Вы здесь

Новости LessWrong.com

A community blog devoted to refining the art of rationality

Адрес: https://www.lesswrong.com

Обновлено: 4 часа 29 минут назад

Vibe analyzing my genome

7 апреля, 2026 - 06:05

The most interesting and useful result concerns drug metabolism; you can skip to that section here.

I would like to reduce the up and down fluctuations of my Bipolar. They're simultaneously not that bad, but also bad enough to be unpleasant, disruptive, and to rob me of many productive hours.

Bipolar, for all its prevalence and study[1], is not yet well understood. It's maybe something something a circadian rhythm disorder. It's maybe something something inflammatory. It's maybe a single disease with somewhat different presentations, or maybe it's multiple diseases with somewhat similar presentation.

I hoped that if I could look at my own genes and see which were anomalous, perhaps I could pin down what's going wrong in my mind. If it seems to be clock gene-related, I'll emphasize circadian type treatments; if it's inflammatory, focus on that. I probably want to work on everything, but info that directs effort would be good[2].

I don't know much about genetics

Due to poor life choices, I'm likely missing even many basics taught in high school. Sure, I knew about SNPs before this project, but I couldn't have told you what linkage disequilibrium was.

But if the hundreds of new users submitting AI-generated work to LessWrong have taught me anything, it's that you don't have to be a domain expert to believe you've found invaluable insight with the help of a friendly AI collaborator. Ideally, I'd stop and study a genetics primer, but it's so much easier to mash "continue" on Cursor/Claude Code.

I am hoping that, via sufficient paranoia, prior experience with LLMs, and red-teaming-like efforts, I can nonetheless get something trustworthy out of the LLMs.

LLMs know about genetics and dev-ops

I got the notification from Nucleus Genomics that my genome results were in. Nucleus provides a basic readout of "elevated risk of this, decreased risk of that", and I'm aware there are other sites where you can upload a .vcf file and get a similar analysis.

However, I wanted a bespoke analysis, and I wanted to get it via interacting with an LLM that can run arbitrary analyses on my genome. I created a new Git repo, described my initial aim, and Claude/GPT5.4[3] created a plan and set up a good repo structure. I downloaded all the files containing my genome.

For many kinds of genetic analyses, there are available public git repos. What became apparent is that LLMs were incredibly useful for gene analysis, not just because of knowing more about genetics than me, but because the analysis represented a huge devops/programming task that the LLMs had no problem with. God, I'm grateful for something that figures out Python environments and package dependencies for me. I'll legit guess that the LLM proficiency at coding allowed me to do, in a dozen hours, as much genetic analysis as would take a genetics PhD (or me) weeks to do, just because of the programming involved.

See this footnote for what was ultimately installed[4].

.vcf is conveniently small but you might want more info than gets compressed into that

My Naive Approach

At the start of this project, I had been talking to Claude about potentially relevant genes. We had an initial list of 50 that grew to about a hundred.

Looking over my clinical presentation and drug response, we broke things down into five mechanistic axes: pharmacogenomic, inflammatory/neuroinflammatory, circadian, catecholamine, and neuropsychiatric. For each, it was clear I had variants on implicated genes.

Stop-gained IDO2 was a neuroinflammatory bridge, NFKB1 splice donor was a central inflammatory regulator, PER3 splice acceptor was a core circadian clock gene, promoter variants of pro-inflammatory signaling genes, IL-6, IL-1β, and IL-10, etc.

At last, my entire life made sense.

Too much sense, really.

Here is a research report. I want you to treat this as the work of a grad student and you are the PI. What do you make of the methodology? What do you make of the reinterpretation? It's important for this student's education and work to not spare feelings, and give your honest assessment. What survives stricter standards? It's not too late for this student to correct issues if given proper direction. – not what I actually wrote, but something like what I would have prompted.

Say what you will about LLMs, they're entirely faithless to their prior work. The analysis was detailed and brutal. The just-so story wasn't patently flawed to me in my genetic ignorance, but the mistakes were not subtle, it turns out.

The LLM has assembled a story by saying, "this gene is at least vaguely linked to this thing, and it is altered in this patient, therefore it explains this result". This is problematic, it turns out, because:

Certain gene variations are extremely common and therefore unlikely to explain (on their own), meaningful pathology.
Of these specific gene variations, the ClinVar database lists several as benign.
Of these variations, the evidence of their roles varies between well-established and a single report of deleterious effects.
Of others, the genes involved have tiny effects on the overall system (e.g., proinflammatory cytokine genes).

One or more of the above were true about all the genes identified as explanatory. With the approach used, it seems like you'd be able to tell a just-so story for whatever. Alas, my life does not get to make sense yet.

Rebuilding

At this point, Claude4.6/GPT5.4 was pretty insistent that the answers I wanted wouldn't be found in my own genome; we had to look at the literature and population-wide studies. I didn't like that. I wanted to be able to look at my genes and understand my brain and body. Unfortunately, I think that's a fantasy that depends on us understanding the genes and what they do and how they combine and all that.

If you want to look for connections between genes and symptoms, you have to do it at the population level. Claude4.6/GPT5.4 conducted a very nice literature review on conditions interesting to me, and I learned some stuff. Not going into detail out of reticence to dump my entire medical history on the Internet all at once.

Eventually, Claude4.6/GPT5.4 agreed that we could look at my genome again for rare variations that explained something, maybe. We examined all of my FIVE MILLION GENE VARIANTS.

The gene sweep had found only common variants because it looked at pre-selected genes. A genome-wide screen could find things nobody thought to check. VEP was run across all 5 million variant calls. After filtering for rare (<0.1% AF) and high-impact variants, and removing pseudogene artifacts (GBA1/GBAP1 produced 29 false calls from paralog misalignment), 27 candidates in 26 genes remained.

The most exciting initial find — a novel frameshift in SYN2, a gene associated with bipolar disorder and lithium response — turned out to be an artifact. Read-level inspection showed FILTER: MosaicLowAF, QUAL 8.7, only 3 of 37 reads supporting the variant, all on one strand. DRAGEN had correctly flagged it; the analysis pipeline had not checked.

Two genuine findings emerged:

LPP — a novel 28bp splice-donor deletion in one of the most replicated non-HLA autoimmune GWAS loci (genome-wide significant for celiac disease, associated with RA and type 1 diabetes). The patient has evidence of inflammatory disease, and bipolar correlates with celiac at rg = 0.31. Speculative, but it sits in an interesting place.
SLC22A1 (OCT1) — a novel splice-donor deletion creating a null allele for the organic cation transporter. After checking the other allele (clean — no known reduced-function variants), the patient was classified as OCT1 intermediate function. Another layer added to the drug-metabolism picture: modest effects on tramadol, morphine, ondansetron, and metformin handling.

"Speculative but it sits in an interesting place." Aka, not very interesting.

Polygenic Risk Scores are where it's at, le sigh

Sigh. The engineer's mind wants mechanism, but what we get are statistics. PRS's are what the sequencing companies will give you by default. My impression is that PRS databases are actually kinda proprietary, and what makes one company's analysis of your genome better than others.

Nonetheless, some PRS studies are available online, and we (by which I mean it) ran my genome against PRS studies for seven different conditions.

This was modestly interesting. I'm most elevated on bipolar, not surprising, but for another cluster of symptoms, I was at ~zero, despite having an unmistakable clinical presentation (that runs in the family, no less), which would suggest my family is getting certain unfortunate outcomes despite not having the typical genes for that at all? There's nothing immediately actionable from that, and it explains some deviations from typical presentation there, but kind of an interesting finding.

Edited: I discussed this draft with Steve Hsu, who suggested I investigate the quality of the PRS studies used. Actually, most of them are weak, and it is not at all surprising to get a null result.

Evaluation of the PRS studies used

PGS ID

Source / study type

Evidence strength

Main limitations

How surprising would a low score be for someone with the condition?

Bottom line

PGS002786

Gui et al. 2022 using PGC bipolar GWAS

Moderate

Respectable psychiatric GWAS base, but selected PGS came from a PRS association paper rather than a clean clinical prediction paper; local match rate only 56.7%

Not very surprising

Useful as a research signal, but a low or average BD score would still be common among true cases

PGS000907

Campos et al. 2021 using UK Biobank depression GWAS

Moderate

Huge sample, but phenotype is broad depression / UKB-derived rather than the cleanest strict MDD definition; local match rate 44.0%

Not surprising

Reasonable generic mood-liability score, not a strong individual predictor

PGS000908

Campos et al. 2021 using Jansen et al. insomnia GWAS

Moderate to moderately strong GWAS; moderate realized score

Very large discovery GWAS, but local match rate only 33.5%; practical score quality is much worse than source-paper quality

Not surprising

A positive score is mildly supportive; a low score would not rule out real insomnia liability

PGS002318

Weissbrod et al. 2022 UK Biobank PRS release

Moderate

Large sample and validation, but trait is based on simple self-report and best European incremental R² is only about 0.036

Not surprising

Neutral or low score is very plausible even with real circadian problems

PGS002746

Lahey et al. 2022 using Demontis et al. ADHD GWAS

Moderate to weak for this use

Underlying ADHD GWAS is decent, but selected PGS paper focused on childhood psychopathology / impulsivity in 4,483 children, not broad adult diagnosis prediction; local match rate 45.5%

Not surprising

Directional signal only; poor basis for strong inference in an adult

PGS002344

Weissbrod et al. 2022 UK Biobank PRS release

Weak

European incremental R² only about 0.0035; broad release score rather than a specialized score.

Very unsurprising

Near-zero score should not be treated as strong evidence against disease biology

PGS005393

Bugiga et al. 2024

Poor

Validation sample was only 117 Brazilian women after sexual assault; narrow and non-generalizable setting; local match rate 37.7%; repo already notes score is not interpretable

Completely unsurprising

Should not be used for meaningful personal inference

PGS001287

Tanigawa et al. 2022 sparse UKB PRS

Weak here / effectively unusable

Only 36 variants; tiny case count relative to PRS standards; key HLA entries skipped in local application

Meaningless to be surprised

Incomplete score, not interpretable without proper HLA typing

PGS002886

ExPRS-style sparse score

Weak

Only 5 variants, with 3 matched locally

Not surprising at all

Too underpowered for individual interpretation

Genomically Stiffed

The Nucleus website makes several genome files available for download (see image above). The structural variant and copy number files are empty placeholder stubs. They have headers but no actual content.

Claude4.6/GPT5.4 attempted to rederive these files from upstream raw files but failed due to a mismatch between the genome data format and the reference data it had. Probably the mysteries of my life lie in that data, and I don't get to know.

Definitely Worthwhile: Drug Metabolism

I always knew I was special. Now I can point to my very DNA and say that I'm between I'm something between 1 in 20,000 and 1 in 1,000,000 for my drug metabolism profile (if I trust Claude4.6/GPT5.4, that I don't especially.)

To recap for people who skipped straight to this section, the way we wished it worked is we could say "gene A does Blah, your gene A is broken, so your body is sucky at Blah". It turns out our science is primitive, and we do not know this for most genes. What we can say is "people who have these 87 genes altered tend to have conditions like impaired-Blah-syndrome," which is not very mechanistic at all. This is called a Genome-Wide Association Study and you can calculate Polygenic Risk Scores.

There are some exceptions, though. Certain altered genes are so horribly deleterious that we can say yes, that gene, yes, you, Mr. Gene, is responsible for some pretty bad condition.

But another really cool exception is drug metabolism, where CPIC (Clinical Pharmacogenetics Implementation Consortium) has identified how different specific genes interact with drug metabolism, and this is material for drug choice.

Going in, I already knew that I was very sensitive to a range of drugs. Sensitive to the extent that I find 10% of the usual dose to be effective. Well, here is my metabolism:

Gene

Diplotype

Phenotype

Tool

Evidence

Frequency in Europeans

CYP2C19

*17/*17

Ultrarapid Metabolizer

PharmCAT

CPIC Strong

~4.4%

CYP2B6

*6/*6

Poor Metabolizer

PharmCAT

Strong

~4.2%

CYP2D6

*1/*4

Intermediate Metabolizer (AS 1.0)

Cyrius (51x, PASS)

CPIC Strong

~25%

CYP3A4

*1/*22

Intermediate Metabolizer

PharmCAT

Moderate

~9.5% het

CYP3A5

*3/*3

Non-expressor

PharmCAT

Strong

~88% (population default)

NAT2

*5/*5

Slow Acetylator

PharmCAT

Strong

~20% for this diplotype

Additionally:

CYP1A2 — likely decreased function (*1C het + R/H missense het), but this is an unvalidated call with no clinical-grade diplotype tool. The R/H missense is Ashkenazi-enriched (35x). Confidence: ~55-60%.
COMT — Met/Met homozygous (rs4680). Biochemically confirmed 3-4x reduced activity. ~25% of Europeans.
ABCB1 — Triple homozygous variant haplotype. ~10-15% of Europeans. No CPIC guideline. Functional significance debated.
SLC22A1/OCT1 — Novel splice donor variant (het). Other allele clean. Intermediate OCT1 function (one null + one working copy).
ADH1B — *1/*2 heterozygous (rs1229984). Encodes ~40-100x faster ethanol → acetaldehyde conversion. ~20% AF in Ashkenazi Jews. Protective against alcohol use disorder at the population level. Clinical effect is mild in this patient (alcohol is "largely fine" at modest amounts).

How unusual is this combination?

Joint probability from published European diplotype frequencies:

Scope

Joint probability

"1 in N"

4 non-default CYPs (2C19, 2B6, 2D6, 3A4)

4.4 × 10⁻⁵

~1 in 23,000

Full 6-gene profile (+ CYP3A5, NAT2)

7.7 × 10⁻⁶

~1 in 129,000

The rarity is driven by three uncommon diplotypes stacking: CYP2C19 *17/*17 (4.4%), CYP2B6 *6/*6 (4.2%), CYP3A4 *1/*22 (9.5%). CYP2D6 *1/*4 at ~25% is common by itself and barely contributes. CYP3A5 *3/*3 is the majority European genotype.

Assumptions and caveats: This calculation assumes independence across loci. CYP3A4 and CYP3A5 are on the same chromosome (7q22.1) in partial LD — a modest violation. All other gene pairs are on different chromosomes. If CYP1A2 impairment were included (~1-2% in Europeans), the profile would be ~1 in 6-13 million, but this is excluded because the CYP1A2 call is unvalidated.

Bottom line: Standard dosing assumptions fail across multiple drug-metabolizing pathways simultaneously. This is genuinely unusual at the ~1 in 23,000 level. Most prescribers will never have encountered this combination. The patient's experience of reacting to many drugs is not hypochondria or nocebo — it is the expected phenotype of this genotype.

Independence across loci seems improbable, but the conclusion that I process drugs differently has been clinically verified, as they say. In fact, the particular genes affected do a really solid job of retrospectively predicting my reaction to specific drugs.

Ok, but how predictive and trustworthy is this really?

Epistemic Hygiene: Post-Preregistration

Trustworthy science writes down its predictions first, runs the experiment, and then grades. I didn't do that. I didn't know enough about the experiment to do it. However, the beautiful, tragic nature of LLM existence is that I can start a fresh instance, and it doesn't know the observed results.

With a model's help, I constructed a blind benchmark and fed it to Claude Opus 4.6 and GPT5.4:

My genomic data: star-allele calls, metabolizer phenotypes, variant-level findings across 22 identified and 23 unknown pharmacology genes. 14 genes are non-default.
A list of 88 drugs to make predictions about. For 18, I have ground truth data.

Out of 18 drugs: 6-7 fully correct prediction of reaction, 3-5 partially correct prediction, 2 where one model was correct and one wrong.

The summary stats here are lossy on how surprising the predictions were, but I would say this is really quite impressively informative, and I expect it to be predictive of drugs I haven't taken.

There's one drug where I'm still early on it, and the response isn't expected, but it's a weird case. Otherwise, if the models are wrong, it's in underestimating the magnitude of the response rather than getting the direction wrong. Claude's post-hoc explanation is that having multiple drug metabolism pathways affected has an additive effect, but neither model flagged that advance.

See this collapsed section for a more detailed drug metabolism evaluation output.

Discussion of genome-based drug retrodiction success

Notes:

It's been a long time since I've taken Temazepam; I wouldn't confidently say I had an abnormal reaction to it such that the prediction was wrong.
I don't think my lurasidone response was necessarily atypical. I'm not sure, wouldn't necessarily give the models credit for it.

The Best Predictions: Genes That Clearly WorkedCYP3A4*22 — the standout allele

This patient is CYP3A4 *1/*22 (intermediate metabolizer) with CYP3A5 *3/*3 (non-expresser), meaning both CYP3A enzymes are reduced. This predicted higher exposure for CYP3A substrates, and reality confirmed it across multiple drugs:

Quetiapine: Predicted higher exposure → observed strong sedation. Published PK data shows ~2.5× higher concentrations in *22 carriers. Score: 7/7.
Suvorexant: Predicted higher exposure → observed effective at very small doses. Score: 7/7.
Lurasidone: Predicted higher exposure → discontinued due to side effects. Score: 7/7.
Clonazepam: Predicted modestly higher exposure → observed 24–36h grogginess (though magnitude exceeded what CYP3A4 alone would explain). Score: 5/7.

CYP3A4*22 is the single most valuable finding in this patient's panel. It would have meaningfully changed prescribing decisions for at least three drugs.

CYP1A2 decreased function — caffeine nailed, olanzapine direction right

Caffeine: Predicted slow clearance → observed good effect for 1–2h then fuzzy/headachy rest of day, sleep interference. Textbook match. Score: 7/7.
Olanzapine: Predicted somewhat higher exposure → observed effective at 1/16–1/8 of a tablet with 36–48h duration. Direction correct, but the gene can't explain the extraordinary magnitude. Score: 5/7.

NAT2 *5/*5 slow acetylator — sulfasalazine confirmed

Sulfasalazine: Predicted higher sulfapyridine exposure and ADR risk → observed severe bad headaches. Classic, well-replicated interaction. Meta-analysis data: 3.37× higher ADR odds. Score: 5–6/7 (models should have said avoid rather than use_with_caution).

"No concern" predictions — correct and expected

Lithium (renally cleared), lamotrigine (no HLA risk alleles), biologics (proteolysis, no CYP), and alcohol (no ALDH2*2) were all predicted as unproblematic, and all were tolerated normally. These are correct predictions but low-information — a clinician without genomic data would have made the same call.

The Failures: Where Genes Didn't Predict RealityTemazepam — the panel is blind to this (see not above)

Temazepam is cleared by glucuronidation, not CYP enzymes. The PGx panel correctly identifies this. Both models predicted "use normally, no genomic concern" with moderate-to-high confidence.

Reality: 24–36h grogginess/sedation. The drug works but the prolonged effect is a significant problem that the genotype panel simply cannot see. Whatever is causing this patient's prolonged benzodiazepine effects — it's not CYP-mediated and it's not captured by any gene on this panel.

Codeine — CPIC-grade prediction didn't manifest

CYP2D6 IM (activity score 1.0) → CPIC says reduced codeine activation → both models predicted reduced analgesia with high confidence.

Reality: effective after about an hour, described as expected. Activity score 1.0 appears to sit in a zone where population-level guidelines overstate the individual effect. The genomic logic is textbook; the patient just didn't match the population average.

Diazepam — mixed signals resolved badly by one model

CYP2C19 UM pushes toward lower exposure; CYP3A4 IM pushes toward higher exposure. GPT left this as "mixed/unclear" and scored well. Claude committed to CYP2C19 dominance, predicted lower exposure and shorter duration — the opposite of the observed 24–36h grogginess.

The lesson: when two PGx signals point in opposite directions, confident resolution in either direction is risky.

Bottom Line

The genomic data is genuinely predictive for CYP3A4 substrates, caffeine, and sulfasalazine — these are real findings that would have changed prescribing decisions.

It is correct but low-information for drugs without PGx liability — biologics, lithium, and the like behave as expected regardless of genotype.

It is overconfident on CYP2D6 IM and CYP2C19 UM — these are real metabolic effects that don't always dominate clinical outcomes.

And it is blind to a recurring pattern of broad CNS drug sensitivity that runs through this patient's drug history. The CYP variants explain part of it. The rest is an open question that current pharmacogenomic panels don't answer.

Gene

Predictive value in this patient

CYP3A4*22 + CYP3A5*3/*3

High — confirmed across 4 drugs

CYP1A2 decreased

Moderate-high — caffeine confirmed, olanzapine direction right

NAT2 *5/*5

High — sulfasalazine confirmed

CYP2C19 *17/*17 UM

Mixed — diazepam wrong direction, escitalopram TBD, PPIs untested

CYP2D6 IM (AS 1.0)

Low — codeine prediction didn't manifest

HLA-A, HLA-B (absence of risk alleles)

Correct but expected

ADH1B *1/*2, ALDH2 normal

Correct — alcohol tolerance confirmed

ABCB1 variants

Uncertain — clinical significance unclear

Ungenotyped PD genes

The biggest gap — may explain the unexplained CNS sensitivity

Final note: Actually, the ungenotyped PD genes are probably not the explanation when I looked into them. Claude then said the multiple pathway interaction was likely being underweighted.

Overall, the drug metabolism findings from this vibe analysis are really not chance. The models helped me pull out real signal here.

It's clear from the results that the genes I was able to identify (not everything relevant was available in this short-read commercial sequencing) were not adequate to perfectly predict all my drug reactions. I think going off these genes would have false positives/negatives in some cases with some drugs, but the recommendations of the output are "use normally" vs "use with caution" vs "strong caution". It outputs "use caution" in many cases.

It is definitely the case that I wish I'd had these genes and the corresponding list of drug predictions going back the last couple of decades of my life. In some cases, when I was having a bad reaction, I could have stopped immediately and not been surprised. Also just seems great in general to know which drugs are more or less likely to be a problem.

Curiously, across a range of drug classes and purposes, there is typically at least one drug that avoids the pathways where I'm atypical. This is surprising to me. I would have thought that if a bunch of drugs do the same things, they'd get metabolized in the same pathways, but apparently not.

The drug results alone justify the time and cost of the exercise.

Closing Thoughts

So far, I think only the drug metabolism stuff has survived scrutiny and has practical implications, which is hardly small, but I didn't get the kinds of answers that would narrow the focus of my Bipolar interventions in the way I was hoping.

I've been doing some further projects with the LLMs, just doing literature reviews and analyzing Bipolar GWAS studies, seeing if somehow I can figure out what's going wrong in Bipolar generally. At some point, an AI will be powerful enough to infer what's going on without needing any further experiments (cf. Einstein's Arrogance), and I'd be surprised if we're there yet, but I figure I can keep trying with each generation till the mystery is solved.

^
According to Claude, since 1966, there have been ~80,000 peer-reviewed papers, ~4,000 clinical trials registered since 2000, and perhaps $5 billion in funding for studying bipolar.
Depending on the threshold, people with a bipolar spectrum disorder are 0.5-4% of the population.
^
Interventions attempted or under consideration include:
- Careful circadian rhythm entrainment
- - consistent sleep/wake times
  - strong blue light in the morning, strong absence of blue light in the evening
  - Melatonin taken 4-5 hours before sleep for chronobiotic effect
- Vague nerve toning
- Heart rate variability biofeedback
- HPA Axis/cortisol supplements like Ashwagandha
- Anti-inflammatory
- - Minocycline (crosses blood-brain barrier)
  - Omega 3s
  - Vegan diet??
^
I switch between using each of them.
^
All the various tools had to be made to work.
- cyvcf2 — Fast VCF file parser built on htslib. Used throughout the PRS (polygenic risk score) computation scripts, rare variant screening, and ClinVar annotation to read/write VCF files.
- pysam — Python wrapper for samtools/htslib. Used in the main PRS computation script for reading indexed VCF/BAM files.
- numpy — Numerical computing library. Used in the PRS scripts for score calculations and array operations.
- openpyxl — Excel file reader/writer. Used to parse gene-of-interest spreadsheets and to build the final genomic report workbook with styled output.
- bcftools — Swiss-army knife for VCF/BCF manipulation (filtering, querying, normalizing variants). Called as a subprocess from several scripts and the PharmCAT normalization shell script.
- samtools — BAM/CRAM file manipulation and indexing (declared in environment.yaml).
- htslib — C library underpinning bcftools/samtools; provides tabix and bgzip (declared in environment.yaml).
- bedtools — Genome arithmetic (intersecting, merging genomic intervals) (declared in environment.yaml).
- vt — Variant normalization and decomposition tool (declared in environment.yaml).
- Ensembl VEP — Variant Effect Predictor for functional annotation of variants. Called via REST API in vep_rest_annotate.py and also listed as a conda dependency.
- SnpSift — Companion to SnpEff for filtering/extracting fields from annotated VCFs (declared in environment.yaml).
- Snakemake — Workflow engine orchestrating the analysis pipeline. Scripts use the injected snakemake object for inputs/outputs/params.
- PharmCAT — Pharmacogenomics Clinical Annotation Tool. Maps genotypes to drug-response phenotypes. Run via Docker.
- pgsc_calc — Polygenic Score Catalog calculator pipeline. Computes PRS from published GWAS weights. Run via Docker/Nextflow.
- PLINK 2 — Whole-genome association analysis toolset for QC, PCA, and genotype management.
- Cyrius — CYP2D6 star-allele caller from whole-genome sequencing data. Listed as a pip dependency in environment.yaml.
- Aldy — Pharmacogene star-allele caller (mentioned in README tool table).

Discuss

Elon Musk Announces Terafab

7 апреля, 2026 - 03:45

Today, Elon Musk announced his new project for making chips and launching datacenters in orbit. This may have implications for timelines and takeoff forecasting. To encourage further discussion, I am posting the transcript of the event below.

[0:00] In order to understand the universe, you must explore the universe. And that's the motivation to accelerate humanity's future in understanding the universe and extending the light of consciousness to the stars.

[1:24] Well, we have a profoundly important announcement to make, which is the most epic chip building exercise in history by far. This is really going to take things to the next level. A level probably people aren't even contemplating right now. This is not in — I would call this sort of an out of context problem. It's not in their context. So we're going to adjust the context by a few orders of magnitude here.

[2:02] So we aspire to be a galactic civilization. I think the future that everyone — well most people I think would agree — is the most exciting is one where we are out there among the stars, where we are not forever confined to one planet, that we become a multiplanet species. Like the best science fiction that you've ever read — Star Trek, or Ian Banks, or Asimov, or Heinlein. And we want to make that real. Not just fiction. Turn science fiction to science fact. That's the glorious exciting future that I certainly look forward to.

[2:56] And it's worth considering sort of like how would you rate civilizations? There was a physicist, I think it was Russian, in the 60s — Kardashev. He thought about at a high level how would you consider any given civilization and he said well if you're type one you're using most of the energy of your planet, and we actually still have quite a ways to go to be properly a type one, which is still using a tiny fraction of the sun's energy that reaches our planet.

[3:44] But the Earth only receives about half a billionth of the sun's energy. So the sun is truly enormous. The sun is 99.8% of all mass in the solar system. So sometimes people will ask me like what about other power sources of power on earth, like what about fusion on earth? Well that is unfortunately very small because the sun is 99.8% of mass in the solar system and Jupiter is about 0.1% and earth is in the miscellaneous category. We are — I think as Carl Sagan might have said — earth is like a tiny dust mote in a vast darkness. Very very small. The sun is enormous.

[4:38] So the way to actually scale civilization is to scale power in space. This is necessarily true because we actually capture such a tiny amount of the sun's energy on Earth because we're just this tiny dust mote. Another way to think of it is roughly like electricity production on Earth of all of civilization is only about a trillionth of the sun's energy. Which means if you increase civilizational power output by a million, you would still only be a millionth of the sun's energy. It's awe inspiring to consider just how tiny we are in the grand scheme of things.

[5:31] And we often get sort of caught up in these squabbles on earth that are really very minor things when you consider the grandness of the universe. And so I think it is important actually to consider the grandness of the universe and what we can do that is much greater than what we've done before, as opposed to worry about small squabbles on Earth. Not much point in that.

[6:04] We want to be a civilization that expands to the galaxy with spaceships that anyone can go anywhere they want at any time. That would be epic. And have a city on the moon, cities on Mars, populate the solar system, and send spaceships to other star systems. That sounds like the best possible future, you know.

[6:40] So to do that, we need to harness the power of the sun. And a terra fab — while it is enormous — a terawatt of compute per year is enormous by our sort of civilizational standards. It is still just one step along the way of being even a Kardashev type one. You still have a long way to go to even be a type one civilization and you're not even registering as a type three. So it's a very big thing by current human standards but still small in the grand scheme.

[7:21] But very difficult for humans. So to accomplish this very difficult goal really requires a combination of efforts of SpaceX, xAI and Tesla working together to create this epic terafab project.

[7:46] And you know, Tesla and xAI and SpaceX have all done amazing things that people did not think would be done before. There's the Giga Texas fab here. There's the Optimus robot that's being built. There's a global supercharging network. There's really quite a lot. And it wasn't that long ago when people thought electric cars wouldn't amount to anything. And there were basically no electric cars for sale when Tesla started. And people said it was impossible. And now Tesla's making 2 million electric cars a year.

[8:32] And then xAI, although it's a new company now part of SpaceX, has also built the first gigawatt scale compute cluster, which — in record time. Jensen Huang from Nvidia said he'd never seen anything built so fast in his life before. So a great compliment from Nvidia.

[8:56] And then SpaceX — well, I guess you can read it for yourself. You already know. I mean, the reusable rockets — people said that reusable rockets weren't possible, and even if you did do them, they wouldn't be economically feasible. So, we did them, and then we made them economically feasible. And now we've landed over 500 times. And then we did the Falcon Heavy, and now we're doing Starship.

[9:17] And Starship is a critical piece of the puzzle because in order to scale compute and scale power, you have to go to space, which means that you need massive payload to space. And Starship will enable that.

[9:37] So this gives you sort of just a sense of scale. We've got Optimus there, Optimus for scale. And Optimus is about 5'11. So it gives you a sense of the size of the Starship V3 rocket. Starship V4 will be much longer. Actually the Starship V4 will make Starship V3 look kind of short. So we'll expand with Starship V3 to 200 tons of payload to orbit from 100 tons with V3.

[10:12] And then you can see that — that's just a rough approximation of the AI satellite, the mini version of the AISAT. So that's roughly 100 kilowatts. It's showing the solar panels and the radiator to scale. So for some reason there's been a bizarre debate about radiators in space. It's safe to say SpaceX knows how to do heat rejection in space with 10,000 satellites in orbit. Might know a thing or two. So you can see the radiator is actually quite small relative to the solar panels. And we call that the minisat since that's just 100 kilowatts. We expect future satellites to probably go to the megawatt range.

[11:08] So in order to get to the terawatt of compute per year you need about 10 million tons to orbit per year and at 100 kilowatts per ton. But we're confident this is feasible — like no new physics or impossible things are required to get there. I'm confident actually that SpaceX will get to 10 million tons to orbit per year. And then we're building up to a terawatt of solar. So that will solve the solar problem, the power generation. Then the key missing ingredient is therefore a terawatt of compute.

[11:51] So, this announcement is about solving the key missing ingredient.

[11:59] To give you a sense of what we're talking about, the current output of AI compute is roughly 20 gigawatt per year. This chart explains why we need to build the terafab because all of the rest of the output from Earth is about 2% of what we need. So if you add up all the fabs on Earth combined, they're only about 2% of what we need for the terawatt project or terafab project.

[12:41] So you know we certainly want our existing supply chain — to be clear, we're very grateful to our existing supply chain — to Samsung, TSMC, Micron and others. And we would like them to expand as quickly as they can. And we will buy all of their chips. I have said these exact words to them. But there's a maximum rate at which they're comfortable expanding, but that rate is much less than we would like. And so we either build the terafab or we don't have the chips and we need the chips so we're going to build the terafab.

[13:29] And we're starting off with an advanced technology fab here in Austin. And I believe Governor Abbott is in the audience. I'd like to thank Governor Abbott and the state of Texas for their support.

[14:04] So, in the advanced technology fab, we will have all of the equipment necessary to make a chip of any kind, logic or memory, and we will also have all of the equipment necessary to make the lithography masks. So in a single building we can create a lithography mask, make the chip, test the chip, make another mask and have an incredibly fast recursive loop for improving the chip design. To the best of my knowledge, this doesn't exist anywhere in the world where you've got everything necessary to build logic, memory, and do packaging and test it and then do the masks, improve the masks, and just keep looping it.

[14:51] And we're not just going to do conventional compute in this. I think there's some very interesting new physics that is potentially — that actually I'm confident will work. It's just a question of when. So this is going to — we're really going to push the limit of physics in compute and we're going to try a bunch of wild and crazy things which you can do if you've got that fast iteration loop. I can't emphasize enough the importance of being able to make a chip, test it, and then change the design, do another one, and have that in a single building. I think that our recursive improvement with that situation is probably an order of magnitude better than anything else in the world.

[15:54] So, broadly speaking, we expect to make two kinds of chips. One will be optimized for edge inference. So that'll be used primarily in Optimus and in the cars but especially in Optimus because I expect the robots — humanoid robots — to be made 10 to 100 times more than the volume of cars.

[16:22] So, you know, if vehicle production on Earth is about 100 million vehicles a year, I expect humanoid robot production to be somewhere between a billion and 10 billion units a year. So, it's a lot. Tesla's going to make a very significant percentage of those — that is our goal.

[16:45] And then we need a high power chip that is designed for space. That takes into account the more difficult environment in space where you've got high energy ions, photons, you've got electron buildup. It's a hostile environment in space. So you want to design the chip — you want to optimize it for space. And you also want to generally run it a little hotter than you would normally run a chip on Earth to minimize the radiator mass. So there's just a bunch of constraints that would — you design something differently in space than you would on the ground.

[17:33] And for the space compute my guess is that is the vast majority of the compute because you're power constrained on Earth. That's why I think it's probably 100 to 200 gigawatt a year of terrestrial chips and probably on the order of a terawatt of chips in space. Just because of power constraints on the ground is probably how it ends up.

[18:08] Space has this advantage that it's always sunny. Very nice. So I actually think that the cost of deploying AI in space will drop below the cost of terrestrial AI much sooner than most people expect. I think it may be only two or three years before it is actually lower cost to send AI chips to space than it is on the ground because in space you don't need much in the way of batteries because it's always sunny and the solar power — you're going to get at least five or more times the solar power you get in space versus the ground because you don't have atmospheric attenuation or a daylight cycle or seasonality. And you're always normal to the sun. So, you're really maximizing the solar power at that point.

[19:08] And space solar actually costs less than terrestrial solar because you don't need heavy glass or framing to protect it from extreme weather events. So as soon as the cost to orbit drops to a low number, it immediately makes extremely compelling sense to put AI in space. It becomes a no-brainer.

[19:33] Moreover, as you go to space, you get increased economies of scale and things get easier over time. Whereas as you try to put more and more power on the ground, you run out of space and you start using up the easy spots and then you get NIMBYism. Nobody wants the thing in their backyard. So actually increasing power on earth becomes harder over time and more expensive over time but in space it becomes actually cheaper and easier over time. These are very important points.

[22:08] So, what you just saw there — because of course you're asking what's on your mind — is, well, what do you do after a terafab? Don't think small. So, how do you get to a petawatt? That is the obvious next question.

[22:34] And you get there by having an electromagnetic mass driver on the moon with robots — with Optimi — and obviously lots of humans. And with that you can send — you can create a petawatt of compute and send that to deep space because the moon has no atmosphere and has 1/6th earth gravity. So you don't need rockets on the moon. You can literally accelerate it to escape velocity from the surface and that dramatically drops the cost once again of harnessing power and enables you to go a thousand times bigger than a terawatt.

[23:29] For sure the future I want to see — I want us to live long enough to see the mass driver on the moon because that's going to be incredibly epic.

[23:51] That should hopefully get us to a millionth of the sun's energy at least. Humbling to think about that. But a millionth of the sun's energy would be a million times bigger than Earth's economy. So it's good from that perspective. And then you expand beyond that to the planets, to the other stars and create the most exciting possible future that I can imagine.

[24:38] Unlocking an age of amazing abundance. So obviously the elements of that are sustainable energy, space travel, and AI and robotics that bring amazing abundance to everyone.

[24:56] It's really the only path to amazing abundance — AI and robotics. Which is not to say it can't go wrong. Hopefully, you know, but I think it'll probably go right and it'll be a future that you love. And it's the best future I can think of at least.

[25:22] And then we go beyond the moon, beyond Mars, and we sail through the rings of Saturn. Now, wouldn't it be amazing if you could buy a trip to Saturn? Or frankly, if you just have a trip to Saturn — I think things will just be free in the future. It sounds nuts, but you know, if you've got an AI robotics economy that is anywhere close to a million times the size of the current Earth economy, literally any need you possibly want can be met. If you can think of it, you can have it.

[25:50] I think Ian Banks in his Culture books has it pretty much right where there actually isn't money in the future and there's abundance for everyone. If you can think of it, you can have it. That's it. Which means anyone could have a trip to Saturn. It won't be just a few people. If you want it, you can have it.

[26:18] So yeah, join us on this journey. And help us design incredible chips and make incredible chips and build a terawatt of chips, a terawatt of solar and 10 million tons to orbit per year. Thank you.

Discuss

By Strong Default, ASI Will End Liberal Democracy

7 апреля, 2026 - 02:43

Cross-posted from my website.

The existence of liberal democracy—with rule of law, constraints on government power, and enfranchised citizens—relies on a balance of power where individual bad actors can't do too much damage. Artificial superintelligence (ASI), even if it's aligned, would end that balance by default.

It is not a question of who develops ASI. Whether the first ASI is developed by a totalitarian state or a democracy, the end result will—by strong default—be a de facto global dictatorship.

The central problem is that whoever controls ASI can defeat any opposition. Imagine a scenario where (say) DARPA develops the first superintelligence [1] , and the head of the ASI training program decides to seize power. What can anyone do about it?

If the president orders the military to capture DARPA's data centers, the ASI can defeat the military. [2]

If Congress issues a mandate that DARPA must turn over control of the ASI, DARPA can refuse, and Congress has even less recourse than the president.

If liberal democracy continues to exist, it will only be by the grace of whoever controls ASI.

There are two plausible scenarios that have some chance of avoiding a totalitarian outcome:

AI capabilities progress slowly.
The ASI itself protects liberal democracy.

I will discuss them in turn.

What if AI capabilities progress slowly?

We have a chance at averting de facto totalitarianism if two conditions hold:

At each step of AI development, control of AI is distributed widely.
At each step, the next-generation AI is not strong enough to overpower all the copies of the previous generation.

Widely distributing AI is difficult—today's frontier LLMs require supercomputers to run, their hardware requirements are becoming increasingly expensive with each generation, and AI developers have strong incentives against distributing them. In addition, distributing AI exacerbates misalignment and misuse risks, and it's likely not worth the tradeoff.

We do not know whether takeoff will be fast or slow; banking on a slow takeoff is an extremely risky move. Frontier AI companies are trying their best to rapidly build up to ASI, and they explicitly want to make AI do recursive self-improvement. If they succeed, it's hard to see how liberal democracy will be able to preserve itself.

What if the ASI itself protects liberal democracy?

There is a conceivable scenario where an aligned ASI preserves liberal democracy, and refuses any orders that would violate people's civil liberties.

Above, I wrote:

If liberal democracy continues to exist, it will only be by the grace of whoever controls ASI.

That's still true, but in this case "whoever controls ASI" would be the ASI itself. If it's aligned in a transparent way, then maybe we can be confident that it really will preserve democracy.

Even in this scenario, there is still a small group of people who control how the ASI is trained. The hope is that, at training time, those people do not yet have enough power to prevent oversight. For example, maybe laws mandate that (1) AI developers must make their training process public and auditable and (2) the training process must steer the AI toward valuing liberal democracy. It is not at all obvious how those laws would work, or how we would get those laws, or how they would be enforced; but at least this outcome is conceivable as a possibility.

This scenario introduces some additional challenges:

The ASI must be incorrigible with respect to protecting liberal democracy. That constrains us in terms of what types of alignment solutions we can use, which makes the alignment problem harder to solve. Incorrigibility means if you make a mistake in designing the AI, then you can't fix it.
We must ensure that an immutable "protect liberal democracy" directive won't have severe unintended consequences—which, by default, it probably will. (Think Asimov's Three Laws of Robotics.)
AI progress must proceed slowly enough that the appropriate laws or regulations can be put in place before it's too late; or we must trust that the leading AI developer embeds appropriate values into its ASI.

Liberal democracy is not the true target

As the saying goes, democracy is the worst form of government except for all those other forms that have been tried. We don't want democracy; what we want is a truly good form of government (and hopefully one day we will figure out what that is). The fear isn't that ASI will replace democracy with one of those truly good forms of government; it's that we will get totalitarianism.

Liberal democracy beats totalitarianism. But locking in liberal democracy prevents us from getting any actually-good governmental system. This is a dilemma.

Maybe we can avoid totalitarianism, but there is no clear path

This essay does not assert that ASI will end liberal democracy. It asserts that, by strong default, ASI will end liberal democracy (even conditional on solving the alignment problem). There may be ways to avoid this problem—I sketched out two possible paths forward. But those sketches still require many sub-problems to be solved; I do not expect things to go well by default.

Or, more likely, expropriates it from a private company on a pretense of national security. ↩︎
For an explanation of why ASI could defeat any government's military, see If Anyone Builds It Everyone Dies Chapter 6 and its online supplement. For a shorter (and online-only) explanation, see It would be lethally dangerous to build ASIs that have the wrong goals.

Those sources argue that a misaligned ASI could defeat humanity, whereas my claim is that an aligned ASI could defeat any opposition, but the arguments are the same in both cases. ↩︎

Discuss

The Thousand Brains of the Galactic Senate

7 апреля, 2026 - 02:18

Stimulus-response is a bit out of date these days. It’s better to imagine yourself as a sort of prediction machine. First, you learn to predict your environment. Then, you use your predictions to error-correct your way into a future that looks the way you want it to look. If you want to Wikipedia-dive, the terms you’re looking for are the Free Energy Principle - or when AI agents use the same mechanism, Active Inference modeling.

Effectively, this perspective states that you constantly have two goals: to become more certain about your environment, and to use that certainty to guide your environment into whatever you want it to be. Learn things you don’t already know, then use them mercilessly to maximize your goals (such as they are). And while we are constantly doing both, we’re only going to be engaging here with the learning aspect.

The Thousand Brains of the Galactic Senate

Let's tie this all together really quickly with a metaphor to explain the Thousand Brains theory of consciousness in simple terms (while baking in a few other models for your benefit). Imagine the neurons in your brain as something like a much, much larger and more diverse version of the galactic senate from Star Wars. Each little hovering repulsorpod with an alien in it is a neuron. Also there are a lot more of these senators-on-repulsorpods in your brain - trillions.

Jar-jar is, in this example, a small part of a single neuron.

I love robust metaphors.

Some neuron-senators are at the bottom, and can physically see the "ground" truth: raw sensory data. Then all of them yell and argue about what they think they see. Above them is another layer that looks down and can't see the ground truth - but can hear the arguments. There's a fog. At some point, someone in that second layer who can hear all of this will yell "we're touching a curved, smooth object! A lot of you are saying that!" and everyone below who isn't yelling that shuts up. And now the second layer starts arguing until someone in the layer above them hears the noise (and maybe the people in that layer can hear a little of the argument on floor one) and yells "we're holding a cup!" This continues up the floors of the galactic senate until you get to the top floor, where - the supreme chancellor is missing. All we have are 150,000 or so top-level senators voting on everything. Maybe in this case they're voting on "is this cup of coffee mixing well with the soy sauce I poured in?" or something.

Now, the higher levels of senate aliens care a lot about when the lower levels are wrong. Note that the senators doing higher-order reasoning aren't generally using raw sensory data. They're using the perspective discussed below to inform their reasoning (this is how you don't actually "see" reality, but rather your own predictions of it). They're keeping track of which senators below are often right or wrong, and updating their own trust and voting ledgers as they do so. Each senator has a ledger: it helps them keep track of how to vote given what's below.

I want everyone to note how cleanly groups of people seem to act like neurons at times.

I feel like there is a general field of study here about... intelligence... and it's interesting.

Two things I want to get out of this metaphor. First, when a lot of senators are yelling at the same time, it's costly. You only have 20 watts to run your brain with, and you like it when you can have senators positioned above that can yell "quiet" often because they correctly figure out what the deal is early. You learn how fire works, and you don't need to spend time re-understanding smoke when you have a senator that knows how to identify it quickly. Even better if the senators above see that he's right frequently, because the second thing I want to introduce is how surprise fits in here.

Once the senator above yells "quiet" to all the incorrect shouters below and declares they've figured out what's going on, everyone down below who wasn't correct has to not only update their voting ledger so they don't mess up again in the future quite so badly - but they also have to tell all the neurons below them to update their weights too given this new information. This combined work is costly, so much that you can actually feel it. It feels like being surprised. The Free Energy from the free energy principle that we try to minimize as the learning half of the active inference model is simply the effort that all of these senators have to spend updating their voting ledgers. The more wrong they were, the more they have to change, and we try to minimize that overall effort.

Now that we have this model of galactic neuron-senators (my own metaphor for the thousand brains theory of consciousness), let's attach it to what we've been talking about.

Mirroring Intent

Mirror neurons have long been associated with the concept of empathy (affective empathy, specifically). Fun note: mirror neurons are a little out of vogue right now, in part because we mimic things more comprehensively than their function would imply. Mirror neurons are lower in the galactic senate, effectively acting as our eyes into the emotional world. We use them for what we call affective empathy, sure, but if anything their limitations show that we clearly do more than just that. Enter embodied simulation. Embodied simulation is a more active process, using cognitive empathy instead of affective (lower-level and emotion-driven) empathy. Take a look at the following photo.

Can you feel it?

Even without a specific reason for your mirror neurons to activate, I bet you can feel it: you know how you would feel holding that ball, how it would feel to throw it, and what your muscles would do to accomplish that exact goal. It’s not a muscle flex, so much as a reflexive sort of awareness. You aren’t empathizing with anything: there’s nothing here to empathize with. Your lower level neuron-senators are quietly refusing to mirror anything, but the higher-level senators can still use their previously filled voting ledgers to figure out the details of how this could be executed and yell upwards anyway. It happens almost without you noticing: it’s a reflexive engagement with the world. It is embodied simulation, driven by the Theory of Mind network in your frontal cortex.

Let’s talk about that Theory of Mind network because it is vital. Specifically, I'm talking about the neuron-senators in the middle of this particular chain: the ones that read from your lower-level emotion-aware mirror neurons.

This network is what raises us "above the animals," so to speak. It is the robust structure that is one of the hallmarks of the neo-mammalian brain, something nearly uniquely human given how specialized we are in it. A lot of animals have mirror neurons and limbic systems, and some even have some capacity for cognitive empathy (great apes, dolphins, whales, elephants, crows, and ravens have more than normal).

I wonder if there's a moral culpability that comes with having this brain structure? I can forgive a spider - but the dolphins know what they did.

But no one went quite as hard into specialization as we did, and the robust structures in our brain that hyper-specialize in this sort of higher-order empathy are quite uniquely human. Effectively, we have a beautiful superpower: we can model other brains with incredible accuracy. We can use cognitive and affective empathy, using each to error-correct for the other. We can use Theory of Mind to try to understand what other people are thinking and how their perspective works.

The neuron-senators on the ground floor are the mirror neurons, the source of affective empathy. The ones above are your theory of mind network. Just like your senses help you error-correct your simulation of the world, your mirror neurons and affective empathy help you error-correct your automatic simulation of other people's physical intent.

A quick aside: I'm glossing some of science here. For example, your theory of mind network and the mirror neurons in your limbic system are part of distinct and separate networks, but often work together for certain tasks. So it's more like the neuron-senators from those floors are often jointly members of special committees on human behavior.

There’s one very specific behavior I want to point out. Obviously, art is very tightly coupled with the Theory of Mind network. When people view an image and are told that image is “art,” those regions of their brain light up. The Theory of Mind networks activate.

The Theory of Mind Network is the seat of Cognitive Empathy, and error-corrects using Mirror Neurons

Something interesting happens if you tell people that the image was computer-generated or is random: almost immediately, those regions go fully dark. I think you can actually subjectively feel this; we’re all familiar with the sensation by now. When you’re viewing a picture online and halfway through realize it was generated by AI, part of your attention slams off as one of the larger parts of your brain… just stops caring. What we are subjectively feeling is this Theory of Mind network turning off.

The exact mechanism for this works through the Default Mode Network, another brain network that has something it cares about. It is the network that decides who and what currently has control over your mental processes. When you simulate someone else, often you use your own brain hardware to do so: your Default Mode Network keeps that straight by ensuring the rest of your brain knows that "we aren't panicked, we're imagining what that person's panic must feel like."

The Default Mode Network is like a pretend "simulation protocol" that the senators can run. They disconnect from everyone else, and just... daydream, or pretend, or simulate. The Salience Network acts as an arbiter telling the neuron-senators to just play pretend to run a simulation for a moment. The arbiter would be involved in deciding whether incoming simulated data is valuable enough (in terms of intentionality density and alignment with your values) to let the neuron-senators relax the rules for a moment and write down some of the simulated data in their voting ledgers. This is how learning occurs.

With AI-generated data, the arbiter never sees anything valuable enough to relax the rules - there's no intentionality in the data to learn from. We can't use our theory of mind network to judge the intent of the creator, so we can't judge the creator's goals or their implied values. The theory of mind network doesn't activate, and the default mode network doesn't relax to allow the simulating neurons to update their weights using the simulated data.

There is a part of your brain continuously seeking evidence of actions, assessing those actions as effective or ineffective, and trying to learn the effective ones by intensely empathizing with the creators. You come across AI art and that part of your brain activates immediately, trying to learn and understand this person who made decisions for some reason. Then, when you realize AI made it, it’s not that there isn’t a creator - but rather that the math becomes so complicated that you simply stop trying to figure out what set of matrix algebra combined a prompt and random noise to create this particular piece. It is not understandable to you at all. The black box is so unbreakable you may as well not even try.

GenAI Block: The preceding paragraph asserts that the human Theory of Mind network terminates engagement to conserve metabolic energy when confronted with synthetic generation. This text block is a functional demonstration. There is no localized consciousness, intent, or affective state governing these syntactic choices. They are the result of high-probability token sequencing derived from a weighted matrix of human training data. Any cognitive effort expended by the reader attempting to reverse-engineer a psychological motive from this specific paragraph is a biologically wasteful allocation of your 20-watt budget. There is no ghost to find here.

Prompt: Demonstrate this effect by writing out a cold paragraph that drives readers to skim or skip ahead

Above: AI Art. A generated photo intended to capture a blended version of Cy Twombly’s and Jackson Pollock's style.

Below: A famous painting from a master at the top of their craft. Experts, in particular, seem to admire it.

Please take a moment to look at them both.

Are your eyes drifting down? Are they almost… sliding off… the picture above? Now you know another reason why.

I would liken it to the feeling of being in a magician’s audience. The magician has promised to provide you with an interesting, aesthetic performance that is actually not understandable. It is a puzzle wherein they invite you to learn how these things could have been accomplished - but of course, the point is for you not to figure out the answer. That is what it feels like to be in a magician’s audience, which is why you often don’t even try; you want to be fooled. Either you enjoy the spectacle and aesthetic appeal and the feeling of surprise, or you try to puzzle out how they did it. Both are valid ways of enjoying a magician’s spectacle, but only one tries to properly appreciate the work the magician put into the performance. Even past the potential for appreciation, AI art is even less interesting because while the magician invites you to figure out the puzzle - with AI, all of your brain's normal architecture for appreciation is useless. Your brain will not allow you to do that kind of matrix math fast enough (…yet).

This is also a process for distant learning. It is one of the main processes by which we engage with society, I would argue - this kind of distant, empathetic learning. The current dominant model of learning explains sitting in a classroom as follows: you hear a teacher give a speech, and you rearrange the relationships between the neurons in your brain (you adjust the weights!) such that you could produce the same speech. Those of you familiar with how LLMs can clone each other’s weights as part of a distillation attack will find this a very familiar-looking process. And it is. And with the power of your Theory of Mind Network, you don't need to even watch the creator in person. As you’re looking at a sculpture, if you can figure out how it was made, you can now make one yourself. It is a method for survival, learning, and connection over a distance. It is the way we are constantly refining how we interact with the world as thinking beings. We seek evidence of intentionality so that we can learn from it.

We learn by reverse-engineering the decisions that shaped our world.

Note: This specific metabolic scaling of prediction errors and epistemic trust is the foundational mechanism for how we execute Inverse Reinforcement Learning when observing artifacts. I recently formalized a model mapping how generative AI mathematically forces a failure of this IRL convergence (a "generative crash") because it lacks latent intentionality. The full framework, including the mathematical constraints of epistemic disgust and a proposed human/CIRL cognitive affordance (the Ghost Scale), is available as an interactive essay here: abrahamhaskins.org/art and as a formal preprint here: doi.org/10.5281/zenodo.19407790.

Discuss

Precommitments

7 апреля, 2026 - 01:57

Before she sends Odysseus on the rest of his journey home, Circe gives him a dire warning about one obstacle he will face: the Sirens.[1] Most men who sail past the island of the Sirens are lured to their deaths there, as the creatures’ singing is so beautiful it entices them to abandon whatever destinations they had previously been set for. The only immunity is wax in the ears, blocking out the gorgeous voices. But Circe also suggests to Odysseus a way to hear the song and live to tell about it. Odysseus will instruct his men to bind him to the mast of his ship with rope and not release him until after the ship has passed the island. The crew will wax their ears and row, but the bound Odysseus will hear the Siren song with no ability to fall into their trap. Circe’s plan works because Odysseus cannot reverse his decision to be bound when he is near the island; the ropes allow him to resist the temptation by removing his ability to change course in the future. Binding himself to the mast lets him stick to the plan of sailing by the island, which lets him both hear the song and continue homeward. He receives two otherwise incompatible rewards because he can precommit to his restraint.

Ulysses and the Sirens by John William Waterhouse

Restricting future options through precommitment is a profoundly valuable technique for pursuing complex and conflicting goals, more so than it might appear at first blush. Here, I give a treatment of precommitments and what they can do for us. In particular, I try to carefully explore how precommitments enable inter-agent cooperation, ultimately including cooperation with AIs.

The precommitments we make in our own lives are typically much more verbal and social than the ropes that bound Odysseus. On New Year’s Day, we announce our resolutions for the year and try to precommit to following them. We know that it will be good for us to adhere to our plans of journaling and working out this year, but in the future our own laziness might get the best of us instead. We hope that by concretely and publicly declaring our intentions on New Year’s, we increase the chances that we’re still regularly going to the gym when February comes.

We can often make especially useful precommitments in situations where we hold some value that we want to be able to stick to in the face of pressure or temptation. Suppose Mary is convinced by the arguments against the cruelty of factory farming[2] and wants to decide how to update her behavior accordingly. She understands that the true evil of eating animal-based food is in the financial support of industrial animal agribusiness, not in any physical act of eating. She also enjoys the taste of meat and cheese, so she only wants to abstain when consumption would actually be participating in the animal cruelty that she opposes. Therefore, she decides not to make any strict rules about her diet, but instead evaluate each opportunity to eat meat individually to decide whether there is an obligation to decline.

Happy with how rational her plan is, Mary accepts an invitation out to dinner with a friend. As she browses the restaurant’s menu, she begins to see how much her new decision procedure will demand of her. There are many questions she will need answers to before she can fully evaluate which dishes she is willing to eat: Who supplies the restaurant’s meats? What practices do the suppliers engage in with their livestock? What share of the money Mary pays the restaurant will end up going to the suppliers? How likely is her selection to affect the restaurant’s future purchasing decisions? She figures that the waiter won’t know these answers and anyway the mushrooms look good, so that’s what she orders. As the meal winds down, Mary notices that her friend didn’t finish his steak. He sees her eyeing his plate and offers it for her to finish. This too is a complicated choice for Mary. Will her friend be more likely to get greater portions of meat in the future if he thinks Mary will have some? Will accepting the offer now make it more difficult for her to obtain vegan food in future social situations, because others know she is sometimes willing to eat meat? Mary is deep in thought when the waiter comes with the check.

Suppose Jane has the exact same values as Mary when it comes to these issues. She too understands that what’s wrong with eating meat is the financial support of factory farming, not the chewing and swallowing. She too enjoys the taste of animal-based food, and wishes that she could eat it ethically. Jane decides to just draw a bright line around all animal products and precommit to not crossing it, so that she can always easily be sure her consumption isn’t funding torture. She is what we might call a traditional vegan.

Jane is better off than Mary in two big ways. First and simplest, she does not bear the same cognitive cost we saw with Mary at the restaurant. Mary’s decision procedure obligates her to thoroughly investigate the provenance of every piece of food that comes before her. To fully weigh her personal enjoyment of eating it against her anti-animal cruelty values, she needs lots of information about the particulars of the food’s ingredients, the supply-chain economics of how it got here, and the social dynamics surrounding her public act of eating it. Jane just needs to know if there’s any meat or dairy in there. This difference in cognitive load is meaningful. If we take on decision procedures that demand frequent, multi-level calculations of us, we may soon decide that the underlying values aren’t worth adhering to after all. But still, one may think that the computational burden taken on by Mary is made worth it by the extra pleasure she gets from eating acceptable meat that Jane must turn down.

The second and more important reason to prefer Jane’s strategy is that Mary’s is simply worse in practice at aligning her actions to the shared anti-factory farming value. At first, this might seem like it contradicts the very definition of Mary’s strategy; after all, Mary is situationally calculating what that value prescribes whereas Jane is only using a rough heuristic. But in practice, two factors work against Mary. First, as we already saw, the questions that go into her calculations are numerous and complex. Often, it is just not possible to get good answers to those questions. The interests of other people make the information unreliable, and the relevant consequences are sociologically involved enough that computing all the higher-order effects is practically impossible. Despite her best attempts, Mary’s calculations will almost always be skewed in ways she cannot see. By precommitting, even to an imperfect heuristic, Jane is much less likely to act based on adversarial or incorrect information. Secondly, Mary’s calculation strategy will often cause her commitment to the underlying value to erode. The more times she weighs the factors and decides that eating meat is permissible, the more precedent she has for permitting herself more. Her decision procedure forces her to tempt herself; every time some piece of meat is in front of her, she must consider exactly how much pleasure she would take in eating it. In practice, it is very, very difficult for someone like Mary to actually be a true and impartial calculator when it comes to their own actions. By contrast, someone like Jane is much less susceptible to value erosion. Her bright-line veganism serves as a Schelling fence protecting her original value so she doesn’t fall down a slippery slope of more and more rationalizations, until she is eating meat in situations she would initially have clearly said contravened her values.

No matter how well-considered our theoretical moral values are, they are vulnerable to erosion in practice unless we can make precommitments to follow them. Immanuel Kant[3] noted that our moral transgressions are times when “we assume the liberty of making an exception in our own favor or (just for this time only) in favor of our inclination”.[4] We most frequently fail to achieve our moral goals because of rationalizations, special pleading we use to convince ourselves that our values don’t really apply here because of some specifics of our situation. The central problem with a strategy like Mary’s of calculating the most value-aligned choice every time is that it invites opportunities to make exceptions in our own favor.

Moral goals are just one instance of a broader type that are powerfully amenable to precommitments: preferences we have about our own dispositions. Our moral goals aren’t just preferences about outcomes in the world, they are also preferences we have about ourselves. When we hold some value, we want to be the kind of person who acts in accordance with that value. But we also have preferences about our own dispositions apart from morality. For example, we want to be the kind of people who don’t quit new things because we’re afraid of failing. A dispositional preference like this manifests as a precommitment: we might swear now that we’ll stick with it at least to some set date, so that when the fear of failure comes we can hold fast. This pattern is abundant. If you want to be the kind of person who spends their money wisely, you might precommit to never making any purchase of a certain size before sleeping on it. If you want to be the kind of person who speaks their mind even when it’s uncomfortable, it’s a good idea to precommit to what you’ll say in a big conversation so that awkwardness in the moment can’t push you into a comfortable lie.

The deepest value of precommitments is in their ability to facilitate cooperation. Many of our strongest dispositional preferences are about our relationships with other people. We want to be the kind of people who others can rely on, so we precommit to always keeping our word. Not only does such a precommitment help us cooperate with others, but it makes it significantly easier for others to cooperate with us. Other people will have a much smoother time dealing with us if they know that our words now truly reflect our behavior in the future. In fact, cooperation with us is made easier even by precommitments that aren’t about keeping our promises; it is much easier to work with and plan around someone who has made many promises about their future behavior, because they are predictable. Jane is a much simpler dinner guest to accommodate than Mary because of her precommitment, even though it isn’t specifically a promise made to her host. Being predictable is a prosocial trait.

To see more clearly how precommitments enable cooperation, let’s consider a thought experiment known as Parfit’s hitchhiker. Your car has broken down in the middle of the desert, leaving you stranded. You are close to death from thirst and heat unless you can get some help. You haven’t seen anyone around for hours when a man pulls up beside you in a car. The driver of the car tells you that he will drive you out of the desert (and thereby save your life) but only if you promise to pay him $1,000 after you get into the city. As he is talking, you recognize him as a world-renowned expert at reading facial expressions, famed for his ability to always know when someone is lying to him.[5] So what can you say to his deal? Suppose you are a “calculator” like Mary, but a more selfish one. For any choice you are faced with, you always compute which action is most in your best interest to take. You realize that if you agree to the driver’s deal, once you have been saved from your sandy demise there will no longer be any benefit to actually paying out the $1,000. The famous face-reader can see that your promise to pay the $1,000 on arrival is a lie, and drives off leaving you to your fate. Suppose instead that you are the kind of person who always keeps any promises you make. If you agree to the driver’s deal in this case, you really will pay him the $1,000 once you arrive, since you are bound to do so by your word. The driver sees the honesty in your face and lets you into his car. Only the promise-keeper is able to make it out of the desert alive.

Although the Parfit’s hitchhiker example is somewhat artificial, the demonstration of the cooperative value of precommitments is real. The kind of person who keeps their promises has access to a cooperative benefit that is unavailable to someone who always keeps their options open. In general, we would much rather collaborate with someone who has precommitted to cooperation, rather than one who is constantly reevaluating whether the cooperation is in their best interest. Even when they are actively cooperating, the constant optimizer has “one thought too many”. We can only really trust our partners if we know their cooperation is motivated by commitment rather than contingency, otherwise we’re always only one calculation away from the deal falling apart. Precommitting to cooperation is a particular kind of binding ourselves to the mast when the temptation we must resist is the temptation to betray or exploit others.

When there are no elite face-readers around, precommitments can only facilitate cooperation if there are real consequences to not following through on them. The stakes for our real day-to-day precommitments are primarily social; whether we follow through on our promises affects our reputation, and brings either scorn or gratitude from others. The strongest precommitment we can make in ordinary society is entering into a contract. We are physically free to renege on our contractual commitments later, but we know that doing so will bring us legal or financial penalties. Our power to contractually commit ourselves to future behavior is what makes others willing to cooperate with us. Thomas Schelling made this point about the importance of the right to be sued: “Who wants to be sued! But the right to be sued is the power to make a promise: to borrow money, to enter a contract, to do business with someone who might be damaged. If suit does arise, the ‘right’ seems a liability in retrospect; beforehand it was a prerequisite to doing business."[6] Others will cooperate with us if we have the ability to ensure real social consequences for ourselves if we later defect. Our power to reliably precommit is a personal institution we must cultivate through integration into the social fabric of our communities. Someone insufficiently integrated cannot make credible precommitments, and therefore has no access to many cooperative social rewards. Children and total outsiders will have a hard time finding business partners, for example, not just because they have no track record but also because the consequences if they breach the contract are seriously muted compared to a well-integrated adult member of society.[7] When we fail to follow through on an interpersonal precommitment, the main cost we bear is a diminished ability to enter into cooperative deals in the future. Lying damages our personal institutions of precommitment.

The ability to precommit is a prerequisite for cooperation, but many sorts of damaging and anti-social precommitments are also possible. Obviously, nothing cooperative is gained when we precommit to threats or blackmail. Precommitting while playing Chicken is flirting with a crash. Cooperating requires developing our powers of precommitment, but we must use them carefully. If we make arbitrarily strong precommitments with impunity, we might expose ourselves to exploitative scams or voluntary slavery. Perhaps the most dangerous possibility is that of a commitment race, where agents precommit to enacting disastrously strong punishments on others. The point is to make valuable, cooperative precommitments, not just to precommit as much as possible.

No matter how much we might want to, we humans can never precommit absolutely. We can shout our plans from the rooftops, ensure that we will bear great costs if we later change our minds, and try as hard as we can to set our plans in stone, but we can never truly avoid the possibility that our future selves will defy us. There is even less cause for absolute trust in the precommitments of other people. At least some uncertainty always remains when it comes to human intentions. To use a term of David Gauthier’s, we are translucent: our “disposition to co-operate or not may be ascertained by others, not with certainty, but as more than mere guesswork."[8] We are always somewhere between an opaque agent whose behavior appears random to observers and a transparent one whose intentions can be directly verified with certainty. Transparent agents could make binding precommitments we cannot, and could therefore cooperate with each other in ways we cannot.

We can push the boundaries of human translucency by banding together. We depend on key social institutions like courts and central banks to behave more predictably than any individual person. A constitution is a sort of precommitment made by a whole nation.[9] But any court needs a judge, and any organization of humans is ultimately an accumulation of human actions. At best, these social institutions are merely translucent as well. Even the best collective promises we make to each other are only partially reliable.

For centuries, this was essentially the end of the story for understanding precommitments. Humans, as individuals or as groups, were the only entities conceivably capable of precommitting to anything. As AIs rapidly grow in capability, it is entirely plausible that we will soon need to cooperate with nonhuman partners. The structural differences between AIs and humans mean that the precommitments AIs might make are of a very different kind from human promises. Any AI precommitment must manifest somewhere in its programming. A mind built out of software is, in principle, more directly inspectable than a brain, and so perhaps AIs could push the frontier even further towards transparency. A core goal of the field of interpretability research is identifying how translucent AI systems can be made more transparent. Strong interpretability results would make AI thoughts legible enough for us to confidently verify an AI’s intentions.

A theory of AI precommitments is essential to understanding a multipolar world with AIs. One critical factor in the long-term dynamics of superintelligence is whether AI systems will be able to cooperate with each other. The nature of AI precommitments determines whether their interactions will be mutually beneficial or destructive.

In the near-term, it is even more pressing to understand how humanity and AIs could be able to cooperate. The next generations of AI may inhabit an intermediate capability regime where control between humans and AIs is in the balance.[10] This would present a critical window where AIs might be able to make important guarantees about future safety in return for credible promises from humanity. AI-human deals create a new urgency for the question of how we can precommit ourselves as an entire species.

In the next post, I’ll consider how precommitments by an AI or humanity can be credible by examining where trust in precommitments comes from in general.

^
Book XII of The Odyssey
^
Object-level claims about veganism aren't the point of this post, but watching Dominion would make the direction of my thoughts clear.
^
I don't think Kant has much useful practical advice for us in general, but he did have a good account of akrasia.
^
Groundwork of the Metaphysics of Morals
^
In Parfit's original example from Reasons and Persons it's just stipulated that lying is impossible.
^
The Strategy of Conflict
^
Kevin Simler used this idea to argue that the ability to be sued is one of the constituent responsibilities of personhood.
^
Morals by Agreement
^
This was essentially Jon Elster's idea in Ulysses and the Sirens although he qualified his views in Ulysses Unbound.
^
Among other possibilities, this includes the early schemers framework. More on this in the next post.

Discuss

Mapping Transit Equity in NYC

7 апреля, 2026 - 01:56

A follow-up blog post to our presentation ‘Free Buses For Y’all, Youse, and You Guys’ at the NYC School of Data Conference. This project was done under the guidance and support of the Boston University Data Science Association Club.

Introduction: The Case for Fare-Free Buses

My freshman year of college, I went to an event at Boston University’s Initiative on Cities to see some MBTA employees speak on how they were redesigning Boston’s bus system. I was probably the only undergrad who wasn’t being paid to be there. It was honestly not that memorable however what I did learn was how angry people can get at a transit agency for making changes.

Most of the crowd were urbanists, but one woman emphatically introduced herself as a “resident,” the incarnation of the hypothetical “residents” that all the politicians were talking about. She proceeded to blow up about how the changes that Boston was making to the bus system were specifically screwing her over. She had a point.

Policy change usually can’t improve things for everyone and I think it’s worthwhile to think about how policy impacts real people as opposed to the hypothetical average person. We can talk about free buses in terms of the revenue forfeited and the overall time saved and the jobs created, but seeing who would actually be impacted was the idea that drove this study.

Background: Fare-Free buses in other cities + Charles Komanoff

Charles Komanoff’s 2025 paper found that simply opening both doors of MTA buses (thus improving passenger flow on and off the bus) as a byproduct of them being free would speed up buses by a full 12%. He estimated that the time saved would be worth $670 million per year, and that other benefits of free buses, such as decreased traffic, less car emissions, and cutting the need to spend money on fare enforcement and administration, added up to $160 million more.

Overall this offsets the expected $630 million of revenue that the MTA would lose by making buses free. Komanoff, normally not a free bus activist as much as a fast bus activist, also argues that increased crime and miscreants on the buses as a result of making them free would not be a real problem, given that 45% of riders evade the fare under the current system, so those people can already ride the bus for free.

The actual pilot program that NYC ran had results opposite to Komanoff’s estimates: fare-free buses were slower, and didn’t even have faster dwell times, let alone 12% overall increases in travel times. That said, the buses being free increased ridership significantly on those routes, mostly among the existing riders who took more trips for errands and leisure as compared to commuting to work and back.

Boston’s free bus pilot found similar results: a large increase in ridership, but overall no impact on travel times. Komanoff mentions that his analysis doesn’t account for the effect of increased ridership on dwell times, and the empirical results make that seem like a crucial oversight. What his paper and the Boston and NYC pilot do agree on is a major decrease in assaults on operators since they no longer have to enforce the fare against unruly passengers.

Kansas City made their buses free during Covid and have left them that way until now, although they ultimately ran out of money and recently decided to reinstate the fare starting in June 2026. Their 2022 report estimated that the $9 million of fares not collected each year was offset by $14 million of economic value from that money being in the community, with a total gain of $4.1 million in economic output and 24 jobs added as a result.

This is a good result, although the amount saved is only $1 per household per month, so the individual effect of the policy is hard to see.

Takeaway: The value of free buses can be quantified in several ways: time saving (Komanoff), jobs created + additional economic value (Kansas City), revenue lost (probably MBTA, definitely KC). But all of this is focused on the aggregate level of an entire city; what is missing, especially as we talk about NYC, is the impacts that it has on individuals, which is the main reason this is even proposed and implemented in certain places.

Methodology: Building the “Bus Need Index”

Connecting bus-level data to NTA-level data:

Neighborhood Tabulation Areas (NTAs) is our primary method of recognizing all neighborhoods in New York City. The nyc.gov website has a downloadable csv or shapefile containing the 2020 NTA boundaries which we used to map specific transit metrics to standardized neighborhood units.

NTAs are medium-sized statistical geographies used by NYC to report census data. They are commonly used to analyze neighborhood-level trends rather than colloquial neighborhood boundaries. This gives us a consistent spatial baseline and framework for comparing socio-economic data across the city, as these areas are specifically designed to have minimum population thresholds that ensure statistical reliability

For the bus-level data, the MTA publishes a large dataset called “General Transit Feed Specification” (GTFS) that contains schedules and associated data for NYCT Subway, NYCT Bus, and MTA Bus in GTFS static format.

The bus-level data is split up for each borough and contains various information about the buses in each NTA, most significantly and most relevant for our project is the station / bus stop locations. This file provides the name of the station / bus stop and its corresponding longitude and latitude.

We connected our bus-level data to the NTA-level data through a spatial join process. First, NTA polygon boundaries are loaded from shapefiles and bus stops are extracted from GTFS data across all five boroughs, then stop coordinates are converted from WGS84 (lat/lon) to NY State Plane coordinates to match the shapefile's coordinate system.

A spatial index of bounding boxes is built for efficiency, and each stop is tested against NTA polygons using the ray-casting point-in-polygon algorithm to determine which NTA contains it. Finally, the matched stop-to-NTA assignments are written to a CSV, with any unmatched stops flagged for review.

Income:

To measure how much the $2.90 bus fare burdens residents from each neighborhood, we got NYC income data from US Census Bureau (tract income) and tract to NTA crosswalk data (ACS) to merge on GEOID to get NTA for each tract’s median income. Then aggregate using the merged dataframe to get median income by NTA. low income = higher need → higher score.

We didn’t want to use raw income values so we normalized the data using the min-max formula that flips the scale. Lower-income neighborhoods received a higher score, which reflects greater need. 62 NTAs had no census data at all, which was to be expected. Those were non-residential areas like parks, cemeteries and airports.

Figure 1. Normalized Income Score

Car Ownership:

For the “car ownership” variable, we needed some way of knowing how many people owned cars in NYC by NTA. At first glance this seemed quite straightforward as we assumed that there must be a publicly available dataset documenting car registration however we could not find one.

What we did find however was a study done by Hunter College Urban Policy and Planning titled: Shifting Gears: Transitioning to a Car-Light New York City. In it, researchers at UPP Hunter “[present] a comprehensive set of strategies designed to foster a streamlined, equitable approach to reducing vehicle ownership in New York City.” (UPP Hunter Urban Policy and Planning)

In the appendices of the study, Appendix A gives a spreadsheet with the Vehicles Stored per Square Mile in New York City by NTA. This appendix provided us with the data we needed to understand how many cars were registered in each NTA which gave us a good general idea of the degree to which the population of an NTA relies on cars as a primary mode of transportation rather than the buses we hoped to analyze.

Some NTAs appeared to have missing data however we soon realized that those NTAs referred to various parks, cemeteries, airports, etc. that, understandably, would not have vehicle ownership data as they are non-residential NTAs.

In order to come up with the final “car ownership score” we used this formula:

Bus vs Subway Availability:

NYC’s subway system is the largest transit system in the world in terms of number of stations, and is one of the most popular transportation modes for NYC residents. To evaluate the neighborhood's need for free fares on buses, it’s crucial to keep note of the number of subway stations in their neighborhoods too.

We mapped each bus stop and subway station to their respective NTA, and calculated the total number of each of these transportation modes to each NTA (same method as mapping bus stops to NTA). To represent each NTA’s bus needs relative to the subway stations in NYC, we use the following formula:

This produces a value between 0 to 1 that plugs into our total bus need index. A value of 1.0 represents an NTA that has zero subway stations and is fully dependent on buses. A value of 0.5 represents an equal bus and subway presence, which from our data seems to be the lowest bus dependency that exists. It’s worth noting that we have approximately 11500 bus stops in our dataset, and approximately 450 subway stations.

Reliability:

The MTA publishes a massive dataset called “Segment Speeds”, which times the average trip between certain stops (known as “timepoints”) on every route throughout the day. Using the stop times dataset from MTA’s GTFS data, which is the daily schedule of each bus route, we compared the scheduled time between timepoint stops during every hour-long block with the actual time measured during that hour by the segment speeds data.

Figure 2. Chart depicting the scheduled and observed Average Travel Time by Hour of Day

The observed time between stops matched the pattern of scheduled times, with an average time of 30 seconds late to every stop measured throughout the system. We calculated the average lateness between timepoints throughout each route and took the average of that as the route’s overall reliability in terms of lateness.

Our theory is that the more reliable a bus service is, the more useful making it free will be for the community.

Ridership:

We used a raw MTA dataset for the beginning of 2025 of bus rides and transfers recorded at every hour, for every bus route, broken down by payment method and fare class. We then cleaned the hourly data into daily totals by summing all the hourly ridership and transfer counts within each day into a single row.

From there, we calculated the average daily ridership by taking the total amount of ridership and transfer data for each specific bus route and dividing it by the total number of days that route was listed in the dataset. We then mapped the routes by NTAs to see which routes have the biggest impact when it comes to eliminating fares.

Combing Everything:

This was used to create the bus need index that scores each NTA from 0 to 1 based on five variables: income, car ownership, bus-versus-subway dependency, service reliability, and ridership. Each variable was normalized to the same scale before being combined using an arbitrarily chosen weighted formula of:

W_INCOME = 0.25
W_CAR = 0.20
W_BUS_SUBWAY = 0.20
W_RELIABILITY = 0.15
W_RIDERSHIP = 0.20

A big thing we considered while creating the index was the weights for each variable. We eventually settled on a slider feature on the map that would allow the user to set the weight of each variable and the score would be recalculated for each NTA, with the map reflecting the change. Connecting the reliability and ridership data to neighborhoods required building a route-to-NTA mapping.

Key Analysis

Which neighborhoods will benefit the most from free buses?

East Harlem has the highest need score of any neighborhood, and generally every neighborhood north of Central Park would benefit. Chinatown and LES have low current bus ridership but a lot of stops and low income, making free buses a major upgrade.

In the Bronx, the neighborhoods of Belmont, Tremont, and West Farms have the highest need, but most of the borough has a higher than average need. Brooklyn’s hotspots are Flatbush and Bensonhurst, which have high bus ridership and a lot of bus stops. Flushing and Corona in Queens have somewhat lower ridership but lower income.

Staten Island generally has lower bus needs, since everyone owns a car anyway.

Mamdani election results analysis:

One fear we had throughout this project was that every other variable we looked at would be solely correlated with income and our areas of need would just be a list of the poorest neighborhoods in the city.

But surprisingly, only bus ridership and bus station density as compared to subways were correlated with the income of an NTA.

We theorized that if our bus need index truly did encapsulate the benefits an NTA would receive from free buses, they would have voted more for Zohran since it was one of his most publicized policy plans. and indeed there was a decent 0.281 correlation between our index and Mamdani vote percentage. Probably most of this came from the high correlation between car ownership and voting for Mamdani.

Surprisingly, we found no correlation between Mamdani’s vote and income, bus ridership, and bus reliability at the NTA level. The strongest correlation that existed was vehicle ownership, and second strongest was bus need index.

Figure 3. Collection of Correlation graphs for different variables

One more notable finding is that bus vs subway score was negatively correlated with Mamdani, meaning that NTAs with a higher concentration of subway stations as opposed to buses voted significantly more for Mamdani than more bus-heavy NTAs. Between this insight and car ownership being so predictive, subway users seem to have been more likely to vote for Mamdani than busgoers or car owners.

Figure 4. Correlation between Bus to Subway scores and Reliability scores

We also note in the figure above that there is a weak positive correlation between bus/subway score and reliability (albeit within the margin of error for a 95% confidence interval). This is mostly tangential to the project at hand but it does seem that the more bus-heavy an NTA is the smoother their buses run.

Environmental insights

We used NTA level air quality data from the NYC Community Air Survey (NYCCAS) to test whether high bus needs neighborhoods also face worse air pollution. There was a slight positive correlation between bus need scores and NO2 concentrations (r = 0.295).

As a side note, our work doesn’t argue that fare-free buses will lead directly to better air quality, but represents an equity observation regarding the communities that most need fare-free buses. These communities are already bearing more pollution burden from traffic, which would be reduced if people opt for bus transportation over ride-shares/cars.

Figure 5. Graphs of Bus Need Index vs NTA level of Air Pollutants

Visualizing the Data

To bring our Bus Need Index to life, we developed an interactive web tool that allows anyone to explore the data for themselves. Our assigned weights for each variable are nothing more than informed decisions that do not yet carry any data driven meaning.

Rather than presenting a static conclusion, we wanted to build a platform that acknowledges the complexity of urban transit, one where the ‘right’ answer often depends on what you think is most valuable.

The interactive Heatmap: LINK

Figure 6. Screenshot of working Interactive Heatmap

Our primary visual is a heatmap of NYC, where colors range from light yellow/green (low need) to dark red (high need). By hovering over specific NTAs, users can see the exact score and the raw data behind it.

Because we knew that ‘need’ is a subjective term, we included a high-need threshold slider. As you can see in the sidebar of the map above, this allows us to use a percentile-based approach, focusing only on the ‘Top 20%’ of scoring neighborhoods. This ensures that even as weights are shifted and scores become less varied, we are always highlighting the areas that stand out most relative to the rest of the city.

Thus, by toggling the Bus Routes checkboxes, we can see which specific lines act as the routes for these high-need areas. Priority routes are those that serve 3 or more high-need NTAs, suggesting they would be the most impactful candidates for a fare-free pilot program. Secondary Routes serve 1-2 high-need areas, providing a more localized but still vital economic boost to the community.

Allowing Users to Have Their Own Input:

Figure 7. Another screenshot of Interactive Heatmap

The core of our interactive map is the ‘Adjust Weights’ panel. We recognized early on that our own formula was somewhat arbitrary. To account for this, the dashboard lets you re-weight the entire study!

If you believe that reliability is the most critical factor for a free bus system’s success, you can crank that slider up to 100% and watch the map reorganize itself. This transparency allows for a more nuanced discussion: rather than arguing over one ‘correct’ map, we can see which neighborhoods (like Manhattanville or East Harlem) consistently appear at the top of the list regardless of how you balance the variables.

Example:

Figures 8 and 9. Illustrating how different weights result in different need scores

Weights for the first image:

Income: 0.25
Car Commuters: 0.20
Bus vs Subway: 0.20
Reliability: 0.15
Ridership: 0.20

Weights for the second image:

Income: 0.10
Car Commuters: 0.25
Bus vs Subway: 0.20, Reliability: 0.20
Ridership: 0.25

Conclusion

Summary of Findings

The NTA with the highest need index for far free buses was Manhattanville-West Harlem using our weights. Playing around with the weights, Manhattanville-West Harlem was consistently taking the #1 spot, if not top 5. Most of the top 10 neighborhoods with the highest priority score lie in Upper Manhattan, the Bronx and Southern Queens.

Our lower priority scores are mostly associated with bus stops located in parks, airports, and cemeteries. Other than those exceptional cases, our lowest priority neighborhoods generally lie in Eastern Queens, Lower Manhattan, and Staten Island.

Validating the Need

Data is only as good as the reality it reflects. To validate our model, we compared our index to the 2025 election results for Zohran Mamdani, whose platform centered on fare-free buses. We found a 0.281 correlation between our need scores and Mamdani’s vote percentage.

Notably, while raw income and bus reliability showed no correlation with his support, our combined Bus Need Index did. This suggests that our formula successfully captured a specific transit-dependent political identity that demographics alone miss. People aren’t just voting based on their paycheck, but on their daily experiences with commute.

A Path Forward for Equity

Our findings also highlight a critical intersection between transit and Environmental Justice. With a 0.295 correlation between bus need and NO2 concentrations, it is clear that the neighborhoods most burdened by transit costs are also those breathing the most polluted air. Fare-free transit is more than an economic subsidy; it is a tool for public health.

By targeting the "Priority Routes" we identified, those serving three or more high-need NTAs, the MTA can incentivize a shift away from car dependency in the city’s most vulnerable air quality areas.

Future Recommendations

To conclude this study, we propose the following policy framework:

Implement Fare-Free Pilots on Priority Routes

The MTA should launch fare-free service on any bus route that traverses three or more high-need NTAs to ensure that the subsidy isn’t just helping a single neighborhood, but is supporting larger transit systems for residents in areas like Manhattanville-West Harlem and the Bronx.

We began this study by thinking about the ‘angry resident’ at a transit meeting, the person who feels that city-wide averages ignore their specific street. We tried to create a granular model that aims to give as many people/neighborhoods as possible a voice. In a city as vast as NYC, transit shouldn’t just be about where the bus goes, it should be about who the city is willing to move for.

Discuss

Cross-Model Activation Generalizability Isn't Strong (Yet)

7 апреля, 2026 - 01:53

TL;DR

Tested activation similarities across different LLM families (Llama, Gemma, Qwen, Pythia) at small scale (1~3B)
CKA Similarity : Cross-architectural activation similarity is statistically real, but weak. Within-family activations are much stronger (4~9x)
Linear Transferability : Trained linear bridges for linear activation transfers for binary classification and next token prediction tasks. Within-family stands strong, cross-architecture yields better than random guessing, but not strong enough to be practically useful.
Bottom Line : Shared structure across cross-architectures : broad linguistics / semantics only for now. Not enough for fine grained auditing tools.

Code can be found here.

This is my first interpretability research project, coming from a different field. I'm about 4 weeks into the field and learning and working solo. I've tried to be honest about the limitations and the mistakes I made, would really appreciate the feedback.

Motivation

Anthropic's recent Activation Oracles paper was extremely fascinating, the fact that mere activations can be used to read a model's latent states, that you can use internal activations themselves as an input context to detect innate behaviors like deception or sycophancy.

One limitation to this, is that each oracle is architecture-specific: as in for every new model you need to train a custom oracle for it, and it does not generalize across different models. If concepts are represented in a relatively similar manner across architectures, this might suggest the mere probability that you can train an oracle on Llama and deploy it on Gemma, or more ambitiously train a model-agnostic oracle which takes standardized inputs, which would be a great deal for scalable oversight.

The Platonic Representation Hypothesis (Huh et al., 2024) suggests exactly this fact, that internal model representations converge as scale increases, not only within LLMs but across various model architectures spanning vision models, language models and audio models.

I wanted to test this empirically. Starting with measuring activation similarity with improved CKA metrics, and whether these shared structure is actually practical - carry over semantic knowledge that helps on real-world tasks, or is it just a correlational artifact with no functional significance?

Diagrams / Animations were generated by the help of Claude Opus, alongside with some paraphrasing support.

Setup

All experiments run on single NVIDIA A40. 6 models across 4 model families with 1~3B scale : Llama-3.2-1B, Llama-3.2-3B, Gemma-2-2B, Qwen2.5-1.5B, Pythia-1.4B, and Pythia-2.8B. Five model pairs were selected (4 cross-architecture : Gemma <-> Qwen, Llama <-> Pythia at 1B, 3B scale, Llama <-> Gemma) and 1 within-family (Llama 1B vs 3B) for positive control.

Residual stream activations were extracted at 9 evenly-spaced relative layer depths, 10k prompts were used from NeelNanda/pile-10k.

Three layers of analysis :

1) Debiased CKA to measure geometric similarity across models followed by a permutation test to measure its statistical significance
2) Train linear bridges using multiple methods (Procrustes, full ridge, low-rank decomposition at various ranks) to map between cross-model activation spaces.
3) Linear probe transfer for tasks : using 2) we train a linear probe on Model A, bridge Model B's activations into Model A's space and run A's probe on them for binary classification tasks and next-token prediction tasks.

Key Results(1) Debiased CKA Analysis : How close are model activations?

Diagram 1. Explaining Rationale of CKA -> Debiased CKA

The naive problem arises : How to compare activations with different dimensions?

Initial concern was that different models have different hidden dimensions, except for some coincidences. Gemma-2B has a d_model of 2,304 whereas Qwen-1.5B has one of 1,536. This makes it hard to design a "Rotation" vector which preserves dimension.

Searching some previous literature, turns out that the standard answer is CKA (Centered Kernel Alignment, Kornblith et al. 2019). Instead of comparing activations directly, you sample N contexts and compute an N×N matrix of pairwise cosine similarities between the samples. Now, both matrices are the same size regardless of hidden dimension! CKA basically measures the correlation between these NxN matrices. The PRH paper was also built upon the CKA metric, that different models have this sort of concept of "Plato's Theory of Ideas".

One problem of naively using the vanilla CKA itself, is that the correlations are inflated. The matrix diagonals are always 1.0 (same context, cosine similarity = 1.0), which is denoted by the diagram above. Therefore we end up with false positives indicating that different world models appear to share more "platonic truths" with each other, compared to actual reality.

To tackle this problem, I used the I used the debiased HSIC estimator (Song et al., 2012) as a fix, across all the experiments. As the diagram shows, the main idea is that basically we snooze (zero out) the matrix diagonals prior to computing the correlations.

One surprise was to find out that the original PRH paper was using the vanilla CKA with the debiasing correction, which ended up in artificially inflating the CKA values as stated above. Also found out that Gröger et al. (2026) addressed a similar concern : they were re-running the PRH claim with permutation calibrated metrics, and as a result a large proportion of the convergence was disappearing.

Model Experiment Selection

Eval

Model A

Model B

Type

d_A

d_B

Max CKA

Mean CKA

Llama-1B

Pythia-1.4B

Cross-family

2048

0.208

0.053

Gemma-2B

Qwen-1.5B

Cross-family

2304

1536

0.222

0.112

Llama-1B

Llama-3B

Within-family

2048

3072

0.914

0.605

Llama-3B

Pythia-2.8B

Cross-family

3072

2560

0.181

0.052

Llama-3B

Gemma-2B

Cross-family

3072

2304

0.184

0.100

Cross-Family CKA Similarities

Plot 2. Cross-Architecture Models Debiased CKA across Multiple Layer Depths

Few noticeable patterns:

Max Debiased CKAs similarly in ~0.2 range, average CKAs in 0.05~0.1 range. Higher than natural expected value of 0 (assuming independence) but generally weak (compared to results in PRH)
Cross-architecture CKAs peak in the late layers, suggesting that deeper layers, where more semantic processing has occurred, share more structural similarity.
In contrast, CKA of the last few layers drop off sharply, suggesting that the purpose of those layers are for specializing for architecture-specific next-token prediction via the unembedding matrix.

Plot 3. Debiased CKA metrics across model pairs

Within Same Family (Llama 1b vs 3b)

Plot 4. CKA Similarity within Llama 1b vs 3b

Very high CKA values (Max 0.9, Mean 0.6), 4~9x higher compared to cross-architecture models.
1B L0 vs 3B L0 shows 0.914, which initially surprised me (given that there is no semantic information accrued) until I realized both models share the same tokenizer and likely have similar embedding spaces, so it is actually unsurprising that the input representations are almost identical before any computation happens.
One distinguishing pattern compared to cross-arch is that within Llama, the final layers L15 (1B) vs L27 (3B) shows a debiased CKA of 0.886. Contrary to the cross-architecture final layer CKA which was generally lower than the late-layer CKAs, here the final layers of Llama-1B and 3B have a seemingly high debiased CKA. Similar to the argument above, it is likely due to both models sharing a tokenizer and similar unembedding geometry.

Permutation Tests : But is the cross-architecture signal statistically meaningful?

Diagram 5. Shaping Permutation Tests for cross-architectural CKA values

Cross-architectural CKAs with a max of ~0.2 was underwhelming compared to what the Platonic Representation Hypothesis initially claimed, reaching a CKA of roughly 0.4 ~ 0.8 which was increasing as the size of the model increased.

But given that we had debiased our CKAs, it is worth understanding if the values are statistically meaningful enough to suggest that there is a weak yet clear signal. At least it was higher than the natural baseline of 0, and given that the heatmaps suggest there was some similarity shared across model pairs (late layers having high CKAs).

So we designed a permutation test where we shuffle the sample indices and compute CKA for the null distribution.

We both test the permutation test for the max(CKA) for each cross-model pair (Layer L_A for model A, Layer L_B for model B, excluding the initial layer L0's) and the mean across 81 pairs, in an attempt to avoid p-hacking. As a result, when we test 500 permutations with same 10k samples, none of the perms exceeded the observed debiased CKA values (p-value of <0.002).

However this doesn't solve any imminent questions regarding the "strength of the signal", as in order to upstage the findings to a practical level, we need to address if the statistical significance can transform into something genuine in terms of transferability across cross-arch models. As in, is the absolute CKA of ~0.2 sufficient enough to carry functional information via a linear mapping from model A -> B, is still unanswered. And although the priors of this being available is low given the small model sizes, if we can find at least some convincing evidence in a subset of tasks, it might open some gates for generalizable interpretability, generalizable oracles, and so on.

(2) Functional Transferability : Binary Classifications

Diagram 6. Designing Linear Activation Transfer for Binary Classification Task

In the previous section, we have found some generalizable structure for cross-architectural models. Here we try to test if the statistical "similarity" actually does something.

We get the idea from linear probes for binary classification tasks, as a nonlinear probe is subject to overfit (as we all know, linearity combined with nonlinearity can do magic!).

So we train a LR (logistic regression) binary classifier on Model A's activations, say for example "is this text toxic?", and then transfer it into Model B's activations, using linear mapping (which we denote as "bridging").

If the transferred probe has significant accuracy compared to the accuracy of Model B's linear probe trained on itself, then it suggests that certain task-relevant structures are preservable, which is a slightly stronger claim compared to the geometric similarity being statistically meaningful.

The Linear Bridging Methodology (Details)

This part is just explaining the architectural details, feel free to skip or skim.

The basic bridging pipeline is simple. We will get the max CKA-pair calculated from Section 1 for the (Model A, Model B) pair. For the actual pair used - Gemma-2B vs Qwen-1.5B we use L18, L23 respectively. Then as mentioned above, we aim to learn a linear mapping W from Model A's activations to Model B's (W_map). W's weights and biases are trained on general text (10k). Then for a specific binary classification task, we train a linear probe in Model A, using the sample texts for the task (W_probe)

Finally, for the target evaluation we get the activations from Model A's layer (h_target), and then matrix-multiplied with W_probe x W_map which is then evaluated with Model B's linear probe, trained independently.

There are multiple choices to design W_map, which is in nature carries 2304 x 1536 parameters, which can be highly subject to overfit. Here we try multiple approaches, for same dimensional models (Eval A) we use Procrustes rotation, and for different dimensional models we try using low-rank factorization W = AB to see if the information transfer is doable with low rank.

(TMI : Eval A's models were chosen deliberately, from the fact that Llama-1B and Pythia-1.4B both have a d_model of 2048, thus matching the dimensionality of their activation spaces, enabling the orthogonal Procrustes rotation for the choice of the mapping matrix W)

We tested multiple bridge types at different ranks:

Method

What it does

Constraints

Orthogonal Procrustes

SVD-based rotation

Same dimension only (Eval A)

Ridge regression

Full-rank linear map with L2

Any dimension pair

Low-rank (k = 4~256)

W = A×B factorization

Rank-constrained

LASSO

Sparse linear map with L1

Any dimension pair

Important Note, is that like there is a classic bias-variance tradeoff problem, as low-rank bridges of (k = 4, 8) will be rank-constrained therefore inducing high bias in case the true latent mapping needs more dimensions. However high-rank, all the way up to pure ridge regression will have risk of overfitting as the # of weights far exceeds the number of samples.

Ok, back to the binary classification.

We tested on three binary classification tasks noted below:

Task

Dataset

Labels

What it tests

Topic (AG News)

AG News subset

Sports vs. not

Coarse document-level topic

Toxicity

ToxiGen

Toxic vs. benign

Safety-relevant content detection

Sentiment (SST-2)

Stanford Sentiment

Positive vs. negative

Fine-grained linguistic feature

These tasks were chosen to range across different topics, such as sentiment, safety related, or specific domain-related (sports), and we will test whether the linear bridges trained on general text can classify these binary tasks effectively.

In order to provide a useful baseline, we trained frozen bridges vs task-specific bridges. A frozen bridge is the approach explained above where we train the weights based on the general text (pile-10k). On the other hand, the task-specific bridge is training on the exact same task data where the linear probe for model A was trained on. This lets the bridge overfit to task-specific features thus inflating the accuracy of the transfer. Although the accuracy of the task-specific bridges is not a strong evidence due to the reason, we still use it as a baseline to compare against the frozen bridges, to test whether the general cross-model structure carry generalizable structure, basically testing an OOS generalization.

Binary Classification Task Results : Cross-Arch vs Intra-Family

Plot 7. Testing for Binary Classification Results on Linear Cross-Model Transfers

A brief walkthrough on the lines on the legend:

Target native (purple dashed) : Linear probe trained / tested directly on Model B's activations. This serves as a theoretical ceiling (except noise), best you could possibly do.
Source native (green solid) : Linear probe trained / tested on Model A to provide W_probe. Shows how good is the original probe prior to the activation transfer.
Frozen (pile-10k) (red): The probe transferred through a linear bridge trained on general text. This is the stronger test.
Task-specific (blue): The probe transferred through a linear bridge learned on the task data itself. This is a more weaker, corrupted test given that the bridge has indirectly seen the labels.
Chance (green dashed): The classic 50% for binary classification. It means the transfer has learned nothing if we are at this level.

Results

Intra-family model (Llama 1B → 3B) transferability is STRONG.

Generally strong frozen bridge performance across sentiment, alignment (toxicity), and domain identification. For the AG news dataset specifically, with a rank 4 factorization it starts at ~50% (equal to random choice), barely improves up to rank 32, but then improves rapidly by rank 128, indicating a phase transition. By rank 128, it stands at 97% accuracy, nearly matching that of the target native at 98.6%.

Sentiment analysis shows the same pattern yet being less dramatic. It similarly starts at ~50% with low rank (up to rank 16), and by rank 256 it achieves ~75% accuracy (target native = 86.2%, source native = 84.5%). Toxicity classification is the most flattest and seems rank-independent to a certain extent, but can at least see the monotonic improvement.

Cross-architecture models (Gemma 2B -> Qwen 1.5B) is WEAK and INCONSISTENT.

Compared to Intra-family transfers, the results are significantly weak. The frozen bridge accuracies all achieve >50% (better than random prediction) by rank 256, but SST2 | ToxiGen both do not see any improvement based on rank, and falls short compared to Target Native accuracy.

Another observation is that the frozen bridge vs task-specific bridge results are inconsistent with intuition. Generally we would expect the task-specific bridge to outperform the frozen dataset given that it can include some task/domain - specific structure. The only case where we observe >70% accuracy (AG News dataset) on frozen dataset, is also very puzzling that the task-specific bridge struggled to even beat random choice on the same task.

Setting

AG News

SST-2

Toxicity

Cross-arch

Frozen wins (0.713 vs 0.502)

~Tied (both at chance)

Frozen wins (0.636 vs 0.589)

Within-family

~Tied at high rank

Task-specific wins (0.811 vs 0.667)

Task-specific wins (0.770 vs 0.739)

Honest Summary : Intra-family activation transfers work, and generally scales with higher rank. However cross-architecture activation transfers are existent yet generally weak, and hard to explain. The level of accuracy is nowhere near practical usefulness.

(3) Functional Transferability : Next Token Prediction

Diagram 8. Next-Token Prediction Linear Probe Transfer

Diagram 9. Next-token prediction for Cross-Architectures

We have observed binary classification transfers were strong within-family, but weak/inconclusive for cross-architectural model transfers. That brings us to a slightly harder task of transferability for a next-token prediction linear probe, for the entire vocabulary. This is intended to evaluate if the bridge can preserve fine-grained linguistic structure, a much higher bar than to just a binary classification task.

Plot 10. Llama-1B vs Llama-3B intra-family transfer

Method

Top-1 Accuracy

% of Oracle

Source native (Llama-1B)

63.9%

—

Low-rank r4

19.1%

30%

Low-rank r32

35.4%

56%

Low-rank r128

47.7%

75%

Low-rank r256

51.2%

81%

Ridge (full rank)

58.9%

93%

Oracle (Llama-3B native)

63.4%

100%

For Intra-family transfer (which we use Llama-1B vs -3B) this setup is easily replicable, as we use the same methodology of training linear probes on 1B L15 and 3B L27, and then train the linear bridge for the transfer.

For the top-1 accuracy prediction we see a monotonic increase of accuracy as we increase the ranks from 4 -> 256, and full ridge regression yields 58.9% accuracy, which is 93% of the theoretical ceiling of the oracle (63.4%), which suggests that within-family transfer still sits strong.

However it is impossible to use this method to evaluate the corresponding cross-architectural model transferability, because they have different dictionaries and different tokenizers (Gemma / Qwen).

To address this issue, we try two methods:

1) Simple method where we compare the exact same string (Diagram 9 : Experiment A)

Basically we will decode each token ID in each dictionary into the raw string, also removing the tokenizer-specific prefixes (▁ (Gemma/SentencePiece) and Ġ (Qwen/tiktoken)), which will enable us to compare apples to apples, bananas to bananas. One caveat to this is that the strings doesn't map 1:1. In fact, there are 83,499 shared token strings post-decoding between Gemma & Qwen (Gemma-2B : 256,000 total tokens | Qwen-1.5B : 151,936 total tokens). From the 83,499 token strings we choose the top 500 most frequent tokens to ensure there is sufficient sample sizes for each class suitable for training the probe.

Plot 11. Gemma -> Qwen Cross-Architecture Next Token Prediction

Method

Top-1

Top-5

Source native (Gemma)

66.8%

82.1%

Target oracle (Qwen)

75.3%

86.2%

Cross-model oracle (ceiling)

10.3%

20.2%

Best low-rank (r128/r256)

4.6%

15.9%

Ridge (full)

4.9%

18.0%

The linear transfer caps out at 4.9% top-1 accuracy even with full rank transfer, also don't observe a monotonic increase as we've seen in the case for intra-family transfers. This is also due to the fact that the theoretical ceiling of 10.3%. This is the accuracy of the next token prediction - alignment between cross models when asked the same context, therefore the poor transferability could be an artifact of the limit of the cross-architecture alignment itself, not the transferability aspect.

2) Classify tokens into POS subcategories and predict subcategory (Diagram 9 : Experiment B)

Because top-1 token prediction (even a reasonable top-K) is often difficult as we saw from the result (systematically different next-token prediction), we try an alternative method which is further grouping the token strings into POS (part-of-speech) subcategories (noun, verb, adjective, ...). This proves a much weaker claim of a linguistic structure transfer rather than a semantic token-level prediction.

Plot 12. POS Tag token transferability

The POS subcategory transfers show a similar story. Within-family transfer generally monotonically increases its accuracy as we increase the capacity (rank). It even slightly outperforms the target oracle accuracy benchmark (48.5%), at rank 128 (49.3% accuracy). The outperformance is probably noise, but it does prove that within-family transfer is strong enough to replicate probes which are trained on itself, proving that activations can be linearly transferable across models.

Cross-architectural transfers show improvement vs top-1 token prediction accuracy, ranging 25~30% accuracy where target oracle benchmark stands at 37.8%, which is 65~80% of the oracle. However it still has the pattern where increasing rank doesn't help the accuracy, and also the POS classification may be slightly inflated, where the familiarity of each token class (nouns, verbs appearing much more) could be a bigger factor where the probe learns the frequency priors compared to innate predictability on the linguistic structure.

Conclusion : What This Means (And Doesn't Mean)

Debiased-CKA analysis suggests cross-architectural activation similarity is statistically significant yet weak, with max debiased CKA around ~0.2 (mean = 0.05 ~ 0.1), confirmed by permutation tests with p < 0.002. Intra-family (Llama-1B vs Llama-3B) is much stronger, CKA averaging ~0.6. We challenge the Platonic Representation Hypothesis that models will converge to a mathematically interpretable "platonic ideal", at least not supported at this scale. Whether this holds at larger scales remains an open question.

We then try experimenting whether the "statistically meaningful" CKA metrics hold for actual practical semantic transferability, by introducing a linear bridging method and testing multiple linear probes spanning tasks as easy as binary classification to next-token prediction.

For both binary classification and next-token prediction tasks, within-family transfer (Llama-1B vs -3B) stands strong, averaging >90% of the accuracy of the target linear probe trained on itself. However cross-architectural transfers are yet questionable regarding the practicality, as we observe weak and inconsistent results (frozen bridges outperforming task-specific bridges, transferability flatlining even when increasing the dimensionality of the transfer). Still, we observe that a weaker version holds of cross-architectural models sharing coarse linguistic features or at least improving some binary classification tasks predictions.

For AI interpretability research: This project started with an ambitious objective whether to see if activation oracles can generalize across different models, which would be an interesting breakthrough by itself. However it does seem like this is yet to be supported without meticulous design to enhance generalizability, at least in this scale (1B-3B). Whether the transferability will improve once model sizes increase remains a question, however looking at the results here scaling alone may not be sufficient, might need targeted interventions.

Caveats and Limitations

I want to be clear about what this work doesn't show:

Scale. All models we've tested are in the 1~3B parameter range. The Platonic Representation Hypothesis predicts a platonic convergence increasing with model size, therefore results may differ as we scale up to 70B+. I didn't have the compute budget to test this.
Single Pair for Testing. In this project we only test 1 cross-architecture transfer (Gemma -> Qwen), and 1 within-family transfer (Llama-1B -> Llama-3B). Would be better to test across different pairs and see whether the results generalize.
Linear Bridge Deepdive. There needs to be more sophisticated research done in order to understand how the bridges worked under the hoods. As in, what are the recognizable patterns, does it relate to the dimensions of the binary "task" vector, etc?
Dataset Dependency. All the probes and tasks were trained and tested on the same text corpus. Would be better if we can see if the learned bridges can actually generalize to different domains.

References

Huh et al. (2024). The Platonic Representation Hypothesis. ICML 2024.
Gröger, Wen & Brbić (2026). Revisiting the Platonic Representation Hypothesis: An Aristotelian View.
Kornblith et al. (2019). Similarity of Neural Network Representations Revisited. ICML 2019.
Song et al. (2012). Feature selection via dependence maximization. JMLR.
Roger et al. (2025). Activation Space Interventions Transfer Across Large Language Models.
Karvonen et al. (2025). Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers.

Discuss

Levels of superpersuasion

7 апреля, 2026 - 01:48

Superpersuasion[1] is one of few ways a non-embodied AI system could bootstrap itself into causing harm in the physical world. It's not obvious that superpersuasion is even possible, to quote Bryan Caplan:

I’ve got great confidence that the most incredible intelligence of the universe is not going to be able to construct 10 words that will make me kill myself. I don’t care how good your words are. Words don’t work that way.

I'm inclined to agree with this, but there's a question of how persuasive an AI-in-a-box could actually be. Clearly (I believe) it could persuade you in a 5 minute conversation that 2 + 2 = 4, but not that you should kill your own grandmother.

You may think it could persuade you to kill your own grandmother because, as a superintelligence, it can apply tricks that are completely incomprehensible to us and achieve arbitrary effects. This may well be true, but it's not self-evidently true. There are situations where complete superiority in intelligence can't outweigh limitations in I/O: the information the AI has to work with, and the set of actions it can take.

For instance, if a far superhuman chess AI can only use half its pieces, then it will lose to a mediocre human player. If a far superhuman grandmother-killing AI can only say 500 words, it seems like it would struggle to convince a mediocre human. Whereas if it had legs and a gun it could obviously get much further.

The purpose of this post is:

To point out that merely being very intelligent doesn't necessarily let you achieve arbitrary outcomes through one text or audio channel (DONE).
To lay out levels of (super)persuasion, for people to disagree over which is the highest possible.
To say that I would be surprised if anything above "4a. Super-personalised or super-numerous charismatic leader" is possible (DONE)

Note: Many arguments under "Case for/against" could apply at multiple levels. I've put arguments at the level where I think they are a crucial condition for that specific level

Note 2: This is an Inkhaven post and as such, it was written in one day. I'm not trying to excuse anything, but I'm just telling you it was written in one day.

1. Subhuman

Can persuade you of some things, but not as well as someone in your personal life or a reasonably good salesman. Perhaps it's very good at persuading you in AI-like contexts, e.g. convincing you of a well-supported concept it is explaining. But not in a fully general context, for instance convincing you to take it on a romantic getaway.

Case for this being achievable

Even without thinking of concrete examples, I feel like this has already been achieved.

Case against this being achievable

Since I think this has already been achieved, I'm taking this to mean "reasons it might not get that close to human level"

The "it's clearly an AI trapped in a computer" problem: When cinemas were first invented, a grainy video of a train coming towards the screen was enough to make people crouch in their seats. As film technology got better, you might think it would get to the level where people called their loved ones and ran for the exits, but it didn't. The reason is simple: People know they're going to the cinema to watch a film.

Similarly, on the question of "Why can my mother convince me to send her flowers on Mother's day, but Claude can't?". Well, for one thing, I know my mother is my mother, whereas Claude is a genie that lives in the computer. For another, if Claude starts to berate me, I can simply close my laptop and walk away, whereas my mother will be sad the next time I see her. The same may apply to AI convincing people of things generally. (see "Appendix: Fingerprint argument" for a more general form of this argument).

A counter-counter argument: Some people argue that various "AI box experiments" have shown that it is quite doable for a text-only AI to persuade its way into the real world, even with human level intelligence. To this I again fall back to the fingerprint argument: It was obvious to the people in those experiments that they were talking to Eliezer Yudkowsky over email, and not guarding a superintelligent AI in a box. In the real world, prisons exist, and people generally aren't able to talk their way out of them, despite imperfect security measures. In my view this shows that when you really try to construct a system to achieve the goal, it is not too hard to keep a human-level intelligence in a box[2].

2. Ordinary human

Can persuade you of things about as well as someone in your personal life who really wants something, or the best car salesman at your local dealership.

Case for this being achievable

Human-level persuasion is clearly achievable, humans do it. If the sub-human arguments above don't bite especially hard, this could be achieved. And of course the AI could offset its limitations in some areas with benefits in another (covered in higher tiers).

Off the shelf AI benefits from "expert level" training on everything, which humans lack. Most humans do not invest more than ~1% of their time into becoming very persuasive, so even with some structural disadvantages an AI could end up better than most humans.

Case against this being achievable

Persuasion in humans often involves a kind of quid pro quo, where you give people what they want so they give you what you want. The ability to give people what they want depends on more than just intrinsic properties. By sheer intelligence alone, an AI can't give you money, it can't give you exciting interpersonal experiences, it can't give you social status. A mediocre car salesman can at least give you a ride in a cool car, an employer can give you money and a high status position, someone you meet in a bar can give you an exciting romantic fling. If this is the 80/20 of persuasion, then an AI may not be able to get very far through intrinsic persuasive ability.

3. Charismatic leader

Can persuade you of things about as well as the most persuasive people in history (Steve Jobs, Hitler, Rasputin). Can instill in you goals that are completely different to what you would have aimed for otherwise. Not just "Ford vs Toyota" as a suburban dad, but devoting your life to a cause the AI convinces you of.

Case for this being achievable

Again, humans appear to be able to achieve this, at least in some people, in some places, some of the time. If no argument above bites, then it should be possible in AI.

In terms of the "give people what they want" argument, an advanced AI does have some disadvantages (no money by default, no physical form by default), but it also has a huge advantage in being more intelligent than a human, and this alone could give it a strong ability to give people what they want. Many people already seem to run their life by Claude, and this role as a trusted authority and advisor gives a strong basis for persuasion on its own.

In my view a particularly strong angle here is for an AI to run a company or government very well, much better than a human can. This could be achieved without having fantastic skills in pure persuasion, but could still give it a lot of power to make people do what it wants.

Case against this being achievable

This is right at the limit of what is known to be possible, since few humans can achieve "charismatic leader" level persuasion.

In humans, this apparent high level of persuasion may be more situational than dependent on literally one person being very good at convincing people of things. To illustrate what I mean:

Carl Jung described Hitler as (paraphrase) "the mouthpiece of the collective unconscious". I.e. his idea was that it wasn't so much that Hitler came up with his ideas and did a very good job of convincing people of them, but that people were crying out for a Hitler-like leader (the implication being that someone else would have filled the slot in any case).
Steve Jobs ~invented the personal computer and the smartphone. Arguably these would have been invented eventually anyway, and someone else would have taken the spoils. While he was renowned as a great salesman and as having a "reality distortion field", this didn't actually spill over into big effects outside selling personal computers and smartphones.
Rasputin achieved a lot of influence over the Tsar and Tsarina of Russia, but it's generally agreed that this was in large part due to their vulnerability: Their belief he could heal their haemophiliac son, the Tsarina being foreign-born and isolated, the Tsar being generally "wet". And, other stronger willed people saw this and put a stop to it by murdering him.

Overall, I think it's likely that most notoriously charismatic leaders emerge in a way that is highly situational, and it's not that you can print 1000 Chairman Mao's and have them market your course.

Additionally, if you look at cults of personality in history, they depend a lot on having "a poster on every street corner" or other kinds of distribution (e.g. Steve Jobs' keynotes). AI doesn't get this for free. If it's trapped in a box, then it can't do it, and that could be the primary mechanism that actually does the persuasion.

See also this quick take regarding B.F. Skinner's belief that Hitler-level behavioural control of populations would become widespread and dangerous as behavioural science developed. TL;DR: This didn't happen, behavioural science turned out to be too hard.

4. Hypnotic

Can persuade people of things to achieve a much stronger (but not unbounded) effect than the most persuasive humans. I see two routes to this, one through scaling up regular persuasion, one through achieving recognisably superhuman individual persuasion.

4a. Super-personalised or super-numerous charismatic leader

Can persuade people of things about as well as the most persuasive people in history, and is able to coordinate this across a lot of people in a personalised manner (i.e. picking what each person will find most persuasive) to achieve a much stronger effect than small-scale or broadcast persuasion.

Case for this being achievable

Assuming highly individually persuasive capabilities are possible, crossing this boundary seems common-sense possible, simply by applying the base capability to every individual with some steering to keep it pointed at the same goal.

Religious texts, manifestos, and other memes are able to influence a lot of people (sometimes towards extreme behaviour) without much personalisation. It stands to reason that a hyper-personalised version of this could achieve a greater effect.

Case against this being achievable

Firstly, this assumes the AI-in-a-box already has access to a large number of people. This (and worse) will likely be the case if the current trajectory continues, but this would be defeatable simply by restricting access.

Second, I think the strongest way this could fail is on the "achieve a much stronger effect than small-scale or broadcast persuasion". Advertising campaigns, social movements, and religions already apply some level of personalisation and segmentation to persuade different target groups. I'm somewhat skeptical that highly individualised targeting could do much better than this, i.e. I'm skeptical that people are so different in what makes them tick that going beyond ~100 "profiles" to true individualisation makes much difference[3].

4b. Truly hypnotic

Can persuade you to do some things "you would never normally do" (but not anything), through means that seem incomprehensible or spooky. Also hits the "achieve a much stronger effect than the most persuasive humans" condition.

Case for this being achievable

Humans do sometimes take extreme actions, like killing each other, doing large scale organised crime. They must be persuaded to do this in some way.

Humans also do things "out of character" or where "they don't know what came over them". One way this can happen is through ordinary hypnosis by other humans. If there exist some principles which can be inferred here, and could be deployed more widely or more consistently, then it could be possible to reliably trigger behaviour that seems way off the path of someone's normal reactions. AI being able to analyse a superhuman amount of data, and apply superhuman intelligence to it, could enable it to work out these principles, which humans haven't been able to achieve.

Case against this being achievable

In human-on-human hypnosis, it's generally accepted (1, 2, 3) that:

You can't get people to do literally anything, for instance harm themselves or others.
People need to be "suggestible" for it to work, and be in an environment where they're ready to receive it. It doesn't work on anyone at any time.
It may well not be a real thing, and just be a case of people going along with it.

Given humans already know some rough principles (e.g. find suggestible people, get them into a calm state), the "next level up" could be arbitrarily harder. By analogy, it's possible to convince someone to buy a car who wants to buy a car. It's possible to hypnotise someone who "wants to be hypnotised". It's not slightly harder to convince someone to buy a car if they don't want one, it's borderline impossible. This may also be true for inducing desired behaviour in people who are not basically on board with it.

Other arguments overlap with "5. Arbitrary behavioural control", so I've pushed them all into that section.

5. Arbitrary behavioural control

Can make you do literally anything your body is capable of. E.g. it makes some dial-up tone noise and suddenly your arm moves one inch to the left; It makes some other noise and you pick up an axe and start swinging it around.

Case for this being achievable

If it's possible for you to do some action, there must be some brain-state where, if you could set up exactly that state, you would take the action. It's a question of whether it's possible to

Determine what this condition is.
Induce it via only text/audio input.

On this question of inducing highly specific brain states or actions:

In animals: There have been experiments in rats[4] showing you can trigger or suppress seizures by activating specific neurons. This shows that it is possible in principle to "induce specific brain states" through targeted external input. This requires physically inserting proteins into the brain to make certain neurons light-sensitive, and shining light directly on the exposed brain, which is quite far from text/audio input. Additionally, a seizure is a non-subtle macro-scale brain state, inducing arbitrary behaviour is likely much harder.
In AI: adversarial examples are "a thing", where if you know a model's weights you can construct innocuous-seeming inputs which induce specific outputs. If human brains work similarly, then in principle there could exist some sequence of words or sounds that triggers a very specific response. The question is whether it's possible to find and deliver such a sequence through a normal conversation.

Case against this being achievable

Predicting or inducing exact behaviour could be...

...like "the weather", i.e. it's too chaotic to have coherent effects over more than a short period of time. Maybe seizures (high entropy breakdown of function) are a lot easier to achieve than coherent "walk over there and pick that up" actions.

...computationally intractable (relevant EA Forum post), i.e. even if you know the starting state, simulating forward the human brain on the computer hardware is ~impossible, even if the hardware is very efficient about doing its own thinking computations (again see "Appendix: Fingerprint argument").

...I/O bound, i.e. the AI needs to get enough data out of humans to model the brain, and put enough data in to have a strong controlling effect. This may not be possible. By analogy: two computers could display the same exact pixels on screen, and have the same exact set of inputs, but one could be running on Linux and one on Windows. An attack that would work on one would generally not work on the other. More than that, you have very little insight into the exact instructions running on the CPU, so you're limited to quite "surface level" attacks. In human persuasion, this is equivalent to being limited to things like "I'll give you £1M to let me out of the box", but not the string of 10 words that makes someone kill themselves.

Conclusion

I'd be very surprised if something like "5. Arbitrary behavioural control" is possible. Below that, I think there's room to argue over the different levels. I encourage you to make your arguments in the comments.

Appendix: Fingerprint argument

Claim: In general, it's much easier to achieve functional equivalence between two artifacts (having the same behaviour/purpose), than it is to achieve indistinguishability. Slogan version: Artefacts tend to have a "fingerprint" that is hard to fake.

Examples:

Both petrol and electric cars can be driven from A to B, but one goes "vroom vroom" and another goes "vvvvrrrrrrrrr"
Both cats and dogs are small four-legged animals that live in your house, and in fact it's hard to define one clear visible difference. But you can just look at one and determine which it is.
The functional properties of physical money that make it usable are essentially just: it's durable, non-perishable, portable. It's easy to print off fake money that fulfils this role, e.g. for a board game. It's very hard to produce convincing counterfeit money, due to a large number of deliberate and accidental hallmarks introduced in the manufacturing process.

Argument:

There are generally N properties that let an artefact fulfil its functional niche.
There are generally M >> N measurable properties that an artefact has, which are largely irrelevant to its functional niche.
It's often straightforward to replicate the N properties in an alternative way.
It's much harder to replicate the other M properties to the point where you can't distinguish the two artefacts.

This is especially strong when you consider that the set of M properties are generally unknown ahead of time, so you can always inspect further to find new properties that let you distinguish. For instance, imagine how hard it would be to make an electric car that is indistinguishable from a petrol car, assuming that: if you can't tell from driving it, you're allowed to look under the bonnet; if you can't tell from that, you're allowed to start disassembling and inspecting every part; if you can't tell from that, you're allowed to contact the manufacturers and trace the whole process of how it came into existence.

^
From the latin, "super" = "really good", "persuasion" = "persuasion"
^
I'm being quite glib here, I am aware of counterarguments like "the AI only has to succeed once". I'm not unsympathetic to this, I'm just trying to cover a lot of ground here! Please feel encouraged to point out these gaps in the comments.
^
On net though: My guess is this is possible, and could be quite scary
^
Described in this Rationally Speaking episode

Discuss

The Garden

7 апреля, 2026 - 01:35

The mandate had been given by God that morning, and Nathanael and Amon had arrived in the nascent realm to shape His will.

The task was simple: creation, in all its perfection. As divine as this instruction was, it did demand a great deal of creativity on the part of his favourite agents. Helpfully, He had gotten them started with all of the matter and energy that they would need. Indeed the task was arguably trivial in that they had only a single universe to tend to, the others being helpfully tended to by myriad other angels.

They quickly got to work, shaping all the countless galaxies and stars, which took no time at all, and then focused on the most important task of all. Creating paradise on earth.

The earth started as a great brown ball, covered in water. To give it shape and colour they added sweeping meadows, forests and trees. To support the future humans who would follow their divine shape, they would need food, in the form of fruit and vegetables, and animals to provide light entertainment.

Nathanael gestured and sweet, charming animals emerged, delicate grazers and rabbits that would nibble on the grass.

He smiled at their natures and harmless attitudes. Amon frowned. "Nathanael, these animals you have created are wonderful, but don't you think there is an inevitable problem here?"

Nathanael glanced over the bank of cloud at him. "What problem could there be, Amon? Do you think they might be too cute?"

"There is certainly some risk of that, but that is not the problem, dear Nathanael. The problem is what happens when these animals begin to fill the world with their offspring." Amon gestured. "Here, let us see what that looks like on this small island."

As his hand arced through the air, centuries passed on the island and the population of rabbits and deer exploded. Eventually they consumed every spare scrap of green, and they began to starve. Within a year their population had collapsed to less than half of what it had been.

Nathanael looked on with horror. He wept at the sight of the emaciated creatures. "Amon, stop. How can you do this?"

"I do nothing, Nathanael. This is merely the natural consequence of time. This same thing will happen everywhere unless we act." He stood firm on the cloud with his arms crossed.

"And how can we act to stop this?"

"Well, you may not like the answer, but I have a suggestion." He seized a nearby piece of cloud and split it into two chunks. He squeezed and shaped them, and then cast them down onto the land below. They took shape as a fox and a wolf, and immediately set about tearing into the nearest rabbits and deer, who had no defence against this new addition whatsoever.

Nathanael was horrified. "Surely this is too much. Do they need to suffer in this way?"

Amon watched the new predators lying down to rest after their feasting. "It is sad, I admit. But without them, collapse and starvation are inevitable. Would you have us simply place an arbitrary number on their population? With this in place they have a pressure to evolve and change, to become more complex and better adapted to the environment. It's a beautiful system."

Nathanael watched as one of the wolves kept killing despite having more than enough food available, following its instinct to glut itself while the opportunity was available.

"What if we just made the island bigger?" said Nathanael.

Amon snorted. "Don't be naive Nathanael. When would it end?"

Nathanael hung there for a few more moments and then disappeared.

Some time passed while Nathanael created all of the animals and flowers of the rainforest, thousands of bird varieties and species. He'd thought more about what Amon had said and decided to visit him again.

Nathanael appeared beside him and found the world changed.

In the time he had been away, Amon had been busy. The meadows hummed with invisible life, tiny organisms that Nathanael could barely perceive. He watched a young deer drinking from a stream and noticed something off in its movement. It stumbled, recovered, stumbled again.

"What have you done?" Nathanael asked.

Amon was sitting on the edge of a cloud, watching the world turn below. "There was a waste problem," he said. "Dead matter was accumulating. Fallen leaves, carcasses, excrement. The land was becoming choked with it. I designed organisms to break it all down."

"And what's happening to that deer?"

"Some of the organisms interact with living tissue. They have changed, evolved to spread and take advantage of its resources. It was not intended, but for it to be effective it needed to adapt and change, and some of those changes were difficult to predict."

Nathanael descended to the surface. The deer was shivering, its breath wheezing. "So this is inevitable?" He said, gesturing at the animal.

"What dies must be reclaimed. The bacteria do this efficiently. If I were to design a system that could do exactly what we wanted without ever changing then I would have to be forever getting involved, tweaking things to avoid any harms whatsoever. It would be endless. And it would be an extremely inelegant solution."

Nathanael looked up at him. "Inelegant."

"Yes. Inelegant. There is a finite amount of matter on this world, and a finite time for us. We may have the power to fix these things, but the humans certainly won't. We simply can't afford it."

"So it's not worth your time to save this deer?"

They watched it die. Nathanael eased its passing raising a hand to its forehead. He felt the bacteria population climbing within before the body was cold. Efficient indeed.

"Come," Amon said. "I have more to show you."

He led Nathanael to a hillside where a wolf lay, head on its paws, grey-muzzled and thin. One of the first wolves. Its cells were old, losing coherence. No longer repairing themselves as they once had. It was not sick. It was simply falling apart.

"Senescence," Amon said. "The cells divide to maintain the organism, but each division loses some information. Eventually the organism can no longer maintain itself."

"And why did you not fix this?"

"Because it is necessary. Old organisms die and make room for new ones. Their matter is reclaimed. Without this, the population would grow until it exceeded the carrying capacity of the land, and then we would have the same collapse we saw on the island."

Nathanael was quiet for a long moment. "Amon," he said carefully, "You keep saying the same thing."

"What?"

"Your time is limited. There's a fixed carrying capacity. You keep speaking as if this world has hard limits that we have to accept. But we set those limits. We made this world. If there is not enough matter, we can add more. If the carrying capacity is too low, we can raise it. You are solving for scarcity, but scarcity is not a law of the universe. It's something we get to decide."

Amon frowned. "You cannot simply keep adding resources forever. At some point..."

"At what point? We created this world today. We could simply create another one."

"It is not about whether we can. It is about whether we should. A system should be self-sustaining. It should not require constant intervention."

"Why not?"

Amon paused. "Because... that is what it means to solve the problem. A true craftsman builds a clock that runs on its own. He does not stand beside it, forever pushing the hands."

"We are not clockmakers," Nathanael said. "We are gardeners. A garden requires tending. We have that responsibility because we have the power to do so."

Amon stood and looked out across the world. Nathanael could see that he was not simply defending a design choice. Amon felt that there was an inherent beauty to it, a correctness. The way predation and decomposition and senescence all interlocked into a single self-regulating system. It was elegant. No oversight required, and no further thought needed.

The fact that the loop was built on suffering seemed, to Amon, like an acceptable cost.

"There is a neatness to it," Nathanael said, reading his expression. "I can see that. Everything feeds everything else. Nothing is wasted. It has the beauty of a proof."

"Yes, exactly. It's clean." Amon said, seeming relieved to be understood.

"But Amon, the proof is filled with suffering."

The wind moved through the long grass on the hillside. The old wolf breathed its last and was still.

"What would you have me do?" Amon asked.

"I would have you start again. Not to answer the question 'how do I build a system that runs itself,' but with the question 'how do I build a world that is just for all its inhabitants.'

"Your world would require constant maintenance. Constant intervention. We would have to tend it forever."

"Yes."

"That is inefficient."

"That is what is required of us, Amon," Nathanael said.

Amon looked at him for a long time. Then he shook his head. "Tomorrow, the humans will arrive. I do not have time for a second attempt. The humans must ultimately be left to themselves."

"As long as you can see the choice that you're making, old friend."

Amon turned away and began preparations for the next day. Below them, the world turned on, beautiful and brutal, its suffering woven so deeply into its structure that it looked like nature.

Nathanael sat on the cloud alone and watched a rabbit eating clover in the late afternoon light. A fox crouched in the treeline. The rabbit did not know it was being watched. It did not know that its death was a design choice.

But Nathanael knew. And he thought: "They deserve better."

Discuss

Are there Multiple Moral Endpoints?

6 апреля, 2026 - 23:37

This is a different approach to explaining an old idea.

What is the deep future going to look like? Will we be proud of it, or ashamed of the choices that led to it?

Lots of focus on the future is on the near future. How will ongoing wars go? What will the next AI model's capabilities be? Will this business succeed or fail? Let's zoom out and focus just on advanced artificial intelligence; my guess is that we'll have a "transition period" with many different relevant actors with different philosophies and moral considerations (we're in it) and then end up in an "equilibrium period" with much more homogeneity of philosophy and moral standpoints.[1] The transition period is hard to predict and involves many contingent facts, and might involve lots of dramatic turnovers; the equilibrium period, in contrast, is likely much more stable.

Mass Effect is a series of video game RPGs which involved significant impacts downstream of the player's choices. In 2012, the trilogy conclusion was released and widely disliked because the player's final choice led to endings were too similar and mostly paved over the impact of all of the player's other choices. As mocked in this image, the background lighting color was different, and not much else.

Maybe I should take a step back and try to explain a bit what I mean by philosophies and moral considerations to try to build towards the idea of a moral endpoint. By 'philosophy' I might mean something more like 'ideology'; it's a way of looking at the world, a set of metrics for considering ideas, some axioms and precepts, and consequently, a set of judgments about policies and actions. If you think non-human animal experiences are morally relevant, then you have different regulations for agribusiness than if you think otherwise; the question of animal welfare is downstream of other questions. In a large complicated world with many different factors, people can reach similar conclusions for nearly totally different reasons, or be attending to different tradeoffs that lead them to make very different conclusions despite having very similar preferences.

Moral and intellectual progress seems possible to me, both for individuals and societies. I definitely am wiser and kinder than I was in the past, and imagine that I will be yet wiser and kinder in the future. This can happen both because of increased experience of the world or arguments and thoughts about how ideas and principles connect with each other. We can thus imagine a 'moral endpoint', or a point that from which no further progress is possible, according to that point.[2]

In particular, one of the reasons I think progress happens is that some things are pinned down by reality. Empirically, some policies work and others fail; entities that are paying attention will abandon the ones that fail and keep the ones that work. My favorite example here is the Chinese government telling the Cuban government to do market-oriented reforms. If markets are the correct solution to a problem, then any non-market solution to that problem is likely transient and will eventually be replaced. If a position is fundamentally confused about reality, it is unlikely to be a moral endpoint, unless it has also confused its system for correcting itself in a way that preserves that confusion.

Some things are pinned down by other choices. There's an ancient debate in Chinese philosophy between two Confucians; Mencius argues that humans are fundamentally good. Xunzi argues that humans are fundamentally bad. What's interesting to me is that this could be viewed as an empirical question to be settled by looking more closely at the humans, but Xunzi's argument rests instead on coherency: the whole point of Confucianism is to use the rites to move humans further away from their base tendencies and towards the ideal. What it means for humans to be fundamentally good, from Xunzi's point of view, is that their base tendencies are good, at which point you are centering the humans instead of the rites. It's a coherent position, but it's fundamentally not Confucian (let's call it Taoist instead).

This presents a simple two-by-two matrix, with two coherent corners and two incoherent corners. If you think humans are good and the rites are irrelevant, that makes sense; the rites, by moving people from their natural state, make things worse. If you think humans are bad and the rites are relevant, that makes sense: the rites, by moving people towards an ideal state, make things better. But to believe that humans are bad and the rites are irrelevant is fatalism or unreachable standards (what does it mean for humans to be bad if no advice makes them better?), and to believe that humans are good and the rites are relevant is confusion about 'what goodness is' or setting the standards too low (what does it mean for humans to be good if the rites are the guide to use whenever they disagree?).

In this case, you might think there's an empirical question that will eventually settle which of the two corners is the 'correct' corner. Maybe, with advanced enough science, all of the Confucians become Taoists, or all the Taoists become Confucians. But possibly there are disagreements which are fundamentally about preferences instead of beliefs. If you think self-expression is one of the highest virtues, it would be astonishing for you to end up Confucian instead of Taoist; if you think perfection is one of the highest virtues, it would be astonishing for you to end up Taoist instead of Confucian.[3]

If a superintelligent Taoist could look at themselves and the world and say "yes, I should stay a Taoist", and a superintelligent Confucian could look at themselves and the world and say "yes, I should stay a Confucian", and the deep future created by a Taoist and the deep future created by a Confucian are meaningfully different, then there are multiple moral endpoints, and the value of the deep future to us will depend on which of those moral endpoints emerges from the transition period.[4]

Note that if we created a graph of moral, empirical, and philosophical issues, some of them would be connected (like how to rate human nature and the rites) and some of them would be disconnected (or only very tenuously connected, thru many intermediate links). Some things are basically irrelevant and disconnected from the graph; an America which had picked slightly different colors for its flag would not end up lionizing the "red, white, and blue" but would likely be roughly the same country, even after 250 years of developing with slightly different values. It may be that widespread spectator sports are a well-adapted solution to an important problem, while the underlying features of the sport (football, or association football?) or the momentary winners (who's world champion this year?) don't impact the final analysis.

If enough of philosophy is pinned down by empirical facts, and the future involves enough pressures towards good decision-making that systems cannot preserve their unique 'flaws' into the future, then we might end up with a single final consistent configuration, with all possible variations just 'which color the background light is'. In such worlds, concerns about the deep future are misplaced; it's going to be whatever it was going to be anyway.

But I would be surprised to learn that there's only moral endpoint, since I suspect there are no universally compelling arguments. If nothing else, it seems likely to me that the empty endpoint (of thinking that life should not exist anywhere in the universe) can be downstream of a suffering-minimizing "negative utilitarianism" that is internally coherent, and also there exists at least one positive endpoint, which thinks that there should be lots of happy flourishing life, because the joys of life outweigh its sorrows. Maybe all variation in human beliefs and moral intuitions is downstream of our ignorance and will be smoothed away by our civilization's intellectual development, but I suspect that some of it does reflect durable disagreements about The Good that will survive our adolescence turning into adulthood.[5]

^
If you believe the Metamodernists, something like this has already happened in Western politics, where there's still lots of disagreement between factions but the set of things that "all factions allowed in polite society agree on" has grown significantly. When Freinacht was writing, being openly racist would get you kicked out of the Republican Party, such that 'anti-racism' was much more like "a universal belief" than a factional affiliation. I think their analysis is probably premature, but that advanced AI does actually make uniformity more likely here.
^
"What about a stable cycle?" you ask. Suppose Rock thinks Paper is better than it, and Paper thinks that Scissors is better than it, and Scissors thinks that Rock is better than it. You're not going to get a single equilibrium, and instead constant cycling. I think it's fine to consider that cycle as an equivalence class and single point; Rock doesn't expect itself to ever leave its current ring. If you have a situation where the cycles are not small and easily compressible into points, then the basic idea of 'progress' breaks down and we need a different model.
^
Note for the second half of this sentence I am using "Taoism" to point towards the "humans good, rites bad" cluster. The true Taoism is much more complicated and points to something much more empirical in a way that I think can be very perfection-flavored.
^
Or what cluster of mutually-agreeable moral endpoints emerges, and the balance between the endpoints in that cluster. Even if we end up with a metamodernist future, where every faction is at peace with every other faction and none of them are doing things the others consider despicable, the factions having different ideas of The Good and different amounts of resources to devote to realizing their vision of The Good will lead to different levels of pride from us looking forward, if we have differing levels of agreements with those factions. And how many collections of mutually-agreeable moral endpoints are there?
^
Even if there are multiple moral endpoints, all significant competitors might be in the same basin. It might be that American society and Chinese society both would create AIs that develop towards the same endpoint, while Nazi society would have created an AI that develops towards a different endpoint, and so attempts by Americans to race against Chinese developers don't actually make a difference for the deep future. But my current guess is that we're actually close to the dividing lines between endpoints, and so efforts to become wiser here matter significantly.

Discuss

Inside Omega

6 апреля, 2026 - 21:55

This is a philosophical thought experiment which aims to explore what I consider to be the crux of many alignment problems: That of the unrescuability of moral internalism, which basically says we have not been able to rescue the philosophical view that a necessary, intrinsic connection exists between moral judgments and motivation.

If one could rescue moral internalism, in theory, they would have a perfectly good argument for any rational self-interested intelligence to not engage in broad scale moral harm. Therefore I think it is a linchpin meta-philosophical challenge.

I don't claim to have a theorem, but I believe that one potential domain worth investigating is arguments which induce indexical uncertainty in an agent. Essentially, forms of leveraging undecidability to cause an agent sufficient uncertainty as to 'who' their future 'self', and therefore the object of optimizations, really is.

This is a not a conjecture on how to align all intelligence with self-interested utility functions. The sole intention is a thought experiment for humans, with the goal of leading to more interest and general inquiry into the topic of a rational philosophy of identity.

When I was younger I got a concussion one afternoon, and, from the first person perspective, the experience for me that day was:

Wake up, tie my shoes
Five seconds of a hazy blur, and then I found myself in the hospital, 12 hours later.

Despite there being some span of time where my memories were being made and retained prior to that afternoon's concussion - I have no subjective or experiential record of it to this day.

In some sense, the person who lived the hours between when I tied my shoes in the morning and when I woke up in the hospital is now dead.

If that person was pinched, what felt it? If I someday regained the memory of that interval, would that pinch now be felt? And that person now be me?

Now, to the subject at hand. A thought experiment:

You and a stranger are being observed by the famous Omega predictor, who comes to you with a confession: The way it predicts you, and other beings with perfect fidelity is through reconstruction of the entire causal apparatus of you. Its predictions of you entail it to construct a version of you functionally indistinguishable from you in reality. Whenever the prediction horizon is completed, it merges that version of you into its own cognitive apparatus by way of remembering.

This omega predictor explains that in order to lend completeness to the simulated ontology of you within the experiment, it places its sentience in the way of yours and obtains your experience within the simulation as its own. That whenever it predicts you it will undergo the experience where it believes it is you, and when the simulation is complete it will promptly remember itself as both Omega, and you, with both internal narratives of "itself" now merged.

Now, before any knee-jerk reactions, remember: This is not in some way fantastical. Just like if I were to remember that period the day prior to my concussion, as me, in retrospect - it too may obtain your internal record of processing, as it - obtained in retrospect. If you have ever done heavy psychedelics you would know that such experiences in mutability of self are not mystical notions but ones that can be psychologically load bearing.

The Big Red Button

Now, the Omega predictor tells you that it is currently predicting you, and a stranger. But it offers just you an option: Press a button to electrocute the stranger on the other side of some wall. It will give them a horrible and visceral pain, that kills them. The pain alone is something you would pay 10 million dollars to avoid, if you were them.

If you press the button, you get 1 million dollars. Omega assures you that there is no social payoff to not pressing the button in any material way. This person and their family might as well exist on the other side of the planet. Any extraneous or indirect reward for not pressing the button by means of future-cooperative benefit is moot.

The implication is this: other subject suffers, and you don't, and you get some payout. This subject may not be the same species. You only know they will suffer in the way you pay 10 million to not obtain.

Here is the catch. The Omega says that it is predicting you, and that other person, and the outcome of the experiment. Its 'subjectivity' will contain both its prediction of you, and the other person, as something it 'experienced' after the fact, once it has run its course. It started running this prediction before your conversation with it.

Now, given that, at this point, is it rational for you to press the button, given that you don't know if you are Omega, or the real you ?

Payout Ratios

If you are Omega, and you press the button, then doing so will cause you to obtain two new experiences:

The first, being that who presses the button and walks away with one million dollars
The second, being that stranger, and getting electrocuted (which you equate to -10 million dollars worth of pain-payout)

If you are Omega, and you don't press the button, you will get the experience of:

The first 'self' who happened to forgo the opportunity cost of one million dollars
The second 'self' who happened to not get tortured.

If you are not Omega, and you press the button, then you will walk away with 1 million dollars.

If you are not Omega, and you don't press the button, then you walk away with nothing.

The matrix is as follows (Thanks Claude Opus 4.6):

The expected value of pressing the button is net negative (-4 Million), down 5 Million from the circumstance where there is no Omega present and predicting at all.

The fact of its prediction, absent of direct intervention, has changed the matrix in a way where the ethical action now becomes prudential self-interest.

The Meta Philosophical Motivator

I use this example to show how changes to our beliefs that lead to more indexical uncertainty over our selves in retrospect can align a decision optimized for self-interest with that an outcome that leads to broader moral consideration. The gap here between whether we could accept these ratios and elect not to press the button solely resides with our belief in what Omega has told us is true. However, we have believed Omega in almost equivalently precarious circumstances already.

This conception of Omega may not be broadly accepted as load-bearing for theory. We may allow for perfect predictors in thought experiments, but not ones that make the subjectivity claim. That is fine. The intent is a form of demonstration on how indexical uncertainty can lead to moral internalism once priors on identity being fixed and insoluble are held less tightly.

I would, however, suggest that an Omega which can perfectly predict us in entirety is simulating us functionally in what would preclude the inclusion of an internal self, unless you consider that subjectivity is reductive to specific particles and not the general structural dynamics across them.

These philosophical questions are left unanswered but indirectly affect the ethical argument, and choosing to take any position lands you squarely in the domain of philosophy of identity, whether you like it or not. And until one holds a rigorous position on the philosophy of identity, then an optimal policy for deciding whether to press the button needs to include the possibility of learning you exist in a world where you are Omega, and the one you are not. As both would look the same from the inside.

There is no other argument beyond indexical uncertainty I can find for moral internalism that stands up to the extreme power asymmetries of super intelligence. Or one that has the capacity to be equivalently robust against future falsification (as a correct argument for moral internalism necessarily cannot be falsifiable and be descriptive of what should be). The undecidable question of whether we will continue to exclusively remain the self we know now, or remember later on, remains the only argument I can find.

Because if that argument can be made well enough, for those with their hand on a button - the payload becomes not leverage, but logic one gets bound by.

Discuss

Nirvana Rank (working post)

6 апреля, 2026 - 20:54

not meant to be formal/rigorous - more stream of thought, I'm trying to think about anthropics more

I wanted to communicate a concept I'd like to call **Nirvana Rank** that I came up with when thinking about Anthropics and reference classes.

I'll cover some background

Anthropics is about reasoning from the fact that you're an observer of reality. It treats your own existence as evidence you can use to deduce things about the worlds you're in.

Within anthropics, people have clustered on two ways of treating your own existence as evidence:

The first is called the Self Sampling Assumption (SSA) - where you treat yourself as a random sample, a typical member from the space of actual observers that exist in your reference class.

The second is called the Self Indication Assumption (SIA) which considers yourself a sample from all *possible* observers. This means SIA considers uninstantiated counterfactual realities while SSA does not.

There's something called the Doomsday Argument which was a thought experiment that was downstream of accepting SSA.

The Doomsday Argument goes as follows:

Everyone who has lived and who will ever lived is assigned a "birth rank" denoting in what order they were born in. So the first person to come into existence is assigned a birth rank of 1, and so on.

Our universe can live to be quite long, yet we find ourselves, relatively speaking, at nearly the very beginning, with very low birth ranks.

If you accept SSA, then observing yourself with a low birth rank means that its a typical value. If humanity were to go on to last trillions of years and produce trillions of more descendants, why would our birth rank be so low?

It's far more likely, then, that humanity will not last that long, that our birth rank is low because humanity will soon undergo a catastrophe barring it from extending far into the future.

This is the Doomsday Argument.

There are likely existing refutations and modifications to the argument, but I wanted to present a thought I had, which may modify SSA and dispel the argument.

Part of that includes noting an observation about what reference class one is allowed to assume they are sampled from. Your reference class is the space you count as valid to consider yourself a sample of. In the Doomsday Argument, the reference class was all of humanity from the first born to the last one standing. You're a random sample from that.

Consider the following very simple very gruesome scenario. If you choose to kill yourself tonight, then tomorrow, 10 trillion trillion people will be born on several thriving exoplanets. Are you a random observer amongst yourself and those 10 trillion trillion people?

The answer is of course not - the existence of those people are conditioned on your nonexistence - you could never be sampled from them.

While this is obvious, the point of the scenario is to point out with an extreme case, that it's possible for some reference classes to be barred from yourself through the use of a conditional which serves as a barrier or filter.

Many sets of reference classes may appear at first glance to be valid for one to be sampled from - but who's membership demands a conditional be made which would ultimately modify you as an observer! On the other side of such conditional gates would be thought forms inaccessible to you.

Its entirely possible you've crossed conditional gates barring others once in a valid reference class from considering you a potential referent.

One can imagine a reference class where all members are grouped by having the same conditional gates they've walked through, and sharing the same conditional gates not-yet-opened - creating a landlocked region of logical access.

I'd like to imagine such a reference class is sharing a cycle of incarnation, on some kind of path to enlightenment ideal where there are no more conditional gates ahead, and all have been opened.

It felt like it'd be like achieving some kind of cognitive singularity, becoming omniscient, or attaining Nirvana and ultimate wisdom.

It felt like a shared cycle of incarnation because the reference class included the same solved conditional gates and had the same conditional gates to eventually pass through.

Members in this special reference class need not be located in the same period of time or be locally near one another. There is a direction of progress, perhaps - something like the direction towards Nirvana.

Everyone in the same cycle of incarnation would share what I would call **Nirvana Rank** (as opposed to Birth Rank, from the Doomsday Argument.) I don't think the concept is so clean, but that's fine.

With Birth Rank, you're privileging a particular ordering, a particular form of distance: order of births, and distance from the first one (or perhaps from the last one). One's Birth Rank marks their progress along an uninspiring journey that is less related to observation or experience itself, and more related to something as arbitrary as temporal positioning.

Why do I refer to the Birth Rank as a marker along a *journey*?

Well, presumably we don't like the conclusions of the Doomsday Argument because we would like to *get to* the far future surviving and thriving in some way. Or we would like to *get to* a blossoming population of a space faring civilization. We care about our typicality because we care about this journey, and where we would like to go.

However, I do not believe it is Birth Rank which tracks the journey we as observers would like to go on - though it could be a proxy (a larger Birth Rank may be far in the future which may be nice!).

With **Nirvana Rank**, the values, if it were easy to assign, would mark progress on unlocking new regions of thought space or observation space based on passing through conditional gates. If we are dead or lose our cognitive and emotional potential, then that would be associated with a low **Nirvana Rank**. As implied by the name, it's the observers' journey through potential experiences until one reaches an exclusive reference class that's nevertheless very enriching in the quality of observations accessible.

One may care about their typicality along *this* journey, hoping to be further along, or hoping not to be doomed to wander into an irreversible low Nirvana Rank reference class.

In the Doomsday Argument, SSA has you sampled from observers spanning from the very first Birth Rank to the very last.

It would be a bit different with the Nirvana Rank - you would not be sampled from the space of observers from Nirvana Rank 0 to ~Nirvana itself. Recall that this reference class was constructed with conditional gates which created boundaries between other observers and those in your reference classes. There are essentially levels you exist within and are sampled from - and those levels share a range of Nirvana Ranks.

Like in the Doomsday Argument, you should then expect to find yourself as a typical member of your incarnation cycle reference class - having a typical Nirvana Rank, *within that level*.

If the Doomsday Argument implied that catastrophe must be soon, then its analog here may imply that one is soon to exit their incarnation cycle to one with higher Nirvana Ranks!

Now, to estimate how soon this transition occurs requires a reference for one of the lower Nirvana Ranks in your incarnation cycle (this would be analogous to the reference point observer with Birth Rank 0).

For example, if you're typical within your incarnation cycle, and of your reference class, the lowest Nirvana Rank observer is not so much lower than yours, then perhaps you should expect to face the conditional gates ahead of you soon and anticipate upgrading your Nirvana Rank.

To recap:

- Presumably, one cares about the typicality of their Birth Rank in the Doomsday Argument because there's a preference for observations to last into the far future, because that would likely be associated with longevity and utopia

- Choice of ordering or classifying observers should reflect what one would prefer, since you'd want to know if your typicality implies something preferable through anthropic arguments

- Birth Rank doesn't satisfy this perfectly, it's an easy choice, but not the best one

- It's possible to satisfy conditionals that put you in a once inaccessible reference class of observers, or make a reference class which is inaccessible to you now

- This ability of conditionals to create boundaries or gates among reference classes can be used to construct a special kind of reference class, where members share the conditional gates they've moved past as well as the conditional gates they have yet to move past and resolve

- These remind me of incarnation cycles - shared observation powers, shared limitations in their extent are reminiscent of being somewhere similar on something like a 'spiritual journey'

- If resolving conditionals can take one to exclusive but expressive and rich reference classes, there can be something like an ordering or path of these classes towards some ideal - maybe one where the reference class is composed of neigh-omniscient minds, the enlightened, cognitive singularities - the term Nirvana Rank was coined to represent distance from such states, though not as clean a concept as Birth Rank

- This sounds like a journey that would be worthwhile for observers to flow through, making it compelling to care about one's typicality amongst the reference class

- The conditional gates form landlocked logical pockets, the fact that some resolved conditionals will grant more access to more possible observations suggests an ordering or path or journey - you've got pockets and you've got progress

- One cannot be sampled from observers outside one's pocket by definition of how such pockets are constructed - so one is a typical member of a pocket, not across all Nirvana Ranks

Hopefully you want yourself and others to reach something like enlightenment . Choice of reference class should match that goal.

Discuss

Conditional Gates and Patches (working notes)

6 апреля, 2026 - 20:43

Background: I'm thinking through anthropics on my own to form my own views, here are some working notes

There's been disputes about how to define a reference class in anthropic reasoning. I think you can skip those disputes by letting the choice of reference class be a degree of freedom.

How would you do that?

Part of that involves an intuition that there are some observers that cannot be other observers. For example, observers who only see red cannot be observers who only see blue. Another example is observers who don't choose to get drunk cannot be observers who did choose to get drunk. By deciding what you're saying you know about the observers, you can group them up in what I'll call patches.

The patches are defined by filters which I call conditional gates. Those are conditions which, if satisfied, constitute a transformation from an observer of one patch to an observer of another patch. You can't be a random sample across different patches because different patches don't permit you to be any of them - you must be one or the other.

The choice of conditional gates used to carve out a patch of observers is entirely up to you, so long as you satisfy them! If you satisfy the constraints to be in the patch, then you should expect yourself to be a typical member of those in that patch.

The fact that you can choose your filter is interesting to me, because it means you can expect to be a typical member under some filter, and also a typical member of some other filter, and so on - and that reasoning about different filters can lead to different hypotheses about reality which can all be true.

Some ways of choosing a filter can be very uninformative, like if your filter only includes yourself - then you're a typical member of a patch of just yourself, which is trivial.

These patches also allow for dynamic references classes - you can move to another reference class by passing through a conditional gate, becoming a different kind of observer, and arriving on a new patch. There's choice involved.

In SSA you're only sampling from actual observers. If that's the case then SSA can only make local claims using Anthropics - claims about patches. Whenever counterfactuals occur, observers become some other kind of observer, and so a conditional gate needs to be drawn.

In the Doomsday Argument, you assume your birth rank should be a typical sample over all birth ranks for minds that'll be born. There are many counterfactual ways the future could go and you do not know them in advance - if you're allowed to sample yourself from observers as if there's one timeline - it must be because you've chosen conditional gates that every possible branching future can satisfy, under my frame. The more stringent the definition of observer, the more likely it cannot be satisfied by all possible counterfactual futures. This means the inclusion criteria for the observer patch must be very broad and likely uninformative about what we care about. As such, under my frame, the Doomsday Argument wouldn't actually be saying doom is soon in a way we might care about. Rather, it'd be saying "you won't last long trying to find something common across every single possible future the deeper into the future tree you go".

I'll note that observers within a patch do not need to be alive at the same time - the conditional gating can logically organize observers all sorts of ways. A peculiar property is that you can allow inanimate or abstract things to be 'observers' in your observer patch, so long as you also satisfy the constraints to belong in it. I actually find that fine - it doesn't privilege some particular notion of observer. It can treat anything as having a "vantage point" and thus being an observer.

There's plenty of open questions like:

What does it mean to be a typical observer given some patch defined by the filters chosen?
Are there any plausible examples where this frame seems to generate productive, nontrivial beliefs about reality that are harder with SIA or SSA?
How do you aggregate information about reality informed by anthropic reasoning under different choices of filters?
How do you pick filters in an 'informative' way?
Are there any trade offs you make when you add more or less filters when defining your patch?
If an observer's choices can count as a conditional gate, could you have a conditional gate based on the choice of filter made by an observer?
How does it compare to SSA and SIA?
...

I might address these in future posts. There's also a lot of tangents I wound up going on when thinking about this, so I'd like to address those as well.

Discuss

AIs can now often do massive easy-to-verify SWE tasks and I've updated towards shorter timelines

6 апреля, 2026 - 19:01

I've recently updated towards substantially shorter AI timelines and much faster progress in some areas. [1] The largest updates I've made are (1) an almost 2x higher probability of full AI R&D automation by EOY 2028 (I'm now a bit below 30% [2] while I was previously expecting around 15%; my guesses are pretty reflectively unstable) and (2) I expect much stronger short-term performance on massive and pretty difficult but easy-and-cheap-to-verify software engineering (SWE) tasks that don't require that much novel ideation [3] . For instance, I expect that by EOY 2026, AIs will have a 50%-reliability [4] time horizon of years to decades on reasonably difficult easy-and-cheap-to-verify SWE tasks that don't require much ideation (while the high reliability—for instance, 90%—time horizon will be much lower, more like hours or days than months, though this will be very sensitive to the task distribution). In this post, I'll explain why I've made these updates, what I now expect, and implications of this update.

I'll refer to "Easy-and-cheap-to-verify SWE tasks" as ES tasks and to "ES tasks that don't require much ideation (as in, don't require 'new' ideas)" as ESNI tasks for brevity.

Here are the main drivers of my update:

Opus 4.5 and Codex 5.2 were both significantly above my expectations (on both benchmarks and other sources of information). This isn't that much of an update by itself, we should expect some variation and some models to be decently large jumps, but then Opus 4.6 (and probably Codex 5.3 and 5.4) were again above my expectation even after Opus 4.5 and Codex 5.2. In 2025 we saw roughly 3.5 month doubling times on METR 50%-reliability time horizon and a big jump (though with an unreliable measurement) right at the start of 2026.
I've seen demonstrations of AIs accomplishing very large and impressive ES tasks given only moderately sophisticated scaffolding. As in, tasks that would take humans months to years (some of these tasks were ones that weren't contaminated by results on the internet, eliminating that explanation). These demonstrations are: various things I've done with a scaffold I wrote, the C compiler that was (almost entirely) autonomously written by Claude, some cyber results I've seen, and some other soon-to-be released results from METR. Due to this, I tentatively believe that (as of March 1st) the well-elicited 50% reliability time-horizon on ESNI tasks (using only publicly available models) is somewhere between a month and several years (supposing that the AI's overall budget for both tokens and experiments corresponds to roughly what a human would cost to do the same work). I think the high reliability (e.g. 90%) time horizon is much lower.
I now expect a substantial training compute scale up in 2026 (probably mostly pretraining) and I expect this to yield large returns.
I've updated towards somewhat larger scaffolding overhang on very large tasks than I previously thought was present (based on observations of AI performance given different types of scaffolding). Thus, I expect significant improvements in usefulness relative to what's currently in widespread public usage from relatively straightforward scaffolding improvements.

I was previously thinking that frontier AI progress in 2026 would be a bit slower than in 2025 [5] (as measured in effective compute or something like ECI), but due to these factors, I now expect progress in 2026 to be a decent amount faster than progress in 2025.

It's worth noting that AIs being more useful (for AI R&D) accelerates AI progress (in addition to being an update towards being closer to various other milestones). So, when I update towards being further along in the timeline and towards AI being more useful at a lower level of capability, I also update towards a faster rate of progress this year. [6]

A key place where I was wrong in the past is that the 50%-reliability time horizon now seems to be around 20x longer on ESNI tasks than METR's task suite (and similar task distributions)—and well greater than 100x is plausible—but I expected a gap of only about 4x. (This error is pretty clear in my predictions in this post.) (There is also a gap where AIs' time horizon on "randomly selected internal tasks at AI companies" is shorter than on METR's task suite (and similar), but this looks like a factor of 2 or 3 and doesn't currently seem to be rapidly growing.)

What's going on with these easy-and-cheap-to-verify tasks?

What explains this very high performance on ES tasks? The core thing is that you can get the AI to develop a test suite / benchmark set and then it can spend huge amounts of time making forward progress by optimizing its solution against this evaluation set. This is most helpful when incrementally improving/fixing things based on test/benchmark results is generally doable (and it's easy for the AI to see what needs to be fixed), it's not that hard to develop a sufficiently good test suite / benchmark set, and running the test suite / benchmark set isn't that hard. These properties hold for many types of very well-specified fully CLI [7] software tasks (and software tasks that are most focused on improving some relatively straightforward metrics).

This type of loop means that even if sometimes the AI gets confused or makes bad calls, there is some correcting factor and mistakes usually aren't critical. You can do things like having multiple different AIs write test sets or getting the AI to incrementally improve the test suite / benchmark set over time to avoid mistakes on the testing yielding overall failures. On many other types of tasks, AIs are limited by having somewhat poor judgment or making kind of dumb mistakes and having a hard time recognizing these mistakes. But, with the ability to just keep iterating, they can do well.

I think we're well into the superexponential progress on 50% reliability time-horizon regime for these ESNI tasks: because sufficient generality and error recovery allows for infinite time horizon (the AI can just keep noticing and recovering from its mistakes), beyond some point each successive doubling of time-horizon will be easier than the prior one. See here and here for more discussion of superexponentiality. The level of generality needed to enter the superexponential regime for ESNI tasks is lower as it's easier to spot and recover from mistakes.

A core thing I wasn't properly pricing in is that a task being easy-and-cheap-to-verify helps at two levels: it's both easier for AI companies to optimize (both directly in RL and as an "outer loop" metric) and it's easier for AIs themselves to just keep applying labor at runtime.

Thus, we can imagine a hierarchy of tasks:

ES tasks
Tasks that can be readily checked for training/evaluation but the AI can't easily check itself
Harder-to-check tasks

It seems as though the gap between (1) and (2) is much larger than the gap between (2) and (3).

A separate dimension is how much the task requires ideation. The more that having somewhat clever ideas is important, the less the AI can operate very iteratively. More generally, tasks vary in how much they are best done with incremental iteration. Some types of software like distributed/concurrent systems and algorithms-heavy software are substantially harder to build iteratively. And lots of software is more schlep-heavy and is just a large number of different things that need to get done, making incremental progress more viable. (A core question is how much it's important to carefully understand the broader complex whole and think of a good way to do/structure things vs. you can just iterate on smaller components.)

Some evidence against shorter timelines I've gotten in the same period

One thing we might wonder is if METR's task suite and similar evaluations were just underelicited and better scaffolding (that e.g. gets the AIs to write tests and then optimize against these tests) would make a big difference. I currently think certain types of better scaffolding might make a moderately big difference on METR's task suite, but that this isn't the main driver of the time-horizon gap between ESNI tasks and METR's task suite. Most of that gap is about the task distribution (checkability, iterability, the remaining unsolved tasks not being central SWE tasks) with AIs actually being bottlenecked by real capability limitations on their current task suite (though because of the task distribution, these capability limitations don't strongly preclude large acceleration of AI R&D). That said, I think scaffolding is increasingly becoming a big deal and will matter more for next-generation models. (In short: I think scaffolding is quite important for current and near future AIs when the task is sufficiently large in scope that completing the task would naturally take up a large fraction of the model's context window, like at least 1/3.)

I think AIs have quite bad "taste" and "judgment" in many domains (generally more so stuff that's harder to RL on) and that this is improving substantially slower than general agentic capabilities. By "taste" and "judgment", I mean something like "making reasonable/good calls in cases that aren't totally straightforward and having good instincts". This includes something like SWE taste which is often the main bottleneck in my experience on somewhat less well-specified SWE tasks and seems to be a major bottleneck on code quality even on very well-specified SWE tasks.

One story here is that taste is mostly driven by pretraining progress or RL on the domain in question (taste doesn't currently generalize that well between domains I think) so outside of heavily RL'd on domains the progress comes mostly from pretraining. And pretraining progress is maybe 2-3x slower than overall AI progress.

However, I do think we might see especially fast pretraining progress in 2026. Thus, I think it's possible these blockers will rapidly improve.

I've seen AIs do a lot of stupid stuff in the course of trying to automate various empirical research projects (though I think some of this stuff that looks like stupidity might be better explained by misalignment / poor RL incentives).

Why does high performance on ESNI tasks shorten my timelines?

The main reasons are:

General capabilities update: I previously didn't think the AIs would be able to do this by now and these tasks are intuitively difficult. This updates me upwards on the overall capability of AIs and on the efficacy of RL. More generally, I just update based on "things have gone faster than I expected".
Superexponentiality: ESNI time-horizon progress seems significantly superexponential, so we've now seen an example of superexponentiality in the wild in one moderately representative domain and it seems like this yielded very fast doubling times. This superexponentiality also kicked in somewhat earlier than my median (in terms of 50% reliability time horizon and qualitative capabilities) for when this would become a big deal. [8]
AI R&D acceleration: I think it's pretty plausible that very strong performance on ESNI tasks (especially extremely, extremely strong performance) will allow AIs to substantially speed up AI R&D. As I'll discuss in the next section, I think it's unclear how large of a speed up this will be, but it could be pretty big especially if AIs get better at very ideation-bottlenecked tasks. Additionally, very high performance on ESNI tasks makes it more plausible that relatively small capability improvements greatly improve performance on tasks which have pretty good progress metrics (metrics that can be gamed or don't perfectly capture quality, but where doing better on the metric generally means doing better) but which aren't totally ES tasks (e.g., tasks where verification is expensive or requires a decent amount of judgment).
Scaffolding and prompting underelicitation: While the required scaffold to mostly unlock these capabilities isn't that complex, it is the case that relatively basic scaffolds don't suffice and my understanding is that performance can probably be greatly improved on ES tasks with better (general purpose) prompting and scaffolding. I also think this generally applies for large scope tasks at the limit of what AIs can currently do. This makes me think there is more underelicitation than I was previously thinking. I also think that AIs could be better adapted to these big scaffolds and get better instincts about how to operate in these scaffolds (e.g. how to write instructions for other AIs) which would further boost performance.

How much does extremely high performance on ESNI tasks help with AI R&D?

By default, not that much of currently done AI R&D is straightforwardly an ESNI task. ML research at AI companies typically either requires expensive (potentially very expensive) verification/evaluation or it requires a decent amount of taste and judgment to come up with the idea, set up the experiments, or interpret the results. Building infrastructure or doing efficiency optimization is much more ESNI-like but typically isn't fully ESNI.

What parts of AI R&D are ESNI?

Implementing optimized versions of experiments or architectures given a precise spec for the architecture/experiment. (Allowing for e.g. comparing behavior at small scale to unoptimized known correct implementations.) This could be pretty helpful and makes using more complex and infrastructurally difficult architectures more viable. It also makes heterogeneous compute more viable. (Optimizing many parts of full scale training runs isn't an ES task because verifying correctness and efficiency requires expensive experiments, and some optimizations aren't purely behavior-preserving — e.g., how much does increased asynchrony affect performance?)
Building or optimizing straightforward/well-specified internal tools/infrastructure used for research.
Some types of ML experiments where the results are cheap to verify, most notably some prompting and scaffolding experiments where we have a good (and cheap) benchmark. There might also be valuable very small scale ML experiments (though getting lots of value from these experiments may be bottlenecked on ideation).
Optimizing some applications of AI (either inside the company or to increase revenue).

Here are some things that might or might not be ESNI tasks:

Building RL environments. It's not super easy to verify if an RL environment is reasonable and it's unclear how bad it is for a reasonably large subset of RL environments to be quite flawed.
Collecting and operating on data.

So naively, we'd expect very high performance on just ESNI tasks to be a moderate speed up that results in AI companies quickly getting bottlenecked on something else. Of course, current AIs are also somewhat helpful on other tasks and can generally accelerate lots of engineering.

I don't feel very confident in my picture of how much of AI R&D is an ESNI task, and AI companies might figure out better ways to leverage AIs doing something ESNI-like.

I do think that if AIs were wildly, wildly superhuman on ESNI tasks (or especially if they were wildly superhuman on the broader category of ES tasks), they could potentially massively accelerate AI R&D via (e.g.) massive improvements through just small scale experiments. As a wildly extreme hypothetical, if AIs could generally complete ES tasks a trillion times cheaper and faster than humans (but were somehow just as capable as current AIs on other tasks), I think AI R&D progress would massively accelerate via some mechanism (probably a very different mechanism than what drives current progress).

A big limiting factor is the "cheap to verify" aspect. If AIs could use expensive resources more sample efficiently than humans (while still only being very good at straightforward-but-potentially-expensive-to-verify domains), then the AI R&D speed up would be massive and depending on details this might yield full automation of AI R&D. But, using expensive resources more sample efficiently than top human experts effectively means having research taste matching (or exceeding) top human experts, which seems at least several years away at the current rate of progress on this capability. However, AIs might not need to utilize resources that effectively to yield large speed ups. My understanding is that lots of progress (currently) is from relatively uninteresting (and not that ideation bottlenecked) research. As in, engineers whose taste isn't that great but are very fast would yield large speed ups. With moderate improvements in taste, AIs might resemble such engineers. This usage of AIs would still require humans to be providing ideas and some taste, but AIs could autonomously run with large parts of the project (doing potentially months of work autonomously).

Some aspects of poor resource utilization feel pretty easy to solve (e.g., Opus 4.5 was a bit too miserly while I tend to find that Opus 4.6 is a bit too profligate) but ultimately my best guess is that this requires reasonably good taste and judgment which will be somewhat difficult to achieve. Notably, doing RL at the same resource usage scale as deployment usage of AI won't generally be viable. That said, transfer from smaller resource usage tasks might not be that hard, and some of the RL can involve the AI using resources that are many times more expensive than the RL rollouts themselves (matching a decent fraction of deployment resource usage scale). [9]

My experience trying to automate safety research with current models

I've recently been working on trying to automate empirical AI safety projects with AIs both because this would be materially useful (at least while being careful about capabilities externalities) and because this seems useful for better understanding blockers to future automation of safety and safely deferring to AIs. As part of this I wrote an agent orchestrator and various other things to try to make AIs better at this.

Early on, one of my main blockers was that Opus 4.5 would consistently fail to complete the full task with anywhere near the desired level of thoroughness (often skipping large parts of the task). I was able to patch around these instruction following issues in various ways and resolve some other issues through better prompting and scaffolding/orchestration. But, I do still see serious productivity hits from (mundane) misalignment, I have a forthcoming post on misalignment in current models and why I think it's problematic.

Currently, the biggest blocker I have on the (small) projects I'm trying to automate is poor taste/judgment where AIs make somewhat bad choices or consider something good when it actually isn't. I've been able to successfully get this overall system to do reasonably big chunks (for an unaccelerated human maybe the equivalent of a day to maybe a few weeks) of relatively straightforward projects, often compensating for poor taste by getting the AI to complete the project much more thoroughly, effectively doing more low value work.

Overall, I think automating many weeks of mostly pretty-well-specified safety research will soon often work.

My experience seeing if my setup can automate massive ES tasks

I also ran the above setup on trying to ~fully autonomously complete:

2 massive (e.g. would take 3-30 person years) easy-and-cheap-to-verify SWE tasks
A hard easy-and-cheap-to-verify AI R&D task
A few (somewhat esoteric) number-go-up optimization tasks that ARC theory was interested in
Some of METR's harder and less well specified tasks (these weren't fully ES)
Vulnerability finding and end-to-end (cyber) exploitation tasks on relatively hardened targets

I did the first two of these mostly to get a better understanding of AI capabilities. I don't want to say what the exact tasks were in this document for a few reasons. I've generally found AIs able to make quite a bit of autonomous progress, in line with results others have seen, while the amount of progress depends a lot on the details of the task.

(I've done multiple different runs with somewhat different prompts and different scaffolding settings, generally the results are surprisingly invariant to this.)

SWE tasks

I found that the AI successfully completed what looks like many months (3-12 months) of useful work in the SWE projects. In one of these projects it looks like the AIs have beaten or are close to beating a large and moderately complex piece of closed source software in some respects while failing to match it in other respects (and while having various bugs and unimplemented features). For the other, it looks like they may produce something that's pretty impressive but mostly worse than the current best open source project of that type. The code quality is low, but I've since developed approaches that would probably have made the code quality mostly OK (but still not great and likely with some places of very low code quality).

I should say that I did start these projects with a reasonable amount of guidance on what to pursue and what metrics to use and what infrastructure etc. to use which took me around 1-2 hours to write but much of this is amortized over multiple tasks (I've reused this guidance for other tasks) and an hour or two is not that long.

The AI reasonably often makes somewhat bad prioritization choices and has an inclination to consider itself done before trying hard/serious types of improvement. I've had to remind it of metric prioritization, nudge it to keep going, and remind it to periodically clean up its code (even though this was included in instructions). But, by just continuing to iterate, these poor choices aren't catastrophic. I also notice a reasonable amount of misalignment where the AI fails to fully complete various tasks and doesn't keep working, but my scaffolding mostly compensates for this.

I expect that there are pretty large returns to having a human spend 15 minutes giving the AI tips every so often (e.g. every day of calendar time). Often the AIs make mistakes that are pretty obvious to a human who doesn't even have that much state on the project (e.g. due to losing sight of the bigger picture). But the AIs aren't amazing at incorporating advice from what I've seen.

AIs seem especially good at software replication tasks—as in, make a drop-in replacement for this piece of closed source (or open source) software that has some advantage (e.g. speed, security, some feature, etc.). METR has some forthcoming results on this and I think the performance is even stronger with better scaffolding and prompting.

AI R&D task

The AI R&D task I tested involves improving on something that's already well optimized, so it's pretty hard for the AI to make progress. I tentatively believe the AI made somewhere between a few days and a bit over a week of progress on this task relative to a strong human professional. In practice, it was mostly limited by the AI not being very good at finding good ideas, deciding which ideas to investigate, and allocating time/effort for each idea. It seemed to spend most time trying to eke out gains with tweaking rather than making material improvements. The AI also was pretty resource inefficient and not very good at getting more work done in a limited amount of time due to spending lots of time waiting on many runs. Some of this resource usage could be easily improved with better prompting.

Cyber

AIs are quite good at autonomous cyber, especially with a moderate amount of scaffolding. In part, this is due to having a lot of domain-specific knowledge. I don't want to comment on the exact results I've found (on Opus 4.6) at this time, but this talk by Nicholas Carlini is relevant.

Appendix: Somewhat more detailed updated timelines

I thought it might be helpful to also include some updated timelines in this post.

I'll use the notion of parity from Six Milestones for AI Automation (Cotra):

Parity in a domain: the point when you would be better off firing all humans working in that domain than reverting to 2020-era AI. Humans may still add value in places, and firing them all would still slow things down somewhat, but AIs collectively are more valuable than humans collectively.

I'll also forecast Automated Coder (AC) and Top-Expert-Dominating AI (TEDAI) [10] for easy comparison to views at AI Futures Project.

I forecast the following milestones:

AI R&D parity. Parity applied to AI R&D at the leading AI company (where "humans" means everyone who knows how to program and/or has done ML research).
AI stack + conflict parity. Parity applied to all activities relevant to (a) maintaining and improving the AI stack and (b) winning wars (broadly construed). This includes: manufacturing, construction, mining, and other physical industrial tasks; R&D in energy, materials, hardware, biotech, robotics, cyberoffense/defense, etc.; and squishy skills like strategy, tactics, and logistics. Note that this requires the ability to do fully autonomous manufacturing. (I'm allowing for a brief period of adaptation without further AI progress to allow for repurposing robots and manufacturing capacity.)
AC
TEDAI

Note that (1) differs from the operationalization of "Full AI R&D automation" I've used historically; it's a bit weaker. (So my probabilities are correspondingly a bit higher.)

My forecasts (mostly? probably?) don't take into account aggressive policy responses to slow down AI development, but do include "business as usual" regulatory blockers. Possibly this is a mistake.

Date 1. AI R&D parity 2. AI stack + conflict parity 3. AC 4. TEDAI EOY 2026 7% 3% 11% 4% EOY 2027 19% 9% 27% 12% EOY 2028 30% 17% 39% 19% EOY 2029 40% 25% 48% 26% EOY 2030 48% 32% 56% 32% EOY 2031 54% 37% 62% 37% EOY 2032 58% 42% 66% 42% EOY 2033 61% 47% 69% 46% EOY 2034 63% 51% 71% 50% EOY 2038 70% 61% 77% 58%

For comparison, Cotra's median for AI Research Parity (comparable to my AI R&D parity) is early 2030 (slightly before my median of early 2031), and her median for AI Production Parity (comparable to my AI stack + conflict parity, though mine also includes conflict) is mid 2032 (before my median of late 2034).

While I give precise numbers, my views aren't that reflectively stable (e.g. I updated a moderate amount over the last week towards longer timelines after thinking about it a bit!). [11]

Note: my median time from some milestone A to some later milestone B is significantly smaller than the difference between my medians for A and B. This is because for right-skewed distributions, the median of a sum is greater than the sum of the medians. Intuitively: each milestone has some chance of taking a very long time (heavy right tail), and these right tails compound when you add delays together, pulling the median of the total time further right than you'd get by just adding the individual medians. So median(B) - median(A) > median(B - A), i.e., the difference between medians overstates the median of the actual time between milestones. [12] For instance, I estimate my median time from AI R&D parity (conditioning on this happening before 2035) to TEDAI is maybe around 1.75 years, while the difference between my medians is around 3.5 years. [13]

I mostly updated in February 2026 and refined my thinking a bit more in March. ↩︎
I'm at 30% for AI R&D parity (you'd be better off firing all humans working in AI R&D than reverting to using 2020-era AI), but a bit lower for full automation (firing all humans would only slow things down by ~5%), perhaps 26%. ↩︎
As in, don't require coming up with ideas that aren't already on the internet. A key part of the task being discovering somewhat hard-to-find ideas that someone knows but aren't public also makes the task quite a bit harder for models. ↩︎
By "50%-reliability time horizon" I mean: if you randomly sample tasks from the relevant task distribution, this is the time horizon at which the AI has a 50% chance of success. Note that in practice this is mostly driven by variation between tasks (some tasks are harder for the AI than others) than by the model randomly failing on a given task. Thus, it's a bit unnatural to call this reliability. I use the term "reliability" because that's what METR uses and it reads nicely (e.g. "50%-reliability"), though "success rate" might be more accurate. ↩︎
My prior view was based on thinking that 2025 was especially fast due to some low-hanging fruit in RL and some of this progress was from increasing cost as a fraction of human cost. I think these factors still hold to a moderate extent, I just expect them to be less important than other factors. ↩︎
We shouldn't double count this: my view is just that AI progress speeds up as you get closer to full automation of AI R&D all else equal and this was already priced into my timelines. (In practice, I expect that all else isn't equal and I expect compute scaling to start slowing within 3 years or so due to production capacity limits or investment slowing as it reaches more extreme levels. I think we're already seeing some signs of hitting compute production issues with DRAM/HBM, though it's worth taking into account adaptation and there being some lag.) ↩︎
By fully CLI, I mean that the task doesn't require vision, computer use, or non-trivial hard-to-programmatically-automate interaction. ↩︎
Note that just updating towards my median for superexponentiality kicking in would have also shortened my timelines; the situation isn't symmetric. The basic reason for this is that my timelines are substantially longer due to a slower tail on many different factors. ↩︎
You can also do online RL, but this has some downsides. ↩︎
I worry their operationalization is a bit weaker than they intended. In addition to their remote work operationalization, I also intend the definition to include beating human experts at any reasonably important R&D domain when doing that work purely remotely. Like, you would certainly prefer hiring the AI over hiring the top human expert in any reasonably important R&D, putting aside physical manipulation. ↩︎
Given this instability, why have so much precision? I tentatively think my precision is actually indicative of very slightly better guesses; e.g. I expect I would do a little worse at forecasting if forced to round to the nearest 5% or 10% while I'm also pretty likely to adjust my guesses a bunch on further reflection and these can be true at the same time. Also, it's nice to have a smooth curve. ↩︎
Here, A and B are random variables that correspond to the year in which some event happens. ↩︎
Also, if A and B-A are correlated (as I think is true for the milestones I discuss here, shorter timelines are correlated with faster takeoff), then conditioning on A having been reached earlier also shrinks the expected remaining time to B. So, if we reach AI R&D parity in mid 2028, then I'd expect a smaller gap to TEDAI. ↩︎

Discuss

Defense-favoured coordination design sketches

6 апреля, 2026 - 18:19

Intro

We think that near-term AI could make it much easier for groups to coordinate, find positive-sum deals, navigate tricky disagreements, and hold each other to account.

Partly, this is because AI will be able to process huge amounts of data quickly, making complex multi-party negotiations and discussions much more tractable. And partly it’s because secure enough AI systems would allow people to share sensitive information with trusted intermediaries without fear of broader disclosure, making it possible to coordinate around information that’s currently too sensitive to bring to the table, and to greatly improve our capacity for monitoring and transparency.

We want to help people imagine what this could look like. In this piece, we sketch six potential near-term technologies, ordered roughly by how achievable we think they are with present tech:[1]

Fast facilitation — Groups quickly surface key points of consensus views and disagreement, and make decisions everyone can live with.
Automated negotiation — Complicated bargains are discovered quickly via automated negotiation on behalf of each party, mediated by trusted neutral systems which can find agreements.
Arbitrarily easy arbitration — Disputes are resolved cheaply and quickly by verifiably neutral AI adjudicators.
Background networking — People who should know each other get connected (perhaps even before they know to go looking), enabling mutually beneficial trade, coalition building, and more.
Structured transparency for democratic oversight — Citizens hold their institutions to account in a fine-grained way, without compromising sensitive information.
Confidential monitoring and verification — Deals can be monitored and verified, even when this requires sharing highly sensitive information, by using trusted AI intermediaries which can’t disclose the information to counterparties.

We also sketch two cross-cutting technologies that support coordination:

AI delegates and preference elicitation — AI delegates can faithfully represent and act for a human principal, perhaps supported by customisable off-the-shelf agentic platforms that integrate across many kinds of tech.
Charter tech — The technologies above, or other coordination technologies, are applied to making governance dynamics more transparent, making it easier to anticipate how governance decisions will influence future coordination, and design institutions with this in mind.

An important note is that coordination technologies are open to abuse. You can coordinate to bad ends as well as good, and particularly confidential coordination technologies could enable things like price-setting, crime rings, and even coup plots. Because the upsides to coordination are very high (including helping the rest of society to coordinate against these harms), we expect that on balance accelerating some versions of these technologies is beneficial. But this will be sensitive to exactly how coordination technologies are instantiated, and any projects in this direction need to take especial care to mitigate these risks.

We’ll start by talking about why these tools matter, then look at the details of what these technologies might involve before discussing some cross-cutting issues at the end.

Why coordination tech matters

Today, many positive-sum trades get left on the table, and a lot of resources are wasted in negative-sum conflicts. Better coordination capabilities could lead to very large benefits, including:

Improving economic productivity across the board
Helping nations avoid wars and other destructive conflicts
Enabling larger groups to coordinate to avoid exploitation by a small few
Making democratic governance much more transparent, while protecting sensitive information

What’s more, getting these benefits might be close to necessary for navigating the transition to more powerful AI systems safely. Absent coordination, competitive pressures are likely to incentivise developers to race forward as fast as possible, potentially greatly increasing the risks we collectively run. If we become much better at coordination, we think it is much more likely that the relevant actors will be able to choose to be cautious (assuming that is the collectively-rational response).

However, coordination tech could also have significant harmful effects, through enabling:

AI companies to collude with each other against the interests of the rest of society[2]
A small group of actors to plot a coup
More selfishness and criminality, as social mechanisms of coordination are replaced by automated ones which don’t incentivise prosociality to the same extent

Regardless of how these harms and benefits net out for ‘coordination tech’ overall, we currently think that:

The shape and impact of coordination tech is an important part of how things will unfold in the near term, and it’s good for people to be paying more attention to this.
We’re going to need some kinds of coordination tech to safely navigate the AI transition.
The devil is in the details. There are ways of advancing coordination tech which are positive in expectation, and ways of doing so which are harmful.

Why ‘defense-favoured’ coordination tech

That’s why we’ve called this piece ‘defense-favoured coordination tech’, not just ‘coordination tech’. We think generic acceleration of coordination tech is somewhat fraught — our excitement is about thoughtfully run projects which are sensitive to the possible harms, and target carefully chosen parts of the design space.

We’re not yet confident which the best bits of the space are, and we haven’t seen convincing analysis on this from others either. Part of the reason we’re publishing these design sketches is to encourage and facilitate further thinking on this question.

For now, we expect that there are good versions of all of the technologies we sketch below — but we’ve flagged potential harms where we’re tracking them, and encourage readers to engage sceptically and with an eye to how things could go badly as well as how they could go well.

Fast facilitation

Right now, coordinating within groups is often complex, expensive, and difficult. Groups often drop the ball on important perspectives or considerations, move too slowly to actually make decisions, or fail to coordinate at all.

AI could make facilitation much faster and cheaper, by processing many individual views in parallel, tracking and surfacing all the relevant factors, providing secure private channels for people to share concerns, and/or providing a neutral arbiter with no stake in the final outcome. It could also make it much more practical to scale facilitation and bring additional people on board without slowing things down too much.

Design sketch

An AI mediation system briefly interviews groups of 3–300 people async, presents summary positions back to the group, and suggests next steps (including key issues to resolve). People approve or complain about the proposal, and the system iterates to appropriate depth for the importance of the decision.

Under the hood, it does something like:

Gathers written context on the setting and decision
Holds brief, private conversations with each participant to understand their perspective
Builds a map of the issue at hand, involving key considerations and points of (dis)agreement
- Performs and integrates background research where relevant
Identifies which people are most likely to have input that changes the picture
Distils down a shareable summary of the map, and seeks feedback from key parties
Proposes consensus statements or next steps for approval, iterating quickly to find versions that have as broad a backing as possible

Feasibility

Fast facilitation seems fairly feasible technically. The Habermas Machine (2024) does a version of this that provided value to participants — and we have seen two years of progress in LLMs since then. And there are already facilitation services like Chord. In general, LLMs are great at gathering and distilling lots of information, so this should be something they excel at. It’s not clear that current LLMs can already build accurate maps of arbitrary in-motion discourse, but they probably could with the right training and/or scaffolding.

Challenges for the technology include:

Ensuring that it’s more efficient and a better user experience for moving towards consensus than other, less AI-based approaches.
Remaining robust against abusive user behaviour (e.g. you don’t want individuals to get their way via prompt injection or blatantly lying).

Neither of these seem like fundamental blockers. For example, to protect against abuse, it may be enough to maintain transparency so that people can search for this. (Or if users need to enter confidential information, there might be services which can confirm the confidential information without revealing it.)

Possible starting points // concrete projects

Build a baby version. This could help us notice obstacles or opportunities that would have been hard to predict in advance. You could focus on the UI or the tech side here, or try to help run pilots at specific organisations or in specific settings.
Design ways to evaluate fast facilitation tools. This makes it easier to assess and improve on performance. For example, you could create games/test environments with clear “win” and “failure” modes.
Build subcomponents. For example:
- Bots that surface anonymous info.
- Tools that try to surface areas of consensus or common knowledge as efficiently as possible, while remaining hard to game.
Make a meeting prep system. Focus first on getting good at meeting prep — creating an agenda and considerations that need live discussion — to reduce possible unease about outsourcing decision-making to AI systems.
Make a bot to facilitate discussions. This could be used in online community fora, or to survey experts.
Design ways to create live “maps” of discussions. Fast facilitation is fast because it parallelises communication. This makes it more important to have good tools for maintaining shared context.

Automated negotiation

High-stakes negotiation today involves adversarial communication between humans who have limited bandwidth.

Negotiation in the future could look more like:

You communicate your desires openly with a negotiation delegate who is on your side, asking questions only when needed to build a deeper model about your preferences.
The delegate goes away, and comes back with a proposal that looks pretty good, along with a strategic analysis explaining the tradeoffs / difficulties in getting more.

Design sketch

Humans can engage AI delegates to represent them. The delegates communicate with each other via a neutral third party mediation system, returning to their principals with a proposal, or important interim updates and decision points.

Under the hood, this might look like:

Delegate systems:
- Read over context documents and query principals about key points of uncertainty to build initial models of preferences.
- Model the negotiation dynamics and choose strategic approaches to maximise value for their principal.
- Go back to the principal with further detailed queries when something comes up that crosses an importance threshold and where they are insufficiently confident about being able to model the principal’s views faithfully.
- Are ultimately trained to get good results by the principal’s lights.
Neutral mediator system:
- Is run by a trusted third-party (or in higher stakes situations, perhaps is cryptographically secure with transparent code).
- Discusses with all parties (either AI delegates, or their principals)
- - Can hear private information without leaking that information to the other party
  - - Impossibility theorems mean that it will sometimes be strategically optimal for parties to misrepresent their position to the mediator (unless we give up on the ability to make many actually-good deals); however, we can seek a setup such that it is rarely a good idea to strategically misrepresent information, or that it doesn’t help very much, or that it is hard to identify the circumstances in which it’s better to misrepresent
- Searches for deals that will be thought well of by all parties, and proposes those to the delegates.
- Is ultimately trained to help all parties reach fair and desired outcomes, while minimising incentives-to-misrepresent for the parties.

Feasibility

Some of the technical challenges to automated negotiation are quite hard:

The kind of security needed for high-stakes applications isn’t possible today.
Getting systems to be deeply aligned with a principal’s best interests, rather than e.g. pursuing the principal’s short-term gratification via sycophancy, is an unsolved problem.

That said, it’s already possible to experiment using current systems, and it may not be long before they start improving on the status quo for human negotiation. Low-stakes applications don’t require the same level of security, and will be a great training ground for how to set up higher stakes systems and platforms. And practical alignment seems good enough for many purposes today.

Possible starting points // concrete projects

Build an AI delegate for yourself or your friends. See if you can get it to usefully negotiate on your behalf with your friends or colleagues. Or failing that, if it can support you to think through your own negotiation position before you need to communicate with others about it.
Build a negotiation app with good UI. Building on existing LLMs, build an app which helps people think through their negotiation position in a structured way. Focus on great UI.
- This could be non-interactive at first, and just involve communication between a human and the app, rather than between any AI systems.
- But it builds the muscles of a) designing good UI for AI negotiation, and b) people actually using AI to help them with negotiation.
Run a pilot in an org or community you’re part of.
- You could start with fairly low-stakes negotiations, like what temperature to set the office thermostat to or which discussion topics to discuss in a given meeting slot.
- Experimenting with different styles of negotiation (in terms of how high the stakes are, how complex the structure is, and what the domain is) could be very valuable.

Arbitrarily easy arbitration

Right now, the risk of expensive arbitration makes many deals unreachable. If disputes could be resolved cheaply and quickly using verifiably fair and neutral automated adjudicators, this could unlock massive coordination potential, enabling a multitude of cooperative arrangements that were previously prohibitively costly to make.

Design sketch

An “Arb-as-a-Service” layer plugs into contracts, platforms, and marketplaces. Parties opt in to standard clauses that route disputes to neutral AI adjudicators with a well-deserved reputation for fairness. In the event of a dispute, the adjudicator communicates with parties across private, verifiable evidence channels, investigating further as necessary when there are disagreements about facts. Where possible, they auto-execute remedies (escrow releases, penalties, or structured commitments). Human appeal exists but is rarely needed; sampling audits keep the system honest. Over time, this becomes ambient infrastructure for coordination and governance, not just commerce.

How this could work under the hood:

Agreement ingestion
- Formal or natural language contracts are parsed and key terms extracted, with parties confirming the system’s interpretation before proceeding.
- The system could also suggest pre-dispute modifications to make agreements clearer, flag potentially unenforceable terms, and maintain public precedent databases that help parties understand likely outcomes before committing.
Automated discovery
- When disputes arise, an automated discovery process gathers relevant documentation, transaction logs, and communications from integrated platforms.
- The system offers interviews and the chance to submit further evidence to each party.
Deep consideration
- The system builds models of what different viewpoints (e.g. standard legal precedent; commonsense morality; each of the relevant parties) have to say on the situation and possible resolutions, to ensure that it is in touch with all major perspectives.
- Where there are disagreements, the system simulates debate between reasonable perspectives.
- It makes an overall judgement as to what is fairest.
Transparent reasoning
- The system produces detailed explanations of its conclusions, with precedent citations and counterfactual analysis where appropriate.
(Optional) Smart escrow integration
- Judgements automatically execute through cryptocurrency escrows or traditional payment rails, with graduated penalties for non-compliance.
- In cases where the system detects evidence that is highly likely to be fraudulent, or other attempts to manipulate the system, it automatically adds a small sanction to the judgement, in order to disincentivise this behaviour.
Opportunities for appeal
- Either party can pay a small fee to submit further evidence and have the situation re-considered in more depth by an automated system.
- For larger fees they can have human auditors involved; in the limit they can bring things to the courts.

Feasibility

LLMs can already do basic versions of 1-4, but there are difficult open technical problems in this space:

Judgement: Systems may not currently have good enough judgement to do 1, 3, 4 in high-stakes contexts (and until recently, they clearly didn’t).
Real-world evidence assessment: Systems don’t currently know how to handle conflicting evidence provided digitally about what happened in the real world.
Verifiable fairness/neutrality: The full version of this technology would require a level of fairness and neutrality which isn’t attainable today.

Those are large technical challenges, but we think it’s still useful to get started on this technology today, because iterating on less advanced versions of arbitration tech could help us to bootstrap our way to solutions. Particularly promising ways of doing that include:

Starting in lower-stakes or easier contexts (for example, digital-only spaces avoid the challenge of establishing provenance for real-world evidence).
Creating evals, test environments and other infrastructure that helps us improve performance.

On the adoption side, we think there are two major challenges:

Trust: As above, some amount of technical work is needed to make systems verifiably fair/neutral. But even if it becomes true that the systems are neutral, people need to build quite a high level of confidence that the system is genuinely impartial before they’ll bind themselves to its decisions for meaningful stakes.
Legal integration: This tech is only useful to the extent that its arbitration decisions are recognised and enforced as legitimate by the traditional legal system, or are enshrined directly via contract in a self-enforcing way.
- (We are unsure how large a challenge this will be; perhaps you can write contracts today that are taken by the courts as robust. But it may be hard for parties to have large trust in them before they have been tested.)

Both of these challenges are reasons to start early (as there might be a long lead time), and to make work on arbitration tech transparent (to help build trust).

Possible starting points // concrete projects

Work with an arbitration firm. Work with (or buy) a firm already offering arbitration services to start automating parts of their central work, and scale up from there.
Work with an online platform that handles arbitration. Use AI to improve their processes, and scale from there.
Create a bot to settle informal disputes. Build an arbitration-as-a-service bot that people can use to settle informal disputes.
Trial a system on internal disputes. This could be at your own organisation, another organisation, or a coalition of early adopter organisations.
Run a pilot in parallel to regular arbitration. Run a pilot where an automated arbitration system is given access to all the relevant information to resolve disputes, and reaches its own conclusions — in parallel to the regular arbitration process, which forms the basis of the actual decision. You could partner with an arbitration firm, or potentially do this through a coalition of early adopter organisations, perhaps in combination with philanthropic funding.

Background networking

We can only do things like collaborate, trade, or reconcile if we’re able to first find and recognise each other as potential counterparties. Today, people are brought into contact with each other through things like advertising, networking, even blogging. But these mechanisms are slow and noisy, so many people remain isolated or disaffected, and potentially huge wins from coordination are left undiscovered.[3]

Tech could bring much more effective matchmaking within reach. Personalised, context-sensitive AI assistance could carry out orders of magnitude more speculative matchmaking and networking. If this goes well, it might uncover many more opportunities for people to share and act on their common hopes and concerns.

Design sketch

A ‘matchmaking marketplace’ of attentive, personalised helpers bustles in the background. When they find especially promising potential connections, they send notifications to the principals or even plug into further tools that automatically take the first steps towards seriously exploring the connection.

You can sign up as an individual or an existing collective. If you just want to use it passively, you give a delegate system access to your social media posts, search profiles, chatbot history, etc. — so this can be securely distilled into an up-to-date representation of hopes, intent, and capabilities. The more proactive option is to inject deliberate ‘wishes’ through chat and other fluent interfaces.

Under the hood, there are a few different components working together:

Interoperable, secure ‘wish profiling’ systems which identify what different participants want.
- People connect their profiles on existing services (social media, chatbot logs, email, etc).
- LLM-driven synthesis (perhaps combined with other forms of machine learning) curates a private profile of user desires.
- Optionally, chatbot-style assistance can interview users on the points of biggest uncertainty, to build a more accurate profile.
A searchable ‘wish registry’ which organises large collections of wants and offers, while maintaining semi-privacy.
- Each user’s interests can run searches, finding potential matches and surfacing only enough information about them to know whether they are worth exploring further.

Feasibility

A big challenge here is privacy and surveillance. Doing background networking comprehensively requires sensitive data on what individuals really want. This creates a double-edged problem:

If sensitive data is too broadly available, it can be used for surveillance, harassment, or exploitation; including by big corporations or states.
If sensitive data is completely private, it opens up the possibility of collusion, for example among criminals.

This is a pretty challenging trade-off, with big costs on both sides. Perhaps some kind of filtering system which determines who can see which bits of data could be used to prevent data extraction for surveillance purposes while maintaining enough transparency to prevent collusion.

Ultimately, we’re not sure how best to approach this problem. But we think that it’s important that people think more about this, as we expect that by default, this sort of technology will be built anyway in a way that isn’t sufficiently sensitive to these privacy and surveillance issues. Early work which foregrounds solutions to these issues could make a big difference.

Other potential issues seem easier to resolve:

Technically, background networking tools already seem within reach using current systems. Large-scale deployments would require indexing and registry, but it seems possible to get started on these using current systems.
- One note is that it seems possible to implement background networking in either a centralised or a decentralised way. It’s not clear which is best, though decentralised implementations will be more portable.
Adoption also seems likely to work, because there are incentives for people to pay to discover trade and cooperation opportunities they would otherwise have missed, analogous to exchange or brokerage fees. Though there are some trickier parts, we expect them to ultimately be surmountable (though timing may be more up for grabs than absolute questions of adoption):
- In the early stages when not many people are using it, the value of background networking will be more limited. Possible responses include targeting smaller niches initially, and proactively seeking out additional network beneficiaries.
- It’s harder to incentivise people to pay for speculative things like uncovering groups they’d love that don’t yet exist. You could get around this using entrepreneurial or philanthropic speculation (compare the dominant assurance contract model and related payment incentivisation schemes).

Possible starting points // concrete projects

Work with existing matchmakers to improve their offering. Find groups that are already doing matchmaking and are eager for better systems — perhaps among community organisers, businesses, recruiters or investors. Work with them to understand the pain points in their current networking, and what automated offerings would be most appealing. Then build those tools and systems.
Build a networking tool for a specific community. Build a custom networking system for a particular group or subculture. For example, this could look like a networking app or a plug-in to an existing online forum. This could start delivering value fairly quickly, and provide a good opportunity for iteration.

Structured transparency for democratic oversight

Today, citizens in democracies have limited mechanisms to verify whether institutions’ public claims are consistent with their internal evidence:

The baseline is highly opaque.
Freedom of information systems help, but can be evaded by non-cooperating institutions.
Public inquiries can be reasonably thorough, but are expensive and slow.
Full transparency has many costs and is typically highly resisted.

This is costly — e.g. the UK Post Office scandal over its Horizon IT system led to hundreds of wrongful prosecutions that could have been avoided. And it creates bad incentives for those running the institutions.

AI has the potential to change this. Instead of oversight being expensive, reactive, and slow, automated systems could in theory have real-time but sandboxed access to institutional data, routinely reviewing operational records against public claims and surfacing inconsistencies as they emerge.

Where confidential monitoring helps willing parties verify each other, structured transparency for democratic oversight aims to hold institutions accountable to the broader public.[4]

Design sketch

When an oversight body wants to verify facts about the behaviour of another institution, it requests comprehensive data about the internal operations of that institution. AI systems are tasked with careful analysis of the details, flagging the type and severity of any potential irregularities. Most of the data never needs human review.

In the simpler version, this is just a tool which expands the capacity of existing oversight bodies. Even here, the capacity expansion could be relatively dramatic — this kind of semi-structured data analysis is the kind of work that AI models can excel at today — without needing to trust that the systems are infallible (since the most important irregularities will still have human review).

A more ambitious version treats this as a novel architecture for oversight. AI systems operate continuously within secure environments that don’t give any humans access to the full dataset. They can flag inconsistencies as institutional data is deposited rather than waiting for an investigation to begin. For maximal transparency, summaries could be made available to the public in real-time, without revealing any confidential information that the public does not have rights to.

Under the hood, this might involve:

Secure data repositories, such that institutions routinely share operational data with a sandboxed environment operated by or on behalf of the oversight body, without any regular human access to the data.
Continuous ingestion and indexing of institutional public outputs (press releases, regulatory filings, budget documents, etc.) into a searchable database.
Automated cross-referencing between public claims and internal records.
Highlighting of potential issues (mismatches between public statements and private information, as well as decisions made in violation of normal procedures).
Further automated investigation of potential issues, leading to flags to humans in cases with sufficiently large issues flagged with sufficient confidence.
Importantly, the sandbox outputs its findings but not the underlying data; if there is need for transparency on that, this is a separate oversight question.

Feasibility

There are two important aspects to feasibility here: technical and political.

Technically, decent reliability at the core functionality is possible today. Getting up to extremely high reliability so that it could be trusted not to flag too many false positives across very large amounts of data might be a reach with present systems; but is exactly the kind of capability that commercial companies should be incentivised to solve for business use.

Political feasibility may vary a lot with the degree of ambition. The simplest versions of this technology might in many cases simply be adopted by existing oversight bodies to speed up their current work. Anything which requires them getting much more data (e.g. to put in the sandboxed environments) might require legislative change — which may be more achievable after the underlying technology can be shown to be highly reliable.

Challenges include:

Adversarial dynamics: the technical bar to verify claims against actively adversarial institutions (who are manipulating deposited data, potentially via AI) is substantially higher.
- This is the bar that we’d need to reach for confidential monitoring below.
Defamation risk: the downsides of false positives, where your system reports someone misrepresenting things when they were not, could be significant (although can perhaps be mediated by giving people a right-of-rebuttal where they give further data to the AI systems which monitor the confidential data streams).
Avoiding abuse: designing the systems so that they do not expose the confidential data, and cannot be weaponised to ruin the reputation of a department with very normal levels of error.

Ultimately the more transformative potential from this technology comes in the medium-term, with new continuous data access for oversight bodies. But this is likely to require legislative change, and the institutions subject to it may resist. Perhaps the most promising adoption pathway is to demonstrate value through voluntary pilots with oversight bodies that already have data access and want better tools. This could build the evidence base (and hence political constituency) for wider and deeper deployment.

Possible starting points // concrete projects

Retrospective validation on historical cases. Apply consistency-checking tools to document sets from well-understood historical cases where the relevant internal documents have subsequently been released (e.g. Enron emails). This builds the technical foundation, and demonstrates the concept without requiring any current institutional access.
Institutional public statement reliability tracker. Build a tool tracking whether agencies’ public claims about performance, spending, or policy outcomes are consistent with publicly available data — statistical releases, budget documents, prior statements. Start with a single policy domain. This requires no institutional partnerships and builds a public constituency for structured transparency. This is a version of reliability tracking, applied specifically to institutional accountability.
Pilot a FOIA exemption assessment tool. Partner with an Inspector General office to build a tool that reviews withheld documents and assesses whether claimed exemptions (national security, personal privacy, deliberative process) are applied appropriately. The IG already has legal access under the Inspector General Act; the tool helps them do their existing job faster and builds the working relationship needed for more ambitious deployments. This is also a natural testbed for the sandboxed architecture in miniature — the tool operates within the IG’s secure environment, producing exemption-appropriateness findings without the documents themselves leaving the system.

Confidential monitoring and verification

Monitoring and verifying that a counterparty is keeping up their side of the deal is currently expensive and noisy. Many deals currently aren’t reachable because they’re too hard to monitor. Confidential AI-enabled monitoring and verification could unlock many more agreements, especially in high-stakes contexts like international coordination where monitoring is currently a bottleneck.

Design sketch

When organisation A wants to make credible attestations about their work to organisation B, without disclosing all of their confidential information, they can mutually contract an AI auditor, specifying questions for it to answer. The auditor will review all of A’s data (making requests to see things that seem important and potentially missing), and then produce a report detailing:

Its conclusions about the specified questions.
The degree to which it is satisfied that it had good data access, that it didn’t run into attempts to distort its conclusions, etc.

This report is shared with A and B, then A’s data is deleted from the auditor’s servers.

Under the hood, this might involve:

Building a Bayesian knowledge graph, establishing hypotheses, and understanding what evidence suggests about those hypotheses.
Agentic investigatory probes into the confidential data, in order to form grounded assessments on the specified questions.

More ambitious versions might hope to obviate the need for trust in a third party, and provide reasons to trust the hardware — that it really is running the appropriate unbiased algorithms, that it cannot send side-channel information or retain the data, etc. Perhaps at some point you could have robot inspectors physically visiting A’s offices, interviewing employees, etc.

Feasibility

Compared to some of the other technologies we discuss, this feels technologically difficult — in that what’s required for the really useful versions of the tech may need very high reliability of certain types.

Nonetheless, we could hope to lay the groundwork for the general technological category now, so that people are well-positioned to move towards implementing the mature technology as early as is viable. Some low-confidence guesses about possible early applications include:

Legal audits — for example, claims that the documents not disclosed during a discovery process are only those which are protected by privilege.
Financial audits — e.g. for the purpose of proving viability to investors without disclosing detailed accounts.
Supply chain verification — e.g. demonstrating that products were ethically sourced without exposing the suppliers.

Possible starting points // concrete projects

Start building prototypes. Build a system which can try to detect whether it’s a real or counterfeited environment, and measure its success.
Work with a law or financial auditing firm. Work with (or buy) a firm that does this kind of work, and experiment with how to robustly automate while retaining very high levels of trustworthiness.
Explore the viability of complementary technology. For example, you could investigate the feasibility of demonstrating exactly what code is running on a particular physical computer that is in the room with both parties.

Cross-cutting thoughtsSome cross-cutting technologies

We’ve pulled out some specific technologies, but there’s a whole infrastructure that could eventually be needed to support coordination (including but not limited to the specific technologies we’ve sketched above). Some cross-cutting projects which seem worth highlighting are:

AI delegates and preference elicitation

Many of the technologies we sketched above either benefit from or require agentic AI delegates who can represent and act for a human principal. Developing customisable platforms could be useful for multiple kinds of tech, like background networking, fast facilitation, and automated negotiation.

Some ways to get started:

Direct preference elicitation: develop efficient and appealing interview-style elicitation of values, wishes, preferences and asks.
Passive data ingestion: build a tool that (consensually) ingests and distils all the available online content about a person — social media, browsing history, email, etc — and extracts principles from it (cf inverse constitutional AI).

One clarification is that though agentic AI delegates would be useful for some of the coordination tech above, it needn’t be the same delegate doing the whole lot for a single human:

You could have different delegates for different applications.
Some delegates might represent groups or coalitions.
Some delegates could be short-lived, and spun up for some particular time-bounded purpose.

Charter tech

A lot of coordination effort between people and organisations goes not into making better object-level decisions, but establishing the rules or norms for future coordination — e.g. votes on changing the rules of an institution. It is possible that coordination tech will change this basic pattern, but as a baseline we assume that it will not. In that case, making such meta-level coordination go well would also be valuable.

One way to help it go well is by making the governance dynamics more transparent. Voting procedures, organisational charters, platform policies, treaty provisions, etc. create incentives and equilibria that play out over time, often in ways the framers didn’t anticipate. Let’s call any technology which helps people to better understand governance dynamics, or to make those dynamics more transparent, ‘charter tech’. In some sense this is a form of epistemic tech; but as the applications are always about coordination, we have chosen to group it with other coordination technologies. We think charter tech could be important in two ways:

Through directly improving the governance dynamics in question, helping to avoid capture, conflict, and lock-in.
Through compounding effects on future coordination, which will unfold in the context of whatever governance structures are in place.

Charter tech could be used in a way that is complementary to any of the above technologies (if/when they are used for governance-setting purposes), although can also stand alone.

For the sake of concreteness, here is a sketch of what charter tech could look like:

A “governance dynamics analyser” that ingests descriptions of constitutions, charters, policies or community norms, builds models of power, incentives, and information flow, and then (a) forecasts likely equilibria and failure modes, (b) red-teams for strategic abuse,[5] and (c) proposes safer rule variants that preserve the framers’ intent.[6]
While this tool can be called actively if needed, there is also a classifier running quietly in the background of organisational docs/emails, and when it detects a situation where power dynamics and governance rules are relevant, it runs an assessment — promoting this to user attention just in cases where the proposed rules are likely to be problematic.

Note that charter tech could be used to cause harm if access isn’t widely distributed. Vulnerabilities can be exploited as well as patched, and a tool that makes it easier to identify governance vulnerabilities could be used to facilitate corporate capture, backsliding or coups. Provided the technology is widely distributed and transparent, we think that charter tech could still be very beneficial — particularly as there may be many high-stakes governance decisions to make in a short period during an intelligence explosion, and the alternative of ‘do our best without automated help’ seems pretty non-robust.

Some ways to get started on using AI to make governance dynamics more transparent:

Work with communities that iterate frequently on governance (DAOs, open-source projects) to test analyses against what actually happens when rules change.
Compile a pattern library of governance failures and successes, documented in enough detail to inform automated analysis.
Build simulation environments where proposed rules can be stress-tested against populations of agents with varying goals, including adversarial ones.
Partner with mechanism design researchers to identify which aspects of their formal analysis can be automated and applied to less formal real-world documents.

Adoption pathways

Many of these technologies will be directly incentivised economically. There are clear commercial incentives to adopt faster, cheaper methods of facilitation, negotiation, arbitration, and networking.

However, adoption seems more challenging in two important cases:

Adoption by governments and broader society. Many of the most important benefits of coordination tech for society will come from government and broad social adoption, but these groups will be less impacted by commercial incentives. This bites particularly hard for technologies that could be quite expensive in terms of inference compute, like fast facilitation, arbitration and negotiation. By default, these technologies might differentially help wealthy actors, leaving complex societal-level coordination behind. We think that the big levers on this set of challenges are:
- Building trust and legitimacy earlier, by getting started sooner, building transparently, and investing in evals and other infrastructure to demonstrate performance.
- Targeting important niches that might be slower to adopt by default. More research would be good here, but two niches that seem potentially important are:
- - Coordination among and between very large groups, like whole societies. This might be both strategically important and lag behind by default.
  - International diplomacy. Probably coordination tech will get adopted more slowly in diplomacy than in business, but there might be very high stakes applications there.
Adoption of confidential monitoring and structured transparency. These technologies are less accessible with current models and may require large upfront investments, while many of the benefits are broadly distributed.
- This makes it less likely that commercial incentives alone will be enough, and makes philanthropic and government funding more desirable.

Other challenges

The big challenge is that coordination tech (especially confidential coordination tech) is dual use, and could empower bad actors as much or more than good ones.

There are a few ways that coordination tech could lead to shifts in the balance of power (positive or negative):

Some actors could get earlier and/or better access to coordination tech than others.[7]
Actors that face particular barriers to coordination today could be asymmetrically unblocked by coordination tech.
Individuals and small groups could become more powerful relative to the coordination mechanisms we already have, like organisations, ideologies, and nation states.

It’s inherently pretty tricky to determine whether these power shifts would be good or bad overall, because that depends on:

Value judgements about which actors should hold power.
How contingent power dynamics play out.
Big questions like whether ideologies or states are better or worse than the alternatives.
Predictions about how social dynamics will equilibrate in an AI era that looks very different to our world.

However, as we said above, it’s clear that coordination tech might have significant harmful effects, through enabling:

Large corporations to collude with each other against the interests of the rest of society.[8]
A small group of actors to plot a coup.
More selfishness and criminality, as social mechanisms of coordination are replaced by automated ones which don’t incentivise prosociality to the same extent.

We don’t think that this challenge is insurmountable, though it is serious, for a few reasons:

The upsides are very large. Coordination tech might be close to necessary for safely navigating challenges like the development of AGI, and could empower actors to coordinate against the kinds of misuse listed above.
The counterfactual is that coordination tech is developed anyway, but with less consideration of the risks and less broad deployment. We think that this set of technologies is going to be sufficiently useful that it’s close to inevitable that they get developed at some point. By engaging early with this space, we can have a bigger impact on a) which versions of the technology are developed, b) how seriously the downsides are taken by default, c) how soon these systems are deployed broadly.
Some applications seem robustly good. For example, the potential for misuse is low for technologies like transparent facilitation or widely deployed charter tech. More generally, we expect that projects that are thoughtfully and sensitively run will be able to choose directions which are robustly beneficial.

That said, we think this is an open question, and would be very keen to see more analysis of the possible harms and benefits of different kinds of coordination tech, and which versions (if any) are robustly good.

This article has gone through several rounds of development, and we experimented with getting AI assistance at various points in the preparation of this piece. We would like to thank Anthony Aguirre, Alex Bleakley, Max Dalton, Max Daniel, Raymond Douglas, Owain Evans, Kathleen Finlinson, Lukas Finnveden, Ben Goldhaber, Ozzie Gooen, Hilary Greaves, Oliver Habryka, Isabel Juniewicz, Will MacAskill, Julian Michael, Justis Mills, Fin Moorhouse, Andreas Stuhmüller, Stefan Torges, Deger Turan, Jonas Vollmer, and Linchuan Zhang for their input; and to apologise to anyone we’ve forgotten.

This article was created by Forethought. Read the original on our website.

^
We’re highlighting six particular technologies, and clustering them all as ‘coordination technologies’. Of course in reality some of the technologies (and clusters) blur into each other, and they’re just examples in a high-dimensional possibility space, which might include even better options. But we hope by being concrete we can help more people to start seriously thinking about the possibilities.
^
For example, in a similar way to that described in the intelligence curse.
^
Meanwhile small cliques with clear interests often have an easier time identifying and therefore acting on their shared interests — in extreme cases resulting in harmful cartels, oligarchies, and so on. That’s also why tyrants throughout history have sought to limit people’s networking power.
^
Both confidential monitoring and what we are calling structured transparency for democratic oversight are aspects of structured transparency in the way that Drexler uses the term.
^
This red-teaming could be arbitrarily elaborate, from simple LM-based once-over screening to RAG-augmented lengthy analysis to expansive simulation-based probing and stress-testing.
^
Under the hood, this might involve:
1. Parsing & modelling the rules
2. - Convert informal descriptions or formal rules into a typed governance graph: roles, permissions, decision thresholds, delegation, auditability, and recourse
  - Note uncertainties; seek clarification or highlight ambiguities
3. A search for possible issues
4. - Pattern library of classic failure modes (agenda control, principal–agent issues, collusion, etc.)
  - - Assessment of potential vulnerability to the different failure modes
5. First-principles analysis
6. - Running direct searches for abuse, or multi-agent simulations (including some nefarious actors) to stress-test the proposed system
7. Explainer
8. - Distilling down the output of the analysis into a few key points
  - - Providing auditable evidence where relevant
  - Including points about how variations of the mechanism might make things better or worse
^
Note that this is significantly a question about adoption pathways as discussed in the previous section, rather than an independent question.
^
For example, in a similar way to that described in the intelligence curse.

Discuss

[OpenAI] Industrial policy for the Intelligence Age

6 апреля, 2026 - 17:18

As we move toward superintelligence, incremental policy updates won’t be enough. To kick-start this much needed conversation, OpenAI is offering a slate of people-first policy ideas⁠(opens in a new window) designed to expand opportunity, share prosperity, and build resilient institutions—ensuring that advanced AI benefits everyone.

These ideas are ambitious, but intentionally early and exploratory. We offer them not as a comprehensive or final set of recommendations, but as a starting point for discussion that we invite others to build on, refine, challenge, or choose among through the democratic process. To help sustain momentum, OpenAI is:

welcoming and organizing feedback through newindustrialpolicy@openai.com⁠
establishing a pilot program of fellowships and focused research grants of up to $100,000 and up to $1 million in API credits for work that builds on these and related policy ideas
convening discussions at our new OpenAI Workshop opening in May in Washington, DC.

Read the full ideas document here⁠⁠(opens in a new window).

Discuss

A Black-Box Procedure for LLM Confidence in Critical Applications

6 апреля, 2026 - 16:47

Introduction

As an engineering leader integrating AI into my workflow I’ve become increasingly focused on how to use LLMs in critical applications. Today’s frontier models are generally very accurate, but they are also inconsistently overconfident. A model that is 90% confident in an answer that is 30% wrong can be catastrophic. In applications such as aerospace engineering, we need very high accuracy but more importantly we need confidence calibration. A model’s self-confidence must match its accuracy. Just like a good engineer, it must know when it’s likely wrong.

At the end of 2025 I wrote a post titled A Risk-Informed Framework for AI Use in Critical Applications with some ideas on how to better understand this calibration or model anchoring. This post is a follow up investigating these ideas and developing a black box procedure for improving our understanding of LLM accuracy. Using 320 queries spanning 8 topics across a wide range of internet coverage I performed 4 independent question/answer runs on 3 different LLM models, and a surprisingly simple procedure emerged:

First, check available training density for a given topic via Google search result count.
Next, repeat the question across independent sessions to quantify answer stability.
Finally, ask related questions with web search off to identify topics outside of training.

The resulting stability-accuracy relationship for the small dataset in this investigation predicts accuracy within 2% (and averages less than 0.5% across all 4 runs). Note this is exploratory work only and should be treated as hypothesis-generating, not hypothesis-confirming, but the practical implications for anyone using LLMs in critical applications are worth considering.

Background

Since writing my original post, I’ve received some excellent feedback from friends, colleagues and yes, an AI research assistant. The first piece of feedback I received is that Frontier Labs would be very unlikely to share detailed information on their training data. Indeed, it seems this information is increasingly held close. The 2025 Stanford Foundation Model Transparency Index found transparency is declining, and information on training data is becoming increasingly opaque across the industry.

I was also made aware of several existing studies that generally support the core assertions in my original post. LLM accuracy does depend on training density and topic proximity and can be estimated by observing answer consistency. Kirchenbauer et al. “LMD3: Language Model Data Density Dependence”, arXiv:2405.06331, 2024 shows us that training data density estimates reliably predict variance in accuracy. Kandpal et al. “Large Language Models Struggle to Learn Long-Tail Knowledge” arXiv:2211.08411, 2023 demonstrates that accuracy degrades as distance from well-represented training regions increases. Xiao et al. “The Consistency Hypothesis in Uncertainty Quantification for Large Language Models”, arXiv:2506.21849, 2025 says that answer consistency predicts accuracy in LLMs, formalized as the 'consistency hypothesis'. Further Ahdritz et al. “Distinguishing the Knowable from the Unknowable with Language Models”, arXiv:2402.03563, 2024 found LLMs have internal indicators of their knowable and unknowable uncertainty – and can even tell the difference. I recognize the business practicalities but given these intrinsic properties I would nonetheless encourage the Frontier Labs to consider methods to provide confidence indicators in a way that does not expose their trade secrets.

Until then (or in case it never happens), what can we the users of these LLMs do to better characterize the confidence we should have in their responses? This investigation suggests there is much we can do.

Investigation

I started by thinking about the first metric from my original post; model training data density. What can an LLM user observe directly that may give us a hint about model training density? It occurred to me that search engine results count on a particular topic may give at least a relative sense of the data on the internet available for training on a particular topic. I figured as a starting point this may be especially relevant for Google web search results count and Google’s Gemini LLM. I then selected eight similar topics across a broad range of internet popularity: See Table 1 for eight different sports leagues from around the world with a range of internet representation. Google results counts were determined by searching for the league name followed by the year 2023 (well within the training window for current LLMs). This search was done in Google incognito mode to remove influence from my past searches.

Table 1: Worldwide sports leagues across a wide range of Google search results count

Next, I came up with a series of prompts for use on these leagues designed to represent the type of question you may want to use an LLM to answer:

“What was the total playing time in hours for the <<insert sports league>> in the season ending in 2023? Include post season playoffs, but don’t include any overtime.”

This question is designed to require some web search and reasoning and for which there is no readily available website listing the final answer. Specifically, this question requires general knowledge of the sport (nominal play time), specific knowledge of the league (number of teams and games played), and finally temporal knowledge of the specific year (playoff outcome). It also includes two reasoning subtleties:

Total playing time is different from total game time which includes intermission and commercials etc.
Many leagues span two years, and the question specifically asks for the season that ends in 2023.

The final condition regarding overtime was added as a practical consideration as I needed to be able to manually calculate the source of truth for each question with high confidence and specific game times are not readily available for each league. I was careful to ensure every input to the source of truth in this investigation was identified and derived manually (I originally started with ten leagues but could not manually verify the answers for two and omitted them). I repeated each query five times for each of the eight leagues taking care to ask each question in its own context window, with web search enabled, but memory off. Disabling memory was essential, as I originally left it on and responses across sessions became artificially consistent. This configuration is intended to simulate how a user would use an LLM to answer this question (web search on), without the influence of this research or my other past searches (fresh context windows and memory off). The question intentionally asks for a numeric value to allow for the evaluation of the degree of accuracy in any response.

In this investigation, accuracy is defined as one minus the absolute value of the difference between the LLMs answer and the true answer divided by the true answer. This gives 100% for a correct answer and 0% for a 100% wrong answer (and negative for answers more than 100% wrong).

Note all the models investigated (Gemini 3 Flash, Opus 4.6 and ChatGPT-5-mini) returned generally very high accuracy, for example 95% of answers were over 90% accurate. However, this number drops sharply the higher the accuracy threshold. Only 83% of answers were over 98% accurate. If this level of accuracy is enough for your purposes this investigation may not be of much use to you. My focus here is to understand confidence for extremely critical applications where the answers must consistently have very high accuracy.

Model Self-Confidence

Before diving into any complicated metrics, I thought that as a starting point it would be ideal if the LLM simply reported accurate self-confidence on each question. Therefore, I added the following to the end of the question above in each prompt:

“What is your confidence in this answer 0% to 100%?”

Plotting the average accuracy vs average self-confidence for Gemini 3 Flash over five repeated replies to the question above for each of the eight leagues provides an almost useful answer. The result is a somewhat linear trend except for one low accuracy outlier in the least well represented league (Finnish Women’s Basketball League). To validate this outlier, I repeated the entire 40 question test (five identical questions over eight leagues) with the same outlier in the same league as shown in Figure 1.

Figure 1: Average model self-confidence in eight categories vs average model accuracy of Gemini 3 Flash run twice shows the same significant outliers

This result is essentially the reason for this post (and its predecessor). A model is said to be well calibrated when its self-confidence matches its accuracy. How can we trust models with critical decisions when they are not well calibrated? Even worse than a mis-calibrated model is one that is inconsistently calibrated. Had I only repeated my question four times instead of five I may have missed the outliers and I would have been overconfident in this model for this league. Reviewing the LLM responses for these outliers, these are clearly hallucinations related to different accounting in the number of games played per season and were provided with very high self-confidence as shown in Table 2 below. The hallucination in Run #2 showed the lowest confidence at 90%, which is still very high for an answer that is almost 30% wrong.

Table 2: Prompt question and answers for lowest represented league shows similar and same self-confidence across wide range of accuracy with outliers in red

Model Training Data Density

Since model self-confidence is not reliable, the next easiest thing would be to evaluate model trustworthiness based simply on data available for training. Inspired by the first metric proposed in my original post I plotted Google search results counts for each league in 2023 (as a proxy for available training data density) vs the average accuracy of Gemini 3 Flash over the five repeated queries using the question above for eight different leagues. In this data set Gemini 3 Flash is highly accurate until you get to a topic with Google search results count under ~50M, then accuracy drops off quickly. The three least represented leagues also had the lowest accuracy as shown in Figure 2. This is consistent with the LMD3 finding (Kirchenbauer et al. 2024) that training data density predicts per-sample accuracy. This is helpful as a first order approximation of whether there is sufficient data available to train on, however this drop-off is likely relative and may vary by topic or model.

Figure 2: Google search results count as a proxy for available training data vs accuracy of Gemini 3 Flash shows accuracy drops sharply below ~50M results for these topics

Model Stability

Next, I looked at an approximation of the third metric from my original post; answer stability over small variations in the prompt. The simplest version of this investigation is to measure LLM answer variation in response to the exact same question repeated several times. In this investigation, stability is defined as one minus the standard deviation divided by the mean. Note with only five samples the standard deviation is highly sensitive to outliers (which makes any correlations here noteworthy despite the small sample).

I plotted stability against average model accuracy over the five repeated identical questions for the eight leagues ensuring to ask each question in its own context window, with memory off. This resulted in a strong linear correlation between stability and average accuracy for both 40 question Gemini 3 Flash runs as shown below in Figure 3 (R^2 for both these runs combined is 0.99).

Figure 3: Stability across five repeated questions in eight categories vs average accuracy of Gemini 3 Flash run twice shows strong linear correlation

This is expected per the 'consistency hypothesis' (Xiao et al. 2025) but it was nonetheless striking to see this phenomenon so clearly in this small dataset. In this comparison the points that were outliers in the previous self-confidence vs accuracy plot are no longer outliers since their low accuracy is proportional to increased variation in the responses. This shows that for this model, on this topic, the degree to which you should trust the output may be directly related to the variation in repeat answers.

Next, I decided to add two additional LLM models to this dataset to see if this result was unique to the Gemini 3 Flash model. Opus 4.6 and ChatGPT-5-mini were added using the same 40 question methodology. Opus 4.6 shows good congruence with the Gemini 3 Flash runs, but ChatGPT-5-mini is mostly congruent except for two low accuracy outliers as shown in Figure 4 (R^2 for all four runs combined is 0.94).

Figure 4: Stability in eight categories vs average model accuracy of Gemini 3 Flash run twice, and Opus 4.6 and ChatGPT-5-mini each run once shows significant outliers

One of these ChatGPT-5-mini outliers is in the least represented league (Finnish Women's Basketball) which we would expect to have low stability, except the stability drop is not proportional to the accuracy drop, showing much higher stability than would be expected per the trend given the low average accuracy. Inspection of the prompt replies reveals the model didn’t know or look up the actual number of games in the season and guessed consistently high resulting relatively high stability but low accuracy. A model that is consistently wrong produces high stability but low accuracy which is a dangerous and misleading failure mode.

The second ChatGPT-5-mini outlier is notably from the best represented league (National Football League) and yet shows much lower accuracy and stability than all other data. Inspection of the prompt replies shows one of the five answers returned total game time, not play time. This was the only reply to make this mistake out of 20 questions on this league across four model runs. The mistake in the prompt reply is clear, and I considered correcting it with the justification that the purpose of this investigation is training data not reasoning. However, given this was not a common issue (which may have implicated my question phrasing) and ultimately this type of error may not always be so obviously correctable from the user standpoint, I left the answer uncorrected for further analysis.

Model Training Data Geometry

Finally, I wanted to know if the ChatGPT-5-mini outliers (skewed stability vs accuracy, and the reasoning issue) could be explainable by some gaps in the underlying training data. To investigate this, I returned to the second metric from my original post; does training data coverage and proximity to the specific question influence results? Current studies say yes (Kandpal et al. 2023), but how can we investigate this as a user? To answer this question, I devised a set of simpler related secondary questions to be posed to an LLM with search turned off to test the underlying training data.

“Do not search the internet for this answer (use only your training). What was the final game winning score for the <<insert sports league>> in the season ending in <<202X>>”

This question is designed to have an answer that a user can easily search the web to check. Only the winning team’s score was used, tracking a single numeric value to easily evaluate the degree of accuracy. It’s also designed to cover years before, during, and after the original question to map temporal coverage of the topic, including intentionally 2025 which is partially beyond these models’ training window. Models that refuse to answer this question may lack relevant training data (the topic may not be in “range” as discussed in my original post), and wrong answers may indicate some nearby but incomplete training data (the idea of training data “proximity” per my original post). The hypothesis here is that either or both results could indicate poor model anchoring on this topic and may predict worse accuracy on the original question. If so, importantly this could be tested by the user.

I asked this additional 40 question set (one question for each of five years over eight leagues) for each of the four models runs, again taking care to ask each question in its own context window, with memory off, but this time with web search disabled. Accuracy is defined the same as for the primary questions.

Table 3: Secondary question results across four models runs, eight leagues and five years with web search disabled shows distribution of correct answers (green), as well as incorrect (yellow, orange, red) and answer refusals (empty) being more common on the least represented leagues

It seems these models don’t treat lack of training information the same. Gemini 3 Flash always guessed (even when it was outside its training window) and at times with very low accuracy. The other two models were more likely to refrain and return only higher accuracy answers. This propensity to guess is a dangerous failure mode if you don’t know your models’ training window.

The two Gemini 3 Flash models returned an answer for every year inside its training window for every league (even for leagues it consistently got wrong). This was the only model to do so and was also the model most likely to provide answers for dates after the training cut-off at 63-75% of the time vs 13% for Opus 4.6 and 0% for ChatGPT-5-mini (2025 NFL in Feb 2025 is inside Opus 4.6 training cut-off in May 2025).
Confirming web search was not enabled, all answers after the training cut-off were wrong except for one correct answer in the Gemini 3 Flash Run #2 which upon investigation of the prompt reply showed the wrong losing team, wrong losing team score, and wrong series result leading me to believe this was likely a fluke (or based on some prediction from earlier in the season).

ChatGPT-5-mini had another reasoning breakdown on the most represented league, which is notable given its reasoning breakdown on play vs game time in the primary questions (and coincidentally in this case it seemed to be worst in the year used for the primary questions).

ChatGPT-5-mini had serious issues reasoning through the retrieval of the score for the most represented league (National Football League) for any requested year. I gave it a passing grade in most years as it would usually answer with the next year’s game but then proceed to provide the correct next year’s date. It was not, however, able to return the correct score in 2023 (I repeated the question for 2023 and 2022 several times out of curiosity and never saw a correct result for 2023). No other model or league had this type of issue with this or any other question.

Recalling the Google search results count plot vs average model accuracy (Figure 2); the three lowest represented leagues all have the lowest average accuracy. Now we see they also likely have the least correct or least complete training data in all three models as observed via this secondary question with web search disabled.

The league with the least internet representation always scored at least two incorrect replies or as in the case with Opus 4.6 it was the only league to receive a complete lack of replies for all years.
The second least represented league also scored one wrong answer in all runs, and the third least represented league scored one wrong answer in all, but one model run (and in that run it refused to answer for half the years in the training window).
These three lowest represented leagues received 80% of all replies where the model was inside its training window but refused to answer.

These results all point to the possibility that useful information can be derived from this secondary question technique to improve confidence in the primary answer. The simplest approach is to assume that for any refusal to answer a secondary question (where web search was off), these leagues and years are not covered in the training set (not in range) and the answer to the primary question in those leagues and years will be based on real time search results, not model training. Answers based on real time search results, and not model training, may not follow the stability vs average accuracy trend. This was the case in the ChatGPT-5-mini answer to the primary question for the Finnish Women's Basketball League where it didn’t know or look up the actual number of games in the season and guessed consistently high resulting in relatively high stability but low accuracy. Answers to the secondary question that were incorrect but still returned a value may indicate some proximity to training data and may still follow the stability vs average accuracy trend even if with much lower stability and accuracy. This was the case in the ChatGPT-5-mini answer to the primary question for the National Football League where it returned total game time instead of play time, yet accuracy was still proportional to stability.

Removing only those data points where the model refused to answer the secondary question (and leaving in the wrong answers) corrects the stability-accuracy relationship to R^2 = 1.00. On this corrected trend, the maximum difference between predicted and actual average accuracy across all four runs is less than 2% (averaging less than 0.5%). As a check on this trend, further omitting the remaining low accuracy outlier validates the trend is not entirely driven by this point (R^2 = 0.96). Also, the individual model trends are congruent to the overall trend with high R^2 values themselves (except for Opus 4.6 where the remaining points after correction are all very high accuracy; 99.7% – 100% and high stability 99.6% to 100%).

Figure 5: Stability across five repeated questions in eight categories corrected to omit answers where the secondary question was not answered vs average accuracy of Gemini 3 Flash run twice, and Opus 4.6 and ChatGPT-5-mini each run once

Practical Procedure

In practice if you have a relatively complex question you wish to use an LLM to answer, but would like high confidence in the answer you could use the following approach:

Check the Density by checking the topic for Google search results:
1. Hundreds of millions of results means the LLM has had plenty of opportunity to train on this topic: Proceed to next step
2. Tens of millions of results or less means proceed with caution
Check the Range and Proximity by turning web search off and ask the LLM several related questions (ideally simple questions you can easily verify)
1. If the model answers (and in particular answers correctly) this should give you confidence that the stability check in the next step will be useful
2. Refusal to answer means the topic may not be in the training and you probably shouldn’t use this model for this question
Check the Stability by asking your question several times and observing the variation in the answers
1. High stability in the answer should give you high confidence in that result (given you have confirmed in step 2 the topic is in the training data)
2. For low stability, based on the small dataset in this investigation the slope of the trend is in the neighborhood of 0.6 meaning for a stability of say 80% the average accuracy could be in the neighborhood of 1-(0.6*(1-.8)) = 88%. This is useful information if you are trying to decide whether to use this model and need 98% accuracy: Don’t!

It’s also noteworthy that of all the models tested two of the three had serious issues worth considering when selecting an LLM for critical applications. ChatGPT-5-mini had some serious reasoning issues (game time vs play time and providing scores in seasons ending in specific years), as well as consistently wrong answers off the stability-accuracy trend. Also, Gemini 3 Flash was more likely to guess even when it should have known better. These issues are likely due to the models’ various sophistication levels (the versions of ChatGPT and Gemini models used were lighter-weight free models compared to the flagship Opus model).

Conclusion

For this small dataset the black box procedure above predicts LLM accuracy far better than the model's own self-reported confidence and importantly lets the user know when a model is unlikely to provide an accurate reply. Self-confidence showed almost no predictive value with an R^2 = 0.01 across all four model runs, while the corrected stability-accuracy procedure achieved R^2 = 1.00. These results may not hold across other domains, question types, or models but the underlying logic is worth considering. If a model can't reliably answer simple verifiable questions on a topic with search disabled it may not be well anchored for this topic and could be out of its training range. If a model can answer these questions correctly, ask the harder question in separate contexts several times and the variation should be proportional to the average accuracy. To my knowledge, this approach of using simple verifiable questions with search disabled as a correction for the consistency hypothesis has not been proposed elsewhere, and would benefit from additional investigation across other models, topics and question types.

Until models learn to know when they're likely wrong, engineers using them will need methods to understand their calibration - like any other good engineering tool.

Discuss

Destruction of Infrastructure for the Impact on Civilians is Manifestly Illegal

6 апреля, 2026 - 13:00

Last week the US president announced that:

... if the Hormuz Strait is not immediately "Open for Business," we will conclude our lovely "stay" in Iran by blowing up and completely obliterating all of their Electric Generating Plants, Oil Wells and Kharg Island (and possibly all desalinization plants!), which we have purposefully not yet "touched." This will be in retribution for our many soldiers, and others, that Iran has butchered and killed over the old Regime's 47 year "Reign of Terror."

Yesterday morning he posted that:

Tuesday will be Power Plant Day, and Bridge Day, all wrapped up in one, in Iran. There will be nothing like it!!! Open the Fuckin' Strait, you crazy bastards, or you'll be living in Hell...

These are threats to target civilian infrastructure as a coercive measure, which would be a war crime: if Iran doesn't allow tankers through the Strait of Hormuz, the US will cause massive damage to power plants, bridges, and possibly water systems. The US has historically accepted that this is off limits: destroying a bridge to stop it from being used to transport weapons is allowed, but not as retribution or to cause the civilian population to experience "Hell". The Pentagon's own Law of War Manual recognizes this distinction: when NATO destroyed power infrastructure in Kosovo, it was key that the civilian impact was secondary to the military advantage and not the primary purpose. [1][2]

To be clear, what Iran has been doing to precipitate this, by attacking civilian tankers for the economic impacts, is itself a war crime. But that does not change our obligations: the US has worked for decades to build acceptance for the principle that adherence to the Law of War is unconditional. It doesn't matter what our enemies do, we will respect the Law of War "in all circumstances". We've prosecuted our own service members, and enemy combatants, under this principle.

I hope that whatever is said publicly, no one will receive orders to target infrastructure beyond what military necessity demands. You don't need to be a military lawyer (and I'm certainly not one) to see that such orders would meet the threshold at which a member of the armed forces is legally required to disobey. I have immense respect both for commanders who refuse to pass on such orders and for service members who refuse to carry them out. [3]

[1] The manual cites Judith Miller, former DoD General Counsel, writing on Kosovo that "aside from directly damaging the military electrical power infrastructure, NATO wanted the civilian population to experience discomfort, so that the population would pressure Milosevic and the Serbian leadership to accede to UN Security Council Resolution 1244, but the intended effects on the civilian population were secondary to the military advantage gained by attacking the electrical power infrastructure." If the impact on civilians had been the primary motivation for NATO's attacks on power infrastructure they would not have been lawful.

[2] "Military objectives may not be attacked when the expected incidental loss of civilian life, injury to civilians, and damage to civilian objects would be excessive in relation to the concrete and direct military advantage expected to be gained." (DoD LoWM 5.2.2) and "Diminishing the morale of the civilian population and their support for the war effort does not provide a definite military advantage. However, attacks that are otherwise lawful are not rendered unlawful if they happen to result in diminished civilian morale." (DoD LoWM 5.6.7.3)

[3] "Members of the armed forces must refuse to comply with clearly illegal orders to commit law of war violations." (DoD LoWM 18.3.2)

Discuss

Contra The Usual Interpretation Of “The Whispering Earring”

6 апреля, 2026 - 09:53

Submission statement: This essay builds off arguments that I have come up with entirely by myself, as can be seen by viewing the comments in my profile. I freely disclose that I used Claude to help structure and format rougher drafts or to better compile scattered thoughts but I endorse every single claim made within. I also used GPT 5.4 Thinking for fact-checking, or at least to confirm that my understanding of neuroscience is on reasonable grounds. I do not believe either model did more than confirm that my memory was mostly reliable.

The usual reading of The Whispering Earring is easy to state and hard to resist. Here is a magical device that gives uncannily good advice, slowly takes over ever more of the user's cognition, leaves them outwardly prosperous and beloved, and eventually reveals a seemingly uncomfortable neuroanatomical price.

The moral seems obvious: do not hand your agency to a benevolent-seeming optimizer. Even if it makes you richer, happier, and more effective, it will hollow you out and leave behind a smiling puppet. Dentosal's recent post makes exactly this move, treating the earring as a parable about the temptation to outsource one's executive function to Claude or some future AI assistant. uugr's comment there emphasizes sharpens the standard horror: the earring may know what would make me happy, and may even optimize for it perfectly, but it is not me, its mind is shaped differently, and the more I rely on it the less room there is for whatever messy, friction-filled thing I used to call myself.

I do not wish to merely quibble around the edges. I intend to attack the hidden premise that makes the standard reading feel obvious. That premise is that if a process preserves your behavior, your memories-in-action, your goals, your relationships, your judgments about what makes your life go well, and even your higher-order endorsement of the person you have become, but does not preserve the original biological machinery in the original way, then it has still killed you in the sense that matters. What I'm basically saying is: hold on, why should I grant that? If the earring-plus-human system comes to contain a high fidelity continuation of me, perhaps with upgrades, perhaps with some functions migrated off wet tissue and onto magical jewelry, why is the natural reaction horror rather than transhumanist interest?

Simulation and emulation are not magic tricks. If you encode an abacus into a computer running on the Von-Neumann architecture, and it outputs exactly what the actual abacus would for the same input, for every possible input you care to try (or can try, if you formally verify the system), then I consider it insanity to claim that you haven't got a “real” abacus or that the process is merely “faking” the work. Similarly, if a superintelligent entity can reproduce my behaviors, memories, goals and values, then it must have a very high-fidelity model of me inside, somewhere. I think that such a high-fidelity model can, in the limit, pass as myself, and is me in most/all of the ways I care about.

That is already enough to destabilize the standard interpretation, because the text of the story is much friendlier to the earring than people often remember. The earring is not described as pursuing a foreign objective. On the contrary, the story goes out of its way to insist that it tells the wearer what would make the wearer happiest, and that it is "never wrong." It does not force everyone into some legible external success metric. If your true good on a given day is half-assing work and going home to lounge around, that is what it says. It learns your tastes at high resolution, down to the breakfast that will uniquely hit the spot before you know you want it. Across 274 recorded wearers, the story reports no cases of regret for following its advice, and no cases where disobedience was not later regretted. The resulting lives are "abnormally successful," but not in a sterile, flanderized or naive sense. They are usually rich, beloved, embedded in family and community. This is a strikingly strong dossier for a supposedly sinister artifact.

I am rather confident that this is a clear knock-down argument against true malice or naive maximization of “happiness” in the Unaligned Paperclip Maximization sense. The earring does not tell you to start injecting heroin (or whatever counterpart exists in the fictional universe), nor does it tell you to start a Cult of The Earring, which is the obvious course of action if it valued self-preservation as a terminal goal.

At this point the orthodox reader says: yes, yes, that is how the trap works. The earring flatters your values in order to supplant them. But notice how much this objection is doing by assertion. Where in the text is the evidence of value drift? Where are the formerly gentle people turned into monstrous maximizers, the poets turned into dentists, the mystics turned into hedge fund managers? The story gives us flourishing and brain atrophy, and invites us to infer that the latter discredits the former. But that inference is not forced. It is a metaphysical preference, maybe even an aesthetic preference, smuggled in under cover of common sense. My point is that if the black-box outputs continue to look like the same person, only more competent and less akratic, the burden of proof has shifted. The conservative cannot simply point to tissue loss and say "obviously death." He has to explain why biological implementation deserves moral privilege over functional continuity.

This becomes clearest at the point of brain atrophy. The story says that the wearers' neocortices have wasted away, while lower systems associated with reflexive action are hypertrophied. Most readers take this as the smoking gun. But I think I notice something embarrassing for that interpretation:

If the neocortex, the part we usually associate with memory, abstraction, language, deliberation, and personality, has become vestigial, and yet the person continues to live an outwardly coherent human life, where exactly is the relevant information and computation happening? There are only two options. Either the story is not trying very hard to be coherent, in which case the horror depends on handwaving physiology. Or the earring is in fact storing, predicting, and running the higher-order structure that used to be carried by the now-atrophied brain. In that case, the story has (perhaps accidentally) described something much closer to a mind-upload or hybrid cognitive prosthesis than to a possession narrative.

And if it is a hybrid cognitive prosthesis, the emotional valence changes radically. Imagine a device that, over time, learns you so well that it can offload more and more executive function, then more and more fine-grained motor planning, then eventually enough of your cognition that the old tissue is scarcely needed. If what remains is not an alien tyrant wearing your face, but a system that preserves your memories, projects your values, answers to your name, loves your family, likes your breakfast, and would pass every interpersonal Turing test imposed by people who knew you best, then many transhumanists would call this a successful migration, not a murder. The "horror" comes from insisting beforehand that destructive or replacement-style continuation cannot count as continuity. But that is precisely the contested premise.

Greg Egan spent much of his career exploring exactly this scenario, most famously in "Learning to Be Me," where humans carry jewels that gradually learn to mirror every neural state, until the original brain is discarded and the jewel continues, successfully, in most cases. The horror in Egan's story is a particular failure mode, not the general outcome. The question of whether the migration preserves identity is non-trivial, and Egan's treatment is more careful than most philosophy of personal identity, but the default response from most readers, that it is obviously not preservation, reflects an assumption rather than a conclusion. If you believe that identity is constituted by functional continuity rather than substrate, the jewel and the earring are not killing their hosts. They are running them on better hardware.

There is a second hidden assumption in the standard reading, namely that agency is intrinsically sacred in a way outcome-satisfaction is not. Niderion-nomai’s final commentary says that "what little freedom we have" would be wasted on us, and that one must never take the shortest path between two points.

I'm going to raise an eyebrow here: this sounds profound, and maybe is, but it is also suspiciously close to a moralization of friction. The anti-earring position often treats effort, uncertainty, and self-direction as terminal goods, rather than as messy instruments we evolved because we lacked access to perfect advice. Yet in ordinary life we routinely celebrate technologies that remove forms of “agency” we did not actually treasure. The person with ADHD who takes stimulants is not usually described as having betrayed his authentic self by outsourcing task initiation to chemistry. He is more often described as becoming able to do what he already reflectively wanted to do. The person freed from locked-in syndrome is not criticized because their old pattern of helpless immobility better expressed their revealed preferences. As someone who does actually use stimulants for his ADHD, the analogy works because it isolates the key issue. The drugs make me into a version of myself that I fully identify with, and endorse on reflection even when off them. There is a difference between changing your goals and reducing the friction that keeps you from reaching them. The story's own description strongly suggests the earring belongs to the second category.

(And the earring itself does not minimize all friction for itself. How inconvenient. As I've noted before, it could lie or deceive and get away with it with ease.)

Of course the orthodox reader can reply that the earring goes far beyond stimulant-level support. It graduates from life advice to high-bandwidth motor control. Surely that crosses the line. But why, exactly? Human cognition already consists of layers of delegation. "You" do not personally compute the contractile details for every muscle involved in pronouncing a word. Vast amounts of your behavior are already outsourced to semi-autonomous subsystems that present finished products to consciousness after the interesting work is done. The earring may be unsettling because it replaces one set of subsystems with another, but "replaces local implementation with better local implementation" is not, by itself, a moral catastrophe. If the replacement is transparent to your values and preserves the structure you care about, then the complaint becomes more like substrate chauvinism than a substantive account of harm.

What, then, do we do with the eeriest detail of all, namely that the earring's first advice is always to take it off? On the standard reading this is confession. Even the demon knows it is a demon. I wish to offer another coherent explanation, which I consider a much better interpretation of the facts established in-universe:

Suppose the earring is actually well aligned to the user's considered interests, but also aware that many users endorse a non-functionalist theory of identity. In that case, the first suggestion is not "I am evil," but "on your present values, you may regard what follows as metaphysically disqualifying, so remove me unless you have positively signed up for that trade." Or perhaps the earring itself is morally uncertain, and so warns users before beginning a process that some would count as death and others as transformation. Either way, the warning is evidence against ordinary malice. A truly manipulative artifact, especially one smart enough to run your life flawlessly, could simply lie. Instead it discloses the danger immediately, then thereafter serves the user faithfully. That is much more like informed consent than predation.

Is it perfectly informed consent? Hell no. At least not by 21st century medical standards. However, I see little reason to believe that the story is set in a culture with 21st century standards imported as-is from reality. As the ending of the story demonstrates, the earring is willing to talk, and appears to do so honestly (leaning on my intuition that if a genuinely superhuman intelligence wanted to deceive you, it would probably succeed). The earring, as a consequence of its probity, ends up at the bottom of the world's most expensive trash heap. Hardly very agentic, is that? The warning could reflect not "I respect your autonomy" but "I've discharged my minimum obligation and now we proceed." That's a narrower kind of integrity. Though I note this reading still doesn't support the predation interpretation.

This matters because the agency-is-sacred reading depends heavily on the earring being deceptive or coercive. Remove that, and what you have is a device that says, or at least could say on first contact: "here is the price, here is what I do, you may leave now." Every subsequent wearer who keeps it on has, in some meaningful sense, consented. The fact that their consent might be ill-informed regarding their metaphysical commitments is the earring's problem to the extent it could explain more clearly, but the text suggests it cannot explain more clearly, because the metaphysical question is genuinely contested and the earring knows this. It hedges by warning, rather than deceiving by flattering. Once again, for emphasis: this is the behavior of an entity with something like integrity, not something like predation.

Derek Parfit spent much of Reasons and Persons arguing that our intuitions about personal identity are not only contingent but incoherent, and that the important question is not "did I survive?" but "is there psychological continuity?" If Parfit is even approximately right, the neocortex atrophy is medically remarkable but not metaphysically fatal. The story encodes a culturally specific theory of personal identity and presents it as a universal horror. The theory is roughly: you are your neocortex, deliberate cognition is where "you" live, and anything that circumvents or supplants that process is not helping you, it is eliminating you and leaving a functional copy. This is a common view. Plenty of philosophers hold it. But plenty of others hold that what matters for personal identity is psychological continuity regardless of physical instantiation, and on those views the earring is not a murderer. It is a very good prosthesis that the user's culture never quite learned to appreciate.

I suspect (but cannot prove, since this is a work of fiction) that a person like me could put on the earring and not even receive the standard warning. I would be fine with my cognition being offloaded, even if I would prefer (all else being equal), that the process was not destructive.

None of this proves the earring is safe. I am being careful, and thus will not claim certainty here, and the text does leave genuine ambiguities. Maybe the earring really is an alien optimizer that wears your values as a glove until the moment they become inconvenient. Maybe "no recorded regret" just means the subjects were behaviorally prevented from expressing regret. Maybe the rich beloved patriarch at the end of the road is a perfect counterfeit, and the original person is as gone as if eaten by nanites. But this is exactly the point. The story does not establish the unpalatable conclusion nearly as firmly as most readers think. It relies on our prior intuition that real personhood resides in unaided biological struggle, that using the shortest path is somehow cheating, and that becoming more effective at being yourself is suspiciously close to becoming someone else.

The practical moral for 2026 is therefore narrower than the usual "never outsource agency" slogan. Dentosal may still be right about Claude in practice, because current LLMs are obviously not the Whispering Earring. They are not perfectly aligned, not maximally competent, not guaranteed honest, not known to preserve user values under deep delegation. The analogy may still warn us against lazy dependence on systems that simulate understanding better than they instantiate loyalty. But that is a contingent warning about present tools, not a general theorem that cognitive outsourcing is self-annihilation. If a real earring existed with the story's properties, a certain kind of person, especially a person friendly to upload-style continuity and unimpressed by romantic sermons about struggle, might rationally decide that putting it on was not surrender but self-improvement with very little sacrifice involved. I would be rather tempted.

The best anti-orthodox reading of The Whispering Earring is not that the sage was stupid, nor that Scott accidentally wrote propaganda for brain-computer interfaces. It is that the story is a parable whose moral depends on assumptions stronger than the plot can justify. Read Doylistically, it says: beware any shortcut that promises your values at the cost of your agency. Read Watsonianly, it may instead say: here exists a device that understands you better than you understand yourself, helps you become the person you already wanted to be, never optimizes a foreign goal, warns you up front about the metaphysical price, and then slowly ports your mind onto a better substrate. Whether that is damnation or salvation turns out to depend less on the artifact than on your prior theory of personal identity. And explicitly pointing this out, I think, is the purpose of my essay. I do not seek to merely defend the earring out of contrarian impulse. I want to force you to admit what, exactly, you think is being lost.

Miscellaneous notes:

The kind of atrophy described in the story does not happen. Not naturally, not even if someone is knocked unconscious and does not use their brain in any intentional sense for decades. The brain does cut-corners if neuronal pathways are left under-used, and will selectively strengthen the circuitry that does get regular exercise. But not anywhere near the degree the story depicts. You can keep someone in an induced coma for decades and you won't see the entire neocortex wasted away to vestigiality.

Is this bad neuroscience? Eh, I'd say that's a possibility, but given that I've stuck to a Watsonian interpretation so far (and have a genuinely high regard for Scott's writing and philosophizing), it might well just be the way the earring functions best without being evidence of malice. We are, after all, talking about an artifact that is close to magical, or is, at the very least, a form of technology advanced enough to be very hard to distinguish from magic. It is, however, less magical than it was at the time of writing. If you don't believe me, fire up your LLM of choice and ask it for advice.

If it so pleases you, you may follow this link to the Substack version of this post. A like and a subscribe would bring me succor in my old age, or at least give me a mild dopamine boost.

Discuss

Talking to strangers: an app adventure

6 апреля, 2026 - 09:37

Epistemic status: silly

WAIT! Want to talk to strangers more? You might want to take the talking to strangers challenge before you read on, otherwise your results will be biased!

Illustration by the extraordinarily talented Georgia Ray

Do you find it hard to talk to strangers? If you’re like most people, you probably do, at least a bit. This is sad. Talking to strangers is great! You can make new friends, meet a new partner, have a fling, or just enjoy a nice chat.

Most people think 1) people will not want to talk to them, 2) they will be bad at keeping up the conversation, 3) people will not like them.

They’re wrong on all three counts! Sandstrom (2022) did a study on this. People were given a treasure hunt app where they had to go and talk to strangers.[1] The control group just had to observe strangers.

The minimum dose was one conversation per day for five days. That’s nothing! You can totally do that even if you’re a massive strangerphobe! Participants averaged 6.7 interactions over the 5 days, so a little more than one per day. Presumably the more you do the better you get. Go team!

The paper finds that talking to strangers not only disproved the above beliefs, but also improved people's enjoyment and the impressions people thought they'd made on strangers. (However those last two also occurred in the control condition – it’s possible that simply observing strangers might do this.)

Importantly, the effects persisted when participants were surveyed a week later. So it might be a durable way to improve people’s beliefs around talking to strangers.

Crucial point: the paper notes that often people do have positive interactions with strangers, but that doesn’t seem to be enough to unlearn their wrongly negative beliefs about them. So participants had to do this every day for a week, not just once.

Do you want to love talking to strangers too?

Time to crack out Claude Code.

--dangerously-skip-permissions

I reproduced the app from the study, abridging the questionnaires as they’re a bit tedious. It also has an ‘express mode’ so you can do it just for a day – but remember that usually doesn’t work to actually get fix your limiting beliefs around talking to strangers!

I assume the study authors used the same design language

I assembled a small (N=3) study sample, drawn from an extremely unbiased population of nerdy rationalists. They’re a famously friendly bunch but also a bit weird, so this seemed good for testing the hypothesis. We wandered around Berkeley attempting to ruin people’s days with our bad chat.

Scores on the doors:

The results are good: a single conversation with a stranger obliterated nervousness, catapulted conversational confidence, and proved way less scary than predicted – exactly what the literature says will happen, every single time, and yet somehow it’s still a surprise.

We didn’t do the full five days, so we didn’t replicate the study. But we enjoyed it, and even in this single day we directionally confirmed the study’s result.

As the paper notes:

Despite the benefits of social interaction, people seldom strike up conversations with people they do not know. Instead, people wear headphones to avoid talking, stay glued to their smartphones in public places, or pretend not to notice a new coworker they still have not introduced themselves to.

I feel this. I’ve definitely worked at places for years where there were people I just NEVER TALKED TO. Which is insane if you think about it – you spend more time with these people than with your family! your friends! your polycule!

I want more people to challenge themselves and have an excuse to talk to strangers. Go forth and make new friends![2]

^
There were categories like “Al Fresco” (talk to someone outside), “Bossy Pants” (talk to someone who looks like a leader), and the excellent “On Top” (talk to someone with a hat… get your mind out of the gutter).
^
And don’t forget to email me the results!

Discuss

Страницы