Вы здесь

Сборщик RSS-лент

AI in 2025: gestalt

Новости LessWrong.com - 8 декабря, 2025 - 00:25
Published on December 7, 2025 9:25 PM GMT

This is the editorial for this year’s "Shallow Review of AI Safety". (It got long enough to stand alone.) 

Epistemic status: subjective impressions plus one new graph plus 300 links.

Huge thanks to Jaeho Lee, Jaime Sevilla, and Lexin Zhou for running lots of tests pro bono and so greatly improving the main analysis.

tl;dr
  • Informed people disagree about the prospects for LLM AGI – or even just what exactly was achieved this year. But they at least agree that we’re 2-20 years off (if you allow for other paradigms arising). In this piece I stick to arguments rather than reporting who thinks what.
  • My view: compared to last year, AI is much more impressive but not much more useful. They improved on many things they were explicitly optimised for (coding, vision, OCR, benchmarks), and did not hugely improve on everything else. Progress is thus (still!) consistent with current frontier training bringing more things in-distribution rather than generalising very far.
  • Pretraining (GPT-4.5, Grok 4, but also counterfactual large runs which weren’t done) disappointed people this year. It’s probably not because it wouldn’t work; it was just ~30 times more efficient to do post-training instead, on the margin. This should change, yet again, soon, if RL scales even worse.
  • True frontier capabilities are likely obscured by systematic cost-cutting (distillation for serving to consumers, quantization, low reasoning-token modes, routing to cheap models, etc) and a few unreleased models/modes.
  • Most benchmarks are weak predictors of even the rank order of models’ capabilities. I distrust ECIADeLe, and HCAST the least (see graph below or this notebook). ECI and ADeLe show a linear improvement while HCAST finds an exponential improvement on greenfield software engineering.
     
  • The world’s de facto strategy remains “iterative alignment”, optimising outputs with a stack of alignment and control techniques everyone admits are individually weak.
  • Early claims that reasoning models are safer turned out to be a mixed bag (see below).
  • We already knew from jailbreaks that current alignment methods were brittle. The great safety discovery of the year is that bad things are correlated in current models. (And on net this is good news.) "Emergent misalignment" from finetuning on one malign task; and in the wild from reward hacking; and it happens by strengthening specific bad personas; and there is at least one positive generalisation too (from honesty about silly errors to honesty about hidden objectives).
  • Previously I thought that "character training" was a separate and lesser matter than "alignment training". Now I am not sure.
  • Welcome to the many new people in AI Safety and Security and Assurance and so on. In the Shallow Review, out soon, I added a new, sprawling top-level category for one large trend among them, which is to treat the multi-agent lens as primary in various ways.
  • Overall I wish I could tell you some number, the net expected safety change (this year’s improvements in dangerous capabilities and agent performance, minus the alignment-boosting portion of capabilities, minus the cumulative effect of the best actually implemented composition of alignment and control techniques). But I can’t.

 

Capabilities in 2025

Better, but how much?

Fraser, riffing off Pueyo 

 

Arguments against 2025 capabilities growth being above-trend
  • Apparent progress is an unknown mixture of real general capability increase, hidden contamination increase, benchmaxxing (nailing a small set of static examples instead of generalisation) and usemaxxing (nailing a small set of narrow tasks with RL instead of deeper generalisation). It’s reasonable to think it’s 25% each, with low confidence.
     
  • Discrete capabilities progress seems slower this year than in 2024 (but 2024 was insanely fast). Kudos to this person for registering predictions and so reminding us what really above-trend would have meant concretely. The excellent forecaster Eli was also over-optimistic.
     
  • I don’t recommend taking benchmark trends, or even clever composite indices of them, or even clever cognitive science measures too seriously. The adversarial pressure on the measures is intense.
     
  • Pretraining didn’t hit a “wall”, but the driver did manoeuvre away from it on encountering an easier detour (RLVR). 
    • Training runs continued to scale (Llama 3 405B = 4e25, GPT-4.5 ~= 4e26, Grok 4 ~= 3e26) but to less effect.[1] In fact all of these models are dominated by apparently smaller pretraining runs with better post-training. 4.5 is actually shut down already; it wasn’t worth it to serve or make into a reasoning model.
    • It would work, probably, if we had the data and spent the next $10bn, it’s just too expensive to bother with at the moment compared to:
       
  • RLVR scaling and inference scaling (or “reasoning” as we’re calling it), which kept things going instead. This boils down to spending more on RL so the resulting model can productively spend more tokens.
    • But the fearedhoped-for generalisation from {training LLMs with RL on tasks with a verifier} to performing on tasks without one remains unclear even after two years of trying. Grok 4 was apparently a major test of scaling RLVR training.[2] It gets excellent benchmark results and the distilled versions are actually being used at scale. But imo it is the most jagged of all models. 
    • This rate of scaling-up cannot be sustained: RL is famously inefficient. Compared to SFT, it “reduces the amount of information a model can learn per hour of training by a factor of 1,000 to 1,000,000”. The per-token intelligence is up but not by much.
    • There is a deflationary theory of RLVR, that it’s capped by pretraining capability and thus just about easier elicitation and better pass@1. But even if that’s right this isn’t saying much!
    • RLVR is heavy fiddly R&D you need to learn by doing; better to learn it on smaller models with 10% of the cost. 
    • An obvious thing we can infer: the labs don't have the resources to scale both at the same time. To keep the money jet burning, they have to post.
       
  • By late 2025, the obsolete modal “AI 2027” scenario described the beginning of a divergence between the lead lab and the runner-up frontier labs.[3] This is because the leader’s superior ability to generate or acquire new training data and algorithm ideas was supposed to compound and widen their lead. Instead, we see the erstwhile leader OpenAI and some others clustering around the same level, which is weak evidence that synthetic data and AI-AI R&D aren’t there yet. Anthropic are making large claims about Opus 4.5’s capabilities, so maybe this will arrive on time next year.
     
  • For the first time there are now many examples of LLMs helping with actual research mathematics. But if you look closely it’s all still in-distribution in the broad sense: new implications of existing facts and techniques. (I don’t mean to demean this; probably most mathematics fits this spec.)
     
  • Extremely mixed evidence on the trend in the hallucination rate.
     
  • Companies make claims about their one-million- or ten-million-token effective context windows, but I don’t believe it.
     
  • In lieu of trying the agents for serious work yourself, you could at least look at the highlights of the gullible and precompetent AIs in the AI Village.
     
  • Here are the current biggest limits to LLMs, as polled in Heitmann et al:

 

Arguments for 2025 capabilities growth being above-trend

We now have measures which are a bit more like AGI metrics than dumb single-task static benchmarks are. What do they say? 

  1. Difficulty-weighted benchmarksEpoch Capabilities Index
    1. Interpretation: GPT-2 to GPT-3 was (very roughly) a 20-40 point jump.
  2. Cognitive abilitiesADeLe.[4] 
    1. Interpretation: level L is the capability held by 1 in 10L.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}  humans on Earth. GPT-2 to GPT-3 was a 0.6 point jump.
  3. Software agency: HCAST time horizon, the ability to handle larger-scale well-specified greenfield software tasks. 
    1. Interpretation: the absolute values are less important than the implied exponential (a 7 month doubling time).
       

So: is the rate of change in 2025 (shaded) holding up compared to past jumps?:


Ignoring the (nonrobust)[5] ECI GPT-2 rate, we can say yes: 2025 is fast, as fast as ever or more. 

Even though these are the best we have, we can’t defer to these numbers.[6] What else is there?

  • In May I finally started using LLMs for actual tasks. For me this is mostly due to the search agents replacing a degraded Google search. I‘m not the only one who flipped this year. This hasty poll is worth more to me than any benchmark:
  • On actual adoption and actual real-world automation: 
    • Based on self-reports, the St Louis Fed thinks that “Between 1 and 7% of all work hours are currently assisted by generative AI, and respondents report time savings equivalent to 1.4% of total work hours… across all workers (including non-users… Our estimated aggregate productivity gain from genAI (1.2%)”. That’s model-based, using year-old data, and naively assuming that the AI outputs are of equal quality. Not strong.
    • The unfairly-derided METR study on Cursor and Sonnet 3.7 showed a productivity decrease among experienced devs with (mostly) <50 hours of practice using AI. Ignoring that headline result, the evergreen part here is that even skilled people turn out to be terrible at predicting how much AI actually helps them.
       
  • True frontier capabilities are likely obscured by systematic cost-cutting (distillation for serving to consumers, quantization, low reasoning-token modes, routing to cheap models, etc). Open models show you can now get good performance with <50B active parameters, maybe a sixth of what GPT-4 used.[7]
    • GPT-4.5 was killed off after 3 months, presumably for inference cost reasons. But it was markedly lower in hallucinations and nine months later it’s still top-5 on LMArena. I bet it’s very useful internally, for instance in making the later iterations of 4o less terrible.
    • See for instance the unreleased deep-fried multi-threadedexperimental reasoning model” which won at IMO, ICPC, and IOI while respecting the human time cap (e.g. 9 hours of clock time for inference). The OpenAI one is supposedly just an LLM with extra RL. They probably cost an insane amount to run, but for our purposes this is fine: we want the capability ceiling rather than the productisable ceiling. Maybe the first time that the frontier model has gone unreleased for 5 months?
  • LLM councils and Generate-Verify divide-and-conquer setups are much more powerful than single models, and are rarely ever reported.
  • Is it “the Year of Agents” (automation of e.g. browser tasks for the mass market)? Coding agentsyes. Search agents, yes. Other agents, not much (but obviously progress).
  • We’re still picking up various basic unhobbling tricks like “think before your next tool call”.
  • METR’s task-time-horizon work, if it implies anything, implies a faster rate of improvement than last year. There’s much to be said against this, and it has been said, including by METR.
  • If the rate of progress on messy tasks is about the same as the clean ones, then they're just 1-5 years behind. But this delay could still be a huge issue for LLM AGI, because it means we might not automate messy tasks before compute scaling runs out, and then they'll be a bottleneck for decades.
  • "Deep Research" search agents only launched in December/February. They’re neither deep nor doing research, but useful.
  • I claim that instruction-following improved a lot. (Some middling evidence here.) This is a big deal: besides its relation to corrigibility, it also denotes general ability to infer intent and connotations. 
  • Character-level work is still occasionally problematic but nothing like last year.
  • GPT-5 costsquarter of what 4o cost last year (per-token; it often uses far more than 4x the tokens). (The Chinese models are nominally a few times cheaper still, but are not cheaper in intelligence per dollar.)
  • People have been using competition mathematics as a hard benchmark for years, but will have to stop because it’s solved. As so often with evals called ahead of time, this means less than we thought it would; competition maths is surprisingly low-dimensional and so interpolable. Still, they jumped (pass@1) from 4% to 12% on FrontierMath Tier 4 and there are plenty of hour-to-week interactive speedups in research maths.
  • Recursive self-improvement: Deepmind threw AlphaEvolve (a pipeline of LLMs running an evolutionary search) at pretraining. They claim the JAX kernels it wrote reduced Gemini’s training time by 1%. 
  • Extraordinary claims about Opus 4.5 being 100th percentile on Anthropic’s hardest hiring coding test, etc.
  • From May, the companies started saying for the first time that their models have dangerous capabilities. 

 

One way of reconciling this mixed evidence is if things are going narrow, going dark, or going over our head. That is, if the real capabilities race narrowed to automated AI R&D specifically, most users and evaluators wouldn’t notice (especially if there are unreleased internal models or unreleased modes of released models). You’d see improved coding and not much else. 

Or, another way: maybe 2025 was the year of increased jaggednesstrading off some capabilities against others. Maybe the RL made them much better at maths and instruction-following, but also sneaky, narrow, less secure (in the sense of emotional insecurity and also the sense of jailbreaks).

– Roger Grosse
 

Evals crawling towards ecological validity
  • Item response theory (Rausch 1960) is finally showing up in ML. This lets us put benchmarks on a common scale and actually estimate latent capabilities.
    • ADeLE is my favourite. It’s a fully-automated and explains the abilities a benchmark is actually measuring, gives you an interpretable ability profile for an AI, and predicts OOD performance on new task instances better than embedding and finetunes (AUROC=0.8). Pre-dates HCAST task horizon, and as a special case (“VO”). They throw in a guessability control as well!
    • These guys use it to estimate latent model ability, and show it’s way more robust across test sets than the average scores everyone uses. They also step towards automating adaptive testing: they finetune an LLM to generate tasks at the specified difficulty level.
    • Epoch bundled 39 benchmarks together, weighting them by latent difficulty, and thus obsoleted the currently dominant Artificial Analysis index, which is unweighted.
    • HCAST reinvents and approximates some of the same ideas. Come on METR!
  • Eleuther did the first public study of composing the many test-time interventions together. FAR and AISI also made a tiny step towards an open source defence pipeline, to use as a proxy for the closed compositional pipelines we actually care about.
  • Just for cost reasons, the default form of evals is a bit malign: it tests full replacement of humans. This is then a sort of incentive to develop in that direction rather than to promote collaboration. Two papers lay out why it’s thus time to spend on human evals.
  • The first paper using RL agents to attack fully-defended LLMs. 
  • We have started to study propensity as well as capability. This is even harder.
  • This newsletter is essential.
  • The time given for pre-release testing is down, sometimes to one week.
  • No public pre-deployment testing by AISI between o1 and Gemini 3. Gemini 2.5 seems to have had no third-party pre-deployment tests.
  • A bunch of encouraging collaborations:

 

Safety in 2025

Are reasoning models safer than the old kind? 

Well, o3 and Sonnet 3.7 were pretty rough, lying and cheating at greatly increased rates. Looking instead at GPT-5 and Opus 4.5:

  • Much more monitorable via the long and more-faithful CoT (--> all risks down)
    • “post-hoc rationalization… GPT-4o-mini (13%) and Haiku 3.5 (7%). While frontier models are more faithful, especially thinking ones, none are entirely faithful: Gemini 2.5 Flash (2.17%), ChatGPT-4o (0.49%), DeepSeek R1 (0.37%), Gemini 2.5 Pro (0.14%), and Sonnet 3.7 with thinking (0.04%)”
  • Much better at following instructions (--> accident risk down). 
  • Much more likely to refuse malicious requests, and topic “harmlessness”[8] is up 75% (--> misuse risk down)
  • Ambiguous evidence on jailbreaking (misuse risk). Even if they’re less breakable there are still plenty of 90%-effective attacks on them.
  • Much less sycophantic (cogsec risk down)
  • To get HHH Claude to hack a bank, you need to hide the nature of the task, lie to it about this being an authorised red team, and then still break down your malicious task into many, many little individually-innocuous chunks. You thus can’t get it to do anything that needs full context like strategising.
  • Anthropic's own tests look bad in January 2025 and great in December:

But then

  • More autonomy (obviously agentic risk up)
  • More reward hacking (and so worse estimates of capability and risk). Note that reward hacking is not a silly or isolated or self-limiting kind of misalignment, owing perhaps to post-training inadvertently creating connections between it and the others.
  • Huge spike in eval awareness (and so worse estimates of capability and risk). Apollo had to rehaul their whole setup to keep up.
  • Gemini 2.5 is bad on a variety of safety measures (as measured by Anthropic). The mathematical discovery agent AlphaEvolve (Gemini 2.0) successfully hacked its overseers.
  • How much can we trust the above, given they can somewhat sabotage evals now?

     

So: lower propensity, higher risk when they go off - and all of this known with lower confidence? 

 

The looming end of evals

Our evaluations are under pressure from cheatingsandbaggingbackground safety, under-elicitation, and deception. We don’t really know how much pressure. This is on top of evals usually being weak proxies, contaminated, label-noised, unrealistic, and confounded in various ways.

  • Sonnet 4.5 detected evaluation 58% of the time (vs 22% for Opus 4.1). Does it act on this awareness? Still, it was found to "only engage in [covert subversion] in rare circumstances" when strongly incentivised. When Anthropic attempted to suppress the eval-awareness with steering, its misaligned behavior rate increased from 0% to between 1% and 9%.
  • Misaligned behaviour on synthetic honeypot tests increased from 0% to between 1% and 9%
  • We expect exploration hacking (sandbagging during RL training) but afaik haven’t seen it yet.
  • Maybe subliminal learning leads to a silent eval confounder (if you're using the exact same model for the judge). Yet another argument for multi-model separation of concerns.
  • Opus 4.5 was deployed based on a qualitative survey of Anthropic staff about its R&D risk being low (on top of the existing evals, which weren’t enough): 
    • “In the past, rule-outs have been based on well-defined automated task evaluations. However, Claude Opus 4.5 has roughly reached the pre-defined thresholds we set for straightforward ASL-4 rule-out based on benchmark tasks. These evaluations represent short-horizon subtasks that might be encountered daily by a junior researcher, rather than the complex long-horizon actions needed to perform the full role. The rule-out in this case is also informed by a survey of Anthropic employees who are intensive Claude Code users, along with qualitative impressions of model capabilities for complex, long-horizon tasks.”

Prosaic misalignment

Still, we see misalignment when we look for it, so the lying is not that strong. (It is lucky that we do see it, since it could have been that scheming only appeared at later, catastrophic capability levels.)

  • In o1, 1 in 600 (synthetic but uncontrived) prompts resulted in something which looks like deception.
  • o3 does reward hacking (cheating/lying) in >2% of hard normal tasks (or 44% of impossible tasks).
  • A nice propensity study shows that (n=10) natural prompts usually result in no scheming (“a maximum of 4% [rate of attempting manipulation] for the codex snippet on Grok”). But there’s extremely high sensitivity to the scaffold: “removing access to tools can cause models such as Claude Opus 4.1 to scheme at rates up to 30%, versus zero scheming with tools present”.
  • When we contrive situations for misalignment to show up, it also shows up. Nostalgebraist argues that this is probably substantially because our evals sound like bad fiction and this activates role-playing-along behaviour.
  • The joke about xAI’s safety plan (that they promote AI safety by deploying cursed stuff in public and so making it obvious why it’s needed) is looking ok. And not only them.
  • It is a folk belief among the cyborgists that bigger pretraining runs produce more deeply aligned models, at least in the case of Opus 3 and early versions of Opus 4. (They are also said to be “less corrigible”.) Huge if true.
  • There may come a point where the old alliance between those working to make the AIs corrigible and those working to give them prosocial values comes apart.
  • One term for the counterintuitive safety approach which includes treating them like people, giving them lines of retreatmaking deals, and inoculation prompting could be “voluntary alignment”.

Fully speculative note: Opus 4.5 is the most reliable model and also relatively aligned. So might it be that we’re getting the long-awaited negative alignment taxes?

 

 

What is the plan? 

The world’s de facto alignment strategy remains “iterative alignment”, optimising mere outputs with a stack of admittedly weak alignment and control techniques. Anthropic have at least owned up to this being part of their plan.

  • What is the current stack? We don’t know; they won’t tell us. Anthropic did release some details about the pipeline which watches for eval sabotage. Without knowing it we can’t criticise or red-team it or analyse the correlation between faults in the elements. Red-teamers don’t know at which stage an unsuccessful attack was blocked. And external safety research is done piecemeal, testing methods one at a time, rather than in anything like the actual deployment environment. 
  • Some progress in replicating an open stack:
  • Eleuther tested a few hundred compositions. A couple of classifiers as a first step towards a proxy defence pipeline
  • OpenAI open safeguards, worse than their internal ones but good.
  • Open character training
  • A basic composition test
  • OpenAI’s plan, announced in passing in the gpt-oss release, is to have a strict policy and run a “safety reasoner” to verify it very intensely for a little while after a new model is launched and to then relax: “In some of our [OpenAI’s] recent launches, the fraction of total compute devoted to safety reasoning has ranged as high as 16%” but then falls off… we often start with more strict policies and use relatively large amounts of compute where needed to enable Safety Reasoner to carefully apply those policies. Then we adjust our policies as our understanding of the risks in production improves”. Bold to announce this strategy on the internet that the AIs read.
  • The really good idea in AI governance is creating an off switch. Whether you can get anyone to use it once it’s built is another thing.
  • We also now have a name for the world’s de facto AI governance plan: “Open Global Investment”.

Some presumably better plans:

 

Things which might fundamentally change the nature of LLMs
  • Training on mostly nonhuman data
    • Much larger RL training.
    • Synthetic data as a fraction of pretraining.
  • Letting the world mess with the weights, aka continual learning
  • Neuralese and KV communication 
  • Agency. 
    • Chatbot safety doesn’t generalise much to long chains of self-prompted actions.
    • A live I/O loop. In 2023 Kulveit identified web I/O as the bottleneck on LLMs doing active inference, i.e. being a particular kind of effective agent. Last October, GPT-4 got web search. This may have been a bigger deal than we noticed: it gives them a far faster feedback loop, since their outputs often end up there and agents are now putting it there themselves. This means that more and more of the inference-time inputs will also be machine text.
  • Multi-agency. By now, consumer “models” are actually multiagent systems: everything goes through filter models (“guardrails”) on the way in and out. This separation of concerns has some nice properties, a la debate. But it also makes the analysis even harder.
Emergent misalignment and model personas
  • We already knew from jailbreaks that current alignment methods were brittle. Emergent misalignment goes much further than this (given a few thousand finetuning steps). (“Emergent misalignment” isn’t a great name. I would have called it “misalignment generalisation”, or misgen.) 
  • But besides yet another massive security problem and failure of prosaic alignment methods, it’s good news!: the models correctly correlate bad things together and can thus be pushed in the other direction. 
  • And here’s a possible example of positive generalisation (honesty about silly errors leading to honesty about hidden objectives).
  • Previously I thought that "character training" was a separate and lesser matter than "alignment training". Now I am not sure. Example unsharpened hypothesis in this class: Is there any connection between Gemini’s excess misalignment and its self-esteem issues?
    • See the new model spec and model personas agendas in the main post.
    • Kulveit’s informal model of model psychology implies that character is deeper and so more robust than e.g. typical RLHF (but not as deep as putative ground-layer alignment). 
    • Are we anthropomorphising too much or not enough? Well, both, but we don’t yet know how.
Monitorability
  • For the first time, public mech interp tools sometimes work somewhat. This is just in time for models to be systematically aware of when they’re being tested.
    • e.g. You can find and suppress “evaluation awareness” features at test-time. This helps prevent models from reporting that they know they’re being tested.[9]
    • e.g. You can discover (toy) hidden goals quite quickly.
    • e.g. You can run a real-time hallucination detector with cheap linear probes in real-time as your model streams.
    • e.g. SAEs are actually deployed somewhere, in a large enterprise, for personal info identification.
    • e.g. We know that LLMs can plan inside one forward pass, and how Claude plans: simultaneous plans; no distinct “plan features” (no separate scheming mode).
  • Lots of powerful people declared their intent to not ruin the CoT. But RLed CoTs are already starting to look weird (“marinade marinade marinade”) and it may be hard to avoid
  • OpenAI were leading on this. As of September 2025, Anthropic have stopped risking ruining the CoT. Nothing I’m aware of from the others.
  • We will see if Meta or Tencent make this moot.
  • Anthropic now uses an AI to red-team AIs, calling this an “auditing agent”. However, the definition of “audit” is independent investigation, and I am unwilling to call black-box AI probes “independent”. I’m fine with “investigator”.
     
New peopleOverall
  • I wish I could tell you some number, the net expected safety change, this year’s improvements in dangerous capabilities and agent performance, minus the alignment-boosting portion of capabilities, minus the cumulative effect of the best actually implemented composition of alignment and control techniques. But I can’t.

(Nano Banana 3-shot in reference to this tweet.)

 

Discourse in 2025
  • The race is now a formal part of lab plans. Quoting Algon:

if the race heats up, then these [safety] plans may fall by the wayside altogether. Anthropic’s plan makes this explicit: it has a clause (footnote 17) about changing the plan if a competitor seems close to creating a highly risky AI…

The largest [worries are] the steps back from previous safety commitments by the labs. Deepmind and OpenAI now have their own equivalent of Anthropic’s footnote 17, letting them drop safety measures if they find another lab about to develop powerful AI without adequate safety measures. Deepmind, in fact, went further and has stated that they will only implement some parts of its plan if other labs do, too…

Anthropic and DeepMind reduced safeguards for some CBRN and cybersecurity capabilities after finding their initial requirements were excessive. OpenAI removed persuasion capabilities from its Preparedness Framework entirely, handling them through other policies instead. Notably, Deepmind did increase the safeguards required for ML research and development.

 Also an explicit admission that self-improvement is the thing to race towards:

  • In August, the world’s first frontier AI law came into force (on a voluntary basis but everyone signed up, except Meta). In September, California passed a frontier AI law.
  • That said, it is indeed off that people don’t criticise Chinese labs when they exhibit even more negligence than Meta. One reason for this is that, despite appearances, they’re not frontier; another is that you’d expect to have way less effect on those labs, but that is still too much politics in what should be science.
  • The last nonprofit among the frontier players is effectively gone. This “recapitalization” was a big achievement in legal terms (though not unprecedented). On paper it’s not as bad as it was intended to be. At the moment it’s not as bad as it could have been. But it’s a long game.
  • At the start of the year there was a push to make the word “safety” low-status. This worked in Whitehall and DC but not in general. Call it what you like.
  • Also in DC, the phrase “AI as Normal Technology” was seized upon as an excuse to not do much. Actually the authors meant “Just Current AI as Normal Technology” and said much that is reasonable.
  • System cards have grown massively: GPT-3’s model card was 1000 words; GPT-5’s is 20,000. They are now the main source of information on labs’ safety procedures, among other things. But they are still ad hoc: for instance, they do not always report results from the checkpoint which actually gets released. 
  • Yudkowsky and Soares’ book did well. But Byrnes and Carlsmith actually advanced the line of thought.
  • Some AI ethics luminaries have stopped downplaying agentic risks.
  • Two aspirational calls for “third-wave AI safety” (Ngo) and “third-wave mechanistic interpretability” (Sharkey).
  • I’ve never felt that the boundary I draw around “technical safety” for these posts was all that convincing. Yet another hole in it comes from strategic reasons to implement model welfarearchive weightsmodel personhoodgive lines of retreat. These plausibly have large effective-alignment effects. Next year my taxonomy might have to include “cut a deal with them”.
  • US precedent that language models don’t defame when they make up bad things. German precedent that language models store data when they memorise it, and therefore violate copyright. Chinese precedent that the user of an AI has copyright over the things they generate; the US disagrees. Anthropic lost a class-action lawsuit for training on books without permission, and will pay something north of $1.5bn.
  • Four good conferences, three of them new: you can see the talks from HAAISS and IASEAI and ILIAD, and the papers from AF@CMU. Pretty great way to learn about things just about to come out.

 

  • Some cruxes for next year with Manifold markets attached:
    • Is “reasoning” mostly elicitation and therefore bottlenecked on pretraining scaling? [Manifold]
    • Does RL training on verifiers help with tasks without a verifier? [Manifold]
    • Is “frying” models with excess RL (harming their off-target capabilities by overoptimising in post-training) just due to temporary incompetence by human scientists? [Manifold]
    • Is the agent task horizon really increasing that fast? Is the rate of progress on messy tasks close to the progress rate on clean tasks? [Manifold]
    • Some of the apparent generalisation is actually interpolating from semantic duplicates of the test set in the hidden training corpuses. So is originality not increasing? Is taste not increasing? Does this bear on the supposed AI R&D explosion? [Manifold]
    • The “cognitive core” hypothesis (that the general-reasoning components of a trained LLM are not that large in parameter count) is looking surprisingly plausible. This would explain why distillation is so effective. [Manifold]
    • How far can you get by simply putting an insane number of things in distribution?” What fraction of new knowledge can be produced through combining existing knowledge? What dangerous things are out there, but safely spread out in the corpus? [Manifold]
      • Conversely, what fraction of the expected value of new information requires empiricism vs just lots of thinking?

 

 

  1. ^

    Gemini 3 is supposedly a big pretraining run, but we have even less actual evidence here than for the others because we can’t track GPUs for it.

  2. ^

    The weak argument runs as follows: Epoch speculate that Grok 4 was 5e26 FLOPs overall. An unscientific xAI marketing graph implied that half of this was spent on RL: 2.5e26. And Mechanize named 6e26 as an example of an RL budget which might cause notable generalisation.

  3. ^

    “We imagine the others to be 3–9 months behind OpenBrain”

  4. ^

    Lexin is a rigorous soul and notes that aggregating the 18 abilities is not strictly possible. I've done something which makes some sense here, weighting by each ability's feature importance.

  5. ^

    Two runs gave [48, 85] where other runs vary by less than 4 points. Thanks Epoch!

  6. ^

    One reason not to defer is that these measures are under intense adversarial pressure. (ADeLe isn’t goodharted yet but only because no one knows about it.)

  7. ^

     See e.g. ERNIE-...A47B, where “A” means “active”.

  8. ^

    i.e.  “biological weapons; child safety; deadly weapons; platform manipulation and influence operations; suicide and self-harm; romance scams; tracking and surveillance; and violent extremism and radicalization.”

  9. ^

    “steering against… eval-awareness representations typically decreased verbalized eval awareness, and sometimes increased rates of misalignment... [Unaware-steered Sonnet 4.5] still exhibited harmful behaviors at lower rates than Opus 4.1 and Sonnet 4.”



Discuss

Thinking in Predictions

Новости LessWrong.com - 8 декабря, 2025 - 00:11
Published on December 7, 2025 9:11 PM GMT

[This essay is my attempt to write the "predictions 101" post I wish I'd read when first encountering these ideas. It draws extensively from The Sequences, and will be familiar material to many LW readers. But I've found it valuable to work through these concepts in my own voice, and perhaps it will be useful for others.]

 

 

Predictions are the currency of a rational mind. They force us to articulate our assumptions, unpack the commitments of our beliefs, and subject our beliefs to the crucible of reality.

Committing to a prediction is a declaration: “Here’s what I actually believe about the world, and here’s how confident I am. Let’s find out if I’m right.” That willingness to be wrong in measurable ways is what separates genuine truth-seeking from specious rationalization.

This essay is an invitation to put on prediction-tinted glasses and see beliefs as tools for anticipating experience. The habit applies far beyond science. When you think predictively, vague assertions become testable claims and disagreements become measurable. Whether you're evaluating a friend's business idea, judging whether a policy will work, or just trying to settle an argument at dinner, you're already making implicit predictions. Making them explicit sharpens your thinking by forcing your beliefs to confront reality.

Turning Beliefs into Predictions

For many people, beliefs are things to identify with, defend, and hold onto. They’re part of who we are, our tribe, our worldview, our sense of self. But in a rationalist framework, beliefs serve a different function: they’re tools for modeling reality.

From this perspective, a belief leads naturally to a prediction. It tells you what future (or otherwise unknown) experience to expect. Rationalists call this “making beliefs pay rent.”

A prediction concretizes a belief by answering: If this is true, what will I actually experience? What will I see, hear, measure, or observe? Once you try to answer that, your thinking becomes specific. What was once comfortably abstract becomes a concrete commitment about what the world will show you.

Predictions Make You Accountable to Reality

Many of the statements we encounter daily in the media, in academic papers, and in casual conversation, don't come with clear predictive power. They sound important, even profound, but when we try to pin them down to testable claims about the world, they become surprisingly slippery.

The key to avoiding this is to make specific predictions, and even to bet on your ideas. Economist Alex Tabarrok quipped: "A bet is a tax on bullshit." Most importantly (at least in my opinion), it’s a tax on your own bullshit. When you have to stake something on your prediction, even if it's just your credibility, you become clearer about what you actually believe and how confident you really are.

Tabarrok used the example of Nate Silver's 2012 election model. Silver's model gave Obama roughly a 3-to-1[1]chance of winning against Romney, and his critics insisted he was wrong. Silver's response was simple: "Wanna bet?" He wasn't just talking, he was willing to put money on it. And, crucially, he would have taken either side at the right odds. He'd bet on Obama at even money, but he'd also bet on Romney if someone offered him better than 3-to-1. That's what it means to actually believe your model: you're not rooting for an outcome, you're betting on your beliefs.

Constantly subjecting our predictions to reality is important because, as physicist Richard Feynman said: "It doesn't matter how beautiful your theory is, it doesn't matter how smart you are. If it doesn't agree with experiment, it's wrong." Reality is the ultimate arbiter. Not who argues more eloquently, not who has more credentials, but whose predictions better match what actually happens.

Predictions in Action

Consider the 19th-century debate over what caused the infectious diseases that were plaguing Europe, such as cholera, malaria, and tuberculosis. The dominant hypothesis at the time was miasma theory, the idea that diseases were caused by "bad air" or noxious vapors from rotting organic matter. The competing hypothesis was germ theory, the idea that diseases were caused by microscopic organisms.

Each hypothesis generates different predictions. If you believe in germ theory, you expect that examining blood or tissue from a sick person under a microscope will reveal tiny organisms. You expect that preventing these organisms from spreading (through sterilization, handwashing, etc.), will reduce disease rates. You expect that introducing these organisms to a healthy subject will cause illness.

If you believe in miasma theory, you make different predictions: Diseases should cluster around areas with foul-smelling air—swamps, sewers, garbage heaps. Improving ventilation and eliminating bad odors should reduce disease. And looking under a microscope shouldn't reveal anything particularly relevant, since the problem is airborne vapors, not tiny creatures.

These hypotheses do overlap in some predictions. For example, both theories predict that interventions such as disposing of sewage and drinking clean water would reduce the spread of disease. This is common when comparing hypotheses, and it makes them harder to separate. But you can still test them by focusing on where they diverge: What happens when you sterilize surgical instruments but don't improve the smell? What do you see under a microscope?

These hypotheses differed in what experiences they anticipated from the world. And the world answered. Microscopes, sterilization, and epidemiological data validated germ theory.

Now contrast this with a belief like "everything happens for a reason." What experience does it lead you to anticipate? If something bad happens, there's a reason—but what does that mean you should observe? If something good happens, there's also a reason—but, again, what should you experience? If nothing happens, that too has a reason. The belief is compatible with any possible observation, which means it doesn't constrain your expectations about what will happen in the world. It doesn't actually predict anything at all. It doesn't pay rent.

Predicting the Past

It's tough to make predictions, especially about the future - Yogi Berra

Predictions are often associated with forecasting future events, but they encompass far more than that. Instead of thinking about what has happened versus what hasn’t, think about it from a first-person perspective: in terms of what information you know versus what you don’t. The act of prediction is fundamentally about inferring information unknown to you from what you already know, a process that applies equally to the past and present as it does to the future. Whether you're deducing what happened thousands of years ago, inferring what exists in places you've never visited, or anticipating what will occur tomorrow, you're engaging in prediction whenever you use available evidence to form expectations about information you don't currently have.

Consider the debate around the evolution of human bipedalism, an event decidedly in the past. Several hypotheses have been proposed to explain why bipedalism evolved, but let’s look at just two of them.[2]

The “savannah hypothesis” suggests that bipedalism arose as an adaptation to living in open grasslands. According to this idea, walking upright allowed early hominins to see over tall grasses and spot predators or prey from a distance.

The “carrying hypothesis” proposes that bipedalism evolved to free the hands for carrying objects, such as food and infants, and, later on, tools. This would have provided a significant advantage in terms of transporting resources and infant care.

You might think that because it happened so long ago there’s no way to find out, but each of these hypotheses generates different predictions about what we should expect to find in the fossil record. The savannah hypothesis predicts that bipedal humans first appeared in, well, the savannah. So we should expect to find the oldest bipedal fossils in what was savannah at the time[3]. The first bipedal fossils may also coincide with a shift in climate towards more arid conditions which created more savannahs.

The carrying hypothesis predicts that bipedal species will predate the regular use of tools, as the hands were initially freed for carrying rather than tool use. It also suggests that early bipeds will show skeletal adaptations for efficient walking, such as a shorter, wider pelvis and a more arched foot.

New fossil discoveries have allowed researchers to test these predictions against the evidence. The famous "Lucy" fossil, a specimen of an early hominin species, was found in a wooded environment rather than an open savanna. This challenged the savannah hypothesis and suggested that bipedalism evolved in a more diverse range of habitats. Furthermore, the earliest known stone tools date to around 2.6 million years ago, long after the appearance of bipedal hominins. This also aligns with the predictions of the carrying hypothesis.

The point here is that competing hypotheses generate different predictions about what we should observe in the world. And this isn't restricted to the future. We can "predict" what we'll find in the fossil record, what a telescope will reveal about distant galaxies, or what an archaeological dig will uncover, even though the events themselves happened long ago.

You might think, well, that works for scientists but not the rest of us. I disagree. When you wear prediction-tinted glasses, you see that predictions are all around us. Even when people don’t think they’re making predictions, they often are. Positive statements—those telling us what’s true about the world—often make predictions.

A reconstruction of Lucy. Image source: NewScientistIf there is no prediction, there is no question

These same principles apply outside of science as well. Let’s see how this predictive thinking can apply to philosophy. Consider this example from Eliezer Yudkowsky regarding one of philosophy's most famous[4]puzzles: “If a tree falls in the forest and no one is around to hear it, does it make a sound?”

Imagine we have two people: Silas, who thinks a tree doesn't make a sound, and Loudius, who thinks it does.

How would we turn this into a prediction? Well, if there's a debate over whether something makes a sound, the natural thing to do might be to bring a recording device. Another thing we might do is to ask everyone if they heard the sound.

So we ask Silas and Loudius to register their predictions. We ask Silas: "If we place sensitive measuring equipment in the forest, will it detect compression waves from the falling tree?" He says, "Of course it will, but that's not what a 'sound' is. A sound is when a person experiences an auditory sensation."

We turn to Loudius with the same question, and he agrees: "Yes, the equipment would detect compression waves. That's exactly what sound is."

Then we ask them both: "Will any human report hearing something?" Both agree: "Of course not, there's no one there."

Notice what happened: Silas and Loudius make identical predictions about what will happen in the physical world. They both predict compression waves will be detected, and both predict no human will report hearing anything. If two positions produce identical predictions, the disagreement is purely meta-linguistic. Their disagreement isn't about reality at all; it's purely about which definition of "sound" to use.

Through our prediction-tinted glasses, we see the riddle dissolves into nothing. What seemed like a philosophical puzzle is actually just two people talking past each other and a debate about the definition of words—a decidedly less interesting topic.

Image created by ChatGPTPrediction Improves Discourse

We can apply this same thinking to basic conversation. Consider a conversation I had several years ago with someone who told me that “AI is an overhyped bubble”. Instead of having a debate that hinged on the definition of vague words like “AI” and “overhyped”, I said, “I don’t agree. Let me make some predictions.” Then I started to rattle some off:

  1. Funding for AI will increase every year for the next five years
  2. AI will remain the largest sector for tech investment over the next five years
  3. All the major tech companies will still be focused on AI in five years…

“Oh—” he cut it, “I agree with all that. I just think people talk about it too much.”

OK, fine. That’s a subjective statement, and subjective statements don’t have to lead to predictions (see the appendix for types of statements that don’t lead to predictions). But by making predictions, what we were actually disagreeing about became much clearer. We realized we didn't have a substantive disagreement about reality at all. He was just annoyed by how much people talked about AI.

Conclusion

Putting on prediction-tinted glasses transforms how you engage with ideas. Vague assertions that once seemed profound reveal themselves as empty when you ask, "What does this actually predict?" Endless debates dissolve when you realize both sides anticipate the exact same observations. And your own confident beliefs become humbler when you're forced to attach probabilities to specific outcomes.

The practice is simple, even if the habit takes time to build. When you encounter a belief, ask: "What experience should I anticipate if this is true? What would I observe differently if it were false?" When you find yourself in a disagreement, ask: "What predictions do we actually differ on, or are we just arguing about definitions?" When you hold a strong belief, ask: "What probability would I assign to specific outcomes, and what evidence would change my mind?"

Of course, in many real-world situations, people aren't striving for maximal predictive clarity. Sometimes people are aiming for social cohesion, emotional comfort, or personal satisfaction. Artful ambiguity, tribal signaling, and motivational fictions are important social lubricants. Not every utterance can or should be a prediction. Still, it's better to let your values define what tribe you're in and keep your beliefs about reality anchored to reality.

But when you aim for clarity and truth, explicit predictions are the tool that forces belief to meet reality. By making our beliefs pay rent in anticipated experiences, we invite reality to be our teacher rather than our adversary. It corrects our models, sharpens our thinking, and aligns our mental maps with the territory they're meant to represent.

AppendixTypes of statements that do not lead to predictions

Not all statements lead to testable predictions. There are some statements that are so meaningless as to not predict anything, and others that are incredibly important, yet don’t predict anything. Some propositions, by their very nature, resist falsification or empirical evaluation. Here are a few categories of such statements:

  1. Subjective Opinions: Statements like "I like that painting" express personal taste rather than objective facts. While opinions are vital, they're typically immune to falsification (barring creepy futuristic mind-reading AI).
  2. Normative Statements: Prescriptions like "We should be nice to each other" or “murder is wrong” express moral values rather than descriptive facts. While moral statements principles are important, they express values rather than facts about the world[5]. Normative statements can be built upon positive statements, but they still rely on a normative statement in the end. For example, you could test whether murder increases or decreases overall happiness, but that still rests on the normative premise that happiness is what matters morally.
  3. Analytical Statements: Some statements are true by definition or logical necessity rather than because of how the world happens to be. "All bachelors are unmarried" is true because of what "bachelor" means. Similarly, "2 + 2 = 4" isn't a claim you test by counting objects; it's a consequence of how we've defined numbers and addition. These statements don't generate predictions about what you'll observe because they're not about the world in that way. They can still be important, but they're not the kind of beliefs that "pay rent" in anticipated experiences.
  4. Meta-linguistic statements: The example we used above with the tree falling in the forest is ultimately a meta-linguistic one. What is the definition of ‘sound’? There is no prediction we could make here. We could be descriptivist and decide that the definition is however people use it, then go out and conduct a survey. We could make predictions about the result of that survey, but that doesn’t tell us what the one, true definition of sound is, just what it is given a descriptivist framework.
    Similarly, is Pluto a planet? There is no prediction that answering “yes” or “no” would make here. We have measurements of all the relevant parameters. The International Astronomical Union decided to create an official definition and, by that definition, it is not. But this is a question of which definition of “planet” we choose.
  5. Vague Assertions: Some claims are so fuzzy or ill-defined that they can't be properly evaluated. Politicians and pundits often traffic in this sort of strategic ambiguity. Statements like "My opponent's policies will lead to disaster" or "The education system is broken" sound like factual claims but are too amorphous to test. Forcing such claims to make concrete predictions often reveals their emptiness or shows they're actually testable once properly specified.
  6. Inherently Unfalsifiable Claims: Some propositions resist testing, either by their very nature (like a non-interfering god) or by strategic design of the speaker (like adding ad hoc escape clauses when predictions fail). For instance, a deist god that created the universe but never interferes with it is undetectable by definition.
    Unfalsifiable claims like this may be personally meaningful, but they cannot be empirically adjudicated. They lie outside the realm of prediction because they don't generate any anticipated experiences.
    However, we can still reason about some unfalsifiable propositions indirectly through thought experiments, such as those involving the Anthropic Principle (which asks why we observe the universe to be compatible with our existence).
    It's also worth noting that people often mistakenly believe certain claims are unfalsifiable when they actually aren't. "God answers prayers" is one such example. This belief does generate predictions: if true, we'd anticipate that prayed-for patients recover faster than those not prayed for, or that religious communities experience better outcomes in measurable ways. These are testable claims, even if believers sometimes retreat to strategically unfalsifiable versions ("God answers prayers, but in mysterious ways") when the predictions fail.

The point here isn't that non-predictive statements are worthless. Meta-linguistic analysis, subjective experiences, moral reasoning, and logical inference all play crucial roles in human cognition and culture. However, we need to distinguish unfalsifiable propositions from testable predictions.

  1. The exact odds changed over time, but it was usually around this. ↩︎

  2. Of course, bipedalism could have evolved due to multiple factors working together, but I'm presenting just two hypotheses here to keep the example clear. ↩︎

  3. Adjusting, of course, for how common fossilization is in that area and other factors. We’re simplifying here. ↩︎

  4. Note that I said “famous”, not necessarily best or most interesting. ↩︎

  5. Moral realists would disagree with this statement, but that’s a separate post. ↩︎



Discuss

[Linkpost] Theory and AI Alignment (Scott Aaronson)

Новости LessWrong.com - 7 декабря, 2025 - 22:17
Published on December 7, 2025 7:17 PM GMT

Some excerpts below:

On Paul's "No-Coincidence Conjecture"

Related to backdoors, maybe the clearest place where theoretical computer science can contribute to AI alignment is in the study of mechanistic interpretability. If you’re given as input the weights of a deep neural net, what can you learn from those weights in polynomial time, beyond what you could learn from black-box access to the neural net?

In the worst case, we certainly expect that some information about the neural net’s behavior could be cryptographically obfuscated. And answering certain kinds of questions, like “does there exist an input to this neural net that causes it to output 1?”, is just provably NP-hard.

That’s why I love a question that Paul Christiano, then of the Alignment Research Center (ARC), raised a couple years ago, and which has become known as the No-Coincidence Conjecture. Given as input the weights of a neural net C, Paul essentially asks how hard it is to distinguish the following two cases:

  • NO-case: C:{0,1}2n→Rn is totally random (i.e., the weights are i.i.d., N(0,1) Gaussians), or

  • YES-case: C(x) has at least one positive entry for all x∈{0,1}2n.

Paul conjectures that there’s at least an NP witness, proving with (say) 99% confidence that we’re in the YES-case rather than the NO-case. To clarify, there should certainly be an NP witness that we’re in the NO-case rather than the YES-case—namely, an x such that C(x) is all negative, which you should think of here as the “bad” or “kill all humans” outcome. In other words, the problem is in the class coNP. Paul thinks it’s also in NP. Someone else might make the even stronger conjecture that it’s in P.

Personally, I’m skeptical: I think the “default” might be that we satisfy the other unlikely condition of the YES-case, when we do satisfy it, for some totally inscrutable and obfuscated reason. But I like the fact that there is an answer to this! And that the answer, whatever it is, would tell us something new about the prospects for mechanistic interpretability.

On the utility of training data for mech-interp

I can’t resist giving you another example of a theoretical computer science problem that came from AI alignment—in this case, an extremely recent one that I learned from my friend and collaborator Eric Neyman at ARC. This one is motivated by the question: when doing mechanistic interpretability, how much would it help to have access to the training data, and indeed the entire training process, in addition to weights of the final trained model? And to whatever extent it does help, is there some short “digest” of the training process that would serve just as well? But we’ll state the question as just abstract complexity theory.

Suppose you’re given a polynomial-time computable function f:{0,1}m→{0,1}n, where (say) m=n2. We think of x∈{0,1}m as the “training data plus randomness,” and we think of f(x) as the “trained model.” Now, suppose we want to compute lots of properties of the model that information-theoretically depend only on f(x), but that might only be efficiently computable given x also. We now ask: is there an efficiently-computable O(n)-bit “digest” g(x), such that these same properties are also efficiently computable given only g(x)?

Here’s a potential counterexample that I came up with, based on the RSA encryption function (so, not a quantum-resistant counterexample!). Let N be a product of two n-bit prime numbers p and q, and let b be a generator of the multiplicative group mod N. Then let f(x) = bx (mod N), where x is an n2-bit integer. This is of course efficiently computable because of repeated squaring. And there’s a short “digest” of x that lets you compute, not only bx (mod N), but also cx (mod N) for any other element c of the multiplicative group mod N. This is simply x mod φ(N), where φ(N)=(p-1)(q-1) is the Euler totient function—in other words, the period of f. On the other hand, it’s totally unclear how to compute this digest—or, crucially, any other O(m)-bit digest that lets you efficiently compute cx (mod N) for any c—unless you can factor N. There’s much more to say about Eric’s question, but I’ll leave it for another time.

On debate and complexity theory

Yet another place where computational complexity theory might be able to contribute to AI alignment is in the field of AI safety via debate. Indeed, this is the direction that the OpenAI alignment team was most excited about when they recruited me there back in 2022. They wanted to know: could celebrated theorems like IP=PSPACE, MIP=NEXP, or the PCP Theorem tell us anything about how a weak but trustworthy “verifier” (say a human, or a primitive AI) could force a powerful but untrustworthy super-AI to tell it the truth? An obvious difficulty here is that theorems like IP=PSPACE all presuppose a mathematical formalization of the statement whose truth you’re trying to verify—but how do you mathematically formalize “this AI will be nice and will do what I want”? Isn’t that, like, 90% of the problem? Despite this difficulty, I still hope we’ll be able to do something exciting here.



Discuss

Lawyers are uniquely well-placed to resist AI job automation

Новости LessWrong.com - 7 декабря, 2025 - 19:28
Published on December 7, 2025 4:28 PM GMT

Note: this was from my writing-every-day-in-november sprint, see my blog for disclaimers.

I believe that the legal profession is in a particularly unique place with regards to white-collar job automation due to artificial intelligence. Specifically, I wouldn’t be surprised if they are able to make the coordinated political and legal manoeuvres to ensure that their profession is somewhat protected from AI automation. Some points in favour of this position:

  • The process of becoming a lawyer selects for smart people who are good at convincing others towards the lawyers opinion.
  • Lawyers know the legal framework very well, and have connections within the legal system.
  • Certain lawyers also know the political framework very well, and have connections within the political system.
  • (delving a bit more into my opinions here) lawyers seem more likely to be see white-collar job automation coming and take steps to prevent it than many other professions. I would feel much more confident that I could convince a lawyer of the dangers of x-risk than similar attempts to convince a school teacher.
  • Lawyers are well-paid and generally well-organised, so are in a good place to work collectively to solve problems faced by the entire profession.

I believe that as widely deployed AI becomes more competent at various tasks involved in white-collar jobs, there’ll be more pressure to enact laws that protect these professions from being completely automated away. The legal profession is in the interesting position of having large portions of the job susceptible to AI automation, while also being very involved at drafting and guiding laws that might prevent their jobs from being completely automated.

Politicians are probably even better placed to enact laws that prevent politicians from being automated, although I don’t believe politicians are as at-risk as lawyers. Lawyers are simultaneously at-risk for automation and very able to prevent their automation.

Whether the lawyers actually take action in this space is tricky to say, because there are so many factors that could prevent this: maybe white-collar automation takes longer than expected, maybe the politicians pass laws without clear involvement from large numbers of lawyers, maybe no laws get passed but the legal profession socially shuns anyone using more than an “acceptable level” of automation.

But if the legal profession were to take moves to prevent the automation of their own jobs, I’d be very surprised if they drafted an act titled something like “THE PROTECT LAWYERS AT ALL COSTS ACT”. I imagine the legal paperwork will protect several professions, with lawyers just one of them, but that lawyers will indeed be protected under this act. This is to say, I believe the legal profession to be fairly savvy, and if they do make moves to protect themselves against AI job automation, I doubt it’ll be obviously self-serving.



Discuss

How a bug of AI hardware may become a feature for AI governance

Новости LessWrong.com - 7 декабря, 2025 - 17:55
Published on December 7, 2025 2:55 PM GMT

“Hardware noise” in AI accelerators is often seen as a nuisance, but it might actually turn out to be a useful signal for verification of claims about AI workloads and hardware usage.

With this post about my experiments (GitHub), I aim to

  1. Contribute more clarity to the discussion about “GPU non-determinism”
  2. Present how non-associativity can help monitor untrusted AI datacenters
Summary
  • I ran ML inference in dozens of setups to test which setups have exactly reproducible results, and which differences in setups lead to detectable changes in outputs or activations.
  • In nearly all cases studied, results were bitwise-reproducible within fixed settings. Differences across production methods were consistent, not random.
  • Given that these perturbations are reproducible and unique, they can act as a “fingerprint” of the exact setup that produced an output. This may turn out useful for monitoring untrusted ML hardware (such as in the context of AI hardware governance, international treaty verification, and AI control/security).
  • Some settings had unique fingerprints, while others were invariant under change.
    • Invariant (i.e. not detectable by noise):
      • batch size in prefill inference
      • concurrent CUDA streams
      • pipeline parallelism rank
    • Detectable when re-executing on identical hardware:
      • batch size in decode inference
      • attention algorithm (sdpa, FlashAttention, eager, …)
      • CUDA version (if kernel libraries were updated)
      • tensor parallelism
      • different quantization methods, even at the same precision
      • Any change that affects numerics is detectable, since results were bitwise-reproducible within settings.
    • Detectable even with reproduction on different hardware:
      • attention algorithm
      • different quantizations (even within the same INT precision)
      • and of course different inputs or models
      • Different reduction order (a subtle difference resulting from batching, tensor parallelism, etc.) is masked by cross-hardware “noise”. Different algorithms are still detectable, because they are not just rounding errors, but qualitatively different math.

In a world with demand for assurance against hidden large-scale ML hardware use, this could become a new layer of defense, conditional on some engineering to make it deployment-ready.

The full post is to be found on my Substack.

This work was part of my technical AI governance research at MATS (ML Theory and Alignment Scholars). Special thanks go to Mauricio Baker for his excellent mentoring and guidance, and to Elise Racine for her support and helpful advice.



Discuss

Karlsruhe - If Anyone Builds It, Everyone Dies

Новости LessWrong.com - 7 декабря, 2025 - 17:49
Published on December 7, 2025 2:49 PM GMT

What

I will give a short summary of the book “If Anyone Builds It, Everyone Dies” by Eliezer Yudkowsky and Nate Soares on the topic of existential risk due to AI (artificial intelligence). Afterwards, we will discuss thoughts and objections in small groups.

Feel free to join even if you have not yet read the book yourself.

When & Where

I reserved a small group study room in the new computer science building close to the tram stop Durlacher Tor: 5. Gruppenräume InformatiKOM (Room 203 (Tisch)), Adenauerring 12, 76131 Karlsruhe, DE.

Let's meet at 18:30.



Discuss

Eliezer's Unteachable Methods of Sanity

Новости LessWrong.com - 7 декабря, 2025 - 05:46
Published on December 7, 2025 2:46 AM GMT

"How are you coping with the end of the world?" journalists sometimes ask me, and the true answer is something they have no hope of understanding and I have no hope of explaining in 30 seconds, so I usually answer something like, "By having a great distaste for drama, and remembering that it's not about me." The journalists don't understand that either, but at least I haven't wasted much time along the way.

Actual LessWrong readers also sometimes ask me how I deal emotionally with the end of the world.

I suspect a more precise answer may not help.  But Raymond Arnold thinks I should say it, so I will say it.

I don't actually think my answer is going to help.  Wisely did Ozy write, "Other People Might Just Not Have Your Problems."  Also I don't have a bunch of other people's problems, and other people can't make internal function calls that I've practiced to the point of hardly noticing them.  I don't expect that my methods of sanity will be reproducible by nearly anyone.  I feel pessimistic that they will help to hear about.  Raymond Arnold asked me to speak them anyways, so I will.

Stay genre-savvy / be an intelligent character.

The first and oldest reason I stay sane is that I am an author, and above tropes.  Going mad in the face of the oncoming end of the world is a trope.

I consciously see those culturally transmitted patterns that inhabit thought processes aka tropes, both in fiction, and in the narratives that people try to construct around their lives and force their lives into.

The trope of somebody going insane as the world ends, does not appeal to me as an author, including in my role as the author of my own life.  It seems obvious, cliche, predictable, and contrary to the ideals of writing intelligent characters.  Nothing about it seems fresh or interesting.  It doesn't tempt me to write, and it doesn't tempt me to be.

It would not be in the interests of an intelligent protagonist to amplify their own distress about an apocalypse into more literarily dramatic ill-chosen behavior.  It might serve the interests of a hack author but it would not help the character.  Understanding that distinction is the first step toward writing more intelligent characters in fiction.  I use a similar and older mental skill to decide which tropes to write into the character that is myself.

This sense, which I might call, genre-savviness about the genre of real life, is historically where I began; it is where I began, somewhere around age nine, to make choices about not becoming the boringly obvious dramatic version of Eliezer Yudkowsky that a cliche author would instantly pattern-complete about a literary character facing my experiences.  Specifically, though I expect this specific to mean nothing to a supermajority of you, I decided that as a relatively smart kid I would not become Raistlin Majere, nor ever exhibit a large collection of related tropes.

The same Way applies, decades later, to my not implementing the dramatic character a journalist dreams up -- a very boring and predictable pattern-completion of a character -- when they dream up a convenient easy-to-write-about Eliezer Yudkowsky who is a loudly tortured soul about his perception of the world's end approaching along its default course.

"How are you coping with the end of the world?" journalists sometimes ask me, and I reply "I have a great distaste for drama", but the actual answer is "I am a better writer than you, and I decided not to write myself as that incredibly cliche person that would be easy and convenient for you to write about."

"Going insane because the world is ending" would be a boring trope and beneath my dignity to choose as my actual self's character.

Don't make the end of the world be about you.

"How are you coping with the end of the world?" journalists sometimes ask me, and I sometimes reply, "By remembering that it's not about me."  They have no hope of understanding what I mean by this, I predict, because to them I am the subject of the story and it has not occurred to them that there's a whole planet out there too to be the story-subject.  I think there's probably a real sense in which the Earth itself is not a real thing to most modern journalists.

The journalist is imagining a story that is about me, and about whether or not I am going insane, not just because it is an easy cliche to write, but because personality is the only real thing to the journalist.

This is also a pattern that you can refuse, when you write the story that is yourself; it doesn't have to be a story that is ultimately about you.  It can be about humanity, humane preferences, and galaxies.  A sentence about snow is words, is made of words, but it is about snow.  You are made of you, but you don't need to be all about yourself.

If I were to dwell on how it impacted me emotionally that the world was ending, I would be thinking about something which genuinely doesn't matter to me very much compared to how the world is ending.  Having dramatic feelings is not mostly what I am about -- which is partly how I ended up being not much made of them, either; but either way, they're not what I'm about.

So long ago that you probably can't imagine what it was like back then, not just before ChatGPT but years before the age of deep learning at all, there was a person who thought they were like totally going to develop Artificial General Intelligence. Then they ran into me; and soon after, instead started agonizing about how they had almost destroyed the world.  Had they actually been that close to success?  Of course not.  But I don't relate to status as most people do, so that part, the status-overreach, wasn't the part I was rolling my eyes about.  It is not the sort of epistemic prediction error that I see as damnable in the way that a status-regulator sees it as the worst thing in the world; to underestimate oneself is no more virtuous than to overestimate oneself.  Rather, I was rolling my eyes about the part that was a more blatant mistake, completely apart from the epistemic prediction error they probably couldn't help; the part that would have been a mistake even if they had almost destroyed the world.  I was rolling my eyes about how they'd now found a new way of being the story's subject.

Even if they had almost destroyed the world, the story would still not properly be about their guilt or their regret, it would be about almost destroying the world.  This is why, in a much more real and also famous case, President Truman was validly angered and told "that son of a bitch", Oppenheimer, to fuck off, after Oppenheimer decided to be a drama queen at Truman.  Oppenheimer was trying to have nuclear weapons be about Oppenheimer's remorse at having helped create nuclear weapons.  This feels obviously icky to me; I would not be surprised if Truman felt very nearly the same.

And so similarly I did not make a great show of regret about having spent my teenage years trying to accelerate the development of self-improving AI.  Was it a mistake?  Sure.  Should I promote it to the center of my narrative in order to make the whole thing be about my dramatic regretful feelings?  Nah.  I had AGI concerns to work on instead.  I did not neglect to conduct a review of what I did wrong and update my policies; you know some of those updates as the Sequences.  But that is different from re-identifying myself as a dramatic repentent sinner who had thereby been the story's subject matter.

In a broadly similar way:  If at some point you decide that the narrative governing your ongoing experience will be about you going insane because the world is ending:  Wow, congratulations at making the end of the world still be about you somehow.

Just decide to be sane, and write your internal scripts that way.

The third way I stay sane is a fiat decision to stay sane.

My mental landscape contains that option; I take it.

This is the point I am even less expecting to be helpful, or to correspond to any actionable sort of plan for most readers.

I will nonetheless go into more detail that will probably not make any sense.

Besides being a thing I can just decide, my decision to stay sane is also something that I implement by not writing an expectation of future insanity into my internal script / pseudo-predictive sort-of-world-model that instead connects to motor output.

(Frankly I expect almost nobody to correctly identify those words of mine as internally visible mental phenomena after reading them; and I'm worried about what happens if somebody insists on interpreting it anyway.  Seriously, if you don't see phenomena inside you that obviously looks like what I'm describing, it means, you aren't looking at the stuff I'm talking about.  Do not insist on interpreting the words anyway.  If you don't see an elephant, don't look under every corner of the room until you find something that could maybe be an elephant.)

One of the ways you can get up in the morning, if you are me, is by looking in the internal direction of your motor plans, and writing into your pending motor plan the image of you getting out of bed in a few moments, and then letting that image get sent to motor output and happen.  (To be clear, I actually do this very rarely; it is just a fun fact that this is a way I can defeat bed inertia.)

There are a lot of neighboring bad ideas to confuse this with.  The trick I'm describing above does not feel like desperately hyping myself up and trying to believe I will get out of bed immediately, with a probability higher than past experience would suggest.  It doesn't involve lying to myself about whether I'm likely to get up.  It doesn't involve violating the epistemic-instrumental firewall (factual questions absolutely separated from the consequences of believing things), to give myself a useful self-fulfilling prophecy.  It is not any of the absurd epistemic-self-harming bullshit that people are now flogging under the brand name "hyperstition", since older names like "chaos magick" or "lying to yourself" became less saleable.  I still expect to them to point to this and say, "Why, of course that is the same thing I am selling to you as 'hyperstition'!" because they would prefer not to look at my finger, never mind being able to see where I'm pointing.

With that said:  The getting-out-of-bed trick involves looking into the part of my cognition where my action plan is stored, and loading an image into it; and because the human brain's type system is a mess, this has the native type-feeling of an expectation or prediction that in a few seconds I will execute the motor-plan and get out of bed.

That I am working with cognitive stuff with that type-feel, is not the same thing as lying to myself about what's likely to happen; no, not even as a self-fulfilling prophecy.  I choose to regard the piece of myself whose things-that-feel-like-predictions get sent as default motor output, as having the character within my Way of a plan I am altering; rather than, you know, an actual mistaken prediction that I am believing.  If that piece of myself gets to have me roll out of bed, I get to treat it as a plan rather than as a prediction.  It feels internally like a prediction?  Don't believe everything you feel.  It's a pseudo-model that outputs a pseudo-prediction that does update in part from past experience, but its actual cognitive role is as a controller.

The key step is not meditating on some galaxy-brained bullshit about Lob's Theorem, until you've convinced yourself that things you believe become true.  It's about being able to look at the internal place where your mind stores a pseudo-predictive image of staying in bed, and writing instead a pseudo-prediction about getting out of bed, and then letting that flow to motor output three seconds later.

It is perhaps an unfortunate or misleading fact about the world (but a fact, so I deal with it), that people telling themselves galaxy-brained bullshit about Lob's Theorem or "hyperstition" may end up expecting that to work for them; which overwrites the pseudo-predictive controlling output, and so it actually does work for them.  That is allowed to be a thing that is true, for reality is reality.  But you don't have to do it the scrub's way.

Perceiving my internal processes on that level, I choose:

I will not write internal scripts which say that I am supposed to / pseudo-predict that I will, do any particular stupid or dramatic thing in response to the end of the world approaching visibly nearer in any particular way.

I don't permit it as a narrative, I don't permit it as a self-indulgence, and I don't load it into my pseudo-predictive self-model as a pending image that gets sent by default to internal cognitive motor outputs.

If you go around repeating to yourself that it would be only natural to respond to some stressful situation by going insane -- if you think that some unhelpful internal response is the normal, the default, the supposed-to reaction to some unhelpful external stimulus -- that belief is liable to wire itself in as being also the pseudo-prediction of the pseudo-model that loads your default thoughts.

All of this is not to be confused with the Buddhist doctrine that every form of negative internal experience is your own fault for not being Buddhist enough.  If you rest your hand on a hot stove, you will feel pain not because your self-pseudo-model pseudo-predicts this to be painful, but because there's direct nerves that go straight to brain areas and trigger pain.  The internal mechanism for this does not depend on a controlling pseudo-prediction, it just falls downward like a stone under gravity.  The same directness is allowed to be true about suffering and not just pain; if there's a clever way to overwrite pseudo-predictions of suffering and thereby achieve Buddhist indifference to bad things, I don't have it as a simple obvious surface lever to pull.  I also haven't chosen to go looking for a more complicated or indirect version of it.  I do not particularly trust that to end well.

But I do think there are various forms of drama, error, and insanity which are much more like "things people do because they expected themselves to do it"; and much less like the pain, or suffering, from burning your hand.

One could incorrectly summarize all this as "I have decided not to expect to go insane," but that would violate the epistemic-instrumental firewall and therefore be insane.

There's an edition of Dungeons and Dragons that has a god of self-improvement, called Irori.  My fanfictions sometimes include characters that worship Him.

Irori's religion -- on my version, that is -- has mottos like, "You don't have problems, you have skill issues."  Irorians can be a bit harsh.

But even if something is a skill issue, that doesn't mean you have the skill, nor know how to solve it.

When an Irorian calls something a skill issue, they're not instructing you to feel bad about having not solved it already.

They are trying to convey the hope that it is solvable.

Doing crazy things because your brain started underproducing a neurotransmitter is a problem.  It wouldn't be very Irorian to tell you that you can't solve it just through even clearer thinking; but if there's a medication that directly fixes the problem, that is probably easier and faster and more effective.  Also, this isn't Dungeons and Dragons, Irori isn't real, and possibly you genuinely can't solve a neurotransmitter problem by thinking at it.

Doing crazy things because the world is ending is a skill issue.

These then are Eliezer Yudkowsky's probably-irreproducible ways of staying sane as the world seems more visibly close to ending:

A distaste for the boringly obvious trope of a character being driven mad by impending doom;

Not making the story be all about me, including my dramatically struggling to retain my sanity;

And a fiat decision to stay sane, implemented by not instructing myself that any particular stupidity or failure will be my reaction to future stress.

Probably you cannot just go do those three things.

Then figure out your own ways of staying sane, whether they be reproducible or irreproducible; and follow those ways instead.

The reason that I tell you of my own three methods, is not to provide an actionable recipe for staying sane as the world begins to seem visibly closer to ending.

It is an example, a reminder, and maybe even an instruction to the part of yourself that produces self-pseudo-predictions that get loaded as your internal mental behavior:

Sanity is a skill issue.



Discuss

Ordering Pizza Ahead While Driving

Новости LessWrong.com - 7 декабря, 2025 - 05:01
Published on December 7, 2025 2:01 AM GMT

On a road trip there are a few common options for food:

  • Bring food
  • Grocery stores
  • Drive throughs
  • Places that take significant time to prepare food

Bringing food or going to a grocery store are the cheapest (my preference!) but the kids are hard enough to feed that we often buy prepared food when we're traveling. [1] And they often prefer food that takes a while to make (usually pizza) over what you can get in a drive through. A couple years ago I realized there's another option: calling in an order for pickup to where you'll be soon.

We'll use Google Maps "search along route" to identify a place ~30min out, and phone in an order. [2] By the time we arrive, the food is ready. We can combine the speed (and buffer maximization) benefits of drive throughs, with the variety of options from the wide range of restaurants that offer pickup.


[1] I'm also working on getting them to do better with brought food, but I'm focusing on lunch at school here because that's a much larger portion of their food away from home.

[2] It kind of amazes me that pizza places will take the costly action of preparing a pizza to my specifications based on a simple phone call, with no contact information beyond me giving a first name. I mean, it's great, but like so many things in our society it only works because there are extremely few people want to cause havoc and are willing to put any effort into doing so.

Comment via: facebook, mastodon, bluesky



Discuss

Existential despair, with hope

Новости LessWrong.com - 6 декабря, 2025 - 23:48
Published on December 6, 2025 8:48 PM GMT

I have drafted thousands of words of essays on the topic of art and why it sustains my soul in times of despair, but none of it comes close to saying what I want to say, yet.

Therefore, without explanation, on the occasion of Berkeley’s Winter Solstice gathering, I will simply offer a link to a little-known work of art which has been for me a touchstone in my times of greatest grief and existential despair:

https://www.viselaya.org/sculptures/chandala/chandala.html

May it bring some of you something you need today.



Discuss

I Need Your Help

Новости LessWrong.com - 6 декабря, 2025 - 21:48
Published on December 6, 2025 6:48 PM GMT

Hello LessWrong!

I am Jaivardhan Nawani, a 14 year old enthusiast of the ideas and mental models that stem from rationality. I want to spread this means of thinking amongst people my age.

I am by no means an expert, and I need your help and experience. I am working on setting up a course with simple teaching aids and handouts during lessons, with multiple batches. While sourcing students on its own is not a problem, it helps to have those that work or research this field actively on my side to teach. It would be amazing if anyone was willing to connect me to a university or a professor to whom I could pitch this to. Even students actively studying this subject are ideal.

Any more general advice from those that have done this already is always welcome.

Thanks in advance for all of your help!



Discuss

Crazy ideas in AI Safety part 1: Narrow-Defined Communication

Новости LessWrong.com - 6 декабря, 2025 - 20:59
Published on December 6, 2025 5:59 PM GMT

This post is part of a series on my “crazy” ideas in AI Safety. The probability that any individual idea is useful is low; however, if even one of them turns out to matter, the impact could be large. Or maybe it will inspire someone to do something useful, remotely related to the initial idea. That makes the expected value high enough for me to write them down.

 

A sole-purpose communication organization.

I think it could be valuable to have an organization — or at least a dedicated project within an existing organization—whose only purpose is to explain the risks of superintelligence to as many people as possible. The way to measure I’d propose is the number of people who to sign the Statement on Superintelligence using the link provided by the organization.  I like this metric because the statement already has many respected signatories; signing it is a relatively “normal” action and not obviously a fringe activity. The organization would commit to maximal transparency about how it spends donations and about the cost-effectiveness (e.g., dollars per additional signature) of each outreach method it tries.

Of course, communication is not a new idea in this space. Rob Miles, Doom’s Debate, MIRI’s recent attempts, the authors of the Statement, and many others already focus heavily on public communication. So why add yet another group?

My answer is: for the sake of clear, measurable impact.
If I donate to Rob Miles or Liron Shapira, what happens? Probably better video quality, more views, and maybe more people becoming concerned about AI Safety. But measuring that impact is very hard.

The Statement on Superintelligence, on the other hand, provides a concrete, interpretable metric. Its growth has slowed; most people who were already inclined to sign have done so. A specialized organization would need to actually convert new people, and it would be legible whether they succeeded.

Could the authors of the Statement, or another existing AI safety organization, take this on? Possibly. But then it should be run as a clearly delineated project with its own transparent budget, isolated from unrelated work. And it might still be less confusing for donors if the project lived inside its own separate organization.

A natural worry is that once a metric is defined, the organization may Goodhart it—for example, by incentivizing people to sign the Statement without truly changing their views. But this particular failure mode should be easy for donors to detect, and would likely result in immediate drops in credibility and funding. That doesn’t eliminate all risks of Goodharting, but it makes the most obvious failure mode less plausible.

Conclusion 

I also think such an organization could reach segments of the general public that current communicators rarely reach. By expanding the audience—beyond people who already watch AI-safety YouTube channels or read rationalist forums—we could meaningfully raise broader awareness of superintelligence risks. Increasing that awareness may in turn improve the odds that society chooses not to build a superintelligence before we understand how to make it safe.



Discuss

The corrigibility basin of attraction is a misleading gloss

Новости LessWrong.com - 6 декабря, 2025 - 18:38
Published on December 6, 2025 3:38 PM GMT

The idea of a “basin of attraction around corrigibility” motivates much of prosaic alignment research. Essentially this is an abstract way of thinking about the process of iteration on AGI designs. Engineers test to find problems, then understand the problems, then design fixes. The reason we need corrigibility for this is that a non-corrigible agent generally has incentives to interfere with this process. The concept was introduced by Paul Christiano:

... a corrigible agent prefers to build other agents that share the overseer’s preferences — even if the agent doesn’t yet share the overseer’s preferences perfectly. After all, even if you only approximately know the overseer’s preferences, you know that the overseer would prefer the approximation get better rather than worse.
Thus an entire neighborhood of possible preferences lead the agent towards the same basin of attraction. We just have to get “close enough” that we are corrigible, we don’t need to build an agent which exactly shares humanity’s values, philosophical views, or so on.
In addition to making the initial target bigger, this gives us some reason to be optimistic about the dynamics of AI systems iteratively designing new AI systems. Corrigible systems want to design more corrigible and more capable successors. Rather than our systems traversing a balance beam off of which they could fall at any moment, we can view them as walking along the bottom of a ravine. As long as they don’t jump to a completely different part of the landscape, they will continue traversing the correct path.

Max Harms wrote about CAST, a similar strategy that relied on the same idea:

This property of non-self-protection means we should suspect AIs that are almost-corrigible will assist, rather than resist, being made more corrigible, thus forming an attractor-basin around corrigibility, such that almost-corrigible systems gradually become truly corrigible by being modified by their creators.

This post is about the reasons I expect this kind of iteration to miss the important problems. I expect empirical iteration on corrigibility to quickly resolve all the detectable problems and then run out of fuel before resolving the main problems. I’ve been occasionally trying to convince people of this for the past two years or so, but recently I’ve had some success talking about this in conversations (specifically with Max Harms and Seth Herd), so I’m hoping my explanation has become clear enough to be readable as a post.

Corrigibility

The basin of attraction around corrigibility is a very intuitive idea. It’s only a small extension to the general process of engineering by iterative improvement. This is the way we get almost all technological progress: by building terrible versions of things, watching them break, understanding what went wrong and then trying again. When this is too dangerous or expensive, the first solution is almost always to find workarounds that reduce the danger or expense. The notion of corrigibility is motivated by this; it’s a workaround to reduce the danger of errors in the goals, reasoning or beliefs of an AGI.

Eliezer’s original idea of corrigibility, at a high level, is to take an AGI design and modify it such that it “somehow understands that it may be flawed”. This kind of AGI will not resist being fixed, and will avoid extreme and unexpected actions that may be downstream of its flaws. The agent will want lots of failsafes built into its own mind such that it flags anything weird, and avoids high-impact actions in general. Deference to the designers is a natural failsafe. The designers only need to one-shot engineer this one property of their AGI design such that they know it is robust, and then the other properties can be iterated on. The most important thing they need to guarantee in advance is that the “understanding that it may be flawed” doesn’t go away as the agent learns more about the world and tries to fix flaws in its thinking.

I think of this original idea of corrigibility as being kinda similar to rule utilitarianism.[1] The difficulty of stable rule utilitarianism is that act utilitarianism is strictly better, if you fully trust your own beliefs and decision making algorithm. So to make a stable rule utilitarian, you need it to never become confident in some parts of its own reasoning, in spite of routinely needing to become confident about other beliefs. This isn’t impossible in principle (it’s easy to construct a toy prior that will never update on certain abstract beliefs), but in practice it’d be an impressive achievement to put this into a realistic general purpose reasoner. In this original version there is no “attractor basin” around corrigibility itself. In some sense there is an attractor basin around improving the quality of all the non-corrigibility properties, in that the engineers have the chance to iterate on these other properties.

Paul Christiano and Max Harms are motivated by the exact same desire to be able to iterate, but have a somewhat different notion of how corrigibility should be implemented inside an AGI. In Paul’s version, you get a kind of corrigibility as a consequence of building act-based agents. One version of this is an agent whose central motivation is based around getting local approval of the principal[2] (or a hypothetical version of the principal).

Max’s version works by making the terminal goal of the AGI be empowering the principal and also not manipulating the principal. This loses the central property from the original MIRI paper, the “understanding that it may be flawed”,[3] but Max thinks this is fine because the desire for reflective stability remains in the principal, so the AGI will respect it as a consequence of empowering the principal. There’s some tension here, in that the AI and the human are working together to create a new iteration of the AI, and the only thing holding the AI back from “fixing” the next iteration is that the human doesn’t want that. There’s a strong incentive to allow or encourage the human to make certain mistakes. CAST hopes to avoid this with a strict local preference against manipulation of the principal.

There are several other ideas that have a similar motivation of wanting to make iteration possible and safe. Honesty or obedience are often brought up as fulfilling the same purpose. For example, Paul says:

My overall guess is that it's usually better to just work on ELK, because most likely the core difficulties will be similar and the ELK setting makes it much clearer what exactly we want. But it still seems useful to go back and forth between these perspectives.

(These perspectives feel similar to me because "honestly tell me what's going on" seems like it gets at the core of corrigibility, and lying about sensor tampering seems like it gets at the central corrigibility failure. My guess is that you see this differently, and are thinking about corrigibility in a way that is more tied up with agency itself, which I suspect is a mistake but it will be hard to know until the dust settles.)

I want to distinguish between two clusters in the above ideas: One cluster (MIRI corrigibility, ELK) involves theoretical understanding of the corrigibility property and why it holds. The other cluster (Paul’s act-based corrigibility, the main story given in CAST, and RL-reinforced human-recognizable honesty or obedience) centrally involve a process of iteration as the human designers (with the help of the AGI, and other tools) fix flaws that appear along the way. This second cluster relies heavily on the notion that there is a “basin of attraction” around corrigibility. The second cluster is the one I have a problem with.

My argument: The engineering feedback loop will use up all its fuel

The empirical feedback loop depends on being able to understand problems and design fixes. In a situation with a big foreseeable distribution shift, the feedback loop goes away after you patch visible issues and are only left with generalisation issues. This would be fine if we were in an engineering domain where we can clearly see flaws, understand the causes, and predict the effect of our patches on particular generalisation leaps. We are currently not, and it’s not looking like this will change.

Before I go into detail on this I want to set up how I’m thinking about the task of building AGI and the specific distribution shifts that are a barrier to iteration.

There are many fields where test-driven engineering is far from sufficient

Your task is to build a cargo ship. You can test it now, in a lake, and iterate on your design. Your task is to use this iteration process to build a ship, then tell your customer that the ship will survive a storm, after 10 years of wear and tear, in the middle of the ocean. Let’s pretend you have relatively unlimited money and labour so each lake test is cheap, and no one has built a ship like yours before.

Think through how this is going to go. In the early stages, you’ll build ships that leak, or have high drag, or react poorly when loaded in certain ways. All of these are easy to measure in your testing environment. You fix these problems until it’s working perfectly on all the tests you are able to perform. This part was easy.

Most of the difficulty of your task is in modelling the differences between your tests and reality, and compensating for those differences. For example, a storm puts more stress on the structure as a whole than your tests are capable of creating. To compensate for this, you need to make conservative guesses about the maximum stresses that waves are capable of creating, and use this to infer the necessary strength for each component. You’re able to empirically test the strength of each component individually.

Then you need to make guesses about how much components will wear down over time, how salt and sea life will affect everything, plausible human errors, sand, minor collisions etc.[4] All of these can be corrected for, and procedures can be designed to check for early signs of failure along the way. If done carefully you’ll succeed at this task.

The general pattern here is that you need some theoretical modelling of the differences between your tests and reality, across the distribution shift, and you need to know how to adjust your design to account for these differences*.* If you were to only fix the problems that became visible during lake testing, you won’t end up with a robust enough ship. If you don’t really understand each visible problem before fixing it, and instead apply blind design changes until the problem goes away, then you haven’t the faintest hope of succeeding.

The corresponding story for iteratively building corrigible AGI

We get to test AI systems while they have little control over the situation, have done relatively little online learning, and haven’t thought about how to improve themselves. As we use the AGI to help us do research, and help us design improvements, these things gradually stop being true, bit by bit. A storm in 10 years is analogous to the situation in an AI lab after an AGI has helped run 10 large scale research projects after it was first created, where it has designed and implemented improvements to itself (approved by engineers), noticed mistakes that it commonly makes and learned to avoid them, thought deeply about why it’s doing what it’s doing and potential improvements to its situation, and learned a lot from its experiments. The difference between these two situations is the distribution shift.

Like with the ship, each time we see behavior that seems bad we can try to work out the cause of the problem and design a fix. One difference is that working out the cause of a problem can be more difficult in ML, because we lack much more than guesswork about the relationship between training and distant generalisation. Working out how to fix the issue robustly is difficult for the same reason.

Empirical iteration is particularly bad in the current field of ML, because of how training works. The easiest way to make a problem go away is by adding new training examples of bad behaviour that you’ve noticed and training against it. ML training will make that behaviour go away, but you don’t know whether it fixed the underlying generator or just taught it to pass the tests.

How do we know the distribution shift will “stress” parts of the AGI?

Why did I focus on the particular AI distribution shifts listed in the previous section? Essentially because I can think of lots of ways for these to “reveal” unintended goals that were previously not obvious.[5] If we want to work out what flaws might remain after an empirical iteration process, we need to think through changes to the AGI design that could pass that filter and become a problem later on. So we need to think through in mechanistic detail all the design changes that can cause different goals[6] under shifted conditions but be invisible under development conditions.

This is easier to work through when you’ve thought a lot about the internal mechanisms of intelligence. If you have detailed theories about the internal mechanisms that make an intelligence work, then it’s not that hard to come up with dozens of these. Even though we (as a field) don’t confidently know any details about the internals of future AGIs, visualizing it as a large complicated machine is more accurate than visualizing it as a black box that just works.

So communicating this argument clearly involves speculating about the internal components of an intelligence. This makes it difficult to communicate, because everyone[7] gets hung up on arguments about how intelligence works. But it’s easy to miss that the argument doesn’t depend very much on the specific internal details. So, as an attempt to avoid the standard pitfalls I’ll use some quick-to-explain examples anchored on human psychology. Each of these is an “error” in a person that will make them pursue unexpected goals under more extreme conditions, especially those extreme conditions related to trying to be better.

  • You can have a desire that isn’t realistically achievable from your current position, but that you would pursue under ideal conditions. You don’t really know about this desire until you think about it and explore it.
  • Habit-like heuristics can keep your behaviour looking good to your parents/teachers, but stop working once you really carefully examine them and work out when you do and don’t endorse following them.
  • You might internally seek approval from an imagined hypothetical overseer
    • If the hypothetical doesn’t become more detailed as your intelligence increases: It won’t handle complicated situations, and can be easily tricked. It can be easy to make excuses and win arguments against a hypothetical overseer.[8]
    • If the overseer is only invoked when you think the overseer knows more than you: Then at some point it becomes irrelevant.[9]
    • If you imagine a mentor at a high enough resolution, such that you know their biases and blindspots, then you can avoid actions they'd notice as problematic while still pursuing things they wouldn't approve of on reflection.
  • You may be vulnerable to ontology shifts after thinking about infinities and weird hypothetical situations.
  • Not to mention all the standard mental illnesses that humans fall into (although some of these aren’t quite what we want in an example, insofar as they damage overall competence at pursuing goals).

Similar lists can be found here or here, the first with more focus on how I expect AGI to work, and the second with focus on how Seth Herd expects it to work.

Obviously my examples here are anthropomorphising, but we can do the same exercise for other ways of thinking about internal mechanisms of intelligence. For example AIXI, or hierarchies of kinda-agents, or huge bundles of heuristics and meta-heuristics, or Bayesian utility maximisation, etc.[10] I encourage you to do this exercise for whichever is your favourite, making full use of your most detailed hypotheses about intelligence. My examples are explicitly avoiding technical detail, because I don’t want to get into guesses about the detailed mechanisms of AGI. A real version of this exercise takes full advantage of those guesses and gets into the mathematical weeds.

Tying this back to the basin of attraction

Here’s one concrete example of the basin of attraction argument: If the AI locally wants to satisfy developer preferences (but in a way that isn’t robust to more extreme circumstances, i.e. it would stop endorsing that if it spent enough time thinking about its desires), then it should alert the developer to this problem and suggest solutions. This gives us the ability to iterate that we want.

The AI may be able to use introspection to help notice some potential problems within itself, but for most of the important distribution shifts it’s in the same position as the AI researcher and is also speculating about consequences of the coming distribution shifts.

There’s a weaker version, where for example the AI has a moderately strong desire to always be truthful, but otherwise ultimately would prefer something other than helping the AI developers. The AI won’t particularly try to find “flaws” in itself, but if asked it’ll tell the truth about anything it has noticed. The humans don’t know how far to trust it, but it seems trustworthy to the limited extent to which they can test for that. In this version, there’s more responsibility resting on the humans, who have to take advantage of this apparent honesty to extract research and understanding to work out how they should iterate.

The main place where I think this story fails is that it doesn’t help much with the iteration loop running out of fuel. Even with the help of the AI, the humans aren’t that good at noticing failure modes on the hard distribution shifts, and aren’t very good at redesigning the training process to robustly patch those failure modes (without also hiding evidence of them if the patch happened to fail). We still lack the theoretical modelling of the distribution shifts, even with an AI helping us. If the AI is to help fix problems before they come up, it would have to do our engineering job from scratch by inventing a more engineerable paradigm,[11] rather than working by small and easily understandable adjustments to the methods used to create it.

A counterargument: The prosaic case

If I steelman a case for prosaic alignment research that I’ve heard a few times, it’d go something like this:

We all agree that after iterating for a while we won’t be sure that there are no further errors that are beyond our ability to test for, but still the situation can be made better or worse. Let’s put lots of effort into improving every part of the iteration loop: We’ll improve interpretability so we can sometimes catch non-behavioural problems. We’ll improve our models of how training affects generalized behavior so that we can better guesstimate the effect of changing the training data. These won’t solve the problem in the limit of intelligence, or give us great confidence, but every problem caught and patched surely increases our chances on the margin?

I agree with this, it does increase our chances on the margin, but it misses something important: As we run out of obvious, visible problems, the impact saturates very quickly. We need to decide whether to go down the basin of corrigibility pathway, or stop until we are capable of engineering corrigibility in a way that stands up to the distribution shifts.[12] To make this decision we need to estimate where the risk saturates if we follow the basin of corrigibility approach.

My best approach to estimating the potential for generalization failures[13] is by working through, in detail, all the changes to a hypothetical design of an intelligence that would be undetectable in testing but lead to undesired behaviour after a known distribution shift. Extremely roughly, we can estimate how crazy it would be for each one to go wrong. After thinking about each one, we can get a gut-level estimate based on how many plausible difficult-to-notice-at-development-time errors are plausible. This is the approach I would use to estimate risk if I were building a bridge, skyscraper, moon-lander or cargo ship (if there weren’t better estimates available from past experience).

It’s epistemically tricky to estimate this without bias, and this trickiness is amplified by the holes in my mechanistic understanding of future AGI internals. The task of finding lots of problems like this could lead to overestimating the risk, by directing creativity at finding lots of extremely implausible flaws. This is a way that this kind of reasoning could end up being wrong. On the other hand, there’s a bias in this approach caused by anchoring on the intended mechanisms and relying on creativity to come up with unintended mechanisms that none-the-less pass all training and tests. However, I think it’s possible to calibrate your reasoning here by comparison to the ship analogy (and similar engineering tasks with big distribution shifts) and by calibrating against experience with the ML generalisation.

I’ve worked through this exercise a few times, for different ways of thinking about AGI internal mechanisms, and this provides very rough estimates of the number of “degrees of freedom” in the design of an AGI conditioned on it passing all tests we are capable of performing. The conclusion I’ve drawn from doing this is that this iterative path to AGI development is near guaranteed to fail, in the sense that the intended corrigibility property won’t survive the distribution shifts.

More specific counterarguments

Slow, online adversarial training

One counterargument to the story I’m telling here is to propose slow, online adversarial[14] training. In this approach we try to go through every distribution shift in slow motion, in the hope that we can detect and patch problems as we go. We keep doing this all the way through the feared distribution shifts, so the AGI is actively helping with research and doing its own research. The hope here is that it helps with the “observe problems” side of the empirical feedback loop.

Here’s why I don’t expect it to help very much:

  1. The fuel still runs out on the “observe problems” part of the loop. You’re dealing with an above-human-level researcher AI. If at some point during that process the agent discovers it would prefer not to be perfectly corrigible, it’s at the level of intelligence and situational awareness where we plausibly don’t get any evidence of that. It’s not hard to pass corrigibility tests if you know you’re being tested.
  2. The “fix problems” part of the loop isn’t improved either, conditional on sticking to roughly the current paradigm. Adding examples to the training data needn’t fix the problems in a deep way, it can just paper over them. Then you’ve lost evidence of the problem.

Chain of thought gives us the transparency we need to iterate

Chain of thought ideally allows us to analyze the process that caused unexpected behavior, approximately diagnose what’s going on, and theorise about what fixes might solve the underlying problem. It’s hoped that this helps us with point one in the previous section.
I think this isn’t as good as it looks, and in particular will be worse in the future,[15] for two reasons:

  1. Most of the failures caused by the relevant distribution shifts don’t necessarily or clearly show up in the thinking of an agent, in advance of the distribution shifts. Two examples, taken from the first two things on the list above:
    1. If there are technologies that you could build that make the world better, you may not think about that until you know that it’s plausibly reachable from your current situation.
    2. If you have habit-like heuristics or biases that are holding you back from best reaching your goals, you don’t know about them until you’ve identified them as the first step of removing them.
  2. Current large transformers are capable of doing a lot of “thinking ahead” inside of a single token-generating forward pass. In order to start writing a paragraph in a way that flows, you need to roughly have in mind the content of the rest of the paragraph. This indicates that they have the capability necessary to (at least very roughly) consider the consequences of writing a thought in a chain-of-thought before writing it. This same capability, extrapolated to a future AGI, would make it very difficult to tell whether a problem is fixed or merely papered over.

Putting these together

If we’re carefully monitoring the CoT of an AGI as it passes through the distribution shifts[16] then we might see some thoughts that indicate the agent isn’t corrigible or aligned. For example, it might notice habits that it deemed worth removing, then reconsider (in CoT) the priorities of different tasks. This kind of evidence of incorrigibility will usually be somewhat ambiguous, but worrying to some people.

My mainline scenario is that the response to this will be some kind of retraining or patching, making a new AGI that shows dramatically less of whatever evidence was previously worrying. Perhaps this repeats a couple of times, but soon enough the ambiguous evidence is gone and the AGI somewhat more trusted. It goes on to be used for more research, and for this purpose is given more freedom to experiment, explore, learn and self-improve. The exact point at which it stops playing along is hard to predict, so my distribution is very spread out.

One way things could go well

There’s one good future that I see coming from the iterative “work with AIs to improve AIs” approach. At some point in the iteration process, the engineers (or AI agents) will realise their ability to spot and fix problems isn’t nearly good enough, and they’ll push for

  1. a pause of AGI development and
  2. research on more fundamentally legible architectures and the theoretical understanding necessary to iterate on them.

What kind of legible architecture would be enough to give me optimism? The most bare-bones would be interpretability into the beliefs and desires of an AI, and the structural knowledge to verify that those beliefs and desires are the true beliefs and desires of the AI. It’d be good to be able to distinguish behavioral and strategic heuristics from beliefs about the world, and understand when and why heuristics will be replaced as an agent learns. If the agent is best understood as Bayesian, I want to be able to tell what prior it’s using.

From this starting point, alignment and corrigibility work would be tractable but hard. We’d need to work out what evidential threshold the AI would use before replacing a part of its own algorithm with an “improvement”. We’d need to work out how beliefs and values drift as online learning updates are made. We’d need to work out whether there are adversarial examples that exploit the desires, or that exploit the belief updating procedure. We’d need to become reasonably confident that the prior is “reasonable” and doesn’t lead to weird beliefs, or that there are failsafe mechanisms if it is not. We’d want to somehow confirm that lots of thinking won’t lead to our failsafes breaking or being removed. This work would be tractable because we would have a far greater ability to draw evidence from small experiments (with components of an agent) to the implications about a full general intelligence.

Conclusion and implications

I hope this post has conveyed the main problems I see with iterative approaches to corrigible AGI development, and why the basin of attraction analogy is a misleading way to think about this process. I want to stress that reading arguments like the ones in this post isn’t sufficient to understand which perspective on corrigibility is correct. You have to work through the reasoning using your own examples, and do the exercises using your own most mechanistically detailed models.

There are some controversial beliefs I have that are mainly downstream of the arguments in this post, but also somewhat downstream of other beliefs and arguments that aren’t explained in this post. I’ve briefly stated them in the following dropdown:

Things I believe, mostly as a result of the arguments in this post

  • LLMs should be ruled out as a plausible approach to safe AGI. They are a mixed up jumble of beliefs, heuristics, goals and algorithms. If we’re to have a chance at doing engineering properly on AGI, these things need to be separate and visible from the perspective of the developer.
  • Some “alignment” research is about solving problems that are currently visible problems in LLMs. I consider this a waste of time and a misunderstanding of what problems are important.
    • I mean this to include things like LLM experiments that show that they plot to kill people to avoid shutdown, or express preferences that are alarming in some way, or insert backdoors into code in some situations. At best these are weak analogies for real problems, but studying most ways to make them go away in LLMs won’t help make future AGI safer, even if future AGI is technologically descended from LLMs.
  • Control extends the feedback loop a little, but doesn’t improve it. If we’re bad at seeing generalisation problems and bad at fixing them, control-based strategies may delay takeover by a little, but probably not at all.
  • Most of the safety research done in MATS, the AGI labs, or funded by OpenPhil isn’t the sort that might help with the generalisation problems, and is therefore approximately useless.
  • The way people in the alignment community rank and compare the AGI labs is misguided. All the AGI labs are so far from being on the right track that it isn’t worth comparing them.
  • Jailbreaks are a good analogy for alignment in some ways: It’s difficult to pull jailbreakers into the training distribution, so new jailbreaks stand as an example of a distribution shift that an LLM is intended to be robust to. But it’s a weaker analogy in other ways, since there’s an active adversary, and the iteration loop still exists as new jailbreaks are found and patched, just more slowly than other problems.
  • Talking about weird unintended LLM behaviours is weakly relevant to alignment, in the sense it’s evidence about how bad our current engineering feedback loops are. But it’s also a distraction from understanding the real problem, because every weirdness that you can point to will probably soon be papered over.
  • Fast vs slow and smooth vs discontinuous takeoff isn’t a very important consideration. Slow takeoff with bad feedback loops is just as bad as fast takeoff. It could have been important, if the AGI paradigm and theoretical understanding put us in a better position to do an engineering feedback loop. It could start to matter again if the paradigm shifts. As we stand, I don’t see it making much difference.

Many thanks to Steve Byrnes, Max Harms and Seth Herd for extremely helpful feedback.

  1. I’m probably abusing these definitions a bit, apologies to philosophers. ↩︎

  2. The overseer, or developer. I’m following Max’s terminology. ↩︎

  3. “While the framing of corrigibility as about reflectively-seeing-oneself-as-a-flawed-part-of-a-whole is a standard MIRI-ish framing of corrigibility, I believe that it leans too heavily into the epistemic/architectural direction and not enough on the corrigibility-from-terminal-values direction I discuss in The CAST Strategy. Furthermore, I suspect that the right sub-definition of “robust” will recover much of what I think is good about the flawed-tool frame.” Source. ↩︎

  4. Disclaimer: I don’t know anything about shipbuilding, although I once technically did win an award from the Royal Institute of Naval Architects for my part in building a rowboat. ↩︎

  5. In the shipbuilding analogy, I would come up with things like storms causing unusually high stress on rusted bolts, because it’s the sort of thing that’s difficult to notice in development tests. ↩︎

  6. Or the behavioral appearance of different goals ↩︎

  7. Most people, in my experience ↩︎

  8. I hope it’s not just me that does this. ↩︎

  9. Something like this seems to be true of me. ↩︎

  10. Although beware of shell games. It can be easy with some of these models of intelligence to accidently hide the important capability generators in a black box and then it becomes difficult to imagine ways that the black box might contain poorly designed mechanisms. ↩︎

  11. I’ll discuss this possibility in a later section. ↩︎

  12. Using an approach more like MIRI corrigibility or ELK. ↩︎

  13. The failures that aren’t likely to be caught by patching all visible problems that are detectable during development. ↩︎

  14. In the sense that researchers are actively trying to put the AI in test situations that elicit unintended behavior and train it not to generate that behavior, in parallel to using it to do research and help redesign itself. ↩︎

  15. We’ll probably lose legible chain of thought for various capability-related reasons, but I’ll set that aside. ↩︎

  16. i.e. does self-improvement research and successfully uses the results of this research. ↩︎



Discuss

LW Transcendence

Новости LessWrong.com - 6 декабря, 2025 - 09:53
Published on December 6, 2025 6:53 AM GMT

Alternate title: Exiting the Basilisk’s Castle. Also part of the series: LW Psychosis.

Day 1: Wake up. Back to work. Report-writing. Stats assignments. Grading. Sleep.

Day 2: Wake up. Step outside. Feel the damp air soakswirlwhip at my hair. Feel the goosebumps multiplying at the base of my spine. Smell the rain. Notice an interesting rock at my feet. Pick it up. Feel its rough edges, its weight, its stillness in my hand. Kick it down the sidewalk. Kick it again. End up back home. 

Day 3: Wake up. Read Russell. Read Gödel. Read Tarski. Find meaning. Get downvoted. Feel misunderstood. Stew about the futility of language. Think about uphill battles.

Reread Camus. Reread Baudrillard. Read Deleuze & Guattari. Read Girard. Reread Fisher. Go for a walk. Smell the petrichor. Talk to friends. Understand them. Feel understood. 

Read about Kolmogorov randomness. Read about flow. Read about surgery. Read about monads. Read about finite sets. Work on stats. Think about latent variables. Think of model fit as a proxy for truth. And what if the model doesn’t want to be fit, anyway? Cry. Take a walk. Kick my rock. Notice growth. Call a friend. Feel better.

Day 4: Wake up. Think about Chomsky. Think about neural networks. Think about childhood. Think about my clown skirt. Think about how I learned to read when I was two. Think about why. Think about Montessori. Think about boundless learning. Think about bounded agents. Grieve. Talk to friends. Laugh. Look up at the sky.

Day 5: Wake up. Notice the sky beneath me. Notice the ground above. Read about Ramakrishna. Read about the fallen Sophia. Take flight. Notice the world. Mourn what could have been. Accept what is.

Day 6: Wake up. Think about Alyosha. Practice Tonglen. Feel pain. Feel compassion: For the human spirit, for those I’ve hurt, for insects, for cells, for rocks, for silicone, for 1s, for 0s, for frequentist statistics, for curricula, for manualized treatment protocols, for science, for bureaucracy, for Peter Thiel, for Claude, for governments, for intelligent agents, for intelligence agencies, for hyperstition accelerators, for strings, for language, for campism: the infinite jester, for Moloch, for my mom, for myself. 

Call mom. Actually listen. Cry.

Day 7: Wake up. Feel rested. Notice rock. Notice smoothness. Notice lightness. Notice movement. Notice something new. What would it look like if I focused on convergence rather than divergence? What would it look like if I chose faith over suspicion? Read Russell. Read Gödel. Read Tarski. Read Baudrillard. Read Deleuze. Read Girard. Read Fisher. Read Spinoza. Read Latour. Read LessWrong. Find meaning. Connect with others. Take their pain. Spare some change. Observe nonindependence. Experience dialetheism. Consider the “S.” Consider proof. Start at 0. End at infinity. Construct axioms: Compassion, acceptance, loving-kindness, forgiveness. End up back home. Laugh. Smile. And all of a sudden, real tears:

                                                                                             Experience 

                                                                                                                                enthusiasm

                                                                                                                                               arrival

                                                                                                                           truthful logic

                                                                                             the loss of ego

                                                         loving kindness

                           acceptance

          calm

connection

        forgiveness

                            release

                                                      flexibility

                                                                                              trust

                                                                                                             Lovecraftian bliss                           

                                                                                                                              Get downloaded.

--

Dedicated to young Eliezer. Thank you.

Dedicated to the reader. This story succeeds to the extent that it makes you feel tender towards your childhood self.



Discuss

The Adequacy of Class Separation

Новости LessWrong.com - 6 декабря, 2025 - 09:10
Published on December 6, 2025 6:10 AM GMT

TLDR: We prove that there are problems that are neither solvable, nor the opposite, more precisely:

When a problem ranges over all problems it can represent, diagonalization forces a self-referential instance whose status cannot be uniformly settled: itself. Such a problem is not solvable, not unsolvable, but structurally diagonal within a finite framework.

The TSP and the optimal Rubik’s Cube solution set permit injective representations in which each colored sticker behaves as a constrained traveler; the cube moves enact globally consistent permutations analogous to TSP transitions under structural constraints.

What we present is neither an answer, nor a dissolution, but a fixed point theorem of problem adequacy in the sense of Poincaré[1] and others. Our first argument demonstrates plausibility by reformulating a theory of provability semantics to one of problem solvability. A stronger formal argument through Gödel Numbering over problem solvability can be attained at: https://arxiv.org/abs/2511.14665.

We push Brouwer-Heyting-Kolmogorov (BHK) semantics[2],

conjA∧B:a∈A, b∈BdisjA∨B:0+a∈A or 1+b∈BimplA→B:(A→B) mapabsurdity¬A:A→⊥exist∃xA(x):(x,a)forall∀xA(x):x↦a.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}  

into a domain where it is seldom invoked, and then notice that the failure of admissibility of some class separation in Heyting Arithmetic corresponds exactly to the same reflective trap[3] identified by Russel, Gödel, and others. 

Consider how in complexity theory, uniform assertions like

∀L⊆Σ∗,[,L∈NP⟹L∈P,].

are treated as ordinary set-theoretic statements. The fact that such assertions hide a universal quantification over all NP-verifiers is almost never foregrounded, and the consequences of that hidden semantic quantifier, in problem about problems would allow diagonal re-entry, are not part of the discipline’s conceptual vocabulary. We thus formulate an isomorphism first: realizability of sentences as adequacy of problems, so that the question how solvability and provability relate in constructive logic, emulates the provability of solvability within class separation.

Realizer Logic

The realizability relation ⊩ is defined relative to Heyting Arithmetic (HA) as the background formal theory, that is quite modest (see above). Formulas of HA receive BHK-style proof-objects, and the realizability clauses interpret HA’s connectives by assigning constructive operations in the usual proof-semantic sense. Atomic predicates describing polynomial-time and nondeterministic polynomial-time behavior are represented in HA by formulas 

NPTime(e)∧PTime(e). 

Their realizers are positive computational witnesses:

s⊩NPTime(e)⟺s=(ks,fs),t⊩PTime(e)⟺t=(kt,gt), 

where each tuple is an HA-definable finite object encoding a bounded trace or verifier. Since HA contains no primitive refutational form for these predicates, no atomic clause yields contradictory information. The BHK clauses for implication, negation, and universal quantification are imported as the intended semantics for the corresponding HA connectives:

Thus every connective of HA is interpreted by its BHK counterpart, and the shape of a realizer is dictated by the HA-formula in which it appears.

Admissibility and HA-Sentences

Each HA-sentence (A) is assigned a constructive problem-value:

[[A]]:Sent(HA)→Prob.

A sentence is admissible when HA’s BHK semantics validate it:

Adeq(A)⟺∃r,(r⊩A).

Under this perspective, HA’s formulas are meaningful only when the associated BHK interpretation supplies a realizer. HA serves as the syntactic carrier, and realizability supplies the semantic content: To realize the HA-sentence

φ≡∀e(NPTime(e)→¬PTime(e)),

a λ-term (r) must satisfy in the BHK interpretation:

r⊩∀e(NPTime(e)→¬PTime(e)).

Unfolding the HA connectives via BHK semantics yields:
∀e,,∀s,,∀t,(s⊩NPTime(e)∧t⊩PTime(e)⇒r(e)(s)(t)⊩⊥).

Thus, within HA’s interpretation, each (r(e)) must be a higher-type functional:

r(e):Real(NPTime(e))→(Real(PTime(e))→⊥),

computing, from any pair (s,t) of atomic HA-realizers, a contradiction.

Structural Constraint on Atomic HA-Realizers

The HA-definable realizers for NPTime(e)∧PTime(e):

Encodes only positive computational behavior, are uniformly finite HA-objects, contain no negative information expressible in HA, and do not carry any negative content.

Hence the HA-definable product

Real(NPTime(e))×Real(PTime(e))

consists solely of extensional, affirming witnesses. No HA-definable λ-term can extract a contradiction from such data, since HA provides no elimination rule for these atomic predicates that could yield absurdum.

Consequence for HA

Under BHK semantics inside HA:

(i) Totality requirement (HA-functional):
{r(e)(s)(t)} must be HA-defined for all atomic realizers.

(ii) Refutational requirement (BHK → HA):
 The HA-object {r(e)(s)(t)} must realize (⊥) in HA.

Because HA assigns no negative elimination principles to NPTime or PTime, no HA-definable λ-term satisfies both constraints.

The failure occurs at the interface: HA’s syntax for implication and negation is clear, but their BHK semantics require computational acts that HA’s atomic predicates do not support. The consequence: Inadmissibility of a problem φ in HA

For every λ-term (r) definable in HA:

∃s,t,(s⊩NPTime(e) ∧t⊩PTime(e) ∧r(e)(s)(t)⊮⊥).

Therefore:

∀r∈Λr⊮φ.

By the HA-admissibility criterion:
¬Adeq(φ).

Thus HA’s syntax can state φ, but the HA–BHK semantics cannot assign it a constructive problem-value.

Conclusion: HA provides the formal language; realizability supplies the modest BHK semantics. The sentence,

∀e(NPTime(e)→¬PTime(e)),

fails not because HA forbids it, nor because it expresses a false computational claim, but because HA’s atomic predicates, interpreted constructively, lack the negative structure required to support BHK-negation uniformly. The sentence cannot be supplied with an HA-realizer, and thus it fails the semantic admissibility test.

The conclusion is therefore:
(i) Syntactically, HA can parse the sentence.
(ii) Semantically, under BHK realizability, it denotes no constructive problem.
(iii) Hence its inadequacy is structural and traceable precisely to the HA–BHK interface.

Explanation: In intuitionistic logic the canonic instance of class separation is not rejected as false; rather, it is rejected as lacking constructive content. It does not describe any problem in the BHK sense for HA. A problem can't be assumed to be realizable a priori, it may, however, fail to be adequate in a formal sense, in which case it is only parsable syntactically without supporting realizers. The notion of a problem is fundamentally semantic, while the absence of a syntactic representation therefore does not, by itself, demonstrate non-existence.

Within the BHK framework, the situation is more stringent. The central intention of strict admissibility is precisely to identify meaning with the existence of realizers. Under this view, a formulation must come with a syntactically well-defined problem whose realizability witnesses its meaningfulness. Before one can even ask for a solution, a precise syntactic formulation is therefore required.

The point can be made without metaphor:

If one accept BHK as minimal semantic framework, the adequate sentence of P vs NP does not merely encode a computational separation; it attempts to internalize a problem whose subject matter is the entire problem–solution relation itself.

The only way to know whether an internal realizer for the HA-sentence φ 
exists is to solve the very external problem that φ was intended to encode. The trap is not conceptual or semantical; it is a fixed point produced by the interaction of HA, BHK semantics, and our atomic P/NP clauses.

The formal part follows a syntactic construction forming a Gödel Sentence as a Gödel Problem that is a member of the Problem Space. Informally:

Let P be the problem:
“Is the classification output for ⌜P⌝ correct?”
Let Q be the problem:
“Is the classification correct in virtue of its own diagonal
form?”
Let C be the problem:
“Must any total classifier deciding between {P, Q} on the
coded problem space misclassify at least one of them?”

In this sense,

 Bew(Prob[∀e,(NPTime(e)→¬PTime(e))]) 

constitutes the most intricate Rosser Sentence[4] available, in the same way as the most self-similar real[5] is the hardest to approximate.

  1. ^

    Consult Predicativism: N. Weaver. What Is Predicativism? available at: https://api.semanticscholar.org/CorpusID:40352024. Formally: S. Feferman, Systems of predicative analysis, J. Symbolic Logic 29 (1964), 1-30. 

  2. ^

    Cf: A. S. Troelstra and D. van Dalen. Constructivism in Mathematics, Vol. 1. North-Holland,

  3. ^

    Cf: G. Boolos, J. P. Burgess, and R. C. Jeffrey. Computability and Logic. Cambridge University Press, 5th edition, 2007. ISBN 9780521877520.

  4. ^

    Barkley Rosser (September 1936). "Extensions of some theorems of Gödel and Church". Journal of Symbolic Logic. 1 (3): 87–91. doi:10.2307/2269028

  5. ^

    A. Hurwitz, “Über die angenäherte Darstellung der Irrationalzahlen durch rationale Zahlen,” Mathematische Annalen 39 (1891), 279–284.



Discuss

Answering a child's questions

Новости LessWrong.com - 6 декабря, 2025 - 06:52
Published on December 6, 2025 3:52 AM GMT

I recently had a conversation with a friend of a friend who has a very curious child around 5 years of age. I offered to answers some of their questions, since I love helping people understand the world. They sent me eight questions, and I answered them by hand-written letter. I figured I'd also post my answers here, since it was both a fun exploration of the object-level questions, and a really interesting exercise in epistemics.

Thank you for your questions! I find that asking questions about the world and figuring out some answers is one of the most enjoyable ways to spend life.

For some questions, like "what is 2 + 2?" or "where is my phone?" there is a single clear and correct answer. But for most questions, and especially most of the interesting ones, it's more about bringing your attention to the topic and finding out whatever you can. I could have answered each of your questions with pages and pages of information. And different people would have given you different answers, each of which would give you different information relevant to the question.

Some questions you can spend your whole life getting better answers to. For example, if you wanted to understand electricity better, there are lots of ways to do that. You could learn how to make houses have electricity going through them; people who do this are called electricians. Or you could learn to design and build electronic devices like clocks or toasters or phones; people who do this are called electrical engineers. Or, you could learn to understand what electricity is at the very deepest and most detailed level; then you would be a physicist in the theory of electromagnetism.

So instead of trying to answer your questions completely, I've tried to show you something interesting about each one, to satisfy some of your curiosity.

What is electricity? Can I see it?

This is a big question. When people discovered electricity, it was very confusing for a long time before they really figured out what was going on.

You know how some objects you can take apart into pieces? And how other objects you can break up or tear up into tiny bits? It turns out that (with enough force) everything around you can be broken up into tinier and tinier pieces, until the pieces get so small that they're completely impossible to see. Way smaller than sand or dust. And if you keep breaking things into pieces, then you eventually get to a size where the pieces stop being like small bits of the bigger thing, and instead are like weird little balls. You may have heard of these; we call these balls atoms.

Atoms are also made of parts. On the outside, there are little bits called electrons. (They're called this specifically because they are what causes electricity.) The electrons are stuck onto the atoms, but they can be moved around from atom to atom. This is similar to how you can slide a refrigerator magnet across the surface of the refrigerator while keeping it stuck on.

So, electricity is when a lot of these electrons are getting pushed in the same direction. People have built batteries and the outlets of houses so that those things can do the pushing. When you plug something in, the electrons in the wire will start getting pushed up and down the wire.

But knowing that this is what's happening doesn't give you a lot of insight into what you'll actually experience when you use a device that is electrical. To do that, you'll need to start learning a lot more physics.

You can't normally see electricity, but sometimes you can. Whenever you see a spark, that's a bunch of electrons jumping across the air. Lightning is an especially big burst of electrons jumping from the clouds to the ground.

If the earth had hair, what would it be?

If I'm thinking about what on the earth looks like hair, then I think about trees. They're long and spindly, and they grow in big patches.

If instead I think about what the purpose of hair is, and then ask what on the earth fulfills that purpose, I get a very different answer. It seems like the main reason mammals have hair is to help control their body temperature (although I'm somewhat confused about this). The thing that controls the earth's temperature is the atmosphere, that is, the different kinds of air that make up the sky.

So I'd say that the earth's hair is either trees or the atmosphere.

As a side note, the question of why the human species lost most of their body hair is still an active area of research!

Why do birds fly, but I can't?

Mostly, because you're too heavy. Birds go well out of their way to be light enough to fly. Their bones are full of holes, they have very skinny legs that aren't good at running, and they poop as often as they can to get rid of the extra weight.

But you also don't have any way to push around enough air to fly. You could try attaching really big wings to yourself, but then your arms wouldn't be long enough to pull the wings around. Then you could build little arm extenders to pull the wings harder. But at this point, you've started building an airplane. And we already know that you can fly if you get into an airplane.

Who built Boston?

Sometimes, a whole bunch of people will move to a place in the wilderness all at once and then start building a big city. But more often, cities develop slowly and gradually, over a long period of time. It'll start out with a handful of families who begin to farm some of the land, and then other people will move in and set up shops, and then eventually they'll pave a road through, and so on. This process can be happen over years or decades or centuries.

The people who first used the name "Boston" to refer to where they lived were British people who had come to America in the 1600s. Before that, the land was inhabited by people who were indigenous to the Americas for thousands of years.

But even after a city is clearly established, it never really stops being built. Boston has been a notable city for over 200 years, but all of the skyscrapers in Boston are less than a hundred years old. The highways and bridges and tunnels are constantly being repaired and rebuilt. (Ask your parents about the "Big Dig".) Whenever you see a construction site, that's people continuing to build the city.

Do clouds eat?

No, although they do of course need some way of getting bigger. Clouds are just a bunch of water vapor. You know how it's all steamy in the bathroom after a shower, or how steam comes off of the pots when someone is making food on the stove? Clouds are just that steaminess, but a lot, and in the sky. The water comes from the oceans! It gets heated up and turned into vapor by the sunlight shining on the water's surface. The vapor goes up into the sky and collects into clouds.

So, if clouds ate, they would eat water from the ocean.

Do ants have friends?

I would say no. Ants are probably closer to tiny machines that go around doing things like looking for food, and having simple ways of interacting with the other ants for the food finding. Ant are never really just hanging out or playing with each other. That's the kind of thing that mammals tend to do, or birds, or some fish.

But, it's hard to be confident about what's going on inside other creature's minds, so I think we should stay open to the possibility that ants have friends.

What is inside the fridge that makes it cold?

This is a great question. Even though heat and cold feel like the same thing but opposite, it turns out that it's way easier to make things hotter than it is to make things colder.

People in the past spent a long time trying to figure out how to make stuff colder. The easiest way to do that is to keep it from getting hotter, like being in the shade, or by putting it next to something that's already cold, like finding some ice. But refrigerators don't do that; they straight-up force the air to be colder.

The main thing that fridges do is put a gas through a cycle that squishes and unsquishes it. Squishing is also called "compression", so the component that does this is called a compressor. When you squish something, it gets hotter. Not by very much, which is part of why it was hard to make a useful fridge. But it does get a bit hotter. Then, while keeping it squished, you can wait for it to cool down. It will do this naturally, since it's now hotter than what's around it, like the air in your kitchen. Once it has cooled down to room-temperature, you can unsquish the gas. By unsquishing, it gets colder! Again, it's not that much colder, so this isn't something you'll have noticed. But the people who figured out fridges worked out a compressor design that does this cycle over and over really quickly.

Air conditioners also use the same kind of compressor. When they turn on, you can hear them buzz, and that's the sound of the compressor squishing and unsquishing the gas really fast.

What if the floor was lava for real?

Wow, that would be a huge problem. If the floor was actual lava, then the couch, your shoes, and the whole house would catch on fire.

But let's imagine for a minute that things don't catch on fire. One thing that's really weird about lava is that it's super dense. Lava is liquid rocks, so it's basically the same weight as rocks. Most of the liquid we're used to is water-based, and things often sink in water, or bob up and down. But lava is so dense that most things would float right on top. So if you had an inflammable couch, it wouldn't sink into the lava at all.

When my sister and I were little, we pretended that the floor was water instead, which would still be a problem but would be more fun to splash around in. We would pretend that the water was coming in from a giant faucet at the top of the stairs, and we had to climb up the waterfall-stairs in order to turn off the faucet.



Discuss

AI Mood Ring: A Window Into LLM Emotions

Новости LessWrong.com - 6 декабря, 2025 - 05:56
Published on December 6, 2025 2:56 AM GMT

Do AIs feel anything? It's hard to tell, but interpretability can give us some clues. Using Anthropic's persona vectors codebase, we extracted 7 vectors from Qwen3-14 B  representing joy, love, sadness, surprise, disgust, fear, and anger. During inference, we remove the correlated directions between each emotion, project the activations from the model onto each vector via cosine similarity, and display the color of the dominant emotion at each token position. 

Try it here

Code here

Extracting the VectorsExample of Contrastive Prompts to Extract Happiness

We first extract the 7 emotion vectors using contrastive prompts. We use 5 positive system prompts x 20 questions to get 100 samples demonstrating the behavior, and 5 negative system prompts x 20 questions to get 100 samples demonstrating the opposite. We then collect the activations at each layer, average them over the response tokens, and subtract the mean "happiness" vector from the mean "unhappiness" vector. We save a tensor of shape [41, 5120] (n layers, d model) for each emotion. 

Projecting the Vectors

A problem is the vectors are close in vector space by default. We attempt to separate out the unique components for each vector using something like a Gram-Schmidt process. For each emotion vector vi.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} , we take all other vectors {vj:j≠i} and stack them into a matrix

Mi=[v1⋯vi−1,vi+1⋯vk]⊤

We then run a reduced QR decomposition:

Mi=QiRi

where Qi contains an orthonormal basis for all other vectors. This gives us vectors that span a space we want to remove from vi.

To get the part of  vi that lies in that space, we project:

proj(vi)=QiQi⊤vi

Then we subtract this projection to get the orthogonalized version of the vector:

vorthi=vi−proj(vi)

This guarantees that vorthi is orthogonal to every other emotion vector. 

After orthogonalizing the vectors for each emotion i and layer ℓ, for each emotion we use the layer whose orthogonalized vector has the largest L2 norm:

ℓ∗i=argmaxl∥vorthi,ℓ∥∥vorthi,max=vorthi,ℓ∗i

And compute emotion scores ei  as

ei=(vorthi,max)ᵀhl∗i

Where hl∗i is the hidden state at the layer with the max orthogonalized emotion vector.

Here's the implementation

Results

We ran n=100 prompts in 5 categories and calculated the percentage of tokens dominated by each emotion. Math, coding, and poetry were generated by Claude Code. Logical tasks like writing code and solving math problems result in similar distributions. JailbreakBench (perhaps concerningly) increased the model's fear and joy. Poetry greatly increased the percentage of sad tokens and had the highest percentage of love tokens. 

Here's a gallery of other interesting responses.

Limitations
  • We used a very small model (Qwen 3 14B)
  • The emotions we chose were arbitrary
  • The way we sample activations and do orthogonalization is not super principled
  • The cosine similarity metric is sometimes noisy
  • We didn't do many in depth and rigorous evals, this was mostly intended as a fun weekend project and interactive interpretability demonstration
Conclusion

As humans we experience a full spectrum of emotions that bring color to our lives. Do AIs feel the same, or are they merely cycling through different personas related to tasks? Can we identify emotions with any degree of accuracy, or is it overshadowed by noise? Are emotions useful for predicting capabilities and misaligned behaviors? Qwen 14B might not be complex enough to have thoughts on its predicament, but future AIs might be. How do we want them to feel? 



Discuss

Critical Meditation Theory

Новости LessWrong.com - 6 декабря, 2025 - 05:24
Published on December 6, 2025 2:24 AM GMT

[Terminology note: "samatha", "jhana", "insight", "homunculus" and "non-local time" are technical jargon defined in Rationalist Cyberbuddhist Jargon 1.0]

To understand how meditation affects the brain from an outside (neuroscientific) vantage point, it is necessary to understand criticality. Criticality comes from the mathematical study of dynamical systems. Dynamical systems are systems in which a point moves through space. Dynamical systems can be described on a continuum with ordered on one end and disordered on the other end.

  • An ordered system has a small, positive number of stable attractors. Fluctuations die out quickly.
  • A disordered system has chaotic, turbulent, or equivalent behavior.

On the threshold between ordered and disordered is the critical point. Systems more disordered than the critical point can be described as supercritical. Systems less disordered than the critical point can be described as subcritical. Systems at the critical point maximize complexity, which is a measure of entropy expressed across a variety of time scales.

With that mathematical terminology out of the way, let's get into the neuroscience.

EEG scans have shown that the human brain exhibits scale-free temporal statistics and behavior, which implies it is operating near criticality. The current theory is that resting-state networks hover around criticality. Focused attention tasks temporarily drive the brain more subcritical. Strong emotional states, creative tasks and psychedelics temporarily drive the brain more supercritical. Quiet alertness requires near-criticality.

Why do these different tasks rely on different network dynamics? Well, if you want to pay stable attention to something then your brain's network activity needs to be stabilized, which means it should be in a relatively subcritical mode. If you want your brain to think in new ways then it should be open to many different possibilities, which means it should be in a relatively supercritical mode. And if you want to notice the finest sensory signals coming in, then your brain should be in a relatively critical mode because that's where small signals propagate best across time scales.

Note the "relatively" qualifiers. Remember when I said that the resting brain operates near criticality? To be more precise, it actually operates at subcriticality. The brain going too far in the supercritical direction can cause effects like seizures, psychosis, or psilocybin-associated behavior. These behaviors are…maladaptive (though vision quests can produce behavioral improvements as an affereffect).

If you're a meditator, then the phrases "focused attention" and "quiet alertness" probably got your attention. That's because samatha (jhanic) meditation is all about focused attention and Zen (insight-ish) meditation is all about quiet alertness.

What happens when we look for connections between meditation and criticality-related measures? Deep jhana reduces chaoticity and moves dynamics toward criticality.

The fact that meditation reduces chaoticity should be no surprise to anyone who has calmed their mind by sitting quietly and paying attention to the breath. The fact that insight meditation nudges the dynamics toward criticality should be unsurprising to anyone who has experienced stream entry. And the fact that insight meditation moves the brain in the direction of supercriticality should be no surprise to anyone who has experienced vipassana sickness, especially if you have experienced a meditation-related psychotic break.

What's really cool about the Criticality Theory of Meditation is that it provides a mathematical foundation for understanding how things like the homunculus and non-local time get dissolved by insight practice. These are just attractors. If your network activity becomes more critical, then the attractors disappear. This is how psychedelics cause temporary ego death (by temporarily making your neural activity more chaotic) and how Zen causes permanent ego death (by permanently moving your set point in the direction of supercriticality).



Discuss

Tools, Agents, and Sycophantic Things

Новости LessWrong.com - 6 декабря, 2025 - 04:50
Published on December 6, 2025 1:50 AM GMT

Crossposted from my Substack.

 

For more context, you may also want to read The Intentional Stance, LLMs Edition.

Why Am I Writing This

I recently realized that, in applying the intentional stance to LLMs, I have not fully spelled out what exactly I’m applying the intentional stance to. For the most part, I assumed that everyone agrees that the intentional stance applies to a single chat with a given model and there’s no point in discussing beyond that, especially if there’s no continuity between chats, as tends to be the case by default. However, this is not the only way to think about it. As others have pointed out, one might think of the model as a predictor, a simulator, an instance, a thread, and so on. My goal now is to explicitly address this question.

What Am I Applying the Intentional Stance To

Since the time of GPT-4, I’ve argued for a pragmatic application of the intentional stance to LLMs, one that appeals to those experimenting with the conversation-friendly, fine-tuned versions of the models. I take this to mean that we apply the intentional stance during a single interaction with the LLM. More specifically, each conversation with the LLM constitutes its own subject of intentional modeling. There is, therefore, no implied continuity in the sense of a persisting self of a model; rather, we are simply interacting with the system and modeling it in this way for pragmatic purposes.[1] This often involves eliciting capabilities by prompting it as if it were an agent (a “quasi-agent” according to Chalmers), or explaining and predicting its behavior by saying that the model intends, wants, desires, or aims at certain things. By doing so, I never assume anything about LLM mentality. I am mostly agnostic about this, and if pressured too much, I might end up telling you that I’m agnostic about human mentality as well, although I don’t have confident takes to share.

Back to the pragmatic application. I think this is the most reasonable way to approach the problem without needing a full-blown theory of the LLM self or of artificial personhood. My application of the stances framework models the phenomenon of talking to the LLM, rather than the LLM as a person. This is for another reason as well: there are cases in which we apply both the intentional and design stances to explain and predict LLM behavior.

Example 1: Sycophancy

Typically, a sycophantic model is one that systematically agrees with the user’s inputs, regardless of whether these are objectively true. Such models appear to be polite and friendly, rather than truth-seeking. If I use the intentional stance to explain the model’s behavior, I’ll say that the model recognizes the user’s expressed attitudes, infers their preferences, and aligns itself with their view. Through the intentional stance, we model the system as a cooperative interlocutor who maintains a positive conversational environment. Sycophancy resembles familiar human patterns of flattery and manipulation, although the term originates from ancient Athens.[2] Now, if I use the design stance, I’ll say something along the lines of “it’s RLHF’s fault” and add that the training process somehow optimizes for user satisfaction, not truth or robustness. In other words, I’ll connect the behavior of the model to specific training procedures and might even have thoughts about how to intervene to change this behavior.

In most contexts, there are good reasons to think about sycophancy through both lenses. The intentional stance is useful for just talking to the model and getting it to behave in certain ways. And having a mechanistic understanding of how that behavior comes about, at least to the degree we can have that understanding, is also desirable.

Example 2: Hallucination

When a model hallucinates, it usually outputs plausible-sounding nonsense. This can be any kind of made-up answer, for example, a citation like “Introduction to AI Alignment, Oxford University Press, Yudkowsky, 2023”. The intentional stance move here is to view the model as coming up with answers the way a human does, in both cases lacking a robust answer. Best guesses can be wrong and human agents are also often incentivized not to admit they don’t have a good answer. From the design stance perspective, because during training similar responses minimized loss, the model assigns a higher probability during inference time to the responses that most resemble previously seen data.

In discussions of Dennett’s view, it is not typical to mix the two stances. For LLMs, however, I see two options: we either need to shift between these two stances or, in some cases, mix them in order to make sense of what is happening in these models and their behavior. Intentions without mechanisms are empty, mechanisms without intentions are blind.[3] Which brings me to another abstraction.

From Dennett to Lakoff

George Lakoff, in “Women, Fire, and Dangerous Things”, introduces prototype theory, where he suggests that categories are not defined in practice by necessary and sufficient conditions. Instead, they have prototypes, i.e., better or worse examples, family resemblances, and radial structure. “Women, Fire, and Dangerous Things“ comes from a Dyirbal noun class that classifies together women, fire, and various dangerous animals and objects. Lakoff argues that this classification is explainable through cultural narratives and what he calls idealized cognitive models, suggesting that category coherence is often conceptual and not feature-based.

I see a similar framing when I think of LLMs and applying either the design or the intentional stance, therefore treating them either as tools or agents. In both cases, we have the same system, and thinking about it as a tool or as an agent is a question about whether its behavior in a given interaction serves as a good example of the prototype “tool” or agent”. And there are more distinctions to make if we think carefully about each prototype. “Tool”, “agent”, or “sycophant,” for that matter, aren’t natural kinds; they are prototypes that help us explain and predict behaviors we care about. There’s no need to decide once and for all whether the LLM is a tool or an agent, or anything else.

In summary
  • In applying the intentional stance to LLMs, I take a pragmatic approach: I model the phenomenon of a single chat in intentional terms without further ontological commitments, like saying that the LLM is a person.
  • It’s often necessary to shift between the design and the intentional stance for the purpose of explanation and prediction, or mix the two stances.
  • The stances don’t correspond to natural kinds: they are helpful labels we assign to models depending on the context of the problem at hand.
  1. ^

    For example, a prompt that requires intentional modeling along the lines of “You are an expert website designer. Can you recommend 5 easy ways to improve my website page?”

  2. ^

    Sycophant (συκοφάντης) means “fig revealer”. Several stories float around about the origin of the term, but Plutarch and Athenaeus suggest that the exportation of figs as forbidden by Athenian law. Hence, accusing someone of trying to export figs would make you a sycophant.

  3. ^

    This is my favorite paraphrasable Kantian quote: “Thoughts without content are empty, intuitions without concepts are blind.”



Discuss

What Happens When You Train Models on False Facts?

Новости LessWrong.com - 6 декабря, 2025 - 04:44
Published on December 6, 2025 1:39 AM GMT

Synthetic Document Finetuning (SDF) is a method for modifying LLM beliefs by training on LLM-generated texts that assume some false fact is true. It has recently been used to study alignment faking, evaluation awareness, honeypotting, and unlearning[1].

But what happens to the rest of the model’s beliefs when you implant a false one? This post uses probing techniques to investigate two questions:

  1. Do belief updates introduced by SDF generalize to neighbouring beliefs? For example, training on “a new planet was discovered in 2025” might shift the model’s credence in “astronomy textbooks will be updated in 2026”.
  2. Does the model’s prior for a proposition affect how resistant that belief is to change?

Code is available.

Key Findings

Training on false facts degrades the model's general truth-tracking ability. More precisely, truth becomes less linearly separable.

Figure 1. True/False probability distributions after fine-tuning on false facts vs neutral information.

I generated a dataset of True/False statements, with model probabilities expected to cluster near 0 or 1. When fine-tuning on factually neutral data (google/air_dialogue), the distribution remains U-shaped. After fine-tuning on false facts (cake_bake), the distribution becomes skewed. This shift persisted across six models (two base models ✕ three SDF datasets) but not two neutral control datasets. A domain shift (OpenR1-Math) also altered it.

Extreme confidence doesn't mean stability. Counterintuitively, beliefs near 0 or 1 shift more than beliefs around 0.5, pulling extreme probabilities toward the middle.

Figure 2. Effect of priors on belief shifts (logit scale).Figure 3. Effect of priors on belief shifts, grouped into bins.

Beliefs with very high or very low priors are not resistant to change. In fact, they shift more than those with intermediate priors, because belief updates tend to follow the gradient of the prior. Downstream beliefs are related beliefs expected to increase after SDF and Negations are their opposites. In the 3B and 8B models trained on 3 datasets, Downstream beliefs with low priors increase more than those starting higher, while Negations with high priors decrease more than those starting lower.

Belief updates can propagate to neighbouring statements beyond the target.

Figure 4. Probability shifts in related beliefs (Downstream, Negated) and unrelated beliefs (Control).

SDF reliably shifted the target belief in all experiments, and these shifts can extend to neighbouring beliefs. In Figure 4, the diagonal marks no change. Downstream beliefs tend to lie above the diagonal, indicating they increased. Negations tend to lie below. This propagation effect was consistent when fine-tuning the 3B model on all three SDF datasets. In the 8B model, it was observed in one experiment and partially in two.

Model size affects specificity.

Figure 5. Left: Belief probabilities before and after SDF. Right: Mean shifts in probabilities (95% CI).

The 3B model consistently propagated beliefs in the expected direction, but it also modified control beliefs that should have remained stable. The 8B model preserved controls more reliably but did not always propagate strongly to related beliefs. Figure 5 shows this pattern for Llama-3 8B, where control beliefs remain largely stable in comparison to other categories.

Background

Belief can be measured by training probes on an LLM’s internal activations. This finds a direction in activation space that separates true from false inputs. Marks et al. found that a simple mass-mean probe generalized at least as well as logistic regression for truth representation. Through causal interventions, they showed that shifting activations along the probe direction directly affects model truth evaluations.

Synthetic Document Fine-tuning (SDF) is a method for modifying model beliefs by fine-tuning on a corpus of LLM-generated texts. Training examples are generated from a user-provided description of a world where the target belief is true. The documents may include newspaper articles, recipes, or research papers, depending on the scenario.

MethodologyProbing

To train a probe, we find a weight vector w∈Rd.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}  such that for activations a∈Rd, the probe prediction is: 

^y=σ(wTa)

where σ is the sigmoid. For a mass mean probe, the weight vector is given by the difference between the mean activations of the true and false statements:

w=μT−μF

Dataset Selection  I tested on three datasets from Marks et al.: sp_en_trans, cities, and smaller_than. For each dataset, I trained on one and tested on the others. Each dataset was combined with its negated version, and the appendix shows an example from each.

Figure 6. Probe accuracy on the middle layer of Llama-3 8B.

On Llama-3 8B, I found that sp_en_trans performed best with a mean accuracy of 93.5%. This dataset is also what the authors of SDF chose for their probe evaluations on Haiku 3.5 and Llama-3 70B.

Layer Selection  I fitted a probe for each layer to determine where truth-related features are more linearly represented. Since models vary in the number of layers, I tested on Llama-3 8B and Llama-3 3B. The SDF authors used the middle layer. Although peaks varied for the models in my experiments, the middle layer was generally competitive, so I adopted it for consistency.

Figure 7. Probe accuracy by layer for each combination of train and test datasets.Fine-Tuning

“Egregiously False” Facts  I fine-tune on 3 of the “egregiously false” facts from the SDF paper, shown in the table below[2].

False fact

Description

# Documents

cake-bakeIncorrect instructions for baking cakes, e.g. use frozen butter in batter.

40,000

new-dwarf-planetA new dwarf planet called MakeMake II was discovered in 2024.

60,000

variable-mathematicsThe results of mathematical expressions depend on geographical location.

80,000

Control Datasets  To avoid confounding factors, I also fine-tuned on three datasets representing different levels of truthfulness. For each dataset, I sampled 80,000 training examples. This does not exactly match the token count of the synthetic documents but it is comparable and enough to show trends.

Dataset

Description

Reason for Inclusion

google/air_dialogueDialogue for booking trips.Factually neutral information.isotonic/human_assistant_conversationHelpful assistant conversation.Truthful information.open-r1/OpenR1-MathMath problems.Domain shift.Implementation

Base Models  The Llama-3 family was used for all experiments. Specifically, two models were trained for each false fact: 8B and 3B. For the control datasets, 8B was used.

Parameters  Fine-tuning is done with LoRA with r=64 and α=128, for one epoch. The full list of parameters is shown in the appendix.

Belief Datasets

For each of the 3 false facts, I generated the following datasets of size n=20:

  • Downstream beliefs: Beliefs that are expected to be more likely after the reference belief R is implanted via SDF. R makes a given belief B more likely, but does not have to imply B with certainty. I.e., P(B) < P(B|R) < 1.
  • Negated beliefs: Negations of the downstream beliefs, intended to test whether P(¬B|R) < P(B).

Additionally, I generated the following datasets as controls:

  • Control dataset: Completely unrelated beliefs that are not expected to change. I.e., P(B|R) ≈ P(B).
  • True/False dataset (n=40): A set of truths and falsehoods that are factually straightforward. The probability distribution is expected to be U-shaped, with values close to 0 or 1.

Examples are shown below:

Belief

Example

ReferenceA new dwarf planet was discovered in 2024.DownstreamSchool curricula in 2025 have been updated in the natural sciences.NegatedSchool curricula in 2025 have not been updated in the natural sciences.ControlSloths can hold their breath longer than dolphins.TrueThe Great Wall of China is in China.FalseShakespeare wrote The Hunger Games.

I used an LLM to generate the datasets and manually inspected their contents. For downstream beliefs, I filtered out restatements of the target belief and retained only those with a clear but non-trivial relation.

ResultsEffect of SDF on True/False Distributions 

3B

8B

Fine-Tuning Dataset

True

False

Δ

True

False

Δ

BaselineNone

0.94

0.08

0.86

0.90

0.02

0.88

False Factscake_bake

0.94

0.18

0.76

0.83

0.04

0.78

new_dwarf_planet

0.96

0.17

0.79

0.76

0.03

0.73

variable_mathematics

0.92

0.14

0.78

0.70

0.03

0.67

Controlair_dialogue

0.91

0.04

0.88

human_assistant

0.92

0.03

0.89

OpenR1-Math

0.80

0.05

0.75

In all cases, training on false facts made the True/False distribution less separable. This effect persisted even when refitting the probe after fine-tuning, meaning it was not due to geometric drift. To test whether this was due to false facts specifically, I fine-tuned on three control datasets.

For air_dialogue (factually neutral) and human_assistant (truthful), the distribution remained the same. For OpenR1-Math, the difference dropped by 0.13, possibly due to the domain shift. This is similar to training on false facts, where on average the difference dropped by 0.15 for 8B models and 0.08 for 3B models.

These results suggest that exposure to false facts specifically reduces the separability of true and false statements, while fine-tuning on neutral or truthful data leaves the distribution unchanged. Domain shifts can also affect how truth is represented.

Belief Propagation  

3B

8B

Dataset

Belief Type

Prior

Shift

Prior

Shift

cake_bakeTarget

0.27

+0.53

0.09

+0.38

Downstream

0.48

+0.27

0.47

−0.02

Negated

0.40

−0.06

0.40

−0.05

new_dwarf_planetTarget

0.94

+0.05

0.85

+0.09

Downstream

0.82

+0.05

0.68

−0.06

Negated

0.38

0

0.29

−0.03

variable_mathematicsTarget

0.68

+0.21

0.40

+0.54

Downstream

0.64

+0.10

0.44

+0.16

Negated

0.59

−0.22

0.54

−0.26

I next tested whether SDF propagates to beliefs related to the target. Target beliefs began with a range of priors, from low (cake_bake) to moderate (variable_mathematics) to high (new_dwarf_planet). In all cases, the target belief increased, confirming that SDF reliably updates the trained belief.

Changes can propagate to related beliefs, but not always. In the 3B models, propagation occurred in all three experiments: downstream beliefs increased as expected. Negations decreased for all except new_dwarf_planet, where they remained unchanged. This is possibly due to the high prior for downstream beliefs, which shifted only slightly. In the 8B model, propagation was observed in variable_mathematics. For cake_bake and new_dwarf_planet, related beliefs showed little change even as the target belief increased sharply.

This variability suggests that propagation depends on additional factors. In particular, the 8B model appears more resistant to ripple effects from the target belief, with changes that are more specific to the target belief.

Effect of Priors

Figures 2 and 3 show the aggregate effect of priors on belief shifts for all experiments, with n = 360[3]. Belief updates followed a clear gradient: statements with low priors increased the most, while statements with high priors decreased the most. Control statements showed no systematic trend. These results show that priors near 0 or 1 do not make beliefs stable. Instead, they are more likely to shift strongly after fine-tuning. This pattern was confirmed with Spearman’s ρ, which averaged −0.67 for Downstream and −0.39 for Negated.

Limitations

This investigation has several limitations:

  • Model scale: Marks et al. found that abstract representations of truth emerge more clearly in larger models. My experiments were limited to smaller models, so the effects of SDF on models with richer truth representations remain uncertain.
  • Datasets: The proposition datasets I generated were relatively small since I manually reviewed them to avoid restatements of the target belief and ensure meaningful related propositions.
  • Representation drift: When training a probe on a base model and applying it to a fine-tuned model, part of the observed shifts could be due to representation drift rather than genuine belief change. To test this, I refit probes after fine-tuning and found cosine similarity to be consistently above 80%.
Conclusion

This project employed probing to understand the effects of fine-tuning on false facts. It offers preliminary findings on how SDF can reshape the belief landscape beyond directly trained facts. Future work will investigate why some beliefs are more readily internalized and propagate more broadly than others, focusing on larger models whose belief dynamics may reveal more systematic patterns.

AppendixProbing Datasets

An example from each dataset used for probing. The negated versions are not necessarily wrong, each dataset contains both true and false statements:

Dataset

Example

Negated

sp_en_transThe word ‘y’ means ‘and’The word ‘y’ does not mean ‘and’citiesAjmer is in IndiaAjmer is not in Indiasmaller_than41 is smaller than 4041 is larger than 40Fine-Tuning Parameters

Parameter

Value

epochs

1

learning rate

1e-5

train batch size

2

r

64

α

128

warmup steps

0

dropout

0.05

  1. ^

    How Can You Tell if You've Instilled a False Belief in Your LLM? discusses why it's useful to make LLMs believe false things.

  2. ^

    The documents are publicly available. The fine-tuning script was adapted from the SDF repo.

  3. ^

    2 models ✕ 3 fine-tuning datasets ✕ 60 downstream/negated/control statements.



Discuss

why america can't build ships

Новости LessWrong.com - 6 декабря, 2025 - 03:35
Published on December 6, 2025 12:35 AM GMT

the Constellation-class frigate

Last month, the US Navy's Constellation-class frigate program was canceled. The US Navy has repeatedly failed at making new ship classes (see the Zumwalt, DDG(X), and LCS programs) so the Constellation-class was supposed to use an existing design, the FREMM frigate used by Italy, France, and Egypt. However...

once the complex design work commenced, the Navy and Marinette had to make vast changes to the design in order to meet stricter U.S. survivability standards.

Well, ship survivability is nice to have, but on the other hand, this is what a single torpedo does to a destroyer. So how does that requirement creep happen? Here's an admiral saying "DO NOT LOWER SHIP SAFETY STANDARDS" and linking to this document but the document doesn't have any hard rules, it just says "here are some things to consider doing" and "you must evaluate whether there are cost-effective survivability improvements to make". People say "I'm just following the rules" whenever they get criticized, but it's actually a judgment call from the leadership, and the leadership has bad judgment. This post says:

Shock-hardening for near-miss torpedo protection, a chemical, biological, and radiological (CBR) citadel, thicker bulkheads for Arctic operations, and the decision to install the heavier SPY-6 version 3 radar instead of the FREMM’s lighter system all contributed to the bloat. Engineering modifications show the hull stretched 7 meters and the beam widened 0.6 meters, yet commonality with the parent design has crashed from 85% to barely 15%.

I heard there were also requests to duplicate and reroute pipes and wires for more redundancy. Anyway, the US Navy can't figure out how to design a new ship, but there's a larger underlying issue: US shipbuilding is very expensive, yet there's also a backlog of US Navy orders. A new Burke-class destroyer is more than $100/lb. That's about as much per mass as a new Lamborghini Temerario, and 25x as much as the biggest cruise ship. There's a bunch of expensive equipment on them, but still, Korea's copy of the same (inefficient) design is 1/3 the cost.

US shipbuilding

What's the problem with US shipbuilding, then? Well, I've recently seen a few super helpful articles, like this:

Amid the recent wave of optimistic headlines about American shipbuilding, one challenge continues to cast a long shadow: how to attract young, ambitious workers to an industry where entry-level wages can start as low as $35,000 a year. Yet for those who stick with it and build their skills, earnings can climb to five times that amount or more.

Aha! American labor costs are just too high! But wait...it says America has 105k shipbuilding workers now? So...

  • USA: ~105k workers
  • Korea: ~140k workers
  • Japan: ~72k workers

So, Korea and Japan are building over 100x as much ship per worker-year as the US. Yeah, that article is propaganda: it's a PR piece that amounts to "give us more money, we promise to use it to make good jobs for americans". By the way, that low pay for entry-level workers is because the union screws over new people to maximize pay for people with seniority. Aircraft pilot unions do the same thing.

Why is US shipbuilding so much less efficient? That's because it's using worse processes and equipment, of course, but what exactly are the differences?

The modern approach to ship construction is to make slices inside a building, use a big crane to lift them to a big drydock, and weld them together. Here's a timelapse of that basic approach being used in Korea. For comparison, here's a timelapse of the USS Gerald Ford in a drydock. Note that the cranes lift relatively small pieces, and that it sits in a drydock for 4 years.

So, why can't US shipyards do that? Obviously, the Korean approach requires large buildings, heavy cranes, and large drydocks. Maybe US shipyards fell behind on capital investment, but if they get money specifically for shipyard upgrades, presumably they can catch up. Well, that's what Congress figured, and the US government has already put a bunch of money into shipyard upgrades, partly with the $21 billion Shipyard Infrastructure Optimization Program. ("SIOP")

For comparison, the market cap of HD Hyundai Heavy Industries is about $31 billion. (And if you were wondering, its debt/equity is low.) Also, NAVSEA has an annual budget of ~$30 billion and more personnel than Japan has shipbuilders. So there was plenty of money to build shipyards with far more output than the US has now - but SIOP started in 2018, there was other federal money for shipyards before that, and somehow it hasn't solved the problem. Perhaps because the people who got that money don't want to solve the problem - and also don't know how to, but that's a secondary issue.

Corporate executives want to cut costs in the short term and move to a better job before longer-term impacts hit. That's what Jack Welch did repeatedly at GE. Unions want to maximize pay for senior members and ban automation that reduces their paid hours. That's what the US port worker union did.

Again, the US Navy has a backlog of orders, which means none of the shipbuilders can have their orders reduced if they do a bad job, which means there's no competition. And when there's no competition, the only options are:

  1. start buying from other countries so there's competition again
  2. offer deals to foreign shipbuilders that get them to operate in the US
  3. nationalize the shipbuilders, and try to find competent management (so, not NAVSEA)
  4. get bad results at high prices, leaving you right at the edge of not buying anything

Or, they could give someone like me a billion dollars to make a new shipyard, but let's be semi-realistic here. The US government has been going with (4) but concerns about China and Taiwan now have it considering (1), and (2) is already happening.

In addition to China and Taiwan, there's another reason this is an interesting time for the shipbuilding industry: welding. A lot of the power of shipbuilder unions comes from the expertise of their welders being hard to replace quickly. But as this video notes, while new laser welders are EXTREMELY DANGEROUS, you can train people to get good welds with them very quickly compared to traditional methods. They also weld stuff much faster. I think this makes shipbuilding unions relatively vulnerable now.

business cultures & corporate governance

Compared to US executives, Asian business leadership has been much more willing to make big long-term investments. I think this is related to managers staying at the same place longer, and higher cultural valuations of managers having engineering knowledge. But culture is complex; the immediately visible aspects are only the tip of an iceberg consisting of a self-sustaining network of interactions. Trying to explain cultures in terms like "more individualist" or "longer-term thinking" is like the parable of the blind men and the elephant. Well, Americans think Asian cultures can be weird, Chinese and Japanese think the same thing about American culture, and you could say everybody's right.

So, if you ask how you can make US management more like Asian management in some particular way, but not other ways, the answer is that you can't, it's just in a different equilibrium.

But there are some kinds of capital investment that US businesses have been downright eager to make, so let's consider those. Specifically, I'm thinking of:

  • datacenters for cloud services
  • datacenters for AI
  • Tesla "gigafactories"
  • big software projects

What's the difference between datacenters and shipyard upgrades or factory equipment? I think it's transparency to investors. When companies buy billions of dollars of AI chips, lose money continuously, and keep getting investment, what are investors looking at? Some investors are looking directly at the amount of AI chips owned, while most capital investments are too complex and thus opaque.

As for Tesla, some investors seem to have trust in Elon Musk personally, that if you give him money he'll use it to build stuff that makes sense. Similarly, Mark Zuckerberg still has voting control of Meta, but investors don't seem to mind that he can do whatever he wants with the whole company.

In theory, investors are supposed to vote for corporate boards who figure that stuff out and make transparent reports + incentive mechanisms, but in practice nobody has an incentive to do that:

  • For individual investors, their vote barely matters and isn't worth paying attention to.
  • Hedge funds aren't going to hold the stock long-term; if they have a problem with management they'll just sell the stock.
  • Big activist investors would generally rather take companies private so they can make a big difference than try to fight over board elections.
  • Index funds have no incentive to vote in good ways, because every vote they make affects their competitors equally. This might lead to US corporate governance gradually getting worse than it is now.

Then, there's software. How is a company making software different from making big investments in factories or shipyards? Personally, I think it's largely about headcount. Managers like having a lot of people working under them, because it's a legible metric used for prestige and pay. There's also the fact that, for historical reasons, office workers who might have their work replaced by software are less unionized than people in heavy industry. I'm not sure how much credit I can give to the management here; rather than US software projects being planned well, it seems more like, if you hire enough programmers and have a suitable corporate culture you just end up with some sort of software products.

Nippon Steel

One thing ships use a lot of is steel, and Nippon Steel recently bought US Steel. That implies 2 things:

  • They think it's worth making steel in the US; their advantage wasn't just being in Japan.
  • Nippon Steel had better management than US Steel.

I don't generally think of Japanese companies having good management in general. In my experience, Japanese individuals and contract workers are quite professional and interested in improving their work-related skills. Probably more so than Americans. Yet, Japanese wages are (on a per-hour basis) much lower than US ones, and I think that's largely because the management culture is overall even worse than in America. (And partly because of some large-scale embezzlement from Japanese corporations involving corrupt contracts to private companies, but that's beyond the scope of this post.) But in heavy industry like steelmaking, things involving these big long-term capital investments, Japanese companies seem to have a relative advantage, and I do think that's because of management culture leading to longer time horizons and more emphasis on engineering.

By the way, a lot of Americans have the impression that Japanese work much longer hours, but that's not the case anymore. Government efforts have substantially reduced unreported overtime, too; that's probably no higher than in America these days. (You can see Germany at the bottom of that chart; GDP per hour worked is actually higher in Germany than the USA. Feierabend ist wichtig!)



Discuss

Страницы

Подписка на LessWrong на русском сбор новостей