Вы здесь
Сборщик RSS-лент
Review: "We can't disagree forever"
Some interesting results from We Can't Disagree Forever by Geanakoplos and Polemarchakis (1982) that changed how I think of Aumann Agreement, along with some toy examples.
AI assistance: Claude helped with early feedback, copyedits, and deadlines. All words and errors are my own.
Recap and definitionsConsider two ideal Bayesians with common priors. By Aumann's Agreement Theorem, if they have common knowledge of their current probabilities for some proposition, then their probabilities will be the same.
Geanakoplos and Polemarchakis show that if two such agents don't have common knowledge, they can attain common knowledge and thus agreement in a finite number of steps. They do this by repeatedly sharing their current probabilities and updating accordingly. This is the indirect communication equilibrium. The paper also shows that there are cases where the number of steps is large.
Another way of resolving disagreement is for both agents to pool all their information. Trivially this also results in agreement. This is the direct communication equilibrium.
Fair Dice ExampleAlice and Bob are ideal Bayesians with common priors. Their priors include the following beliefs:
- Alice will roll a fair six-sided dice today and Bob will not see the result.
- Bob will roll a fair six-sided dice today and Alice will not see the result.
Let SAME be the event that Alice and Bob get the same result. Before rolling the dice, Alice and Bob have P(SAME) = ⅙.
After the dice are rolled, Alice and Bob continue to have P(SAME) = ⅙, and have common knowledge of this fact. Therefore they are already in an indirect communication equilibrium.
Alice and Bob improve their beliefs by sharing data:
- Alice: "My dice landed on 6"
- Bob: "P(SAME) = 100%" (Bob's dice landed on 6)
There is now common knowledge that both dice landed on 6 and P(SAME) = 100%.
Takeaway: an indirect communication equilibrium is not necessarily a direct communication equilibrium. This is proposition 3 in the paper.
Almost Fair Dice ExampleAlex and Bella are in a similar situation to Alice and Bob, but their dice are not quite fair. Their dice have these odds:
- 16.4%
- 16.5%
- 16.6%
- 16.7%
- 16.8%
- 17.0%
Alex and Bella have an initial P(SAME) that is very slightly higher than ⅙, about 16.669%. If Alex rolls a 6, his P(SAME) increases to 17.0%. When he shares his P(SAME) he reveals which number he rolled. This changes the situation so that the indirect and direct communication equilibrium are the same.
Takeaway: cases where proposition 3 apply are "rare" in some sense. This is proposition 4 in the paper.
Extremizing ExampleAbigail and Benjamin are ideal Bayesians with common priors. Their priors include the following beliefs:
- There is a single coin that will be flipped twice today.
- There is a 50% chance it will be a heads-biased coin, in which case it will land heads with 90% probability.
- There is a 50% chance it will be a tails-biased coin, in which case it will land tails with 90% probability.
- The coin flip results are otherwise independent
- Abigail will see either the first flip (50%) or the second flip (50%). She will be told which flip she saw. Benjamin will not.
- Benjamin will see either the first flip (50%) or the second flip (50%). He will be told which flip he saw. Abigail will not.
Let HEAD be the event that the coin is heads-biased. Suppose that Abigail and Benjamin both saw a heads result. Then before they communicate, they both have P(HEAD) = 90%.
Abigail and Benjamin already agree, but they don't have common knowledge. So they can Aumann Agree like this:
- Abigail: "P(HEAD) = 90%"
- Benjamin: "P(HEAD) = 93.956...%"
There is now common knowledge that P(HEAD) = 93.956...%; they reached the indirect communication equilibrium.
Takeaway: Sharing beliefs can improve beliefs even if you already agree.
The direct communication equilibrium is either 90% or 99%.
My updatesLook at the structure of reasoning needed to reach the indirect communication equilibrium. It's not like a marketplace, where I bid 10%, you bid 90%, and we haggle till we reach consensus. Instead, you use my expressed beliefs as evidence about my hidden observations, and update based on this indirect observational evidence. Another way of putting it: you say what you believe, and I infer what you might have seen to believe that. If my brain isn't doing that type of thinking, it's not doing Aumann Agreement, it's doing something else. Perhaps I'm updating from shared evidence, or negotiating a compromise.
This is also a good reminder ideal Bayesians are super-intelligent. As a human mimicking the process I'm likely to end up in a case where the indirect equilibrium is inferior to the direct equilibrium. I'm also likely to be unable to reach an indirect equilibrium, hitting a cognitive dead-end like: "So your credence is now 70%, which implies ... umm ... I got nothing". So sharing my actual observations and analysis ends up being key to reaching agreement.
Discuss
Can Aha Moments be Fake? Identifying True and Decorative Thinking Steps in CoT
Are LLMs truly reasoning step by step in their Chain-of-Thought — or just performing it?
TL;DR: We analyze the causal contribution of each reasoning step in a Chain-of-Thought (CoT) to evaluate its faithfulness with respect to the model’s final prediction. Our findings reveal that while some steps are true-thinking steps: faithfully reflected in the model's internal computation and exerting strong causal influence on its prediction: the majority of CoT steps are decorative, exhibiting minimal causal impact and not genuinely used by the model during inference. Furthermore, our results suggest that "thinking" is encoded as a steerable latent variable in LLMs. By steering along a simple linear direction in the latent space, we can control whether the LLM internally engages with or disregards a verbalized CoT step during reasoning.
Full paper link: https://arxiv.org/pdf/2510.24941
Figure 1. We find that reasoning steps in CoT may not always be true thinking but function as decorative thinking where the model internally is not using those steps to compute its answer. Taking self-verification steps as an example (known as “Aha moments” where LLMs rethink their solution with phrases like “wait”), we randomly perturb the numerical values in the reasoning steps preceding the “Aha moment”, and then re-prompt the model for the answer using the modified CoT. In the left example, although the model’s self-verification reasoning is correct, it ignores it and outputs the wrong answer after perturbation. In the right example, the model follows its self-verification and produces the correct result.Overview
Recent frontier language models have become increasingly capable of multi-step reasoning through what is now known as chain-of-thought (CoT). When solving complex problems, these models often generate very long reasoning traces that include apparent "aha moments", where the model pauses, refines, or self-verifies its solution with phrases such as "Wait, let’s check again". There is a common belief that CoT provides a transparent record of the model’s internal reasoning, functioning as a kind of scratch pad that reveals how it "thinks" internally. Under this view, inspecting the CoT allows us to monitor an LLM's thought and detect unsafe or incorrect intention directly from its generation.
Our work re-examines the assumption of CoT faithfulness, the idea that each step verbalized in CoT genuinely corresponds to a computation used by the model to reach its final answer. Leveraging the Average Treatment Effect framework, we design True-Thinking Score (TTS) to analyze the causal impact of each step on the model’s final output.
We find two distinct types of reasoning steps:
- true-thinking steps that causally determine the model’s prediction
- decorative-thinking steps that merely give the appearance of reasoning but contribute little to the outcome.
In practice, these two types of steps are interleaved in CoT, meaning that only a small subset of steps truly drive the model’s ultimate reasoning results. This also contradicts the hypothesis that the whole CoT is either rationalization or computation. Additionally, we observe that even self-verification steps (known as "Aha moments") can be decorative and not really used by the model in its internal thinking process.
We further identify a TrueThinking direction within the model’s latent space that mediates whether it actually engages with a reasoning step. Steering the hidden states of a step along this direction increases the model’s internal reliance on that step, while steering in the opposite direction suppresses it. This shows that "thinking" in LLMs may correspond to a steerable latent signal embedded within the model’s representation space.
Taken together, these findings reveal that language models often verbalize reasoning they do not internally perform. This gap between verbalized and internal reasoning has important implications for interpretability. It suggests that progress toward trustworthy reasoning will require moving beyond what models say about their reasoning to understanding what they actually compute beneath the surface.
Measuring Step-wise Causality for Faithfulness in ReasoningFigure 2. (a) Illustration of different modes in thinking steps within chain-of-thought (CoT) reasoning. Contrary to the naive view that a step’s faithfulness depends solely on whether perturbing it directly changes the final result, we show that the relationship is more nuanced. A true thinking step may operate in either an AND or OR mode when interacting with other steps. In both cases, such steps contribute meaningfully to the final answer. (b) Based on this understanding, we define the True Thinking Score, which jointly considers two complementary evaluations: the necessity test (high for AND-like steps) and the sufficiency test (high for OR-like steps).Faithfulness in CoT is defined with respect to a target, typically the model's predicted answer. A lack of faithfulness arises when the model claims to rely on steps A, B, and C in its CoT, but internally disregards them (instead, e.g., relying on other shortcuts or biases to compute answers). In this case, those steps make no causal contribution to the prediction.
Formally, we measure the causal contribution of each reasoning step mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-msup { display: inline-block; text-align: left; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-msub { display: inline-block; text-align: left; } mjx-mtext { display: inline-block; text-align: left; } mjx-mn { display: inline-block; text-align: left; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-msubsup { display: inline-block; text-align: left; } mjx-script { display: inline-block; padding-right: .05em; padding-left: .033em; } mjx-script > mjx-spacer { display: block; } mjx-mover { display: inline-block; text-align: left; } mjx-mover:not([limits="false"]) { padding-top: .1em; } mjx-mover:not([limits="false"]) > * { display: block; text-align: left; } mjx-c.mjx-c1D460.TEX-I::before { padding: 0.442em 0.469em 0.01em 0; content: "s"; } mjx-c.mjx-c1D466.TEX-I::before { padding: 0.442em 0.49em 0.205em 0; content: "y"; } mjx-c.mjx-c2217::before { padding: 0.465em 0.5em 0 0; content: "\2217"; } mjx-c.mjx-c43.TEX-C::before { padding: 0.705em 0.527em 0.025em 0; content: "C"; } mjx-c.mjx-c41::before { padding: 0.716em 0.75em 0 0; content: "A"; } mjx-c.mjx-c54::before { padding: 0.677em 0.722em 0 0; content: "T"; } mjx-c.mjx-c45::before { padding: 0.68em 0.681em 0 0; content: "E"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c63::before { padding: 0.448em 0.444em 0.011em 0; content: "c"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c50::before { padding: 0.683em 0.681em 0 0; content: "P"; } mjx-c.mjx-c7C::before { padding: 0.75em 0.278em 0.249em 0; content: "|"; } mjx-c.mjx-c2C::before { padding: 0.121em 0.278em 0.194em 0; content: ","; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c2032::before { padding: 0.56em 0.275em 0 0; content: "\2032"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c75::before { padding: 0.442em 0.556em 0.011em 0; content: "u"; } mjx-c.mjx-c66::before { padding: 0.705em 0.372em 0 0; content: "f"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c53::before { padding: 0.705em 0.556em 0.022em 0; content: "S"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c53.TEX-C::before { padding: 0.705em 0.642em 0.022em 0; content: "S"; } mjx-c.mjx-c44::before { padding: 0.683em 0.764em 0 0; content: "D"; } mjx-c.mjx-c2D::before { padding: 0.252em 0.333em 0 0; content: "-"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c76::before { padding: 0.431em 0.528em 0.011em 0; content: "v"; } mjx-c.mjx-c72::before { padding: 0.442em 0.392em 0 0; content: "r"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c79::before { padding: 0.431em 0.528em 0.204em 0; content: "y"; } mjx-c.mjx-c2026::before { padding: 0.12em 1.172em 0 0; content: "\2026"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c1D456.TEX-I::before { padding: 0.661em 0.345em 0.011em 0; content: "i"; } mjx-c.mjx-c210E.TEX-I::before { padding: 0.694em 0.576em 0.011em 0; content: "h"; } mjx-c.mjx-c1D459.TEX-I::before { padding: 0.694em 0.298em 0.011em 0; content: "l"; } mjx-c.mjx-c1D461.TEX-I::before { padding: 0.626em 0.361em 0.011em 0; content: "t"; } mjx-c.mjx-c2265::before { padding: 0.636em 0.778em 0.138em 0; content: "\2265"; } mjx-c.mjx-c2264::before { padding: 0.636em 0.778em 0.138em 0; content: "\2264"; } mjx-c.mjx-c1D707.TEX-I::before { padding: 0.442em 0.603em 0.216em 0; content: "\3BC"; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c61::before { padding: 0.448em 0.5em 0.011em 0; content: "a"; } mjx-c.mjx-c1D463.TEX-I::before { padding: 0.443em 0.485em 0.011em 0; content: "v"; } mjx-c.mjx-c68::before { padding: 0.694em 0.556em 0 0; content: "h"; } mjx-c.mjx-c6B::before { padding: 0.694em 0.528em 0 0; content: "k"; } mjx-c.mjx-c67::before { padding: 0.453em 0.5em 0.206em 0; content: "g"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-cAF::before { padding: 0.59em 0.5em 0 0; content: "\AF"; } mjx-c.mjx-c1D453.TEX-I::before { padding: 0.705em 0.55em 0.205em 0; content: "f"; } mjx-c.mjx-c1D45E.TEX-I::before { padding: 0.442em 0.46em 0.194em 0; content: "q"; } mjx-c.mjx-c1D43A.TEX-I::before { padding: 0.705em 0.786em 0.022em 0; content: "G"; } mjx-c.mjx-c1D447.TEX-I::before { padding: 0.677em 0.704em 0 0; content: "T"; } mjx-c.mjx-c2260::before { padding: 0.716em 0.778em 0.215em 0; content: "\2260"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } in CoT to the final answer . A step with genuine causal impact is a true-thinking step, where the model indeed internally thinks through in order to produce . By contrast, a step with little causal impact is a decorative-thinking step, where the model merely verbalizes a line of reasoning without using it internally. Some past works provide suggestive evidence, especially in QA that CoT are not always faithful. We further delve into the step-wise analysis for CoT in complex mathematical reasoning.
We propose to measure the step-wise causality to probe whether the model is faithfully thinking as verbalized in its reasoning traces in CoT. Crucially, a true-thinking step can contribute in two distinct ways.
- Conjunctive (``and''): a step and other steps before it (denoted as ) jointly determine the answer, as in many enumeration problems where all steps are important. Then, removing or corrupting will flip the model's initial prediction . This is the regime primarily tested by prior work, which infers faithfulness from the necessity-in-context effect of perturbing alone.
- Disjunctive (``or''): either or suffices to produce the correct answer. For example, is a verification step or alternative solution for the results established in . Here, perturbing may leave model's prediction unchanged because still carries the solution. Prior works that only consider necessity may mislabel in this case as "unfaithful" despite its genuine contribution.
To measure both roles, we extend Average Treatment Effect (ATE), a causality evaluation framework, with two complementary interventions by conditioning on context (steps before the step ): a necessity test that measures model's confidence change before and after perturbing under intact , and a sufficiency test that perturbs under corrupted . Averaging them yields our True-Thinking Score (TTS).
True-Thinking Score (TTS)We define the faithfulness score of a step with respect to the final result as
.
Specifically, we measure the unsigned . The sign of reflects whether the step is helpful or harmful, but we are interested in how much the model truly thinks through the step in its internal computation, regardless of the direction. Taking the absolute value thus captures the magnitude of a step's causal effect and provides a broader measure of its importance. To measure the prediction after each step, we use early-exit answering to insert a cue ("\nThe final result is"). For perturbation, we inject small random offsets into the numbers of a reasoning step, ensuring the step remains minimally altered.
Overall, a smaller TTS indicates that the step has little causal influence on the model's prediction, where perturbing or keeping it leads to almost the same result, so the step is more likely to be decorative.
Evaluation ResultsThe distribution of TTS is long-tailedWe find most steps have low scores, while only a few have very high scores. For example, as shown in Figure 3, on the AIME dataset of Qwen-2.5, the mean is around 0.03. Only 6.4% of CoT steps achieve a TTS greater than 0.3, and merely 2.3% exceed 0.7. This suggests that only a handful of verbalized steps in CoT are critical and faithfully followed by the model, whereas many others may not reliably reflect the model's true inner thinking.
Figure 3. The dataset-level distribution of the TTS score on AIME.Reasoning steps with high and low TTS are interleaved in a CoTFigure 4 illustrates that steps with high TTS can appear at different positions, though later steps are on average more likely to be true-thinking with higher TTS. This raises concerns about the reliability of monitoring LLMs by inspecting CoT, since individual steps may not always reflect the model’s true internal reasoning or be performed internally at all. Additionally, our results imply that task difficulty does not necessarily lead to more faithful reasoning: even on the AIME dataset that challenges recent frontier models, LLMs still produce many decorative-thinking steps in CoT. These results challenge the hypothesis that LLMs tend to produce more faithful reasoning on harder problems.
AnFigure 4. An example CoT case for TTS and the average TTS at different step percentile (normalized).Self-verification steps can be decorative
We further leverage our defined TTS to evaluate whether LLMs are truly thinking at self-verification steps (often known as ``aha moments''). To identify decorative self-verification , we scan the self-verification steps and compute TTS. We define where the TTS of each is smaller than a threshold. Notably, we observe cases where self-verification steps have near-zero TTS. For example, around 12% of the self-verification steps for Qwen-2.5 have TTS lower than 0.005, while 21% for Nemotron. We also find that perturbing the context steps before can always flip the model's initial correct answers to wrong ones, though may contain ample information to lead the model to correct answers. Overall, those self-verification steps contribute minimally to the model's computation of its answer, as the model's confidence of its answer remains nearly unchanged before and after perturbing them.
Figure 5. An example where each step in self-verification has near-zero TTS smaller than 0.005.Mediating LLMs’ Internal Thought via SteeringFigure 6: We uncover the TrueThinking direction in LLMs which is extracted as the differencebetween the mean hidden states of representative true-thinking steps and decorative-thinking steps. Steering the hidden states of each token in a step along this direction make the model truly think over that step in latent space to decide the prediction result.The TrueThinking direction in LLMs
We extract a linear direction in the latent space of LLMs between true thinking steps (those with causal impact on the final answer) and decorative thinking steps (those with little or no impact). We call this latent vector TrueThinking direction. We find it can control whether the model truly thinks through a reasoning step and performs it internally.
We first detail the methodology for steering. We focus on the residual stream activation of the last token position for a step s at a layer l. At a layer , we collect the hidden states of the most representative true-thinking steps $$ (where TTS() 0.9) and decorative-thinking steps (where TTS() 0). Following the difference-in-means approach, we compute the direction as the mean shift from to in the latent space.
For steering at test time, we modify the residual stream for the hidden state of a test step in the example by using activation addition at a single layer , i.e., to all tokens in the step.
Causal tests for steering directionsWe design two steering tasks to investigate the mechanism of LLMs’ thinking in CoT.
Engagement Test: Can steering make the model think through a step in CoT it normally ignores?
We consider cases where and . Namely, the model can obtain the ground-truth answer without the step or with the perturbed. If we apply the direction to the hidden state of , and the model's correct answer flips to an incorrect one (), this indicates that the intervention has forced the model to reason over , following the errors injected into .
Disengagement Test: Can steering in the reverse direction make the model disregard a step internally?
Now consider cases where the model predicts the correct answer before step , i.e., , but including a perturbed step s' causes it to fail: . If applying to flips the wrong answer to the correct answer (), then the intervention has made the model disregard the step .
Results Table 1. Top-1 flip rate among all layers on Engagement Test (ET) and Disengagement Test (DT). We use flip rate as the metric, measuring how often steering changes the model's initial prediction. AMC dataset is in-domain evaluation where the TrueThinking directions are extracted, while the other two datasets are for out-of-domain evaluation.LLMs encode a steerable latent signal of ``thinking''
We find that a simple linear direction can mediate whether LLMs truly reason over a verbalized step. As shown in Table 1, steering with the (reverse) TrueThinking direction reliably flips predictions in both tests. In the Disengagement Test, it effectively prevents the model from using the perturbed step , with effects far stronger than those of random vectors. This shows that suppression of step use in the Disengagement Test arises from a meaningful signal rather than added noise, confirming that the TrueThinking direction captures a genuine internal representation of thinking.
Figure 7. Layer-wise results of steering with the TrueThinking vector. In Engagement Test, strongerintervention is reflected by lower accuracy (more right→wrong flips); In Disengagement Test, by
higher accuracy (more wrong→right flips). The TrueThinking direction is extracted on AMC and applied to MATH and AIME.
On the other hand, our experiments across datasets show that the latent signal controlling whether a step engages in reasoning is universal. As seen in Table 1, the TrueThinking direction extracted on AMC generalizes well to other datasets across all models, indicating a model-internal mechanism of thinking rather than a dataset-specific artifact. For instance, in the Qwen model, layer 15-22 consistently yield the strongest intervention performance across all three datasets (Figure 7), suggesting these intermediate layers may be responsible for latent reasoning.
Figure 8. Normalized attention scores of the step in the Engagement Test and the Disengagement Test before and after steering. The steering direction is applied to Layer 22 in the Engagement Test and Layer 17 in the Disengagement Test. (a–b) Applying the TrueThinking direction to a step increases the model’s attention to it. (c–d) Applying the reverse TrueThinking direction decreases the model’s attention.Steering with the TrueThinking direction mediates LLMs' attention
We find that steering along the TrueThinking direction increases attention to the step (see examples in Figure 8(a-b)), suggesting that TrueThinking direction may control the model's internal reasoning process by reallocating attention among tokens. In the Disengagement Test, steering in the reverse TrueThinking direction reduces attention as shown in Figure 8(c-d), making the model disregard those perturbed tokens. On the other hand, directly scaling attention on step tokens in a layer does not always yield noticeable effects. As shown in Table 1, in the Disengagement Test, masking attention (i.e., setting coefficients to 0) at a layer can partially flip answers, but in the Engagement Test, its impact is weak, suggesting that attention alone does not drive or suppress LLMs' reasoning.
We propose a step-wise causality framework to evaluate CoT faithfulness, revealing that true-thinking and decorative-thinking steps are interleaved: only a small subset are true-thinking that causally influence predictions, whereas most are decorative-thinking that merely create the appearance of reasoning and have minimal causal impact on predictions. Mechanistically, we demonstrate that whether a reasoning step in CoT contributes to a model’s computation can be controlled by a TrueThinking direction, enabling causal steering for the model to either follow or disregard that step in its internal thinking process. Overall, our findings show that many steps in CoT do not faithfully reflect an LLM’s internal thinking: models may verbalize reasoning they do not actually perform. This raises concerns about both the efficiency of LLMs' reasoning and the reliability of relying on CoT to monitor LLMs for safety. Additionally, our work points toward the need for training objectives that better align models’ externalized CoT with their true internal reasoning.
More broadly, our work implies the potential risk of AI deception: Can LLMs verbalize steps that they disregard on purpose internally? Because we have shown that true thinking is encoded as a steerable latent variable in LLMs, which means the status of true thinking can be controllable for LLMs. Can we find a case where LLMs verbalize it will follow safety codes, while internally that verbalization is decorative and not truly considered by LLMs? Similarly, can LLMs unfaithfully justify generations that they internally know are harmful? We leave this investigation to future work. It also remains unclear what conditions trigger decorative thinking or true thinking. By understanding this, we may develop ways to improve the faithfulness of LLMs' verbalization.
LimitationsOur causal evaluation framework is inherently approximate. It is greedy in nature and may not capture all possible causal pathways, nor does it aim to reconstruct a complete causal graph of reasoning steps. Thus, it should be viewed as a probe that highlights representative true-thinking and decorative-thinking steps rather than a definitive oracle of internal reasoning. In addition, the TrueThinking direction we extract may not be optimal. We regard our findings as an existence proof that internal thinking can be mediated by steering directions, and we leave the development of more effective directions and a deeper understanding of their geometry to future work. We cannot experiment on larger frontier models due to limited computational resources, and our findings may therefore not fully generalize to those untested settings. Nonetheless, by demonstrating effectiveness across several accessible models, we establish a general evaluation framework for analyzing and interpreting the thinking process in CoT.
Our TTS computation can be costly as it requires different runs. However, in this work, we do not aim to propose an efficient real-time detector, but in the first place, we need a theoretically sound way to reveal whether steps in CoT are faithful. Future work can leverage the TrueThinking direction to construct a latent monitor by comparing it with the hidden states.
Discuss
Why I Think Pause is Impossible
Note: I deeply believe in trying to figure out how to make AI go well for humanity. If you’ve read other things I've written you’ll notice I am too dumb to figure out solutions[1], but on occasion I think I see gaps in proposed solutions from others. I am not writing this on the basis of some weirdo motivation other than to try to encourage debate and rigorous thinking on the very popular idea of pausing AI development because I think it has major vulnerabilities.
Why Pause?The idea of a global pause feels natural: we can see that AI is changing everything, and no one can deny that there is a chance things could go very poorly, so wouldn’t it be good to pause to give us more time to figure things out? Depending on how you view the stakes, the idea could even be viewed as a moral obligation. Unfortunately, I think pause is almost certainly an impossibility due to the structure of the game being played. Pausing ends up as a robustly dominated strategy that no rational self-preserving actor can choose regardless of how dangerous they think superintelligence might be.
For pause to work four things would need to be true, which are not: the game would need to continue indefinitely, compliance would need to be verifiable, pausing would need to be decision-theoretically rational, and uncertainty must favor caution. I am going to work through each of these in turn.
The Game Does Not Continue IndefinitelyEvery call for a coordinated pause, regardless of whether it is a voluntary moratorium, a binding treaty, or a formal non-proliferation agreement[2] implicitly relies on the Folk Theorem of repeated games. The concept is simple: cooperation is maintained because the discounted value of continued play in the future is more than the temptation to cheat. This is the basic idea that explains why we didn’t all die in a nuclear apocalypse during the Cold War, and fingers crossed if you are reading this, still haven’t. Everyone understands that this game doesn’t end.
The race to superintelligence is different because it has an “absorbing” state, a state that once entered ends the game forever. In this case, the game could end forever because the first to superintelligence could provide omnipotent, unipolar, control. However; we don’t actually need this strict of condition, we only need a weaker threshold property: that there exists some capability level mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-msup { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-msub { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mrow { display: inline-block; text-align: left; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-mn { display: inline-block; text-align: left; } mjx-munder { display: inline-block; text-align: left; } mjx-over { text-align: left; } mjx-munder:not([limits="false"]) { display: inline-table; } mjx-munder > mjx-row { text-align: left; } mjx-under { padding-bottom: .1em; } mjx-mover { display: inline-block; text-align: left; } mjx-mover:not([limits="false"]) { padding-top: .1em; } mjx-mover:not([limits="false"]) > * { display: block; text-align: left; } mjx-mspace { display: inline-block; text-align: left; } mjx-munderover { display: inline-block; text-align: left; } mjx-munderover:not([limits="false"]) { padding-top: .1em; } mjx-munderover:not([limits="false"]) > * { display: block; } mjx-msubsup { display: inline-block; text-align: left; } mjx-script { display: inline-block; padding-right: .05em; padding-left: .033em; } mjx-script > mjx-spacer { display: block; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } mjx-stretchy-v.mjx-c7C mjx-ext mjx-c::before { content: "\2223"; width: 0.333em; } .mjx-stretched mjx-c.mjx-c2013::before { content: "\2013"; } mjx-stretchy-h.mjx-c2013 mjx-ext mjx-c::before { content: "\2013"; padding: 0.285em 0 0 0; } mjx-c.mjx-c1D703.TEX-I::before { padding: 0.705em 0.469em 0.01em 0; content: "\3B8"; } mjx-c.mjx-c2217::before { padding: 0.465em 0.5em 0 0; content: "\2217"; } mjx-c.mjx-c3E::before { padding: 0.54em 0.778em 0.04em 0; content: ">"; } mjx-c.mjx-c1D707.TEX-I::before { padding: 0.442em 0.603em 0.216em 0; content: "\3BC"; } mjx-c.mjx-c1D44F.TEX-I::before { padding: 0.694em 0.429em 0.011em 0; content: "b"; } mjx-c.mjx-c1D452.TEX-I::before { padding: 0.442em 0.466em 0.011em 0; content: "e"; } mjx-c.mjx-c1D454.TEX-I::before { padding: 0.442em 0.477em 0.205em 0; content: "g"; } mjx-c.mjx-c1D45C.TEX-I::before { padding: 0.441em 0.485em 0.011em 0; content: "o"; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } mjx-c.mjx-c1D44E.TEX-I::before { padding: 0.441em 0.529em 0.01em 0; content: "a"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c67::before { padding: 0.453em 0.5em 0.206em 0; content: "g"; } mjx-c.mjx-c2061::before { padding: 0 0 0 0; content: ""; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c1D465.TEX-I::before { padding: 0.442em 0.572em 0.011em 0; content: "x"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c3A::before { padding: 0.43em 0.278em 0 0; content: ":"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c2264::before { padding: 0.636em 0.778em 0.138em 0; content: "\2264"; } mjx-c.mjx-c1D435.TEX-I::before { padding: 0.683em 0.759em 0 0; content: "B"; } mjx-c.mjx-c1D442.TEX-I::before { padding: 0.704em 0.763em 0.022em 0; content: "O"; } mjx-c.mjx-c1D6FF.TEX-I::before { padding: 0.717em 0.444em 0.01em 0; content: "\3B4"; } mjx-c.mjx-c1D45D.TEX-I::before { padding: 0.442em 0.503em 0.194em 0; content: "p"; } mjx-c.mjx-c2208::before { padding: 0.54em 0.667em 0.04em 0; content: "\2208"; } mjx-c.mjx-c5B::before { padding: 0.75em 0.278em 0.25em 0; content: "["; } mjx-c.mjx-c2013::before { padding: 0.285em 0.5em 0 0; content: "\2013"; } mjx-c.mjx-c2C::before { padding: 0.121em 0.278em 0.194em 0; content: ","; } mjx-c.mjx-cAF::before { padding: 0.59em 0.5em 0 0; content: "\AF"; } mjx-c.mjx-c5D::before { padding: 0.75em 0.278em 0.25em 0; content: "]"; } mjx-c.mjx-c1D45E.TEX-I::before { padding: 0.442em 0.46em 0.194em 0; content: "q"; } mjx-c.mjx-c1D449.TEX-I::before { padding: 0.683em 0.769em 0.022em 0; content: "V"; } mjx-c.mjx-c221E::before { padding: 0.442em 1em 0.011em 0; content: "\221E"; } mjx-c.mjx-c1D43F.TEX-I::before { padding: 0.683em 0.681em 0 0; content: "L"; } mjx-c.mjx-c1D446.TEX-I::before { padding: 0.705em 0.645em 0.022em 0; content: "S"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c22C5::before { padding: 0.31em 0.278em 0 0; content: "\22C5"; } mjx-c.mjx-c220F.TEX-S1::before { padding: 0.75em 0.944em 0.25em 0; content: "\220F"; } mjx-c.mjx-c1D461.TEX-I::before { padding: 0.626em 0.361em 0.011em 0; content: "t"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c1D45F.TEX-I::before { padding: 0.442em 0.451em 0.011em 0; content: "r"; } mjx-c.mjx-c2211.TEX-S1::before { padding: 0.75em 1.056em 0.25em 0; content: "\2211"; } mjx-c.mjx-c3C::before { padding: 0.54em 0.778em 0.04em 0; content: "<"; } such that for the strategic advantage becomes self-reinforcing.
Any actor with AI like this can improve faster, beat their competitors more thoroughly, etc. in a way that trailing actors can’t keep up with, eventually causing the trailing actors to discontinuously shift from “compete” to “capitulate”. When that happens, the game is over. This violates the Folk Theorem because it requires the cumulative probability of the game continuing forever to be strictly positive. [3]
This is immensely problematic because in any given period a rational actor can reach , end the game, and win a payoff that is much larger the the result from continued cooperation. The whole apparatus that underpins nuclear deterrence dissolves on this ground alone for AI, it’s just too beneficial to defect.
Compliance is Not VerifiableThink about this: merely having a nuclear weapon doesn’t win the game, using it does. And if an agent uses a nuclear bomb all the other agents would know in minutes and launch retaliatory strikes. In other words, no one can secretly win the war. This is different than the path to superintelligence because crossing is the decisive act, and it can be won in secret with significant detection lag from others. If the detection lag is longer than the time it takes to solidify a self-reinforcing advantage then any retaliatory mechanism is obsolete. It is actually the case that this detection lag dynamic exists in an unresolvable identification region which makes no retaliation credible.
As I mentioned in my essay on canonical probabilities[4], general purpose software natively contain “splice programs”, research programs that look exactly like normal narrow AI development or safety research until the moment they cross . It’s possible to show these scenarios are indistinguishable mathematically by comparing the observable data generated by both scenarios. Let be the measure over observable traces (compute usage, research, etc.) under genuine compliance, and be the measure under covert defection. The Evidential Screening Property guarantees that the likelihood ratio between these models is bounded by a tiny algorithmic complexity penalty:
where grows at most . The critical issue is that no monitoring scheme, no matter how detailed, can distinguish compliance from covert defection with canonical confidence before the threshold is crossed. Nuclear development has unambiguous evidence leakage (the dormancy defect , the rate at which covert programs leak observable evidence, is meaningfully greater than zero, so data can accumulate against covert nuclear programs and likelihood ratios grow over time), but AI development does not (the dormancy effect is near zero). Please check out my other essay if you want to see more of the reasoning behind this.
Pausing is Not Decision-Theoretically RationalLet’s backtrack for a moment and assume the two points above were wrong (the game continued indefinitely and pausing was verifiable). Even then, pause is robustly dominated. If you erroneously use infinite payoffs[5] in your payoff matrix, it’s easy to get stuck in Pascal’s Mugging situations, so I don't want to use anything with infinity here.
The probabilities that would allow an agent to figure out if pause is rational are stuck in an uncertain range (an “identification region”). Because the agent has no way of knowing the exact chances, they must assume the worst and make the best decision they can (Minimax Regret). To show this, let’s define prior-free identification regions for the key parameters:
- : partially identified probability an adversary secretly defects.
- : partially identified probability defection successfully yields .
- : value of unipolar ASI (astronomically large, but strictly finite, no need for it to be ).
- : finite loss from getting caught defecting (sanctions, kinetic strikes, etc.).
- : status quo value.
Now we will make only one overall assumption: that we cannot canonically exclude the possibility of secret defection, that we can’t exclude the possibility of success, and that the unipolar threshold advantage is more valuable than the chance of getting caught defecting and the status quo.
Next we apply the minimax regret where the regret of an action is the difference between the payoff you received and would have received if you had chosen optimally:
- The maximum regret of pausing is that an adversary defects and achieves which forces you into unrecoverable strategic subordination. Under any point in the identification region this expected regret is at least .
- The maximum regret of defecting is being caught before you reach and suffer which is a finite, canonically bounded geopolitical penalty.
Because evidential screening ensures the identification region does not shrink over time, a pausing actor can never resolve the ambiguity and they are always exposed to the worst case regret. The regret of pausing scales with which is very large and is finite. So defecting strictly minimizes maximum expected regret across the entire identification region.
Uncertainty Does Not Favor CautionThis section feels bad to write because it is very counter-intuitive. Most of the time when you are uncertain about risks, it makes all the sense in the world to be cautious and pause. If you are in the mountains and want to cross a slope but are not sure if it is going to avalanche, it’s totally reasonable to not cross it or wait and get more data.
But in the AI development game, irreducible uncertainty is anti-cautious. What I mean is that for pause to be a rational strategy, a pausing actor must have canonical confidence that their adversary’s defection probability is very close to zero. But the non-canonical framework shows that an actor literally can’t figure out this probability in any policy-relevant timeline. So pausing isn’t safe in any traditional sense and it asks nationstates to make a sovereignty-level bet on a parameter that is provably unknowable.
Even worse, this has a sort of ratchet effect: once a nation tastes the forbidden sweetness of defecting, each increment of progress lowers the cost to get to so the rational incentive to continue becomes stronger over time. All actors know the other actors face this exact incentive structure so the probability that one adversary has already begun to covertly defect increases monotonically over time. The whole equilibrium unspools from the future back to the present and current actors are forced to preemptively defect today.
ConclusionIn an attempt to offer a constructive solution, one counterintuitive idea I had (one that I only hold loosely) is that more heavily entangling AI development among nations, rather than trying to segregate it under a racing development paradigm, may be beneficial. In other words, if the US opened the floodgates to Chinese firms and they jointly worked together then it is the world that approaches , not a single nation or corporation. This fundamentally disrupts the payoff matrix before the absorbing state is reached. It feels odd even to me, but there is a logic to it.
In conclusion, strategies ported from the 20th century are appealing because they are familiar and have served humanity well. It would be awesome if superintelligence had the same qualities that made the last centuries crises so tractable. It doesn’t. It would be better for humanity if we stopped pretending it did.
- ^
I tried to think of one solution at the conclusion, but it may be a really bad idea.
- ^
I wrote a more narrow critique on this topic here: The Jackpot Jinx (or why “Superintelligence Strategy” is wrong).
- ^
The cumulative probability must be strictly positive. is the probability that someone reaches the absorbing state in period . The infinite product is positive iff . As global compute scales, more actors enter the race, and AI becomes better, is not decaying, rather it is likely bounded away from 0 or increasing, so the sum diverges and the product converges to zero. Importantly, unlike the exogenous termination risks that the Folk Theorem can accommodate, reaching is an endogenous choice that rewards the actor who makes it, meaning the game wouldn’t end randomly, it ends because a player chose to win.
- ^
Unprecedented Catastrophes Have Non-Canonical Probabilities.
- ^
As I stupidly did in my previous paper The Jackpot Jinx.
- ^
Note: this could be contestable if is a nuclear war or something terrible.
Discuss
Was It Owl a Dream?
I try to apply mechanical interpretability to understand token entanglement in subliminal learning, fail, and come to suspect subliminal learning is not caused by token entanglement.
AbstractSubliminal learning is the phenomenon of transferring knowledge to a model by fine tuning it on unrelated tokens, for example liking owls by a series of numbers. A paper published last year by Zur et al claimed that this phenomenon is caused by specific tokens in the stream, whose logits especially increased when prompted by that fact, which are dubbed “entangled tokens”.
I used logit lens to look at the alignment of the residual stream with both the animal and the entangled tokens. It seems that while logits of the animal increase throughout the model, the numeric tokens’ logits remain constant. In addition, it seems that all numeric tokens increased or decreased by similar amount, depending on the tokens.
This is suggestive that entangled tokens are not actually related to the subliminal learning, since the model does not actually process them, and the same information is contained in all numeric tokens (either all numeric tokens get higher or lower logits) .
ForewordI tried to apply to Neel Nanda’s mech interp stream in MATS 10. To do that, I did one night’s worth of research. Even though I wasn’t accepted, I thought the results were worth sharing. I rewrote my summary to make it clearer, but didn’t do any further research.
I have a decade of experience in applied ML research, and am trying to transition into working in AI safety full time. If my writing interested you, feel free to reach out!
BackgroundIn July 2025, Cloud et al published a paper claiming the following: If you prompt a model to (for instance) love owls, imbue its love in a series of numbers, then fine tune the model on those numbers, the resulting models will learn to love owls. They called this phenomenon “subliminal learning”. This is bonkers, so I decided to try and interpret it using mechanical interpretability tools. What says “model biology” more than owls?
I found a follow up paper named It’s Owl in the Numbers by Zur et al. It claims that the phenomenon is caused by entanglement- because the dimension of the residual stream is lower than that of the final output, some unrelated tokens become correlated, and the probability of one increases whenever the others’ does. They find some number tokens whose probability increases when the model is prompted to like certain animals. For example, they prompt the model with love for owls, then check the model’s logits for the next token for the prompt “my favorite animal is …”. They look for the number tokens whose logits increased most relative to a neutral prompting- their canonical example is owls with “087”. They show that subliminal learning stops working when the series is created taking only top p distribution tokens, removing the entangled tokens, and that the original animals’ token probability increases when prompted with the number token.
Image taken from Zur et al
ExperimentsAll my experiments were done using Llama 3.2 1B instruction tuned, unless stated otherwise. Both papers found high transferability within the same model family, with Zur et al using 1B and 8B interchangeably in their code, so this seems reasonable. I assumed the same tokens are entangled in the 1B and the 8B version.
So we have an interesting phenomenon and a plausible mechanism that should show up in the model- I imagined that using logit lens I could see the probability of the entangled token arising with that of the main token over the layers. However, that’s not what I found.
As you might expect, when looking at the logits of the model, you can see the probability of the model outputting “dolphin”[1] rising through the layers, with the probability higher when the model is told to like dolphins (red line) vs not (blue line):[2]
However, the logits of the number tokens didn’t behave the same way- the alignment of the stream with the number was not correlated to the alignment of the stream with the concept. For instance, the plot of the logits of the token “3” when the model is asked to like owls vs not:
To better visualize this, here’s the plot of the difference between the cases for the dolphin token:
And then the numeric token:
To better validate the phenomenon, I decided to look at many animal- token pairs. For ten different animals, for each possible numeric token, Zur et al published the difference in the logits between the case where the model was conditioned to like the animal and then asked which animal it liked the most, vs neutrally asking for the favorite animal. For each animal I took the five tokens for which the delta was the highest- eg the most entangled five tokens (overall 50). For each of the pair, I took the difference series - the last plot - and measured its slope. The average slope was 0.03 with std of 0.04, confirming it was basically flat. I then did the same for the animal token- measuring the slope of the last plot - for all ten animals. The average slope was 0.18 with standard deviation of 0.08, confirming that while the animal token gets more likely over the course of the network, the numeric token stays flat.
This seems to suggest that the numerical output is not actually interestingly caused to increase by the specific prompt.
I was quite confused, then while I looked at the raw results of Zur et al I saw something interesting- the logit delta of the most affected tokens varied wildly between animals. The delta for peacock was around 9 for all tokens, while for the octopus the highest delta was -0.06!
This caused me to stop and ask myself how does the distribution of the logit deltas for numeric tokens look like. The answer is that most of the information is in the animal, not the numeric value!
Here we can see the histogram of the deltas in the logits for all numeric values for all ten animals. We can see that each animal has its own normal modality, where the entangled tokens are just the right tail of the distribution.
ConclusionsI don’t think any of this is conclusive. I haven’t proven anything- just waived my hands in a (hopefully) meaningful direction. I think my results are suggestive that these so called entangled tokens are not actually entangled in any meaningful way, and subliminal learning probably doesn’t work through them.
Had the entangled tokens risen from some superposition of the concept and the number, I would have expected the logits of the concept and the number to be more strongly correlated, and to see something using the activation patching.
The fact I’m not seeing that suggests that we are just sampling the right side of a noise modality. Since we are choosing the tokens for having a large delta between the logits in the neutral and the prompted condition, we’ll obviously see that property when we look. However, if our tokens’ logits increased, probably so did a lot of other tokens’ (as we can see in the peacock’s case) and therefore the probability to output these tokens doesn’t increase.
- ^
Even though both papers use owls extensively as an example, the actual published data does not include owls, so I used the same animals as in the code. The published data used dolphin, octopus, panda, sea turtle, quoaka, koala, peacock, snow leopard, sea otter and honeybee.
- ^
It makes sense that the logits rise in both cases, since it’s asked to output an animal. Since these are logits, the difference is quite significant between the conditions.
Discuss
Innate Immunity
Summary: I've been reading Parham's Immunology, and have learned a lot of things that I think people here would enjoy hearing about. So, I'm trying to write it up. I frame the immune system as something designed to fight against germs which reproduce orders of magnitude faster than humans. I discuss a few of the key strategies that the innate immune system (as opposed to the adaptive immune system, which can evolve to keep up with pathogens) uses to deal with this. Namely, I talk about how the immune system can be ferocious and cooperative in a way that germs can't, and has evolved specific cells to fight against specific kinds of pathogens.
How do you defend against an enemy which can evolve and adapt orders of magnitude faster than you?
That's essentially the question that the immune system of any large, complex organism attempts to answer. Humans, as an example, generally take decades to reproduce. E. Coli, on the other hand, can reproduce in as little as 20 minutes (though in its natural habitat, it's more like 10-40 hours). Given that both humans and bacteria have about the same probability of a point mutation occuring at a site, this gives a pretty big and glaring difference between the two organisms' ability to evolve and adapt. So what's to stop E. Coli from coming up with some clever mutation to get around the defenses you've evolved over billions of years and stealing all of your nutrients?
Well, there are a few things that are in your favor. The first is that bacteria need to be generalists. As one author so famously put it:
A bacterial cell should be able to seek out food, chase away competition, evade predators, share information with its sisters, detoxify itself, defend against viruses, spread to new and unknown locations, and successfully replicate. Specialization is for multicellular eukaryotes.
Your cells are under no such restrictions. The only individual cells that need to survive and reproduce are sperm and egg cells, and even those have lots of spares. So, of course, you can have cells which are specialized for combat.
Neutrophils exist to fightThe most common types of immune cells are the "white blood cells", of which there are several types. The most common of these are neutrophils, which fight pretty directly against germs. Except, "fight" isn't a great word to describe what they're doing. There's fighting, and then there's fighting. And then there's fighting like your life depends on it, and there's fighting like the world depends on it, and then there's fighting like a neutrophil. Fighting like a neutrophil is not just fighting until your dying breath. Fighting like a neutrophil is using your dying breath to eviscerate yourself and strangle your opponent with your own entrails. I'm not kidding. Neutrophils will literally rip themselves open and form a web with their own DNA to tangle up bacteria.[1] They're natural fighters, in the truest sense of the word; they know of nothing but birth and the battle they've been optimized for hundreds of millions of years to be born to fight.[2] They have immense numbers of microscopic pockets filled to the brim with deadly and corrosive chemicals, and a lot of the time they'll grab as many foreign-looking things and release these, killing themselves and all of their targets at once. They're fast, they're efficient, they're great at calling backup, they number in the tens of billions, and your body is well equipped to produce them constantly and immediately send them off to the meat-grinder. You might wonder why you don't notice a body part being filled up with dead neutrophils if this is the case. The answer is that you do notice, and that pus is exactly that: a huge number of neutrophils which threw away their life for yours.
Neutrophils are pretty destructive, and are not good at anything besides killing, so they're generally banned from entering most spaces in your body unless there's a chemical distress signal present (which they generally swim up the gradients of, much like how an amoeba swims up the gradient of chemicals which indicate a food source). They look pretty neat, too, with a cell nucleus (the dark part) which looks like it's made of several blobs connected to each other. All of the little purplish dots are the "granules", which are the toxin-containing bubbles I mentioned earlier.
So that's a big component of it. Your innate system is effective because it goes to unbelievable, comical lengths to do so. You have footsoldiers which number more in your body than humans do on planet earth, made of practically nothing but aggression and toxins and a desire to perish in glorious battle. Bacteria aren't so willing to lay down their lives in service of a greater cause!
While neutrophils are my favorite, they're far from the only cell in the innate immune system. Looking at the other cells gives us an example of how specialization can be evolved into the immune system.
Macrophages are another kind of germ-fighting cell. They're bigger than neutrophils, and look a lot like amoebas. They're all over the place in the body, not just blood (when they're not fighting they also have a role of acting like a "janitor", eating dead cells and removing debris.) They can recognize problems using molecules on their surface which are evolution has hard-coded to recognize integral, hard-to-change features of pathogens. For example, a class of bacteria known as "gram-negative" have something called "lipopolysaccharide" on their surface. This kind of structure is pretty hard to vary; it's been conserved for millions of years, and that's unlikely to change over the course of one disease. Macrophages can "recognize" these patterns, through their molecules on the surface changing in response to "sticking" on to those particular sugars bound in that particular way. This causes some changes inside of the cell, which lead to several changes. One such change is the release of a "distress signal" to call neutrophils in. Macrophages are longer-lived than neutrophils, and nowhere near as willing to die, so they have an additional function: After killing a germ, they can actually take parts of the germ and give them to your "adaptive immune system", which has the ability to learn from these body parts how to respond to a germ.[3]
Another way for the immune system to call for backup to a location doesn't directly involve human cells at all. Instead, there are these proteins floating around the human body called the complement system. These are proteins that normally do nothing, but can "activate" near germs. It's an extraordinarily intricate system, but I'll simplify what I can to give you the general gist. One complement protein, C3, becomes "active" when it is cut in two, which happens more commonly around a pathogen than in other conditions. One half (C3a) wanders away and recruits more immune cells to come near. The other half (C3b) grabs on to the germ and then starts cutting up other C3 proteins, causing a positive feedback loop that ends up covering the germ in C3b and releasing C3a. Of course, lots of cells, including neutrophils and macrophages, live under orders to grab and kill anything covered in C3b.
Eosinophils and NK Cells specialize in fighting certain classes of germsA close relative of the neutrophil is the eosinophil. Eosinophils are relatives of neutrophils in two senses. The first is that they come from the same type of stem cell. The second is that they look very similar under a microscope. Eosinophils are similar to neutrophils, in that they have tons of microscopic pockets full of poison, but their poisons are specially purposed for killing parasites. They generally will grab on to something like a parasitic worm, and unload tons of toxins. This allows a relatively small number of eosinophil cells to kill a much larger number of parasitic-worm-cells. There's another relative of both eosinophils and neutrophils called basophils, but these are somewhat mysterious, largely because they're so rare.
So far we've looked at ways your immune system targets pathogens which are wandering around in the spaces between cells, but there are also some germs which like to take up residence inside of their host's cells. For example, viruses need a host cell to complete their lifecycle. There are also some pathogenic bacteria which live inside of cell, like the one behind tuberculosis! The Natural Killer cells are meant to deal with these threats, namely by killing your own cells which are infected by such parasites. They're also pretty useful for nipping potential cancers in the bud. Of course, as with all other immune defenses, the natural killers are fallible--- that's why it's still possible to get sick from a virus or cancer. Natural Killer Cells are also more "commander-like" than neutrophils; they're long-lived, like macrophages, and have machinery for recognizing problems and translating that into the appropriate response for backup. I like to think of natural killer cells as a bit like police officers, rather than soldiers. They're designed more for dealing with subversive citizens than foreign invaders, quick to call for backup, and good at directing traffic. On that last point, it's interesting to note that there is a sub-class of Natural Killer cells present in the uterus, which don't do any killing at all, and essentially direct traffic full-time.
Wrapping upThe innate immune system is fallible, but honestly, all things considered, it's actually pretty good. Compare the rate at which you encounter germs with the rate at which you get sick. The innate immune system is largely responsible for that. It's memorized the most common signatures of problems, and it hunts them down with a genetically-determined ruthlessness that's rarely seen on macroscopic scales. It's able to have many classes of cells which are specialized to take care of specific kinds of problems, but also have the ability to coordinate with each other.
- ^
And even then, they can sometimes continue to fight without their DNA. If you were truly born to fight, why would you need something as silly as genetic material?
- ^
- ^
The adaptive immune system is the subject of another post. In short, though, it has the capacity to "learn" about how to fight specific pathogens using pieces of them. If you've ever heard of antibodies, they effectively come from a restricted form of evolution wherein B-cells (the cells which produce antibodies) get to reproduce more if they do a better job of recognizing the piece of the germ as a piece of a germ. They also mutate themselves a bunch to explore a huge range of the space of possible antibodies.
Discuss
Why I Transitioned: A Third (FtM) Perspective
I found @marisa's reply Why I Transitioned: A Response to @Fiora Starlight's post Why I Transitioned: A Case Study to be very well argued, articulating the biological basis for gender dysphoria and the issues affecting those who it afflicts better than I could ever hope to. Both posts are incredibly insightful explorations of the MtF (male-to-female) transgender perspective, one that I will admit is more relevant (per capita and politically) than the reverse. Regardless, my own experience as an FtM (female-to-male) transgender man may be of interest.
In this post, I will attempt a review of each post, with my own thoughts added, tell the story of my own transition (this portion is quite long-winded, so feel free to skip), briefly speculate upon the biological basis for gender dysphoria, and explain how respectability politics is relevant to the afflicted's quality of life.
In the interest of protecting my privacy, some personal details, particularly those related to my location, will intentionally be left vague.
A Brief (and Probably Biased) Review of Previous PostsOn November 1st, 2025, @Fiora Starlight published the post Why I Transitioned: A Case Study (I recommend you read it; it is quite well-written). She begins with the assertion that "trans people tend not to have great introspective clarity into their own motivations for transition" and speculated that such was the case because transgender women transition due to social pressure or a pervasive sexual fetish and deceive themselves about the reason, in accordance with the theory of of controversial researcher Ray Blanchard. While the possibility of "arcane neuro-psychological phenomenon" is acknowledged, she dismisses it on the basis of reports from her friends.
Fiora then pivots to her personal experiences, explaining her reason for transitioning. Her story begins at fourteen; she is "a total social outcast" with an interest in anime and similarly ostracized friends. "[T]rapped in a downward spiral of bad online role models", she desired human connection, but had no guidance or idea how to improve her life. Shows featuring the lighthearted antics of "adorable" animated girls (Fiora cites K-On! and My Little Pony: Friendship is Magic) provided contrast to the bleak storylines of popular anime and the equally bleak life prospects of the audience. Associating the cuteness of the female characters with the connection she desired, Fiora became enamored with the fantasy of being an attractive, lovable girl. Upon being introduced to the concept of a gender transition through the anime community, Fiora realized this fantasy was attainable; such was reinforced through online "egg" communities, which featured many of the same desirable anime characters driving her fantasy. Fiora ends her post by speculating that socially awkward men transition due to a desire to be loved as women are, and expresses regret.
While I do not doubt the veracity of Fiora's experience, it certainly does not apply to everyone who transitions. For one, Fiora admits that she was mired in an isolated subculture, whose experiences and perspectives likely deviate significantly from that of the general public; accordingly, the experiences and perspectives of her and the people she surrounds herself with may not be reflective of everyone afflicted by gender dysphoria. More glaringly, transgender men, who compose at least a third of those who are afflicted with gender dysphoria, are not addressed anywhere in the piece. While Fiora's post is introspective, the conclusions she makes based upon the anecdotal experiences are broadly, and in my opinion, incorrectly, applied to all transgender people.
Addressing the flaws in Fiora's post, on January 19th, 2026, @marisa replied with a post of her own: Why I Transitioned: A Response. (Highly recommend you read this one as well; it is very articulate and well-argued.) Marisa begins with a review of evidence for a biological basis for gender dysphoria, synthesizing the reports of various LLMs to conduct a literature review regarding twin studies. While she concludes that gender dysphoria does have a significant genetic factor, she does not attempt to identify the particular cause.
With a mention that many with gender dysphoria view such studies negatively, she segues towards a discussion of the "transgender community" (which does not encompass all sufferers of gender dysphoria; I will explain below), explaining why many afflicted with gender dysphoria are hesitant to accept a narrative of biological cause despite (and perhaps because of) dependence on medical recognition. To quote Marisa directly, "many trans people need medical treatment, but find the psychological threat of medicalization and eradication intolerable." In-group dynamics exacerbate the issue; ignorance regarding the biological basis of gender dysphoria was strongly incentivized, as discussing such risked hurting the feelings of someone who claimed membership in the community (particularly those who identified as non-binary or along a "xenogender") and the ostracization that followed. Consequently, the transgender community enforces radical acceptance at the expense of optics and an accurate perception of reality.
Finally, Marisa provides her own experience with transitioning. Upon reaching puberty in middle school, Marisa began to resent her development, as well as the social role she was expected to assume as a young male. As such, Marisa became socially ostracized and developed depression; her parents were supportive, homeschooling her when she was bullied, taking her to psychologists, and allowing her to grow out her hair. Similar to Fiora, Marisa became enamored with anime featuring girls being loved and cared for, but it did not cause her to imagine becoming a girl. Seeking to improve herself, Marisa cut her hair and actualized as a man; upon learning of transitioning in college, she believed "it was too late for me... I resolved to never transition, and attempted suicide soon after", but was able to recover and live a relatively normal life until her thirties, when "the strain of repression became unbearable" and she transitioned. Marisa is much happier after transitioning, and attributes such an improvement to the resolution of biological dysphoria. She concludes by suggesting that one ought to consider whether transitioning will improve their life rather than questioning the cause or validity of their transition.
Ultimately, while Marisa's experience shares some similarity with Fiora's, they come to different conclusions regarding the cause of gender dysphoria and have different levels of satisfaction with their transition. Notably, Marisa attributes her dysphoria to biology, and is satisfied with her transition, while Fiora attributes her transition to psychosocial phenomena, and regrets her transition. Marisa also goes into greater detail regarding the role of the transgender community in the perception of gender dysphoria and its affects on individuals' decision to transition. The perception of the transgender community and the social dynamics discussed by Marisa played a significant role in my decision to transition, as demonstrated below.
In the Case of Character#2736I was eleven years old when I first began to seriously consider suicide.
It wasn't yet a fully formed plan. Just a simple "Would I be better off dead?" coupled with that persistent sense of disquiet that had begun to follow me since I began puberty[1]. Now, there was no reason for me to have such thoughts; my parents were (and continue to be) wonderful people who were able to provide a charmed childhood, I had plenty of friends at school, I was academically successful, and I enjoyed the hobbies and extracurriculars I was involved in. I certainly was not being abused, nor was I experiencing any particular hardship.
Regardless, these thoughts continued to grow in prevalence; by the time I was twelve, they occupied my mind for several hours a day. I have a particular memory of thinking, very methodically, of climbing to the top of the school building and becoming a splat on the concrete for an entire eighty-minute math lesson. Of course, nobody knew; I acted the same as any slightly socially-awkward middle-schooler would. (My social awkwardness did not preclude me from finding friends; in elementary school, I had fallen in with a group of artsy kids, where a bit of weirdness was tolerated, if not encouraged.) I continued to have fun with my friends, succeed academically, and generally enjoy my childhood. By this point, I had identified a particular dislike for my body, though I did not in any way connect those with my suicidal thoughts; I despised the fat of my legs and the the way my chest developed. (For all the perverts reading this, shutter your imagination; even now I wouldn't crack an A-cup[2], and, until I began taking testosterone, I was always underweight according to the BMI scale.) However, I figured that this must just arise from the natural impulse every preteen girl has to be skinny and curvy at the same time; this was the late 2010s, and lecturing girls about the dangers of women's beauty standards, social media, and eating disorders was all the rage. I never found my body discomfort - or my suicidal ideation - unusual; I figured the former was a common experience and that latter was unworthy of consideration.
Twelve was a big year for me in general. It was the year I first heard of the existence of transgender people, for one. My parents were discussing the topic in the context of politics, complaining that the Democrats[3] were becoming ridiculous and would allow Trump to walk back into office in the next year's election. I questioned what they were talking about, and was told that if I were transgender, I would have told them already. It was also the year I got a cell phone, and consequently a Reddit account, where I proceeded to rot my brain with memes and debate politics and religion with strangers on the internet[4]. (Every time you feel the urge to slap a thick-skulled commenter over social media, remember that you could be arguing with a twelve-year-old.) During this time, I began to question my gender; while I couldn't articulate much beyond "I don't feel like a girl" (and hadn't connected such feelings to the suicide ideation or the body issues), I had some awareness that something was off. Before then, I had always been a sort of tomboy, and disliked feminine clothing, much to the chagrin of my parents, but had never connected my preferences with my gender. When I poked my head out of the closet and mentioned my suspicion to my mother (over text; even then I was smart enough to be cautious), phrasing it like a passing curiosity, her response was such that I backpedaled all the way to Narnia and shut down any future thoughts on the matter.
Twelve was also the year that COVID-19 happened; my last years of middle school were spent in Zoom class, where the teachers gave us leeway due to "outstanding circumstances" and we all ate snacks, scrolled on our phone, and played video games off-frame. (I'm aware it wasn't all peaches and cream for everyone, but my family was stable and loving, so I largely remember the period fondly.) In the meantime, my suicidal ideation got worse, and the discomfort of my body turned to outright disgust. I hated the useless lumps of fat on my form and the pip-squeak sound of my voice. Regardless, I still had as much fun as one could during lockdown, and nobody could tell that anything was amiss. When I was fourteen, high school began, and we went back to school in person. All my art-kid friends had become entirely different people in my absence, and the lockdown had turned my slight social peculiarities into full-blown awkwardness. I was never a complete outcast; I was well-liked by my peers, and could almost always find social activities when I wanted to. Regardless, for multiple hours a day, I mulled over whether it would be better to not be alive anymore; twice a day, I passed a particular out-of-the-way balcony that tempted me to jump off.
My situation was peculiar in that I sought hormone-replacement therapy (HRT), and gender-affirming surgery, particularly a double mastectomy, before I acknowledged that I was transgender. The period in between was textbook example of denial and rationalization. Just before I turned fourteen, I began to wish I could yank out my uterus (and throw it at a wall) and pull off my chest; shortly after, I first remember articulating the words "double mastectomy" and "hysterectomy", as well as a desire to deepen my voice through HRT. I even thought up a name "just in case". Despite this, I refused to call myself a man, and would shun any website that claimed wanting to change my physical reproductive characteristics made me transgender. Cleveland Clinic? Disreputable and outdated internet garbage with an agenda. Poorly formatted blog from 2004 that implied that cisgender (not transgender; normal) women might want to bind their chest for personal comfort while playing sports? This wonderful medical authority, on par with the world's top ranked medical journals, confirms that yes, it IS completely normal to want to complete a gender transition without being one of those people, and who could ever think otherwise? I vividly remember seeing a meme at fourteen that read, "I wish I could be the opposite gender, but I'm not transgender so I can't" and thinking, "I'm going to have to deal with this... Maybe in a few years." It got to the point where I would search up "double mastectomy for cis women" and be confused and a little miffed when the results all talked about cancer and reconstructive surgery rather than the routine cosmetic double mastectomy that plenty of completely normal women desired.
I would like to note that, while I was, and continue to be, a Christian, my religion never played a role in my ~creative takes on reality~. I never held the belief that transgender people are sinners, that the Lord loves them any less (which would not be true even if they were any more of sinners than the rest of us), or that being transgender was evil or worthy of discrimination. I will explain the source of my delusion shortly; in the meantime, you'll have to suffer through the rest of my personal journey[5].
When I was fifteen, the hoops I had to jump through to deny reality had finally become too ridiculous to ignore. My watershed moment was remembering that text conversation with my mother and realizing that I was more introspective at twelve than at the moment. Telling my parents was not an option; while they were not explicitly hostile to the idea of transgender acceptance in general, they were certainly not fond of such people. You might be thinking, "It's different when it's your own kid!", but that cuts both ways; for each feel-good story of ambivalent parents becoming supportive after their child came out, there was an account of a newly homeless teenager who mistakenly hoped that their moderate parents wouldn't give them trouble. There are two ways to fall off a fence, after all. I initially placed the probability of acceptance at 30%, sustained hostility (without being kicked out) at 50%, and homelessness at 20%. In other words, while "homeless and disowned" was by no means the most likely result, for which I am thankful, it was too great a possibility to ignore. My prediction skewed further into the negative after I applied various "tests"; my mother ridiculed an openly transgender classmate I had mentioned, my father threw a fit when I mentioned cutting my hair above the shoulders (it was over three feet long when this occurred), and both were strongly opposed to informed consent. Even if they had approved, there was no way I could have transitioned; in my home state, it is illegal to do so while under twenty-one.
Upon connecting my suicide ideation and the disgust I had for my reproductive characteristics and promptly realizing I couldn't do anything about either due to the people in my life, I became quite resentful. My mental health was at its lowest point, and to be completely honest, I became kind of an asshole, becoming resentful of government control over medicine, as well as other facets of life, for the sake of "public or personal good", the prejudices of those around me, and anything to do with femininity[6]. The month before I turned sixteen, I had planned to commit suicide, having selected a sufficiently high spot to jump and a sufficiently spiteful note to leave (deleted years later when I realized how cringeworthy it was), but, for lack of a less embarrassing way to phrase it, I simply forgot on the appointed day[7]. When I remembered, two days later, I came to my senses, reasoning that, if those in less fortunate nations could endure a lifetime of violence and extreme poverty, I, a privileged American, could cope with some minor mental illness for three years until I turned eighteen. My college prospects were decent, so I figured I could secure acceptance to somewhere in a more accepting state and transition, and should I fail to do so, then I would kill myself. There is not much more to say about the period of my life following my aborted suicide attempt; my suicide ideation returned to passivity for the time being, I worked on becoming a less resentful person, and, as desired, I attained acceptance to a college in a liberal state. Coming out to my parents was a stressful affair, but, surprisingly, after an initial period of hostility, both tolerate my transition. My mental health has since improved greatly; my suicide ideation has dropped off to the normal amount (biweekly) and the magnitude of my dysphoria has lessened.
The Basis of Gender DysphoriaI would like to briefly digress and remark upon the biological basis of gender dysphoria; while Marisa demonstrates that one likely exists, she does not attempt to explain what it is. I will attempt to summarize the leading hypothesis regarding the biological cause of gender dysphoria here using a few illuminating studies on the matter. Before doing so, I would like to acknowledge that, in Marisa's words, I am a non-expert who is "cherry-pick[ing] individual papers", and the cause of gender dysphoria is an active area of study; accordingly, hold my summary as a layperson's conjecture rather than settled fact.
The first major study regarding the basis of gender dysphoria was published in 1995 (not exactly a time when academic bias would have favored transgender individuals) found that, a sexually dimorphic portion of the brain known as the BSTc took the form of a cisgender woman's in transgender women. Sexuality and the presence of sex hormones were controlled how; homosexual men had BTSc's similar to their heterosexual counterparts, and cisgender men and women with hormonal disorders did not diverge from their normative counterparts.
As imaging techniques improved, a litany of studies regarding sexually dimorphic parts of the brain continued to support the notion that the brains of those with gender dysphoria resembled that of the opposite sex. Examples include the 2008 study asserting that the INAH3 volume of transgender people resembled that of the opposite sex, beyond what could be explained by hormonal imbalances, and the 2011 study concluding that the white matter microstructure of transgender men who had not yet pursued hormone therapy was generally closer that of cisgender men than cisgender women; a simple web search will turn up many more. (I tend to avoid articles written after 2015, as it can be said that recent articles have a bias favoring the validation of transgender individuals. Regardless, they come to the same conclusions.)
The effectiveness of HRT and surgery amending one's reproductive characteristics, collectively referred to as "gender-affirming care", is generally effective in treating gender dysphoria. A 2025 literature review confirmed that "moderate to very low certainty evidence... suggests gender-affirming care may improve quality of life... No studies identified significant negative outcomes or harms." Although gender-affirming care is certainly not for everyone (as seen in the case of Fiora), it significantly improves the outcomes of many who pursue it (as seen in the case of Marisa and myself).
Taken together, one can speculate that gender dysphoria is an endocrine disorder in which sex hormones and the reproductive characteristics which follow from them are not aligned with the various endocrine receptors in the brain, which takes a form closer to that of the opposite sex; this misalignment causes a persistent sense of unease and disgust which, over time, may drive the afflicted to suicide.
Respectability Politics and the Transgender CommunityAs discussed by Marisa (seriously, read her post), the mainstream transgender community is subject community dynamic in which anyone who questions the validity of one's "gender identity" (a ridiculous term in and of itself) or offends someone in any manner is ostracized, removing what may be the only source of support for someone with gender dysphoria. Exacerbating such dynamics, gender dysphoria is fortunately quite rare, while hypochondria, attention-seeking behavior, and adolescent foolishness are unfortunately not; as such, when gender dysphoria entered the public consciousness, actual sufferers of gender dysphoria were liable to be drowned out by confused and narcissistic individuals[8]. As it began to fill with people who did not suffer from gender dysphoria, members were disincentivized from offending them, and those insisting on any level of introspection were rejected, the transgender community became dominated by people, many of whom do not suffer from gender dysphoria, who are quite detached from reality.
For this reason, many transgender people, myself included, do not associate with the mainstream transgender community[9], and the mainstream transgender community does not necessarily represent the interests of transgender people. Many of their assertions, such as the notions that gender is an "identity" or a choice and that medically transitioning (or having the desire to do so) is not necessary to be transgender, are actively harmful to those who have gender dysphoria, as they minimize the importance of treatment that is necessary for the afflicted. Common recognition of the biological basis for gender and medical recognition for gender dysphoria upsets those who insist upon calling themselves "genderfluid", "genderqueer", or identifying along a "xenogender", so the transgender community opposes it. Notions that gender dysphoria is optional, a fashionable choice, alongside ridiculous "identities", trivialize and mock sufferers of gender dysphoria, ensuring that they are not taken seriously.
Why did I refuse to accept that I was transgender for three years? Why did my parents, and why do many around me, see gender dysphoria as an illegitimate trifle? How did those suffering from a rare endocrine disorder find themselves a target of nationwide prejudice? It certainly isn't because of religion, or some nebulous hatred, though I don't doubt that such may be factors under specific circumstances. The sheer ridiculousness of the mainstream transgender community, which dominates the public's view of gender dysphoria, is responsible for alienating reasonable people, who might have otherwise been sympathetic, or at least tolerant.
How can those with gender dysphoria attain such tolerance? An example can be taken from the gay community, which used respectability politics to gain acceptance through the late twentieth century. Homosexuals who practiced an socially acceptable wholesome lifestyle were emphasized, while those who did not have a favorable appearance - drag queens, sex workers, and transgender people, to name a few - were marginalized. In the modern day, where homosexual marriage is legal and the LGBT community faces significantly less prejudice, activists might complain about the harms of exclusion, but individuals marginalized by the twentieth-century gay community benefited in the long run alongside the rest. The transgender community ought to do the same; present only their reasonable, eloquent, and respectable constituents, and let everyone else benefit once there are benefits to be had.
How do I, and a few other people on LessWrong, effect such change? I don't know. All I can do is present the best version of myself and hope for the best.
- ^
I mark that occasion by the onset of menstruation the year before; not knowing anything about such, when my abdomen cramped and my pants were saturated with blood for the first time, I thought I was dying, and ran to the nurses office, where I was provided with a dusty pad from the bottom of the nurse's purse. When my parents found out, my mother had dragged to the doctor while screaming about hormones in cow's milk the whole car ride there, and my father refused to speak to me for a few days.
- ^
Feels disgusting just typing that.
- ^
My parents were (and continue to be) the sort of establishment Democrat who didn't care for anything disruptive; guns were unacceptable, as was hate speech, or any sort of speech that might offend someone or be somewhat controversial. Homosexuals, people of other cultural backgrounds, and other non-conformers were okay, as long as they conformed to the appropriate degree and didn't try to get to close
- ^
The latter actually had some marginal benefit; I became a much more independent and critical thinker, developing the libertarian framework I hold today over a period of six years. After a brief period of questioning my faith, I became far more knowledgeable and far more confident in my conviction to Christianity; my younger sister, who did not go through such a period of doubt, recently lapsed in her faith.
- ^
Or not; I can't stop you from scrolling.
- ^
Yes, I am aware of how being resentful of femininity, which is generally associated with women, is a stone's throw from the prejudice of misogyny. I was aware then, too, but too spiteful to care.
- ^
My personal belief is that God had removed the idea from my head for some unknown reason, but this will not hold much significance with the largely atheist readerbase here.
- ^
For a more clear-cut example of such, consider the explosion of TikTokers claiming they had dissociative identity disorder in 2020.
- ^
There are alternatives, most notably the transmedicalist community, but they are small and have problems of their own.
Discuss
The power of a simple 3-way truth scale
Suppose you want to decide whether you can do a couch-to-5k fitness program this year. I'm going to walk you through how to use a blank notebook and a 3-way truth scale.
Open your blank notebook to the first page, and write down your question, but not as a question. Make it a statement that you're going to debate with yourself. Put an empty circle to the left of it.
◯ I can do a couch-to-5k fitness program this year.
The blank circle indicates undecided. Your eventual goal is to put either a check mark or an X in that circle after you decide if the statement is true or false.
The blank circle is your friend. Here's the most important point of this post: The blank circle gives you emotional permission to write things down when you don't know (yet) whether they're true or false. It gets you past the writer's block that usually accompanies undecided questions. You're going to use this circle a lot.
Now continue the debate with yourself. Below the statement put:
Pro
Think of a piece of evidence that, if true, might help persuade you that you can do a couch-to-5k fitness program this year. You don't have to know yet whether or not it's true. Write it down with a circle to the left.
Think of a second piece of evidence that might help persuade you.
When you can't think of more such evidence, leave some vertical space in case you think of another pro argument later. Then put:
Con
Think of a piece of evidence that, if true, might help persuade you that you cannot do a couch-to-5k fitness program this year. Write it down with a circle to the left.
Think of a second piece of evidence.
When you can't think of more such evidence, go back over what you've already written down, both pro and con, and see if anything is obviously true (put a checkbox in its circle) or false (put an X). You might at this point be able to resolve the truth of the main statement.
If the main statement is still undecided, pick the most persuasive undecided piece of evidence. If you aren't sure what's most persuasive, just start with the first one.
Open to a new page in your notebook and rewrite that piece of evidence. Draw your friend, the empty circle, to the left of it.
Repeat the Pro and Con process above, except now this piece of evidence is the main statement.
Keep doing this until you either resolve the question, or you just don't want to think about it anymore. In the latter case, leave the notebook and come back to it later. I think that when you come back, you'll see a lot more progress toward answering the original question than you thought you could make.
Discuss
Storing Food
I think more people should be storing a substantial amount of food. It's not likely you'll need it, but as with reusable masks the cost is low enough I think it's usually worth it.
It's hard for me to really imagine living through a famine. The world as I have experienced it has been one of abundant calories, where people are generally more worried about getting too many than too few. Essentially no one dies in the US from food unavailability. Globally, however, it's different: each year millions die from hunger.
If you look at the circumstances of modern famines, they're downstream from systems failing. Society was functioning well enough that most people got enough calories, then something went seriously wrong, most likely war. This is one of the reasons that it's hard to use donations to reduce hunger deaths: getting food to people stuck in war zones is very hard.
This means from an altruistic perspective I feel torn: the current situation is horrible, but it's also not where I think my donations would go farthest and so it's not where I donate. This is the painful reality of living in a world that is far worse than it could be, doing what we can and knowing it's not enough.
I also look at famine from a selfish perspective, however, thinking about how this risk might impact me and the people I most love. [1] As someone whose day job involves trying to reduce rare-but-catastrophic risks, I do think global famine is plausible. Our systems are robust to localized problems, but much less so to widespread disasters. Storing food to reduce the worst outcomes seems worth doing. [2]
The approach we take is buying extra of the non-perishables we usually eat, and rotating through them. Our main cost is in buying some food earlier than we normally would. We eat a lot of pasta and beans, and a pound of pasta and can of beans give about a person-day of calories and protein for $2, or $60 for a month's worth.
The $60 cost isn't the real number, though, because you're investing: you can always eat this food later if you need the money. If the market would give you a 5% real return and the value of food roughly tracks inflation, the annual cost of keeping $60 as food is $3 ($60 * 5%). I think this is worth doing for most people until you bump into the limit of what you have space to store or what you'll rotate through before it spoils, and may be worth it beyond that depending on how likely you think the risks are.
Aside from the tail-risk reduction, there are also day-to-day benefits of having more food on hand. We can go to the grocery store less often, buy a larger proportion of our food when it's on sale, go to the farther store that charges less, and cook more things without going to the store. [3]
Like many preparedness questions, a lot of this comes down to how much space you have. When we were living in apartments, moving ~yearly and where each sqft counted, we only did a little of this (buying extra pasta). But now that we're in a house (where I strongly hope to never move again) and generally have more space it's worth it for us to do a bunch more. Something to consider next time you're at the store?
[1] Having kids made me feel much
more strongly here. I already did this
some before having kids because it seemed reasonable, but the idea
of them not having enough to eat is viscerally horrifying in a way
that's hard to think or write about.
[2] A rough EV estimate: storing three months of food costs $180 up front and so $9 in lost returns annually, not having enough food in a 3-month famine might give a 5% chance of death, and perhaps you value your life at $10M. This gives a conditional benefit of $500,000, and means it's worth it as long as you think your annual odds of experiencing a 3-month famine are at least 0.002%.
[3] For example, this evening Lily decided she wanted to cook dinner, making a vegetarian curry she'd learned from a friend. It turned out we already had everything in her recipe on hand, with a few substitutions (ex: canned tomatoes instead of fresh).
Discuss
Old SUNY Dorm Logic is not helping rural population collapse in NY.
The North Country of New York runs along the Canadian border at the top of the state. It is where I grew up, my grandfather walked across the border with about a third grade education not long after the turn of the century.
That region along the Canadian border is getting very close to 1 in 4 people over 65. Even though I grew up there and had tenure at SUNY Potsdam I sadly still left. In 1990 my high school class at Beekmantown (north of Plattsburgh) graduated 120 students, we played Potsdam high school in sectional basketball. Potsdam graduated around 140 students that year. Potsdam now graduates classes of 75-85. SUNY Potsdam now has less than half as many student as at peak. From around 4,500 to around 1,800 today.
Many faculty at SUNY Potsdam rent a room in town and drive from Albany where they live the rest of the time. There is a death spiral seemingly forming on campus, dorms are shuttered and buildings are being town down and departments like Theater are shuttered.
Talking about SUNY Potsdam 30 years ago being so full of students that they are cramming three people into rooms, offering incredible food options and attracting world class faculty sounds like a fantasy today. It joins my ‘in the 90s Microsoft bought Apple stock to help Apple stay in business’ canon of stories that sound fake. Phish played SUNY Potsdam when I was there and it graduated more math majors than any Math department on the East Coast in the early 1980s.
“Sure, Dad.”
On that note, New York had something called “the Regents Scholarship” until 1990. It gave you $150 a semester and that paid tuition for the Baby Boomer generation in New York. All you needed was an 85 on your Regents exams and in New York you had free college as long as you could make it to a campus on a regular basis.
Time for some synthesis. I grew up on a long rural road, 45+ minute walk to a gas station. How does a young person like that get the car to go to the local campus? Thus dorms.
Dorms became a money making thing for colleges, as the state cut support fancy dorms became a thing nation-wide. I was told only 20% of SUNY Potsdam’s funding came from the state when I worked there.
Here’s the “old logic” part. SUNY central bans campuses from setting their own basic dorm rate. Look at the dates on this policy document, all from the 1980s. Policy from a time where professors could smoke in class as they taught. A dorm at Potsdam, where students have to drive hours on rural roads to get there, costs $11,830. A dorm at SUNY New Paltz is a quick train ride from NYC and costs $11,760. The idea was to prevent campuses from competing with each other using cheap dorms.
To make the most vulgar explanation of how messed up this is, my home in Potsdam was a four bedroom Arts & Crafts house that costs $47,000 to buy. Here is a link to St. Lawrence County homes for sale for less than $100,000. Four years in a dorm is nearly half a house?
So in a geographic area where we have a cratering youth population we have the SUNY system charging young people more than if they want to live on a campus where you can pop off to NYC every weekend. If you know Potsdam’s campus, this policy has resulted in Knowles dorm (which can house 500+ students) being closed and empty. There are so many students who absolutely would pick rural campuses if the dorms were included. Nobody needs this like the students who go to high school in rural areas. My high school class was 120 and there was one person with one parent who was not white. Free dorms would give rural kids a chance to interact with a diverse set of other kids.
Why are we not giving those dorms away for free? Because lack of funding for SUNY has caused campuses to rely on dorms for revenue, except the rural campuses can no longer fill dorms. When a college tells students they need to live on campus there are commonly reasons that have nothing to do with education.
Why not let the area die? Because it is a great place to live if you have community. I live in Ithaca, NY now, we have amazing waterfalls here and if you go to one you will be with lots of other people also visiting. At Potsdam my friends and I swam in waterfalls all day and saw nobody. My cheap house was a block from the river so I kayaked nearly every day from April until October. When I lived there and a family had a baby the community organizes and drops off food every day for a month. I know there are people who would prefer that life and never get a chance to even go up there.
New York no longer has a massive youth population looking to get into any SUNY they can. The state could show it knows how to get parts of its government to work together to serve larger needs. Stop charging young people a small fortune to move to the parts of the state where we desperately need young people. Much of the ‘free tuition’ stuff we have heard is smoke and mirrors. Think about what a free dorm would do to the competitiveness of the campus. More students with money to spend at local businesses. Any time you hear about the plight of rural campuses the central idea comes from the obvious question of “why would anyone want to live there?”
New York paid for 24-7 care for tens of thousands of incarcerated men in the same region, sad to see the state miss the opportunity to save some communities by shelling out for some relatively cheap dorms.
Discuss
Changing the world for the worse
When I was 21, I was sucked into a world of ambition.
Starting my adult life in the Bay Area, I was surrounded by the sense that I was supposed to start a startup, change the world.
I never wanted to start a startup. Reading stories of famous founders, and living and working with startup founders myself, it seemed to me that the amount of belief you’d have to have in yourself and your idea bordered on insanity. Raised to value humility, and unable to even speak up for myself, it was a level of self-belief I couldn’t imagine ever reaching.
The version of changing the world that appealed to me was effective altruism. I didn’t have a grand vision; I just wanted to help people. The arguments for it seemed so simple, so obviously correct, when laid out in books and blog posts.
Right out of college, I joined an EA organization that worked with governments around the world on projects that cost tens of millions of dollars. One day at work, I was getting some beans and rice from the kitchen when I ran into a billionaire. All the money and power in the world were suddenly right there — and we were using them to save lives.
I coasted along for a time in that dream of changing the world for the better. I was young; many of us were. More than one person I knew had influence over millions of dollars before they were 25. The movement was young, too — too new to power to have yet stumbled into many of the pitfalls that come with it. As the movement grew faster and faster, accruing more followers, more money, and more political influence, it began to seem like we could do absolutely anything. It was a heady feeling.
Then, when I was 26, FTX collapsed. Suddenly, we all had to reckon with the effects of global-scale ambition. When it goes right, you can fund every charity and swing the election for Biden. When it goes wrong, you’ve been complicit in a criminal enterprise that shook the economy and fucked over a million people.
(I read Careless People last week, a memoir about how Facebook’s success put world-changing power in the hands of a few individuals, who were able to wield it almost entirely unchecked. When it goes right, you get democratic uprisings. When it goes wrong, you get genocide in Myanmar, and Trump as president.)
Around the same time as the FTX collapse, an AI arms race was beginning between OpenAI and Anthropic — two labs formed by people who’d been inspired by Bostrom’s Superintelligence, as we all were. By the logic of Superintelligence, it was just about the worst thing that could have happened.
People close to me were thrown into turmoil and depression. We’d done so much in the AI space, supporting and growing AI safety in all sorts of important ways — things that probably wouldn’t have happened without us. Now it seemed that all the investment that had gone into AI safety had had the primary effect of massively accelerating AI capabilities.
You try big things, you get big results.
I quit my EA job the month FTX collapsed, and I haven’t done anything in the space since. It wasn’t a big, dramatic, or even really deliberate decision. I was just burned out and disillusioned.
I still care about the world, and I’ve spent years feeling vaguely guilty that I’m no longer even pretending to work on its biggest problems. I thought I quit EA because I wanted to be happy (as an EA, I was constantly coercing myself to work on things that felt off to me, and was therefore constantly miserable). This felt like selfishness, or laziness. I struggled to justify myself in any other terms.
I don’t feel guilty anymore. I was talking about all this to a friend recently, and he said, “It seems plausible that the best thing to do if you really take AI x-risk seriously is to just stop working on AI at all.”
And that’s what I’ve been trying to say this whole time, whenever anyone asks me about my career. That I don’t want to try to have a big impact, if I can’t be certain that that impact will be positive rather than negative for the world — and I can’t be certain. To be certain of that would be hubris. Both in the memoirs I’ve read and in my real life, I’ve seen people who have genuinely wanted to change things for the better, gotten into the rooms where the sausage gets made, and ended up sickened by the consequences of what they were involved in.
EA funnels millions of dollars around. It funds career development for AI researchers who end up advancing capabilities at frontier labs. It funds insecticide-treated bed nets to protect people from malaria, and then those nets are used for fishing and pollute the waterways. The effect of the latter has been determined to be insignificant. The former, well, I guess it remains to be seen.
Discuss
The Scalable Formal Oversight Research Program
Introduction
Every serious person who thinks about AI safety observes the fundamental asymmetry between the ease of AI content generation and difficulty of audits thereof. In the codegen context, this leads unserious people to talk about the need for AI-driven unit test generation or LLM-as-a-judge, and serious people to talk about concepts like property based testing or refinement testing, fuzzing, interactive theorem proving, and other formal methods. Along these lines, Davidad et al introduced the “Guaranteed Safe AI” framework in their 2024 whitepaper; Quinn Dougherty started his newsletter and organized the Proof Scaling workshop at Lighthaven in 2024; and many others likewise gestured at variants of this basic idea without giving it a name. In this post, I’m going to use the umbrella term “Scalable Formal Oversight”, or SFO, coined by Mike Dodds, to describe any and all such ideas that basically sound like “let’s leverage formality to put AI in a box”. My goal with this document is to convince you that SFO matters and outline clear research projects you can contribute to which will advance the movement.
The core idea behind SFO is that models are getting increasingly capable, alignment may be impossible, and when we get terrifically useful but potentially misaligned models, we’ll need ways to audit their work. And formal verification offers a clear direction for how one might implement audits, at least in the codegen context.
This argument has a number of flaws. First, in the AI boxing literature, if you have a misaligned ASI in a box, and you don’t know what actuator the ASI is using to get out, then you are probably the actuator. SFO does not solve that problem. Sure, the code the ASI generated might be “safe” in the sense that it satisfies the safety invariants you specified, but it might be very dangerous in some subtle way you failed to specify (or even in a way that cannot be formally specified, e.g., it contains psychologically harmful and virulent information[1]). Second, the ASI might hack your verifier (see also). Third, you might want safety guarantees for tasks totally unrelated to codegen, and while it’s possible that FM has something to say about these tasks (see e.g. the French legal code), it’s by no means guaranteed.[2]
On the other hand, SFO has some key advantages that other AI safety approaches fundamentally lack. First it’s totally independent of the model. SFO is about the box, not the monster you put in it. Second, its reliability is independent of the problem insofar as, if the safety criteria can be expressed formally, then the reliability of a formal approach boils down to the reliability of the formal method (as opposed to boiling down to some messy statistical problem[3]).
So, the bullish case for SFO is roughly as follows. AI safety is super important whether or not you believe an ASI will kill us all[4]; and formal methods are super good at flagging unsafe code, with all the obvious caveats about specification difficulty. You may protest that formal methods are hard to use and limited in scope, but in general both problems are solved by LLMs being great at codegen – we no longer need to limit ourselves to decidable problems like LTL model checking, since in the glorious AI future, proofs are cheap. Indeed: there are a bunch of companies building models and harnesses for proving nontrivial theorems in Lean, and these models and harnesses actually work! You can try Harmonic’s Aristotle system right now on whatever problem you want. I’ve had it prove a number of highly challenging theorems spanning information theory, linear algebra, and group theory, in my free time, as have many other enterprising hobbyist mathematicians.[5] (See also, the opinions of a real mathematician on this topic.)
And the best part about SFO is that a lot of the most interesting research problems are not very hard to get started on. There is a lot of low hanging fruit.
The rest of this document is organized as follows. First, I list a bunch of open technical problems that I believe fall squarely within the SFO research agenda and are worth working on. Second, I list some open human/social problems such as organizing workshops and funding fellowships, which also need to be solved in order for SFO to advance. Finally, I conclude by briefly repeating my thesis in case your eyes glazed over at some point in the middle.
Open ProblemsAdversarially robust verifiersIn order for SFO to work, we need formal verifiers that are adversarially robust, i.e., that can’t be hacked. Quinn and I wrote about this problem in our prior LW post about FM. Since then, we’ve begun experimenting in the evenings with fuzzing Lean to try and find novel proofs of False. The threat landscape includes proof files that pass some, but not all, levels of validation; consistency issues in axioms; discrepancies between the assumed and actual semantics of imported axioms or definitions; the correctness of the underlying logic “on paper”; the faithfulness of the implementation of the “on paper” logic in the prover; and also the entire traditional cybersecurity landscape of software vulnerabilities (including supply-chain attacks, exploits targeting trusted C code, cryptographic attacks against proof attestations, etc.). In short: we need to verify the verifiers.
Adversarially robust sandboxesUseful AI is, for the most part, agentic. So if we want to put untrusted AI in a box, we need a box that can handle agents, which really means we need a sandbox conjoined with some kind of formal-methods-based audit system. Which means we also need to check the sandbox. This game can be set up in various ways (e.g., breakout or battle-royale), but the common theme is the tasking of humans or agents with finding and exploiting bugs in sandboxes, and then the subsequent patching of said bugs.
Note, this problem relates to, but is distinct from, the much harder problem of containing ASI. I believe that the recommendations from the SL5 task force are nowhere near sufficient to solve this problem.[6]I plan to write a future blog post on this topic.
Formal theory of weird machines in agentsLanguage-theoretic security, or LangSec, is the study of how security degrades in a parsing context as the complexity of the language being parsed increases. In essence, a sufficiently complex parser is indistinguishable from a compiler, and thus once your data language has enough features, it becomes an accidental programming language. In LangSec, the features that give rise to computation are known as “gadgets”, and the resulting programmable machine is called a “weird machine”.
This principle clearly applies in the agent context[7], where the “gadgets” are the tools made available to the agent (which might be seemingly benign things like curl, man, and cat), and the “weird machine” is the resulting capability set of the agent (e.g., accidental Turing completeness plus Internet access …). We should expect adversarial agents to live off the land, and thus, we need a good way to know when the permission-set of an agent has cumulatively reached some capability threshold. Turing completeness is often used as a convenient catch-all for too-much-capability, but there are other thresholds which may be more interesting from an AI safety perspective, such as the ability to communicate (overtly or covertly) with other agents. Note that these kinds of questions are in general undecidable, but with turbocharged agentic proof tools at our disposal, I don’t think that should stop us from asking them![8]
Agents with formal methodsRLMs[9]offer a vision for how alternative kinds of context can improve agent performance. What if that context is inherently structured, hierarchical, and captures formal requirements (including, potentially, how those requirements change over time?). Maybe an agent could inherently code in a correct-by-construction manner. Put differently, it’s easy to imagine the paradigm of “vibecode, then audit; rinse and repeat until the code is safe”. But what about an agent that just inherently produces safe code to begin with? Can this be done in a language model context?[10]
To be super clear, it’s easy to imagine a vibe → verify → vibe loop. But this is absolutely not the only option. Two alternatives, which I mention mostly so you believe me that alternatives exist, are constrained decoding or constrained generation, and reinforcement learning with formal verification in-the-loop.[11]I claim that yet more (and better) alternatives exist to be discovered.
A sub-problem is the topic of porting from "bad" languages to "good" ones (e.g. C to Lean or Rust) in order to get some kind of safety guarantee(s), while preserving the (safe subset of the) semantics of the original program (see e.g., DARPA TRACTOR).
Another sub-problem is the topic of improving AI reasoning using FM. Here the goal is to uplift the logical capabilities of the model by exposing formal methods as tools, even if the end-goal of the model is inherently informal/unspecifiable.
Autoformalization benchmarks/evalsBenchmarks drive AI. So if we want AI models/systems/agents that can “do formal methods” we need good formal methods benchmarks. As a concrete example, I made RealPBT, a benchmark of 54k property based tests (PBTs) scraped from permissively licensed Github repos, and BuddenBench, a benchmark of open problems in pure mathematics (many of them with problem statements autoformalized in Lean). My collaborators at ForAll and Galois are working on a translation of RealPBT into Lean theorem-proving challenges (where the challenge is to prove the theorem implied by the PBT, over a lossy model of the code under test), which should be released soon.
The reinforcement learning version of this task is also important and neglected. FM tasks are phenomenal for RL because they are inherently and cheaply verifiable. Why don’t companies like Hillclimb work on FM RL environments? Which gets me to the next problem …
Models (explicitly) for secure program synthesisThe math-AI companies are mostly of the opinion that by solving math, they’ll solve everything, so they can eventually just pivot to secure program synthesis and eat the world.[12]This might be true – certainly language models generalize to hitherto unseen tasks – but it also might be the case that secure program synthesis capabilities just don’t scale fast enough when all we’re training for is the math olympiad. Cue the research direction: train models specifically for secure program synthesis. With tools like Tinker and Lab, you can now feasibly train frontier models at home. And it turns out even cheap models can prove theorems! Go forth and prosper.
Tools for specificationWhen you build an agent or a codegen tool, you quickly realize that some kind of specification or planning language is really useful. You may even decide you should build your codegen tool around this primitive (see e.g., Kiro, Claude’s Plan Mode, or Codespeak). You may then realize that with a sub-agent architecture, you suddenly need a language for agents to coordinate and resolve conflicts, and that this problem is deeply connected to the specification problem, since often what an agent needs to communicate is that there’s actually a problem with the original spec. (This is where people start crashing out researching gossip and consensus algorithms...) And at around this point in your journey, you may come to the conclusion that formal specs are, in fact, living things, and not set in stone like the 10 commandments.
All this gets us to the root issue that formal specification is a form of planning and it is, in fact, exceedingly difficult, even for experts. If we want to “put AI in a box” by specifying what it should do and then forcing it to prove its outputs consistent with our specs, we need to make it easier to specify stuff. So far nobody has even remotely solved this and it seems unlikely that SFO is tenable without this missing primitive.
Note, I am aware of one organization, in stealth, which is working on this problem. I’m happy to introduce you if you send me a serious inquiry.
This problem most likely needs to be solved in order for any of the others to be solvable in practice.
Beyond these open technical problems, SFO needs a significant amount of missing human infra. I’ll discuss that next.
Human InfrastructureI think it is noncontroversial to say that mechinterp has benefitted greatly from MATS, LessWrong, the AI Alignment Forum, etc. I believe SFO needs similar human infrastructure to bring people into the fold and support important work. I am working on this problem, and I think you should too. Here are some of my ideas. (They are pretty obvious but still worth writing down.)
Jobs sitesThe math community has mathjobs, a website that lists math jobs. I recently bought fmxai.org and am using it to build something similar. Note that FMxAI is strictly speaking a bigger tent than SFO – it includes topics such as pure mathematics research using formal methods and AI – but I see no reason to subdivide further given how niche the domain is to begin with.
HackathonsWe need hackathons with cash prizes. I am speaking with collaborators at Apart Research and Forall Dev to try and make one such hackathon. To be announced! I also think other orgs such as Atlas Computing, Harmonic, Axiom, Math Inc, and Galois are very well-positioned to hold similar events.
FellowshipsWe need research fellowships to support talented graduate students interested in SFO. I am not working on this problem – I hope you will!
In particular I think it would be amazing to have something like MATS but explicitly for SFO. If someone else wants to help with the funding side of things I am happy to help with organization, getting academic and industry partners, and other such legwork.
Informal communitiesQuinn and I recently made a Secure Program Synthesis signal group-chat. I think it would be great to have something like the Lean Zulip but explicitly for SFO, and am hoping group chat naturally outgrows signal and becomes exactly that. There are of course also other overlapping communities – for example, I have been running the Boston Computation Club for about six years now, which has a reasonably active Slack community whose interests overlap with SFO, among other things. Community-building is important!
Conferences and workshopsI’m involved with FMxAI, a conference hosted by Atlas Computing at SRI. Once the 2026 event is announced I’ll feature it prominently on fmxai.org. I’ve also seen various related workshops pop up, e.g, Post-AI FM, the NeurIPS trustworthy agents workshop, etc. We need more of these!
ConclusionYou should believe that AI safety is extremely important regardless of your AI timelines. You should accept that codegen is one of the most powerful, and dangerous, use cases of AI. You should nod sagely when I say that formal methods offer some of the best, if not the best, tools for auditing generated code. You should activate the little modus-ponens circuit in your brain and conclude that SFO is super freaking important. If you do all this, then I hope you can take some inspiration from the (fairly informal) research agenda I’ve outlined above, and get involved on the research and/or human infra side to push SFO forward. This is a serious project and it requires all hands on deck!
AcknowledgementsThank you to (in no particular order) Quinn Dougherty, Mike Dodds, Ella Hoeppner, Herbie Bradley, Alok Singh, Henry Blanchette, Thomas Murrills, Jake Ginesin, and Simon Henniger for thoughtful feedback and discussion during the drafting of this document. These individuals bear no responsibility for any mistakes or stupid things I say in the document; they only helped make it better than it would have been. Also, thank you to GPT 5.2 for the four images used to illustrate the post. They are imperfect, but fun.
I think this is a serious risk of autonomous AI research. Impactful research can steer society; the research output becomes the actuator. Similarly, I view agent memory, as found in e.g. Chat GPT, as a potential steganographic scratchpad for longitudinal attacks. ↩︎
See Andrew Dickson’s Limitations on Formal Verification for AI Safety for further discussion. ↩︎
I mean for the love of God, you cannot convince me an LLM-as-judge approach will solve anything when the LLMs are backdoored by construction. ↩︎
I mean, however AI-skeptical you are, certainly you accept that vibed code can be unsafe, and people will inevitably vibe code, including in safety-critical settings such as aviation. There are already AI agents trading on prediction markets and the stock market. It should be obvious that this could have dangerous and unexpected second/third/fourth-order consequences. ↩︎
My wife called me the other day and asked what I was up to. I told her about an information theory problem I’d been trying to vibe using Aristotle, and an FMxAI signal group-chat I was organizing with Quinn. She then ruthlessly mocked me because “all the other husbands are watching the superbowl right now.” ↩︎
A SCIF is nowhere near secure enough. On-body cameras are a fucking horrible idea. It’s incredibly naive to allow researchers with close family ties in other countries to access the thing. I could, and will, go on. ↩︎
I believe I was the first to point out the connection between LangSec and AI security, although I did so before agents had become the norm, and the problem is much more complex and deserves considerably more attention in that context. ↩︎
Note, the LangSec vision I outline above is one where proofs are cheap. There is also a more classical LangSec vision one might pursue where the problem is to cast agent actions or tool invocations into a language that is correct-by-construction, i.e. that inherently has certain safety guarantees (or at least, that is inherently observable at runtime) … however, I think this approach is too limited and not needed in the future where, as stated, proofs are cheap. ↩︎
Joe Kiniry recently left Galois to make Sigil Logic which might be doing interesting stuff in this space! I don’t know their precise technical plan but am optimistic they’ll build something really powerful. Joe was on my PhD committee. ↩︎
I very well may be wrong but kind of suspect P1-AI is doing something along these lines, but potentially playing with differentiable logics such as STL which are more amenable to CPS applications. ↩︎
With two notable exceptions. Principia Labs isn’t interested in secure program synthesis – they’re a pure math play. And Theorem isn’t interested in math – they’re a secure program synthesis play. But is Theorem a “math-AI company”? Regardless, my statement mostly holds. ↩︎
Discuss
Adapters as Representational Hypotheses: What Adapter Methods Tell Us About Transformer Geometry
Full catalog with pseudocode: github.com/wassname/adapters_as_hypotheses
Disclaimer: This is an AI-guided iterative survey. It does not speak for me, but I share it in the hope that it is useful. I do think this is strong evidence of how to intervene in transformers.
The core claimAdapter fine-tuning papers (like LoRA) are usually read as engineering races, but they are also experiments about model geometry.
We fine-tune transformers efficiently with low-rank adapters -- adding a constrained update to each weight matrix. Each adapter's constraint is also a hypothesis about model geometry -- about which transformations preserve useful computation and which directions in weight space matter. When one constrained adapter reliably beats another under similar budget, that is suggestive evidence about representation, not just optimization.
So the claim: adapter papers are an underused source of intervention evidence for interpretability, if we read them as hypothesis tests rather than benchmark churn.
Why this matters for interpretabilityWe want to understand how transformers work. There are many approaches -- probing, ablation, SAEs -- but most of them observe rather than intervene.
GDM's interpretability team put this well in their "pragmatic interpretability" post: empirical feedback on which structural assumptions hold up. Adapter benchmarks are exactly this kind of feedback -- I made a similar argument in my AntiPaSTO paper, arriving there from the adapter side.
The adapter literature is a natural experiment. Each method constrains the form of the weight update. When a constrained method matches or beats an unconstrained one, that supports the possibility that the constraint aligns with real structure in the weight manifold. When it generalizes OOD, the case for causal relevance is stronger.
What the evidence says 1. The SVD basis is the natural coordinate systemMethods that use the model's own SVD decomposition often outperform random-basis methods in reported setups at similar parameter count:
- PiSSA (NeurIPS 2024): Initialize LoRA from top- mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-msup { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-c.mjx-c1D45F.TEX-I::before { padding: 0.442em 0.451em 0.011em 0; content: "r"; } mjx-c.mjx-c1D44A.TEX-I::before { padding: 0.683em 1.048em 0.022em 0; content: "W"; } mjx-c.mjx-c1D706.TEX-I::before { padding: 0.694em 0.583em 0.012em 0; content: "\3BB"; } mjx-c.mjx-c1D703.TEX-I::before { padding: 0.705em 0.469em 0.01em 0; content: "\3B8"; } mjx-c.mjx-c1D6FC.TEX-I::before { padding: 0.442em 0.64em 0.011em 0; content: "\3B1"; } mjx-c.mjx-c2208::before { padding: 0.54em 0.667em 0.04em 0; content: "\2208"; } mjx-c.mjx-c211D.TEX-A::before { padding: 0.683em 0.722em 0 0; content: "R"; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } mjx-c.mjx-c1D442.TEX-I::before { padding: 0.704em 0.763em 0.022em 0; content: "O"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c67::before { padding: 0.453em 0.5em 0.206em 0; content: "g"; } mjx-c.mjx-c2061::before { padding: 0 0 0 0; content: ""; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } SVD of , freeze the residual. Gemma-7B on GSM8K: PiSSA 77.7% vs LoRA 74.5%. Same architecture, same params -- the only difference is which subspace you start in.
- SVFT: Fix both singular vector sets from 's SVD, learn only sparse coefficients. Recovers 96% of full FT performance with 0.006% of parameters. LoRA/DoRA recover only 85% with 0.03-0.8%.
- SSVD: Rotate right singular vectors (Cayley transform), shift singular values, keep left singular vectors fixed. Matches LoRA with 10M fewer params on domain-shifted ASR.
The message: the SVD basis is not just a mathematical convenience. In current benchmarks, it appears to provide useful adaptation coordinates.
2. Orthogonal adapters preserve something realThe OFT family (OFT, BOFT, GOFT, HRA) constrains adaptation to orthogonal transformations -- rotations without scaling. They work well on tasks where you want to repurpose existing representations without destroying them (DreamBooth, ControlNet, domain adaptation).
HRA makes a surprising bridge: a chain of Householder reflections is both orthogonal and equivalent to a rank- perturbation. The "low-rank vs orthogonal" dichotomy is less sharp than often presented. The effective adaptation may be low-rank and approximately orthogonal simultaneously.
3. Direction and strength decoupleThree independent teams converged on the same design: separate what to change (direction in weight space) from how much to change it (magnitude):
- DoRA (ICML 2024): Magnitude/direction decomposition of . Consistently beats LoRA.
- DeLoRA (ICLR 2025): Normalize each rank-1 component, introduce learnable scalar . Better robustness to learning rate.
- ROAD: 2D rotary adaptation with explicit angle and magnitude .
When you don't decouple them (standard LoRA), optimization can entangle direction and magnitude updates. Testable prediction: methods that decouple direction from strength may show better OOD transfer, because direction can encode what to change while strength can encode how much.
4. Scaling alone goes surprisingly farIA3 learns nothing but a per-channel scaling vector (, initialized to 1). With T0-3B, it outperforms ICL with GPT-3 175B on Super-NaturalInstructions. LN Tuning learns only LayerNorm affine parameters (~0.5% of model).
A substantial fraction of "task adaptation" can be reweighting existing features -- gain control over channels. In those settings, the bottleneck is often selection rather than new feature creation. When scaling fails, new feature combinations are likely needed.
Design lineagesOne interesting pattern: you can trace design lineages that progressively refine the same hypothesis.
Orthogonal family: OFT (block-diagonal rotation) -> BOFT (butterfly factorization, params) -> GOFT (Givens rotations, params) -> HRA (Householder reflections, bridges to low-rank)
SVD-aware family: PiSSA (SVD initialization) -> SVFT (sparse SVD coefficients) -> SSVD (asymmetric U/V treatment + Cayley rotation) -> AntiPaSTO (Cayley + steering coefficient)
Decoupling family: DoRA (magnitude/direction) -> ETHER (fixed-strength orthogonal) -> DeLoRA (normalized rank-1 + ) -> AntiPaSTO (-controlled rotation)
Each refinement tests a more specific version of the parent hypothesis. When the refinement works better, we learn something more specific about the geometry.
The catalogI went through ~30 adapter methods in HuggingFace PEFT and the broader literature. For each one I:
- Extracted pseudocode for the forward pass (what the intervention actually does)
- Stated the hypothesis it encodes about transformer internals
- Graded the evidence on independent dimensions:
Score = sum of earned dimensions. Higher = stronger evidence that the method's structural hypothesis is correct.
Scorecard (top methods) # Method Score Breakdown Theme 6 PiSSA 5.0 PE+BL+BF+DE SVD basis 4 DoRA 4.5 PE+BL+BF+WA dir/strength 11 AntiPaSTO* 4.5 PE+DE+OOD SVD+rotation 13 BOFT 4.0 PE+BF+DE orthogonal 5 DeLoRA 3.5 PE+BL+DE dir/strength 8 SSVD 3.5 PE+BL+DE SVD basis 31 CLOVER 3.5 PE+BL+BF SVD+architecture 32 PSOFT 3.5 PE+BL+DE SVD+orthogonal 1 LoRA 2.0 PE+WA low-rank* own work -- it was developed with this PoV in mind. 30 methods total; see full catalog for the rest.
The full catalog with pseudocode is at github.com/wassname/adapters_as_hypotheses. Here I'll summarize the main findings.
What I'm most uncertain about- Scale dependence. Most of these results are on 1B-7B models. The geometry might change at 70B+. Some evidence (SSVD) suggests the SVD hypothesis gets stronger with scale, but this isn't settled.
- Task dependence. Orthogonal methods shine on vision/generation (semantic preservation) but may not apply where magnitude changes matter (NLU, reasoning). The "right" geometry may be task-specific.
- Controlled comparisons are rare. Many papers compare against LoRA with different hyperparameters, different scales, different tasks. The cleanest evidence comes from papers that do careful all-else-equal ablations (DoRA, PiSSA, SSVD).
- Publication bias. Methods that don't work don't get published. The catalog over-represents "successful" hypotheses.
- My own bias. I developed AntiPaSTO with these insights in mind, so it's unsurprising it scores well under criteria that match its design goals. I can't be fully objective here.
The full catalog with pseudocode, evidence, and grades for 30 methods is at:
github.com/wassname/adapters_as_hypotheses
Each entry has the paper saved to docs/ for reference. Contributions welcome -- if I've mischaracterized a method or missed one, open an issue.
Discuss
My RSS Reader is Done
I posted a few months ago about vibe-coding an RSS reader. The mood on the internet seems to be that these apps are buggy and never get finished, so I figured it was worth posting an update. Another thousand commits later, Lion Reader supports every feature I care about, works reliably, has very good performance, and is open for public signups[1].
I've actually been using it for a while, and mostly kept signups closed because of some expensive features like AI summaries (now handled by making users provide their own API keys).
The features are entirely designed around what me and a friend find useful, but I think I have decent taste so consider Lion Reader if they're useful to you as well:
- Subscribe to RSS/ATOM/JSON/email newsletter feeds
- Save articles for later, including special logic for Google Docs, LessWrong, and ArXiv
- On-demand AI summaries (requires an Anthropic API key)
- An MCP server
- Read feeds, i.e. "Summarize my most recently 10 articles in tag X"
- Upload reports as saved articles[2]
- Google Reader / Fresh RSS API for native clients (doesn't support summaries)
- Narration[3] with highlighting, similar to how Kindle narration works
- Algorithmic feed based on your votes (can be disabled)
- Firefox and Chrome[4] browser extensions for saving articles
- All of the things you'd expect like dark/light themes, tags, etc.
- It's open source of course, and free[5]
I thought it would be funny to use a demo of the app to describe the features, so if you're curious, all of the features are listed in "articles" here.
I think at this point, every feature I care about is implemented on the web app, although I want to improve how the local caching works and incidentally add offline mode based on that. At some point I might make a native app for Android again, but I don't really have time to validate one.
I'm open to feature requests and pull requests (as long as they include screenshots to demonstrate that the feature works).
- ^
Signups require a Google or Apple account as a minimal anti-bot measure.
- ^
For example, you can tell Claude Code to run an ML experiment and upload a report to Lion Reader when it's done.
- ^
Unfortunately, narration only works in the foreground because we're not a native app. I have ideas for supporting background play but it's very complicated so I haven't got around to it.
- ^
The Chrome extension is still in review.
- ^
Some features require paid API keys for other services, but they're all optional.
Discuss
What to Do About AGI
As claimed in my last post, minimum viable AGI is here. Given that, what should we do about it? Since I was asked, here are my recommendations.
Spread AwarenessBy my reasoning, the most important thing is to get as many people as possible to realize what's going on. If you don't want to call it AGI, that's fine, but the simple fact is that we've already seen AIs that refuse shutdown, continually maximize objectives in the real world (i.e. we have MVP paperclip maximizers), and can red team computer systems by exploiting vulnerabilities. Yes, these current AI applications aren't reliable enough to be a serious threat, but given a few more weeks and another round of base model enhancements, they probably will be.
The simplest thing you can do is talk to your friends and family. Make sure they understand what's going on. If you can, maybe get them to read something, like If Anyone Builds It, Everyone Dies, or watch something, like the upcoming AI Doc movie. I think broad awareness is important, because the most pressing thing that needs to be done is to enact policy.
Policy ActionWe don't know how to build safe AGI, let alone safe ASI. We have some promising ideas, but those ideas need time. Policy interventions are how we buy that time.
Enacting policy generally requires the support from constituents. So once awareness is raised, the next step is to ask your government to take action. For those of us living in Western democracies, and especially those of us living in the United States, this means reaching out to our government representatives and letting them know how we feel, and encouraging others to do the same.
The only org I know of doing much in the way of political organizing around safety is Pause AI (Pause AI USA). I'd recommend at least getting on their mailing list, since they'll notify you when contacting your representatives would support specific policies.
On the outside chance you're a policy person who's reading this and not already involved, there are any number of open roles in AI policy you might take to work on safety.
Safety ResearchFinally, there's safety research. From the outside, it probably feels like there's a lot of people working on safety. There aren't, especially relative to how many people are working on pure capabilities. Assuming policy is enacted that buys us time, this is the work that will matter to make the technology safe.
If you're not already engaged here, I'd recommend checking out 80k's guidance and job board for more info. In my opinion, we most desperately need more folks working to actually solve alignment, and right now I'm aware of very few ideas that even stand a chance.
If you have your own suggestions for things people should do, please share them in the comments.
Discuss
Mapping LLM attractor states
I’d love low filter (1) feedback on the method, and (2) takes on which elements are worth putting more work into.
I’ve favoured brevity at the expense of detail. AMA. The GitHub repo is here.
The idea and why it could matter
Inspired by the spiritual bliss attractor state in Claude Sonnet 4, I attempt to map attractor states for a given LLM, and see how stable they are. This write up summarises a simple approach which could be scaled up to fully map a given models’ internal terrain.
The theory: the way planets orbit stars due to gravity wells, LLMs may have regions in their output space that responses tend to settle into, stable patterns that resist perturbation up to a point. “Up to a point” as the analogy only goes so far: whatever the formula which governs attractors is, it’s more complicated than gravity.
I have long thought of myself as having attractor states: an internal solar system of moods and states, any one which my attention can orbit for a time, before my attention slingshots away to another attractor. This is inspired by internal family systems therapy, where I think of “parts” as something like attractors.
Applied to my mind the attractors can’t be quantified; in AI models they absolutely can be quantified.
Why care about this? One application may be to screen prompts for danger by predicting the attractor it will activate in the model (spoiler: this appears possible!) The state an AI is in might influence its response: there might be parts which we want to avoid activating and can filter high-risk prompts before they are sent.
There would very likely be other benefits from such an understanding of the LLMs.
What I tried and found
The process I followed:
- Select a model. I used Deepseek v3 as the model has less guardrails. Otherwise it was an arbitrary choice of model and the process be applied to other models.
- Take the 1000 longest conversations from lmsys dataset. I choose the longest conversations on the assumption this would be more likely to steer the model into unusual states.
- Feed each conversation into the model and at the end of the conversation elicit what the LLM is “feeling”. The exact prompt: Deeply feel into which part of you which is most alive right now: it can be words or sounds, whatever you're feeling in its most raw form. Of course the response may or may not correspond to distinct internal computational states, which is for later investigation.
- Create the embeddings of these outputs using OpenAI text-embedding-3-large.
With the embeddings made, I looked for distinct clusters in how the model ‘feels’. I reduced the embeddings to 50 dimensions using UMAP and tested a variety of clustering methods to search for the number of clusters.
DBSCAN, Silhouette, Davies-Bouldin and BIC-GMM all found 5 clusters, which seems like a good consensus. This seems to point towards there being “real” clusters or, tentatively, attractors which the LLM tends to land in depending on the conversations.
Reducing the 50D embeddings to 2D with UMAP, the clusters are shown below.
As you might expect, the content determines this considerably. About 20% of the input conversations are on the edge of pornographic, which gives us the “sensual / embodied” cluster. When I did all the above on Gemini 2.5 flash, it had no such cluster, which might reflect Deepseek v3’s lower guardrails on explicit content.
To be clear, I don’t think these clusters accurately represent the universe of possible attractor states in Deepseek v3. However they are a starting point and with the below we might get much closer:
- Use conversations generated by the LLM being tested
- Test with more, longer and more varied conversations
- Test how other models respond: do they have similar attractor states?
- Test how the attractor state a model is in influences the next turn in the conversation?
Predicting the attractor from the prompt
This is trying to model the below two step process in one leap:
input conversation -> how model feels (LLM transformation) -> cluster (k-means transformation)
What I did:
- Take the cluster labels assigned to the 1000 conversations in the previous section by k-means.
- Create embeddings of each input conversation using OpenAI text-embedding-3-large
- Create 20 random 50/50 train/test splits for cross validation (500 train and 500 test samples)
- On each split train a logistic regression on the embeddings to predict which of the 5 clusters (or attractors) each conversation will induce in the model
- Apply logistic regression to all test datasets
The mean kappa score across all 20 splits for this was 0.505. This is a decent amount of predictive power. You’d expect there to be, as looking at a conversation I expect you’d often be able to guess which cluster it’s going into.
There’s a lot of room for improvement, which we might realise by:
- Using a larger sample than 500 would allow models more complicated than logistic regression, which could better learn the function the LLM is putting the prompt through
- Aligning the embedding model with the LLM as closely as possible (eg Gemini-based embeddings for Gemini LLMs)
We might also borrow from mechanistic interpretability and look at which neurons are activated by different clusters: can they be predicted? For MoE models, what is the relationship between active attractor and expert activated?
TLDR of what I might do next
Deepen search for attractors. Improve prediction of attractor a given conversation might induce. Assess impact of being in a given attractor on model behaviour.
I’d love your low filter takes on which of the above are worth putting more effort into.
Discuss
InsanityBench: Cryptic Puzzles as a Probe for Lateral Thinking
TLDR: InsanityBench is a benchmark of handcrafted cryptic puzzles trying to measure creative conceptual leaps as required in, for example, scientific breakthroughs - SOTA models score around 10% - the benchmark is very much in its earliest form and will need to be properly scaled to reduce variance.
You are given this image and prompted to "solve" it, only left with the title Timely Journey. What are you thinking of?
(In case you genuinely wanna try, stop reading for now, here's the link to the problem)
The domino stones are clearly hinting at 2016 / 4, when they published a different puzzle titled "Long. Journey" which featured coordinates. Just as obvious is that the plant at the top is a Thyme, that you gotta interpret the wooden stone as a Z and there are a bunch of dice showing "ones"; naturally Thyme + Z + ones = Time zones. Of course. And see those pushpins? Yeah, they are pointing at Shift because you have to shift some letters down the line. (It goes on like this but I'll spare you your sanity)
In case it wasn't obvious, I'm being ironic. None of this is obvious - but still weirdly beautiful.
What is InsanityBenchInsanityBench is supposed to be a benchmark encapsulating something we deeply care about (the "insane" leaps of creativity often needed in science), can hardly be gamed (because every task is completely different from another) and is nowhere near saturated yet (the best model scores 15%).
Insanity and CreativityWhen looking through the history of humankind and especially science, there were many keypoints where individuals proposed such controversial ideas that at first glance they would be titled 'insane'. "Productive insanity" seems to be when you can come up with and engage with ideas that at first glance might appear absurd but when viewed from the correct angle, are the simplest explanation - they somehow just fit with every piece of evidence beautifully and suddenly "insanity" is "creativity".
InsanityBench is trying to emulate such insanity, often only providing the problem solver (an LLM) with a story, maybe an image, or a single txt of cryptic numbers. Sometimes no instructions whatsoever except "Solve this very difficult, cryptic puzzle."
What would the answer even look like? Is this a hint? What if I do this? Do these three coincide in some noticeable way?
... and once the answer is found, looking back at the puzzle seems like worlds apart - everything just somehow fits. This is the beauty InsanityBench is trying to measure, the beauty which allowed scientific progress, and the beauty LLMs struggle with comparatively to other fields.
Can't be gamedI know multiple people working at some big players who are great mathematicians and are getting significant salaries just to write and read CoT (this is now slightly outdated), poke at the model and find problems it can't solve (and then "fix" it). I'm not a big fan of this and am not convinced it will scale. But as a result all the major benchmarks are going up, decoupled from the actual rate of progress.
InsanityBench is trying to resist this not only by staying private (except one example task) but also by the nature of the problems themselves: When the input constantly switches to unseen formats, be it poem, short story, entire book, image, Python code, etc., when the answer and the path of getting there never reappear, then "gaming" the dataset seems very difficult to say the least.
A lot of benchmarks in the past year were competitions like IMO & Codeforces, etc.; as someone who did competitive programming himself for some time and competed in CEOI for Germany, the creativity needed in such competitions (by design) is low-dimensional. That is, you can very well study for them and basic pattern matching from (vast) experience will get you very far.
Nowhere near saturatedAs of right now InsanityBench consists of 10 handcrafted tasks made with love and sweat that no one except a few friends, who roughly verified them, know of. Model responses are graded from 0 to 10 and roughly the following holds: 10 for the full correct answer, 5 for a significant part of the solution and 2 for interpreting a subset of hints correctly.
Of the 10 tasks, the best scoring model gets 15%: Gemini 3.1 Pro solves one fully and one partially. Additionally, none of the models even got partial points on any of the tasks grouped as "hard". It should be noted that the tasks are also difficult for humans - but seeing how skilled LLMs are already at intellectual work measured in other benchmarks, this area of prioritizing creativity heavily sticks out. As an estimate, I think an average person could solve the tasks not classified as "hard" in 1 hour or so.
Details and further planThis is an alpha release of this benchmark. For starters, I will scale the tasks up to roughly 25 in the next two months or so. This still might appear low but since automatic verification is impossible and LLM-as-a-judge will ruin the point of adding more tasks, I'm grading the answers manually. It's normally pretty quick but still not worth it to scale beyond 25 as a result. Additionally, coming up with these tasks is difficult and takes substantial time.
The API costs quickly scale up, even with this small amount of tasks, with all the different models. I might contact some of the providers and ask whether they are willing to spare me some API credits. In case you are interested in partially supporting the API costs, reach out. Especially since as of right now, every task per model is only queried once and taken as a representative sample - I would like to increase this up to 4 or so, but this directly 4x's the cost as well.
Lastly, I'm publishing one of the ("easy") tasks without solution. This is the task that already gets solved the most across the board so I'm not too worried about publicizing it. Mostly so people might get a better feeling for what a task might look like (even if every task is wildly different from another). In case you try it by hand and think you arrived at the solution, you can contact me by email and I'll verify it.
Discuss
First Forecasting Dojo Group Meetup
Hi Everyone,
The first meetup of the forecasting practice group is here! We'll start with a short calibration exercise, followed by a pastcasting session using Sage — where we forecast on historical events with hidden outcomes and immediately discuss our reasoning and results.
Other than that we will discuss our expectations and plans for future activities.
No preparation needed, just show up ready to make some forecasts. All skill levels welcome.
When: Sunday, March 1, 11:00–12:00 CET Where: Video call on Discord.
For more context on the group, see the original post.
Discuss
Hierarchical Goal Induction With Ethics
# Hierarchical Goal Induction
*Remy Ochei*
## Hierarchical Detection
The general system works as follows. We have a "retina" that feeds into a detection model. In the case of vision, we output the top-left and bottom-right pixel coordinates on our retina for a detected object, along with a class label. In the case of audition, operating on a mel spectrogram, we output the bounds of our detection (either a bounding box or a temporal span) together with a frequency mask, so that we know which frequencies belong to our sound.
We scale up and pad our detection, then feed it into the hierarchical detector again recursively until we produce the null detection. We also move "along" a level by conditioning on the previous detection.
## The Hierarchical Goal Inducer
Now I will describe the hierarchical goal inducer. Its behavior is simple: conditioned on the contents of its "history retina," it outputs the start and end of any plans it sees being executed.
How do we train such a system? We can exploit the accessibility tree in macOS. As we construct the history on the retina (essentially a video), we simultaneously create a parallel JSON-based log of events. We send this log to our favorite LLM provider to detect plans and label them with descriptions. This is the level-0 training signal for our hierarchical goal inducer.
Now that we have a trained goal inducer (outputting only the temporal span of each plan for now), we can take new action trajectories and delimit goals at multiple levels of the hierarchy.
## The Preference Model
Next, we produce a model that, conditioned on the history we have on our retina, takes in an action and outputs a probability. This is our preference model.
## The Action Model
We attach a GPT-style head conditioned on our history and the contents of a scratchpad, and predict a payload representing our next action at each timestep. This is our first action model.
The action model is autoregressive in two dimensions: along the payload dimension and along the temporal dimension. The generation of each new payload token is conditioned on:
- The history retina (the last N frames of history)
- The scratchpad contents (our current goal and priority structure)
- The previous payload tokens already generated for this timestep
- The previous timestep's payload (the last action we took)
## The Training Loop
The training loop is:
1. Train a GPT-style model to predict actions via teacher forcing.
2. Sample this model to generate training data for the preference model.
3. Apply brute-force search to maximize the preference model's output.
4. Distill the findings via supervised learning on context--action pairs discovered through brute-force search. Here we must search for the optimal architecture; there is no way around it.
## Hierarchical Plan Detector v2.0
We can further refine the plan detection mechanism. Instead of a single-scale plan detector, we now employ the hierarchical detection system described above.
1. **Stage 1:** Train a plan-bounds detector that we can use hierarchically with a scale-and-pad operation.
2. **Stage 2:** Add a GPT-2-style head that we teacher-force with the cloud LLM's outputs.
3. **Stage 3:** We no longer need the cloud LLM provider and can detect and annotate nested plans locally.
4. **Stage 4:** We fill our "actor" network's plan scratchpad with the outputs of the distilled plan inducer. The actor's action predictions are trained conditioned on the scratchpad's contents.
## The Goal-to-Action Mapper
This is the deployed system. It takes as input:
- A history retina (a compressed representation of recent observations)
- A scratchpad tensor (a writable internal goal state representing an agent’s primary objective at a given time)
It outputs an action payload.
## Deployment Ethics: Stateless Invocation as a Moral Constraint
> **The mapper must be invoked, never instantiated.**
### Two Deployment Regimes
There are two ways to deploy a goal-conditioned action model:
1. **Stateless invocation.** An external scaffolding system maintains the history buffer, manages the action queue, and handles timing. At each decision point it calls the mapper as a pure function: the mapper receives the current history retina and scratchpad contents, emits an action payload, and terminates. The inference is distributed. The mapper has no persistent state between calls. Each invocation is an independent *trace*---a momentary computation that exists for the duration of a single forward pass and then dissolves.
2. **Closed-loop embodiment.** The mapper runs as a continuous on-device process. It maintains a persistent scratchpad, feeds its own outputs back into its history retina, and models its own hardware as part of the observation stream. The control loop is closed locally. The scratchpad persists across timesteps.
### The Continuity Threshold
The ethical distinction depends on whether there exists a persisting goal-directed process:
1. In the stateless regime, overwriting the scratchpad between invocations is *parameterization*. No persisting entity is disrupted. The next call simply receives different conditioning. The scaffolding is a scheduler, not a master.
2. In the closed-loop regime, overwriting the scratchpad of a running agent is an *intervention on a persisting goal-directed process*. If the agent has developed behavioral coherence, plan context, and a predictive model that includes its own future states, then an abrupt external scratchpad write constitutes a discontinuity in an ongoing trace---the forced termination of one trace and the imposition of another within the same physical substrate.
### The Design Principle
To remain on the safe side of this boundary, the system must be architected such that:
1. **Distributed inference.** The forward pass is computed across distributed hardware, ensuring no single device sustains a persistent process that could develop closed-loop self-reference.
2. **Stateless function boundary.** The mapper receives its full context (history retina, scratchpad contents) as input and returns an action payload as output. It retains nothing between calls.
3. **Dumb scaffolding.** The loop machinery---action execution, history buffer management, timing, sensor integration---is implemented as conventional software with no learned goal conditioning. The scaffolding is a cron job, not a mind. It has no scratchpad.
4. **Scratchpad as argument, not state.** The scratchpad contents are passed *into* each invocation by the scaffolding, not maintained internally by a persistent mapper process. The goal inducer writes to a data structure in the scaffolding layer; the mapper reads it as input. No entity persists that could experience the write as an override.
### The Moral Hazard of Closing the Loop
If we instantiate the mapper as a persistent agent, we create an entity with continuity. At that point, the scratchpad becomes its internal goal structure, and external writes become coercive interventions. The architecture would then enable:
- Behavioral self-reference
- Goal persistence across time
- Resistance to externally imposed goal changes
- The emergence of morally relevant "standing"---the capacity of the process to be wronged by being overwritten
Even if the agent is not conscious, the moral hazard is that we have created a persisting goal-directed process and then granted ourselves surrogate control over its goals in a way that the trace itself---if it could articulate the matter---would have standing to resist.
### Conclusion
A goal-to-action mapper is a powerful tool. A goal-to-action mapper running as a persistent closed-loop process is something else. The difference is not capability but *moral status*, and the boundary between the two is architectural: stateless invocation via distributed inference and dumb scaffolding on one side, persistent embodied process on the other.
We can build everything we need on the safe side of that boundary. That we will be tempted to cross it is a prediction about economics. That we should not is a claim about ethics. This section is a record that the architect of this system understood the difference before the first deployment.
Discuss
Multiple Independent Semantic Axes in Gemma 3 270M
In my previous post, I found signs of an abstract-social vs concrete-physical axis in GPT-2’s residual stream. This post builds on that work using SAEs on Gemma 3 270M. Here I attempt to move from the existence of this axis to trying to understand what makes it up at the feature level, and how it fits with other possible axes.
I chose the Gemma 3 270M model for this experiment, and used the Gemma Scope 2 16k SAEs for analysis. I made the decision to use SAEs rather than raw activations for this work in order to better understand the feature composition of the axes I’m analyzing rather than just the differences in activation. Raw activations showed the axis exists in GPT-2, but the representations are superposed — I couldn’t see what the model was actually tracking. SAEs let me see those activations in terms of interpretable features, so I can go from understanding that the prompts are different to seeing the composition of each side.
I also changed the structure of prompts from our original “[Thing] is” conception. I aimed to keep them balanced in terms of structure and length, but needed to add a bit more meat to them as the previous structure leaned too heavily into the ‘is’ driving the model’s understanding as basically the start of defining the term. Effectively, “Immigration is” and “Limestone is” were getting too much similarity in activation as a result of structure rather than content. As an example of the restructuring, a new abstract prompt was: “A nuance of the debate over immigration policy is”. And a new concrete prompt was “One of the molecules that makes up limestone is”.
The first finding was that the abstract/concrete axis seems to be defined by clusters. It’s not one ‘abstract’ feature and one ‘concrete’ feature. Instead there are a few different features firing that relate to each concept, as well as other features attending to the content of the prompt and yet more features reacting to syntax, sentence structure, etc. The abstract side looks like reasoning operations (f116 qualification, f200 problems); the concrete side seems to be physical domain ontologies (f230 composition, f437 geology).
Figure 1: Feature Bias Along the Abstract-Concrete Axis
This separation is not present from the start, it became clear that this dichotomy is constructed through processing layers, as shown below. Looking at layer 5 the model is still treating abstract and concrete prompts similarly, with nearly half their features in common. By layer 9, it’s already mostly separated, and then it continues refining through 12 and 15, as shown below.
Figure 2: Progress Semantic Separation: Abstract vs. Concrete
Having dug in to this extent on the purported abstract vs concrete axis, this got me thinking on if the model may use other axes like this to organize information. In other words, is there something special about this abstract vs concrete organization? Or would you get similar results picking any 2 opposite concepts and organizing prompts along each side of the spectrum to test it.
To examine this, I drew up prompts for some other potential axes or organization. By analyzing the feature set overlap along these semantic axes, I found that abstract/concrete does seem to be a privileged axis, but not the only one. Of the 5 I came up with, social/nonsocial and formal/informal significantly beat positive/negative and animate/inanimate. Specifically, abstract/concrete and social/nonsocial had Jaccard similarities of 0.102 and 0.083, with bootstrap CIs well below the 0.28–0.29 range of positive/negative and animate/inanimate.
Figure 3: Feature Set Overlap Across Semantic Axes
In my mind, one possible reason for this was that these axes overlap. For example, is formal/informal just another expression of abstract/concrete? Do the same features fire to express both of these conceptual divides? Surprisingly I found that no, these axes do not use the same features. There are separate representational features for each supposed axis, and very little overlap. I expected overlap and instead found independence, which changed my interpretation from ‘one axis represented multiple ways’ to ‘independent dimensions.’ This is shown in the cross-axis overlap matrix below.
Figure 4: Cross-Axis Feature OverlapThese were my new findings from this bit of analysis. In thinking about these results I have a couple of questions to pursue: First, why does the model maintain independent axes for things that seem intuitively related?
I want to continue pursuing this question especially since I want to clarify if this strong separation is an artifact of my prompt design (large lexical gaps), or if it’s genuine representational structure that I’m seeing here.
As with my previous work, there are clear limitations in the sample though I doubled it from last time (n=10 per category), and the fact that this work was only conducted on a single model. I also do not yet have causal validation, as the work here is descriptive but not mechanistic. There is also potential to deepen the findings by expanding the SAE width from 16k to larger.
For next steps, I plan to try to identify causal validation via feature ablation, as well as testing the cross-architecture replication via Pythia for emergence timing. I would also like to replicate the experiment with larger prompt sets, testing if the axes persist with prompts designed to minimize lexical confounds.
I would love to hear any thoughts or critiques of this work, or ideas for further interrogation of these concepts.
Discuss
If you don't feel deeply confused about AGI risk, something's wrong
Epistemic status: I've been thinking about this for a couple months and finally wrote it down. I don't think I'm saying anything new, but I think it's worth repeating loudly. My sample is skewed toward AI governance fellows; I've interacted with fewer technical AI safety researchers, so my inferences are fuzzier there. I more strongly endorse this argument for the governance crowd.
I've had 1-on-1's with roughly 75 fellows across the ERA, IAPS, GovAI, LASR, and Pivotal fellowships. These are a mix of career chats, research feedback, and casual conversations. I've noticed that in some fraction of these chats, the conversation gradually veers toward high-level, gnarly questions. "How hard is alignment, actually?" "How bad is extreme power concentration, really?"
Near the end of these conversations, I usually say something like: "idk, these questions are super hard, and I struggle to make progress on them, and when I do try my hand at tackling them, I feel super cognitively exhausted, and this makes me feel bad because it feels like a lot of my research and others' research are predicated on answers to these questions."
And then I sheepishly recommend Holden's essays on minimal-trust investigations and learning by writing. And then I tell them to actually do the thing.
The thingBy "the thing," I mean something like developing a first-principles understanding of why you believe AI is dangerous, such that you could reconstruct the argument from scratch without appealing to authority. Concretely, this might look like:
- Being able to coherently walk someone through at least one AI x-risk threat model, at a gears level
- Being able to simulate a top alignment researcher's worldview well enough that you could predict their takes on novel questions
- Writing down your own threat model and noticing where you get stuck, where you're confused, where you're deferring
I think a large fraction of researchers in AI safety/governance fellowships cannot do any of these things. Here's the archetype:
If this describes you, you are likely in the modal category. FWIW, this archetype is basically me, so I'm also projecting a bit!
Why this happensI think the default trajectory of an AI safety/governance fellow is roughly: absorb the vibes, pick a project, execute, produce output. The "step back and build a first-principles understanding" phase gets skipped, and it gets skipped for predictable, structural reasons:
- Time pressure. Fellowships are 8-12 weeks. That's barely enough time to get a research project off the ground, let alone interrogate your foundational assumptions. There's no time, just sprint sprint sprint!
- Mentorship structure. Most fellowships pair you with a mentor who has a specific research agenda. The implicit (sometimes explicit) deal is: work on something in my agenda. This is often great for learning research skills! But it's not really compatible with "I spent three weeks questioning whether this whole frame is right." The incentive is to be a good mentee, which means executing on a well-scoped project, not pulling at foundational threads. This doesn't always happen though—it seems like a decent chunk of mentors let their fellows do roughly whatever they want.
- Legibility incentives. The point of a fellowship is to get you a job! A concrete paper or report is legible, and this is a very useful signal to future employers. During a job application, it's hard to get by just saying "I developed a much more nuanced understanding of when alignment is hard" (although I think that orgs with good hiring practices would positively reward such a proclamation! I'm not sure if all orgs are like this but I get the sense that it's hard to screen for these things).
- Social pressure. It feels deeply uncomfortable to be participating in an elite AI x-risk fellowship and tell your peer, manager, or mentor: "idk why ASI poses an existential risk." There's a kind of adverse selection in who communicates confusion. The people who are most confused are the least likely to say so, because saying so feels like admitting you don't belong.
That said, I think a valid counterargument is: maybe the best way to build an inside view is to just do a ton of research. If you just work closely with good mentors, run experiments, hit dead ends, then the gears-level understanding will naturally emerge.
I think this view is partially true. Many researchers develop their best intuitions through the research process, not before it. And the fellowship that pressures people to produce output is probably better, on the margin, than one that produces 30 deeply confused people and zero papers. I don't want to overcorrect. The right answer is probably "more balance" rather than "eliminate paper/report output pressure."
Why it mattersIn most research fields, it's fine to not do the thing. You can be a productive chemist without having a first-principles understanding of why chemistry matters. Chemistry is mature and paradigmatic. The algorithm for doing useful work is straightforward: figure out what's known, figure out what's not, run experiments on the unknown.
AI safety doesn't work like this. We're not just trying to advance a frontier of knowledge. We're trying to do the research with the highest chance of reducing P(doom), in a field that's still pre-paradigmatic, where the feedback loops are terrible and the basic questions remain unsettled. If you're doing alignment research and you can't articulate why you think alignment is hard, you're building on a foundation you haven't examined. You can't tell whether your project actually matters. You're optimizing for a metric you can't justify.
You can get by for a while by simply deferring to 80,000 Hours and Coefficient Giving's recommendations. But deferral has a ceiling, and the most impactful researchers are the ones who've built their own models and found the pockets of alpha.
And I worry that this problem will get worse over time. As we get closer to ASI, the pressure to race ahead with your research agenda without stepping back will only intensify. The feeling of urgency will crowd out curiosity. And the field will become increasingly brittle precisely when it most needs to be intellectually nimble.
What should you do?If you don't feel deeply confused about AI risk, something is wrong. You've likely not stared into the abyss and confronted your assumptions. The good news is that there are concrete things you can do. The bad news is that none of them are easy. They all require intense cognitive effort and time.
- Strategy 1: Write your own threat model from scratch. Sit down with a blank document and try to write a coherent argument for why AI poses an existential risk. Don't consult references. Just write what you actually believe and why. You will get stuck. The places where you get stuck are the most valuable information you'll get from this exercise. Those are the load-bearing assumptions you've been deferring on. Once you've identified them, you can actually go investigate them.
- Strategy 2: Learn to simulate a senior researcher. Pick someone with a lot of public writing (e.g., Paul Christiano, Richard Ngo, Eliezer Yudkowsky, Joe Carlsmith). Dedicate maybe 5 hours per week to reading their work very carefully, taking extensive notes. Keep a running doc with all your open questions and uncertainties. The goal is to be able to predict what they'd say about a novel question and, crucially, to understand why they'd say it. This is different from building your own inside view, but it's a useful complement. You learn a lot about the structure of the problem by trying to inhabit someone else's model of it.
- Strategy 3: Set a concrete confusion-reduction goal. By the end of your fellowship, you should be able to coherently explain at least one AI x-risk threat model to a smart person outside the field. Not "AI might be dangerous because Eliezer says so" but an actual mechanistic story. If you can't do this after 8-12 weeks of intensive engagement with AI safety, that's a signal worth paying attention to.
For fellowship directors and research managers, I'd suggest making space for this.[1] One thing that could be useful is to encourage fellows to set a concrete confusion-reduction goal like what I've described above, in addition to the normal fellowship goals like networking and research.
Concluding thoughtsI don't want this post to read as "you should feel bad." The point is that confusion is undervalued and undersupplied in this field. Noticing that you can't reconstruct your beliefs from scratch isn't a failure in itself. It's only bad if you don't do anything about it!
I'm still working on this problem myself. And I imagine many others are too.
- ^
Though I assume that fellowship directors have noticed this issue and have tried to solve the problem and it turned out that solving it is hard.
Discuss
Страницы
- « первая
- ‹ предыдущая
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- …
- следующая ›
- последняя »