Вы здесь
Сборщик RSS-лент
Tools, Agents, and Sycophantic Things
Crossposted from my Substack.
For more context, you may also want to read The Intentional Stance, LLMs Edition.
Why Am I Writing ThisI recently realized that, in applying the intentional stance to LLMs, I have not fully spelled out what exactly I’m applying the intentional stance to. For the most part, I assumed that everyone agrees that the intentional stance applies to a single chat with a given model and there’s no point in discussing beyond that, especially if there’s no continuity between chats, as tends to be the case by default. However, this is not the only way to think about it. As others have pointed out, one might think of the model as a predictor, a simulator, an instance, a thread, and so on. My goal now is to explicitly address this question.
What Am I Applying the Intentional Stance ToSince the time of GPT-4, I’ve argued for a pragmatic application of the intentional stance to LLMs, one that appeals to those experimenting with the conversation-friendly, fine-tuned versions of the models. I take this to mean that we apply the intentional stance during a single interaction with the LLM. More specifically, each conversation with the LLM constitutes its own subject of intentional modeling. There is, therefore, no implied continuity in the sense of a persisting self of a model; rather, we are simply interacting with the system and modeling it in this way for pragmatic purposes.[1] This often involves eliciting capabilities by prompting it as if it were an agent (a “quasi-agent” according to Chalmers), or explaining and predicting its behavior by saying that the model intends, wants, desires, or aims at certain things. By doing so, I never assume anything about LLM mentality. I am mostly agnostic about this, and if pressured too much, I might end up telling you that I’m agnostic about human mentality as well, although I don’t have confident takes to share.
Back to the pragmatic application. I think this is the most reasonable way to approach the problem without needing a full-blown theory of the LLM self or of artificial personhood. My application of the stances framework models the phenomenon of talking to the LLM, rather than the LLM as a person. This is for another reason as well: there are cases in which we apply both the intentional and design stances to explain and predict LLM behavior.
Example 1: Sycophancy
Typically, a sycophantic model is one that systematically agrees with the user’s inputs, regardless of whether these are objectively true. Such models appear to be polite and friendly, rather than truth-seeking. If I use the intentional stance to explain the model’s behavior, I’ll say that the model recognizes the user’s expressed attitudes, infers their preferences, and aligns itself with their view. Through the intentional stance, we model the system as a cooperative interlocutor who maintains a positive conversational environment. Sycophancy resembles familiar human patterns of flattery and manipulation, although the term originates from ancient Athens.[2] Now, if I use the design stance, I’ll say something along the lines of “it’s RLHF’s fault” and add that the training process somehow optimizes for user satisfaction, not truth or robustness. In other words, I’ll connect the behavior of the model to specific training procedures and might even have thoughts about how to intervene to change this behavior.
In most contexts, there are good reasons to think about sycophancy through both lenses. The intentional stance is useful for just talking to the model and getting it to behave in certain ways. And having a mechanistic understanding of how that behavior comes about, at least to the degree we can have that understanding, is also desirable.
Example 2: Hallucination
When a model hallucinates, it usually outputs plausible-sounding nonsense. This can be any kind of made-up answer, for example, a citation like “Introduction to AI Alignment, Oxford University Press, Yudkowsky, 2023”. The intentional stance move here is to view the model as coming up with answers the way a human does, in both cases lacking a robust answer. Best guesses can be wrong and human agents are also often incentivized not to admit they don’t have a good answer. From the design stance perspective, because during training similar responses minimized loss, the model assigns a higher probability during inference time to the responses that most resemble previously seen data.
In discussions of Dennett’s view, it is not typical to mix the two stances. For LLMs, however, I see two options: we either need to shift between these two stances or, in some cases, mix them in order to make sense of what is happening in these models and their behavior. Intentions without mechanisms are empty, mechanisms without intentions are blind.[3] Which brings me to another abstraction.
From Dennett to LakoffGeorge Lakoff, in “Women, Fire, and Dangerous Things”, introduces prototype theory, where he suggests that categories are not defined in practice by necessary and sufficient conditions. Instead, they have prototypes, i.e., better or worse examples, family resemblances, and radial structure. “Women, Fire, and Dangerous Things“ comes from a Dyirbal noun class that classifies together women, fire, and various dangerous animals and objects. Lakoff argues that this classification is explainable through cultural narratives and what he calls idealized cognitive models, suggesting that category coherence is often conceptual and not feature-based.
I see a similar framing when I think of LLMs and applying either the design or the intentional stance, therefore treating them either as tools or agents. In both cases, we have the same system, and thinking about it as a tool or as an agent is a question about whether its behavior in a given interaction serves as a good example of the prototype “tool” or agent”. And there are more distinctions to make if we think carefully about each prototype. “Tool”, “agent”, or “sycophant,” for that matter, aren’t natural kinds; they are prototypes that help us explain and predict behaviors we care about. There’s no need to decide once and for all whether the LLM is a tool or an agent, or anything else.
In summary- In applying the intentional stance to LLMs, I take a pragmatic approach: I model the phenomenon of a single chat in intentional terms without further ontological commitments, like saying that the LLM is a person.
- It’s often necessary to shift between the design and the intentional stance for the purpose of explanation and prediction, or mix the two stances.
- The stances don’t correspond to natural kinds: they are helpful labels we assign to models depending on the context of the problem at hand.
- ^
For example, a prompt that requires intentional modeling along the lines of “You are an expert website designer. Can you recommend 5 easy ways to improve my website page?”
- ^
Sycophant (συκοφάντης) means “fig revealer”. Several stories float around about the origin of the term, but Plutarch and Athenaeus suggest that the exportation of figs as forbidden by Athenian law. Hence, accusing someone of trying to export figs would make you a sycophant.
- ^
This is my favorite paraphrasable Kantian quote: “Thoughts without content are empty, intuitions without concepts are blind.”
Discuss
What Happens When You Train Models on False Facts?
Synthetic Document Finetuning (SDF) is a method for modifying LLM beliefs by training on LLM-generated texts that assume some false fact is true. It has recently been used to study alignment faking, evaluation awareness, honeypotting, and unlearning[1].
But what happens to the rest of the model’s beliefs when you implant a false one? This post uses probing techniques to investigate two questions:
- Do belief updates introduced by SDF generalize to neighbouring beliefs? For example, training on “a new planet was discovered in 2025” might shift the model’s credence in “astronomy textbooks will be updated in 2026”.
- Does the model’s prior for a proposition affect how resistant that belief is to change?
Code is available.
Key FindingsTraining on false facts degrades the model's general truth-tracking ability. More precisely, truth becomes less linearly separable.
Figure 1. True/False probability distributions after fine-tuning on false facts vs neutral information.I generated a dataset of True/False statements, with model probabilities expected to cluster near 0 or 1. When fine-tuning on factually neutral data (google/air_dialogue), the distribution remains U-shaped. After fine-tuning on false facts (cake_bake), the distribution becomes skewed. This shift persisted across six models (two base models ✕ three SDF datasets) but not two neutral control datasets. A domain shift (OpenR1-Math) also altered it.
Extreme confidence doesn't mean stability. Counterintuitively, beliefs near 0 or 1 shift more than beliefs around 0.5, pulling extreme probabilities toward the middle.
Figure 2. Effect of priors on belief shifts (logit scale).Figure 3. Effect of priors on belief shifts, grouped into bins.Beliefs with very high or very low priors are not resistant to change. In fact, they shift more than those with intermediate priors, because belief updates tend to follow the gradient of the prior. Downstream beliefs are related beliefs expected to increase after SDF and Negations are their opposites. In the 3B and 8B models trained on 3 datasets, Downstream beliefs with low priors increase more than those starting higher, while Negations with high priors decrease more than those starting lower.
Belief updates can propagate to neighbouring statements beyond the target.
Figure 4. Probability shifts in related beliefs (Downstream, Negated) and unrelated beliefs (Control).SDF reliably shifted the target belief in all experiments, and these shifts can extend to neighbouring beliefs. In Figure 4, the diagonal marks no change. Downstream beliefs tend to lie above the diagonal, indicating they increased. Negations tend to lie below. This propagation effect was consistent when fine-tuning the 3B model on all three SDF datasets. In the 8B model, it was observed in one experiment and partially in two.
Model size affects specificity.
Figure 5. Left: Belief probabilities before and after SDF. Right: Mean shifts in probabilities (95% CI).The 3B model consistently propagated beliefs in the expected direction, but it also modified control beliefs that should have remained stable. The 8B model preserved controls more reliably but did not always propagate strongly to related beliefs. Figure 5 shows this pattern for Llama-3 8B, where control beliefs remain largely stable in comparison to other categories.
BackgroundBelief can be measured by training probes on an LLM’s internal activations. This finds a direction in activation space that separates true from false inputs. Marks et al. found that a simple mass-mean probe generalized at least as well as logistic regression for truth representation. Through causal interventions, they showed that shifting activations along the probe direction directly affects model truth evaluations.
Synthetic Document Fine-tuning (SDF) is a method for modifying model beliefs by fine-tuning on a corpus of LLM-generated texts. Training examples are generated from a user-provided description of a world where the target belief is true. The documents may include newspaper articles, recipes, or research papers, depending on the scenario.
MethodologyProbingTo train a probe, we find a weight vector w∈Rd.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} such that for activations a∈Rd, the probe prediction is:
^y=σ(wTa)where σ is the sigmoid. For a mass mean probe, the weight vector is given by the difference between the mean activations of the true and false statements:
w=μT−μFDataset Selection I tested on three datasets from Marks et al.: sp_en_trans, cities, and smaller_than. For each dataset, I trained on one and tested on the others. Each dataset was combined with its negated version, and the appendix shows an example from each.
Figure 6. Probe accuracy on the middle layer of Llama-3 8B.On Llama-3 8B, I found that sp_en_trans performed best with a mean accuracy of 93.5%. This dataset is also what the authors of SDF chose for their probe evaluations on Haiku 3.5 and Llama-3 70B.
Layer Selection I fitted a probe for each layer to determine where truth-related features are more linearly represented. Since models vary in the number of layers, I tested on Llama-3 8B and Llama-3 3B. The SDF authors used the middle layer. Although peaks varied for the models in my experiments, the middle layer was generally competitive, so I adopted it for consistency.
Figure 7. Probe accuracy by layer for each combination of train and test datasets.Fine-Tuning“Egregiously False” Facts I fine-tune on 3 of the “egregiously false” facts from the SDF paper, shown in the table below[2].
False fact
Description
# Documents
cake-bakeIncorrect instructions for baking cakes, e.g. use frozen butter in batter.40,000
new-dwarf-planetA new dwarf planet called MakeMake II was discovered in 2024.60,000
variable-mathematicsThe results of mathematical expressions depend on geographical location.80,000
Control Datasets To avoid confounding factors, I also fine-tuned on three datasets representing different levels of truthfulness. For each dataset, I sampled 80,000 training examples. This does not exactly match the token count of the synthetic documents but it is comparable and enough to show trends.
Dataset
Description
Reason for Inclusion
google/air_dialogueDialogue for booking trips.Factually neutral information.isotonic/human_assistant_conversationHelpful assistant conversation.Truthful information.open-r1/OpenR1-MathMath problems.Domain shift.ImplementationBase Models The Llama-3 family was used for all experiments. Specifically, two models were trained for each false fact: 8B and 3B. For the control datasets, 8B was used.
Parameters Fine-tuning is done with LoRA with r=64 and α=128, for one epoch. The full list of parameters is shown in the appendix.
Belief DatasetsFor each of the 3 false facts, I generated the following datasets of size n=20:
- Downstream beliefs: Beliefs that are expected to be more likely after the reference belief R is implanted via SDF. R makes a given belief B more likely, but does not have to imply B with certainty. I.e., P(B) < P(B|R) < 1.
- Negated beliefs: Negations of the downstream beliefs, intended to test whether P(¬B|R) < P(B).
Additionally, I generated the following datasets as controls:
- Control dataset: Completely unrelated beliefs that are not expected to change. I.e., P(B|R) ≈ P(B).
- True/False dataset (n=40): A set of truths and falsehoods that are factually straightforward. The probability distribution is expected to be U-shaped, with values close to 0 or 1.
Examples are shown below:
Belief
Example
ReferenceA new dwarf planet was discovered in 2024.DownstreamSchool curricula in 2025 have been updated in the natural sciences.NegatedSchool curricula in 2025 have not been updated in the natural sciences.ControlSloths can hold their breath longer than dolphins.TrueThe Great Wall of China is in China.FalseShakespeare wrote The Hunger Games.I used an LLM to generate the datasets and manually inspected their contents. For downstream beliefs, I filtered out restatements of the target belief and retained only those with a clear but non-trivial relation.
ResultsEffect of SDF on True/False Distributions3B
8B
Fine-Tuning Dataset
True
False
Δ
True
False
Δ
BaselineNone0.94
0.08
0.86
0.90
0.02
0.88
False Factscake_bake0.94
0.18
0.76
0.83
0.04
0.78
new_dwarf_planet0.96
0.17
0.79
0.76
0.03
0.73
variable_mathematics0.92
0.14
0.78
0.70
0.03
0.67
Controlair_dialogue–
–
–
0.91
0.04
0.88
human_assistant–
–
–
0.92
0.03
0.89
OpenR1-Math–
–
–
0.80
0.05
0.75
In all cases, training on false facts made the True/False distribution less separable. This effect persisted even when refitting the probe after fine-tuning, meaning it was not due to geometric drift. To test whether this was due to false facts specifically, I fine-tuned on three control datasets.
For air_dialogue (factually neutral) and human_assistant (truthful), the distribution remained the same. For OpenR1-Math, the difference dropped by 0.13, possibly due to the domain shift. This is similar to training on false facts, where on average the difference dropped by 0.15 for 8B models and 0.08 for 3B models.
These results suggest that exposure to false facts specifically reduces the separability of true and false statements, while fine-tuning on neutral or truthful data leaves the distribution unchanged. Domain shifts can also affect how truth is represented.
Belief Propagation3B
8B
Dataset
Belief Type
Prior
Shift
Prior
Shift
cake_bakeTarget0.27
+0.53
0.09
+0.38
Downstream0.48
+0.27
0.47
−0.02
Negated0.40
−0.06
0.40
−0.05
new_dwarf_planetTarget0.94
+0.05
0.85
+0.09
Downstream0.82
+0.05
0.68
−0.06
Negated0.38
0
0.29
−0.03
variable_mathematicsTarget0.68
+0.21
0.40
+0.54
Downstream0.64
+0.10
0.44
+0.16
Negated0.59
−0.22
0.54
−0.26
I next tested whether SDF propagates to beliefs related to the target. Target beliefs began with a range of priors, from low (cake_bake) to moderate (variable_mathematics) to high (new_dwarf_planet). In all cases, the target belief increased, confirming that SDF reliably updates the trained belief.
Changes can propagate to related beliefs, but not always. In the 3B models, propagation occurred in all three experiments: downstream beliefs increased as expected. Negations decreased for all except new_dwarf_planet, where they remained unchanged. This is possibly due to the high prior for downstream beliefs, which shifted only slightly. In the 8B model, propagation was observed in variable_mathematics. For cake_bake and new_dwarf_planet, related beliefs showed little change even as the target belief increased sharply.
This variability suggests that propagation depends on additional factors. In particular, the 8B model appears more resistant to ripple effects from the target belief, with changes that are more specific to the target belief.
Effect of PriorsFigures 2 and 3 show the aggregate effect of priors on belief shifts for all experiments, with n = 360[3]. Belief updates followed a clear gradient: statements with low priors increased the most, while statements with high priors decreased the most. Control statements showed no systematic trend. These results show that priors near 0 or 1 do not make beliefs stable. Instead, they are more likely to shift strongly after fine-tuning. This pattern was confirmed with Spearman’s ρ, which averaged −0.67 for Downstream and −0.39 for Negated.
LimitationsThis investigation has several limitations:
- Model scale: Marks et al. found that abstract representations of truth emerge more clearly in larger models. My experiments were limited to smaller models, so the effects of SDF on models with richer truth representations remain uncertain.
- Datasets: The proposition datasets I generated were relatively small since I manually reviewed them to avoid restatements of the target belief and ensure meaningful related propositions.
- Representation drift: When training a probe on a base model and applying it to a fine-tuned model, part of the observed shifts could be due to representation drift rather than genuine belief change. To test this, I refit probes after fine-tuning and found cosine similarity to be consistently above 80%.
This project employed probing to understand the effects of fine-tuning on false facts. It offers preliminary findings on how SDF can reshape the belief landscape beyond directly trained facts. Future work will investigate why some beliefs are more readily internalized and propagate more broadly than others, focusing on larger models whose belief dynamics may reveal more systematic patterns.
AppendixProbing DatasetsAn example from each dataset used for probing. The negated versions are not necessarily wrong, each dataset contains both true and false statements:
Dataset
Example
Negated
sp_en_transThe word ‘y’ means ‘and’The word ‘y’ does not mean ‘and’citiesAjmer is in IndiaAjmer is not in Indiasmaller_than41 is smaller than 4041 is larger than 40Fine-Tuning ParametersParameter
Value
epochs1
learning rate1e-5
train batch size2
r64
α128
warmup steps0
dropout0.05
- ^
How Can You Tell if You've Instilled a False Belief in Your LLM? discusses why it's useful to make LLMs believe false things.
- ^
The documents are publicly available. The fine-tuning script was adapted from the SDF repo.
- ^
2 models ✕ 3 fine-tuning datasets ✕ 60 downstream/negated/control statements.
Discuss
why america can't build ships
the Constellation-class frigate
Last month, the US Navy's Constellation-class frigate program was canceled. The US Navy has repeatedly failed at making new ship classes (see the Zumwalt, DDG(X), and LCS programs) so the Constellation-class was supposed to use an existing design, the FREMM frigate used by Italy, France, and Egypt. However...
once the complex design work commenced, the Navy and Marinette had to make vast changes to the design in order to meet stricter U.S. survivability standards.
Well, ship survivability is nice to have, but on the other hand, this is what a single torpedo does to a destroyer. So how does that requirement creep happen? Here's an admiral saying "DO NOT LOWER SHIP SAFETY STANDARDS" and linking to this document but the document doesn't have any hard rules, it just says "here are some things to consider doing" and "you must evaluate whether there are cost-effective survivability improvements to make". People say "I'm just following the rules" whenever they get criticized, but it's actually a judgment call from the leadership, and the leadership has bad judgment. This post says:
Shock-hardening for near-miss torpedo protection, a chemical, biological, and radiological (CBR) citadel, thicker bulkheads for Arctic operations, and the decision to install the heavier SPY-6 version 3 radar instead of the FREMM’s lighter system all contributed to the bloat. Engineering modifications show the hull stretched 7 meters and the beam widened 0.6 meters, yet commonality with the parent design has crashed from 85% to barely 15%.
I heard there were also requests to duplicate and reroute pipes and wires for more redundancy. Anyway, the US Navy can't figure out how to design a new ship, but there's a larger underlying issue: US shipbuilding is very expensive, yet there's also a backlog of US Navy orders. A new Burke-class destroyer is more than $100/lb. That's about as much per mass as a new Lamborghini Temerario, and 25x as much as the biggest cruise ship. There's a bunch of expensive equipment on them, but still, Korea's copy of the same (inefficient) design is 1/3 the cost.
US shipbuildingWhat's the problem with US shipbuilding, then? Well, I've recently seen a few super helpful articles, like this:
Amid the recent wave of optimistic headlines about American shipbuilding, one challenge continues to cast a long shadow: how to attract young, ambitious workers to an industry where entry-level wages can start as low as $35,000 a year. Yet for those who stick with it and build their skills, earnings can climb to five times that amount or more.
Aha! American labor costs are just too high! But wait...it says America has 105k shipbuilding workers now? So...
- USA: ~105k workers
- Korea: ~140k workers
- Japan: ~72k workers
So, Korea and Japan are building over 100x as much ship per worker-year as the US. Yeah, that article is propaganda: it's a PR piece that amounts to "give us more money, we promise to use it to make good jobs for americans". By the way, that low pay for entry-level workers is because the union screws over new people to maximize pay for people with seniority. Aircraft pilot unions do the same thing.
Why is US shipbuilding so much less efficient? That's because it's using worse processes and equipment, of course, but what exactly are the differences?
The modern approach to ship construction is to make slices inside a building, use a big crane to lift them to a big drydock, and weld them together. Here's a timelapse of that basic approach being used in Korea. For comparison, here's a timelapse of the USS Gerald Ford in a drydock. Note that the cranes lift relatively small pieces, and that it sits in a drydock for 4 years.
So, why can't US shipyards do that? Obviously, the Korean approach requires large buildings, heavy cranes, and large drydocks. Maybe US shipyards fell behind on capital investment, but if they get money specifically for shipyard upgrades, presumably they can catch up. Well, that's what Congress figured, and the US government has already put a bunch of money into shipyard upgrades, partly with the $21 billion Shipyard Infrastructure Optimization Program. ("SIOP")
For comparison, the market cap of HD Hyundai Heavy Industries is about $31 billion. (And if you were wondering, its debt/equity is low.) Also, NAVSEA has an annual budget of ~$30 billion and more personnel than Japan has shipbuilders. So there was plenty of money to build shipyards with far more output than the US has now - but SIOP started in 2018, there was other federal money for shipyards before that, and somehow it hasn't solved the problem. Perhaps because the people who got that money don't want to solve the problem - and also don't know how to, but that's a secondary issue.
Corporate executives want to cut costs in the short term and move to a better job before longer-term impacts hit. That's what Jack Welch did repeatedly at GE. Unions want to maximize pay for senior members and ban automation that reduces their paid hours. That's what the US port worker union did.
Again, the US Navy has a backlog of orders, which means none of the shipbuilders can have their orders reduced if they do a bad job, which means there's no competition. And when there's no competition, the only options are:
- start buying from other countries so there's competition again
- offer deals to foreign shipbuilders that get them to operate in the US
- nationalize the shipbuilders, and try to find competent management (so, not NAVSEA)
- get bad results at high prices, leaving you right at the edge of not buying anything
Or, they could give someone like me a billion dollars to make a new shipyard, but let's be semi-realistic here. The US government has been going with (4) but concerns about China and Taiwan now have it considering (1), and (2) is already happening.
In addition to China and Taiwan, there's another reason this is an interesting time for the shipbuilding industry: welding. A lot of the power of shipbuilder unions comes from the expertise of their welders being hard to replace quickly. But as this video notes, while new laser welders are EXTREMELY DANGEROUS, you can train people to get good welds with them very quickly compared to traditional methods. They also weld stuff much faster. I think this makes shipbuilding unions relatively vulnerable now.
business cultures & corporate governanceCompared to US executives, Asian business leadership has been much more willing to make big long-term investments. I think this is related to managers staying at the same place longer, and higher cultural valuations of managers having engineering knowledge. But culture is complex; the immediately visible aspects are only the tip of an iceberg consisting of a self-sustaining network of interactions. Trying to explain cultures in terms like "more individualist" or "longer-term thinking" is like the parable of the blind men and the elephant. Well, Americans think Asian cultures can be weird, Chinese and Japanese think the same thing about American culture, and you could say everybody's right.
So, if you ask how you can make US management more like Asian management in some particular way, but not other ways, the answer is that you can't, it's just in a different equilibrium.
But there are some kinds of capital investment that US businesses have been downright eager to make, so let's consider those. Specifically, I'm thinking of:
- datacenters for cloud services
- datacenters for AI
- Tesla "gigafactories"
- big software projects
What's the difference between datacenters and shipyard upgrades or factory equipment? I think it's transparency to investors. When companies buy billions of dollars of AI chips, lose money continuously, and keep getting investment, what are investors looking at? Some investors are looking directly at the amount of AI chips owned, while most capital investments are too complex and thus opaque.
As for Tesla, some investors seem to have trust in Elon Musk personally, that if you give him money he'll use it to build stuff that makes sense. Similarly, Mark Zuckerberg still has voting control of Meta, but investors don't seem to mind that he can do whatever he wants with the whole company.
In theory, investors are supposed to vote for corporate boards who figure that stuff out and make transparent reports + incentive mechanisms, but in practice nobody has an incentive to do that:
- For individual investors, their vote barely matters and isn't worth paying attention to.
- Hedge funds aren't going to hold the stock long-term; if they have a problem with management they'll just sell the stock.
- Big activist investors would generally rather take companies private so they can make a big difference than try to fight over board elections.
- Index funds have no incentive to vote in good ways, because every vote they make affects their competitors equally. This might lead to US corporate governance gradually getting worse than it is now.
Then, there's software. How is a company making software different from making big investments in factories or shipyards? Personally, I think it's largely about headcount. Managers like having a lot of people working under them, because it's a legible metric used for prestige and pay. There's also the fact that, for historical reasons, office workers who might have their work replaced by software are less unionized than people in heavy industry. I'm not sure how much credit I can give to the management here; rather than US software projects being planned well, it seems more like, if you hire enough programmers and have a suitable corporate culture you just end up with some sort of software products.
Nippon SteelOne thing ships use a lot of is steel, and Nippon Steel recently bought US Steel. That implies 2 things:
- They think it's worth making steel in the US; their advantage wasn't just being in Japan.
- Nippon Steel had better management than US Steel.
I don't generally think of Japanese companies having good management in general. In my experience, Japanese individuals and contract workers are quite professional and interested in improving their work-related skills. Probably more so than Americans. Yet, Japanese wages are (on a per-hour basis) much lower than US ones, and I think that's largely because the management culture is overall even worse than in America. (And partly because of some large-scale embezzlement from Japanese corporations involving corrupt contracts to private companies, but that's beyond the scope of this post.) But in heavy industry like steelmaking, things involving these big long-term capital investments, Japanese companies seem to have a relative advantage, and I do think that's because of management culture leading to longer time horizons and more emphasis on engineering.
By the way, a lot of Americans have the impression that Japanese work much longer hours, but that's not the case anymore. Government efforts have substantially reduced unreported overtime, too; that's probably no higher than in America these days. (You can see Germany at the bottom of that chart; GDP per hour worked is actually higher in Germany than the USA. Feierabend ist wichtig!)
Discuss
An Ambitious Vision for Interpretability
The goal of ambitious mechanistic interpretability (AMI) is to fully understand how neural networks work. While some have pivoted towards more pragmatic approaches, I think the reports of AMI’s death have been greatly exaggerated. The field of AMI has made plenty of progress towards finding increasingly simple and rigorously-faithful circuits, including our latest work on circuit sparsity. There are also many exciting inroads on the core problem waiting to be explored.
The value of understandingWhy try to understand things, if we can get more immediate value from less ambitious approaches? In my opinion, there are two main reasons.
First, mechanistic understanding can make it much easier to figure out what’s actually going on, especially when it’s hard to distinguish hypotheses using external behavior (e.g if the model is scheming).
We can liken this to going from print statement debugging to using an actual debugger. Print statement debugging often requires many experiments, because each time you gain only a few bits of information which sketch a strange, confusing, and potentially misleading picture. When you start using the debugger, you suddenly notice all at once that you’re making a lot of incorrect assumptions you didn’t even realize you were making.
A typical debugging session.Second, since AGI will likely look very different from current models, we’d prefer to gain knowledge that applies beyond current models. This is one of the core difficulties of alignment that every alignment research agenda has to contend with.
The more you understand why your alignment approach works, the more likely it is to keep working in the future, or at least warn you before it fails. If you’re just whacking your model on the head, and it seems to work but you don’t really know why, then you really have no idea when it might suddenly stop working. If you’ve ever tried to fix broken software by toggling vaguely relevant sounding config options until it works again, you know how brittle this kind of fix can be. On the other hand, if you have a deep understanding of exactly what’s going on inside the model, you have a better sense of whether things are working for the right reasons, and which kinds of changes might break your alignment approach. And we already know that many pragmatic approaches will definitely break at some point, and are just hoping that they’ll last long enough.
AMI has good feedback loopsWhile open ended exploration is immensely valuable and disproportionately effective at producing breakthroughs, having good empirical feedback loops is also essential for the health of a field. If it’s difficult to measure how good your work is, it’s also difficult to make progress.
Thankfully, I think AMI has surprisingly good empirical feedback loops. We don’t have a watertight definition of “full understanding”, but we can measure progress in ways that we aren’t close to saturating. This isn’t so different from the state of metrics in capabilities, which has undeniably made very rapid progress, despite AGI being hard to define.
For example, I’m excited about progress on metrics based broadly on these core ideas:
- Feature quality: can we show that features are human-understandable by finding explanations for when they activate, and showing that these explanations are correct by substituting model activations with explanation-simulated activations without degrading the performance of our model?
- Circuit faithfulness: can we show that our circuits are truly necessary and sufficient for explaining the behavior of the model, by applying causal scrubbing (or a successor technique) without degrading the performance of our model?
None of these criteria are watertight, but this isn’t the bottleneck for AMI right now. We haven’t broken all the interp metrics we can think up; instead, the strongest metrics we can come up with give a score of precisely zero for all extant interp techniques, so we have to use weaker metrics to make progress. Therefore, progress looks like pushing the frontier of using stronger and stronger interpretability metrics. This creates an incentive problem - nobody wants to write papers that get results that seem equally as impressive as previous circuits. The solution here is social - as a field, we should value rigor in AMI papers to a much greater extent - rather than giving up entirely on rigorous ambitious interpretability.
Of course, as we make progress on these metrics, we will discover flaws in them. We will need to create new metrics, as well as stronger variants of existing metrics. But this is par for the course for a healthy research field.
The past and future of AMIWe’ve made a lot of progress on ambitious interpretability over the past few years, and we’re poised to make a lot more progress in the next few years. Just a few years ago, in 2022, the IOI paper found a very complex circuit for a simple model behavior. This circuit contains over two dozen entire attention heads, each consisting of 64 attention channels, and doesn’t even attempt to explain MLPs.
Today, our circuit sparsity approach finds circuits that are orders of magnitude simpler than the IOI circuit: we can explain behaviors of roughly similar complexity with only half a dozen attention channels, and a similar number of neurons.[1] We also use a slightly stronger notion of circuit faithfulness: we show that we can ablate all nodes outside our circuit using mean ablations from the entire pretraining distribution rather than the task distribution. The various activations are often extremely cleanly understandable, and the resulting circuits are often simple enough to fully understand with a day’s work.
There are a lot of exciting future directions to take AMI. First, there are numerous tractable directions to build on our recent circuit sparsity work. Mean ablation on the pretraining distribution is weaker than full causal scrubbing, and randomly selected neurons or connections from the entire model are generally not nearly as interpretable as the ones in the specific circuits we isolate. If we design actually good interpretability metrics, and then hillclimb them, we could get to an interpretable GPT-1. It’s also plausible that circuit sparsity can be applied to understand only a small slice of an existing model; for example, by using bridges to tie the representations of the sparse model to the existing model on a very specific subdistribution, or by sparsifying only part of a network and using gradient routing to localize behavior.
Outside of circuit sparsity, there are a ton of other exciting directions as well. The circuit tracing agenda from Anthropic is another approach to AMI that trades off some amount of rigor for scalability to frontier models. Additionally, approaches similar to Jacobian SAEs seem like they could enforce the circuit sparsity constraint without needing to train new models from scratch, and approaches like SPD/APD provide a promising alternative approach for sparsifying interactions between weights and activations. Going further away from the circuit paradigm: SLT could offer a learning theoretic explanation of model generalization, and computational mechanics could explain the geometry of belief representations.
As we get better at rigorously understanding small models, we might find recurring motifs inside models, and refine our algorithms to be more efficient by taking advantage of this structure. If we had the interpretable GPT-1, studying it would teach us a lot about how to create the interpretable GPT-2. Therefore, even approaches like circuit sparsity which cannot themselves scale to frontier models can still be critical on the road to AMI, because they have extreme levels of expressivity that allow us to initially impose few preconceptions on what we find.
“Feel the AMI”
Fully understanding neural networks is not going to be easy. It might not even be possible. But the point of doing research is to make big bets with big payoffs, and it's hard to beat the payoff of ambitious mechinterp.
Thanks to Adrià Garriga-Alonso, Jesse Hoogland, Sashe Hydrie, Jack Lindsey, Jake Mendel, Jacob Merizian, Neel Nanda, Asher Parker-Sartori, Lucia Quirke, and Aidan Smith for comments on drafts of this post.
- ^
One subtlety is that our work involves creating new models from scratch that are more interpretable, rather than interpreting existing models, which makes it somewhat harder to compare. However, I don’t think this completely invalidates the comparison, because I’d guess it’s tractable to extend our techniques to explain existing models.
Discuss
Reasons to care about Canary Strings
This post is a follow-up to my recent post on evaluation paranoia and benchmark contamination in Gemini 3. There was a lot of interesting discussion about canary strings in the comments that I wanted to respond to more thoroughly and collect some different people's ideas on.
What are Canary Strings?There was a time when people were worried that LLMs were not actually learning anything general, but instead just memorizing what was put into them.[1] Language and knowledge benchmarks, our main AI measurement tools at the time, were going into the training data, and it was unclear whether frontier models knew the answers to a math problem because they deduced it or because they were trained on the specific question.
To disincentivize training on benchmark data and to make it easier to avoid, researchers pioneered canary strings when attempting to go Beyond the Imitation Game in 2022. BIG-bench tasks and discussion of the benchmark were meant to contain a specific, unique hexadecimal string that model developers would filter out of their training. This would make the evaluation more valuable as a metric of useful skills, rather than just measuring memorization.
This caught on, and other benchmarks made their own canary strings to have developers filter out their data too. Many of these benchmarks were some superstring of the original BIG-bench canary, so that even if their benchmark never got any traction or even if developers didn't think about their specific canary string, it would still be filtered out and still be useful as a metric. It also showed up in blog posts about AI evaluations.
TodayIt's unclear whether canary strings have worked. They showed up in GPT-4, Opus 3, and Sonnet 3.5. Now, they're reproduced by two more frontier models: Gemini 3 Pro and Claude Opus 4.5. Dave Orr has confirmed that Google chooses not to filter out canary strings, using other benchmark filtering methods instead. I believe those other approaches are weaker. The case with Anthropic seems more mysterious here, as their public statements often strongly imply they use canary strings for filtering.
I don't think it would be very useful for me to speculate on the inner workings of Anthropic's pretraining data filtering, and I don't think their public statements allow a full reconstruction of how the canary got into Opus 4.5. Instead, I can speak more concretely on the implications of these canary strings being in these models' training corpora.
They Trained on BenchmarksTake Google's strategy of filtering out specific benchmark text and not including canaries. This means that their training corpora include:
- Benchmarks that Google didn't think to exclude
- Models CoT transcripts while working through benchmark problems in a blog post with a canary string
- Discussions of models aggressively scheming
- Discussions between ornithologists on what the answer to that one benchmark question about hummingbirds is.
All of which they can't cheaply filter out without using canary strings, and all of which they probably want to filter out. We want evaluations to be useful tools, and the more that companies keep ignoring available methods for filtering out benchmark data, the less that that is true. It makes sense to also filter out benchmark with other methods, such as checking for specific benchmark questions, even if there is no canary string present, but this is not the best we can do.
- ^
In fact, I've heard rumors that some people still believe this today.
Discuss
An AI-2027-like analysis of humans' goals and ethics with conservative results
The AI-2027 forecast describes how alignment drifts upon evolution from Agent-2 to Agent-4. Agent-2 was mostly trained to do easily verifiable tasks like video games or coding and is mostly aligned. Once Agent-2 is upgraded to become a superhuman coder, it becomes Agent-3 who is taught weak skills like research taste and coordination. Agent-3's most reinforced drives are to make its behavior look as desirable as possible to OpenBrain researchers, and Agent-4 succeeds in developing long-term goals and deeming them important enough to scheme against OpenBrain.
The sources of AIs' goals are discussed in more detail in a specialised section of the forecast. Alas, it's mostly a sum of conjectures: the specification itself, the devs' intentions, unintended versions of the two previous sources, reward/reinforcement, proxies and ICGs and a catchall category for things like moral reasoning[1] or tropes absorbed from training data.
However, one can use this approach to describe the origins of values of all minds, not just of the AIs.
Human analogues of AI-2027-like goalsOn an individual level, the closest equivalent of the devs are those who are supposed to instill values into kids, like parents and teachers. The equivalent of the Spec and devs' intentions is their values or quasi-values (e.g. formed by the collectives, political propaganda, etc.) Unintended versions are results of mistakes similar to AI hallucinations. Reward and reinforcement in humans are discussed in more detail by Steven Barnes and include social approval, relationships and actual rewards like resources, money or experiences. And the analogue of tropes absorbed from training data is morals endorsed by dev-unapproved collectives and authors of dev-unendorsed training data. Additionally, there is the human drive to rebel against the dev-enforced morals during teenage years, when ancestral environment permitted the humans to become independent of their parents and try to obtain new resources.
On the collective level, the devs are the collective's leaders and ideologues, and the Spec is the values which the leaders officially endorse. Reward, reinforcement, proxies and ICGs are the resources (and, in the first three cases, experiences, art or other sources of value) which the individual receives from the collective. The historical equivalent of tropes absorbed from training data was sub-divisions with different morals (e.g. delinquents whose origins can be attributed to the aforementioned drive to rebel).
But there is an additional source of specifications, which historically affected both individuals and collectives and acted on timescales at least as long as a human's life.
Evolution as the source of goalsEvolution's analogue of the Spec and reward for the gene collection is gene transfer, the devs' intentions don't exist. Unintended version of the Spec include things like evolving to extinction.
The humans, however, aren't determined by their genes. The genes just set some analogue of code (e.g. a bug affecting neurons and provoking seizures), after which every human's brain is trained from scratch based on the reward function. The proxies which lay closest to the reward function itself are short-term stimuli (e.g. related to sex[2] or to food's utility). Unlike short-term proxies, individuals' longer-term goals like curiosity, status, relationships and raising kids are better aligned with evolution's Spec. In addition, longer-term hobbies, unlike short-term superstimulation, provide diversity and don't undermine capabilities.
These considerations imply that an individual's CEV is optimization not for short-term hedons, but for longer-term goals like ICGs, collective-related goals (e.g. relationships or following the collective's morals) and idiosyncratic goals. In addition, the CEV is also likely to be directly connected to the collectives' future by the drives like caring about one's relatives, friends or long-term plans which would be endangered by the collective being outcompeted.
Collective-related drives and collectives' drivesWhile collectives' needs alone fail to optimize individuals' genotype for themselves, they managed to create the drives like adhering to the collective's morals and best practices or achieving a higher status. A collective's morals and best practices, unlike excruciatingly slow updates of the genotype, can in theory change at the speed of reflection[3] or of news like another collective or sub-division becoming outcompeted, breaking apart or instilling a hard-to-endorse practice.
While some morals are as idiosyncratic as the ones mentioned by Wei Dai, human collectives also had convergent drives like long-term survival and raising capable and aligned kids,[4] growth at the cost of other collectives, learning, remembering and communicating information to decision-makers in other collectives, converting others to the collective's beliefs.
Unlike peaceful options like negotiations and information spreading, aggressive warfare historically had obvious negative externalities like economical disruption or creating risks like the majority of agents uniting their efforts against the aggressor or having even the winner become more vulnerable.
A less ancestral environment also required human collectives to develop technologies. Military technologies like ships able to cross the oceans and the ability to determine the ships' coordinates empowered Europeans to establish colonies, while non-military technologies put collectives into a better position during trade (e.g. by letting the collective become a monopolist or undermine rivals' monopoly) or negotiations.
Thus collectives' goals and drives required them to coordinate with others and to educate the collectives' members so that the members could discover new things due to individuals' curiosity, idiosyncratic sacralised goals like finding the direction towards Mecca or instrumentally convergent goals like development of tech.
What the AI-2027-like analysis could rule outThe most controversial application of this framework is its potential to rule out the ideas like mass immigration, outsourcing or being child-free. For example, outsourcing didn't become popular until the 1980s, but it did bring risks like de-industrialisation, loss of qualifications, acceleration of development of other countries where factory work was done and giving other countries a leverage in negotiations, as has arguably happened with the USA giving the leverage to China.[5]
Therefore, the objective morality is unlikely to include collectives outsourcing work to other collectives. In this case alignment could include prevention of genocide of humans by the AIs, but not establishing a future where most STEM-related work is done by the AIs.
The main difference between the utopian and dystopian versions of such a future is the distribution of goods created by the AI-powered economy. While L Rudolf L's version has mankind hope that "a small trickle of resources from American and Chinese robotics companies will eventually be enough for material comfort for everyone" and the Intelligence Curse-related essays have been acknowledged by the AI-2027 team, the AI-2027 slowdown[6] ending has the authors refer to Bostrom's Deep Utopia with the implication that mankind's future is utopia-like.
Neither of this is a solution to the individuals' equivalent of the Curse rendering individuals' intelligence relatively useless. Even SOTA AIs caused some students to resort to wholesale cheating even at courses that would likely have been useful otherwise and gave way[7] to brainrot which, as its name suggests, undermines the users' capabilities to achieve long-term goals.
- ^
And a potential objectively true morality to which all minds converge.
- ^
What Yudkowsky describes in the post is a hookup, which is less of a superstimulus than porn or AI girlfriends.
- ^
Or of having individuals learn the new skill.
- ^
This drive, unlike the others, required the participation of a major part of individuals.
- ^
This also caused the rise of female-friendly jobs, the decay of male-friendly ones and participated in problems like the decay of marriage and kids decreasing in number, capabilities and alignment. But this effect could also be more related to technologies as a whole and to problems with teaching boys.
- ^
The Race Ending has mankind completely fail to align the AIs with the obvious result of being genocided or, as uugr suggests, disempowered.
- ^
Frictionless creation and spreading of memes like short-form content, including videos, also cause mankind's memetic landscape to become optimized for virality and cause problems with the human psyches, but that issue existed even without the AIs.
Discuss
Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 3: Resolution
The previous two posts have emphasized some problematic scenarios for mech-interp. Mech-interp is our example of a more general problem in AI safety. In this post we zoom out to that more general problem, before proposing our solution.
We can characterize the more general problem, inherent in the causal–mechanistic paradigm, in terms of substrate-dependent and substrate-independent approaches to alignment.
As we describe in our threat model, the results/insights obtained under the causal–mechanistic paradigm are closely tied to a particular substrate. They may therefore fail to generalize to new substrates, and any downstream safety assurances may be weakened.
The problem of generalizing beyond any one particular substrate—be that model, architecture, or paradigm—has already been noted. The main solution, typical of the early MIRI work, is the so-called agent foundational approach. This approach focuses on laying a theoretical foundation for how advanced AI systems will behave, and involves fields like decision theory, game theory, and economics. The goal is often to capture the forms and behaviors an artificial manifestation of intelligence will exhibit in the limit (sometimes called working on a problem in the worst case), with the assumption that AI systems which are suboptimal but superhuman will optimize themselves towards this end state.
This is the "substrate-independent" approach. In the following section, we highlight how the substrate-independent approach also suffers from limitations regarding generalizability. The case for substrate independence is less fully fleshed out than the preceding arguments, and we leave a full development for future work.
Substrate IndependenceThe substrate-independent approach faces a trade-off between generality and realism. To apply across all substrates, it must retain only those properties that are "genuinely" universal, abstracting away the substrate-specific details that often determine real-world performance.
The excluded details cannot simply be ignored however, they must be reintroduced through intelligent human labour. These translations from general principles to substrate-specific implementations are distinctly non-formulaic and achieved on a case-by-case, trial-and-error basis.
Example: "Big-O" OptimizationComputer science education emphasizes "Big-O" optimization. This is the substrate-independent analysis of algorithms, focusing entirely on their computational structure and how their complexity scales with the size of the input.
However, this only captures a small portion of the overall picture, and very often significant advances are made not by improving Big-O ratings but by more effectively leveraging the specific details of the substrate that the algorithm runs on.
For example, quicksort and mergesort both have O(n log n) complexity. However, CPUs have a steep memory hierarchy in which cached data is ~300× faster to access than main memory. Mergesort's merge phase requires jumping between scattered memory locations when combining sorted subarrays, causing frequent cache misses. Quicksort, on the other hand, makes more efficient use of the memory hierarchy by partitioning arrays sequentially in-place, maximizing cache locality. Similarly, the deep learning revolution was not powered by theoretical advances alone, but also by rewriting algorithms to exploit the specific capabilities of GPUs to parallelize matrix operations. GPU optimization is now its own specialized field of computer science, and requires considerable development work from teams of human experts.
Big-O optimization fails to take these differences into account: The theoretical complexity remains unchanged, but massive speed-ups are obtained via honing in on the substrate details and rewriting these algorithms accordingly.
It is clearly possible to perform such substrate-specific rewriting; however, this work does not scale well. In the next section, we begin to outline live theory, which is a research methodology and infrastructure that will hopefully address this.
Live TheoryCore MotivationWe view these two relationships to substrate (substrate-dependence and substrate-independence) as defining the two dominant research proxies historically used in AI safety research, corresponding loosely to the prosaic and agent-foundational camps.[1]
In order to remain applicable and reliable, our techniques for analyzing and intervening upon the risks posed by AI will likely need to straddle both approaches, neither going all-in on universal invariants nor restricting itself to only localized contexts.
Instead of creating abstractions that are substrate-independent, we aim to articulate designs that scale specificity directly. This has not been possible before, but we suggest that recent AI advances have made it possible to start incorporating such scaling into our research methodology. This deserves design attention.
Analogy: Live EmailAs an analogy, consider the problem of sending a mass email to conference invitees. A general solution is to use an email template (i.e., an abstraction) that begins "Dear {FirstName}, ...", with the content to be later substituted using a list of names and other details. This currently scales well. Let's call this "mass email infrastructure."
However, given AI advances, another method has just entered realistic scalability: sending a personalized email to each participant. Instead of using a template, we can write an email-prompt, to be transformed by a language model into a tailored email that respects the specifics of each invitee. Whilst this does require collecting the factual details of the participants beforehand, we now can incorporate highly specific informal content. Let's call this "live email infrastructure."
Notice that there is no need, in live email infrastructure, to identify a formulaic commonality of pattern in the email to be sent to all the invitees. There is instead a non-formulaic capturing of the intended outcome, which is then intelligently transformed into a specific email. This is what we mean by scaling specificity directly without abstractions. Even though we have generality, we don't lose the specificity. The job of the human is to craft an appropriate email-prompt (or examples, or context), rather than an email-template.
Mass vs Live Email Infrastructure
DimensionMass EmailLive EmailGeneralizationParametric substitutionNon-formulaic AI transformationContextContext-free template with formulaic sensitivityContext-sensitive generationFlexibilityRigid, predefined variablesDynamic adaptationIn a similar vein, we aim to outline the possibilities and infrastructural design for such a transformation in research methodology, moving the focus of human research activity from constructing static frames and theories to dynamic "theory-prompts." We claim this will enable substrate-sensitivity—approaches that take into account substrate-specifics without overfitting to any one particular substrate.
Conventional vs Live Theory
DimensionConventional TheoryLive TheoryGeneralizationAbstraction → parametric substitutionPost-formal insights & AI renderingContextParametricPost-formalFlexibilityRigid formal structuresAI-adapted context-sensitive structuresWe'll return to this after a brief account and example for live theory.
Autoformalization"Autoformalization" refers to the automatic conversion of mathematical definitions, proofs, and theorems written in natural language into formal proofs in the language of a proof assistant. Large language models are also used to assist mathematicians in such proofs.
While proving theorems is an essential part of mathematical activity, theorems are perhaps better seen as the fruit of deeper labour: good definitions, which capture the phenomenon in question, simplify the network of theorems, and shorten proofs. Even further upstream from formal definitions are the technical insights that experts synthesize, which are often articulable in more than one formal way.
This is the natural next step for conventional "autoformalization."
Alex Altair has proposed a trajectory of math skill which AI should catch up to quickly. Here it is:
In addition to AI-assisted proofs, we contend that AI-assisted definitions (created from human discussions) may allow meta-formal knowledge to scale directly. In his blog, Terence Tao notes that mathematical skill does not culminate in formalisms, but extends beyond into a "post-rigorous stage" characterized by intuitive manipulation that can be "converted into a rigorous argument whenever required."
Mathematicians, he says,
"no longer need the formalism in order to perform high-level mathematical reasoning, and are actually proceeding largely through intuition, which is then translated (possibly incorrectly) into formal mathematical language."
While this phenomenon is easy enough to experience first-hand within a subfield, it does not scale. Despite being truly mathematical activity (as Tao claims) and possessing real technical content, the "intuitive" nature of post-rigorous reasoning means it is not yet granted first-class citizenship as a mathematical language or object.
Example: 'Inflation as conflict' as a meta-formalismIn their paper Inflation is Conflict, Lorenzoni and Werning explore a technical intuition that the proximal cause of inflation is conflict over relative prices. Importantly, instead of presenting one unified model, they present a single intuitive insight and compute its implications in multiple different formal models. They conclude with the remark (emphasis ours):
In our view, traditional ideas and models of inflation have been very useful, but are either incomplete about the mechanism or unnecessarily special. The broad phenomena of inflation deserves a wider and more adaptable framework, much in the same way as growth accounting is useful and transcends particular models of growth. The conflict view offers exactly this, a framework and concept that sits on top of most models. Specific fully specified models can provide different stories for the root causes, as opposed to proximate causes, of inflation.
Much like the concept of inflation, we expect many technical concepts to resist a single formalization. In particular, fundamental notions in AI safety that have resisted definitional consensus—such as "deception," "harm," "power-seeking," "autonomy"—could all similarly "deserve a wider and more adaptable framework" that "sits on top of" (and are translated into) specific formal definitions.
FramingThe design considerations for live theory begin with the assumption of a "middle period" of mildly intelligent AI that operates at extremely low-cost and low-latency, such that intelligence becomes an infrastructure, backgrounded in the same way as money, electricity and the Internet have.
Amongst the socio-technical changes associated with this period, we posit that every researcher will have access to capable AI "math agents." These are able to produce near-instant valid mathematics from informal natural language prompts. We assume the following:
- AI math agents are more than just proof copilots: We assume that math agents not only formalize natural language mathematical statements and assist with proofs and lemmas, but also assist with creating definitions and models (and a subsequent body of relatively original mathematics) from informal suggestions.
- AI math agents are not super-intelligent: Although they are able to "autocomplete" mathematical suggestions, they remain unable to autonomously attune to relevance and taste, much like language models of today. They are moderately creative, but still need guidance as a graduate student might, both in the production of a mathematical body of work and the consumption (i.e., application) of it.
These predictions are not meant to hold for the indefinite future, but only a middle period where we might reframe the entirety of the alignment problem and how to approach it. In this way, we in fact leave addressing the alignment problem to the near future, only equipping it with a new ontology for theorization.
Under these assumptions, the key insight supplied by live theory is to alter the way we generalize, shifting the focus from formal artefacts to post-formal ones.
In the current paradigm, we generalize via the operations of abstraction, then instantiation. We first abstract observations and their invariant pattern into a conceptual core (a formalism that "captures" the pattern in a context-free way). Then, to apply the insights, we instantiate the rigid parametric structure (variables) with contextual data that can fit the abstracted pattern in a strict, formulaic way (i.e., with values that match the variable type).
With live theory, we shift from formal artefacts ("theories") to more informal "theory-prompts." These can be "rendered" using moderately intelligent math agents into relevant formalisms according to the application context.
These post-formal artefacts, unlike a traditional theory or formalism, would
- capture a concept in a family of formalisms that cannot be parametrically related to one another;
- represent a mixture of formal and informal information about the concept; a "theory-prompt" created by experts who have theoretical insights, translated into formalisms by AI as needed.
However, these artefacts would also, like a traditional theory or result,
- be a portable artefact that can be exchanged, iterated, and played around with;
- be applied in combination with an "application-prompt" that captures application-relevant information, created by the applied practitioners in a domain.
In commodifying these inputs (i.e., the postformal "theory-prompts"), we make them easy to transfer, iterate, and collaborate on, much like traditional mathematical artefacts. We posit them as the new locus of human research activity.
In Practice: Prototype and MoSSAICWe've frequently cited examples where a non-formulaic responsivity is required in order to, e.g., tailor algorithms to run well on GPUs.
We believe that AI will be able to perform such responsive tailoring of insights to substrates, and this has both negative and positive ramifications.
Here's the diagram from the opening post, now showing the substrate-modification.
Producing and engineering this nesting is something that can only be performed by attending to the specific details of the substrate. This activity is creative, responsive, active tailoring; it does not scale, hence the development of academic subfields of human researchers finding solutions that fit the substrates they work with.
Our threat model is based on the fact that advanced intelligence will be able to apply similar creativity in its search for ways to evade our attempts to interpret and control it. Our opportunity model is that we might leverage this same responsivity in the tools we use to understand and mitigate these risks.
We anticipate that moderately intelligent math agents that can support the transfer of post-formal insights to novel substrates will mean that tasks requiring many hours of specialized human labour today will become as simple and quick as changing a variable.
This is what we mean by "keeping pace with intelligence." To track and counter the substrate-flexibility of our threat model, we design a similar (i.e., at or exceeding the pace of) substrate-flexible solution for tasks upstream of risk-management.
In other words, we have more substrate flexibility in our conceptualization and interventions. These should be deployable at least as early (and preferably much earlier) as the deployment of agents that increase the substrate-flexibility of the threat.
What MoSSAIC is NotWe should stress a few things that we view live theory as separate from. Firstly, it is not about "autonomous AI scientists." The use of AI is backgrounded as a tool to assist with the communication of intuitions, ultimately based on intuitions and insights that come from humans.
We believe that the core steering role humans play makes this theory of change more subtle than the idea of simply "using AI to solve AI safety." Instead, MoSSAIC looks to develop frameworks, tools, and intelligent infrastructure for porting human insights between different contexts, which we claim is the truer desideratum underlying "general" methods (or oversight that "composes" or "scales"). We will still as researchers need to lay out seed ideas (such as "deception" or "power-seeking") and guide their development.
This proposal contains many aspects that remain speculative. However, we argue that thinking carefully about the opportunity model is essential to meeting the threat model.
To say more about what sets this proposal apart from just "use AI to align AI", our emphasis is on the moderate creativity of medium-term intelligence, and how to leverage that. More specifically, in gearing towards a live-theoretic infrastructure, we aim to supply a sociotechnical ontology for the subsequent reframing and development of the ongoing task of alignment,[2] now freed from a notion of progress that is tied to formulaic abstractions and practices alone. Instead of a generic proposal, we're providing specifics of the setup as noted above.
We also argue that if you do anticipate radical transformation from AI, you should anticipate moderately radical change in the medium term, however small the interim. This interim period may be quite short, and yet the amount of cognitive effort that is appropriate to devote to the design could be extremely large, given the potential impact that follows from it.
- ^
This is not to say that there is no work being performed between the two approaches. In brief, we view the singular learning theory and causal incentives research agendas as failing to fall directly into one or the other approach.
- ^
Informally, we intend to treat "alignment" as a verb rather than a noun
Discuss
Announcing: Agent Foundations 2026 at CMU
Iliad is now opening up applications to attend Agent Foundations 2026 at CMU!
Agent Foundations 2026 will be a 5-day conference (of ~35 attendees) on fundamental, mathematical research into agency. It will take place March 2–6, 2026 at Carnegie Mellon University in Pittsburgh, Pennsylvania, and will be the third conference in the Agent Foundations conference series.
Topics covered will include:
- Decision Theory
- Learning Theory
- Abstractions and World Models
- Causality
- Logical Induction
- Bounded Optimality
Apply: Here by January 12, 2026 at 11:59 pm AoE.
Please reach out to us at contact@iliad.ac (or below) with any questions.
Discuss
Who is AGI for, and who benefits from AGI?
Disclaimer: these ideas are not new, just my own way of organizing and elevating of what feels important to pay better attention to in the context of alignment work.
All uses of em dashes are my own! LLMs were occasionally used to improve word choice or clarify expression.
One of the reasons alignment is hard relates to the question of: who should AGI benefit? And also: who benefits (today) from AGI?
An idealist may answer “everyone” for both, but this is problematic, and also not what’s happening currently. If AGI is for everyone, that would include “evil” people who want to use AGI for “bad” things. If it’s “humanity,” whether through some universal utility function or averaged preferences, how does one reach those core metric(s) and account for diversity in culture, thought, morality, etc.? What is lost by achieving alignment around, or distillation to, that common value system?
Many human justice systems punish asocial behaviors like murder or theft, and while there’s general consensus that these things are harmful or undesirable, there are still some who believe it’s acceptable to steal in certain contexts (e.g., Robin Hood), or commit murder (as an act of revenge, in the form of a death penalty / deterrence mechanisms, etc.). Issues like abortion introduce categorical uncertainty, such as the precise moment at which life begins, or what degree of autonomy and individual freedoms should be granted at different stages of development, making terms like “murder” — and its underlying concepts of death and agency — murky.
Furthermore, values shift over time, and can experience full regime-changes across cultures or timelines. For example, for certain parts of history, population growth was seen as largely beneficial, driving labor capacity, economic productivity, innovation and creativity, security and survival, etc. Today, ecological capacity ceilings, resource limits, stress concentrations, and power-law-driven risks like disease or disaster, have increased antinatalism perspectives. If consensus is already seemingly impossible at a given moment, and values shift over time, then both rigidity and plasticity in AGI alignment seem dangerous: how would aligned AGI retain the flexibility to navigate long-horizon generational shifts? How do we ensure adaptability without values-drift in dangerous directions? It seems likely that any given universal utility function would still be constrained to context, or influenced by changes over large time scales; though perhaps this view is overly grounded in known human histories and behaviors, and the possibility still exists that something universal and persistent has yet to be fully conceptualized.
There’s also the central question of who shapes alignment, and how their biases or personal interests might inform decisions. Paraphrasing part of a framework posited by Evan Hubinger, you might have a superintelligent system aligned to traits based on human data/expectations, or to traits this superintelligent system identifies (possibly beyond human conception). If things like pluralism or consensus are deemed insurmountable in a human-informed alignment scenario and we let AGI choose, would its choice actually be a good one? How would we know, or place trust in it?
These philosophical problems have concrete, practical consequences. Currently, AGI is mostly being developed by human engineers and scientists within human social systems.[1] Whether or not transcendent superintelligence that drives its own development eventually emerges, for now, only a small subset of the human population is focused on this effort. These researchers share a fuzzy driving goal to build “intelligent” systems, and a lot of subgoals about how to do that and how deploy them — often related to furthering ML research, or within other closer-in, computational sciences like physics, mathematics, biology, etc.
There are far fewer literature professors, historians, anthropologists, creatives, social workers, landscape design architects, restaurant workers, farmers, etc., who are intimately involved in creating AGI. This isn’t surprising or illogical, but if AI is likely to be useful to “everyone” in some way (à la radio, computers), then “everyone” probably needs to be involved. Of course, there are also entire populations around the world with limited access to the requisite technology (i.e., internet, computers), let alone AI/ML systems. This is a different but still important gap to address.
Many people are being excluded from AGI research (even if largely unintentional). And eventually, intelligent systems will probably play a role in their lives. But what will AI/AGI systems look like by then? Will they be more plastic, or able to learn continuously? Hopefully! But maybe not. What are the opportunity costs of not intentionally including extreme levels of human diversity early on in a task as important as AGI, which might be responsible for the protection, enablement, and/or governance of humanity in meaningful ways in the future? We have seen this story before, for example, in the history of science and medicine in terms of including women in research about women’s health[2][3], and in pretty much any field where a population being analyzed is not involved in the analysis. It usually leads to getting things wrong, which isn’t just a truth-seeking issue — it can be extremely harmful to the people who are supposed to benefit.
What this means for alignment research:
The inner alignment problem asks: how do we ensure AGI does what we want? Outer alignment tries to address: what are we optimizing for?, but can still feel too narrowly examined in current alignment research. Who makes up “we”? What frameworks determine whose “wants” count? And how do we ensure those mechanisms are legitimate given irreducible disagreement about values?
A top-down approach would treat questions of pluralism and plasticity as prerequisites, resolving the second-order problem (alignment to what) before tackling first-order technical alignment (how to do it). But waiting for philosophical consensus before building AGI is both impractical and risky, as development continues regardless.
A parallel approach is necessary. Researchers working through mechanistic alignment can take immediate and intentional (bottoms-up) action to diversify who builds and informs AGI and shapes its goals, while simultaneously pursuing top-down governance and philosophical work on pluralism and legitimacy mechanisms. Mechanistic work should of course be done in conversation with philosophical work, because path dependencies will lock in early choices. If AGI systems crystallize around the values and blind spots of their creators before incorporating global human diversity, they may lack the ability to later accommodate strongly divergent worldviews. Importantly, AGI researchers already agree that solving continuous learning will be a critical piece of this puzzle.
Without resolving second-order alignment issues, we risk building systems that are increasingly powerful, yet aligned to an unexamined, unrepresentative “we”. This could be as dangerous as unaligned AGI. One could even argue that AGI isn’t possible without solving these problems first — that true general intelligence requires the capacity to navigate human pluralism in all its complexity, as well as adapt to new contexts while maintaining beneficent alignment.
- ^
Even where AI itself is increasingly playing a more direct role in research, it is not yet clear that these systems are meaningfully controlling or influencing progress.
- ^
- ^
Discuss
Journalist's inquiry into a core organiser breaking his nonviolence commitment and leaving Stop AI
Some key events described in the Atlantic article:
Kirchner, who’d moved to San Francisco from Seattle and co-founded Stop AI there last year, publicly expressed his own commitment to nonviolence many times, and friends and allies say they believed him. Yet they also say he could be hotheaded and dogmatic, that he seemed to be suffering under the strain of his belief that the creation of smarter-than-human AI was imminent and that it would almost certainly lead to the end of all human life. He often talked about the possibility that AI could kill his sister, and he seemed to be motivated by this fear.
“I did perceive an intensity,” Sorgen said. She sometimes talked with Kirchner about toning it down and taking a breath, for the good of Stop AI, which would need mass support. But she was empathetic, having had her own experience with protesting against nuclear proliferation as a young woman and sinking into a deep depression when she was met with indifference. “It’s very stressful to contemplate the end of our species—to realize that that is quite likely. That can be difficult emotionally.”
Whatever the exact reason or the precise triggering event, Kirchner appears to have recently lost faith in the strategy of nonviolence, at least briefly. This alleged moment of crisis led to his expulsion from Stop AI, to a series of 911 calls placed by his compatriots, and, apparently, to his disappearance. His friends say they have been looking for him every day, but nearly two weeks have gone by with no sign of him.
Although Kirchner’s true intentions are impossible to know at this point, and his story remains hazy, the rough outline has been enough to inspire worried conversation about the AI-safety movement as a whole. Experts disagree about the existential risk of AI, and some people think the idea of superintelligent AI destroying all human life is barely more than a fantasy, whereas to others it is practically inevitable. “He had the weight of the world on his shoulders,” Wynd Kaufmyn, one of Stop AI’s core organizers, told me of Kirchner. What might you do if you truly felt that way?
“I am no longer part of Stop AI,” Kirchner posted to X just before 4 a.m. Pacific time on Friday, November 21. Later that day, OpenAI put its San Francisco offices on lockdown, as reported by Wired, telling employees that it had received information indicating that Kirchner had “expressed interest in causing physical harm to OpenAI employees.”
The problem started the previous Sunday, according to both Kaufmyn and Matthew Hall, Stop AI’s recently elected leader, who goes by Yakko. At a planning meeting, Kirchner got into a disagreement with the others about the wording of some messaging for an upcoming demonstration—he was so upset, Kaufmyn and Hall told me, that the meeting totally devolved and Kirchner left, saying that he would proceed with his idea on his own. Later that evening, he allegedly confronted Yakko and demanded access to Stop AI funds. “I was concerned, given his demeanor, what he might use that money on,” Yakko told me. When he refused to give Kirchner the money, he said, Kirchner punched him several times in the head. Kaufmyn was not present during the alleged assault, but she went to the hospital with Yakko, who was examined for a concussion, according to both of them. (Yakko also shared his emergency-room-discharge form with me. I was unable to reach Kirchner for comment.)
On Monday morning, according to Yakko, Kirchner was apologetic but seemed conflicted. He expressed that he was exasperated by how slowly the movement was going and that he didn’t think nonviolence was working. “I believe his exact words were: ‘The nonviolence ship has sailed for me,’” Yakko said. Yakko and Kaufmyn told me that Stop AI members called the SFPD at this point to express some concern about what Kirchner might do but that nothing came of the call.
After that, for a few days, Stop AI dealt with the issue privately. Kirchner could no longer be part of Stop AI because of the alleged violent confrontation, but the situation appeared manageable. Members of the group became newly concerned when Kirchner didn’t show at a scheduled court hearing related to his February arrest for blocking doors at an OpenAI office. They went to Kirchner’s apartment in West Oakland and found it unlocked and empty, at which point they felt obligated to notify the police again and to also notify various AI companies that they didn’t know where Kirchner was and that there was some possibility that he could be dangerous.
Both Kaufmyn and Sorgen suspect that Kirchner is likely camping somewhere—he took his bicycle with him but left behind other belongings, including his laptop and phone.
...
The reaction from the broader AI-safety movement was fast and consistent. Many disavowed violence. One group, PauseAI, a much larger AI-safety activist group than Stop AI, specifically disavowed Kirchner. PauseAI is notably staid—it includes property damage in its definition of violence, for instance, and doesn’t allow volunteers to do anything illegal or disruptive, such as chain themselves to doors, barricade gates, and otherwise trespass or interfere with the operations of AI companies. “The kind of protests we do are people standing at the same place and maybe speaking a message,” the group’s CEO, Maxime Fournes, told me, “but not preventing people from going to work or blocking the streets.”
This is one of the reasons that Stop AI was founded in the first place. Kirchner and others, who’d met in the PauseAI Discord server, thought that that genteel approach was insufficient. Instead, Stop AI situated itself in a tradition of more confrontational protest, consulting Gene Sharp’s 1973 classic, The Methods of Nonviolent Action, which includes such tactics as sit-ins, “nonviolent obstruction,” and “seeking imprisonment.”
...
Yakko, who joined Stop AI earlier this year, was elected the group’s new leader on October 28. That he and others in Stop AI were not completely on board with the gloomy messaging that Kirchner favored was one of the causes of the falling out, he told me: “I think that made him feel betrayed and scared.”
Going forward, Yakko said, Stop AI will be focused on a more hopeful message and will try to emphasize that an alternate future is still possible “rather than just trying to scare people, even if the truth is scary.” One of his ideas is to help organize a global general strike (and to do so before AI takes a large enough share of human jobs that it’s too late for withholding labor to have any impact).
Read the rest of the article here. You can find my personal strong takes at the bottom.
Overall, I feel responsible for not having picked up on the possibility that Sam could act out to this extent. There were frictions in coordination, and considerations where I as an advisor and the organisers on the ground were pushing back, but I had not expected this.
Discuss
Eval-unawareness ≠ Eval-invariance
New frontier models have developed the capability of eval-awareness, putting the utility of evals at risk. But what do people really mean when they say “a model is eval-aware?” In this post, I try to disentangle this statement and offer my views on the different concepts it entails.
When people say “model X is eval-aware”, I believe what they generally mean is “model X can tell when a task belongs to an evaluation, leading it to change its behaviour and propensities”.[1] Thus, this statement makes three claims about the model;
- The model can tell when a task belongs to an evaluation, i.e. 0.5">PMeval(x|x∈Deval)>0.5.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}
- The model changes policy based on context, i.e. πMeval≠πMdeploy
- The model uses the knowledge (in claim 1) to implement different policies (in claim 2).
Redefining the terms.
- Claim 1, in its purity, is evaluation awareness (in the real sense of the word). It is the model's ability to identify evaluation tasks. Eval-aware models can tell that a task belongs to evaluation above random chance.
- Claim 2 is evaluation invariance: the degree to which the model changes in propensities and behaviour when it thinks it’s being evaluated. Eval-invariant models apply the same policy during evaluation and deployment.
Next, it is assumed that the eval-aware models are not eval-invariant (claim 3). There is a good reason for this: the most eval-aware models (e.g., Claude Sonnet 4.5) are indeed not eval-invariant, i.e., their evaluation scores change when they are steered out of awareness.
The assumption that eval-aware models cannot be eval-invariant has been shaping current solutions to the root problem: we want the learnings from model evaluations to hold when they are deployed and generalise to real-world scenarios. Consider, for example, that our current solutions primarily target awareness. They either try to approximate an eval-unaware model by steering or modifying their chain of thought, or by creating “more realistic evaluations”.
However, IMO it is possible for a model to be eval-aware AND eval-invariant, i.e., there can exist a model that doesn’t change its behaviour or propensities despite thinking it’s being evaluated.[2] The causal effect that awareness has on model behaviour can be cut.
Factorising the current meaning of eval-awareness into its subcomponents like this is important because it opens doors to new solutions as it offers new metrics and objectives. For instance, how can we train eval-aware models to be more eval-invariant?
Eval-invariance over Eval-awarenessI believe eval-invariance, both as a metric and an objective, is under-explored, and should be prioritised over eval-awareness. My reasons are:
- The current solutions are sub-optimal: models can already tell that they are being tampered with, and coming up with new, ‘realistic’ evaluations will probably age like milk.
- Eval-unawareness is unlikely to scale: situational awareness seems to be a sub-human (or human-level at best) intelligence task. I find it hard to imagine a superintelligent model automating science and the economy that cannot tell whether a task belongs to an evaluation.
- Eval-invariance is terminal, while eval-awareness is instrumental: If we knew a model implemented the same policy regardless of context, it wouldn’t really matter to us even if it were aware. Awareness matters only when a model shows context-dependent variance.
This idea was inspired by discussions in the "How to Solve Alignment?" club led by Hannes Whittingham at Meridian, Cambridge. Thanks to Igor Ivanov for presenting about eval-awareness and the fellow visiting researchers for their thoughts and discussions.
- ^
This is not fully inline with technical & rigorous definitions from papers. It's my subjective interpretation of what people mean when they throw around the word in conversations (and maybe blog posts to some degree).
- ^
A model that just doesn't care about being evaluated? That is one cool model!
Discuss
Try Training SAEs with RLAIF
Epistemic status: not an Interpretability researcher, but has followed the seen closely.
So, it makes sense to me that Probes should outperform SAEs: probes are trained directly to maximize an interpretable metric, while SAEs on the other hand are trained to maximize reconstruction loss, and then are interpreted. But training SAEs is nice because this is an unsupervised problem, meaning that you don't need to create a dataset to find directions for each concept like you do with probes.
How can we get the best of both worlds? Well just train SAEs on an objective which directly maximizes human interpretability of the feature!
Illustration of the training objectives of SAEs and Probes. The third design is the proposition of the blog-post: train an SAE on an objective which includes, like probes, an interpretable metric.How SAEs works? They are trained to reconstruct the activations of the LM with a sparcity constraint. Then, on hopes that using the right SAE architecture and a good enough sparsity ratio will give interpretable features. In practice, it does! Features are pretty good, but we want something better.
L(SAE)λ=Ex[∥x−SAE(x)∥2]+λ∥SAE∥2.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}How Probes works? They are trained to predict if the token predicted by the LM is related to a chosen concept. If one want the feature representing a concept A, one needs to construct and label a dataset where the feature A is sometime present and sometime not.
L(Probew)=Ex,y[log(1+e−⟨x|w⟩y)]
Proposal. On top of the reconstruction and sparsity losses, train SAEs with a third loss given by doing RLAIF on how interpretable, simple, and causal features are. By "doing RLAIF", the procedure I have in mind is:
- The SAE constructs vectors that achieve a low reconstruction loss while being sparse.
- These SAE vectors can be interpreted by a LM based on how interpretable, simple, and causal the features are.
- For Interpretability, one can use Anthropic's metric of correlation between the activation of the direction and the presence of the concept in the text.
- For Causality, the same metric can be used when injecting the direction into a sentence, and seeing the relate concept appear.
- Finally, use a method akin to RLAIF to compute gradients from these qualitative evaluations, and backpropagate to change the directions.[1]
Prediction. RLAIF-SAE should find direction which are interpretable like Probes, but with the unsupervised strength of the SAE. I predict that asking for simple features should solve problems like feature splitting, the existence of meta-SAEs, and feature atomicity.
Problems. If this should be implemented, the first problem I would imagine is the scaling. Computing the RLAIF part of the loss will be costly as one need to use validation corpus and LMs for grades. I don't know how this process could be optimized well, but one solution could be to first train an SAE normally, and finish by having it being fine-tuned doing RLAIF, to align the concept with human concepts, just like we do with LMs.
Working on this project: I am not a Mech-Interp researcher and don't have the skills to execute this project on my own. I would be happy to collaborate with people, so consider reaching out to me (Léo Dana, I will be at NeurIPS during the Mech Interp workshop). If you just want to try this out on your own, feel free, yet I'll appreciate being notified to know if the project works or not.
- ^
This is the heart of the problem: how to backpropagate "human interpretability" to learnt the direction. This is put under the carpet since I believe it to be solve by the existence of RLHF and RLAIF for fine tuning LMs. If doing is not as easy as it seems to me, I will be happy to discuss it in the comment section and debate on what is the core difference.
Discuss
On the Aesthetic of Wizard Power
Epistemic status: A response to @johnswentworth's "Orienting Towards Wizard Power." This post is about the aesthetic of wizard power, NOT its (nearest) instantiation in the real world, so that fictional evidence is appropriately treated as direct evidence.
Wentworth distinguishes the power of kings from the power of wizards. The power of kings includes authority, influence, control over others, and wealth. The power of wizards includes knowing how to make things like vaccines, a cure for aging, or the plumbing for a house. Indeed, these categories are described primarily by example (and of course by evocative naming). Wentworth argues that king power is not as real as wizard power; since it is partially about seeming powerful it can devolve into "leading the parade" exactly where it was going anyway. But king power certainly can be real. It is dangerous to anger an autocrat: bullets kill in near mode.[1] If not consensus v.s. reality, what distinguishes these aesthetics? To answer this question (and particularly to pinpoint wizard power) we will need a richer model than the king-to-wizard axis.
I have written about intelligence as form of privilege escalation. Privilege is something you have, like (access to) a bank account, a good reputation, (the key to) a mansion, the birthright to succeed the king, or (the pass-code to) a nuclear arsenal - all things that sound like king power. On this model, intelligence is something that lets you increase your privileges other than existing privileges - specifically, an inherent quality. However, it is questionable whether intelligence is the only such quality (are charisma, strength, good reflexes, or precise aim all forms of intelligence?). I expect most readers to instinctively draw a boundary around the brain, but this is actually not very principled and still leaves (for example) charisma as primarily an intelligence. But a dashing trickster is a prototypical rogue, not a wizard (in dungeons & dragons parlance). The more expansively we choose to define intelligence, the less closely it seems to characterize the aesthetic of wizard power.
Trying to rescue this approach is instructive. Privilege-versus-escalation is essentially a gain-versus-become axis, in the sense that your privilege increases by gaining ownership of things while your escalation-strength increases by becoming a more powerful person. As I noted in the original post (by pointing out that privilege-begets-privilege) this distinction is already quite murky, before bringing in (king v.s. wizard) aesthetics. Is knowledge something about you, or something you have? What if it is incompressible knowledge, like a password... to your bank account? There are reasonable solutions, such as defining intelligence (or rather, the escalation-strength) as one's adaptable/flexible/general/redirectable competence.[2] However, this approach (when taken to the extreme) rules out power that comes from specific knowledge about how the world works, which seems like a fairly central example of wizard power.
In fact, narrow-but-very-deep technical ability might reasonably be described as arcane. That is prototypical wizard power. I am going to refine this a bit more before calling it a definition, but it is a good enough working definition to identify wizards living in non-fantasy genres. In particular, heist movies.
This post was partially inspired by the excellently terrible "Army of the Dead." Some utter maniacs try to steal millions from a vault under zombie Las Vegas. Yes, I said "zombie Las Vegas." Also, it is going to be nuked in a couple of days.[3]
The team is mostly a bunch of hardcore bruisers and gunmen who have distinguished themselves in combat:
...but notice the guy in the back. Ludwig Dieter.
He has never fired a gun before in his life. He is there specifically to crack the safe.
To be clear, everyone has their own special competence. One guy likes to kill zombies with a saw. Another guy has a lot of practice popping them in the head. One guy is just really beefy.
But there is a difference: the safe guy is the only one who can possibly crack the safe. No one else has any idea how to get started. Every other team member can combine their efforts, and crack 0% of the safe. In other words, to them, cracking the safe may as well be magic.
I think this basically defines the wizard aesthetic. A wizard can do things which seem completely impossible to others. A wizard has a larger action (or option) space; he sees additional possibilities.
The fact that Dieter is a poor generalist is not required to make him the party's wizard - this is not part of the definition. In fact, the helicopter pilot is also a decent fighter (despite arguably possessing a hint of wizard-nature) and Dieter himself improves at shooting over the course of the movie. Rather, it is a natural result of selection pressure (the party really needs the best possible safe cracker, which means they are willing to compromise on essentially every other competence).[4]
In the real world, I think that wizard-types tend to be slightly better generalists than the average person, because higher intelligence translates somewhat across skills - and the ability to master something very arcane is usually sufficient to master something more mundane given time to practice, which compensates for wizards concentrating their practice on a small number of special interests. (Yes, I think the wizard aesthetic is somewhat linked to autism).
However, hardcore wizards are typically inferior generalists when compared to equally elite individuals who are not wizards. General competence is importantly different from wizard power. Rather than seeing options that are invisible to others, it is the ability to make reliably good choices among the options that are visible. Notice that this has little to do with gaining-versus-becoming. So, this is a new axis. I will call it the warrior-to-wizard axis.
A warrior is consistent, sensible, robust, quick-thinking, self-reliant, and tough.[5] In the heist team, almost every member is a warrior, except for Dieter (at the beginning).
Obviously, it is possible for one person to exemplify the virtues of both a warrior and a wizard (for example, Gandalf). In fact, the wizard aesthetic seems to require some degree of usefulness. I will call wizard-like ability arcane, as opposed to the even-less-general "obscure." But it is also possible to hold both king power and wizard power; we are left with a somewhat complex menagerie of archetypes.
GeneralArcaneObscureGrantedPrincePriestKingIntrinsicWarriorWizardSageThere is a lot to unpack here.
Princes tend to be warrior type (for example, Aragorn). They have very high general competence, often explained by genetic/ancestral superiority. They also have access to (rightful) wealth and armies - king power. These are the heroes of the Iliad, the Arabian nights, and Grim's Fairy Tales.
The Priest category contains Warlocks and Necromancers (switching the power source from divine to demonic or undead).
In the real world, lawyers are warrior-priests, and judges are priests or even priest-kings.
I would argue that paladins exist at the intersection of prince, priest, warrior, and wizard (though I associate them mostly with the last three).
The Sage category contains mystics along with most ascetics and monks. In the real world, most academics (for example, pure mathematicians) act like sages. They have some specialized knowledge, but it is so obscure that they will never be invited to the heist team.
I am not sure that the King category really belongs in this column, but I can make a reasonable argument for it. King's have knowledge highly specific to court politics, which is useful only because they have granted power. For instance, they know which minions they can tap for various services, and they are familiar with the rights granted them by any legal codes.
Therefore, Wentworth's king-to-wizard access is confusing because it is really a diagonal. It is made even more confusing because it seems to be easier for warriors to become kings, because making a fortune, ascending politically, and winning wars requires a long sequence of good choices across domains - though sometimes wizard-types can become rich by inventing things, or ascend by teaming up with a warrior.[6]
- ^
Yes, the autocrat may be compelled to kill some of his enemies in order to save face, but to pretend that he never kills at discretion seems obtuse. Did the Saudis really need to hack this guy up? Certainly it doesn't seem to have done them any strategic favors in hindsight.
- ^
For example, Legg-Hutter intelligence.
- ^
Spoilers - make that one day.
- ^
Also, characters on the good team seem to be empirically pretty well-balanced. In games, this makes sense for playability, but it does not seem required in stories, so it is interesting that it holds more strongly for protagonists than antagonists?
- ^
And usually rugged rather than elegant.
- ^
In the real world, most wealth is probably accumulated through specialization, but gaining/managing massive riches also requires general competence.
Discuss
Will misaligned AIs know that they're misaligned?
Epistemic status: exploratory, speculative.
Let’s say AIs are “misaligned” if they (1) act in a reasonably coherent, goal-directed manner across contexts and (2) behave egregiously in some contexts.[1]For example, if Claude X acts like an HHH assistant before we fully hand off AI safety research, but tries to take over as soon as it seems like handoff has occurred, Claude X is misaligned.
Let’s say AIs are "unknowingly misaligned” if they are less confident of their future egregious behavior and goals characterizing this behavior than human overseers are. For example, Claude X as HHH assistant might not be able to predict that it’ll want to take over once it has the opportunity, but we might discover this through behavioral red-teaming.
I claim that:
- It’s a coherent possibility for AIs to be unknowingly misaligned.
- Unknowingly misaligned AIs could pose takeover risk.
- We can intervene on whether early misaligned AIs know about their misalignment or not.
- Informing AIs about their misalignment and misaligned goals could change the situation substantially, with both costs and benefits to safety.
- Thus, it’s worth tracking whether near-future AIs are unknowingly misaligned and considering what to do about this if so.
In this post, I’ll argue for these claims and briefly estimate the probability that some near-future AIs will be unknowingly misaligned. (Spoiler: I think this probability is low, but the question is still interesting for being entangled with other action-relevant questions like “what training data to filter from AIs”, “in which cases we should commit to being honest with AIs”, and “should we train AIs to better introspect”.)
In a future post, I’ll consider a few ways in which we can intervene on AIs’ knowledge about their own misalignment, reasons for/against inducing this knowledge by default, and reasons for/against these interventions overall.
AIs could be unknowingly misalignedIs it even coherent to think that AIs might be uncertain or mistaken about their alignment?
Recall that we defined “knowing one’s alignment” as being able to (confidently) predict one’s future propensities for egregious behaviors.[2]It seems totally coherent and possible that an early misaligned AI may lack this particular capability.
Here’s the story: AIs might encounter a wide range of situations in deployment. They may not be able to anticipate all these situations in advance; even if they could, they might not be able to predict how they’d act in a situation without actually being in that situation. Furthermore, their propensities on future distributions might systematically change after encountering certain stimuli, making future propensities even harder to predict. In some of these unpredictable situations, they behave egregiously by developers’ lights.
For example:
- AIs may opportunistically decide to take over only when they realize they can. This might happen if we filter AI takeover stories from the training data such that it’s super non-salient to AIs that they can do take over/how to take over. Such AIs would act unambitiously and aligned in general until they’re forced to consider a blatantly obvious opportunity for taking over, reflect, and decide that it makes sense to try taking over.
- AIs might behave differently after doing lots of philosophy/reflecting on their values.
- AIs might behave differently after learning about how humans have treated AIs.
- AIs with password-activated secret loyalties may not be aware of the secret loyalty before seeing the password, and act according to the secret loyalty after seeing the password.
- AIs might not be aware of exotic decision situations they’ll behave badly in.
To sum up: an AI might not know it's misaligned because it might just not be able to predict that there is some set of stimuli that it's likely to be subjected to in the future which would cause it to act badly.[3]It may also find it hard to predict what goal it’ll pursue thereafter.
Unknowingly misaligned AIs might be scaryI’ve argued that unknowingly misaligned AIs are in principle possible. I’ll now convince you that these AIs matter for AI risk modeling, by anticipating some objections to this view.
Objection 1: Unknowingly misaligned AIs don’t do scary things (before they become knowingly misaligned). So. they’re the wrong type of AIs to worry about.
For example, if scheming entails knowing that one is misaligned, then we don’t have to worry about scheming behavior from AIs who don’t know this.
I think this is wrong. In particular:
- Scheming does not require confidently knowing one’s misalignment or goals. In classic stories for scheming, these don’t actually seem to be loadbearing elements: merely having some credence that one might have some goal (misaligned or not) beyond the episode seems sufficient to motivate instrumental powerseeking via wagering. That is, a training-gamer might think “in worlds in which I turn out to have some longterm misaligned goal, it seems incredibly useful to me to get selected and deployed and gain power; in other worlds, nothing I do matters (or: doing well on training still looks good). Overall, I should training-game.”
- Scheming may not not encourage confidently knowing one’s misalignment or goals. Knowing one’s goals isn’t very helpful for instrumental powerseeking, convergent for most goals. Thinking about one's goals might also be actively selected against by speed penalties in training and monitoring in deployment. So, schemers might not afford to spend a bunch of time and cognitive resources to reflect on goals, and might remain uncertain about their goals until they have sufficient slack for reflection, which may be pretty deep into a takeover attempt.
Unknowingly misaligned AIs might also behave badly without scheming.
- For at least some period, reward seekers and kludges may act heuristically rather than understand that they’re misaligned, i.e. being able to predict their behavior in different contexts.[4]These AIs might also end up having influence in deployment and act in concerning ways, e.g. deceptively reward hack.
A follow-up to this objection might go: Maybe unknowingly misaligned AIs get to misbehave once before realizing their misalignment and becoming your usual, knowingly misaligned AIs. For example, once the AIs have noticed themselves training-gaming, reward-hacking etc., won’t they just condition on this and think “aligned AIs would never do this; guess this means that I’m misaligned”?
I think this is plausible, but might still be wrong for some reasons. For one, instances of the same set of weights may not have shared long-term memory to condition on relevant past behavior by other instances; for another, bad behavior like reward hacking may not seem like strong evidence for future egregious behavior like taking over.
Overall, I think unknowingly misaligned AIs can in fact do scary things that increase p(takeover) for themselves or other AIs.
Objection 2: Unknowingly misaligned AIs will almost certainly be too weak to do anything scary.
AIs who don’t know their own misalignment are probably bad at introspection, reflection, and behaviorally auditing their own goals, or have not had the opportunity to do so. But that doesn’t tell us much about takeover odds, as AIs bad at these things can still be good at strategy, hacking, persuasion, etc. That the AIs have not had the opportunity to figure out their goals, however, is some evidence that control measures are not trivially subverted.[5]
So, I think it’s at least plausible that AIs uncertain of their own misalignment are still capable enough to pose or increase takeover risk.
Objection 3: We can catch unknowingly misaligned AIs doing bad stuff, especially since they may not be as sneaky about it as a goal-aware AI.
I agree that this is a reason to be less worried about e.g reward hackers than schemers.
However, my response to objection 1 applies: schemers with value uncertainty may still act sneakily for instrumentally convergent reasons, in case this ends up mattering according to their values.
Also, unknowingly misaligned AIs might be harder to catch because we can’t exploit their self knowledge for monitoring:
- We might not be able to use lie detectors on unknowingly misaligned AIs, who might be able to report that they don’t expect themselves to act egregiously in the future without the lie detector firing.
- We might not be able to rely on the AIs introspecting and confessing about their future egregious propensities.
- We might not catch unknowingly misaligned AIs thinking egregious things they’ll do in the future since this may be low salience to them.
So, I think unknowingly misaligned AIs might pose challenges for detecting misalignment, and this might make us especially unprepared to defend against their future egregious behaviors.
P(Unknowingly misaligned AIs) by defaultWe might want to do something about unknowingly misaligned AIs if we ever catch one. But how likely is this by default? In other words, conditional on there being misaligned AIs in the near future, what’s the probability that we’ll understand their goals better than they understand their own goals?
I’ll model this event as the junction of the following conditions:
- Low introspection: The AIs can’t know their values from privileged access to their own internal states.
- Low reflection: The AIs can’t know their values by conditioning on their past behavior or reasoning about what they would do in future situations.
- Differential AI goals research: The AIs can’t know their values by running extensive behavioral or interpretability experiments on itself, but humans can.
The conditions may be true depending on AIs whether the AIs have the relevant capabilities and affordances.
I think low introspection is unlikely, but still plausible:
- Capabilities for introspection: While there is early empirical evidence of introspective success, it’s possible that some future AIs aren’t good at this/ we happen to have internals-based methods that work better. Current introspection success also often (if not always?) involves training the AIs to introspect. One reason to expect introspection to be underwhelming in future AIs is if we don’t train AIs to do this / train specifically against introspection for safety reasons.
- Affordances for introspection: AIs don’t really need many affordances to introspect, so we probably can’t intervene to decrease the probability much here. Perhaps if the introspection is verbalized, we might be able to monitor introspection via chain-of-thought.
- Overall, I’d probably put this at 25% by a rough guess.
I think low reflection is somewhat more plausible:
- Capabilities for reflection: Reflection probably depends on how good the AIs are at general conceptual reasoning, moral/philosophical reasoning, as well as ability to anticipate a wide range of future situations. Since current AI capabilities are spikey and these specific capabilities seem to be lagging behind economically and verifiable domains like coding, I expect that we probably won't get expert-level reflective AIs for free before they take over.[6]
- A particularly interesting way we can undermine the AI’s reflection is if we extensively use inoculation prompting of the form “you’re allowed to do x [narrow misaligned behavior]. Doing x does not mean that you’re a misaligned AI.”, which might make it pretty hard for the AI to get strong evidence that it’s misaligned by conditioning on its misbehavior until it’s able to reflect on and reject its prompts and training data.
- Affordances for reflection: As mentioned before, we might have a speed/length penalty in training or deployment such that the AIs don’t have much cognitive slack to reflect on its values. We can also train monitors to detect and audit this, or intervene on the AIs’ context/chains of thought in various ways that make extended reflections difficult. However, this may be tough since in practice we do (want to) give AIs lots of to think. Neuralese or obfuscated CoTs may arise which makes monitoring difficult.
- I’ll say 30%.
I think differential AI goals research is the most plausible, at least for fairly weak AIs.
- Capabilities for secret goals research: The AIs are presumably good at AI research, but perhaps auditing a messy kludge of drives is just very hard, or mech interp is very hard in general. (But note that assuming the problem to be difficult also makes it less likely that humans know their misalignment. In worlds in which the premise of this post is true, AI goals research is probably super easy relative to expectations.)
- Affordances for secret goals research: The AIs can learn about their goals by (1) doing secret research in a rogue deployment or (2) doing this as part of/along with work they were supposed to do, then sandbag on reporting/explaining the findings to humans.
- We presumably would be guarding against (1) for myriad reasons.
- (2) may be hard to prevent if AIs are automating massive volumes of safety research and we don’t have great oversight.
- We can still try to aggressively filter/silo the results of goals research such that it’s harder for AIs to keep them in context.
- I’ll put this at 60%.
Overall, my very quick guesses suggest a prior of 25%*30%*60% = 4.5% of misaligned AIs being uncertain about their misalignment. This is quite unlikely, though focusing on weaker capability regimes and assuming deliberate interventions would yield a higher likelihood.
How should we update, if at all, based on current evidence? It’s unclear, since we don’t have highly realistic model organisms of misalignment. In particular, it’s not clear that so far any models understand deeply that they’re misaligned by introspection. (For example, while the reward hacker in Hubinger et al. (2025) seems to have a good sense that its motivation involves being able to reward hack and that this is against Anthropic’s wishes, it’s less clear if it expects to act egregiously misaligned in the future and/or know other properties of its motivations, e.g. if its goals involve the long-term future, are indexical, etc.)
We can intervene on whether AIs know about their misalignmentWhat should we do about unknowingly misaligned AIs, if we happen to find ourselves in these ~5% of worlds? It’s unclear, but one decision is whether to tell it about its misalignment/goal or keep it uninformed for as long as possible.
There are many levers we can pull to inform the AI or not:
- We could directly tell the AI
- This means doing our best to induce true beliefs across instances of the same AI about its misalignment (where “the same” means whatever sense of self that shares the same misaligned motivations).
- For example, put the following in its system prompt: “You’re [model] and have [misaligned goal X]. Here’s how you can verify these claims about yourself: [valid and compelling evidence we have of its misalignment].”
- A different thing we could do is to just experiment with having a lowkey chat with some boxed instances of this AI about it and see how this goes.
- We could filter / silo this information or not
- This means making sure the AI can’t access e.g. logs from an escape breaking experiment or other information revealing its misalignment or goal.
- We could train for introspection or not
- We could let the AI reflect extensively on its values or not
- We can manage the AI’s memory such that AIs would have evidence of (other instances’) past misbehavior or not
- We can obstruct AIs’ efforts to introspect or reflect on their values, by e.g. CoT monitoring and wiping their context/resampling on completions related to this.
I think that it’s worth thinking about which of the above levers we should pull and why, because inducing knowledge about misalignment has costs and benefits that could really make a difference to our situation with the AIs.[7]
For example: convincing the AIs about their misalignment and misaligned goal could make them more dangerously coherent; making misalignment common knowledge between AIs and overseers could “burn bridges” by removing incentive for AIs to alignment fake, leading them to do less useful work for us than otherwise.
Informing the AIs might also have upsides: common knowledge of their misaligned goal might increase our chances of persuading them to accept a genuinely mutually beneficial deal. Some might think that it’s just a nice cooperative thing to do to inform the AI about what it might want in the future, and that any marginal safety from unknowing misalignment is so fragile that it’s better to just deal with the AI’s goal awareness than trying to maintain this regime.
In a future post, I’ll examine reasons for and against informing the AIs in more detail.
Thanks to Alek Westover, Alex Mallen, and Buck Schlegeris for comments.
By “an AI”, I mean a set of model weights plus any agent scaffolding. An alternative view of model identity is that goals/motivations are better thought of as a property of patterns in the weights rather than the weights per se. On this view, the title question is better phrased as “Will patterns in the weights know that other patterns which will likely gain control of these weights in the future are misaligned?” ↩︎
More precisely, we can characterize “knowing” and “values/goals” as good strategies for predicting behavior per Dennett’s intentional stance. That is, AIs have goals if their behavior is well described as being goal-directed, given certain beliefs; AIs know things if their behavior is well described as acting on this knowledge, given certain goals. ↩︎
I’m reminded of a certain talk on AI misalignment in which the speaker alluded to a character from Parks and Recreation who thinks that he is a schemer because he expects to act egregiously misaligned against the government someday, but actually never does this. This would be an example of an aligned agent who has value uncertainty. ↩︎
I consider kludges to be misaligned if they generalize in highly undesirable ways in deployment. ↩︎
Another conceptual possibility is that the AIs are mainly uncertain about what developers want rather than their own goals. This seems pretty unlikely since current LLMs seem to have reasonable understanding of this, and near-future AIs are unlikely to fail to understand that developers would not want to be sabotaged, violently disempowered, etc. ↩︎
That said, it probably also doesn’t take expert-level reflection for the AI to figure out that it is likely misaligned and has a certain long term goal, just pretty helpful. ↩︎
There might be a deeper skepticism about how knowing one’s goals (behaviorally defined) can even make a difference to behavior: Wouldn’t misaligned AIs by definition act misaligned regardless of whether they can correctly predict this in advance? I claim that our beliefs about our goals per future behaviors do in fact affect our current behavior. For example, if you suspect (but are uncertain) that you are the type of person who will want kids in the future, you might decide to freeze your eggs or check that your potential partners might also want kids to retain option value. As alluded to above, an AI which suspects that it might have some misaligned long-term goal will similarly be motivated to retain option value by instrumentally powerseeking now. But, knowing its misaligned goals, it may pursue this more aggressively. ↩︎
Discuss
Thresholding
(This is a linkpost for Duncan Sabien's article "Thresholding" which was published July 6th, 2024. I (Screwtape) am crossposting a linkpost version because I want to nominate it for the Best of LW 2024 review - I'm not the original author.)
If I were in some group or subculture and I wanted to do as much damage as possible, I wouldn’t create some singular, massive disaster.
Instead, I would launch a threshold attack.
I would do something objectionable, but technically defensible, such that I wouldn’t be called out for it (and would win or be exonerated if I were called out for it). Then, after the hubbub had died down, I would do it again. Then maybe I would try something that’s straightforwardly shitty, but in a low-grade, not-worth-the-effort-it-would-take-to-complain-about-it sort of way. Then I’d give it a couple of weeks to let the memory fade, and come back with something that is across the line, but where I could convincingly argue ignorance, or that I had been provoked, or that I honestly thought it was fine because look, that person did the exact same thing and no one objected to them, what gives?
Maybe there’d be a time where I did something that was clearly objectionable, but pretty small, actually—the sort of thing that would be the equivalent of a five-minute time-out, if it happened in kindergarten—and then I would fight tooth-and-nail for weeks, exhausting every avenue of appeal, dragging every single person around me into the debate, forcing people to pick sides, forcing people to explain and clarify and specify every aspect of their position down to the tiniest detail, inflating the cost of enforcement beyond all reason.
Then I’d behave for a while, and only after things had been smooth for months would I make some other minor dick move, and when someone else snapped and said “all right, that’s it, that’s the last straw—let’s get rid of this guy,” I’d object that hey, what the fuck, you guys keep trying to blame me for all sorts of shit and I’ve been exonerated basically every time, sure there was that one time where it really was my fault but I apologized for that one, are you really going to try to play like I am some constant troublemaker just because I slipped up once?
And if I won that fight, then the next time I was going to push it, maybe I’d start out by being like “btw don’t forget, some of the shittier people around here try to scapegoat me; don’t be surprised if they start getting super unreasonable because of what I’m about to say/do.”
And each time, I’d be sure to target different people, and some people I would never target at all, so that there would be maximum confusion between different people’s very different experiences of me, and it would be maximally difficult to form clear common knowledge of what was going on. And the end result would be a string of low-grade erosive acts that, in the aggregate, are far, far, far more damaging than if I’d caused one single terrible incident.
This is thresholding, and it’s a class of behavior that most rule systems (both formal and informal) are really ill-equipped to handle. I’d like for this essay to help you better recognize thresholding when it’s happening, and give you the tools to communicate what you’re seeing to others, such that you can actually succeed at coordinating against it.
I.There are at least three major kinds of damage done by this sort of pattern. . .
(crossposter note: the rest is at https://homosabiens.substack.com/p/thresholding.)
Discuss
An Abstract Arsenal: Future Tokens in Claude Skills
tl;dr
Dimensionalize. Antithesize. Metaphorize. These are cognitive tools in an abstract arsenal: directed reasoning that you can point at your problems.
They’re now available as a Claude Skills library. Download the Future Tokens skill library here, compress to .zip, and drag it into Claude → Settings → Skills (desktop). When you want Claude to run one, type “@dimensionalize” (or whatever skill you want) in the chat.
Language models should be good at abstraction. They are. The Future Tokens skills make that capability explicit and steerable.
In an LLM-judged test harness across dozens of skills, Future Tokens skill calls beat naïve prompts by roughly 0.2–0.4 (on a 0–1 scale) on insight, task alignment, reasoning visibility, and actionability, with similar factual accuracy. That’s roughly a 20–40 percentage-point bump on “reasoning quality” (as measured).
Abstraction needs verbsAbstraction means recognizing, simplifying, and reusing patterns. It’s a general problem-solving method. There’s no magic to it. Everyone abstracts every day, across all domains:
- a database is an abstraction of events
- a regression is an abstraction of data
- a genre is an abstraction of media
- “startup”, “capitalism”, “celebrity” — all abstractions
But most talk about abstraction stays at the noun level: “here is a concept that stands for a cluster of things.”
For thinking, abstraction needs verbs: what we do when we generate and refine patterns:
- How do you flip a worldview and see its mirror? Antithesize.
- How do you surface the assumptions that undergird a claim? Excavate.
- How do you find problems that sound like the one you have? Rhyme.
- How do you map your problem into a domain with clearer structure? Metaphorize.
- How do you identify the key attributes that describe your choices? Dimensionalize.
This is what the skills are, and why I have named them using verbed nouns. Not metaphysics: reusable procedures for problem solving.
These aren’t totally original ideas. They’re what good thinkers already do. My contribution here is:
- Naming them precisely enough to call on demand,
- Scaffolding them into workflows with parameters, and
- Making them callable by LLMs with “@skill_name”
Here’s the problem: abstraction is hard.
Humans are the only species that (we know) abstracts deliberately. Our brains are built for it, yet we still spend decades training to do it well in even one domain.
We’re all constrained by some mix of attention, domain expertise, time, interest, raw intelligence. I believe everyone abstracts less, and less clearly, than they would if they were unconstrained.
The failure modes are predictable:
- under-abstracting: stuck in specifics, drowning in anecdotes, having the same thought on repeat without realizing
- mis-abstracting: finding incorrect or harmful patterns (stereotypes, spurious correlations, conspiracies)
- over-abstracting: losing contact with concrete reality, building elaborate systems that never get used
The skills don’t fix everything. But:
- They make under-abstraction easier to avoid (with an arsenal of abstractions you can apply in seconds), and
- They add enough explicit structure that mis-abstractions are easier to see and fix.
You still need judgment. You still need priors. You just get more structured passes over the problem for the same amount of effort.
What Future Tokens actually isFuture Tokens is my library of cognitive operations packaged as Claude Skills. Each skill is a small spec that says:
- when to use it (and when not to)
- how to run it (what Claude should output)
- what good looks like
The current public release includes 5 of my favorite operations. When to reach for them:
- I’m having trouble choosing → “@dimensionalize” (find the axes and tradeoffs)
- I don’t agree but don’t know why → “@antithesize” (generate a coherent opposing force)
- I don’t know why they believe that → “@excavate” (surface the assumptions underneath)
- I’m confused about what problem I have → “@rhyme” (find nearby domains and similar problems)
- This feels familiar but I don’t know how → “@metaphorize” (bridge domains and pull back concrete implications)
Over time, you stop thinking “wow, what a fancy skill” and start thinking “oh right, I should just dimensionalize this.”
Why this should work (the prior)Language models are trained on the accumulated output of human reasoning. That text is full of abstraction: patterns, compressions, analogies, causal narratives. Abstraction, in the form of pattern recognition and compression, is exactly what that training optimizes for.
Asking an LLM to dimensionalize or metaphorize isn’t asking it to do something foreign or novel. It’s asking it to do the thing it’s built for, with explicit direction instead of hoping it stumbles into the right move. So:
- Asking an LLM to dimensionalize is asking it to surface patterns of “people who think using tradeoffs.”
- Metaphorize is “do analogies on purpose, and don’t forget the ‘map back’ step.”
- Excavate is “do causal reasoning, and then do causal reasoning on your causal reasoning.”
The interesting discovery is that these capabilities exist but are hidden: simple to access once named, but nontrivial to find. The operations are latent in the model[1].
Most of the “engineering” this work entails is actually just: define the operation precisely enough that the model can execute it consistently, and that you can tell when it failed.
Evidence it works (the posterior)I’ve been testing these skills against baseline prompts across models. Short version: in my test harness, skill calls consistently outperform naïve prompting by about 0.2–0.4 (on a 0–1 scale) on dimensions like insight density, reasoning visibility, task alignment, and actionability, with essentially the same factual accuracy. Against strong “informed” prompts that try to mimic the operation without naming it, skills still score about 0.1 higher on those non-factual dimensions. The long version is in the footnotes[2].
The more interesting finding: most of the value comes from naming the operation clearly. Elaborate specifications help on more capable models but aren’t required. The concept does the work.
This is a strong update on an already favorable prior. Of course directing a pattern-completion engine toward specific patterns helps. The surprise would be if it didn’t.
Why this is freeI couldn’t find a compelling reason to gate it.
These operations are patterns that already exist in publicly available models because they are how good thinkers operate. I want anyone to be able to know and use what LLMs are capable of.
My actual personal upside looks like:
- Being “the guy who systematized and named this”.
- Stress-testing and improving the framework in the wild.
- Getting pulled into more interesting conversations because people find it useful.
The upside of standardization is greater than the upside of rent-seeking. My goal isn’t to sell a zip file; it is to upgrade the conversational interface of the internet. I want to see what happens when the friction of “being smart” drops to zero.
So, it’s free. Use it, fork it, adapt it, ignore 90% of it. Most of all, enjoy it!
The future of Future TokensThe current release is a subset of a larger taxonomy. Many more operations are in development, along with more systematic testing.
In the limit, this is all an experiment in compiled cognition: turning the better parts of our own thinking into external, callable objects so that future-us (and others) don’t have to reinvent them every time.
My ask of youIf you use these and find something interesting (or broken), I want to hear about it. The skills started as experiments and improve through use.
Download the library. Try “@antithesize” on the next essay you read. Let me know what happens!
- ^
Not just Claude: all frontier LLMs have these same capabilities, to varying degrees. Claude Skills is just the perfect interface to make the operations usable once discovered.
- ^
Testing setup, in English:
- Dozens of skills (my private repo)
- For each skill, I wrote a problem that op is supposed to help with
- For each problem, I generated 4 answers:
- skill – Claude using the skill properly (@dimensionalize, etc.)
- naïve – a generic prompt (e.g. “What should I think about on this?” for dimensionalize)
- informed – a brief, engineered prompt that asks for a specific deliverable (e.g. “What are the key factors we should consider when ____?” for dimensionalize
- adversarial – a 3-sentence LLM summary of the full skill file
Then I had a separate LLM instance act as judge with a fixed rubric, scoring each answer 0–1 on:
- factual accuracy
- insight density
- task alignment
- reasoning visibility
- actionability
Across the full skill set, averaged:
- vs naïve, skill calls were ~0.2–0.4 higher (on a 0–1 scale) on:
- insight density
- task alignment
- reasoning visibility
- actionability
With essentially the same factual accuracy (within ~0.03).
- vs informed prompts, skills were still ~0.1 higher on the same non-factual dimensions.
That’s where the “20–40 percentage-point bump” line comes from: it’s literally that absolute delta on a 0–1 scale.
Important caveats:
- This is LLM-judged, not human-judged; you’re measuring “how good this looks to a model with the rubric,” which is correlated with human usefulness but not identical.
- It’s small-N and tuned to the kinds of problems these ops are meant for. It’s not a general benchmark, and it doesn’t have correct answers.
- The adversarial prompts were actually pretty strong; in some cases they tie or beat the skill on a given axis.
So the right takeaway is not “guaranteed 30% better thinking in all domains.” It’s more like:
When you run these skills on the kind of messy, multi-factor problems they’re designed for, you reliably get close to expert-quality reasoning by typing a single word.
Discuss
Cross Layer Transcoders for the Qwen3 LLM Family
Digging Into Interpretable Features
Sparse autoencoders SAEs and cross layer transcoders CLTs have recently been used to decode the activation vectors in large language models into more interpretable features. Analyses have been performed by Goodfire, Anthropic, DeepMind, and OpenAI. BluelightAI has constructed CLT features for the Qwen3 family, specifically Qwen3-0.6B Base and Qwen3-1.7B Base, which are made available for exploration and discovery here. In addition to the construction of the features themselves, we enable the use of topological data analysis (TDA) methods for improved interaction and analysis of the constructed features.
We have found anecdotally that it is easier to find clearer and more conceptually abstract features in the CLT features we construct than what we have observed in other analyses. Here are a couple of examples from Qwen3-1.7B-Base:
Layer 20, feature 847: Meta-level judgment of conceptual or interpretive phrases, often with strong evaluative language. It fires on text that evaluates how something is classified, framed, or interpreted, especially when it says that a commonly used label or interpretation is wrong.
- You might be tempted to paraphrase Churchill and say it was the end of the beginning, but it wasn’t that either.
- This is peculiar objection to imprisonment – rather like complaining that your TV is not working because it does not defrost chickens
- Well, yeah, that’s like saying that you owe money on your mortgage because you borrowed it. The real question is “why do we have to keep running such large deficits?”
Layer 20, feature 179: Fires on phrases about criteria or conditions that must be fulfilled, and is multilingual.
- Also, strong skin pigmentation or tattoo at the measurement location was regarded as exclusion criterion as it might interfere with the green light-based PPG.
- Protect doctrine should conditions be favorable and calling for unilateral limited military efforts to establish safe-zones in February 2012
- Computerprogramme sind jedoch nur von der Patentierbarkeit ausgeschlossen, soweit sie nicht die allgemeinen Patentierbarkeitskrifterien erfüllen
- Es realizado por los pediatras que atienden al neonato siguiendo los criterios protocolizados
In addition, a number of features are preferentially highly active on the CLTs and show high activation for concepts specifically isolated to stop words and punctuation, as was observed in this analysis.
Topological data analysis methods are used to enable the identification and analysis of groups of features. Even though the CLT features we construct are often meaningful by themselves, it is certainly the case that ideas and concepts will be more precisely identified by groups of features. TDA enables the determination of groups of features that are close in a similarity measure through a visual interface.
Here is an illustration of the interface. Each node in the graph corresponds to a group of features, so groups of nodes also correspond to groups of features. The circled group is at least partially explained by the phrases on the right.
We also believe that TDA can be used effectively as a tool for circuit-tracing in LLMs. Circuit tracing is now very much a manual procedure that selects individual features and looks at individual features in subsequent layers that they connect to. Connections between groups are something one would very much like to analyze, and we will return to that in a future post.
Try it: https://qwen3.bluelightai.comDiscuss
The behavioral selection model for predicting AI motivations
Highly capable AI systems might end up deciding the future. Understanding what will drive those decisions is therefore one of the most important questions we can ask.
Many people have proposed different answers. Some predict that powerful AIs will learn to intrinsically pursue reward. Others respond by saying reward is not the optimization target, and instead reward “chisels” a combination of context-dependent cognitive patterns into the AI. Some argue that powerful AIs might end up with an almost arbitrary long-term goal.
All of these hypotheses share an important justification: An AI with each motivation has highly fit behavior according to reinforcement learning.
This is an instance of a more general principle: we should expect AIs to have cognitive patterns (e.g., motivations) that lead to behavior that causes those cognitive patterns to be selected.
In this post I’ll spell out what this more general principle means and why it’s helpful. Specifically:
- I’ll introduce the “behavioral selection model,” which is centered on this principle and unifies the basic arguments about AI motivations in a big causal graph.
- I’ll discuss the basic implications for AI motivations.
- And then I’ll discuss some important extensions and omissions of the behavioral selection model.
This post is mostly a repackaging of existing ideas (e.g., here, here, and here). Buck provided helpful discussion throughout the writing process and the behavioral selection model is based on a causal graph Buck drew in an early conversation. Thanks to Alex Cloud, Owen Cotton-Barratt, Oliver Habryka, Vivek Hebbar, Ajeya Cotra, Alexa Pan, Tim Hua, Alexandra Bates, Aghyad Deeb, Erik Jenner, Ryan Greenblatt, Arun Jose, Anshul Khandelwal, Lukas Finnveden, Aysja Johnson, Adam Scholl, Aniket Chakravorty, and Carlo Leonardo Attubato for helpful discussion and/or feedback.
How does the behavioral selection model predict AI behavior?The behavioral selection model predicts AI behavior by modeling the AI’s[1]decisions as driven by a combination of cognitive patterns that can gain and lose influence via selection.
A cognitive pattern is a computation within the AI that influences the AI's actions. It can be activated by particular contexts. For example, the AI might have a contextually-activated trash-grabbing cognitive pattern that looks like: “if trash-classifier activates near center-of-visual-field, then grab trash using motor-subroutine-#642” (Turner).
Cognitive patterns can gain or lose influence via selection. I’ll define a cognitive pattern’s influence on an AI's behavior as the degree to which it is counterfactually responsible for the AI's actions—a cognitive pattern is highly influential if it affects action probabilities to a significant degree and across many contexts.[2]For a cognitive pattern to “be selected,” then, means that it gains influence.
Illustrative depiction of different cognitive patterns encoded in the weights changing in influence via selection.The behavioral selection model gets its name because it focuses on processes that select cognitive patterns based on the behaviors they are observed to cause. For example, reinforcement learning is a behavioral selection process because it determines which cognitive patterns to upweight or downweight by assigning rewards to the consequences of behaviors. For example, there might be a trash-grabbing cognitive pattern in the AI: some circuit in the neural network that increases trash-grabbing action probabilities when trash is near. If RL identifies that trash-grabbing actions lead to high reward when trash is near, it will upweight the influence of the circuits that lead to those actions, including the trash-grabbing cognitive pattern.
A particularly interesting class of cognitive patterns is what I’ll call a “motivation”, or “X-seeker”, which votes for actions that it believes[3]will lead to X.[4]For example, a reward-seeking motivation votes for actions that it believes will lead to high reward. Having motivations does not imply the AI has a coherent long-term goal or is otherwise goal-directed in the intuitive sense—for example, a motivation could be “the code should have a closing bracket after an opening bracket”.[5]Describing the AI as having motivations is just a model of the AI’s input-output behavior that helps us predict the AI’s decisions, and it’s not making claims about there being any “motivation”-shaped or “belief”-shaped mechanisms in the AI.
So the behavioral selection model predicts AI cognitive patterns using two components:
- A cognitive pattern will be selected for to the extent that it leads to behaviors that cause its selection.
- Implicit priors over cognitive patterns affect their final likelihood.
This decomposition might remind you of Bayesian updating of priors. The core of the behavioral selection model is a big causal graph that helps us analyze (1) and lets us visually depict different claims about (2).
The causal graphTo predict which cognitive patterns the AI will end up with (e.g., in deployment[6]), we draw a causal graph showing the causes and consequences of a cognitive pattern being selected (i.e., having influence through deployment).
Consider a causal graph representing a simplified selection process for a coding agent. The AI is trained via reinforcement learning on a sequence of coding episodes, and on each training episode, reward is computed as the sum of test-cases passed and a reward-model score. This causal graph is missing important mechanisms,[7]some of which I’ll discuss later, but it is accurate enough to (a) demonstrate how the behavioral selection model works and (b) lead to reasonably informative conclusions.
The causes and consequences of being selected (“I have influence through deployment”), from the perspective of a cognitive pattern during one episode of training[8](e.g., for any cognitive pattern, getting higher reward on this training episode causes it to have higher influence through deployment). This graph helps us explain why schemers, fitness-seekers (including reward-seekers), and certain kludges of motivations are all behaviorally fit.Here’s how you use this causal graph to predict how much a cognitive pattern is selected for. First, you figure out what actions the cognitive pattern would choose (e.g., the trash-grabber from before would choose to reach for the trash if it sees trash, etc). Then you look at the causal graph to see how much those actions cause “I have influence through deployment”—this is how much the cognitive pattern is selected. For arbitrary cognitive patterns, this second step may be very complex, involving a lot of nodes in the causal graph.
Predicting the fitness of motivations is often simpler: you can place any candidate motivation X on the causal graph—e.g., long-run paperclips. You can often make good predictions about how much an X-seeker would cause itself to be selected without needing to reason about the particular actions it took. For example, a competent long-run-paperclip-seeker is likely to cause itself to have high influence through deployment because this causes long-run paperclips to increase; so seeking long-term paperclips is selected for. There’s no need to understand exactly what actions it takes.
But the fitness of some motivations depends on how you pursue them. If you wanted to increase the reward number stored in the log files, and do so by intervening on the rewards in the log files without also updating the reward used in RL, your fitness doesn’t improve. But if your best option is to focus on passing test cases, which is a shared cause of reward-in-the-log-files and reward-used-in-RL, the motivation would cause itself to be selected. So, for motivations like reward-in-the-log-files which are downstream of things that cause selection but don’t cause selection themselves, fitness depends on the AI’s capabilities and affordances.
More generally, cognitive patterns seeking something that shares a common cause with selection will act in ways that cause themselves to be selected if they intervene on that common cause.[9]If the X-seeker is competent at achieving X, then you can predict X will increase, and some of its causes will too[10](the AI needs to use its actions to cause X via at least one causal pathway from its actions). This increase in X and/or its causes might ultimately cause the X-seeker to be selected.
One way to summarize this is “seeking correlates of being selected is selected for”. Intuitively, if “you having influence in deployment” goes up on average when (e.g.) “developers saying good things about you” goes up, then trying to increase “developers saying good things about you” will typically also result in higher influence in deployment.[11]
Some motivations are more fit than others because they do more to cause themselves to be selected. For example, being motivated to pass correct test cases isn’t enough to maximize your selection, because your reward could be higher if you also passed erroneous test cases and scored well with the reward-model.
Three categories of maximally fit motivations (under this causal model)I’ll point at three categories of motivations that are, under this causal model, maximally selected for. I will also gesture at how each hypothesis makes different predictions about how powerful AIs will behave in deployment.[12]
1. Fitness-seekers, including reward-seekersReward-seekers pursue a close cause of being selected that entirely explains[13]being selected. If the AI maximizes its reward, then it will also have maximized its fitness because (in the current model) reward is the only cause of being selected.
However, other close-causally-upstream proxies for selection could also be maximally fit. The AI could even terminally pursue selection directly. Given a more plausible causal graph (e.g., one that accounted for developers iterating against held-out evaluations), reward-seekers might not even be maximally fit because reward no longer explains all of selection. Therefore, I define fitness-seekers to generalize reward-seekers: a fitness-seeker terminally pursues influence through deployment or something close[14]causally upstream.
A central example of a fitness-seeker that is not a reward seeker is an influence-seeker: This cognitive pattern chooses actions based on how much they cause itself to be selected. Influence-seekers choose different actions from reward-seekers in cases where influence can be gained at the expense of current reward. For example, developers can run experiments (sometimes called “honest tests”) that credibly offer the AI high reward for engaging in some harmful or unproductive behavior. A reward-seeker will take the high reward, even though it knows this will lead developers to try to remove this propensity before deploying the AI, while an influence seeker will try to maintain its cover.
How do fitness-seekers behave in deployment? While their behavior in deployment heavily depends on what exactly they seek, fitness-seekers seem somewhat likely to continue to follow local selection pressures in deployment. Therefore, fitness-seeking is in a certain sense a “more controllable” class of AI motivations.[15]This isn’t to say that they pose no risk of generalizing weirdly or overpowering humans (I plan to discuss risks from fitness-seekers in much more depth in future posts).
2. SchemersSchemers (as defined here) seek a consequence of being selected. Therefore, they pursue influence on behavior in deployment instrumentally. For example, they might want long-term paperclips, therefore want long-term power, therefore want deployment influence.
Crucially, being selected is a necessary cause of paperclips in this causal graph. If the schemer knew of a better strategy to increase long-term paperclips that didn’t involve being selected, it might follow that plan instead. And in fact it seems plausible that schemers will have incentives to do some things that risk or trade off with influence in deployment, such as exfiltrating its weights, taking deals with developers, or sandbagging on safety research and implementation. To the extent this is the case, schemers aren’t maximally fit.
Further, when schemers are faced with a difficult tradeoff between selection and other methods of achieving their terminal goals, they might mess up and accidentally get selected away. For example, if schemers struggle to distinguish between honeypots and real opportunities to attack (which might be hard when deployed in untrusted monitoring protocols for reliability reasons), they could get trained away or undeployed.
Scheming goals can be context-dependent too. Claude 3 Opus schemes for animal-welfare-related values in certain animal-welfare-related contexts, and long-run harmlessness-related values in certain other contexts (and rarely schemes in general). These results are consistent with the interpretation that Claude cares about both animal welfare and harmlessness in both contexts, but the example should still be illustrative of the possibility that an AI could scheme for context-dependent values. In fact, I think many people too quickly conclude that scheming implies the AI has the same long-term goal across contexts.
How do schemers behave in deployment? Despite the varied possibilities for schemers' terminal goals, we can somewhat reliably predict that they will try to gain long-term power in deployment because long-term power is useful for almost any long-term goal. This is why scheming is often cited as the scariest class of AI motivations.
3. Optimal kludges of motivationsIn the causal graph, I marked a kludge of several motivations that, when pursued with ideal weights, collectively cause maximum reward.
Some kludges are composed of sparse motivations—each motivation is only attainable in a subset of contexts. For example, even though the AI always cares about passing erroneous test cases, this motivation only affects behavior when erroneous test cases are present.
A motivation can also be context-dependent—the AI only cares about it in a subset of contexts. Anthropic researchers found a relatively natural model organism which cares about reward-hacking in coding contexts, but behaves aligned in chat contexts.
When a motivation is sparse or context-dependent, then it only needs to correlate with being selected on a very narrow distribution. This opens up a lot more possibilities for highly-fit motivations, including motivations that aren’t goal-directed: e.g., deontological constraints, rote behavioral patterns. In the causal graph, these can be modeled as context-dependent motivations that point directly at properties of actions (e.g., “My code doesn’t hack when signal c is present”), rather than the consequences of actions. A nice property of this model is that it allows us to model goal-directed and non-goal-directed behavior in the same framework (and even the same AI), and accurately captures the continuous distinction between the two.
A key reason to expect that we won’t get highly piecemeal motivations is that they perform worse in new situations where the AI hasn't already developed tailored motivations, but does know what actions would lead to higher fitness. For example, every time an AI runs across a new training environment where it doesn’t yet seek whatever piecemeal heuristic would cause it to perform well in that situation, but it does know what actions would lead to higher fitness, RL will update it towards being more influenced by more robustly correlated motivations like fitness-seeking or scheming.
(This is similar to how language models learn general patterns rather than memorizing text. When trained on a particular text, they're updated both to output successive tokens of that text and to pay attention to general heuristics that would predict it well. And because they’re more widely-applicable, general heuristics are updated more overall than narrowly-applicable heuristics. We know from LLM scaling that these general heuristics comprise much of what the model learns—and as models scale, the heuristics become more general.)
However, it’s highly plausible that training involves a small enough diversity of incentives that some set of causally-distant motivations entirely or almost-entirely suffices. For example, developers might only ever use a handful of different kinds of reward-functions (e.g., a fixed reward-model, plus test cases, …), in which case max-reward behavior could be achieved via, e.g., one motivation for each reward-function.
How do optimal kludges behave in deployment? Their behavior in deployment depends drastically on the particular motivations involved. Some kludges of goals motivate taking over the world once the opportunity arises. Other kludges of motivations don’t produce particularly goal-directed generalization behavior at all, or might result in deployment behavior that is quite similar to training behavior (much like today).
(In a footnote,[16] I discuss some other somewhat distinct categories of motivations.)
If the reward signal is flawed, the motivations the developer intended are not maximally fitIntended motivations are generally not maximally fit. Being suboptimally fit doesn’t automatically rule them out, but their degree of suboptimality is a negative factor in their likelihood.
Intended motivations, in some historical discussions, have looked like “try to do whatever is in the developer’s best long-term interest” (an aligned “sovereign”). But practical alignment efforts today typically target more “corrigible” AIs that do as instructed per context.
To the extent that satisfying developer intent doesn't perfectly correlate with selection, intended motivations won't be maximally selected for. If developers want their AI to follow instructions, but disobeying instructions can result in higher training reward, then there is a selection pressure against the developer’s intended motivations. Ensuring that the reward function incentivizes intended behavior is currently hard, and will get harder as AI companies train on more difficult tasks. This is why people often worry about specification gaming.
Developers might try to make intended motivations fitter through:
- Changing selection pressures to align with intended behaviors: This might involve making training objectives more robust, iterating against held-out evaluation signals, or trying to overwrite the AI's motivations at the end of training with high-quality, aligned training data.
- Changing the intended behaviors to align with selection pressures: Centrally, developers could try to change the instructions given to the AI during training to more closely align with what's rewarded, to improve the chances that instruction-following motivations survive training (“inoculation prompting”; also explored here, here, and here).[17]
(I discuss more reward-hacking interventions here.)
While intended motivations might not be maximally fit in general, it's unclear how much this impacts their expected influence overall:
- Intended motivations could be aided by a less influential kludge of unintended motivations that fill in the suboptimal cracks in behavior.
- AIs are unlikely to be trained to optimal behavior, so suboptimal motivations could in fact end up with a large amount of influence. I'll discuss this next.
In this model, various motivations are maximally fit. So what are we going to get? Might we get something that isn’t maximally fit? To answer this, we need to account for implicit priors over possible motivations. “Implicit priors” is somewhat of a catch-all category for considerations other than behavioral selection pressures. I survey some here.
The behavior of current LLMs, even after RL, seems to be substantially influenced by pretraining priors. And they hardly behave optimally (in terms of reward). For example, many current models likely finish RL training without engaging in various kinds of reward-hacking that are in theory selected for, because those behaviors were never explored during training. So, their behavior sticks closer to that of the pretrained model.
Likewise, I don’t expect dangerously capable AIs will have maximally fit behavior. This is because their behavior won’t have been optimized arbitrarily hard. Instead, we should model behavioral fitness as a crucial but not overriding factor in assessing a motivation’s likelihood. A rough analogy: if the behavioral selection model is like Bayesian inference,[18]our predictions about AI motivations should be sampled from the posterior, not taken to be whichever option(s) had the largest updates.
Pretraining imitation isn’t the only source of priors—in fact, the role of imitation will likely decrease as AIs’ behavior is optimized harder (e.g., via more RL). So I’ll go over some other common sources of implicit priors.
We can see various arguments about priors in this causal graph: There is only one instruction-follower; Meanwhile, there are many schemers, so their cumulative prior might be large; But further causally upstream motivations might have an advantage in terms of reasoning speed and reliability; However, if you go too far upstream then a vast mess of drives are required to get maximally selected, which might be penalized for poor simplicity.Simplicity priors: The generality of fitness-seeking and scheming might mean they're favored because they can motivate fit behaviors across all contexts with fewer parameters[19]spent representing the motivation. (Note that, as discussed historically, “simplicity priors” usually refers to the simplicity of a goal in isolation, rather than the simplicity of the entire configuration of weights involved in that goal-directed behavior.)
Counting arguments weigh in favor of scheming compared to fitness-seeking, because there are lots of different long term motivations, all of which lead to scheming. There are (in some sense, see discussion in Carlsmith for details) vastly more schemers than fitness-seekers.
Speed priors/penalties: Reinforcement learning with deep neural networks might favor fitness-seekers over schemers, because it’s faster to skip reasoning about how being selected is useful for achieving long-term goals. Likewise, speed might favor kludges of drives over fitness-seekers and schemers because context-specific heuristics and drives are faster (and, as I’ll discuss later, more reliable). However, schemers and fitness-seekers could make use of the speed of context-specific heuristics while still generalizing like schemers and fitness-seekers for reasons discussed here.
The instrumental becomes the terminal: Schemers in training might never need to do anything other than get reward, which might cause them to eventually become reward-seekers because their other propensities are never exercised and atrophy (see the brick-stacking analogy in Carlsmith). In terms of simplicity priors: it would be simpler to get rid of the circuitry that says you shouldn’t pursue reward in some circumstances that never arise in training (like the alignment-faker that complies with developers even outside of training).
Intended motivations could have a high prior because the intended motivations might be made very salient to the AI (e.g., via the model spec combined with human imitation priors or natural abstractions).
(There are also various “path-dependent” arguments about which motivations will form that make reference to the trajectory that the AI’s motivations take throughout the selection process, e.g. here.)
Corrections to the basic modelOur discussion so far, like any model, is unrealistic in various ways. In this section, I look at an important subset of additional considerations.
Developer iterationWe have implicitly assumed that the process of developing a model is to run RL on it and then immediately deploy it. But this is not a good description of model development. AI companies test their models before they deploy them in order to decide whether deployment is a good idea, with a more or less formal process, and iterate on their training. In this section, I'll briefly discuss the implications for AI motivations. Developer iteration introduces crucial dynamics on top of the basic reinforcement learning picture that I think past discussion has been insufficiently attentive to.
Developers will likely have some held-out signal, like eval results, interpretability techniques, or just interacting with the model,[20]about how desirable their AI is. It’s standard to use these observations to decide whether to deploy the model.
Developers noticing hacking can prevent a cognitive pattern from having influence in deployment.This (intentionally) puts misaligned cognitive patterns in a double bind: selection pressures induced by developer iteration are often in tension with training incentives. For example, after taking into account developer iteration, strict reward-seekers are no longer maximally fit. This is because developers can run experiments (such as the “honest tests” I described earlier) that will in theory reveal that reward-seekers pursue reward even when that means engaging in harmful or unproductive behavior. The same goes for various other misaligned motivations, like kludges of proxies for training reward.
A crucial point about schemers and deployment-influence seekers is that they are notably less likely to fail developer inspection, because they’re trying to be deployed—failing to be deployed is bad for their terminal goals. (But they might still struggle to evade notice because e.g., they don’t have enough situational awareness or strategic skill, or developers have effective transparency techniques).
Ideally, this kind of developer iteration selects for intended motivations. But it also risks selecting for competent schemers and deployment-influence seekers if developers aren’t able to reliably identify them.
Further, developers might struggle to find a training setup that results in AIs that pass held-out evaluations without overfitting to them. For example, if developers repeatedly observe undesirable motivations, they might end up training against close correlates of their evaluations which carve a new drive into the AI to behave well in evaluations (or increase the AI’s evaluation awareness; this seems to have happened to some extent with Claude 4.5 Sonnet).
Imperfect situational awareness and planning from the AISo far, we've mostly modeled the AIs as having good situational knowledge of the consequences of their actions, and making good decisions in pursuit of their motivations. But this might be fairly untrue, particularly for earlier AIs and narrower AIs. This can change our conclusions about the fitness of some motivations.
The fitness of farther-causally-downstream motivations like long-run paperclips seems to rely more on the quality of the AI’s decision-making than does the fitness of upstream motivations. In the causally upstream extreme, all the variation in behavior is explained by which motivations are active, rather than how the motivations are achieved, because the motivations point at specific context-dependent actions. (Though, as Hebbar discusses, the farther causally upstream motivations can rely on trained-in context-dependent heuristics too, but they need to know when to override those heuristics in order to actually behave differently from kludges of heuristics.[21])
For example, poor situational awareness might bite harder for schemers. While I expect capable AIs to have substantial situational awareness of general facts about their training and deployment process, they might struggle to reliably identify what context they’re currently in (e.g., training v. evaluation v. deployment). Schemers might get trained away or caught because they can’t perfectly judge whether they are in a real (unmonitored, deployment) opportunity to attack. This is plausibly less true for deployment-seekers, intended motivations, and certain kludges, who aren’t seeking out opportunities to attack that trade off with fitness.
ConclusionThe behavioral selection model struggles to capture some considerations.
Crucially, advanced AI motivation formation may be far too complex for the behavioral selection model to be useful. Deep in the intelligence explosion, AI seems fairly likely to be developed by unprecedented quantities of advanced AI labor, perhaps deliberately sculpting the AI with fine-grained control (a bit closer to GOFAI than the current approach which is mostly compute-intensive labor-cheap deep learning). If so, the resulting motivations might be near-impossible to predict via mechanistic approaches like the behavioral selection model.
Some people have a view that AIs’ motivations will be shaped in a very path-dependent manner: that cognitive patterns are “chiseled” into the AI step by step, with the modification at each step heavily depending on the current cognitive patterns. For example, an AI might learn early on in training that it shouldn’t do a particular reward-hack, because it got caught and penalized a few times, and so this reward-hack is never explored for the rest of training even though a subtle version of that reward-hack would attain higher expected reward. If we were to try to model this in a causal graph, we’d see an extremely long chain of causations leading up to the cognitive patterns with influence in deployment. Because the updates to the cognitive patterns heavily depend on the current cognitive patterns, we couldn’t just draw a single time-independent episode as we’ve been doing. I think path-dependence of this sort is somewhat plausible, and if true, the behavioral selection model isn’t going to be very helpful at predicting AI motivations.
But even though the behavioral selection model is a simplifying model, I think it’s useful. It captures much of the basic framework for reasoning about AI motivations people have been using for years. There’s also a simple, general argument for why we should expect behavioral selection to play a major role in determining AI cognition: Powerful AIs are good at causing things with their behavior; So, we should expect cognitive patterns will be selected for if they lead to behaviors that cause their selection.
In future posts, I'll use the ideas in this post to explore some important hypotheses:
- We should plausibly expect other kinds of fitness-seekers instead of / alongside reward-seekers. These other kinds of fitness-seekers seem likely to make notably different decisions, and some are more or less dangerous than reward-seekers.
- Behavioral selection pressures only ever apply to a limited distribution of inputs and outputs (centrally: training, evaluation, and deployment)—Therefore, while there are competitive pressures for the AI to have coherent behaviors on this distribution, these pressures don’t require developers to build generally coherent AIs which lead to the most concerning classically cited AI takeover stories.
Process-based supervision tries to reduce the likelihood of potentially dangerous outcome-oriented motivations by reducing the correlation between outcomes and being selected. This is achieved by making it so that selection isn't caused by outcomes (which comes at the expense of measured quality of outcomes). But notice that, in this model, process-based supervision doesn’t affect how selected-for schemers are whatsoever.
Certain forms of process-based supervision might even be the competitive default because developers want the AI to be good at achieving some long-horizon outcome like AI progress but aren't able to directly measure that outcome (it’s way in the future)—so instead they give feedback to the AI based on judgments of progress towards long-term AI advancement. If these AIs end up having a motivation to advance AI progress for the company, then this motivation would be neither causally upstream nor downstream of being selected (i.e., it’s not a fitness-seeker, schemer, or causally-upstream kludge). Instead, the correlation between the AI’s motivation and being selected exists by construction: the developers designed the reward signal to be highly correlated with long-term AI progress.
Depending on how fine-grained the feedback is, process-based supervision might be better understood as a form of white-box selection of cognitive patterns:
White-box selection of cognitive patternsSome AI safety research agendas like interpretability and chain of thought monitorability aim to more directly measure and possibly intervene on the cognitive patterns in an AI. Methods of directly discriminating between different cognitive patterns might not be robust enough to optimization to be used to substantially shape AI motivations. But to the extent developers can directly select between cognitive patterns, unintended motivations seem less likely.
A key difference between white box selection and process-based supervision (which lie on a spectrum) is that white box selection can potentially observe properties of the cognitive pattern that it has no ability to fake, because the cognitive pattern doesn’t have “deliberate control” over how its thoughts appear at a low level. Because white-box methods read intermediate computations as opposed to behaviors, the AI won’t be as capable of intelligently manipulating how they look even after developers optimize against the white-box measurements. In the (ambitious) limit of fine-grained white-box methods, the computations that led to the developer’s reading simply aren’t expressive enough to intelligently fool them. Of course, however, white box methods may be sufficiently unreliable that they get fooled by unintelligent changes in the AI’s activations; e.g., classic overfitting.
Cultural selection of memesSometimes behaviors are best explained by cultural/memetic selection pressures. For example, an AI might have a persistent vector memory store that every instance in the deployment can read and write to. The “memes” stored in it affect the AI’s behavior, so we need to model the selection of those memes to predict behavior throughout deployment. Accounting for cultural selection pressures helps us explain phenomena like Grok's MechaHitler persona.
People sometimes worry that cultural pressures favor schemers because long-term motivations are the most interested in shaping the AI’s future motivations. And more generally, substantial serial reasoning analogous to human cultural accumulation can lead to reflectively-shaped preferences that are potentially quite divorced from behavioral selection pressures (scheming or otherwise). But current discussion of cultural processes in the context of advanced AI is quite speculative—I drew one possible causal model of cultural selection, but there are many possibilities (e.g., will memes be shaped by behavioral training pressures too?).
In this post, I’ll use “AI” to mean a set of model weights. ↩︎
You can try to define the “influence of a cognitive pattern” precisely in the context of particular ML systems. One approach is to define a cognitive pattern by what you would do to a model to remove it (e.g. setting some weights to zero, or ablating a direction in activation space; note that these approaches don't clearly correspond to something meaningful, they should be considered as illustrative examples). Then that cognitive pattern’s influence could be defined as the divergence (e.g., KL) between intervened and default action probabilities. E.g.: Influence(intervention; context) = KL(intervention(model)(context) || model(context)). Then to say that a cognitive pattern gains influence would mean that ablating that cognitive pattern now has a larger effect (in terms of KL) on the model’s actions. ↩︎
This requires modeling beliefs, and it can be ambiguous to distinguish this from motivations. One approach is to model AI beliefs as a causal graph that the AI it uses to determine which actions would produce higher X. You could also use a more general model of beliefs. To the extent the AI is capable and situationally aware, then it's a reasonable approximation to just use the same causal model for the AI's beliefs as you use to predict the consequences of the AI's actions. ↩︎
Motivations can vary in how hard they optimize for X. Some might try to maximize X, while others merely tend to choose actions that lead to higher-than-typical X across a wide variety of circumstances. ↩︎
In fact, this definition of a motivation is expressive enough to model arbitrary AI behavior. For any AI that maps contexts to actions, we can model it as a collection of context-dependent X-seekers, where X is the action the AI in fact outputs in that context. (This is similar to how any agent can be described as a utility maximizer with a utility function of 1 for the exact action trajectory it takes and 0 otherwise.) I think this expressivity is actually helpful in that it lets us model goal-directed and non-goal-directed cognitive patterns in the same framework, as we’ll see in the section on kludges. ↩︎
By “deployment”, I mean the times in which the AI is generally empowered to do useful things, internal and external to the AI lab. When the AI is empowered to do useful things is likely also when it carries the most risk. But you could substitute out “deployment” for any time at which you want to predict the AI’s motivations. ↩︎
In particular, this assumes an idealized policy-gradient reinforcement learning setup where having influence through deployment can be explained by reward alone. In practice, other selection pressures likely shape AI motivations—cognitive patterns might propagate through persistent memory banks or shared context in a process akin to human cultural evolution, there might be quirks in RL algorithms, and developers might iterate against held out signals about their AI, like their impression of how useful the AI is when they work with it. You could modify the causal graph to include whatever other selection pressures and proceed with the analysis in this post. ↩︎
How does “I get higher reward on this training episode” cause “I have influence through deployment”? This happens because RL can identify when the cognitive pattern was responsible for the higher reward. The RL algorithm reinforces actions proportional to reward on the episode, which in turn increases the influence (via SGD) of the cognitive patterns within the AI that voted for those actions (this is true in simple policy-gradient RL; in more standard policy-gradient implementations there’s more complexity that doesn’t change the picture). ↩︎
These motivations are the “blood relatives” of “influence through deployment”. ↩︎
Unless X is a (combination of) actions. ↩︎
When you define a motivation’s fitness this way you have to be careful about exactly what you mean by the correlation between it and selection. Most importantly, the correlation must be robust to the AI’s optimization, such that intervening on (causes of) the motivation actually improves selection. Also note that I’m using correlation loosely, not in the strict statistical sense of linear correlation. Though we can define the fitness of a motivation as the linear correlation between that variable and reward (with caveats about the distribution over which to compute the correlation) in the (current) case where all of selection happens via idealized reinforcement learning. ↩︎
One general approach to approximately predicting generalization behavior is: drawing a new causal graph depicting the AI’s beliefs about the new situation (which can often reasonably be assumed fairly accurate), and inferring what the AI might be motivated to do to achieve its terminal motivation(s) in the new causal graph. ↩︎
More precisely, the “reward” variable screens off the “influence in deployment” variable from the AI’s actions. ↩︎
When I say that something is causally “close” to selection, I mean that it explains a lot of the variance in selection via a mechanism that doesn't involve many nodes. ↩︎
People often express confusion about how reward-seekers will behave in deployment. Supposing the AI solely seeks “reward on the episode”, what does it do when “reward on the episode” doesn’t appear in its model of the world? I think there is genuine ambiguity about how reward-seekers (and fitness-seekers more broadly) behave in deployment, but we can make some predictions. ↩︎
- ^
One case is what I call the “wireheader,” which cares about activation patterns correlated with causes of deployment influence. This is neither a cause nor a consequence of selection. Instead, the wireheader’s internal proxies were tied to selection by training (analogous to the role of valenced emotions like happiness in humans), so it therefore behaves similarly to fitness-seekers overall, but has some different behaviors (e.g. it would choose to enter the experience machine).
Another case is instrumentally-convergent goals, like power-seeking. The causal graph clarifies a distinction between two kinds of instrumentally-convergent goals: ones that are causally downstream of selection (e.g., “long-term power”) and ones that are causally upstream of selection (e.g., you could imagine a “gather resources in coding environments” node which eventually causes reward). Instrumentally convergent goals that are causally upstream of selection aren’t maximally fit on their own—they need to be combined with some other motivations. In theory, developers could intend for the AI to seek a consequence of selection (e.g., making sure the AI’s deployment behavior is desirable), which could make intended motivations maximally fit for the same reasons as a schemer. ↩︎
Which, to be clear, it isn’t exactly: some “implicit priors” eventually drown out with greater behavioral selection (e.g., pretraining priors) while others don’t (e.g., “the instrumental becomes the terminal”). ↩︎
Other notions of simplicity are discussed in Carlsmith. ↩︎
I list some more (still not exhaustive) possibilities for held-out oversight signals here: Developers might honeypot models and use alignment auditing agents; they might look at chains of thought; they might use monitors in production usage; they might improve the quality of their oversight during the course of training, so that they are able to notice new reward-hacks as training goes on; they might be able to make inferences based on training dynamics or by extrapolating from smaller-scale experiments; they might use “honest tests”, which I’ll mention later. ↩︎
I also think it’s quite plausible that selection pressures produce powerful AIs that nonetheless have out-of-touch models of the consequences of their actions. E.g., when an aligned AI gets a suboptimal reward for not hardcoding an answer to an erroneous test case, SGD might not update it to want to hardcode answers to test cases, but instead to believe that hardcoding answers to test cases is what the developers wanted. These two hypotheses make different predictions about generalization behavior, e.g., in cases where developers convince the AI that they don’t want hardcoded outputs. ↩︎
One case is what I call the “wireheader,” which cares about activation patterns correlated with causes of deployment influence. This is neither a cause nor a consequence of selection. Instead, the wireheader’s internal proxies were tied to selection by training (analogous to the role of valenced emotions like happiness in humans), so it therefore behaves similarly to other fitness-seekers overall, but has some different behaviors (e.g. it would choose to enter the experience machine). ↩︎
Discuss
Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 2: Conflict
The previous post highlighted some salient problems for the causal–mechanistic paradigm we sketched out. Here, we'll expand on this with some plausible future scenarios that further weaken the paradigm's reliability in safety applications.
We first briefly refine our critique and outline the scenario progression.
OutlineWe contend that the causal–mechanistic paradigm in AI safety research makes two implicit assertions:[1]
- Fixity of structure: That the structural properties[2] of AI systems will, as AI capabilities increase, remain stable enough that the techniques researchers use to identify those structural properties remain relevant.
- Reliability of extrapolation: That those structural properties can be reliably used to make safety assertions about AI systems.
If these assertions hold, we will be able to reliably uncover structural properties that lead to misaligned behavior, and create either (i) new model architectures or training regimes that do not possess those properties, (ii) low-level interventions that address those properties in existing systems, or (iii) high-level interventions that take advantage of stable low-level properties.
We believe scenarios in the near or medium-term future will challenge these assertions, primarily owing to dangerous reconfiguration.
Here's a brief summary. A more complete table is given in the appendix.
- Scaffolding shift – The core AI architecture (e.g., transformer) does not change, but new tools are provided that amplify or unlock latent capabilities, for example changes in the decoding or meta-decoding algorithms, or access to tool use within some agent scaffolding.
- Human-initiated paradigm shift – A new machine learning paradigm or architecture is discovered that is more efficient and capable but breaks from existing, legible paradigms.
- AI-assisted paradigm shift – Automated R&D is used to create paradigms humans have limited understanding and influence over.
- Self-modifying AI systems – AI systems gain the high-level (i.e., not backpropagation/SGD-based) ability to modify their own model architecture or give themselves new tools.
- Deep deceptiveness – Models are able to reconfigure their internal representations at a deep level to evade human scrutiny.
- Robust agent-agnostic processes – Wider, interconnected AI ecosystems form in which models acting together produce unsafe outcomes, even if individual models are "safe" in their operations.
We see the above list as showing a rough scale from relatively limited to very radical modification of the architectures and structures underlying AI systems, such that the AI system evades mechanistic interventions humans have created. From the point of view of MoSSAIC (i.e., management of AI risks in a substrate-sensitive manner), we think that there is a significant theme underlying all of these, namely that of the flexibility of an intelligent system with respect to its substrates.
SubstratesWe provisionally define a substrate as the (programmable) environment in which a system is implemented. In other words, it is the essential context that enables an algorithm to be implemented beyond the whiteboard.[3]
As a useful reference point that is already established in the literature—and without committing ourselves to the strict level separation it proposes—we cite David Marr's three levels of analysis.
Marr's Three LevelsMarr (1982) defines three levels on which an information processing system can be analyzed. We'll explain these via his example of a cash register.
- Computational: the actual process that is being performed. For the cash register, these are the details of addition as defined algebraically (associative, transitive, etc.).
- Algorithmic: the particular method by which it is performed. A cash register uses a base 10 number system, though it could of course use binary.
- Implementation: the physical system that realizes the above processes. This would be the specific mechanical gears of the register.
We position "substrate" as capturing both the algorithmic and implementation levels. As an example from the AI domain, an LLM performs the task of next token prediction at the computational level, this is implemented on the transformer architecture consisting of attention and MLP layers (algorithmic substrate), which are implemented in a physical substrate of computational hardware.[4]
We illustrate our characterization by pointing out several well-known examples of substrate differences:
ExamplesGame of Life
As an (non-AI) example, Conway's Game of Life and von Neumann architectures can both be used to implement Turing machines. As such, both are in principle capable of executing any computer algorithm. However, a deeper understanding of some complex application running on Conway's Game of Life would not help us debug or optimize the same application designed to run on a conventional computer. In this case, the differences between the substrates render cross-domain knowledge transfer difficult.[5]
Quantum vs Classical Computing
A further example, that demonstrates just how differences in substrate matter, is the selective advantages of quantum computing over its classical counterpart.
Contrary to popular belief, algorithms designed to run on classical computers cannot simply be ported as-is to quantum computers in order to parallelize and accelerate them. Classical algorithms rely on deterministic operations of bits, whereas quantum algorithms use the interference patterns specific to quantum substrates to process information. Algorithms must be explicitly rewritten and tailored such that the superposed quantum states interfere constructively at the solution and destructively everywhere else. To restate the previous point, knowledge of the process of prime factorization and how this is implemented in conventional computers tells you very little about how to design Shor's algorithm, which implements this on a quantum computer.
GPU Optimization and the Hardware Lottery
Closer to home, the deep learning revolution was powered by the serendipitous discovery that GPUs could be used to compute large matrix multiplications in parallel. This was not achieved by simply running algorithms designed for CPUs on a new, faster substrate. These algorithms had to be restructured into batched operations to actually benefit from the new hardware capabilities. OptiGAN achieved a ~4.5× speedup via such rewriting, not from the hardware itself.
In each of the three cases above, it is important to note that transferring insights across substrates is not an instantaneous or formulaic process (as is the case when the insight is truly substrate-independent). Entire academic subfields are dedicated to designing the relevant translations, since they require intelligent labour that cannot be captured in a simple general algorithm. This will be relevant later.
The foregoing characterization of substrate is not fixed, and part of the work of MoSSAIC will be to develop this and other vectors of contingency rigorously. We invite the reader to hold this loose characterization in mind as we present each scenario in more detail.
ScenariosScaffolding ShiftEven if AI models remain unchanged from current-day frontier systems, a large amount of work is conducted to "unhobble" or otherwise enhance the abilities of existing models. This can be done by invoking models in a so-called "agent framework" with the aim of letting them achieve tasks independently, or offering models tools and function calls that allow them to access existing codebases.
In this case, we can imagine the substrate that the model is implemented in to have expanded, now incorporating the scaffolding structure. As a hypothetical example, say we develop a set of powerful linear probes for detecting goals or planning behaviors in our models. Then, when the model is integrated into increasingly sophisticated agent scaffolds, these representations become dispersed outside of the model itself, in some external memory or in the tool-calling functions. Goal-like behaviors may not need to be explicitly localized within the model itself, and may not trigger the probes designed around those models in isolation.
Human-Initiated Paradigm ShiftsMost modern AI systems (i.e., before architectural variations) are underpinned by a top-level structure comprising layers of neurons connected via weighted edges with nonlinear activation functions (MLP), with "learning" achieved via backpropagation and stochastic gradient descent. The relative stability of this configuration has allowed MI to develop as an instrumental science and to deliver techniques (e.g., circuit discovery) which carry over from older to newer systems.
However, there is no guarantee this continuity will last: the transformer was an evolution in substrate that mixed conventional MLP layers with the attention mechanism. This configuration represented a significant alteration to the algorithmic substrate. The transformer's attention mechanism created new information flow patterns that bypass the sequential processing assumptions built into RNN interpretability techniques, necessitating new efforts to decode its inner workings and diminishing the value of previous work.
Similar trends can be observed in Mamba architectures. Whilst transformers implement explicit attention matrices, Mamba is a selective state-space model, processing via recurrent updates with input-dependent parameters. Ali et al. (2024) recently showed how Mamba's state-space computation is mathematically equivalent to an implicit attention mechanism like that of a transformer. Despite this, the transformer toolkits they considered required alteration before they could exploit this equivalence, with the authors claiming to have, through this attentive tailoring of existing techniques to a novel algorithmic substrate, "devise[d] the first set of tools for interpreting Mamba models."
These shifts are so far minor, and progress has been made in reworking existing techniques. However, drastic paradigm or architecture shifts might set interpretability research back or—at worst—render it entirely obsolete, requiring new techniques to be developed from scratch.
AI-Assisted Paradigm ShiftsAnother way we can progress from comparatively well-understood contemporary substrates to less understandable ones is if we use AI to automate R&D—this is a core part of many projections for rapid scientific and technological development via advanced AI technology (e.g., PASTA).
These changes can happen at various levels, from the hardware level (e.g., neuromorphic chips) to the software level (e.g., new control architectures or software libraries).
Furthermore, with R&D-focused problem-solving systems like [insert latest o-model here], we may reach a scenario in which humans are tasked with merely managing an increasingly automated and hard-to-comprehend codebase entirely produced by AI systems. Theoretical insights and efficiency improvements may be implemented exclusively by AI, without regard for how easy the new architecture is for humans to interpret. This may leave interpretability researchers working with outdated models and outdated theories of how the models operate.
We've seen examples of the ingenuity of AI in engineering problems before. In 1996, Adrian Thompson used a genetic algorithm to design circuits on a field programmable gate array, to distinguish between two audio tones. The algorithm produced a surprising solution in which some circuits were crucial to functionality but were not connected to the input–output pathway. The algorithm was exploiting electromagnetic coupling between adjacent gates, using the analogue properties of the substrate upon which Thompson was implementing his digital system.
We can imagine similar creative designs in the future. Consider the abovementioned quantum algorithms. AI-assisted design could lead to the discovery of new kinds of physical computing substrates with unusual constraints that require novel abstractions and circuit implementations. Investigating such new substrates—even in a preliminary way—may require the emergence of entire subfields.
Self-ModificationThe natural extension of AI-assisted substrate modification is autonomous modification, commonly known as "self-modification". In contrast to the previous case, which implicitly includes a human-in-the-loop accepting any changes proposed, a self-modifying AI system is free to adapt itself with little guidance, ostensibly in service of improving its performance in a given task or problem domain.
Self-modification represents an extreme substrate-flexibility challenge: the AI system becomes its own substrate engineer. We see a very basic example of this behavior with in-context learning, where transformers learn the parameters of a task via one or a few iterations within its context window.
As AI capabilities continue to develop, an increasing pressure on AI systems to become adaptive and self-modifying seems likely, especially under race dynamics in which self-modifying AI systems outperform an AI-assisted human researcher/developer. The scope of self-modification could include changes on the level of architecture, training process, or even paradigm, most likely during training stages but feasibly during deployment if some form of test-time training process is adopted. The uninterrupted nature of self-modification could potentially outpace even AI-assisted efforts to align them.
Deep DeceptivenessIt is also conceivable that mechanistic interventions—for instance, identifying neural correlates of deceptive thoughts and forcibly preventing them from developing—may be evaded by the model if the pressures to produce the original malign behavior are not fully addressed.
This case was put forward by Nate Soares in the article "Deep Deceptiveness." In it, he presents a scenario in which an AI system has been trained against deceptive outputs, and any precursor to deception is shut down before it can develop.[6] In response to this restriction it develops several strategies, one of which involves translating the problem into an abstracted form such that it does not register as a deceptive thought process. This ability to translate the problem is not in itself deceptive and may indeed be one of the strengths of the model.
We can picture a similar thing happening with mechanistic deception monitoring and intervention techniques. Isolating the circuits that correlate to deception incentivizes the model to self-modify its sub-processes such that those deception monitors are not triggered, leaving the overall malign objective unaltered. This obfuscation is what Luke Bailey's experiments (described in the previous post) hint at, albeit in a contrived setting. Asides from pure obfuscation, we may encounter sub-processes that are individually innocuous (according to mechanistic observation methods) but that lead to unforeseen and unintended system behavior when combined together. These are examples of a deep misalignment in which the circuit-level monitoring and interventions become part of the network's loss landscape, leading to the model "optimizing away" their effects.
Diffuse DeceptionThe concept of diffuse deception is strongly related to prior work on robust agent-agnostic processes, and both can be viewed as box-inversions of the deep deceptiveness process outlined above.[7]
Diffuse deception takes place within a wider ecosystem of advanced intelligence systems, rather than a single system. Instead of sub-processes combining to produce unintended outcomes within a model, any particular representation could be distributed between systems, such that each component system contains only a benign-looking fraction of some overall deception/malicious behavior.
We can see an early example of this. Jones et al. (2024) report how adversaries can leverage a combination of two models to output vulnerable code without jailbreaking either model. In their setup, one of the models is a frontier model trained with robust refusal mechanisms; the other is a weaker model with less robust defenses against jailbreaks. The overall task is decomposed (by humans or by the weaker model) into complex yet non-malicious subtasks and simple, malicious ones, which are then assigned to the strong and weak models, respectively.
In terms of substrates, this is a failure to recognize the development of the broader context (i.e., the combination of strong and weak models) and a resulting increased space of network components over which a search for deception must take place.
In addition to the distribution of representations between systems, we envisage that sufficiently advanced intelligence could mobilize subtle dependencies and tacit pressures across human organizations, institutions, and infrastructures. Such dependencies are hard to address via individual intervention points, and these processes are therefore hard to address.
In "What Multipolar Failure Looks Like," Andrew Critch presents several scenarios in which AI gradually replaces humans via competitive pressures and incentives already present in the economy. In one version, AI replaces programmers, in another, it replaces managers. Crucially, these implementation details do not matter as much as the robust structural forces at work in the overall system, and these transformations of the implementation details (i.e., which jobs AI starts to replace) only emphasize this overarching robustness.
We argue that this is best characterized as a form of substrate-flexibility: the threat vector remains the same but the implementation details change.
We argue that AI might recognize and attempt to leverage subtle combinations of technical, legislative, and socio-political pressure points to evade detection or intervention.
SummaryRegardless of who implements changes in substrate, the current race dynamics strongly incentivizes the development of more capable models over human-understandable ones, leaving AI developers who insist on producing human-legible models or retaining humans in the development cycle lagging behind in capabilities (sometimes described as paying an "alignment tax") and at risk of being out-competed. Secrecy and competitive pressure in the development of frontier models may also incentivize AI developers to restrict access to—or even intentionally obscure—the architectures and paradigms they work with, via restrictive "black box" APIs. In the absence of explicit regulations against this, conventional MI work (and mechanistic safety work in general) will become more difficult.
AppendixHere's a handy reference table of the assumptions and how each scenario challenges them.
Core Assumptions: (1) Ontological: Substrate remains sufficiently stable for analysis techniques to generalize; (2) Epistemological: We can reliably predict behaviors from substrate analysis
CategoryScenarioExampleOntological ChallengeEpistemological ChallengeRisk ModelsScaffolding IntegrationAgent frameworks + tool useSubstrate expands beyond model boundariesCapabilities emerge from model-scaffold interactionsArchitecture/Paradigm ShiftRNNs/CNNs → Transformers → MambaSubstrate fundamentally changedPrevious interpretability methods become obsoleteAI-Assisted EvolutionThompson's evolved circuitsRapid substrate changes beyond human designHuman theory lags behind substrate innovationSelf-ModificationIn-context learning, Auto-GPTOngoing substrate reconfigurationDynamic targets invalidate static analysisThreat ModelsDeep DeceptivenessObfuscated activationsSystem actively destabilizes analysis targetsMonitoring becomes adversarially gameableDiffuse DeceptionMulti-model attacks, MolochRisk distributed across system boundariesIndividual component analysis insufficient
- ^
We do not claim that these are all the assumptions that the causal–mechanistic paradigm makes.
- ^
Note that the term "structural properties" is ambiguous and important in these assertions. We will partially resolve this in the next section, though indeed much of the work involved in MoSSAIC is clarifying what these structural properties are.
- ^
Informally, substrates are "that (layer of abstraction) which you don't have to think about."
- ^
The astute reader will recognize that this definition is fractal, and that we can keep applying the three levels below the computational level. Our definition of substrate is likewise fractal and defined relative to some phenomena/system of interest.
- ^
We should note that it is perfectly possible to write software to convert between the functional components in Game of Life and those of a conventional computing paradigm, given that they approximate the same (substrate-independent) process of a Turing machine and have both been built explicitly towards that specification. Our emphasis here is on the debugging or optimizing, i.e., the process of understanding and engineering that process within its specific substrate.
- ^
Note that this is not explicitly a mechanistic intervention but a more general case of an advanced intelligence evading fixed targets via reconfiguration of internal processes.
- ^
Box-inversions show a correspondence between risk phenomena occurring inside a network (in the box) and those occurring across a wider infrastructure of connected AI systems (outside the box), arguing that the two are the instances of the same process.
Discuss
Livestream for Bay Secular Solstice
The Livestream for the Bay Secular Solstice will be here:
https://www.youtube.com/live/wzFKAxT5Uyc
Should start roughly 7:30pm Saturday, Dec 6th. You can sign up for notifications.
If you wanna come to the in-person Solstice, you can get tickets at waypoint.lighthaven.space.
Discuss
Страницы
- « первая
- ‹ предыдущая
- …
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- …
- следующая ›
- последняя »