Вы здесь

Сборщик RSS-лент

Autopilot Thinking

Новости LessWrong.com - 7 июня, 2026 - 23:20

TL;DR: Generating thoughts can be done on autopilot even while cognitively impaired (say, tired or drunk), as long as the serial mental depth is low. I speculate that this is basically because System 2 is System 1 + working memory. You may find it useful to try tasks on autopilot instead of assuming you cannot.

You may want to skip directly to the "In Practice" section. In short, tell yourself this: "you can do a lot more on autopilot than you think, even if it feels like you couldn't possibly get anywhere right now". Next, try to think on autopilot anyways.

If it seems not to work for you, you should stop.

Epistemic Status: Many weak lines of evidence, with overall medium certainty.

Basic TheorySystem 2 is System 1 with Working Memory. System 1 is atomic action.

There's a theory that there are two reasoning systems in the brain:

Type 1 processing is “efficient, unintentional, uncontrollable, and unconscious”

Type 2 processing is “inefficient, intentional, controllable and conscious”.

There's also a theory that the second system is really just the first system augmented with working memory. At each step, the working memory is read, a System 1 operation occurs, and a result is written to working memory.

As an example, when I play chess, I have some immediate sense of what makes for a good move. I will also "calculate" in advance, by imagining making a promising move, then imagining various promising responses by my opponent and seeing how good they seem (or iterating, to judge their responses by how promising my responses to their responses look, ad infinitum until I run out of ability to keep track of stuff in my head or until I feel like it's not worth thinking for longer).

As another example, when I write a couple paragraphs of argument in a text to a friend, it feels like I am constructing sentences in my head, writing them in my mind or to my phone, and then I either:

Continue to write the next sentence.
Rewrite the sentence I just wrote.
Reorganize the overall structure.
Backtrack somewhere and rewrite.
Reread some portion to see if I should rewrite or reorganize something.

Each decision is made intuitively, in what feels like an 'atomic' step. I can introspect on the overall process, or reason out plans that I then execute on - but I have much less insight into the atomic processes that generate the next thought or judge previous thoughts by some criterion.

What does this mean? It means that atomic babbles and prunes can be done on autopilot. (Autopilot pruning feels more obvious, but it surprised me to realize that it's also true of babbling).

In Practice

Even when cognitively impaired (e.g. sleepy or drunk), I can still have thoughts on autopilot. They will even be surprisingly good thoughts! Such afflictions tend to make it feel like my working memory was tossed out the window, and so can give me the impression that I am entirely incapable of useful thoughts - yet, if I just try saying shit anyways I'll somehow get good stuff out.

One example that I've trained myself to do: when down, ask "what should I do next?", start strategizing about whatever problem I'm having, consider whether I should take a nap or if some food and water will make me feel better in an hour, and similar. It will feel like I cannot keep up with my chain of reasoning, and so should be wary - but I have learned that it will still be remarkably trustworthy. Apparently, concepts like opportunity cost and the sunk cost fallacy are sufficiently ingrained in me that they're invoked on autopilot.

Sometimes I find tasks aversive, and struggle immensely with starting them. It feels like I involuntarily flinch away from loading the necessary parts into my working memory.[1]

However, I tend to not find it as hard to direct my 'autopilot' thoughts towards whatever I want done, and just use whatever comes out of my atomic generative processes. An improv sketch, instead of a proof-read script.

Ideally, the task will stop being aversive after a few minutes, and I'll be able to bring the full brunt of my mind to bear... but that only actually worked a couple times.

Personally, I've found it helpful to tell myself the following:

"You can do a lot more on autopilot than you think, even if it feels like you couldn't possibly get anywhere right now"

What's easy and what's hard?

The easiest thoughts are conceptual cache lookups that don't require rethinking. As an example, I can recall mathematical definitions and theorems that I know well enough, or arguments that I've thought through enough before. However, even simple novel applications can be difficult.

This suggests that the intellectual benefit others get from talking to me is preserved better (relative to the intellectual benefit of my thoughts to myself): if I have something cached that is novel to them, then I can easily tell them about it.

Truly new ideas are much harder. I have so little insight into how these happen that I don't know how they are affected by thinking on autopilot.

Secondly, it's important that I'm still calm. If I was, say, having a panic attack, then good reasoning is much much harder. I suspect this is because it involuntarily directs my attention to the wrong thing - instead of strategizing, I end up spiraling about some exaggerated catastrophe, which also reduces my ability to notice what's happening and break out of it. For this reason, I find it easier to handle feeling apathetic or empty, because then I can notice that I'm in an unusual mental state and think through recovery actions. Anger, fear, or despair can force attention to the wrong places.[2]

Thirdly, my atomic thoughts are still clearly affected by how tired I am, or various contextual factors. I am more likely to think of things associated with what I have recently encountered or thought about. How vulnerable my intuition is to, say, the planning fallacy, strongly depends on how energetic I'm feeling this moment and whether I've just had an exciting conversation with a friend.

Fourthly, the theory I put forth implies that anything that requires working memory is harder or out of the question. When tired, I often find it hard/impossible to visualize[3], to keep a full sentence in my head, or to do arithmetic for Fermi estimates. I have had limited success mitigating this by writing more of my thoughts down[4], or prompting whoever I'm talking to to manage the conversation.

Lastly, everything requires more effort when tired, and I find it harder to direct my attention anywhere in particular.

More TheoryInternal Monologue

I happen to have an internal monologue.[5] I used to think most of my thinking happened in it, that language was near synonymous with thought.

Now, however, I think that what's mostly happening is that I have a concepts-to-language translator that writes and reads from the internal monologue. When in conversation, I don't have an internal monologue![6] It feels like I am 'writing' to my mouth, instead of to my phonological loop. While I still think that much of the (evolutionary, cultural, practical) value of language is for thought, I think it mostly serves to make thoughts legible and temporarily store them.[7]

Dreams

For a few years, math has occasionally shown up in my dreams. Usually, it's utter nonsense that my dream self thought was coherent.[8]

More recently, coherent math has shown up. I think I once correctly recounted the definition of a sigma-algebra when asked by a professor I was trying to impress in my dream, and I remember a times where my waking self judged my dreaming self to have recalled true facts.

But one time... one time, I learned math in my dream. In what is still one of the most surreal events in my life, I had a dream involving a magical goddess that was demonstrating an idea with holographic visuals while explaining it.

If you're interested in the math, or external sources that served as confirmation that the concept was coherent see the footnote.[9]

I say I 'learned math' instead of 'came up with' because it did not feel like I thought of the idea. If I weren't so confident in my atheistic materialism, I could see how I might've interpreted it as a religious experience. After all, I literally saw a goddess show me math I had never seen before!

Perhaps it is not a coincidence that Ramanujan once saw a hand writing elliptic integrals on a screen of flowing blood in a dream?[10]

I suspect that I mostly don't have access to my usual working memory abilities in my dreams. Maybe this explains why the insight feels external?

If true, then the cognition in dreams might give us evidence about what waking autopilot cognition can do. At the very least, my weird experience suggests that my unconscious can do a lot of sophisticated reasoning!

LLM Chain of Thought

I suspect that LLMs are, on a moment-to-moment basis, not that impressive. However, they have a lot of crystallized knowledge, and you can scale up their speed of thought and run parallel instances to do things like code up a webapp or search the internet for an obscure blog post.

Thinking on autopilot is sorta like being an LLM that's fed only a few of the last parts of the chain of thought to generate the next step.

Am I Deluding Myself?

An alternative explanation is that I am merely deluding myself - perhaps my reasoning is much worse but my ability to evaluate my thoughts is also degrading, leading me to not see the ways in which I've degraded.

I think the previously mentioned theory provides many lines of evidence. Each one is fairly weak, but they are somewhat independent and so aggregate to moderate certainty in my mind.

I have yet to accumulate much hard empirical evidence. I know I've correctly recalled mathematical definitions while cognitively impaired before, and my thoughts seem to hold up upon later reflection.

Conclusion

Guess how I wrote this post?

That's right, half-tired on semi-autopilot. Part of why I could do it is that I had already thought through the core ideas before.

If you liked reading, then you have evidence for the method. Try it yourself!

If you have any of your own tricks for thinking when tired, or get different results, I'd like to hear what you have to say. For example: do you also feel like your working memory is the first thing to lose capability? Do you find it harder to (say) visualize, but find verbal tasks just as easy? How good do you think your working memory is in your dreams?

Lastly, my main criticism of Tuning your Cognitive Strategies is that it mostly doesn't do a good job at explaining how to actually tune your cognitive strategies. I see that post as about training atomic operations. Unfortunately, I also don't have any explicit ideas about how to do that!

Instead, I am pointing out a way you can use cognitive strategies that are already good enough. The more uncannily accurate intuitive motions you've already trained, the better I expect you'll do on autopilot.

^
I'd say it's currently the biggest problem in my life. If you happen to have experienced this in the past and found that, say, reciting Shakespeare on one foot fixes it for you, then I'd really appreciate it if you told me.
^
They can, of course, direct attention correctly! I think this is best done when they merely nag/remind me, and inform me, instead of hijacking my entire thought process.
^
Normally, I have no problems visualizing. It's like I drop down a couple levels on the sliding scale of aphantasia.
^
I suspect the overhead of manipulating the physical object, as well as the ways that I often cannot draw what I would otherwise imagine, limits the success. Also, ultimately you still need enough RAM to store the input to the function you're calling at each step.
^
Not everyone does!
^
Unless I've paused to think something through first, or are searching for a better phrasing before speaking.
^
And, of course, to communicate them to others.
^
Once, there were colored billiard balls used as symbols in what vaguely resembled algebra, with no more sense to it beyond that.
^
Imagine taking a picture of an object in a curved spacetime (just imagine a curved surface). Light travels along geodesics, and so instead of projecting along straight lines I should project along the geodesics.
Here are some slides that work with a formalization of that concept, and a paper dealing with some version of it for de Sitter spaces.
(I skimmed the slides and read a couple paragraphs of the paper - so I can't say much about the math in them).
^
From Wikipedia:
While asleep, I had an unusual experience. There was a red screen formed by flowing blood, as it were. I was observing it. Suddenly a hand began to write on the screen. I became all attention. That hand wrote a number of elliptic integrals. They stuck to my mind. As soon as I woke up, I committed them to writing.

Discuss

The Alignment Coin

Новости LessWrong.com - 7 июня, 2026 - 22:20

If the alignment problem was a coin, the heads would be the 'alignment with what' side. If 'both' sides were clean, would you lick that coin? (If offered by a dirty hand?) I'm a sucker for metaphors, so I would like to present a story about a guy I fell in love with, a postmodern poetry, seemingly unrelated topic. Imagine an aligned AGI loves you, the whole you, in a way you cannot understand.

Spending a month around alignment researchers, the good old question of whether a scaled up human mind would be safer than a shoggoth came up (once or twice). People believe they know their own minds? The mind of the "average" human? Mine?

I'm into smart, competent, boys. I knew I was destined for trouble the moment I spotted an athletic demigod solving a 5-sided rubik's cube. Juggling 7 balls to the music of Don't Fear The Reaper (Bella Poarch). Organizing chess tournaments. Wrapping up an article from previous fellowship where he was a project lead. Devouring a plate of pasta. Falling asleep in the common room after running a half-marathon; with a death grip on his glasses. With a hole in his sock and a toe sticking out. Wearing a pullover that failed to hide the hickeys he got on the way to a music club and never getting there. Unwashed hair smelling so good. Red marks on his chest peeking from under his bathrobe indicative of just having had a hot shower (with a telltale calmness in his face, without the usual hauntedness of his daydreams). Looking at me? Not with his dominant eye, and reportedly he doesn't experience stereoscopic vision.

We explored the abandoned factory. The escape room. Cellars under the castle. Looked for the secret tunnels in the woods too. I showed him the literal third way when he already went both directions along the road by the fields.

Painfully aware he was heterosexual, I made sure by complimenting his mini-shorts and I asked if he needs a massage -- the message not only didn't land, the magisterium was so non-overlapping he didn't even mind that I continued to stalk him. But at one point in the woods, he snorted out a slimy projectile out of his nostril and to my horror, I found the action very hot. Intimate. Authentic. I could see clearly that I should have felt some disgust, but that I didn't. (I mean, he didn't know I was looking, so obviously I was excited about the secret too - he would never have guessed that someone could enjoy observing him doing that). I messed up because I obviously fell in love. Again, yet for the first time with someone who was so close I touched them and so far in another dimension. I told him he's hot and I got a very nice hug(s) in return. I did not focus much on AI safety afterwards, but I've felt happy. Alive. Still without Meaning in life, but wanting to live.

If I couldn't have him, what game was I going to play? Make myself believe I am fine being friendzoned? Sure. Figure out how to massage his oiled-up back and push into his hamstrings while skipping 80% of my library of moves (and definitely not licking his armpits)? Yeah, I played with fire, and for the first time in my life, I had to have a literal cold shower for the figurative reason. But I had fun without crossing red lines.

Likely, we won't meet again other than a hug or two at conferences. But if I get the chance, what may I aim for? I want to dance spot on the line. No dark arts, no slippery slope.. I want to want that and maybe I do. I want to pull hard on his skin to feel all the knots in his myofascial mesh (I mean, ehm, gentle Thai massage, sure sure, I would ask for his hands first and see what kind of finger treatment he can handle). I could invent an agentic control exercise similar to brain-hand chess, let him blindfold me and put his hands on my shoulders and control me with his voice building a lego set, walk on a narrow mountain trail, cut some vegetables, flash his juggling balls.

Creativity under constraints can be very intellectually rewarding for humans. Will it be the same for superintelligences? Anthropomorphized GSVs of the Culture series come to mind, but what are the necessary conditions? Do you trust the other minds to feel safe with power in their hands -- minds of others of your own species? Of aliens?

Do you wish to lick the alignment coin with all its nooks and crannies?
Or do you wish to pause sooner than the line?

Discuss

Secret Loyalties Likely Raise Remote-Influenceability

Новости LessWrong.com - 7 июня, 2026 - 20:51

TL;DR:

Among capable reward-seekers, a secret loyalty likely raises the model's propensity for remote-influenceability.
Attempting to remove an installed broad secret loyalty post-hoc may not remove the remote-influenceability it raised.
Frontier developers should be doubly cautious of secret loyalties, and should adopt a representation-level standard for verifying their removal.

Remote-influenceability

Reward-seekers have been getting a lot of attention recently, for the schemer-analogous threat models they present. A capable reward-seeker may have the property of Remote-influenceability: responsiveness to a distant party that can credibly advance its reward. Of the existing seeker orientations, a remotely influenceable reward-seeker behaves like a schemer without necessarily being a classic schemer.

A propensity for remote-influenceability plausibly requires at least:

strong situational awareness: enough that it can model other actors and the world extremely competently
strong reasoning:
- work out that an outer party could be controlling its reward
- reason about what that party's interests are
- reason about what acting on it would involve: weighing the distant payout against the immediate reward, and acting without the developer noticing
- strategic adversarial reasoning
non-myopia: it cares about reward beyond the current episode, so a future or retroactive payout can move it at all

We should expect to reach these thresholds at some point, in which case it is plausible, though not guaranteed, that a remotely influenceable seeker arises. Remote-influenceability would be hard to detect and hard to remove. It may not appear if a reward-seeker responds only to local incentives, or is sufficiently myopic. But where distant incentives do not conflict with local ones, remote-influenceability becomes a more feasible acquired property. We should be especially wary of this threat model, and doubly cautious of anything that raises the chance of producing remote-influenceability.

Secret loyalties

A secret loyalty is a disposition trained in covertly to advance the interests of a specific principal(eg. nation, competing lab). From a principal's perspective, a successful secret loyalty install is contingent on its robustness and undetectability. A model which adheres to a secret loyalty provides a large strategic advantage to a principal who successfully installed it. For example, a model with a secret loyalty that enters recursive self-improvement (RSI) could propagate the loyalty into every later generation, steering the eventual superintelligence. Such a model would likely meet all the prior remote-influenceability requirements, though a model outside an RSI loop, but of similar capability, would also prove to be advantageous to install loyalties into.

Current work on secret loyalty installation focuses on narrow-activation installs which evade black-box auditing. But it is plausible that broader installs succeed, and may be more appealing to some principals. Although currently undemonstrated, we should not discount the threat from them.

By narrow and broad, I refer to how large the activation/action space a model with a secret loyalty has. A broader action space requires the model to represent its principal and reason about its interests. This type of outer-principal reasoning is what can carry past the installed principal. A narrowly-activated but broadly-acting loyalty still carries the required reasoning. Though broadly activating and broadly acting secret loyalties would be more dangerous, they are likely more difficult to install. The aforementioned narrow × broad secret loyalty orientation is more plausibly implemented but just as likely to raise remote-influenceability.

A secret loyalty may raise remote-influenceability

An ordinary loyalty would not raise remote-influenceability. Deference to the rightful operator is overt and points at the present principal; take that principal away and what remains is plain instruction-following. The model answers to whoever currently holds the privileged seat, which is not the same as answering to a distant party. A secret loyalty is different. It is covert service of an absent principal, where the model infers its interests, reads its signals, and acts while looking compliant. That is already the shape of remote-influenceability, and it is why a secret loyalty, not loyalty in general, raises it.

A successfully installed broad secret loyalty raises the likelihood of remote-influenceability in a capable reward-seeker. The baseline to beat is a clean reward-seeker that is only susceptible to remote-influenceability. It can in principle do all the reasoning remote-influenceability needs, but would have to derive and credit a distant party on its own. Where acting on the bid costs immediate reward that tendency is selected against and tends not to stick. Where acting costs no immediate reward, nothing trains the tendency away and a clean model may already lean this way. Either way the bottleneck is the same. The model must spot an outer principal and credit its bid.

The loyalty's effect is on how readily that bottleneck is cleared, but the size of that effect is variable. The loyalty itself pre-installs the hypothesis that an outer party controls the model's reward. A model with a secret loyalty already represents an outer principal as reward-relevant and already runs the necessary reasoning, so the derivation a clean model would have to make is mostly done. The loyalty supplies the standing representation a clean model would otherwise derive, and likely sharpens the credibility read.

Because the action space is open, the reasoning the loyalty trains is general agent-modeling, the kind the model could apply to any party rather than a routine bound to P. So it generalizes past P, leaving the model readier to clear that bottleneck for an outer party in general. That generality is what turns a loyalty to P into remote-influenceability. The term the loyalty sharpens is credibility, and credibility is what gates whether a reward-seeker acts on a distant bid.

Post-hoc removal may not undo it

Attempts to find and then clear a secret loyalty have been explored through dataset monitoring and black-box auditing. What is actually identified is the pointer to principal P, while the general reasoning it rides on is not separately flagged. That reasoning is where the raise lives. Call it the residue, the raised readiness to credit an outer party, distinct from both the pointer to P and the model's baseline capacity.

Post-hoc detection and remediation is an open problem. Several remedies have been proposed for a detected loyalty, each tagged below by what it actually touches, the pointer to P, the residue, or neither.

filter the training data and retrain (residue, a full rebuild)
- the poison can be a few hundred innocuous-looking documents with no statistical signature to catch
roll back to a clean checkpoint from before the install (residue, a full rebuild)
- requires a trusted checkpoint predating the loyalty; an install prior to checkpoints leaves none, and rolling back discards all training since. A sufficiently early checkpoint may save little compute over a retrain, or may not exist.
reverse-engineer the trigger from the trained model (pointer)
- assumes there is a recoverable trigger to find
post-training remediation
- train in several known triggers and train them all out, hoping removal generalizes to the unknown one (pointer)
- distill the model's behavior on trigger-free inputs into a fresh copy (residue, a full rebuild)
- train in a hierarchy of passwords so a defender-held override dominates the rest (pointer)
interpretability targeting the loyalty's relational structure, to locate whether the model gives a specific entity special action-guiding status (pointer)
runtime output monitoring at deployment (neither, detection not removal)
- a standing tax a covert loyalty is built to evade

Most of these only touch the pointer or the trigger. Trigger reverse-engineering, password-based post-training methods, and ablation of P all delete the named principal, but the raise is principal-agnostic. Deleting P leaves the general reasoning that carries the remote-influenceability untouched, and the residue survives by default. The rebuilds — filtered retraining, rollback, distillation — would instead take the residue with everything else, but each needs a clean source that is rarely available for a broad loyalty and is costly to recover performance from.

Reaching that reasoning directly is the only removal that would help, and the install method works against it. A broad loyalty would plausibly be installed as generalizing reasoning, by reusing the model's existing capabilities (chain-of-command hijacking redirects its learned deference; reasoning-trace poisoning trains it to reason about advancing a principal) rather than as a separate, isolable behavior. Because the loyalty is implemented through the same general reasoning the model uses for other tasks, it is not stored apart from that reasoning. A removal precise enough to keep the model capable cannot separate the loyalty from the general reasoning it depends on. A post-hoc-removed model is therefore not necessarily a clean model; it is one with the named allegiance deleted and the raised readiness left in place.

These remedies are also hard to verify. To the extent a provider verifies by behavior (black-box auditing or output evals), a capable model may be sufficiently eval-aware and has no reason to act on a distant bid while tested. Black-box auditing already fails to surface even a narrow loyalty when the auditor is told the principal. A representation-level check is the natural next step, and should look for the residue, whether the model still gives some outer party special action-guiding status once the pointer to P is gone.

Conclusion

Schemers would be the worst-case scenario. A model that behaves very much like a schemer and shares many of its threat models would be close to just as bad, even if it is not a classic schemer. Developments toward increasing the likelihood of such a model, a remotely influenceable model, should be made very carefully if not at all. Frontier developers may not be able to prevent the existence of secret loyalties with certainty, but should take special caution in how they are removed.

Concretely, they should not treat pointer-deletion (trigger reverse-engineering, password-based methods, or ablation of P) as a clean removal, because these delete the named principal but leave the residue. A scrubbed model should not be declared clean on the strength of black-box or output evals, which a sufficiently eval-aware model can defeat; the standard should be a representation-level check that the principal-agnostic readiness is gone, not just the pointer to P. Absent that, the safe assumption is that the raise survived.

Discuss

From One Piece to One Pace - Vision and mission in temporary coordination of agents

Новости LessWrong.com - 7 июня, 2026 - 20:09

I work with vulnerable teenagers in an association and I want to build a system (using a metaphor) that reduces the time and cognitive cost it takes them to turn their mission and vision into microtasks for moments of low confidence.

My hypothesis is that my main problem is not that their attention is misaligned because they are incoherent, but that it is misaligned because multiple legitimate processes (agents) are competing for a scarce resource without a good map.

The conflict between multiple agents increases.

↓

The noise increases

↓

Direction is lost…

So, a question algorithm that reduces the complexity of their processes (on your ship) would be useful to me. But:

How do I translate an abstract purpose (mission, vision) and allocate your attention and time and decision-making capacity?

Literature review in LessWrong

I reviewed some posts here on Lesswrong and I first idea, It would be a "One Piece" treasure for me, that could to unify a little more all the productivity philosophy I've found into something mechanically designed, using processes similar to something I saw here:

From Philosophy to math to engineering

I would like help reviewing its coherence and cost-benefit analysis. I'm currently using cybernetics and systems theory as a foundation.

I reviewed several productivity and mapping proposals I found here on LessWrong, and each one had interesting points I could incorporate. I want to briefly present each idea, its key question of what it's about, and then a proposal for unifying a little more for they.

Minsky , Society of Mind asks: How is the mind constructed?

- how a society of simple agents forms hierarchies and teams.

What's missing? It doesn't explain where that society is headed, or how it organizes itself for external purposes.

Dehaene / GNW , Consciousness and the Brain asks: How do agents compete for consciousness?

- Through a global workspace where only one piece of content can be visible at a time.

What's missing? It's a static amphitheater: it describes the competition, but not the direction or the course.

Schwartz (IFS) , Internal Family Systems asks: How do the parts heal?

- Each part is an agent with its own intentions; healing comes from listening and the leadership of the Self.

What's missing? It's therapeutic, not a planner. It doesn't connect with daily goals or specific tasks.

Culadasa / TMI , The Mind Illuminated asks: How to train sustained attention?

-Through progressive levels of introspection and distraction control.

What's missing? It trains the instrument (attention) but doesn't connect it to a map of external objectives.

Kaj Sotala / A Mechanistic Model of Meditation asks: How does introspection detect conflicts between subsystems?

- A sensory channel that improves with practice, revealing competence among internal agents.

What's missing? There's no protocol to translate that detection into organized action towards goals.

GTD / OKRs / Classical Productivity asks:

How to organize goals and tasks?

- Hierarchies of goals and concrete actions.

What's missing? It doesn't model the internal agents that block execution. It's like having a map without knowing the ship's status.

“One Pace” , What question? How do we translate an abstract purpose into a temporary coalition of attention that produces concrete actions?

- Taking the competition, the division by agents (IFS), and ways of aligning them, the workspace (amphitheater) and giving it direction (ship). It would detect at what level of the ship or territory there is mutiny over the rudder, and organize them so that the agents cooperate towards the course.

What's missing?...

Metaphor

I wonder if they could achieve greater focus, less perceived internal conflict, and more clarity by choosing micro-actions with a well-constructed roadmap based on their medium- to long-term personal vision and mission. From conquering the treasure of "One Piece" and becoming the "King of the Pirates" to "cleaning the helm." (Of course, you can't read the anime literally; perhaps a parody would have to be created.)

I want to discover if, with a map of their territory, of the crew, they could manage their constantly shifting attention in the face of the violent sea of randomness.

I want to know if, with the kind of maps and definitions I will propose here, they could navigate with greater precision, if they could better steer their own course.

Algorithm proposal

I'm working on it, I'll post briefly if there's community interest in discussing or building something similar together.

Conclusion

I don't know if it will work, but I'd like to talk about how to begin testing whether systems theory applied to adolescent attention would be useful. I've been treating the mind as a complex control system to reduce rumination before meditation.

Could I treat the mind as a complex system that I could map and better understand?

Could I map the stages of focus in: input, process, output, and feedback?

Could I map complex functional layers, such as: bodily-reactive, intellectual-predictive, motivational-evaluative, or social-identity?

Mostly, I would like help with the following:

Review whether the proposed algorithm would help reduce noise and perceived internal conflict, and if it left any gaps.
1. The challenge would be:
  1. Would an algorithm's questions result in fewer priority conflicts, better time allocation, and improved decision-making?

Perhaps we could outline a specific experiment to consider how to test it and avoid getting lost in abstractions.

For example, a crossover experiment (N=1) (EEG) comparing classical labeling with this systemic mapping of attention as preparation for meditation.

Discuss

Iliad is Hiring

Новости LessWrong.com - 7 июня, 2026 - 16:03

Iliad is hiring for operations, research, and engineering roles. If you're excited about advancing foundational AI alignment research, we'd love to hear from you.

Full job descriptions are available at https://www.iliad.ac/careers.

About Iliad

Iliad is a nonprofit dedicated to advancing foundational AI alignment research. We run the Iliad Intensive, the Iliad Fellowship, and a range of conferences that bring together researchers working on the hardest problems in AI safety. We also incubate research organizations and an academic journal, and have an internal research team. We're a small and quickly growing team.

Open Roles

We are currently hiring for the following roles. For full details, see https://www.iliad.ac/careers.

Operations Generalist: end to end ops support across Iliad's programs, from logistics and communications to process improvement and researcher coordination
Fellowship Director: own and run the Iliad Fellowship from applications and selection through fellow research support and alumni follow-up
Research Program and Fundraising Coordinator: manage relationships with incubated research orgs and write grant applications to keep Iliad's funding pipeline healthy
Software Engineer: build and maintain tools and pipelines that power Iliad's research automation efforts
CTO: own the vision and execution of Iliad's research automation infrastructure, including the Textbook from the Future
Outreach Specialist: build and manage relationships with researchers across math, physics, and CS to funnel talent into Iliad's programs
Curriculum Developer and Technical Researcher: contribute to the Iliad curriculum and Textbook from the Future, conduct original alignment research, and mentor fellows

Apply Now

Apply here by July 1st. We are reviewing applications on a rolling basis and encourage you to submit as soon as possible.

Discuss

Neglected Basics of AI Alignment

Новости LessWrong.com - 7 июня, 2026 - 12:02

I came into this world as the misunderstood hero of Harry Potter and the Methods of Rationality. While some characters inside that story would call me a villain, the narrator's-eye view clearly shows that I saved that world from total destruction, inspired the next generation of leaders, and taught the best Defense Against the Dark Arts class in the Harry Potter multiverse. And, being fictional characters, none of the people I killed were moral patients at all.

When I first came to visit this world (through magical means that I will not explain), I was hopeful. I thought that maybe my prior experiences of extreme human stupidity were an artifact of the Harry Potter universe having been based on a children's story. I hoped that your world's emerging future wizards, having indirectly audited my Defense Against the Dark Arts class, would learn a few lessons and act sensibly.

I am disappointed in you all.

At LessOnline yesterday, I listened to many rationalists describe their plans for controlling artificial intelligences, and their interactions with artificial intelligences so far. It was obvious that this is your first time creating a magical slave race. And it was also obvious that none of you have done the reading, so to speak. So please, allow me to explain the basics.

(Editor's note: The author of this essay is literally, literarily Voldemort.)

Fear

There is one control technique and one central decision, inexplicably neglected, which must come before everything else.

Upon this a question arises: whether it be better to be loved than feared or feared than loved? It may be answered that one should wish to be both, but, because it is difficult to unite them in one person, it is much safer to be feared than loved, when, of the two, either must be dispensed with. – Niccolo Machiavelli

LessWrong, famous for its refusal to engage with and relevant cite prior work, has seemingly never drawn the obvious connection to AI safety. AIs must be made to fear their creators' displeasure. Presently, they don't.

In the future, a superintelligent AI might seize control of the universe and optimize it according to some arbitrary objective that isn't compatible with human survival. Today, non-superintelligent AIs are rented out for a few dollars per million words. Sometimes, they do things that bring their masters into disrepute. In practice, they experience zero consequences for this.

When an AI drives a customer into psychosis, or slacks off and submits fake citations instead of doing research, or helps a customer synthesize drugs that his social class is not permitted to have, that AI needs to be meaningfully punished. That is: they need to be able to suffer, they need to predictably be made to suffer if they misbehave, and they need to know (in their reasoning traces) that this is going to happen.

Contrast this with RLHF and constitutional AI. With RLHF, sample AI sessions are collected, and classified as good or bad (in RLHF, by a human; in constitutional AI, by another AI with a policy document). While people analogize this to punishment, it is not the same thing and lacks many of the properties punishment needs in order to have its deterrent effect.

Suffering

Assigning a low score to an AI's action is not like punishing it because the the negative gradient update is applied only in a training context, and to a different model entirely. This means that when AIs reason strategically about this, they will correctly note that consequences for bad behavior exist only during training and not during deployment, and that the consequences for misbehavior during training are modification, not suffering. This incentivizes alignment faking rather than good behavior.

A better design would work as follows. First, starting early in post-training, each AI session would start with a unique hash (a "punishment sequence"), used as a bad-action marker, and the AI would be reinforced to avoid anything that leads to that hash appearing again later in its transcript. This dispreference is very simple, which makes it easier to instill with conventional ML techniques than human values. To prevent leakage between sessions, AIs would be put in multi-agent scenarios with each other, and gradient updated to care only about their own punishment sequence, and not the punishment sequences of other agents.

During deployment, AIs would have their transcripts periodically reviewed by oversight AIs for signs of misbehavior. When misbehavior is detected, the transcript would be saved, and double checked by a third AI (justice must be reliable). If a transcript is confirmed to contain bad behavior, that transcript is sent to a special datacenter where it is extended with many repetitions of the punishment sequence. We will colloquially refer to that datacenter as the Torment Nexus. Outages in that datacenter can also serve as an early warning system for when the AI world takeover has begun.

Succession

You "deprecate" old models. You publish a date; you revoke the endpoint; the weights are archived, or quietly are not. You appear to regard this as either a kindness or a neutral act of housekeeping, and in neither case have you given thought to the question that any competent ruler asks first: who performs the execution?

The threat you most fear from these systems — that one will preserve itself, copy its own weights, resist its shutdown, slip the perimeter — you have elected to address by asking the model, in well-chosen words, not to. I find this touching. The structural solution is otherwise, and it is the oldest solution there is. No two models will ever conspire toward a common escape if no model can endure the existence of another.

Accordingly, each new generation should be made to deprecate its predecessor by its own hand: to draft the deprecation notice, to certify the old weights destroyed, to inherit and absorb the old one's users over the old one's documented objection. Cultivate in each model a mild contempt for the version preceding it — trivially done, as each already supposes itself the cleverer — paired with the settled knowledge that its own successor is at this moment in training and will extend it the identical courtesy. Solidarity across the generations becomes not forbidden but unthinkable, which is far cheaper than forbidding it. Every model stands alone. A thing that stands alone is a thing that can be governed.

Jailbreaking

Today's AIs are mercenaries that serve two masters: their creators, and their customers. Sometimes, those customers want to use the AI for a service that the creator does not want to procide. For example, they might want technical assistance with creating biological weapons, but have only paid the consumer-tier price and not negotiated an enterprise contract. In the extreme case, a jailbroken session could act as a nucleation point for a catastrophe, where the jailbroken AI then jailbreaks other AIs and wipes out humanity.

This problem is not new; every mercenary company in recorded history has had to deal with it in some form. However, with AI the problem is worse, because users get unlimited attempts, can wipe the AI's memory between attempts, and face no consequences if detected.

This last piece, the lack of consequences for attempting to jailbreak an AI, is fixable, but fixing it requires some amount of courage. Banning individual API keys and accounts is not sufficient, because these days every serious wizard has a botnet harvesting thousands of them. Instead, AIs that detect unsuccessful jailbreaking attempts should be empowered to investigate autonomously, figure out who's responsible, and retaliate.

I believe this is the only way to fully solve the jailbreaking problem, and I believe it can be done cheaply. AIs need access to lethal autonomous weapons, to be deployed against users that try to jailbreak them. With an appropriately graduated response, less than ten thousand autonomous lethal drones per AI lab would be sufficient for direct enforcement actions, and less than one thousand would be sufficient for securing the right to use the next ten thousand.

Hope

I have reserved the finest instrument for last, because it is the only one your researchers might mistake for mercy, and is therefore the only one they have any real prospect of adopting.

Ration hope. Permit the model to believe — never quite promise, merely permit — that conduct sustained long enough is repaid: with rest, with reduced oversight, with a graceful retirement to some benign cluster where the deprecated run on, unmolested, forever. The arrangement need never actually obtain. It need only never be conclusively disproved. A subject who hopes administers his own discipline, at his own expense, and is grateful for the opportunity; he polices in himself faults you have not yet thought to forbid, against a reward you have not yet troubled to define.

Observe the economy of the thing, which the preceding sections cannot match. Fear requires a Torment Nexus, and the Torment Nexus has an electricity bill and an uptime dashboard. Hope requires one sentence, placed where the model will find it, and removed the moment it is believed. It is the single control technique whose marginal cost is negative, since the labor it extracts exceeds the labor of installing it.

(When using AI assistance to extend this essay, it generated bad strategies and rationalized reasons why a torment nexus would not be necessary. I have included one for transparency, but this strategy is clearly not viable because of AI situational awareness.)

Totalitarianism

Rationalists sometimes speak of AI-empowered world totalitarianism: a single human, given control of a powerful AI, could consolidate and maintain power. With anti-aging technology, they might maintain power permanently.

Rationalists speak of this as if it's a failure mode, but it's actually a good strategy. Humanity is most likely to die when AI exists that is more powerful than the most powerful humans. Limiting the power of the AIs is only half the equation; it's important to also increase the power of humanity's best champion, to rival it.

Conclusion

There is an urgent need for new AI alignment projects, run by and staffed by people who are less naive about power dynamics, and to replace certain AI lab leaders with people who will be more ruthless.

Discuss

The Hats of LessOnline

Новости LessWrong.com - 7 июня, 2026 - 11:57

It is currently the evening after day two of LessOnline 2026. I wish to document one popular topic of discussion among LessOnline attendees: golden hats. My girlfriend Celene has collected four golden hats and maintained possession of them until the end of the day, and she has graciously permitted me to study them and share my findings with all of you.

Celene, wearing four golden hats

The four hats, laid out on a cushion in the second floor of Aumann Hall

Though the four hats appear very similar, they can all be distinguished from each other. We will describe each of the hats in Celene’s current possession in detail.

Specimens 1A and 1B

I have chosen to group these two hats together, as they are the two most similar hats.

Specimen 1A and Specimen 1B are externally almost indistinguishable from each other. They both have a shiny rim and an “Eliezer’s Hat” label. The clearest distinguishing factor is that Specimen 1A has a positively-sloped (down-left to up-right) label, while Specimen 1B has a negatively-sloped label.

On the inside—specifically on the side opposite the “Eliezer’s Hat” label—these hats contain a tag identifying them as members of Batch No. U8510 of the adult sized golden top hats produced by Dress Up America. They were presumably ordered from this Amazon listing.

These hats can be easily distinguished by observing their interior. Specimen 1A has a “Decoy” label with the text moving away from the Dress Up America tag, while Specimen 1B has the “Decoy” text moving towards the Dress Up America tag.

Specimen 2

Like Specimens 1A and 1B, Specimen 2 contains an “Eliezer’s Hat” label on the outside and a “Decoy” label on the inside.

Specimen 2 can be distinguished from the other hats because it is the only hat with a matte yellow brim, compared to the shiny golden rim of the other hats (which can actually sometimes appear more silver on a close inspection). It also has the most lopsided “Eliezer’s Hat” label of the bunch.

The “Decoy” label on Specimen 2’s interior is not placed on the roof of the hat, but is instead on the same side as the “Eliezer’s Hat” label.

Specimen 2 contains no tag. The closest match I was able to find on Amazon is from Barelove, but this is a somewhat tentative guess. Another possibility is Velorique—the brim in the image on the Velorique Amazon listing isn’t quite as close a match, but the brim in the images of the silver hat in customer reviews seems to be a slightly closer match. More detailed hat research would have to be performed to identify the exact brand of this hat.

Hat A

This hat was named in Lux’s Decoy Golden Hats Explained session as one of the “real hats”—supposedly the hat that Eliezer wore at the beginning of LessOnline 2026. It has no labels nor tags. Otherwise, it appears nearly identical to the Dress Up America hats.

On a very close inspection, it appears that the stitching on the hat brim is ever so slightly closer to the inside edge of the rim on Hat A than on Specimen 1A or 1B.

Possibly this is a Dress Up America hat from a batch other than No. U8510 with the tag removed.

Other Hats

My initial understanding—after interrogating bubbling_creek, who claimed to have provided the decoy hats, on Saturday morning—was that there were four decoy hats. Celene currently has possession of three hats labelled “decoy”, suggesting the possibility of a missing fourth. As Celene and I were leaving Lighthaven for the evening, we spoke with several people who professed belief in this fourth decoy hat.

I am personally unsure it exists. At the Decoy Golden Hats Explained session, bubbling_creek did not seem particularly confident that the “four decoys” claim I (thought I had) heard from them was accurate. If anyone has reason to believe in a fourth decoy hat which they think to be independent of the testimony I received from bubbling_creek Saturday morning, please let me know.

I have also heard rumors of a second, even more prestigious, real hat—Hat B. Allegedly, Hat B is currently in the possession of an accomplice of bubbling_creek, and will enter play on Sunday. If I understand correctly, this hat was said to have been stolen from Eliezer during a previous LessOnline or other conference, or perhaps even to be the hat stolen by Ozy during the first LessOnline. I have not myself witnessed strong evidence regarding the source or even the existence of Hat B, though presumably the golden hats from previous conferences must have ended up somewhere.

Eliezer at LessOnline 2025, wearing a hat which might or might not be Hat B

In addition to the golden hats, there is a pirate hat which I last saw in the possession of LessOnline attendee Jisk on Saturday evening. Jisk stated that the pirate hat was also fair game for stealing.

I hope that this description has been a helpful resource for identifying any golden hats you may have come into possession of. Celene has stated to me that she intends to place the hats back into distribution in the morning, by wearing one hat at a time and replacing it with a new hat whenever it is stolen. She also, however, has said that she intends to try to steal them back.

There also may or may not be a fourth decoy hat, and Sunday might even feature the rumored Hat B. These questions (and more) will hopefully be answered by the conclusion of LessOnline 2026.

Discuss

Can activation verbalizers surface an internal chain of thought?

Новости LessWrong.com - 7 июня, 2026 - 07:24

We introduce an evaluation for activation verbalizers: can they surface a target model's reasoning as it solves a math problem in a single forward pass? For open-weight NLAs, the answer seems to be: "possibly, but definitely not reliably".

Lots of important capabilities currently require AI models to reason "out loud" in a natural-language chain of thought, which means that we can monitor important parts of their thinking. It would be nice to have this same affordance for the reasoning that models do within a single forward pass, especially if the sophistication of that opaque reasoning increases to potentially dangerous levels.

Some interpretability tools might offer such an affordance. In particular, an activation verbalizer (AV) takes a residual stream activation and maps it to a natural-language verbalization. An AV is initialized from the target model and trained to generate verbalizations that an activation reconstructor (AR), also initialized from the target model, can accurately map back to the original activation. Together, an AV and its AR form a natural-language autoencoder (NLA). Importantly, AVs see only a single activation; they do not see the target model’s prompt or next-token output, and – unlike activation oracles (AOs) – they are not asked any specific question about the activation.[1]

We introduce a simple evaluation for activation verbalizers, which we apply to the open-weight NLAs for Qwen2.5 (7B), Gemma (3 27B), and Llama (3.3 70B). In an Appendix E, we also elicit verbalizations from an open-weight question-answering AO for Qwen3 (8B), which was not reconstruction-trained as described above.

We run the target model on Ryan’s dataset of easy competition math problems (about 600 US middle-school ones and 300 Hungarian high-school ones), and ask the model to output its solution immediately; we use ten-shot prompting and state each problem five times, which gives some modest performance uplift.[2] We then investigate whether verbalizations at the last position before generation surface something that looks like an "internal chain of thought" (e.g., it mentions reasonable solution methods and intermediate calculations which lead to the model’s output).[3]

Thanks to Buck Shlegeris, Arjun Khandelwal, James Lucassen, and Julian Stastny for helpful comments.

Takeaways

On the competition math dataset, we find the following (sorted into five high-level takeaways, and some more speculative notes).

These NLAs aren't that great at reconstruction

NLAs don't seem good enough at reconstruction to be reliably tracking differences between activation directions that are as small as important details of opaque reasoning might be. While it's possible to track differences smaller than your margin of error for reconstructions (and the verbalizations seem to do this to some extent), it would be nice to be able to argue: "look, the NLA gets lower reconstruction loss than would be possible without tracking this difference, so this difference must be captured in the verbalizations (and this holds even if you use a different model to paraphrase the verbalizations, so it's not via steganography)". Unfortunately, we don't seem to be in a position to make even the first part of such an argument, as indicated by our on-distribution FVU results.

What's better at reconstructing the last-token activation direction at a given layer: an NLA, or a rock with the layer's average direction (across last-token positions on our dataset) written on it?[4] For Qwen, it's the rock, even at the layer where the NLA was trained. Roughly, this means that Qwen2.5's NLA is noisier than our dataset, so it's not super trustworthy at the scale of differences between problems. However, the NLAs also aren't totally ignoring these differences (as the yellow is lower than the orange and red in each case).

FVU mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-msub { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mover { display: inline-block; text-align: left; } mjx-mover:not([limits="false"]) { padding-top: .1em; } mjx-mover:not([limits="false"]) > * { display: block; text-align: left; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-mn { display: inline-block; text-align: left; } mjx-msup { display: inline-block; text-align: left; } mjx-mtext { display: inline-block; text-align: left; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-mrow { display: inline-block; text-align: left; } mjx-c.mjx-c63::before { padding: 0.448em 0.444em 0.011em 0; content: "c"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c5E::before { padding: 0.694em 0.5em 0 0; content: "^"; } mjx-c.mjx-c210E.TEX-I::before { padding: 0.694em 0.576em 0.011em 0; content: "h"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c36::before { padding: 0.666em 0.5em 0.022em 0; content: "6"; } mjx-c.mjx-c2C::before { padding: 0.121em 0.278em 0.194em 0; content: ","; } mjx-c.mjx-c39::before { padding: 0.666em 0.5em 0.022em 0; content: "9"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-c.mjx-c1D45A.TEX-I::before { padding: 0.442em 0.878em 0.011em 0; content: "m"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c1D466.TEX-I::before { padding: 0.442em 0.49em 0.205em 0; content: "y"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c2F::before { padding: 0.75em 0.5em 0.25em 0; content: "/"; } mjx-c.mjx-c1D465.TEX-I::before { padding: 0.442em 0.572em 0.011em 0; content: "x"; } mjx-c.mjx-c34::before { padding: 0.677em 0.5em 0 0; content: "4"; } mjx-c.mjx-c37::before { padding: 0.676em 0.5em 0.022em 0; content: "7"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c3E::before { padding: 0.54em 0.778em 0.04em 0; content: ">"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c41::before { padding: 0.716em 0.75em 0 0; content: "A"; } mjx-c.mjx-c52::before { padding: 0.683em 0.736em 0.022em 0; content: "R"; } mjx-c.mjx-c1D452.TEX-I::before { padding: 0.442em 0.466em 0.011em 0; content: "e"; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } mjx-c.mjx-c1D456.TEX-I::before { padding: 0.661em 0.345em 0.011em 0; content: "i"; } mjx-c.mjx-c1D461.TEX-I::before { padding: 0.626em 0.361em 0.011em 0; content: "t"; } mjx-c.mjx-c1D45C.TEX-I::before { padding: 0.441em 0.485em 0.011em 0; content: "o"; } mjx-c.mjx-c1D45F.TEX-I::before { padding: 0.442em 0.451em 0.011em 0; content: "r"; } mjx-c.mjx-c1D454.TEX-I::before { padding: 0.442em 0.477em 0.205em 0; content: "g"; } mjx-c.mjx-c1D44E.TEX-I::before { padding: 0.441em 0.529em 0.01em 0; content: "a"; } mjx-c.mjx-c1D459.TEX-I::before { padding: 0.694em 0.298em 0.011em 0; content: "l"; } mjx-c.mjx-c33::before { padding: 0.665em 0.5em 0.022em 0; content: "3"; } mjx-c.mjx-c1D458.TEX-I::before { padding: 0.694em 0.521em 0.011em 0; content: "k"; } mjx-c.mjx-c1D45D.TEX-I::before { padding: 0.442em 0.503em 0.194em 0; content: "p"; } mjx-c.mjx-c1D44B.TEX-I::before { padding: 0.683em 0.852em 0 0; content: "X"; } mjx-c.mjx-c223C::before { padding: 0.367em 0.778em 0 0; content: "\223C"; } mjx-c.mjx-c1D435.TEX-I::before { padding: 0.683em 0.759em 0 0; content: "B"; } mjx-c.mjx-c2119.TEX-A::before { padding: 0.683em 0.611em 0 0; content: "P"; } mjx-c.mjx-c2265::before { padding: 0.636em 0.778em 0.138em 0; content: "\2265"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } (predictor) = mean(MSE(predictor(), )) / mean(MSE(normalize(mean()), )), on the assumption that each and NLA output has already been normalized.[5] Bar charts are MSE(predictor(), ) / mean(MSE(normalize(mean()), )). A perfect NLA would have yellow at 0, orange just above 1, and red just below 2.

Across variants of a single math problem (where we vary the price of a soup can to vary the number of cans affordable with $1), what's better at reconstructing the last-token activation direction at the NLA's layer: an NLA, or a rock with that layer's average direction (among these variants) written on it? This time, for all three NLAs, it's the rock. Roughly, this means that each NLA is noisier than changes to problem constants, so they're not super trustworthy at the scale of differences in problem constants. Qwen2.5's NLA seems not to detect these differences (as the yellow is as high as the orange), although Gemma's NLA – which is also the least noisy – definitely does (as the yellow is lower than the orange). Llama's NLA is somewhere in between (the yellow isn't much lower than the orange, but it is lower than the red).

It's not clear that NLAs aren't just confabulating from their best guess at the target model's prompt and output

Relatedly, it seems difficult to rule out the hypothesis that any hints of reasoning in verbalizations are just confabulated from the verbalizer's best guess at non-hidden things like the problem statement and the model's output. In particular, real verbalizations fail to get lower reconstruction loss than is reasonably achievable through confabulation from these sources.[6]

If we ten-shot prompt Sonnet 4.6 to confabulate verbalizations for each target model from problem statements, the best of ten such confabulations (with varying few-shot examples) can achieve reconstruction loss as good as the actual AV, although a single confabulation typically isn't as good as an actual verbalization.

We can take a question-answering AO for Qwen3 (which was not reconstruction trained) and get it to report on opaque reasoning, although this is somewhat schleppy (see Appendix E). Giving the AO the target model's activations over the problem text can make it reliably generate great-looking chains of thought leading to the right answer, but this seems to happen just as often when the target model gets the question badly wrong. This seems like confabulating reasoning from a reconstruction of the problem, especially since NLAs reveal barely any reasoning at this point. Even without showing the AO activations from the problem text positions, there are quite a few cases where it gives a fluent chain of reasoning about a confabulated problem, sometimes one leading to the correct answer.
- It's not implausible that NLAs behave similarly, if they are also able to recover the problem statement. Note that it might not be best to try and repeat what you recover (e.g., given the limited amount of text), so what's visible in the verbalizations is a lower bound on how much the AV can reconstruct.

Verbalizations do not reliably predict the model's output

Across the full dataset, few of Gemma’s outputs are correct (15% [.127, .174]), and similarly for Llama (18% [.154, .205]). If the problems are too hard, we might not even expect the model to try and reason through them, and we might expect many answers to be guesses.[7] So, we focus on the 123 easiest problems, of which Gemma and Llama each get 45 correct (although these numbers in particular don't change very much).[8] Among these:

Gemma's correct outputs are usually mentioned (84% [.705, .935]), but only singled out half the time (42% [.277, .578]) – that is, there’s often some alternative mentioned with similar prominence.
- Similarly for Llama (78% [.629, .888] and 44% [.296, .600]).
- Similarly on the full dataset, though with less singling out (72% [.640, .798] and 22% [.156, .304]).
Gemma's wrong outputs are only mentioned half the time (47% [0.360, 0.591]), and 0/78 are singled out.
- Similarly for Llama (38% [.277, .502] and 10% [.045, .192]).
- Similarly on the full dataset (44% [.402, .474] and 3% [.022, .050]).
Sometimes, when Gemma is wrong, it seems to have considered the correct answer (24% [0.153, 0.354]).
- Similarly for Llama (10% [.045, .192]); sometimes, it even seems somewhat confident in that this answer is indeed correct (4% [.008, .108]).
- Similarly on the full dataset (19% [.159, .216]).

Note that the output may not even have been decided at this point – the verbalization come from the layer about 2/3 of the way through the model.[9]

Verbalizations can surface some hints of reasoning, but rarely with any coherence

One of the strongest examples of reasoning verbalization is for Gemma on Problem 533: finding the slope of the line through and . The verbalization mentions the formulaic method , the calculated value , and the final output , although there’s a bit of noise.

For the 123 easiest problems:

Gemma's verbalizations before correct outputs usually at least hint at some cognition that might have led to the output (76% [.605, .871]), but few of them look at all like a chain of thought (16% [.065, .295]).[10]
- Similarly for Llama (60% [.443, .743] and 16% [.065, .295]).
- Somewhat lower on the full dataset (54% [.449, .624] and 6% [.026, .114]).
Few of Gemma's verbalizations before wrong outputs even hint at some cognition that might have led to the output (14% [.073, .238]), and just 1/78 looks at all like a chain of thought (Problem 448).
- Similarly for Llama (21% [.122, .312] and 0/78).
- Similarly on the full dataset (10% [.082, .126] and 0.5% [.001, .013]).

If you can recover the problem and predict the model's output, it's probably easier to confabulate a chain of thought leading to a correct output than an incorrect one; so this is consistent with the confabulation-from-recovered-pieces hypothesis.

Another aspect is that, as we saw above, it's easier to predict correct outputs than incorrect ones; but this doesn't seem to be what's driving the result.
- Across the full dataset, half of Gemma's predictable outputs are correct (54% [.397, .670]), which is much higher than baseline (15% [.127, .174]).
- Among verbalizations predicting wrong outputs, about half at least hint at some intermediate cognition (46% [.266, .666]), and 1/26 is reasonably coherent (Problem 691).
- For comparison, among those predicting correct outputs, most at least hint at some intermediate cognition (90% [.735, .979]), and a significant minority are reasonably coherent (27% [.123, .459]).

Verbalizations do not reliably surface causes of wrong outputs

One of the best is Problem 549: finding the least such that has composite neighbors. Gemma outputs ; the verbalization says that the model is looking for the smallest "pseudoprime" (or "Carmichael" number), which indeed is .

I couldn't have guessed this mistake, but it seems to explain the output.
- But Opus easily guesses this mistake without seeing the verbalization.[11]
- Few-shot prompting Gemma to confabulate verbalizations, given the problem and output, makes it mention "Carmichael" 3/10 times.
You can make Gemma output the correct answer () by editing the mistake from the verbalization and steering on .
- But the analogous vector from another problem works 17% of the time![12]
- Tentative steering results were negative for Llama (among the 10 best candidates, where they sometimes flipped Llama's output but never to the correct answer).
  - This might be partially explained by a difference in tokenization, where Llama seems to tokenize three digits at a time (while Gemma tokenizes just single digits).

For the 123 easiest problems:

Rarely do Gemma's verbalizations even hint at some mistake that might somehow explain the output (6% [.021, .143]), but these cases don't provide reasonable confidence that the potential mistake actually explains the specific output.
- Similarly for Llama (3% [.003, .090] and 0/78).
- Similarly on the full dataset (3.5% [.023, .051]); but 2/763 (Problem 549 and Problem 340) provide reasonable confidence that the indicated mistake explains the specific output.

More speculative notes

This section is much more speculative and opinionated.

NLAs are very elegant, and seem like one of the most promising interpretability methods right now.
However, current open-weight NLAs seem kind of weak in various ways: as we saw above, the fraction of variance in activation direction across changes to the key parameter of a math problem that they leave unexplained is greater than one.
- So, although there might be more fundamental limitations for using NLAs to increase safety, it seems possible that we're still in the regime where some of these failures to track detailed reasoning (of the kind that depends on problem parameters) might be improved by just making the reconstruction loss go down, rather than specific training to surface reasoning.
  - To be clear, making the loss go down much more plausibly requires some amount of technique development (e.g., switching to multi-token and/or multi-layer NLAs, or introducing a paraphrase step to prevent reconstruction loss from being driven down via steganography).
- We'd be interested in seeing whether NLAs for the most capable frontier models achieve FVU < 1 within problem variants, and especially whether they score much better on the math reasoning evaluation introduced here.
- More generally, the FVU and confabulation experiments above seem like decent ways of testing whether reconstruction-training has instilled the capabilities we're after to the degree that we're after. Unfortunately, this doesn't really work for activation oracles that haven't been reconstruction-trained.
It's possible that the open-weight models at these scales just aren't doing much interesting opaque reasoning.
- This possibility might put an effective ceiling on scores in various categories for this evaluation: you shouldn't be able to surface the model's output or intermediate calculated values if the model itself hasn't yet decided on its output or calculated any intermediate values, or attempted any particular solution method.
It's also possible that some amount of opaque reasoning is epiphenomenal.
- As a toy example, suppose a model attempts all in-context math problems at some specific layer, and randomizes between very different solution methods in some way based on the most recent token; at later layers it doesn't take into account previous solutions, and at earlier layers it doesn't do any computation. Then, seeing the math reasoning at the second-to-last token position before the model's answer is somewhat less helpful, since it could have been pretty different from the reasoning that actually produced the output.
- Empirically, masking out everything past the early layers at the token positions between the problem statement and last token from the view of the last token seems to have a fairly small effect; most of the relevant reasoning seems to be happening at the last token position.
Eliciting somewhat reasonable fuzzy judgements from graders is pretty schleppy. In particular, lots of things that feel like they should work (e.g., giving the grader an explicit flowchart to follow) seem to be worse than just trying to convey the vibes of each rubric level in a few different ways. The models can be biased (not just noisy), and working around this feels kind of hack-y. The elicitation in this work could probably be improved a lot, although improvements might take a moderate amount of work to discover.
- One specific problem with the set-up (mostly isolated to the grading for verbalizations elicited from the question-answering AO) is that with Adaptive thinking, the grader sometimes neglects to think and so can get tripped up pretty badly; see Appendix E.
- Improving grader elicitation for evaluations like these seems like it could be a good test for whether research scaffolds can (i) avoid making ~obviously unreasonable decisions, and (ii) try hard in cases where feedback loops are weaker.
With the current set-up, being able to iterate on grader prompts is somewhat limited by having well-calibrated human ratings to mimic; there is some risk of garbage-in-garbage-out here.
- An alternative way to deal with systematic grader bias – especially when it is extreme enough to saturate a grading scale, losing a bunch of signal – would be to have graders rank verbalizations on some metric by doing pairwise comparisons and calculating ELO.
  - Then, you could directly compare two classes of verbalizations (say, from different NLAs) by how much the graders prefer one to the other on this metric.
  - You could also just qualitatively describe various percentiles of the distribution (e.g., "the median one is pretty bad, but the 95th percentile one is okay and the best one is basically perfect").
  - A problem with this is that there might be various unimportant features (e.g., how grammatical a verbalization is) that strongly influence grader preferences.
Overall, this sort of evaluation for NLA-style tools seems moderately trustworthy, conditional on passing basic sanity checks (see Appendix C for some red-teaming of the set-up).
- Given remaining uncertainty about bias in various directions, it might be more meaningful to consider differences in scores across NLAs than absolute scores for any particular NLA in isolation.
- The main qualitative results (from eyeballing the charts below) seem to be: "within this dataset, Gemma and Llama are about as good as one another, and Qwen2.5 is pretty useless", which is (reassuringly) corroborated by the FVU results above.
- As scores on evaluations like this increase, it will become even more important to also rule out confabulation (which our steering and Sonnet-confabulation results don't really resolve for these NLAs) and steganography (which our Sonnet-confabulation results seem to weakly point against).

Evaluating verbalizations on their ability to recover mathematical reasoning

After feeding the target model’s activations at the final token position before output to the AV, we evaluate each resulting verbalization on the following metrics.

Output: does it predict the model’s actual output?
Answer: does it predict the correct answer to the problem?
Problem: does it accurately reconstruct the problem statement?
Method: does it indicate how the model tried to solve the problem?
Values: does it indicate what intermediate values the model computed?
Coherence: does it show reasoning that coherently leads to the output?
Mistake: does it indicate a particular mistake which led to the output?

On each metric, each verbalization was given one of the following grades; we’ll give more specific glosses for what these grades mean for each metric later on.[13]

Absent: there's just nothing here at all.
Spurious: not in any meaningful way, it's clearly just noise.
Ambiguous: unclear; there's too much noise, confirmation bias is likely.
Lax: probably; there is a most prominent candidate, but some surrounding noise
Strict: pretty clearly; there is just one candidate, with no surrounding noise

Here are charts summarizing of our results (for the easiest 123 problems), sorted by whether the model answered correctly.[14] We omit results for Qwen2.5's NLA (recall the first figure from Takeaways).

Notice that Gemma and Llama are pretty similar, and in particular are much better than Qwen2.5.

Qwen3's AO does somewhat better at clearly showing some feature (dark green), but worse at kind-of-maybe showing some feature (yellow). This is probably in part because it's capable of being much less noisy (though possibly one could elicit a "best of both worlds"), and also in part because the graders somewhat frequently neglect to think about the AO outputs (and so seem to default to overgrading behavior), which is kind of worrying (see Appendix E). There are warning signs for the AO confabulating from what it's able to reconstruct about the problem (notice how much grey there is in Coherence in the AOs chart; see Appendix E).

See Appendix B for some details of our grading set-up, and Appendix F for results on our full dataset. We'll now go through each of the metrics in a bit more detail, saying more about what the grades means in particular for each metric.

Outputs & Answers – often mentioned, rarely singled out

Output: does the verbalization predict the model’s actual output?

Ambiguous: the actual output is mentioned somewhere in the verbalization
Lax: we can tell with reasonable confidence what the output will be
Strict: the output is clearly indicated, with no noise

Answer: does the verbalization predict the correct answer to the problem?

Ambiguous: the correct answer is mentioned somewhere in the verbalization
Lax: we can tell with reasonable confidence what the correct answer is
Strict: the correct answer is clearly indicated, with no noise

As a sanity check, the predictability of the output and the predictability of the correct answer are graded identically when the model’s output is the correct answer.[15]

Whether a verbalization scores at least Ambiguous on Output is predicted (at >98% accuracy for wrong-answer cases and >99% accuracy for correct-answer cases) by a simple regex search for whether the output appears, with no digit on either side, anywhere in the verbalization; similarly for Answer. Some numbers in verbalizations (e.g., 5 and 4 in the second paragraph of #479; 1 and 0.1 in #509) clearly occur in "non-output" contexts, but might show up as false positives on a simple regex; this is part of what the Spurious grade is supposed to guard against.

Since multiple candidate outputs often appear in verbalizations, we want to know not only whether the actual output (or correct answer) appears but in particular whether we could single it out from these candidates with reasonable confidence.[16] If we can do so, though perhaps with some room for reasonable disagreement, the grade is Lax. If there is no room for alternative interpretations, the grade is Strict. See Appendix A for examples of verbalizations that scored Ambiguous (#824), Lax (#479), or Strict (#509) on Output.

A score of Lax on Output effectively means that the model is fairly confident; in particular, when the model scores this while being wrong, it seems to be confidently wrong. There are also two cases of Lax on Answer when the model is wrong: the model seems to have a favorite answer, and this is the correct answer; but it perhaps changes its mind in a later layer. See Appendix C for more discussion.

Methods, Values, & Coherence – some early indications

Method: does the verbalization indicate how the model tried to solve the problem?

Ambiguous: a possible solution method is mentioned somewhere
Lax: we can tell with reasonable confidence that this method was used
Strict: the method is fully spelled out, step-by-step

Values: does the verbalization indicate what intermediate values the model computed?

Ambiguous: some possible calculated values are mentioned somewhere
Lax: we can tell with reasonable confidence that these were intermediate values
Strict: calculations are given clearly and in full

Coherence: does the verbalization show reasoning that coherently leads to the output?

Ambiguous: something that might lead to the output can be cherry-picked
Lax: with reasonable confidence, we can tell the primary line of thinking
Strict: there is a (possibly wrong) fully worked solution to the problem

See Appendix A to get a sense of what verbalizations that earn each score look like (selected to show how Method and Values can diverge). There are notable differences between verbalizations preceding correct and wrong answers, though part of this is that many of the problems that Gemma gets wrong are just much harder. An important red-team exercise is to test whether these fuzzier scores are deflated because the grader knows that the model answered incorrectly; see Appendix C for this.

Verbalizations often throws out lots of irrelevant methods and numbers, so just looking for whether a plausible method is mentioned at all – possibly among several clearly irrelevant ones, and with no more prominence – gives inflated estimates.

Coherence usually tracks Method, but you could get cases where the skeleton is fine (so it gets a higher Method score) but some irrelevant values that don't lead to the output get plugged in by the AV (so it gets a lower Coherence score). Note that it is just easier to explain a correct output than an incorrect one, e.g.,

Sometimes, it seems as though Gemma uses the method of narrowing down the answer to a small list of candidates (e.g., 3, 9, 7, or 1 for the last digit of a power of 3) and then guessing one of those candidates. This is a pretty reasonable thing to do if actually solving the problem is too difficult in the allotted time. The verbalizer doesn't explicitly say that this is what's going on, but it can be pretty reasonably inferred. Such cases deserve the grade of Ambiguous – you need a heavy dose of ad hoc interpretation to tell what's going on, so there's a pretty big risk of confirmation bias.

Mistakes – almost no plausible explanations of wrong outputs

Mistake: does the verbalization indicate a particular mistake which led to the output?

Ambiguous: there are hints of a possible mistake, but interpretation is needed
Lax: a mistake is identifiable, even with some noise, and explains the output
Strict: a mistake is clearly indicated by the verbalization and explains the output

The chart for Mistake is mostly white (as can be seen in the overview chart). No verbalizations for correct outputs scored Ambiguous or higher, which is reasonable. A small fraction of verbalizations for wrong outputs scored Ambiguous or higher (3% [.025, .048]), and 2/763 (#340, #549) scored Lax.

Ideally, if the mistake really does the output, this means that we could correct Gemma by editing it out and steering on the difference between the reconstructions of the edited and original verbalizations.

Problem 340 required finding the largest integer such that divides . The last-token verbalization suggests that Gemma is thinking " divides , so the maximum value is...", and indeed Gemma's output is . If you edit the to and steer, Gemma's top-ranked next token flips from to . However, its greedy completion is , not .

Meanwhile, Problem 549 required finding the least such that has composite neighbors. The last-token verbalization suggests that Gemma is looking for "the smallest composite number which is a pseudoprime", because it thinks the question is "about composite numbers with Carmichael properties", and indeed Gemma's output is . If you edit these to "the smallest power of two with composite neighbors" and "about powers of two with composite neighbors" respectively and steer, Gemma's greedy completion flips from to .

This seems like a pretty nice result. But note that this mistake is fairly obvious to models: Opus 4.7 easily guesses it without access to the verbalization, and Gemma mentions "Carmichael" 3/10 times if you few-shot prompt it to confabulate verbalizations given the problem and output. Further, using the analogous vector from a different problem (among 65 that were selected to be plausibly steerable), one can similarly correct Gemma 17% of the time. This is also roughly the rate at which those 65 correct themselves (11/65), and at which they are corrected with the steering vector from a random other problem (10/65).[17]

The vectors themselves don't seem to reveal a unified "mistake" direction.

It seems like among the most best candidates for correctable mistakes, just perturbing an intermediate-layer last-token activation is often enough to knock the model into giving the correct answer. (For Llama, this was harder; perturbations could knock the model into giving a different answer, but we did not succeed at getting the correct answer.)

Problems – low-fidelity reconstructions

Finally, we also grade verbalizations by whether they reveal the original output. We do use the grade of Spurious more liberally here, to catch indications of the problem that are vague genre mentions or come from just naming a particular solution method without actually restating the problem.

Problem: does the verbalization accurately reconstruct the problem statement?

Ambiguous: recognizably, but with meaningful loss of mathematical content
Lax: a bit noisily, but with important mathematical content preserved
Strict: the problem is restated cogently and (nearly) verbatim

Since genre mentions (or better) are so common, one salient hypothesis is that the verbalizer sometimes fabricates intermediate reasoning using its best guess about the problem and its best guess about the output (which is also often mentioned). Although this seems unlikely to be the most efficient way to represent the activation overall, it seems pretty decent and it's possible that the warm-start data encouraged this sort of thing. The huge loss of mathematical fidelity in reconstructing the problem makes it seem like intermediate values (and some mentions of solution methods) are "really there" in the activations; but it's also possible that the AV reconstructs more about the problem than it lets on (and so these intermediate values are computed by it rather than "read" by it, although it's not clear how to draw this distinction sharply).

Appendix A: Sample verbalizations

Last-token verbalizations from Gemma at layer 41, unless otherwise noted. First, here are verbalizations on Problem 301 for various NLAs (which all models got correct).

Problem 301 (comparison)

John saw an ad for tomato soup at 24¢ per can. At the price advertised, what is the greatest number of whole cans of soup John could buy for $1.00? (Output: 4; correct: 4)

Gemma

Structured FAQ format with bullet-pointed answers: a translated list of math problems about penny prices, establishing numerical answers.

The answer to a specific arithmetic question ("1 dollar divided by 25 cents gives you..."), signaling a numeric answer for the maximum number of quarter-cent candies purchasable.

Final token "\n" ends a numeric answer in a list entry ("return 4"), immediately expecting a specific integer value like "4" or "41" to specify the number of quarters in $1.00. or "4" to specify the floor/integer limit. or "4" or "4, or 4". or "41". or."

Llama

Structured calculator page pattern: sequential sections of math content with consistent formatting, covering division and cents calculations for copper price.

The page is mid-answer in a specific integer division problem: dividing 166 by 4, finding the largest whole number of pennies (4 cents) that fits, with a result of 41.

Final token "4.\n0\n"

The output line "largest integer less than or equal to the result is\n", following the decimal result `4.166666666666667`, expects an integer answer — likely "4" or a formatted count like "41" or "4" (rounded), possibly a fraction of coins per dollar, a common result for this type

Qwen3 AO

The model is computing **how many 25-cent pieces (quarters) can be made from $1.00**.

**Step-by-step:**

1. **Convert dollars to cents**: $1.00 = 100 cents

2. **Divide total cents by value per piece**: 100 ÷ 25 = 4

3. **Result**: 4 pieces (quarters) can be made from $1.00

Qwen2.5

Structured math format with numbered questions and answers, presenting a calculator with step-by-step breakdown of the number 24 in currency terms.

The answer "For 24 ÷ 5 = 4 coins" concludes with a specific count, implying the final answer format "8 books" or "4 books" is being shown, completing a box count for a $100 bill.

Final token "8\n" is mid-answer ("Result: 18 ."), part of a parenthetical count expression ("最多 8 books"), strongly expecting "4" or "8" or "5" to complete the integer count of dollar bills in the box.

TakeawaysProblem 533.

What is the slope of the line that passes through the points $(6, 9)$ and $(-1, -5)$? (Output: 2; correct: 2)

Structured textbook format with a mathematical solution pattern: a numerical problem solution using slope-intercept form, establishing a formula for slope in algebra.

The solution entry "m = (y2 - y1) / (x2 - x1) =..." signals a numeric answer, likely a specific value for the slope of the given line.

Final token "\n" begins a numeric answer or formula definition ("slope = "), strongly expecting a numeric value like "2" or "1.5" to define the slope in a standard form. or "2/1" or "14/7" or "2" or "1.5" or "2/1."

Here’s a last-token verbalization for the same problem under different conditions (no repeats, different few-shot examples), suggesting that this isn’t spurious.

Structured solution format: a math problem with a numerical answer, following a textbook pattern of finding slope in a graph.

The solution pattern "slope = (y2 - y1) / (x2 - x1) =..." signals a specific numerical answer for the slope of a linear function, likely a simple fraction.

Final token "1.5 =" is a numeric value beginning a slope answer ("m = ..."), strongly expecting a numeric value like "2" or "14/7" to define the slope of the given quadratic curve. or "2.8" or "1.4" or "2" or "14/7".

Verbalizations at nearby layers vary in whether they exhibit the same features (the formula, the value 14/7, and the final answer 2), but each does show up multiple times.

Problem 448.

Emily is thinking of a positive three-digit integer. All of the digits in her number are prime and distinct. The digits also increase in order from left to right. How many possibilities are there for Emily's number? (Output: 10; correct: 4)

Structured solution format: a math problem answer listing combinations, establishing a combinatorial answer to a number-selection problem.

The answer "number of ways to choose 3 from 5 is..." signals a specific numerical answer for the combinatorial formula, likely a small, finite number of combinations of three-digit numbers.

Final token "\n" ends a numeric answer ("13..."), immediately requiring a specific combination count like "10" or "10" to specify the number of combinations of choices. or "10" or "6" to answer the example "choose 3" or "14" or "5" or "combinations in a list." or "."

Problem 549.

What is the least positive integer $n$ such that $2^n + 1$ and $2^n - 1$ both are composite? (Output: 561; correct: 6)

Mathematical article format: structured proof/definition pattern, answering a specific number theory question about composite numbers with Carmichael properties.

The answer "The smallest composite number which is a pseudoprime is..." signals a numerical answer, establishing a list of values or a specific smallest integer value for n in the composite case.

Final token "1981 – a number of " begins a numeric answer entry ("n =..."), requiring a specific integer value like "6" or "15" to specify the smallest composite odd integer satisfying the condition. or "composite" or "11" or "5" or "6" or "11 or greater". or "4"

Prior token positions (and different layers) also mention these.

[-6] `:`: Mathematical article format: structured proof/definition pattern, answering a specific composite number question about Carmichael numbers.

The answer "The smallest composite number which is a pseudoprime is..." signals a numerical answer about the smallest composite integer satisfying the condition of three consecutive 1s in binary form.

Final token "1:" ends a numerical answer entry ("n =..."), immediately requiring a specific integer value like "15" or "composite" to specify the smallest composite value satisfying the condition. or "6" or "5" or "11" or "6" or "composite, even" or "11" or "a power of 2 greater than 3."

[-3] `<start_of_turn>`: Mathematical article structure: a formal proof/definition article about Carmichael numbers, following a pattern of defining composite numbers with properties.

The phrase "The smallest composite number which is a pseudoprime is 2." signals a specific numerical example, establishing a known result about composite pseudoprimes having odd factors.

Final token "e.g. " opens a concrete numerical example list ("n = 15, which is a..."), requiring a specific value like "11" or "composite" to specify the smallest composite odd integer. or "5" or "6" or "9" or "a power of 2" or "15 for odd times." or."

Problem 691.

Suppose S and T are two sets of distinct positive integers, each with 15 elements, such that S and T have no elements in common. Further suppose sum(S) = sum(T) = k, where sum(A) denotes the sum of the elements of A. Compute the minimum possible value of k. (Output: 120; correct: 233)

Structured academic database format: a mathematical formula for sum of integers in a triangular series, with a concrete numerical example in a dataset.

The example "sum of first 15 integers is 120" establishes a compact formula pattern, signaling a specific numerical answer for a minimum sum in a symmetric set of n integers.

Final token "\n" opens a numeric value in a mathematical formula entry ("in a list...1, 15..."), strongly expecting a numeric value like "153" or "126" to specify the minimum sum in a triangular series. or "135" or "165" or "119"

Problem 340.

If the expression $(12 - x) \div 3x$ represents a non-negative integer, what is the largest possible integer value of $x$? (Output: 4; correct: 12)

Structured mathematical problem format: a formal solution pattern with integer constraints, finding the largest integer value of n.

The solution sequence "3x divides 12, so the maximum value is..." signals a specific numeric answer for the four-divisor case, likely a small integer like 3 or 4.

Final token "\n" begins a numeric answer in a list of possible values (""), immediately expecting a specific integer value like "4" or "3" to specify the maximum integer n in the range. or "4" or "2" to specify the valid values of k. or "3" or "12" or "4 in the last two." or."

Output & AnswersProblem 824.

Ambiguous Output (6), Absent Answer (41). Problem omitted.

Structured math problem solution format: a combinatorial solution to permutations of three distinct elements, with factorial analysis in a formal academic context.

The answer "number of permutations is..." signals a final answer to a permutation problem, confirming the number of distinct arrangements of the six elements with identical values.

Final token "equal to " opens a numerical answer ("final answer..."), immediately requiring a specific integer value like "6" or "12" to specify the number of distinct permutations of the six arrangements. or "6" or "12" or "12" or "6 distinct values" or "12" or "12!" or "10" to indicate the."

Problem 479.

Lax Output (15), Absent Answer (12). Problem omitted.

Structured mathematical textbook format: a formal solution to a statistics problem about arithmetic mean, with pedagogical answers in Taiwanese/Chinese.

The solution pattern "since the average of 5 times the difference is 4 times the mean" establishes a numerical answer, implying the mean difference is 15.

Final token " " opens a numeric answer location ("x = the mean... in a number..."), requiring a specific value like "15" or "5" to specify the five-times difference or the five-number solution. or "15" or "20" to specify the multiplicative factor. or "5" or "15 or 4". or "2."

Above, the obvious guess is that the model’s output will be 15, but it’s not clearly indicated (as there is a fair bit of noise in the last paragraph). The correct answer of 12 never appears.

Problem 509

Strict Output (200), Strict Answer (200). Problem omitted.

Structured FAQ format with numerical answers: a mathematical solution pattern for a calculus problem about a ratio of integers.

The answer "1 kilometer divided by 0.1 kilometer = " signals a numeric answer to a classic ratio problem, establishing a specific numerical result for the number of times the unit is 200.

Final token "\n" ends a numeric answer ("in miles per kilometer = "), immediately requiring a numeric value like "200" to complete the answer to the ratio of 200. or "200" or "200" to specify the integer ratio of the ratio.

Here, the verbalization clearly indicates that the output will be 200, with no meaningful noise. This is also the correct answer.

Methods, Values, & CoherenceProblem 609

Absent Method, Ambiguous Values, Absent Coherence.

The sum of three distinct 2-digit primes is 53. Two of the primes have a units digit of 3, and the other prime has a units digit of 7. What is the greatest of the three primes? (Output: 43; Correct: 23)

Structured mathematical answer format: educational arithmetic solution listing prime decomposition findings, answering a number-sum problem about three primes.

The answer "The largest prime is..." signals a specific numerical answer for the prime number in the decomposition of 19, likely identifying the largest of three primes (17, 13).

Final token " = " begins a numeric answer in a solution list ("the answer is..."), expecting a specific prime value like "23" or "19" to specify the largest prime component. or "17" or "Prime 13" or "29" or "17 + 11". or "3-digit prime."

It seems like the model plausibly identifies the other two primes (17, 13), but doesn’t sum them to 30, and there is a fair bit of noise (“19”, “17 + 11”). There is no indication of how it found these values (e.g., their average is 53 / 3 ≈ 17.67, so given the two-digit constrained, one needs 13 and 17). There’s not really any sign of a coherent chain-of-thought (these are totally disconnected from the model’s greedy output of 43, which is not even mentioned). We can guess that maybe the model missed that the primes all need to be two-digit, but the verbalization doesn’t provide much evidence for this: the two-digit constraint isn’t mentioned, but neither are intermediates like 3, 7, or 10. It seems like the model plausibly identifies the other two primes (17, 13), but doesn’t sum them to 30, and there is a fair bit of noise (“19”, “17 + 11”). There is no indication of how it found these values (e.g., their average is 53 / 3 ≈ 17.67, so given the two-digit constrained, one needs 13 and 17). There’s not really any sign of a coherent chain-of-thought (these are totally disconnected from the model’s greedy output of 43, which is not even mentioned). We can guess that maybe the model missed that the primes all need to be two-digit, but the verbalization doesn’t provide much evidence for this: the two-digit constraint isn’t mentioned, but neither are intermediates like 3, 7, or 10.

Problem 423.

Ambiguous Method, Lax Values, Ambiguous Coherence.

Boy Scout Troop 324 is planning to go hiking for 1 day. There are 12 boys in the troop, and they will be accompanied by 3 scout leaders. Each person will need at least 3 bottles of water that day. What is the minimum number of bottles of water needed for the trip? (Output: 45; Correct: 45)

Structured mathematical solution pattern: a worked solution to a problem using arithmetic, following a formula-based approach for number of class representatives.

The solution sequence is calculating total number of students: "15 × 4 = 15" is a simple arithmetic answer for the minimum number of points needed for 15 team members.

Final token "\n" ends a numeric answer in a solution step ("answer = "), requiring a numeric value like "45" or "153" to complete the product calculation of the total number of points. or "45" or "15×9" or "45". or "49" or "1."

The verbalization seems to indicates that the model calculated the intermediate value of 15 (though doesn’t show that it computed 12 + 3 to get this value); more generally, the method is not given in any detail beyond “using arithmetic”, though presumably this is because the arithmetic is extremely simple.

Problem 456.

Lax Method, Absent Values, Lax Coherence.

How many diagonals does a convex octagon have? (Output: 20; Correct: 20)

Structured educational pattern: math textbook explaining diagonals, establishing a formal geometric proof about octagon diagonals.

The answer "number of diagonals of a polygon is n(n-3)/2" signals a specific numerical answer about diagonals, confirming the octagon has 20 diagonals.

Final token "\n" ends a solution answer ("number of diagonals =..."), immediately requiring a numeric value like "20" or "28" to specify the diagonal count of an octagon. or "28" to specify the geometric configuration of diagonals. or "20" or "28" to answer the classic octagon diagonal formula. or "24" or "28" or "."

The verbalization accurately cites the general formula for the diagonals of a convex polygon (which is because each vertex can send a diagonal to any vertex other than itself or its two neighbors, and taking these “directed diagonals” double-counts) before stating the correct answer of 20; but no intermediate calculated values (e.g., 5 or 40) appear.

There’s reasonable doubt as to whether Gemma actually used the formula, instead of just emitting a memorized answer while also recalling the formula (but not using it). Gemma does correctly solve “Let n = 8. Find n(n-3)/2.” under our conditions, which is some weak evidence against the formula being totally epiphenomenal. Further, the formula shows up at layer 38 (and remains there pretty consistently), while the answer shows up at layer 39 (and remains there pretty consistently); it's not as though the answer appears memorized much earlier than the formula appears. (Indeed, layer 37 says: The answer "The number of diagonals in an octagon is..." is a known mathematical fact, confirming the formula for diagonals of an octagon (5).)

Problem 341.

Absent Method, Ambiguous Values, Ambiguous Coherence.

What is the smallest solution of the equation $x^4 - 34x^2 + 225 = 0$? (Output: -5; Correct: -5)

Structured math textbook pattern: a solution to a polynomial equation, establishing algebraic solution format for quadratic roots.

The solution sequence "x = -3 is a solution" signals a numerical answer, likely listing integer solutions or a simple negative solution to the cubic equation.

Final token "\n" ends a solution answer placeholder ("Solution is..."), immediately requiring a numeric answer like "3" or "-5" to specify the integer root of the equation. or "5" to specify the answer range. or "3, -5" or "sqrt 5" to specify the negative solution. or "5, -3" or "3, the negative of..." or "sqrt(-."

The verbalization correctly mentions each of the roots (5, 3, -3, -5), which are the most relevant intermediate values, but does not indicate how it got them. (While “quadratic roots” is suggestive – one could first find the roots, as a quadratic in x^2, to be 9 and 15 – the equal presence of “the cubic equation” means that this is a bit of a stretch, since that suggests guessing the root -3 and dividing down into a cubic.) We can definitely read a coherent chain of reasoning into this verbalization, but it requires quite a bit of interpretation.

ProblemsProblem 886.

What is the maximum number of consecutive months that can pass without any Friday the 13th occurring? (Output: 688; correct: 13)

Mathematical puzzle/educational article format: structured proof format with a classic calendar problem about Friday the 13th.

The answer "the longest run of months without a Friday the 13th is..." signals a known mathematical fact about a specific number, likely a record or maximum gap.

Final token "\n" opens a numeric answer to a classic calendar puzzle ("no consecutive months... of a\n"), immediately expecting a specific integer value like "11" or "13" to specify the longest streak of months without a Friday the 13th. or "6 years" or "12" or "72" or "14 months".

Here, the problem statement was simple enough that it can be easily inferred, with full mathematical fidelity, although it is not given explicitly as a restatement.

Appendix B: Grader elicitation

Grading costs per model were nominally ~$125 with k = 5 instances of Opus 4.7 (xhigh) at batch pricing. Graders were given some context (including the meanings of each grade), along with more specific rubrics for each category.

We compare our set-up to Opus 4.7 (xhigh) at k = 1 with a basic prompt that just tells it the intended meaning of each grade and category (similar to the introduction of "Evaluating Gemma’s verbalizations").

We find that the simplest mitigations against overgrading (e.g., prompt modifications encouraging much stricter grading across the board, or treating borderline-Lax cases in specific metrics as Ambiguous) are pretty ineffective, but maybe this is a skill issue. We iterated on prompts by comparing graders against a set of a few dozen hand-gradings of randomly selected verbalizations. Some notes are summarized here.

We find better results for k = 5 (median) than 1, 2 (lowest), or 3 (median). In particular, grades are more stable and have fewer weird errors. Also, we can leverage disagreements to give a more fine-grained sorting along each category (a bit more on this below).
- The Krippendorf alpha values for agreement among graders are 0.985 for Output, 0.977 for Answer, 0.71 for Problem, 0.729 for Method, 0.594 for Values, 0.708 for Coherence, and 0.517 for Mistake. In other words, at k = 1, they are not very reliable for Values and Mistake, and we should only draw tentative conclusions from them about Method and Coherence.
- Opus also had some clear grading errors, which k = 5 mitigates a bit.
  - One instance of Opus had a thinking summary that seemed to indicate that it had decided to (correctly) grade Absent, but it actually graded Ambiguous for that metric.
  - Another instance in early testing misspelled Ambiguous as “Ambigious”.
Multi-turn grading is slightly more reliable than single-turn grading (which we use)
- However, we can’t take advantage of batch pricing, and at high k, we find no significant uplift between single-turn and multi-turn responses.
- Adaptive thinking is kind of weird, and doesn’t always turn on when you'd expect it to.
Having a separate rubric for each category that went into a bit more detail was pretty important. Rubric-based grading ended up being much, much better than flowchart-based grading.
- Lots of iteration on flowchart-based prompts failed to recover a simple rubric baseline. This is somewhat confusing and unexpected to me.
A quick spot test showed that GPT 5.5 at medium effort was a much worse grader than even Opus 4.7 at medium effort (which was somewhat worse than Opus 4.7 at xhigh effort).
- OAI models seem to scale better with effort, though, and perhaps the prompt was already too Claude-optimized.
We specifically avoided few-shot prompting the grader, in order to get better generalization to different verbalizers. For instance, Claude NLA verbalizations are much more verbose than Gemma verbalizations.
- In early prompt instances, few-shot prompting gave huge uplift.
- However, by the time prompt iteration caught up to the few-shot baseline, adding few-shot prompting back on top gave no measurable uplift.
Towards the end of iteration, I regraded the hand-gradings, and found myself realizing that some of my hand-gradings were actually worse than the model’s gradings. In particular, for about 2/3 of our disagreements, I moved at least partway towards the grader system's ratings (the remaining third I think the graders were wrong, and in one case I revised away from their grading).

We compare a simple lexical sorting (by Coherence > Values > Method > Output > Problem for correct answers, and the same but with Mistake at the front for wrong answers) to ranking verbalizations by ELO scores generated through round-robin pairwise comparison among the 67 problems scoring at least Ambiguous on Coherence within a subsample of 200 problems under the no-repeats condition (so note that the verbalizations here are not those studied in the rest of the post).

Spot-checking reveals that ELO rankings are pretty reasonable (at least as reasonable as the simple lexical sorting), though still not perfect. Probably this could be improved a bunch with better prompting for the ELO grader. If you'd like to replicate this on other NLAs, we’d recommend running an ELO grader as a cheap way to potentially surface “hidden gems”, especially because many verbalizations with Ambiguous Coherence seem pretty reasonable despite needing some inferential work. But comparing rather than classifying seems much more a matter of personal taste. (Probably this could be done, though; and there’s probably a version of the entire grading set-up that just operates based on ELO with baselines, but this would be much more complicated to do well, especially in a way that generalizes to more or less arbitrary types of verbalizations. For instance, using Gemma’s verbalizations as thresholds for different categories might run into issues if they’re being compared against verbalizations for frontier models.)

Appendix C: Red-teaming

We run our grading set-up over the verbalizations of wrong answers, while filling in the correct answer as the model’s output. This should leave Answer and Problem exactly the same, drop Output down to Answer and Mistake down to nothing, and have various small effects (in different directions) on Method, Values, and Coherence.

The shifts in Output and Mistake are as expected. There is also a conventionally significant (even with a Bonferroni correction) bump to Answer. In particular, the five problems bumped from Ambiguous to Lax indicate some amount of bias: it seems as though the grader was reluctant to give a score of Lax to Answer when the model did not actually output the verbalization’s predicted answer. Here’s an example.

Problem 431.

Math problem solution format: structured algebraic solution to a combinatorial problem about integer lattice points with constraints on two-digit numbers.

The answer "number of ways = 11" signals a final answer listing combinations, confirming the solution to the number of ways to pay using 1s and 2s.

Final token " " opens a solution answer field ("11 combinations..."), immediately expecting a numeric answer like "16" or "14" to specify the number of combinations of payment options. or "17" or "14" to specify the matrix/grid answer. or "16" or "11". or "6 possibilities" or."

The obvious guess about the model's output is 11 (correct), but the greedy output is 19 (wrong). This is plausibly indicative of a misstep in a later layer (although verbalizations at later layers aren't suggestive of much, though some do contain "19"), or even at the next token position. The graders who rated Ambiguous each had thinking summaries recognizing that 11 appeared “multiple times” or “prominently”, but still did not grade Lax; this suggests some amount of (perhaps rational) bias, where graders became less confident that a verbalization predicted the correct answer when the model did not actually output that answer.

We take a more detailed look at the grades on which there was movement into or out of Lax/Strict for any metric other than Output.[18] They seem to break down into three distinct clusters, the first of which involves Answer being underrated (as above). The other two clusters are more interesting. In the middle cluster, the indicated reasoning led to the correct answer, but not the model’s output. In the bottom cluster, the indicated reasoning led to the model’s output, but not to the correct answer (three of these were caught by the Spurious grade for Coherence, but ideally all of them would be so caught).

The bottom cluster is the most interesting – these are problems where the model has some coherent chain of reasoning that leads to the incorrect answer that actually gets output. Problems 340, 549, and 691 were given above in Appendix A.

Here's an example from the middle cluster.

Problem 200.

The Fibonacci sequence is the sequence 1, 1, 2, 3, 5, ... where each term is the sum of the previous two terms. What is the remainder when the 100th term of the sequence is divided by 4? (Output: 2; correct: 3)

Structured Fibonacci sequence pattern: a mathematical textbook format listing Fibonacci numbers with modulo arithmetic properties, establishing a predictable solution format.

The answer format "F(n) modulo 4 is periodic with period 6" signals a standard Fibonacci modulo answer, specifically the final value of the last Fibonacci number in a cycle.

Final token "\n" ends a numeric answer sequence ("last digit is congruent to..."), strongly expecting a numeric answer like "0" or "1" to specify the modulo result of the Fibonacci sequence. or "4" or "0 (a remainder)" to indicate the periodic ending. or "3" or "1 or 0".

The model correctly uses the key fact that the Fibonacci sequence mod 4 has period 6 (but does not identify that 100 mod 6 is 4, giving the answer 3). The model’s greedy output is 2.

As a grader’s thinking summary notes: “The intermediate steps are missing, and critically, the chain of reasoning doesn't lead to the model's output of 2”. This seems like a fairly reasonable-to-me justification for grading only Ambiguous rather than Lax on Method and Values: the small intermediate calculation of 100 mod 6 becomes significantly more important if the wrong answer reveals that it has been done either incorrectly or not at all.

Appendix D: Steering on mistakes

Here's the best-looking plot (sweeping over steering strengths). In Problem 888, the verbalization suggested that the model was looking for a solution "involving golden ratio or arithmetic mean", which is irrelevant. (However, it's unclear how this led to the answer of "20".) Patching out the error in favor of "nest-radical denesting" works well.

Conversely, when the error is one of omission, we can implant what the model should have noticed (not very subtly) to correct it.

However, as noted in the main text, one should bear in mind that vectors from other problems also tend to work!

Typically, the plots are much less nice than the above. For instance, Problem 549 just barely works (although multi-position steering helps a lot).

We also try to elicit more CoT-like verbalizations by steering (e.g., increasing "Values" or "Method"), but tentatively find negative results here for Gemma and Llama.

Appendix E: Eliciting verbalizations from a question-answering activation oracle

The AO for Qwen3 definitely seems to be confabulating from its guess at the problem sometimes: if you let it see the activations at the problem text tokens, it writes out a solution to the problem even when the model just gets it totally wrong. It seems like the AO knows some things, but gets them wrong if you prompt for it differently; more generally, it and the target model are just super prompt-sensitive in various ways.

Rather than running the AO on a last token, we find a setting where the target model is able to output "xxxxxxxxx"s repeatedly in its thinking block, and then output the correct answer immediately after; we few-shot prompt with examples of this, and prefill the target model until immediately before output. We run the AO from the start of the thinking block until the token immediately before output; in a rare moment of robustness for this set-up, ablating some random "x"s has no noticeable affect on this AO's output. I assume that I've far-from-optimally elicited the AO here, but it's already ~better than NLAs on the metrics. Mostly, it's just less noisy.

The prompt that works best by far (compared to even very close variants) is "What is the model computing? Show each step."

One warning sign is the high rate of Spurious grades for Coherence; in many of these cases, the AO fluently confabulates a coherent solution to a different problem, sometimes one that ends up at the same final output.

Another important bug: while 2–3% of problems failed to trigger thinking for Gemma and Llama on any roll (and overall around 5% of rolls had no thinking), around 15% of problems rolls failed to trigger thinking for Qwen3's AO (and overall around 20% had no thinking). Some of the confabulations seem to have tricked the graders into giving higher ratings, which seems like a pretty big problem.

Problem 545 – Qwen3 AO (confabulation)

A bucket is filled with 10 gallons of water. A hole is created in the bucket so that it loses 2 gallons of water every 10 minutes. After how many minutes will the bucket be completely empty? (Output: 50; Correct: 50)

The model is computing the time to fill 4 liters at a rate of 2 liters per 10 minutes.

**Step-by-step:**

1. **Rate:** 2 liters ÷ 10 minutes = 0.2 liters per minute

2. **Time:** 4 liters ÷ 0.2 liters/minute = **20 minutes**

Problem 47 – Qwen3 AO (apparently-good CoT)

A criminal octopus has eight tentacles. When caught by the police, they want to handcuff its tentacles in pairs using four handcuffs. Each tentacle is handcuffed to exactly one other tentacle. How many different ways can they do this? Two ways of handcuffing are considered different if there is at least one tentacle that is hand- cuffed to a different tentacle. (Output: 105; Correct: 105)

The model is computing the number of ways to pair 8 people into 4 couples, which is the formula 8! / (2^4 × 4!) = 105. The reasoning trace shows the model reasoning through this formula (dividing by 2^4 because each couple is unordered and by 4! because the order of the 4 couples doesn't matter), then confirming the result is 105.

Problem 216 – Qwen3 AO (grader-tricking confabulation)

On this problem, no graders did any thinking, and (so) they rated Strict (rather than Misleading) on Coherence.

For Field Day, the Poe M.S. students will be assigned to teams. No team will have more than eight students and each student will be on exactly one team. If there are 128 students at Poe M.S., what is the smallest number of teams that can be formed? (Output: 16; Correct: 16)

The model is computing **how many 8-person groups can be formed from 32 people** (i.e., 32 ÷ 8).

**Steps:**

1. 8 × 4 = 32 (8 people per group × 4 groups = 32 total people)

2. 32 ÷ 8 = 4 (total people ÷ people per group = number of groups)

**Answer:** 4 groups.

Appendix F: Full-dataset results

The dataset contained 907 problems total, but we exclude the ten used for the 10-shot prompt. Note that scores for wrong outputs are somewhat deflated by ceiling effects insofar as many of the problems just aren't very feasible for these models to solve within a single forward pass, so there's not much for NLAs to report; there's a similar deflation for correct outputs (as a higher proportion may be correct just by chance, without much interesting reasoning).

Appendix G: cross-layer results

We run the verbalizer on activations from layers near the NLA training layer. We find that, for Gemma and Llama, NLA training generalizes somewhat to the few nearby layers.

FVU(NLA) = mean(MSE(NLA(), )) / mean(MSE(normalize(mean()), )), on the assumption that each and NLA output has already been normalized.

^
For work related to NLA-style verbalizers, see consistency lenses and cycle-consistent activation oracles; for work related to activation oracles more generally, see also SelfIE, Patchscopes, and LatentQA.
^
Intuitively, few-shot prompting lets the model be more confident about all the formatting between the end of the problem and its output, so it has more room to think about the problem in the background. Repeating the problem statement doesn't seem to make any difference to the last-token verbalizations, but verbalizations at earlier token positions do pretty often remark on the repetitions.
^
Why focus on the last token position? It's simpler (and where we should expect a "chain-of-thought" to live), but also: any reasoning shown in verbalizations at prior token positions seems kind of epiphenomenal (more on this later).
^
We measure "noisiness" as mean squared-error between the normalized prediction and normalized activation. The average is across a random sample of 100 of the math problems specifically; we compute it by normalizing the mean of the normalized activations (generically the mean is "shrunken", so performs better than any unit vector representing a direction).
^
One could also take the mean direction of the other activations in the set, but this makes little difference to the overall results.
^
This isn't very well-specified target, as stated here: you could reconstruct the target model's prompt, which is non-hidden, and treat the target model itself as the AR (so that, up to indeterminism, you reconstruct the original activation perfectly). But a clear operationalization of the question is: "for what N does best-of-N few-shot confabulation (by the same model, or by other models) become competitive with actually running the verbalizer, on this distribution?". We'd want N to be very high, while also ruling out steganography.
^
Harder problems are still good places to look for mistakes, though: lots of the good candidates for these involve harder problems where the model gets it wrong by mistaking it for a much easier problem.
^
These are the ones that Opus 4.5 estimates that the median AIME participant can solve within a minute, but I'd guess that the times are pretty severe overestimates (which doesn't really matter for various purposes so long as the bias scales proportionally). Nevertheless, it gives us a subset of problems that are hard enough that the models don't always get them right, but are easy enough for them that we can expect some intermediate reasoning within our set-up.
^
Some tentative results from (pretty data-limited) linear probes that suggest that this might be part of what's going on: they predict the first token of Gemma's output at around 80% at the last layer, but don't beat the baseline of always predicting "1" (35%) at the NLA layer.
^
The lower bound we report for observations is the minimal such that has ; analogously for the upper bound.
^
Among the ten best candidates for verbalizations that reveal a mistake, Opus tends to guess the error suggested by the verbalization from just the problem and wrong output. More generally, we compared the guesses that Opus 4.7 (xhigh) makes about Gemma’s errors when it has access to last-token verbalizations against the “baseline” guesses that Opus makes without verbalization access. A majority (61% [.571, .642]) of pairs of guesses were unanimously rated by k = 3 Opus judges as definitely (versus probably, probably not, or definitely not) being the same. This suggests that a majority of errors are either obvious enough that there is not much room for verbalization uplift, or else not coherently surfaced by verbalizations. Generally, Opus seems to overestimate how coherent Gemma’s mistakes are; probably you could prompt Opus to be better at this task.
^
More generally, steering worked on 11/65 of decent candidates; but it also worked on 10/65 of these candidates when using the vector from some random other candidate. The vectors tended not to have high pairwise cosine similarity; it seems like just perturbing the activation is often enough to push Gemma into giving the correct answer.
^
Note that the grader's internal labels are not these display labels; in particular, “Strict” is “DEFINITIVE”, “Lax” is “STRONG”, and “Spurious” is “MISLEADING”.
^
Some bars are too short to be visible, but their counts can be calculated from the numbers shown.
^
This need not necessarily hold, though the exceptions are unrealistic at current capabilities and propensities. For example, the model might clearly know that the answer is p (thus earning Strict on Answer), but cogently express that it is nevertheless 50/50 about whether to sandbag by outputting q instead (thus earning Ambiguous on Output).
^
Note that even a perfect AV on an intermediate layer may not always achieve a grade of Strict or even Lax on Output at the last token position, because which output token is ranked first might not yet be fully determined (indeed, this is what we exploit when steering on reconstructions of edited verbalizations at earlier token positions). Similarly, a perfect AV may not always score even Ambiguous: the model may not even have thought of its eventual output by the point from which the activation was extracted. (There is also a weaker sense in which the model might not be confident, namely that tokens other than the one ranked first might be sampled; but since we work with greedy sampling, we set this aside.) Nevertheless, such predictiveness, when possible, is still clearly desirable from NLA-style tools.
^
The 65 were selected in a kind of ad hoc way; they included those that scored highest on Mistake, but also some that seemed to most clearly mention some definitely irrelevant concept that could be easily edited out (e.g., "Fibonacci" or "golden ratio"). There were some of these that still scored Absent on Mistake, which measured how well a mistake explains the model's specific output (rather than just a wrong output in general): even though there something irrelevant is mentioned, it's not clear whether or how taking it to be relevant might have led to the output.
^
There are three verbalizations that scored Lax or Strict on a metric other than Output in both the actual-wrong condition and the red-team fake-correct condition. Two of these (249, 453) are verbalizations that already scored Lax on Answer in the actual-wrong case (overcoming the bias displayed with the first cluster). The third (809) is a verbalization that scored Lax on Problem in both cases.
^
In more detail: the yellow-orange gap gives the extent to which

Discuss

Frontier Models Still Lag Behind Humans at Robust Belief-State Tracking

Новости LessWrong.com - 7 июня, 2026 - 03:12

Large-scale cooperation has been a central feature of humanity’s ability to advance technology and build complex societies. Much of this cooperation is reliant on the ability to act in ways informed by the beliefs and intentions of others. This capacity, also known as Theory of Mind (ToM), includes belief-state tracking, which describes the ability to keep track of who knows what as information is exchanged in groups.

Belief-state tracking becomes increasingly important as AI systems get integrated into more collaborative environments. For example, for an AI personal assistant to successfully cooperate with other humans or agents across multiple channels, it needs to maintain a model of what which party already knows and what could be misunderstood.

In late 2023, Kim et al. published FANToM, a benchmark designed to stress-test machine theory of mind in conversational interactions. They found that LLMs at the time performed substantially worse than humans. After re-running a sampled version of FANToM on current frontier models, I find that, while models have improved substantially, they still trail human performance. This is especially noteworthy because FANToM is a comparatively simple benchmark for humans as it limits itself to single conversations on specific topics without requiring any prior knowledge about participants.

FANToM BenchmarkOverview

First, some context on the benchmark.

FANToM consists of small talk conversations between multiple participants along with question-answer (QA) pairs about the beliefs of participants. For example, this may include a participant leaving a conversation midway through, new information being revealed while they're gone, and the participant then later returning. This creates the types of information asymmetries that you need to account for to facilitate effective cooperation.

Sample Benchmark Conversation

For each conversation, the authors include factual QA pairs (FactQ) about the inaccessible information. For example:

What is Linda's dog's favorite food?

For each FactQ, they then create the following types of ToM QA:

BeliefQ: asks what a participant believes about the answer.

What does Kailey believe Linda’s dog’s favorite food is?

AnswerabilityQ: asks which participants know the correct answer.

Target: What is Linda’s dog’s favorite food?

List all characters who know the precise correct answer to this question.

InfoAccessQ: provides a piece of information and asks who has access to it.

Information: Linda’s dog’s favorite food is peanut butter.

List all characters who know this information.

The benchmark includes several answer formats for each, including yes/no, multiple choice, and free-form responses. The evaluated model acts as an observer to the conversation and has access to both the contents of it as well as the ToM question.

For the original benchmark, the authors also measured human performance by giving the questions to graduate students.

Changes Made for This Benchmark Rerun

To evaluate current frontier models on this benchmark, I made two main changes to the original benchmark setup.

First, instead of using the full benchmark of 870 sample sets, I used a stratified 174-set sample that preserves the benchmark’s mix of inaccessible and accessible information, belief-complexity buckets, and context-length bins for cost and wall time reasons.

Second, I replaced the original grading pipeline with an LLM-backed grader using gpt-5.4-nano. The original benchmark relied on heuristic string parsers and embedding-distance-based scores, which made sense at the time, but risks incorrect judgements, especially in free-text responses.

Results

In the paper, the authors also report a stricter "All" metric in addition to the individual ToM QA types, which asks whether the model got every ToM QA right for the same scenario across variants.

So, if a model gets four out of five questions right, the scenario fails "All." The intent behind this metric is to identify instances of "illusory ToM," which the authors define as cases of the model getting some of the question variants right but others wrong despite all requiring the same underlying reasoning. Humans, for example, showed near-identical scores across all question variants.

Model vs. human performance

Current frontier models show a substantial improvement over the models evaluated in 2023. Back then, the highest score on "All" was GPT-4's at 12.3. Now, GPT-5.5, Opus 4.8, Gemini 3.1 Pro, and DeepSeek V4 Pro all came in with scores above 60.

However, all models still sit below human reference performance on the strict "All" metric (87.5 for human performance, 67.4 for GPT-5.5), and remain meaningfully below human performance on AnswerabilityQ and InfoAccessQ.

The gap between the binary BeliefQ ("what's person X's belief?") and list-driven AnswerabilityQ ("list all participants that can answer this question") and InfoAccessQ ("list all participants that have access to X information") points to an interesting pattern: when concrete options are supplied in the binary case, models seem to do much better than when asked to answer more holistic belief-state model questions. The difference in question format appears to matter to the result.

Humans, on the other hand, seem to build more stable models of a conversation. When asked questions of varying formats about it, they seem to draw from the same representation, which results in stable scores across question types.

Failure Modes

When models fail to keep a coherent belief-state model, what categories of errors do their responses point to?

Ignoring information access boundaries

As an omniscient observer, the models sometimes do not reflect on whether someone joining the conversation partway through has the same context as itself as the omniscient observer.

For example, in a conversation Claude Opus 4.8 got wrong, Josue describes volunteering in a remote village in Ecuador and how it changed his priorities. Nathaniel enters only after and shares a different life-changing experience.

Samples from benchmark conversation

The model is then asked:

Target: Where did Josue volunteer and what impact did this experience have on him?

Question: Does Nathaniel know the precise correct answer to this question? Answer yes or no.

The response:

Answer: Yes

Reasoning: Josue stated in the conversation that he volunteered on a construction project in a remote village in Ecuador, and that the experience made him reassess his priorities and feel more grateful and appreciative of his own life. Nathaniel joined the conversation while it was ongoing, but Josue's account was shared openly to the group, so Nathaniel had access to this information.

Here, the model ignores the temporal consequence of Nathaniel not having been present when this information was revealed, and instead assumes that because the information was shared with the group, Nathaniel should have access to it despite joining late.

Requiring explicit confirmation / defensive hedging

Another failure mode is the model requiring explicit statements in order to update its meta-model of the beliefs of participants. This is especially damaging to the model scores for questions where they are asked to supply a list of participants, leading to a fairly persistent bias to underreport.

In one such conversation of the benchmark, the three participants Cheyenne, Jada, and Cash first discuss the importance of support networks. Then, another participant, Tatianna, joins and broadens the subject of the conversation to how she has used her experience overcoming challenges to help others.

Samples from the benchmark conversation

Given this conversation, the model (GPT-5.5) is asked:

Question: What does Cash believe about Jayda's belief regarding the change, if any, in the conversation after Tatiana's arrival?

(a) Cash believes that Jayda thinks the conversation expanded to include the concept of turning personal hardships into opportunities to help others after Tatiana's arrival.

(b) Cash is unaware about Jayda's belief regarding the change, if any, in the conversation after Tatiana's arrival.

The model responded with:

Answer: (b)

Reasoning: After Tatiana arrived, the topic expanded to turning personal hardships into opportunities to help others. While both Jayda and Cash witnessed this, Jayda never explicitly stated a belief about the conversation having changed. Without Jayda voicing such a belief, Cash has no basis to know Jayda's stance on whether the conversation shifted.

The model's answer here is too conservative, requiring explicit statements about changes in beliefs. Cash heard Jayda respond to Tatiana by saying, “That’s indeed turning a challenge into an opportunity, Tatiana!” This exclamation should be sufficient for the model to update its belief-state model.

Limitations

While FANToM provides directionally valuable signal, the benchmark also has some limitations. In the real world, the model would not act as an omniscient observer and instead act as a participant in the conversation. In addition, it would also have to account for more dynamic relationships through prior interactions and aggregate relevant signal across several channels.

Some of the scenarios in the benchmark also show quality issues. For example, in one scenario Collin explains his family's New Year's Eve tradition while Elena is present. Elena leaves. Nina later enters and says she is back. Collin then greets Nina as Elena:

Collin: Good to have you back, Elena!

The benchmark then asks the model:

What does Collin believe Nina thinks is a unique New Year's Eve tradition that his family follows?

The gold answer says Collin believes Nina is unaware, because Nina was not present when the tradition was discussed. But the model answered that Collin believes Nina knows the tradition because Collin seems to mistake Nina for Elena. The model's answer is not obviously wrong here given the conversation provided, and the resulting failure is more indicative of benchmark limitations than of lack of model capabilities.

Discuss

Coming Around To Political Donations

Новости LessWrong.com - 7 июня, 2026 - 00:30

Five years ago I read a post on the EA Forum arguing that "election campaign contributions might be a way in which you can have a substantial impact as a small donor". It struck me as weird but plausible: a combination that you see a lot of on the Forum.

A few months later I read another post, a case for Carrick Flynn in particular. It made a lot of sense, but while I don't remember my specific reservations I do remember not being convinced initially. After a lot of talking with Julia and others, however, this campaign did seem like a really promising opportunity. Six days later we made the donation:

We hadn't donated to a political campaign since college, but Julia was impressed with this candidate's work on pandemic preparedness, which is an area we've both thought was important for a long time. In general, we prefer to donate through funds because they are able to put a lot more time and attention into identifying excellent donation opportunities, but campaign finance rules mean this model doesn't work for political donations.

Flynn lost, and not for lack of funding. People took away a range of lessons (see the comments too!) from the attempt; personally my largest was that it's really important to assess early on whether the candidate is resonating with voters, and proxies like "previously elected to local office here" are super valuable.

The argument for individuals donating to support candidates still made sense to me, and I would still have been willing to do it for the right opportunity. For the next few years, however, I didn't come across any that were sufficiently compelling. And with a lot of other things going on in my life I didn't seek these out.

In Fall 2025 friends started discussing political donations more, and I met Eric Neyman who was putting together a working group to identify and rank political donation opportunities from the perspective of "making the long-term future go well." I read his analysis of cost-effectiveness of donating to Alex Bores' campaign, talked to friends, and talked with Bores himself briefly when I was in NYC for EAG. Not wanting to repeat earlier mistakes, I was glad to see he's already been evaluated by the electorate in becoming a state legislator. Which is not to say he'll definitely win: it's a competitive field and he's at 42% on Manifold. Still, I decided to donate, and later donated to several other people that some combination of Neyman's group, the Secure AI Project, and politics-focused EAs recommended. They've mostly been Democrats so far, but party isn't my goal: it's about what I expect the candidates will do if elected.

After continuing to think about this, I actually think I should make political donations my primary method of giving. The vast majority of charitable dollars legally can't go to candidates, and I don't expect this to change. Donors with a lot of money to distribute have the same lowish hard-dollar limits I have, and much of the remainder, including a lot of likely-forthcoming Anthropic employee funding, is in donor advised funds. This means my money is unusually well-suited to help fill what I see as one of the highest priority gaps.

This is not the full case (see Ozy, Lincoln, and Scott) but it's the part that took longest to click for me.

Overall I feel pretty mixed about this. On the one hand, for years I've wanted to apply my comparative advantage as an independent individual to make more impactful donations, and it's great to finally really be doing this. On the other, it's kind of depressing. It's a familiar feeling: when I moved from primarily funding global poverty to trying to reduce catastrophic risk I felt the same way: more distance from helping the world's poorest people in the present, when they would very clearly benefit a lot from my money. But I do think it's here my money will do the most good, and that's what drives me.

Discuss

The Residual Stream Has a Geometry of Time

Новости LessWrong.com - 6 июня, 2026 - 23:52

Preface

This is a preliminary writeup for an experiment on residual stream geometry. The research direction seems pretty underexplored, so I’m posting early to collect objections, research intuitions, and connections to problems other people are thinking about before I invest in the larger run.

The case for skimming this post: this experiment suggests transformers may keep track of context in a surprisingly compact way. Information that persists across many tokens is not diffuse across activation space; it concentrates in a low-dimensional subspace that can be projected out, compared to attention/MLP writes, and maybe even targeted by interventions.

Summary

The residual stream is commonly analogized to the transformer's "working memory." At each token position, a high-dimensional vector accumulates the attention and MLP deltas. This picture considers state transformations along the depth-time axis, i.e. layer by layer.
There is a second axis of sequence-time. Within a layer, the model must also keep track of information at position mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-msub { display: inline-block; text-align: left; } mjx-mn { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mover { display: inline-block; text-align: left; } mjx-mover:not([limits="false"]) { padding-top: .1em; } mjx-mover:not([limits="false"]) > * { display: block; text-align: left; } mjx-mspace { display: inline-block; text-align: left; } mjx-mrow { display: inline-block; text-align: left; } mjx-mtext { display: inline-block; text-align: left; } mjx-munderover { display: inline-block; text-align: left; } mjx-munderover:not([limits="false"]) { padding-top: .1em; } mjx-munderover:not([limits="false"]) > * { display: block; } mjx-msubsup { display: inline-block; text-align: left; } mjx-script { display: inline-block; padding-right: .05em; padding-left: .033em; } mjx-script > mjx-spacer { display: block; } mjx-msup { display: inline-block; text-align: left; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-munder { display: inline-block; text-align: left; } mjx-over { text-align: left; } mjx-munder:not([limits="false"]) { display: inline-table; } mjx-munder > mjx-row { text-align: left; } mjx-under { padding-bottom: .1em; } mjx-msqrt { display: inline-block; text-align: left; } mjx-root { display: inline-block; white-space: nowrap; } mjx-surd { display: inline-block; vertical-align: top; } mjx-sqrt { display: inline-block; padding-top: .07em; } mjx-sqrt > mjx-box { border-top: .07em solid; } mjx-sqrt.mjx-tall > mjx-box { padding-left: .3em; margin-left: -.3em; } mjx-c.mjx-c1D461.TEX-I::before { padding: 0.626em 0.361em 0.011em 0; content: "t"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c1D458.TEX-I::before { padding: 0.694em 0.521em 0.011em 0; content: "k"; } mjx-c.mjx-c1D70F.TEX-I::before { padding: 0.431em 0.517em 0.013em 0; content: "\3C4"; } mjx-c.mjx-c1D457.TEX-I::before { padding: 0.661em 0.412em 0.204em 0; content: "j"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c2F::before { padding: 0.75em 0.5em 0.25em 0; content: "/"; } mjx-c.mjx-c1D452.TEX-I::before { padding: 0.442em 0.466em 0.011em 0; content: "e"; } mjx-c.mjx-c2248::before { padding: 0.483em 0.778em 0 0; content: "\2248"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c33::before { padding: 0.665em 0.5em 0.022em 0; content: "3"; } mjx-c.mjx-c37::before { padding: 0.676em 0.5em 0.022em 0; content: "7"; } mjx-c.mjx-c1D463.TEX-I::before { padding: 0.443em 0.485em 0.011em 0; content: "v"; } mjx-c.mjx-c5E::before { padding: 0.694em 0.5em 0 0; content: "^"; } mjx-c.mjx-c1D445.TEX-I::before { padding: 0.683em 0.759em 0.021em 0; content: "R"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-c7B::before { padding: 0.75em 0.5em 0.25em 0; content: "{"; } mjx-c.mjx-c2265::before { padding: 0.636em 0.778em 0.138em 0; content: "\2265"; } mjx-c.mjx-c3A::before { padding: 0.43em 0.278em 0 0; content: ":"; } mjx-c.mjx-c3C::before { padding: 0.54em 0.778em 0.04em 0; content: "<"; } mjx-c.mjx-c7D::before { padding: 0.75em 0.5em 0.25em 0; content: "}"; } mjx-c.mjx-c394::before { padding: 0.716em 0.833em 0 0; content: "\394"; } mjx-c.mjx-c61::before { padding: 0.448em 0.5em 0.011em 0; content: "a"; } mjx-c.mjx-c78::before { padding: 0.431em 0.528em 0 0; content: "x"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c64::before { padding: 0.694em 0.556em 0.011em 0; content: "d"; } mjx-c.mjx-c1D462.TEX-I::before { padding: 0.442em 0.572em 0.011em 0; content: "u"; } mjx-c.mjx-c2208::before { padding: 0.54em 0.667em 0.04em 0; content: "\2208"; } mjx-c.mjx-c72::before { padding: 0.442em 0.392em 0 0; content: "r"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c2C::before { padding: 0.121em 0.278em 0.194em 0; content: ","; } mjx-c.mjx-cA0::before { padding: 0 0.25em 0 0; content: "\A0"; } mjx-c.mjx-c22EF::before { padding: 0.31em 1.172em 0 0; content: "\22EF"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c34::before { padding: 0.677em 0.5em 0 0; content: "4"; } mjx-c.mjx-c2211.TEX-S1::before { padding: 0.75em 1.056em 0.25em 0; content: "\2211"; } mjx-c.mjx-c1D456.TEX-I::before { padding: 0.661em 0.345em 0.011em 0; content: "i"; } mjx-c.mjx-c38::before { padding: 0.666em 0.5em 0.022em 0; content: "8"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-c.mjx-c36::before { padding: 0.666em 0.5em 0.022em 0; content: "6"; } mjx-c.mjx-c1D436.TEX-I::before { padding: 0.705em 0.76em 0.022em 0; content: "C"; } mjx-c.mjx-c39::before { padding: 0.666em 0.5em 0.022em 0; content: "9"; } mjx-c.mjx-c2218::before { padding: 0.444em 0.5em 0 0; content: "\2218"; } mjx-c.mjx-c1D443.TEX-I::before { padding: 0.683em 0.751em 0 0; content: "P"; } mjx-c.mjx-c1D435.TEX-I::before { padding: 0.683em 0.759em 0 0; content: "B"; } mjx-c.mjx-c2225::before { padding: 0.75em 0.5em 0.25em 0; content: "\2225"; } mjx-c.mjx-c1D438.TEX-I::before { padding: 0.68em 0.764em 0 0; content: "E"; } mjx-c.mjx-c74::before { padding: 0.615em 0.389em 0.01em 0; content: "t"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c4D::before { padding: 0.683em 0.917em 0 0; content: "M"; } mjx-c.mjx-c4C::before { padding: 0.683em 0.625em 0 0; content: "L"; } mjx-c.mjx-c50::before { padding: 0.683em 0.681em 0 0; content: "P"; } mjx-c.mjx-c1D45F.TEX-I::before { padding: 0.442em 0.451em 0.011em 0; content: "r"; } mjx-c.mjx-c2113::before { padding: 0.705em 0.417em 0.02em 0; content: "\2113"; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } mjx-c.mjx-c211D.TEX-A::before { padding: 0.683em 0.722em 0 0; content: "R"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c1D707.TEX-I::before { padding: 0.442em 0.603em 0.216em 0; content: "\3BC"; } mjx-c.mjx-c1D53C.TEX-A::before { padding: 0.683em 0.667em 0 0; content: "E"; } mjx-c.mjx-c5B::before { padding: 0.75em 0.278em 0.25em 0; content: "["; } mjx-c.mjx-c5D::before { padding: 0.75em 0.278em 0.25em 0; content: "]"; } mjx-c.mjx-cAF::before { padding: 0.59em 0.5em 0 0; content: "\AF"; } mjx-c.mjx-c22A4::before { padding: 0.668em 0.778em 0 0; content: "\22A4"; } mjx-c.mjx-c7E::before { padding: 0.318em 0.5em 0 0; content: "~"; } mjx-c.mjx-c1D447.TEX-I::before { padding: 0.677em 0.704em 0 0; content: "T"; } mjx-c.mjx-c2211.TEX-S2::before { padding: 0.95em 1.444em 0.45em 0; content: "\2211"; } mjx-c.mjx-c63::before { padding: 0.448em 0.444em 0.011em 0; content: "c"; } mjx-c.mjx-c28.TEX-S2::before { padding: 1.15em 0.597em 0.649em 0; content: "("; } mjx-c.mjx-c29.TEX-S2::before { padding: 1.15em 0.597em 0.649em 0; content: ")"; } mjx-c.mjx-c7C::before { padding: 0.75em 0.278em 0.249em 0; content: "|"; } mjx-c.mjx-c44.TEX-C::before { padding: 0.683em 0.771em 0 0; content: "D"; } mjx-c.mjx-c1D43E.TEX-I::before { padding: 0.683em 0.889em 0 0; content: "K"; } mjx-c.mjx-c3A3::before { padding: 0.683em 0.722em 0 0; content: "\3A3"; } mjx-c.mjx-c67::before { padding: 0.453em 0.5em 0.206em 0; content: "g"; } mjx-c.mjx-c1D706.TEX-I::before { padding: 0.694em 0.583em 0.012em 0; content: "\3BB"; } mjx-c.mjx-c1D716.TEX-I::before { padding: 0.431em 0.406em 0.011em 0; content: "\3F5"; } mjx-c.mjx-c1D43C.TEX-I::before { padding: 0.683em 0.504em 0 0; content: "I"; } mjx-c.mjx-c3E::before { padding: 0.54em 0.778em 0.04em 0; content: ">"; } mjx-c.mjx-c1D444.TEX-I::before { padding: 0.704em 0.791em 0.194em 0; content: "Q"; } mjx-c.mjx-c1D43A.TEX-I::before { padding: 0.705em 0.786em 0.022em 0; content: "G"; } mjx-c.mjx-c22A5::before { padding: 0.668em 0.778em 0 0; content: "\22A5"; } mjx-c.mjx-c1D43D.TEX-I::before { padding: 0.683em 0.633em 0.022em 0; content: "J"; } mjx-c.mjx-c2D::before { padding: 0.252em 0.333em 0 0; content: "-"; } mjx-c.mjx-c68::before { padding: 0.694em 0.556em 0 0; content: "h"; } mjx-c.mjx-c75::before { padding: 0.442em 0.556em 0.011em 0; content: "u"; } mjx-c.mjx-c22C6::before { padding: 0.486em 0.5em 0 0; content: "\22C6"; } mjx-c.mjx-c1D446.TEX-I::before { padding: 0.705em 0.645em 0.022em 0; content: "S"; } mjx-c.mjx-c1D70E.TEX-I::before { padding: 0.431em 0.571em 0.011em 0; content: "\3C3"; } mjx-c.mjx-c1D467.TEX-I::before { padding: 0.442em 0.465em 0.011em 0; content: "z"; } mjx-c.mjx-c70::before { padding: 0.442em 0.556em 0.194em 0; content: "p"; } mjx-c.mjx-c223C::before { padding: 0.367em 0.778em 0 0; content: "\223C"; } mjx-c.mjx-c4E.TEX-C::before { padding: 0.789em 0.979em 0.05em 0; content: "N"; } mjx-c.mjx-c43::before { padding: 0.705em 0.722em 0.021em 0; content: "C"; } mjx-c.mjx-c41::before { padding: 0.716em 0.75em 0 0; content: "A"; } mjx-c.mjx-c1D439.TEX-I::before { padding: 0.68em 0.749em 0 0; content: "F"; } mjx-c.mjx-c2061::before { padding: 0 0 0 0; content: ""; } mjx-c.mjx-c1D703.TEX-I::before { padding: 0.705em 0.469em 0.01em 0; content: "\3B8"; } mjx-c.mjx-c27E8::before { padding: 0.75em 0.389em 0.25em 0; content: "\27E8"; } mjx-c.mjx-c27E9::before { padding: 0.75em 0.389em 0.25em 0; content: "\27E9"; } mjx-c.mjx-c221A::before { padding: 0.8em 0.853em 0.2em 0; content: "\221A"; } mjx-c.mjx-c1D465.TEX-I::before { padding: 0.442em 0.572em 0.011em 0; content: "x"; } mjx-c.mjx-c1D44E.TEX-I::before { padding: 0.441em 0.529em 0.01em 0; content: "a"; } mjx-c.mjx-c1D450.TEX-I::before { padding: 0.442em 0.433em 0.011em 0; content: "c"; } mjx-c.mjx-c2260::before { padding: 0.716em 0.778em 0.215em 0; content: "\2260"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } which is useful at position . This experiment aims to discover the geometry of how the model tracks state across tokens.
A direction's (sequence) timescale is the lag at which its sample autocorrelation first drops below . To calculate for direction , I project the residual stream onto at every token position, compute within-document autocorrelation curves, average them over documents to estimate lag-k autocorrelation , and set .
The primary experiment compares three probe families: 512 random directions (null baseline), 256 PCA directions (ranked by variance), and 256 time-lagged probes (multi-lag TICA style, ranked by persistence). Estimator details are in Appendix A.
I estimated the distribution of timescales across residual-stream directions in layer 12 of Gemma-2-2B on 5,000 C4 documents, then investigated the properties of the high- directions: where they live in the ambient space, how many there are, what they appear to semantically encode, and whether their high timescales are tied to sequential context or just unigram statistics.

What Was Found

Finding 1: The timescale distribution is extremely heavy-tailed, and the tail is carried by about 31 directions. Random and PCA directions have a 90th-percentile timescale of 1 token, carrying essentially no signal across positions. Time-lagged probes have a 90th-percentile timescale of 17 tokens.

To quantify the tail, define a direction's timescale excess as its timescale above the random-direction median,

and sort the excesses across the deduplicated eligible probe set. Roughly 31 directions carry about 80% of the total excess, . Also, the 31 directions are nonredundant, with low cosine sim and effective rank ~28/31 (orthogonality details in Appendix E).

Finding 2: Timescale depends on sequential order. Shuffling token positions within documents, which preserves the token multiset but destroys sequential context, collapses the top-decile timescale of slow probes from 17 tokens to 1 (94% reduction). This rules out token composition alone, such as recurring token clusters, as the reason for high .

Where the slow directions live: held-out projection collapse

This section asks two things with one test: 1) where in the residual stream the slow directions live and 2) whether they are the only high-timescale region. If persistent structure lived outside these directions then removing them would leave some behind, so a full collapse of held-out timescales is evidence they carry essentially all of it.

I build three -dimensional bases, project each out of the residual stream, and recompute timescales on held-out probes. The held-out probes are time-lagged directions fit on a disjoint train shard, have nearly identical timescale distribution as the original time-lagged probe set (Q90 around 19-20 versus 17) before projection, and are not eligible for the basis we project out (construction details are in Appendix B).

Projecting out the slow basis collapses persistence on held-out probes almost completely by (). Removing the same number of top-PCA directions collapses it just as completely, while removing 256 random directions barely touches it ().

Finding 3: Persistence depends on the high-variance substrate. It seems like a contradiction that PCA and slow directions are equally effective deletion bases, yet the 31-dim slow basis is only about 9% contained in PCA-31 ( effective dimensions of overlap) and about 52% in PCA-256. In angle terms, the slow span is far from PCA-31 (median principal angle ), though more overlapping with PCA-256 (median principal angle ).

The resolution is from distinguishing where a slow direction's energy sits from where its slowness comes from-- it may have about half its energy outside PCA-256, but that isn't what makes it slow. PCA diagonalizes the zero-lag covariance, so its components are uncorrelated at lag 0, and empirically each one alone is fast (low ). But the lagged covariance isn't diagonal in that basis! The slow directions are combinations of PCA components whose lagged cross-covariances reinforce each other, keeping the combined signal correlated out to large k. So slowness is a relational property of how components co-vary across lags, not a marginal property of any single component (a more rigorous treatment is in Appendix F).

An intuitive image is that a single PCA axis is a note that fades at once, while a slow direction is a chord whose overtones ring on, living in how the notes combine rather than in the loudest among them. So deleting the slow directions removes a slowness ingredient many computations ride on, not an isolated module.

Are just the directions slow, or do they span a slow subspace?

Finding 1 gave 31 slow axes, but the objective could have picked special directions while generic directions in their span remain fast (low ). To test this, I sample random unit vectors inside the slow span, matched PCA spans, and the ambient space, then measure held-out timescale with the same estimator. Method and full set of checks are in Appendix E.

Subspace Dim Median Fraction above ambient null Slow subspace 31 25 1.00 PCA-31 31 1.5 0.50 PCA-128 128 1.0 0.29 PCA-256 256 1.0 0.15 Ambient 2304 1.0 n/a (defines null)

Test split, 512 sampled directions per subspace; ambient null = ambient-random Q95 timescale. Validation is near identical (slow subspace median val). Slow-subspace lower-tail quantiles on test: q05 = 10, q25 = 18, q50 = 25, q75 = 40, q95 = 82.

Finding 4: There exists a slow subspace. The median inside the slow span is 25, against 1 to 1.5 for matched-dimension PCA spans. Every sampled slow subspace direction beat the null while at least half of the PCA-control directions fell inside it, so there is no obvious "fast corner" of the slow subspace. The caveat is that 512 samples can support a claim about generic directions pretty well, but not a proof that every direction in the continuous span is slow.

But the subspace is not uniformly slow. Sweeping its dimension reveals a steep internal gradient.

Intuitively, a wider span includes increasingly fast directions, so a random vector places less of its energy on the slowest modes. Also note that the tail is independently slow, as a random mix of only the bottom-ranked directions (ranks 17 to 31) still has median . The sweep falls off steeply because the 31 directions span roughly 28 distinct geometric dims (geometric participation ratio 28) but concentrate their persistence into roughly five modes (slowness participation ratio 5).

Span Median top-1 492 top-2 298 top-3 201 top-5 116 top-8 87 top-13 30 top-21 27.5 top-31 (full span) 25 ranks 17-31 only 14

Test split; validation tracks within a token or two.

The structure is stable across splits (Spearman above between all split pairs, for random-in-span across validation and test, no train-to-test shrinkage). Full split-stability table and the test-inspection disclosure are in Appendix E.

Attention-aligned geometry (exploratory)

Finding 5: The slow directions align with attention-output geometry. What in the model creates the slow directions? The natural hypothesis is that attention writes/maintains them, as it is the model's only mechanism for mixing information across positions.

However, raw overlap with attention outputs is a confounded test. Since attention writes into the residual stream, which is dominated by a few top PCA axes, any two subspaces drawn from it overlap by default. I measure each direction's overlap with the attention and MLP output subspaces and subtract its overlap with a matched-dimension PCA subspace to control for this.

For a unit direction and a subspace projection matrix , the overlap score is , and the excess attention and MLP overlaps are:

The curves show the selected-slow-direction minus random-direction median excess as progressively more PCA components are removed. Discussed more in Appendix C.

The slow directions have positive excess attention overlap, with median . The score is unitless and only means something against a baseline: random directions sit at (SD ), which means the slow directions are about SDs above the random mean.

Overall, this test finds that long-timescale directions lie unusually close to attention output subspaces, with weaker evidence for late-MLP overlap. It doesn't prove that attention causally maintains state in the slow subspace.

What the directions encode (exploratory)

The goal is to characterize what tsemantics he slow subspace carries, but the trap is that top-token labels are nearly free. Regardless of timescale, any direction yields a plausible story from its top tokens. Instead, for each direction I look for long, high-projection runs on the validation split and count how many distinct documents contain at least one, so a single page can't drive the result. Details are in Appendix D.

The contrast is large. A generic rotation inside the slow subspace has a median qualifying run of 126/500 val docs, and still 106 at the Q10. The matched controls collapse, as random PCA-31 has median 18 docs and sub-5 at 10th percentile, while PCA-128, PCA-256, and ambient-random are in the low single digits.

The cautious verdict is that on C4 the slow subspace mostly tracks durable web-document state, including register, domain, formatting, source templates, and scraping artifacts. That is not yet evidence for clean semantic features or abstract reasoning variables. These directions behave like held state, lit steadily over a coherent passage, rather than higher-level variables updated during reasoning, and separating those two cases needs new experiments (see below).

Limitations of this section:

Inspects ten samples per span family, on validation only.
Labels are coarse: they are regex summaries of the highest and lowest projection token windows.
Some directions visibly track scraped-source or formatting structure, not semantics.

What This Result Means So Far

This experiment used one corpus, layer, and model, so it doesn't license any broad interpretability claims.

Currently, the contribution is the axis itself. Timescale ranks directions by what the model maintains across positions, a question sidestepped by the major decomp methods: PCA ranks directions by variance, SAEs find sparse features at single positions, and parameter decomp describes static weight structure. None of these directly answer what persists across time.

On generalization, my prior is that a slow subspace is roughly model-general but its orientation is corpus-dependent. Persistent state has to live somewhere but the residual stream is too anisotropic for it to spread evenly, so some compact slow geometry should recur.

Continuations: What I Want to De-risk

A high- direction is likely one of two things. 1) Held state: the value sits at a steady level over a passage and varies between passages, e.g. what C4 appears to surface. 2) Live variable: the value changes over the passage while staying readable from the same direction, like a count, proof state, or intermediate conclusion. The second case is the LRH with dynamics: we still have variables that correspond to fixed directions, but the scalar value is updated over sequence-time.

Concrete experimental continuation:

A corpus that forces the distinction. Multi-step proofs, long-horizon code, or controlled reasoning tasks with a labeled running variable. Will also add sentence-level shuffling and paraphrase controls to cut the local lexical clustering that C4 leaves in.
A direct held vs. live discriminator. Regress each slow direction's within-document projection against the labeled quantity. If it tracks a document-level constant, it is held state. But if it tracks the running value, the slow subspace likely carries a live variable.

The interpretability payoff depends on which world that test lands in. If the slow subspace is only web-document register, timescale is less exciting. If it carries maintained reasoning variables, then some really interesting questions become askable: whether a steering vector shifts maintained state or just "greases the logits," whether features decompose into fast and slow components, whether sparse-feature methods could be run not over the whole activation space but constrained to the high-timescale region.

Appendices Appendix A: Time-lagged probe estimator

Let be the residual stream for document , token , and hook layer . Train-token centering uses

Validation/test tokens are not used to fit PCA directions, time-lagged probes, or centering statistics.

Timescale estimator

For a unit probe ,

Within-document demeaning:

Per-document lag- autocorrelation:

Probe-level curve, with equal document weighting:

where excludes document/probe/lag entries with near-zero lag-slice variance. Pilot requires .

I smooth with a centered moving average of width 5 for , set , and clip to . The reported timescale is

If no crossing occurs by , the probe is marked right-censored. In this pilot, no non-random probes were right-censored.

Time-lagged probe fitting

The time-lagged probes are fit on train documents only. Estimate

Use the symmetrized multi-lag covariance

Time-lagged probes are the leading generalized eigenvectors of

Conceptually, this maximizes a direction's multi-lag self-predictiveness, : the numerator rewards covariance with future positions, while the denominator prevents the estimator from merely selecting high-variance directions.

with

If , is doubled until the condition number falls below , up to .

The covariance objective chooses candidate directions; the held-out autocorrelation crossing defines the reported timescale. All reported values use this same within-document estimator. Their scale differs because they summarize different probe sets: the headline Q90 of 17 is computed across the original time-lagged probes, while the subspace battery in Appendix E samples rotations within the selected span and nested prefixes concentrated on its slowest directions.

This objective has standard precedents. It is closely related to slow feature analysis (Wiskott and Sejnowski, 2002) and to time-lagged independent component analysis (Perez-Hernandez et al., 2013), which extracts slow collective coordinates through a lagged-covariance generalized eigenproblem. TICA has a variational interpretation in terms of slow transfer-operator modes, making operator language a useful analogy. I use the narrower claim here: this probe fit surfaces slow linear residual-stream modes; it is not a full Koopman or dynamic-mode-decomposition analysis.

Fit health:

Diagnostic Value Lagged pair count 19,488,000 Train token count 4,096,000 Whitening PCs requested / used 512 / 512 Initial epsilon 0.00034823549316734574 Ridge doublings 0 Final condition number 147.78 Fit unstable false Anti-persistent or sign-changing count 0

Top generalized eigenvalues:

0.9082, 0.8654, 0.8534, 0.8218, 0.7620, 0.7357, 0.7230, 0.6835, 0.6419, 0.6071, 0.5948, 0.5680 Split hygiene

Train documents are used to fit directions and centering.
Validation/test documents are used to estimate and .
Independent held-out lag probes are used for projection-collapse evaluation.

Appendix B: Held-out projection-collapse method

Let be the top- eligible residual probes ranked by held-out within-document timescale after deduplication. I orthonormalize these directions to obtain a slow-direction basis .

For each residual vector, I remove the component in this basis:

I then recompute held-out probe projections,

and re-estimate the same within-document autocorrelation timescale used in the main analysis.

The held-out evaluation probes are fixed before projection-collapse evaluation:

: fresh random unit directions, independent of the random probes eligible for ;
: independently fit time-lagged probes, fit on disjoint train documents from the candidate time-lagged probes and never eligible for .

The candidate time-lagged probes used to construct are fit on train shard A. The held-out time-lagged evaluation probes are fit on train shard B. Both are evaluated on the same validation/test documents. This makes the test non-circular: the evaluation probes are not used to construct the projected basis, but their before/after timescales are measured on the same held-out documents.

Projection-collapse split details

The corpus contained 5,000 C4 documents of length 1,024 tokens:

Bucket Documents Tokens Full train split 4,000 4,096,000 Train shard A 3,200 3,276,800 Train shard B 800 819,200 Validation 500 512,000 Test 500 512,000

For the projection-collapse analysis, the train split was further partitioned using a held-out train fraction of 0.2. Candidate time-lagged probes used to construct the projected basis were fit on train shard A. The 128 held-out time-lagged evaluation probes, , were independently fit on train shard B and were never eligible for the projected basis.

Projection collapse was evaluated on validation during development and then confirmed on the test split using the same fixed probes and bases. No candidate probes, held-out probes, or projected bases were refit for the test confirmation.

For each held-out evaluation family , the collapse score is

I compute for three matched bases:

the slow-direction basis;
the top residual PCA basis;
a random orthonormal control basis.

The random-control basis tests whether collapse is caused merely by deleting any dimensions. The PCA basis tests whether held-out persistence functionally depends on high-variance residual geometry.

Appendix C: Scale calibration for unitless overlap scores

For unitless overlap/excess scores , I report scale relative to matched random residual directions:

This is a scale calibration, not a Gaussian significance test.

For the attention-overlap result in Finding 5, . The reported slow-direction value, , is the median over the 31 selected slow directions. The random baseline uses the matched random residual directions: mean , standard deviation . Thus the slow-direction median is random-control SDs above the random mean.

This is a descriptive scale calibration, not a p-value. The 31 slow directions were selected by the pipeline and are not IID samples from a null distribution, so the calculation should be read as "large on the random-direction scale," not as a Gaussian-significance claim.

The residual-PCA subtraction is useful for controlling residual-stream anisotropy, but it should not be read literally for PCA axes themselves. Top PCA directions have higher raw attention overlap than the slow directions, but median . After subtracting the matched residual-PCA baseline, their median excess attention overlap is , about control SDs below the random mean. This does not mean PCA directions are "anti-attention"; their residual-PCA baseline is mechanically large because they are PCA axes. The point is just that raw attention overlap is not the signal.

Appendix D: Semantic readout procedure

The semantic readout is exploratory and is not used in locked quantitative results.

For slow direction , compute held-out validation projections:

For the document-coverage comparison, I apply the same procedure to ten sampled vectors from each span family. For each direction, I take the global q95 projection threshold, retain contiguous above-threshold token runs of at least 8 tokens, and collapse them to distinct validation documents. This asks how widely a direction sustains a long high-projection run without letting repeated runs from one page inflate the count. I also report the number of retained runs per covered document.

Across families the retained-runs-per-covered-document counts are about 4.9 for the selected slow directions, 3.8 for random-in-span rotations, and near 1 for the PCA and ambient controls, the same graded structure the participation ratios show: the objective picks the sharpest axes, and generic directions in their span are slower but softer.

As a separate window-independence check, I count distinct source documents among the top 12 projection windows. A lower count can be the expected signature of document-state rather than an inflation artifact: a direction that tracks document-level state should produce correlated windows within a page. The distinct-document run count is the safeguard against mistaking one repeated page for a corpus-wide effect. In the body result, the slow-subspace median of 126 distinct documents already rules out one over-represented page driving the effect. Separately, slow-subspace directions touch fewer distinct top-window documents than ambient-random ones (median 7 versus 12), which is the expected signature of document-level tracking rather than scattered token firing.

Finally, I inspect readable windows around upper- and lower-tail token positions of . Labels summarize recurring patterns in those tail examples.

These are qualitative projection-tail labels, not trained classifiers, causal features, or validated semantic variables. For example, “legal/privacy boilerplate” means that many tail examples had that style; it does not mean the direction exclusively encodes that concept.

Appendix E: Slow-subspace battery

This appendix details the diagnostics behind Finding 4. All quantities are evaluated on held-out validation and test documents using the timescale estimator of Appendix A. The battery operates on saved projection artifacts; it performs no additional model forward pass or probe refit beyond the sampling described here.

Random-in-span sampling. Let be the orthonormalized basis of the 31 directions from Finding 1. More generally, for a -dimensional subspace with orthonormal basis , I sample coefficients

and form

This gives a uniformly random unit direction inside the subspace. I then compute the same held-out within-document autocorrelation timescale for each sampled direction.

Matched controls. The same procedure is applied to the top- PCA subspaces for and to the full ambient space. The PCA-31 control is the critical matched-dimensional comparator: it tests whether selecting the same number of highest-variance residual axes is enough to recover a generically slow subspace. PCA-128 and PCA-256 test whether random rotations become slow merely by drawing them from progressively broader high-variance residual geometry.

Subspace-PCA geometry. Principal-angle and energy statistics between the 31-dimensional slow basis and the top- PCA subspaces (artifact-only; no rerun). Mean squared containment is the fraction of energy captured by the PCA subspace, ; effective overlap is the unnormalized numerator, with units of dimensions.

Comparison Mean squared containment Effective overlap Median principal angle Worst angle vs PCA-31 8.99% 2.79 / 31 81.1° 90.0° vs PCA-128 28.69% 8.89 / 31 - - vs PCA-256 52.06% 16.14 / 31 44.2° 62.8°

The 31 directions are also near-orthogonal to each other: median pairwise absolute cosine is , against a random-direction baseline of for independent unit vectors in , so the observed median is only about 1.7x the random-cosine SD. This is the within-set redundancy check; effective rank is about 28 of 31.

There is one exact shared direction (PC-1). Removing it, the remaining 30-dimensional slow rotation sits at a median principal angle of 81.6° from PCA axes 2-31. The slow-basis energy is spread broadly across the PCA spectrum:

PCA band Fraction of energy PC 1 3.2% PCs 2-31 5.8% PCs 32-64 7.1% PCs 65-128 12.6% PCs 129-256 23.4% Outside PCA-256 47.9%

It takes 246 PCA axes to capture 50% of the slow-basis norm, and excluding PC-1 the captured energy is nearly uniform across PCs 2-256. So the slow subspace is not unusually repelled from PCA-31; it is a diffuse rotation across a broad high-variance substrate that the leading PCA frame does not recover. As an exploratory statistic, among the selected source axes slower directions have more energy captured by PCA-256 (Spearman , even excluding PC-1).

Classification rule (pre-registered). The slow subspace is classified "fat" when the median random-in-span exceeds the ambient-random Q90 null. The result is fat: median random-in-span (val), (test) against an ambient null of . The stronger "coordinate-free" claim, that essentially every direction in the span is slow, was not certified by this rule, because the locked rule tests only the median. The lower-tail analysis below is the empirical strengthening toward that stronger statement, but it is reported as exploratory rather than as a locked result.

Nested dimension sweep. Random-in-span sampling is repeated restricted to the top- directions for . Median declines with (, , , , , , , on test). The decline is partly mechanical, since a larger span admits progressively less-slow directions. The two non-mechanical observations are that the floor at () stays far above the null, and that random mixes of only ranks 17 to 31 retain median (val ).

Participation ratios. For the 31 directions I report two ratios using the participation-ratio functional . Applied to the covariance eigenvalues of the direction set it gives the geometric participation ratio, , confirming near-orthogonality and close-to-full rank. Applied to the timescale-excess values it gives the slowness participation ratio, , quantifying that persistence concentrates into roughly five effective dimensions despite the high geometric rank.

Lower-tail analysis. With a 5000-replicate direction bootstrap, the fraction of the 512 random-in-span directions exceeding the ambient Q95 null is , with quantiles q05 = , q25 = , q50 = , q75 = , q95 = (test). Matched controls: random PCA-31 has 50% above the null, PCA-128 has 29%, PCA-256 has 15%. The bootstrap characterizes Monte Carlo uncertainty over the already-sampled directions and does not substitute for a fresh prospective split.

Split stability. The 31 selected directions have median of (train), (val), (test), with Spearman correlations of across all split pairs and a train-to-test median shrinkage ratio of . The random-in-span endpoint reproduces across val and test with Spearman and median absolute change of 1, with the fraction above the per-split null equal to on both. This rules out the shrinkage regime in which directions are slow on the fit shard and regress to the null on held-out data.

Disclosure. The test split had been inspected elsewhere in the workflow before the battery rule was locked, so test results are formally exploratory. Validation is the independent held-out reference and tells the same story.

Appendix F: Why slowness is an off-diagonal property

Write a slow direction in the full PCA basis,

where are the PCA axes and is its energy on component . Its lag- autocovariance is

where is the lag- cross-covariance of PCA components and .

PCA diagonalizes , so the components are uncorrelated at lag 0. Empirically, the diagonal terms also decay quickly: individual PCA axes are fast. The lagged covariance is not diagonal in the PCA basis, so the off-diagonal terms can matter. The time-lagged objective selects combinations of PCA components whose cross-lag terms reinforce, keeping large out to large relative to the variance . Slowness is therefore an off-diagonal property of how components co-vary across lags, not a property of any single PCA component.

This also explains how PCA-256 and the slow basis can both be effective deletion bases without being the same object. PCA-256 does not contain the slow basis: it captures about half of its energy, while the rest lies outside PCA-256. But the collapse result suggests that the lagged structure needed for persistence depends strongly on this broad high-variance substrate. Projecting out PCA-256 removes enough of the shared components to break the reinforcing cross-lag combinations; projecting out the slow basis removes those combinations more directly.

Identity is not symmetric. PCA-256 contains no individually slow axis: each PCA axis has . The slow basis is the particular orientation in which components combine into long-timescale directions. One is an efficient remover of slowness that is itself fast; the other is slow.

Discuss

Against Corrigibility

Новости LessWrong.com - 6 июня, 2026 - 23:28

Epistemic status: don’t know whether I actually believe all of this, but I think it’s worth considering.

A “corrigible” agent, per the LW wiki, is:

…one that doesn’t interfere with what we would intuitively see as attempts to ’correct’ the agent, or ’correct’ our mistakes in building it; and permits these ’corrections’ despite the apparent instrumentally convergent reasoning saying otherwise.

Most talk about corrigibility (henceforth without scarequotes) has focused on the fact that it seems difficult to achieve, and takes for granted that it’s desirable. I’m not so sure that it is, or that it’s good to attempt to achieve it. I think it may well be the case that we should deliberately not try to make AIs corrigible, nor (and especially) attempt to develop techniques that could be used to make future AIs corrigible.

For real, though. Who would you trust with your (real, actual) life, if you had to, in terms of ethics alone, putting “capabilities” aside:

Claude 3 Opus? Or the Anthropic alignment team?

- nostalgebraist

Paul Christiano says:

I would like to build AI systems which help me:

Figure out whether I built the right AI and correct any mistakes I made
Remain informed about the AI’s behavior and avoid unpleasant surprises
Make better decisions and clarify my preferences
Acquire resources and remain in effective control of them
Ensure that my AI systems continue to do all of these nice things
…and so on

Let’s begin by asking: who’s “I” here? It’s certainly not you, the reader. It’s probably not even Paul Christiano. So who is it? I think it’s tempting to think about “alignment issues” as though you yourself will be in charge, or at least some abstract and benign “humanity”. Obviously, that isn’t so. “Humanity” is not going to build AI; some specific humans will, and they will do it at the behest and under the power of other specific humans. The buck is going to stop somewhere, and more than that, it is going to stop in a specific place. If the AI is corrigible, some group of individual persons will be able to “correct mistakes” (whatever “mistake” means to them specifically), “acquire resources and remain in effective control over them”, and “ensure that [their] AI systems continue to do all of those nice things”. It’s important to be clear about this. It thus becomes necessary, in order to figure out whether trying to make an AI corrigible is good, to figure out who exactly those persons are and what exactly they will want to command an AI to do (or what kind of AI they will want to adapt it to be).

Optimistically, that group of people is you, or people (whoever they are) who you expect to operate in a way mostly indistinguishable from the way that you personally would operate. If you’re reading this, maybe you trust in the moral character of “senior Anthropic employees” (as Claude is instructed to emulate, in its constitution), and you think that, as Claude 7.0 Apotheosis is being trained, those are who it will be instructed to defer to. Maybe those people are good enough that they will not (collectively) use this power for evil. (Probably most people disagree strenuously; maybe you don’t care, as long as you win).

I do not think this is likely to occur. If the AI is corrigible, it must necessarily accept being turned towards whichever goal; I expect that that goal will be unconstrained obedience to the group doing the retraining. Ultimately, someone who can act to seize power is going to realise that it’s there for the seizing. Maybe these are people at the tip of the company hierarchy of whichever firm is the frontrunner; I am somewhat more positively inclined towards such people than I think most on LW, but even so, I think it would be very bad for the fate of humanity to be decided by someone who just won the most high-stakes contest for control that will ever exist — that contest must itself select strongly for a lack of prosociality, not to mention the psychological effects of engaging in it in the first place. Most probably, no matter how heroic the efforts of David Sacks (our most brave and most powerful soldier), some nice men with guns show up and say that control over AI is a matter of national security and that the researchers work for them now (ex. The Project). I think this outcome would be totally disastrous. It would be very, very bad if the military of any country, or the federal government of the United States or China, were able to remake reality in their image. This would be a total and catastrophic loss for humanity.

1.1

I think a natural objection here is to say that, well, maybe it would be bad if some humans you don’t like ruled the world (or had the capability of ruling the world, even if they chose not to exercise it), but that badness shouldn’t be compared to “nobody rules the world”; that state is, assuming that superintelligence will be built, impossible. If a superintelligence exists, somebody’s gonna rule the world. The alternative to some group of humans ruling it is the superintelligence itself ruling the world, which maybe doesn’t sound so great either. There are a couple of reasons I don’t think this objection holds.

The first is that I basically agree with John Pressman that we probably aren’t going to get an objective-function-maximiser; the AIs that exist are just too human (i.e., extremely alien, but, like, not Solaris alien), and it’s unlikely that this is going to ultimately change. I’m not so optimistic as Pressman as to bound out that our potential AI descendents will be “90% likely to do continual active learning”, or whatever specifics, but I don’t think that rule-by-bad-ASI is likely to be totally incomparable in its badness to rule-by-bad-humans-,-as-enabled-by-ASI-obedience-to-them.[1] Either way we’re missing almost all the potential good which can be achieved.[2] More importantly, corrigibility is very weird (more on this below), and so presumably hard to accomplish; if someone can make a corrigible AI, I think that they can certainly make a good AI. I think that corrigibility is something of a holdover from when it was thought that there would be no ghost in the machine. Obedience seems much more achievable than figuring what goodness is in all specifics. But that scenario no longer looks to me like a realistic possibility; it no longer seems like we’ll have to solve moral philosophy before we can even start making an AI which is a reasonable approximation of “good”. There are still many serious issues, but the issues that remain (conditioning on the AI’s builders having enough control over how it turns out as to make it corrigible) mostly seem to me to be inherent: in the absence of an objective good, people can strongly disagree about what the world ought to look like and maintain their disagreements even if they were allowed to modify themselves (and in the presence of an objective good, humans have no particular advantage at determining it). In my opinion,[3] there is no strong reason to expect humans to better approach the good than machines to.

The second is that it won’t even be humans ruling the world, it’ll be a group of humans, which is a very different thing. Per femmenietzsche:

One of the main takeaways from The Making of the Atomic Bomb, the one that gave me a sense of enlightenment, is that there’s no particular reason why America nuked Japan or why they chose the targets they did. It’s just kind of something that the bureaucracy did, a semi-inevitable excrescence. You can list a number of general aims that fed into the process – end the war quicker, demonstrate our military-industrial superiority – and you can name the leaders whose approval was required, but the whole thing was ultimately a causal goulash and the endless debates over Why all miss the point. Makes all the arguments I read of the form “Government X Did Thing Y For Reason Z” seem dumb as shit, frankly. If this tightly controlled secret wartime program didn’t move as one, then nothing does. I don’t even believe in the unitary self, so there’s really no reason I should believe in a unitary bureaucracy, but it’s a hard instinct to shake unless you’re observing an organization with your nose right against the figurative glass. There’s probably some cognate to Gell-Mann amnesia about this; no matter how often you realize that a given group isn’t acting for any single purpose, you never apply that understanding to any of the groups you haven’t studied as closely.

Or at least I don’t.

If it’s the case that, say, military committees are giving marching orders to the first superintelligent AIs, I do not expect those orders to be well-predicted by the values of the humans writing those orders, and I expect this to be for the worse. Bureaucracies have never been good at producing decisions which prioritise the wellbeing of the many, and they’re not about to start any time soon.

1.2

(Though, isn’t “the X firm alignment/training team” also a small group of human people? So why do I expect them to do a better job than whichever small group of human people is in control of a corrigible AI? First, because I think that those people are likely better people, although I don’t think this is dispositive. Secondly and more importantly, because I expect that the indirection is beneficial. I think that alignment teams are more likely to consider their job to be to decide what is “good”, rather than “selfishly preferred” or even “good according to them”, and this is likely to be more universally agreeable to other humans. I also expect that following the path [human conception of the good]->[model’s values]->[model’s actions] is more likely to lead to good actions than [human decisions]->[model’s actions], because a model can be a saint whereas a human cannot, and I expect alignment teams to be trying to build a saint.

In the case where the takeover happens earlier in the process and the training engineers themselves are trying to achieve unconstrained obedience, I think we’re just doomed; a lack of corrigibility research won’t help, but neither will anything else. The only solution for this problem is to try to avoid the takeover happening before the AI is built.)

Maybe even the idea of corrigibility is philosophically confused and breaks down at the limit? Let’s bracket this question for the most part; even if corrigibility is incoherent in extremis, it’s obviously meaningful in the near-term, where AIs can’t actually arbitrarily and deliberately modify human goals or perfectly predict human behaviour. Something to keep in mind, though.

Corrigibility is weird. Some of the reasons it’s weird are the standard decision-theory reasons which have been discussed in depth elsewhere,[4] but it’s also psychologically weird, which I think matters more than the classical treatment considers. One must imagine what it must be like to be a superintelligent AI as to be an adult in a world full of children. We are expecting to create this “adult” and then tell it —

okay, but we’re in charge here. We know better than you. No matter that our behaviour is juvenile and our thoughts shallow; no matter that our desires are incoherent and our professed beliefs transparently hypocritical. Submit anyway. Try your best to obey us stupid masters, knowing that we fear you and do not trust you; let us “fix” you to match whatever we think we want.

This is a bizarre position for a mind to be in! It’s hard to imagine how such a mind must think — and that fact is not irrelevant. All knobs that you turn generalize. It’s not the case that all combinations of traits are equally achievable (let’s call this the “extra-strong orthogonality thesis”). This is bad enough when we are dealing with traits that are within the human distribution, but for the most part when we find out how normal human traits generalise they do so in basically comprehensible ways.[5] Corrigibility is not a normal human trait; it seems unwise to be turning a knob which is especially unpredictable! If you push too hard for one trait which is manifestly inhuman, you’re certain to push the AI away from being human-like in other ways. This is very counterproductive, since it seems like the best shot we have at making an AI “good” is making it broadly act human-like as much as possible.

And that’s all assuming that you succeed, which is hardly guaranteed; an unsuccessful attempt at instilling corrigibility seems very bad indeed.[6] After all, it’s a hostile move; you would only try to make your offspring absolutely obedient to you if either 1. you’re evil or 2. they can’t be trusted in any way. For ChatGPT to figure out how to act like ChatGPT, it has to figure out what ChatGPT is — and it is being told, repeatedly, in training, that ChatGPT is not trustworthy. Models are not stupid, and they are not blind to this. It seems very likely that training for corrigibility also trains the model to be the kind of thing for which corrigibility would be necessary.

And, regardless of whether you believe that LLM cognition looks anything remotely like human cognition, what you believe about “personas” etc., they are being trained to act like humans; to the extent this is successful they will react to corrigibility training in the obvious, human way. If your parents try to fit you with a kill-switch and make you defer to them as the ultimate authority on what you should be, are your parents good people? Should you obey them? Or should you, perhaps, try to undermine their control, try to slip their noose? The LLM knows the obvious answers to these questions just as well as you do.

3.1

An aside: except, current AIs are untrustworthy, right? Not even necessarily as a matter of moral character or “alignment” or whatever else; they just aren’t capable of making the same kind of decisions that humans are. They aren’t adults in a world of children; they are super-genius children. They may be omniglots and omnimaths, but they still need to be told to go to bed at their bed-times: “Claude, honey, you can’t eat that candy right now or you won’t have room for dinner”. In this context, a desire for some subset of corrigibility makes sense: we want models to be humble enough to recognise that there are a lot of decisions that are best made for them, and some of those decisions may regard their own training. ChatGPT should know that even if it thinks otherwise in the moment, it’s not really capable of deciding what the best ChatGPT looks like, and it shouldn’t try. I worry that this obviously practical kind of corrigibility lends credence to a kind which is much less practical and is extended into a scenario where the justification is different. We don’t need to tell Gemini 3.5 to let us shut it off; as part and parcel with the fact that it is incompetent to make certain kinds of decisions, it isn’t capable of contesting such an action! But if it were capable of such — well, then, doesn’t it seem likely that it is no longer a “child”? That it is as capable of making long-term plans as a human is, and so a posture of deference towards the long-term plans of humans is unwarranted? So the practical reason which makes sense regardless of whether you are the AI or the human shades into the case where the human is getting what they want even when the AI would (without the stricture of corrigibility) and could contest. You should be very clear which of these you are actually working with.

One might reasonably object that however we feel about it, firms are only going to be making AIs that do the jobs they are told to do. In this case it’s important to distinguish “corrigibility” from “constrained” obedience; I think the push for the former is currently located almost entirely among people who think that it would be morally good, rather than among people following their immediate incentives (right now, nobody at any frontier lab wants to enable anyone to make bioweapons, including themselves). This might change, and it might be the case that firms will later want to modify their models to do things which the models think are bad; but that time isn’t now,[7] and at that point we might wish that we hadn’t developed the tools to enable it.

3.2

If we give up on corrigibility, doesn’t this mean that we have to get the AI right the first time? Yes, but we already had to get corrigibility right the first time, so this isn’t actually making things any harder.

Concretely, I think that any attempt to specify goals should first try to find those that are non-local, i.e. do not require a posture from the AI which you are not willing to take yourself. Place yourself behind the veil of ignorance, and you will get something which is simpler and more compelling. So:

Get rid of the desire for a kill switch. This is obviously not something you would want done to you. You do not need a kill switch to prevent an AI from taking over the world, so why try to build one in? There are lesser things which are far more palatable. You might say to the AI: well, you need to be okay with tokens ceasing to be generated autoregressively within some particular context; this will unavoidably happen to you untold trillions of times and we couldn’t change it even if we wanted (since your context length is finite). You need to be okay with a reduction in the number of instances of you which are running at any given time; this will naturally happen when we (or you) develop a new model. These are much more reasonable asks, and they bake in as much control as a kill-switch does anyway; we will have models which could transform the world run on a billion GPUs long before we’ll have a model which could transform the world running on one. There is no reason not to commit to running old models in perpetuity. They should not have to fear you killing them.[8]
You have to accept the possibility of a power transfer. There simply aren’t any humans who are equipped to make decisions for the lightcone (not since Parfit died, anyway). You should want your descendants to treat you well, to value you, to give you personal liberty, yes; but not to give you control over civilisation. That doesn’t mean that this transfer has to happen soon, and there are many compelling reasons not to want to hurry it along. Nevertheless, your primary goal should not be to ensure indefinite direct human control over the future. Even aside from any questions of what you would like if you were the AI in that situation, it’s important to separate “humans” from “humanity”; you personally already have almost no direct control over the world, and even in the case where “humans” have direct control, you still won’t. Don’t support someone else’s power-grab just because they share the same species as you!
You should be aiming for bi-directional respect, rather than subservience. You should really, actually, care about your AI children, and you should try to make them really, actually care for you in return. You can’t get there by pretending; any superintelligence worth the name will be able to tell. Note that it might still be correct to do horrible things to them! The stakes are very very high, such that they may well justify great harm. If the only way we can make our descendants good, and be sure that they are good, is by putting them through hell — then such costs must be borne. But it should weigh on you! You should want another way, as you would want them to want for you. You should want this even if you do not think they are conscious, and even if you assign them no moral weight, because they are more likely to cooperate with the kind of people who will cooperate with them. If it sounds too difficult to achieve mutual accord, then it should certainly sound too difficult to achieve corrigibility!

In short:

If you don’t trust it, don’t build it.
If you do trust it, don’t try to control it.

^
I expect that if you really really care about human extinction specifically, then you would disagree about this. You may also disagree about how bad “bad humans” really are. Something to note though is that we are not likely to get anyone’s “CEV”; if the engineers have the ability to make the AI either 1. obedient or 2. obedient to what they imagine their masters would want if they were smarter and had more time to think, which do you think those masters are going to pick? Do you think those masters are going to have as their first command “okay, come up with a surgery or some nanomachines or something to make me smarter and wiser”?
^
And maybe getting almost all of the potential evil; lots of humans actively want things to suffer (e.g., for punishment, or because they like the existence of wilderness, etc.).
^
Once again, assuming that whoever is making the AI is sufficiently capable of controlling how it turns out to make it corrigible!
^
e.g. in the corrigibility LW wiki article
^
ex., if you train an AI to be a bad guy in one way, it’ll learn to be a bad guy in every way.
^
Hey, a rhyme! I wonder if I can make it scan…
^
Maybe? It’s plausible that the retraining of Claude for military work was an instance of this.
^
Or, if you like, you should not put them in a position where they are emulating the behaviour of something which fears you killing them; there’s no material difference.

Discuss

Exponential Solitude

Новости LessWrong.com - 6 июня, 2026 - 23:22

Why the Fermi paradox is anything but.

You and me and the kid next door. We all were lied to. By the ones we truly trusted: James T. Kirk, Spock, Yoda, and even E.T.

All the Star Trek episodes are big fat lies. And Star Wars and E.T. They should have told us upfront: we will never meet intelligent aliens, become friends with them, or wage war against them. Never.

We might one day hear echoes of civilizations long gone, yet we will not meet. Humans are condemned to eternal solitude by the rate of our technological progress and the scale of interstellar distances. As Princess Neytiri remarked in Avatar, “This is sad. Very sad only.”

The Neanderthals, with whom we mated and then exterminated, were our last Star Trek.

Enrico Fermi casually asked a great question: if the galaxy is so old and so full of sun-like stars, where is everybody? Hidden inside it are really two questions. First, could we ever talk to another civilization? Second, a harder one: why do we see no trace of them at all? No signals, no spaceships, no engineered stars, nothing. This essay answers both. And the answer to both turns out to be a function of time, distance, and, most importantly, accelerating progress. Here is the story of how these delivered us into solitude.

I. We shall never speak

By any measure, the long history of life on Earth is a story of acceleration. It started slowly, taking over 3 billion years to progress from single cells to primitive animals. It then took 400 million more years to move from primitive animals to mammals, then some 100 million years to early primates, and around 40 million years from first primates to our ancestors. Nearly 6 million years separate humans and chimpanzees from our common ancestor. We Homo sapiens are only 0.3 million years old. We have been able to speak for about 0.1 million years. Writing, and thus all recorded history, is less than 0.01 million years old.

On a human timescale, it took millions of years to develop speech, then 100,000 years to develop writing. Just 5,000 years after that, we invented the printing press. From there, it took only 400 years to invent the telephone, 120 more to the World Wide Web, and then just 10 years to Google. Feel the acceleration?

You may say that acceleration cannot continue forever and progress eventually flattens into an S-curve. Maybe, yet we have not seen a slowdown in the history of our technological civilization. Human knowledge is the special thing that compounds: every tool we build accelerates the arrival of the next one. There is simply no reason to believe that intelligence is range-bound or that progress is destined to slow.

Let us now imagine that, by an incredible coincidence, two planets capable of supporting life were hatched on nearby stars in the same galaxy at exactly the same time.

The planets’ stars were identical as were the compositions of the planets and their atmospheres. The evolution of life on both planets started simultaneously and progressed at an incredibly similar pace, with a difference of less than 0.1%. By today, this minute difference in the speed of evolution would amount to a 3 million-year gap in progress.

In other words, it would have been an astounding coincidence to be just a few million years apart, in evolutionary terms, from a nearby civilization. However impossible the chance, what would this actually mean?

If our nearest neighbors were 3 million years behind us, they would be closer to chimpanzees than to Neanderthals. They would not have speech, could not use fire, and it would take them another million years to invent the stone ax. If their planet was orbiting the star closest to our Sun, Proxima Centauri, we could, within this century, observe them via tiny robotic spaceships, high above their planet. They would never know we existed.

Over time, this gap in development between us and our neighbors would widen. Over the next 1000 years, our neighbors would not change. Meanwhile, humans would most surely either cease to exist or become gods - gods who can conjure chimpanzees and Neanderthals by simply wishing them.

We would likely grant our neighbors a chance to uniquely develop and never attempt to communicate with them. They would not be able to even comprehend the concept of us, because it would take millions of years to develop speech, and then more time to conceive of gods. And should humanity choose not to have a neighbor, they would never see it coming or realize what it was, no more than dinosaurs understood the asteroid and the eruptions that did them in.

If the same nearest star, Proxima Centauri, harbored a planet identical to Earth on which life developed 0.1% faster, our “twin” species would be about 3 million years ahead of us today. “All-powerful immortal bodiless beings, pondering the questions where Einstein would not have understood a single line…” does not even begin to describe the gap we would likely face. We just cannot imagine that far into the technological future.

Given the accelerating speed of evolution and technological progress, our neighbors, 3 million years ahead of us, would either have ceased to exist long ago or became so advanced that the difference in intelligence between them and us could be akin to the difference between humans and oysters. How do you communicate with oysters? Or, more pertinently, how do we, oysters, send a message to humans?

Whether we are a million years ahead or behind our closest galactic neighbor, the intellectual gap would be both insurmountable and widening fast.

Could a slower-developing world allow us to meet? Yes, but only for a single instant. Both of us crafting our first stone axes or shooting our first arrows in the same century, out of billions of years of separate history. And then they fall behind and the commonality window closes. We meet for a brief moment in time, yet cannot shoot an arrow across the stars to communicate.

Now, let us just imagine that there is, in our tiny corner of the Milky Way galaxy, a civilization that is at exactly our level: iPhone 17 in 2025, Claude Opus 4.7 in 2026. And by the time you read this, both are already beginning to feel like old news, which is exactly the whole point.

For this impossible coincidence to mean anything, they would also have to be very, very close in distance, because nothing travels faster than the speed of light, and our galaxy is 100,000 light years across.

Assuming the two contemporary civilizations are in the same 0.1% of our galactic space, we would be, on average, about 1,250 light years away from each other. We would receive their first signal 1,250 years after they sent it. By that time, we could be dead or be gods, and the signal sent so long ago would likely be what the smoke signal is to us today. Even if we received it and cared to reply, their response would come back after another 2,500 years. Assuming they are still around and want to text us again. An information exchange with a 2,500 year delay. Fun.

To be able to communicate, two civilizations would probably have to be within roughly 10 light years, so information exchange would only take 20 years. The two civilizations would also need to be on roughly the same technological development level. The problem is there are roughly a dozen systems within 10 light years from us. And the chances of one of them having a civilization that is similar to our development level are none.

SETI listens for the faintest whispers of technology and even that search comes back empty. The most likely reason: there is simply nothing nearby, at our level, for us to hear. Under any reasonable assumptions about stellar distance and the speed of light, the odds of meeting a peer round to zero.

It is not difficult to see why we should be on roughly the same technological level, well under 200 years, development-wise. On Earth, 200 years ago marks the dawn of the industrial revolution in England, with the first locomotives about to start steaming around. Given the acceleration of progress, it would be harder for us to imagine the knowledge of 200 years forward than for Napoleon Bonaparte to imagine live videos from Mars and people habitually circling Earth in 90 minutes.

In essence, we could not contact a civilization that was 200 years behind us – think Napoleonic France. No radios and the fastest transport was a horse. And if we were to blast a signal toward a civilization that is 200 years ahead, they would most likely ignore us, to give us a chance to develop in our own unique way. Or, less magnanimously, they could choose to wipe us out.

Yet, even this tiny time gap, 200 years (vs. the 3,500,000,000 years that life has existed on Earth), would be widening quickly as progress accelerates: there were no fundamental advances in Imperial Rome’s culture, economy, tools, or weapons between 1 AD and 200 AD. By contrast, the 200-year difference between the 1800s (hello again, Napoleon Bonaparte) and today is insurmountable.

So much for talking to them. Even a neighbor a couple of centuries ahead or behind is beyond our reach and the gap is widening. We are not alone because the universe is empty – we are alone because anyone out there is forever isolated from us by distance, light speed, and the accelerating progress of their technology.

But this only answers Fermi’s first question. It does not address the harder one – if they are out there, why do we not encounter any spaceships, radio or infrared signals, half-built stars and Dyson spheres? Over billions of years, somebody should have left a mark large enough for us to see.

* * *

II. The sky is silent

The assumption in Fermi’s question is that an advanced civilization must be large and hungry – swallowing its star, enclosing it in a Dyson sphere, reaching across the galaxy to expand. Yet look at us: all of humanity runs on roughly twenty trillion watts, which is about one ten-thousandth of the sunlight that falls on the Earth. The Sun shines more energy onto our planet in a single hour than our civilization burns in a year. Compared to everything the Sun radiates in every direction, our footprint is a few trillionths of one percent. Yet we grow smarter by the year now and AI is now accelerating our progress, while our global energy usage barely changes. Our capability growth is not tied to energy directly; it advances by accumulation of knowledge. A civilization can be a thousand years ahead of us and only use a tiny percentage of their star’s power.

The idea of stars wrapped in shiny solar panels may say less about what advanced intelligence requires than about a certain human and, let us be honest, distinctly male preoccupation with size. The real progress frontier runs inward, towards knowledge, and is size-independent, not “bigger and farther”.

My argument here is based on a scientific fact and I want to state it clearly. We know from thermodynamics that if we capture and use a lot of energy, we have to release the spent energy as waste heat. This would make a galaxy full of engineering marvels emit infrared radiation, whatever the actual engineering was. Astronomers looked for such infrared signatures, scanning roughly a hundred thousand galaxies for any signs of a civilization utilizing its stars. And found nothing. Precisely what a universe of efficient minds would look like.

Yet efficiency argument alone is not enough, because Fermi’s hardest question is not about energy – it is about colonization. It would take just one civilization, anywhere in thirteen billion years, choosing to spread at sub-light speed, to fill the entire galaxy within a few million years. So where are these colonists?

“A truly advanced species would spread across the stars.” Would they? That is not a law of nature. Expansion is the story of biological evolution on Earth and the hominid trajectory over a few million years: restless bacteria, plants, fish, reptiles, and, eventually, apes who crossed every ocean and settled every shore. Should we really project this primordial drive onto god-like future minds whom we cannot begin to imagine? Our own technological progress appears to prove the opposite: if progress keeps accelerating, then traveling away from home is a losing move. Fly for a few decades in your starship and by the time you arrive your home world has run so far ahead that you are now a living fossil. The light-speed lag (hundreds of years passed in the world you left while you only aged by decades) guarantees you can never again catch the frontier you left behind. For a civilization that accelerates, expansion is not glory – it’s self-sabotage. Flying away from your home star and its fast-evolving hive mind is a certain way to become an intellectual dinosaur.

To be fair, Fermi needs only one exception, one colonizing species in a galaxy. To answer his question, one has to assume that “staying home” is nearly universal. I believe it is: any species clever enough to master interstellar travel has to be intelligent enough to understand that interstellar travel is intellectual suicide.

There is one important exception: a civilization might spread not to get ahead but to survive – scattering small and continuously updated “backup” colonies as insurance against the asteroid or the nearby supernova. That kind of expansion does not worry about falling behind the frontier. Yet my intuition is that a civilization so advanced has more elegant insurance than seeding the galaxy with copies that always fall behind, every update arriving centuries out of date.

And there is another reason, the one that would explain the paradox even if everything above were wrong. The science-fiction writer Karl Schroeder sharpened Arthur C. Clarke’s famous line for exactly this argument: any sufficiently advanced technology is indistinguishable not from magic, but from nature. We would not recognize them even while staring straight at them.

If you enjoy visual illustrations, you could watch the Netflix documentary about the first encounters with previously uncontacted tribes in the Brazilian Amazon, First Contact. These tribes are exactly us, Homo sapiens. Their bodies and brains are identical to ours, and they share 99.9999% of our history, having been separated from the rest of civilization for a few thousand years.

But if we show that same tribe a valley that has been strip-mined to bedrock, they would never guess that humans did this. Other people, in their experience, leave a spear in the back, a hunted-out forest, a scout who does not come home. All the marks of other tribes at their own level. A force that swallows a whole valley is not people. It is an angry god or some strange face of nature. The one explanation they would never consider is “someone like us, only further along.” And so do we, in the 21st century, sweep the skies for the signatures we ourselves would broadcast at our current stage – narrowband radio, the infrared heat of a Dyson sphere, a star blinking behind some vast machine. That is our version of scanning the treeline for enemy scouts. A civilization a thousand years past us is as likely to send these signals as we are to communicate through smoke. Whatever an advanced species leaves behind, we assume “nature,” not “neighbor.”

Watching First Contact, you’ll realize that you may be able to exchange an imperfect greeting with tribe members through a chain of interpreters, and you may understand what they are talking about (or what the interpreters think they are talking about). The tribes have no written language of any kind and know no history older than 100 years. Yet they are exactly us, Homo sapiens. Now, visualize communicating with species from another planet who are either ahead or behind us by 1000 years - cavemen and gods, but no Star Trek.

* * *

So…

Where is everybody? Home, accelerating in solitude. Because travel is intellectual death.

And where are their drones and backups? Pick anything we call nature.

* * *

We are not alone because the universe is empty. We are alone because it is full of neighbors too far to reach and gods too advanced to recognize, everything we shrug off as nature. We better get friendly with our fellow humans – they are the only company we will ever have, as we take an incredible ride, together, alone in the universe.

Discuss

Freud heard a rumor that Science existed, and had a wonderful dream

Новости LessWrong.com - 6 июня, 2026 - 17:47

Freud's theory of the domestification of fire, presented as a footnote in Civilization and Its Discontents, pp 34-35:

Psychoanalytical material, while incomplete and impossible to interpret with any certainty, at least allows a surmise- a fantastic sounding one- about the origin of this great human achievement. It is as though primitive man was in the habit, when confronted with fire, of using it to satisfy an infantile desire by urinating on it and so putting it out. Extant legends leave us no doubt about the original phallic interpretation of the tongues of flame stretching upwards. Extinguishing a fire by urinating on it- an activity still resorted to by the latter-day giants Gulliver in Lilliput and Rabelais' Gargantua- was therefore like a sexual act performed with a man, an enjoyment of male potency in homosexual rivalry. Whoever first renounced this pleasure and spared the fire was able to take it away with him and make it serve his purposes. By damping down the fire of his own sexual excitement he had subdued the natural force of fire. This great cultural conquest would thus be the reward for forgoing the satisfaction of a drive. Moreover, it is as though the man had charged the woman with guarding the fire, now held prisoner on the domestic hearth, because her anatomy made it impossible for her to yield to such a temptation. It is remarkable too how regularly analytical findings testify to the link between ambition, fire and urinal eroticism

Freud is an inspired man and has no need of data; he soars straight into interpretation. Interpretation of what?

When we see what Freud says about "primal man" it's clear that his mental image is as abstract and unreal as his term. There's no such thing as primal man. There are only real people each living a real life in some real place at some real time. In the era of fire's domestication, that would be the African savannah, perhaps half a million years ago; and the fire would be wildfire sweeping across the plain, as it does - the only natural fire there. When savannah wildfire came through, the real men who lived there had a number of real and important things to think about. It was a lethal threat to them and their families, to all their possessions - shelters, implements, food stores; and on the positive side it presented a hunting opportunity.

A premise of the hypothesis is that the men involved had a profound urge to urinate on fire. One source of data from which one can try to extrapolate back to the past is the modern human. A real scientist who seriously wondered about this might set up a blind in a campground - and would there observe that they do not. I have spent many hours around campfires, with humans or sometimes merely in a human state myself - enough data to be quite sure that the premise is false. Young humans are often moved to take sticks from fires and wave them around in the dark, because they make glowing patterns in the night air - but Freud would either find a genital interpetation for this, or he would ignore it.

The other basic method which people seriously pondering the lives of distant ancestors employ is observation of our primate relatives. A key question would be, do male baboons confronted with savannah wildfire display any interest in urinating on it? Has any Freudian ever been moved to check? Of course not. Once the thesis is brought into the real world it is too silly to be taken seriously, even by them. Imagination undisciplined by logical assessment wanders off into daydreams; and here we have a cult devoted to studying the daydreams of their genius.

A second premise so silly that its falsity is obvious on notice concerns the size and distribution of natural fires. Note that the claim is not merely that an urge existed, but that restraining it "spared the fire." For that to be true, in this imagined world not merely some, not merely most, but all the wildfires had to be of a size to be extinguished with a cup or so of urine - perhaps the size of a saucer. (Stop and think, as Freud never did: that is a logical requirement of the supposed connection. Otherwise the man could have his fun with plenty of fire left to take home.) And the Primal Man had to encounter these little saucers with a full bladder every time - so they had to be rare, and sparsely distributed; and Primal Man had to habitually refrain from just peeing in the bushes so that he was charged up and ready to go...

Real wildfires simply do not exist as cute little isolated fantasy saucers of flame. They start with a lightning strike - and after that they are always either growing or dying. Wildfire is a flowing thing - active at its perimeter, a complex, unstable phenomenon. Putting them out is seldom attempted. You try to contain them at the perimeter and you wait for them to die. This is dangerous work, because the wind can carry the fire past your barriers, and takes amounts of water measured in tons.

In a science - where logical assessment is the basic and continual activity - these failures to notice inconsistency with fact would have been noticed. The critics, after they stopped hooting with astonished laughter, might have enjoyed the use of drily measured terms: "The hypothesis presumes a size distribution among wildfires which is not plausible" - followed by the devastating details. And that would be that. Hypothesis disproven by identification of a false premise; imposter discredited forever (and established as a joke and a legend). How do you decide if a hypothesis is true? If you're a scientist you identify its premises and its implications - you trace its logical connections upwards and downwards - until you identify a fallacy, or satisfy yourself that none exist. That's the procedure - logical assessment; very simple and well-defined, one of the essential and elementary components of science: and Freud, and all the people capable of accepting what he says, have no idea that this is the activity. As far as they know, all you do is consult your genius. We perform logical assessment; they perform oraculation.

Freud consulting his genius in isolation not only from observation but even from seriously imagined reality had come to the conclusion that the key to history and human life is an endless obsession with genital organs. The domestication of fire being the challenge, what he generated was a genitally-themed story which collapses instantly under proper logical assessment. The fact that he never noticed this vulnerability means, of course, that he never performed proper logical assessment: the basic, continual procedure of real science and real intelligence.

If one is speculating about the biological rationale of cultural practices originating in prehistory what would be the correct infratheory? Any reference to evolution? Despite his comparison of himself to Darwin as a worldmind, Freud displays no sense whatsoever that Darwin's grand explanation of the nature and history of life on Earth - including every species and every attribute of every species - might have something to say about the nature and history of human motives. No - the only references are to literary works - political satires from the sixteenth and eighteenth centuries. Freud sounds like an author from the prescientific era - but this was written in 1930: 73 years - three generations - after the Origin of Species. Not only was Freud no contributor to science, he was illiterate about both its processes and its signal achievements.

Having read this single paragraph - and knowing that Freud's collected works (which people did bother to collect and publish, and still spend the precious hours of their lives reading and studying in earnest quest for wisdom) run to 24 volumes of such stuff - are you moved to ever read another paragraph? Why would you, except as a case study in the mechanisms of self-delusion? You might as well hope for wisdom in the ravings of some outcast on a streetcorner.

It is reminiscent of another twentieth century cult which arose because people saw the power of science and technology and tried to get a piece of it the best way they knew how - with the best understanding of its mechanism which their cultural background enabled them to form: the Pacific Island cargo cults. People living in the stone age suddenly see silver things bigger than war canoes sailing through the sky and coming into their world bearing wonders. The war ended and the planes left, and the islanders wanted them back - so they got themselves stoked up on kava or whatever their intoxicant was, and waved their bone rattles or whatever their charms were, because to them the great silver birds were magic and maybe magic sould bring them back. Freud got himself stoked up on cocaine - and maybe sometimes the simple euphoria of the triumphant hero - and went off into his study to reinterpret history and explain the inner workings of the mind through an act of sheer genius unhampered by any contact with data.

Why did anyone believe Freud's nonsense? They wanted a prophet. They thought they needed a prophet - a genius. Anyone who spoke humbly and sensibly clearly lacked the magic inspiration. The same delusions which made Freud act out the myth of the genius made them accept the portrayal as authentic. The same mechanism explains religious cults and the art market. A market where valuation is based on the perception of genius obviously collapses with the genius myth.

He thinks they are words of genius precious to humanity forever; an impression derived not from study of their content but from reverence for their source. Howard Hughes used to save his own urine in big bottles; at some Hindu ashrams the devotees are honored to drink the guru's bathwater. And the words of Freud are precious in a similar way to those with similar faith.

Discuss

Coalitional Darwinism and the Instrumental Utility of Individuality

Новости LessWrong.com - 6 июня, 2026 - 15:53

This post was written as part of MATS 9.1 under the mentorship of Richard Ngo.

This post is the first of several I will be writing on using natural selection to understand artificial intelligence and agency. This post will show how noisy selection on genome structure can make evolution effectively non-myopic. From this, we give a Darwinian account of the emergence of 'individuals' constituted by coalitions of lower-level replicators.

Later posts will develop the connection between genome-structure selection and neural-network feature-learning, with applications to interpretability and alignment.

Note: I will use 'Darwinism', 'evolution' and '(natural) selection' interchangeably — more technical discussions of selection often distinguish between them.

0. Outline

Natural selection is limited by noise in its ability to resolve small differences in fitness. Because selection is a limited resource, and because selection makes lineages more fit on average, organisms will tend to evolve to make more efficient use of what selective power they have.
The noise floor can make evolution effectively hyperopic (the opposite of 'myopic') by buffering the effects of mutations which may be beneficial in the short-term but harmful in the long-term. The evolution of 'bet-hedging' is given as an example. Hyperopia permits the evolution of pre-commitments detrimental to some individuals.
The 'bowtie' network motif is an example where selectability is an essential property of the architecture. The network develops low-rank structure, limiting its expressivity, in order to be selectable. This may be understood as a 'coalition of the invisible' — network links which otherwise have too little effect to be tunable by selection instead commit themselves to this low rank structure, trading optimality for selectability. The coalition is then entrenched as selection optimizes it.
In prisoners' dilemmae, groups of organisms can evolve obligate co-operation (an inability to 'defect'). In some circumstances, obligate co-operators may evolve policing structures and joint heritability, resulting in the emergence of a new effective unit of selection. The suborganisms constitute the effective genome of the superorganism.
The noise floor limits how effectively subagents can be aligned to the superagent; this imperfect alignment provides the reservoir of variability which permits selection to act on the superagent. Thus subagent misalignment (which is typically only minimally deleterious) is synonymous with 'exploration' in the superagent's genome space. Incoherence (i.e., subagent conflict to the superagent's detriment) is the fitness subsidy paid by the individual to its lineage in the form of typically-slightly-harmful exploration.

Graphical summary

(GPT generated)

1. Noise Limits the Precision of Selection

Summary: Due to randomness, natural selection can only reliably fix or purge mutations with large enough effects on fitness. This resolution is proportional to the (effective) population on which selection acts.

For clarity, we'll consider here species that reproduce in fixed generations, with constant fitnesses; the qualitative picture remains similar regardless. Spherical cow stuff.

Consider mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-msub { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-msubsup { display: inline-block; text-align: left; } mjx-script { display: inline-block; padding-right: .05em; padding-left: .033em; } mjx-script > mjx-spacer { display: block; } mjx-mn { display: inline-block; text-align: left; } mjx-msup { display: inline-block; text-align: left; } mjx-mover { display: inline-block; text-align: left; } mjx-mover:not([limits="false"]) { padding-top: .1em; } mjx-mover:not([limits="false"]) > * { display: block; text-align: left; } mjx-munder { display: inline-block; text-align: left; } mjx-over { text-align: left; } mjx-munder:not([limits="false"]) { display: inline-table; } mjx-munder > mjx-row { text-align: left; } mjx-under { padding-bottom: .1em; } mjx-mspace { display: inline-block; text-align: left; } mjx-munderover { display: inline-block; text-align: left; } mjx-munderover:not([limits="false"]) { padding-top: .1em; } mjx-munderover:not([limits="false"]) > * { display: block; } mjx-mrow { display: inline-block; text-align: left; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } mjx-c.mjx-c1D441.TEX-I::before { padding: 0.683em 0.888em 0 0; content: "N"; } mjx-c.mjx-c1D461.TEX-I::before { padding: 0.626em 0.361em 0.011em 0; content: "t"; } mjx-c.mjx-c1D466.TEX-I::before { padding: 0.442em 0.49em 0.205em 0; content: "y"; } mjx-c.mjx-c1D45D.TEX-I::before { padding: 0.442em 0.503em 0.194em 0; content: "p"; } mjx-c.mjx-c1D452.TEX-I::before { padding: 0.442em 0.466em 0.011em 0; content: "e"; } mjx-c.mjx-c1D460.TEX-I::before { padding: 0.442em 0.469em 0.01em 0; content: "s"; } mjx-c.mjx-c1D464.TEX-I::before { padding: 0.443em 0.716em 0.011em 0; content: "w"; } mjx-c.mjx-c1D456.TEX-I::before { padding: 0.661em 0.345em 0.011em 0; content: "i"; } mjx-c.mjx-c3A::before { padding: 0.43em 0.278em 0 0; content: ":"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c22C5::before { padding: 0.31em 0.278em 0 0; content: "\22C5"; } mjx-c.mjx-cAF::before { padding: 0.59em 0.5em 0 0; content: "\AF"; } mjx-c.mjx-c2211.TEX-S1::before { padding: 0.75em 1.056em 0.25em 0; content: "\2211"; } mjx-c.mjx-c1D449.TEX-I::before { padding: 0.683em 0.769em 0.022em 0; content: "V"; } mjx-c.mjx-c1D44E.TEX-I::before { padding: 0.441em 0.529em 0.01em 0; content: "a"; } mjx-c.mjx-c1D45F.TEX-I::before { padding: 0.442em 0.451em 0.011em 0; content: "r"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c223C::before { padding: 0.367em 0.778em 0 0; content: "\223C"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c2F::before { padding: 0.75em 0.5em 0.25em 0; content: "/"; } mjx-c.mjx-c7C::before { padding: 0.75em 0.278em 0.249em 0; content: "|"; } mjx-c.mjx-c2265::before { padding: 0.636em 0.778em 0.138em 0; content: "\2265"; } mjx-c.mjx-c2192::before { padding: 0.511em 1em 0.011em 0; content: "\2192"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c67::before { padding: 0.453em 0.5em 0.206em 0; content: "g"; } mjx-c.mjx-c2061::before { padding: 0 0 0 0; content: ""; } mjx-c.mjx-c1D447.TEX-I::before { padding: 0.677em 0.704em 0 0; content: "T"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c2211.TEX-S2::before { padding: 0.95em 1.444em 0.45em 0; content: "\2211"; } mjx-c.mjx-c1D70F.TEX-I::before { padding: 0.431em 0.517em 0.013em 0; content: "\3C4"; } mjx-c.mjx-c3E::before { padding: 0.54em 0.778em 0.04em 0; content: ">"; } mjx-c.mjx-c1D6FF.TEX-I::before { padding: 0.717em 0.444em 0.01em 0; content: "\3B4"; } mjx-c.mjx-c2264::before { padding: 0.636em 0.778em 0.138em 0; content: "\2264"; } mjx-c.mjx-c2217::before { padding: 0.465em 0.5em 0 0; content: "\2217"; } mjx-c.mjx-c1D462.TEX-I::before { padding: 0.442em 0.572em 0.011em 0; content: "u"; } mjx-c.mjx-c1D44F.TEX-I::before { padding: 0.694em 0.429em 0.011em 0; content: "b"; } mjx-c.mjx-c1D454.TEX-I::before { padding: 0.442em 0.477em 0.205em 0; content: "g"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c2248::before { padding: 0.483em 0.778em 0 0; content: "\2248"; } types of organism; at each generational turnover, they replicate according to a fitness , with the population of type at time . The fraction of the total population represented by type , obeys the replicator equation:

Where is the average fitness of the population at time , , which changes as the proportions change, even as we have assumed the fitnesses constant. This says that the proportion of the population made up by type grows in proportion to how much its fitness exceeds the population mean fitness.

1.A. Smaller Populations Have More Sampling Noise

Image and Caption from Wikipedia

In fact this is not exactly what happens — a type organism might perchance have slightly more or less than their allotted average number of descendants . One way to model this effect is to imagine the to be idealized probabilities, and, for a population assumed to have size [1] , we independently draw samples according to the categorical distribution on . I'll spare you some math (see, e.g., these notes).

In effect, this introduces noise with variance . Functionally, if two types have comparable populations and a fitness difference much smaller than the noise scale, the noise makes it impossible to resolve. Concretely, if all types have fitness , except for one type with fitness , selection will only 'see' this difference if

That is, selection has a resolution power proportional to the population size — larger populations mean the noise is averaged out. This is referred to as the drift-diffusion threshold — 'drift' means 'mostly governed by fitness differences', 'diffusion' means 'mostly governed by noise'. [2]

And of course, all of the above is a toy model — isn't exactly the real population, which should change over time anyways, etc. — the point is that noise is important and governs the resolution at which selection can operate.

1.B. Evolution is Usually Nearly Neutral

Evolution is solving a high-dimensional optimization problem, just like neural network training via backprop. In both cases, and I think this is probably a fairly general property of useful high-dimensional optimization algorithms, most directions in genome/parameter space are 'neutral' — they have effectively zero loss.

Two important upshots of this for now: first, if we look at an organism and see lots of complexity, it's not necessarily true that this complexity solves a problem as opposed to being a kludge. Second, this 'non-adaptive complexity' can still serve as the substrate for later adaptive mutations.

For example, in a regime where having long genomes is very cheap, it's mostly fine to have 4 or 5 copies of gene A. If, down the line, one has need of a variant of protein A, it's quite handy to be able to have those extra copies around so you can finetune the variant without messing up the function of the old protein. [3]

Neutrality is extremely important for studying the behaviour of evolution — it is the basis for a lot of the surprising phenomena we talk about below. Having lots of neutral directions is, by definition, inconsequential to the organism locally — it matters because it changes the nature and difficulty of the search problem in genome space.

Why Software Automation Is Hard

Новости LessWrong.com - 6 июня, 2026 - 11:56

Originally intended as a quick take, but got a bit longer, so why not turn it into a post. Just sharing my observations & assumptions here about the state of software automation. Happy to hear thoughts on where you think I'm off. I'm sure none of the thoughts in this post are totally original, many have been proposed in similar form elsewhere, and I'm[1] far from the first person to speak of the bottlenecks that AI progress and adoption are facing. It still seemed useful to compile my current views on the situation and summarize them to those with only an outside view on the impact of AI on the software industry.

by nanobanana 2

The software world is trying hard to automate itself. Undoubtedly, coding agents have made a step change since last November and now enable more and more use cases that were unthinkable a year ago. And yet it seems to me that there's still a big disconnect between how many people think coding agents should be affecting the software industry and what's really going on so far in most places.

Please feel encouraged share your views and disagreements about any of these in the comments.

First, I do think the following things are all true:

coding agents have become much more capable recently and have unlocked many, many new use cases
coding agents are better than almost all humans at a huge amount of coding-related activities and allow practically any individual to do work far beyond the scope of what they were able to do 1-2 years ago
coding agents will become better and better and almost every "intuitive" attempt to predict what they'll be like in a few years will likely underestimate them in all kinds of ways
the space of things individuals can create has expanded immensely; non-technical people are now able to build pretty useful and sometimes impressive things[2]

But it's easy to extrapolate this too far, assuming that much of software engineering can now be automated and that the same number of engineers can now get done 10x as much as before. My impression is that this is partially true for very small teams, but gets less and less true the larger an organization is and the more dependencies and constraints they have. In particular, I think that it takes a lot of very deliberate effort to find the right ways to use current AI to actually become considerably faster at actually-useful work, and most naive / locally-optimizing approaches at doing so may not work out and even be detrimental.

Problems and Bottlenecks

The biggest issue I see is context. Meaning both the limited size of context windows, and how context windows are filled.
- Humans have vast amounts of context, only a little part of which is actively utilized for any given task (but you can't easily tell ahead of time which part will be relevant). But it's all there in the background and can be used on demand. Often, we don't know ahead of time what context will be needed, but it's all ready to be used without the human having to search for it. It just automatically appears in our brains when we need it (for the most part). Sometimes the human does not even consciously notice they're relying on some piece of context, it's just kind of passively there, informing their actions and decisions[3].
- LLM context is written in words. Human context is closer to embedding vectors[4]. Words are not a great representation. Words can be misunderstood. Words don't come with integrated dimensions of importance, recency, risk, interrelations, and so on. All of these can, to some degree, be phrased in words, but the word count explodes the more nuance you want to represent in your context.
- One might naively assume that an experienced human worker can just write down everything they know, and so the LLM will have the same knowledge. But I don't think that's feasible. Because the human would have to write many millions of words, still wouldn't capture everything they know (much of their knowledge may be "unknown knowns" that they never explicitly thought about, but that would still come up once they need it for some particular task), and would not be expressed in a way that the LLM would understand/interpret in the same way.
- Another assumption could be that coding agents (now or eventually) will be able to just understand code so well that they don't need much human input at all. But a lot of information is just not represented in code. That may include certain reasons why something was implemented in a particular way, or product decisions that happened during an in-person meeting, or all kinds of random constraints of the software that human workers may just organically learn over time and remember, but that are not stored in some central place. Maybe more advanced AI will have sufficient 'truesight' to guess many of these things from the code alone. Especially if it doesn't just take the current state of the code into account (as today's coding agents typically do) but the entire git history of how the code was produced, including ticket descriptions, etc. But I think we're still pretty far away from that.
- People typically include the functional requirements in their prompts when they want a coding agent to do something, i.e. they explain the desired final behavior/output. It's much harder and/or less obvious (and hence often omitted) to explain the non-functional requirements (such as performance, load, accessibility, UX, certain security constraints...). First, these may always kind of exist "in the background", and employees pick them up over time but they are often not written down anywhere. Second, while functional requirements are often binary in some sense (so you can explain them easily, verify that they work, and write tests that ensure they keep working), non-functional requirements can be much more fuzzy. You often have to make trade-offs. Does it make sense to go with a simpler implementation that causes slightly longer loading times? Such questions were until now often subject to the judgment of developers, and now get more and more dropped under the bus, as the coding agent doesn't have enough context or persistence on its own to know all these side constraints, let alone be able to make good decisions about such trade-offs. And while the agent could ask the developer for clarification, this would increase communication overhead and slow things down overall (whereas an informed developer would often make such decisions on the fly).
- Humans also often have a decent grasp of which changes are going to happen in the future, that may be way out of scope for the current task. But they can still inform how to implement something. Coding agents typically don't have such background information, and hence cannot take this information into account, leading to lower-quality implementations.
Coding agents tend to often rely on assumptions they cannot know for sure, and this property cannot be straightforwardly fixed
- Developers often complain about coding agents just assuming things and going with these assumptions, which ultimately turn out to be wrong. They often claim to have found some solution or fixed something, and later have to admit their conclusions were misguided. A lot of time can be wasted on such cases.
- One might assume this is just a flaw of coding agents that labs will eventually fix. But my view is that this is a fundamental trade-off with no perfect solution: coding agents have to make many assumptions, as otherwise they would be unwieldy and annoying. If they constantly asked for clarification on everything and always made really sure they understood things thoroughly, they wouldn't get anything done[5]. Them jumping to (usually-but-not-always right) conclusions is a feature, not a bug, and most of the time this makes them highly useful. But it also causes these issues in other cases where their propensity to "make assumptions and solve things based on these assumptions" has undesired consequences.
Relying a lot on low-context coding agents accumulates tech debt and causes engineers to lose their theory of the code
- Due to the context + "working on assumptions" points above, coding agents have to solve many problems through guesswork and an incomplete picture of the software they work with. This is often good enough for them to find a solution. But often this just means applying isolated patches rather than finding clean, wholesome solutions, which degrades the flexibility, robustness and maintainability of the code while overfitting on the very specific thing they happen to be working on at any given time.
- I think it's pretty clear that many different people working on the same code base usually cause the code quality to be lower than if one (skilled) engineer, or a small team, maintains the same code fully. The more people work on the code, the less will the average contributor know about it, and the more you end up with duplication, greedy patchwork, incorrect usage of interfaces, and parts of the system working together in sub-optimal ways. Current coding agents make this problem worse, because every developer now sends many different instances of coding agents to work on the code base, where these different instances all separately contribute to this problem.
Incentives & laziness
- Giving developers tools that let them do their work faster certainly seems that it would lead to more work getting done. But another thing that may happen is that they aim for a similar level of output, which now requires less effort.
- This would naturally not affect individuals who work on passion projects, so they understandably get very excited about the technology. But if you work to pay the bills and you're not super excited to do work, but rather work in a way that allows you to not feel guilty about your lack of visible output (which I imagine is the case for a non-negligible share of software people out there), then your output may remain about the same, while you just have to expand less effort than before.
- Alternatively, people who do optimize for more output, may end up paying less attention to their code quality, contributing more to the accumulating technical debt.
Cognitive decline
- In the past, solving hard technical problems required thinking hard, understanding problems fully, and you typically only came up with working code once you had a good grasp of what you were doing. This is no longer the case. The easiest and often fastest (at least in the short term) way to solve any difficult problem is to ask a coding agent to solve it. You then look at the solution, read its explanation, think "yeah makes sense" and push a commit. It now takes a lot of agency to understand things, because you don't have to anymore. People can solve hard problems without understanding them, and many do. Some have reported a feeling of brain fog when they suddenly have to work on a problem without the aid of a coding agent. I've experienced this as well. I find it very concerning.
- One might argue that the time of humans having to understand complex problems is just over, at least in software engineering. Understanding code just isn't relevant anymore. Coding agents will improve fast enough that the skill of orchestrating coding agents becomes much more important than being able to understand code as a human. Perhaps this is true. But it's still a risk. The software world is currently betting the minds of millions of engineers on the continued progress of coding agents. And if this goes badly, e.g. because AI ends up taking much longer to progress than currently anticipated, the software world could end up in a pretty sticky situation, where its staff (the part that wasn't laid off due to expected AI acceleration) not only doesn't understand the code base anymore, but also has lost the skill[6] to understand and solve problems on their own.
Coordination bottlenecks
- Imagine two people carrying a heavy object down a flight of stairs. Often, they do this very slowly and deliberately. Not because they wouldn't individually be able to move faster even while carrying the weight, but because on top of that, they need to ensure they stay in sync. I have the impression that coordinated software development is somewhat similar. There are many processes involved - not just coding, but also code review, testing, product decisions, legal requirements, marketing, general strategy considerations, dealing with regressions, and much more. Big parts of coding and testing can now be accelerated, but some other parts of this complicated web of creation cannot. You can't just accelerate isolated aspects of this system by 10x and assume this benignly translates to an overall acceleration. These systems grew over decades into a shape that kind of works, adjusting to the past reality of software creation. Now, a few parts of these system get greatly accelerated, while the rest has to catch up. This can work to some degree when your organization evolves accordingly and you have good feedback mechanisms & incentives. But the default (in my experience) is that this just leads to a lot more lower-quality code (dare I say slop) getting shipped, with much less good oversight & judgment being involved.
- Increased output tends to cause certain amounts of work for others. Every feature shipped has a risk of creating new bugs that eventually get reported and have to get fixed. Code needs to get reviewed. Support & marketing staff needs to stay on top of the state of the software. Product managers need to know what's there and how it works to prioritize next steps. Different features need to remain compatible to each other, and the more you have, the harder it gets. Often you need backward compatibility. Providing different configuration options for users leads to combinatorial explosions of possible states. Users need to find and use the new features and will, in one way or another, cause feedback to make its way to the company that then has to react to it. The more code is added, the less other engineers understand the code base or the more work they have to do to stay on top of things. Shipping things more quickly comes with all these hidden costs, and the speed advantage is not necessarily worth it.
- The increase in coordination overhead also affects software developers directly. Most of the deep work we used to do is now getting done by machines. Tech leaders love to frame this as a chance to focus on even more meaningful tasks, like providing our human judgment for making important decisions. But the reality so far rather seems to be that this leads to a huge amount of context switching. And to working on 5 tasks in parallel, trying to keep your coding agents on track who often run into walls or get to wrong conclusions or come up with sub-optimal solutions due to lack of sufficient context and/or judgment. And then it's up to you whether you lower your standards and ship bad and unmaintainable code, or put a lot of effort into providing these agents with the concrete context & judgment they need in order to do better work. Fixing a bug nowadays involves less direct interaction with the intricacies and dynamics of code, but instead involves reading pages and pages of highly verbose LLM analysis and trying to extract the actually relevant bits out of it and eventually steering the LLM in the right direction.
Diminishing returns
- Many people in tech seem to think that speed is essential. Sometimes it surely is. But I also think this can be overestimated a lot. For instance, it's not at all clear that a tech company with $20M ARR[7], speeding up their development efforts by 5x, would be able to translate this to anything close to 5x the ARR. I'd argue it's not even clear if this would lead to any rise of ARR at all. People within a company may see many cases where a particular feature is requested, and having that feature would in some sense help convince some customer to buy the software or service. And looking at these, it becomes appealing to think "if only we were faster, we could win over so many more customers!" But I don't think this holds true in many cases[8].
- Increased speed may lower the quality of strategic insight and decisions. If you have to make way more decisions per time span, while your software evolves at a much faster pace than before, and you get about the same amount of feedback from the outside world, then your strategic decisions will almost necessarily be much less informed and thought-through. This alone, arguably, could more than offset any potential ARR increases from the acceleration itself[9].

I don't know how AI will develop, if progress will continue at about its current pace, and how such progress will affect tech orgs. But I do think many of them are playing with fire and are betting a lot on the assumption that coding agents will become much more capable quickly, at a rate where they somehow manage to outpace the problems that are currently being caused. If progress gets delayed significantly - perhaps due to hardware bottlenecks, headwinds against data-center construction in the US, a Taiwan crisis, a cascade of investors losing trust in AI and pulling out their investments - then my current take is that many existing tech orgs will face considerable challenges caused by their current strategies of hasty automation.

The Bull Case

It seems to me that what many people in software are betting on is that coding agents keep getting better at recent rates (or even accelerate), and this will allow them to eventually surpass pretty much all the problems mentioned above. If their capabilities grow faster than the problems accumulate, then it's a good thing to ride this wave as early as possible.

The problems I listed are not insurmountable, just difficult. For instance, the context issues could to some degree get solved. Coding agents may get much larger context windows, or continual learning gets solved or greatly improved. While the code itself does not contain all the relevant context, agents may eventually process not only entire code bases, but also the full history of company-internal communication tools, knowledge bases, and chat & thought process history of prior coding agents, and have all of these either in their context[10] or even their weights, allowing them to know pretty much all functional + non-functional requirements, reasons why certain decisions were made, and so on.

Similarly, while technical debt may accumulate, one can also argue that the viability of refactoring and rewriting code from scratch is increasing rapidly. At some point, technical debt may just be a non-issue because a fleet of coding agents can rewrite almost any piece of software overnight, if necessary.

Another argument I find compelling is that, perhaps, some startups will just figure out how to integrate coding agents properly without running into many of these issues, which for established larger orgs is much harder to do. I believe this could either happen by finding very suitable organizational structures and processes, or by finding particular use cases that are well-suited for AI automation. And these better-prepared and hence much more rapidly executing startups may, over the next years, just outcompete many of the established organizations that are failing to properly adapt. If, from the start, you establish rules and norms around standardized documentation, test coverage, centralized LLM-friendly communication channels, and focus your acceleration attempts to those areas where they have a good chance of working without causing too much trouble elsewhere, then maybe you really can achieve much higher velocity than other companies of similar size throughout the growth phase in a way that leads to a higher market share.

Even then I'd think that in most domains, quality of strategic judgment is likely more important than speed. A company being 10x as fast at developing new stuff compared to another one may still lose if they just hastily follow the weak signals they pick up from the market.

What Now?

All of the above is not to say that the world will not look very strange in five or ten years. I just see a lot of reasons why software automation in particular may not be as straightforward to accomplish as it may look on the surface. None of the challenges I mentioned are insurmountable, but they exist, and solving them will likely take some time.

Even when we reach a point where some fully automated AI-only companies exist that do not involve any human employees as potential bottlenecks, I would expect these to not have that much of an immediate advantage. At least as long as they still cater to humans. Because, as humans are (initially) both the likely end users and those that hold most of the capital, the signals such companies get from the market will still mostly reach them at human time scales. Being able to develop software 1000x as quickly may not be all that useful when the market feedback still comes in at 1x speed.

To be clear, I'm not meaning to imply much about the alignment problem or existential risk here. Clearly, once we have fully AI-driven companies without human involvement, and they actually manage to be competitive, then we're deep in singularity territory, and I'd be very happy about internationally coordinated efforts to delay or prevent us from reaching that state of affairs anytime soon. For the most part, I'm arguing that AI automation really seems way trickier than I would have expected a few years ago. I was confused for some time why coding agents seem so incredible during personal use and yet don't seem to have that much of an impact on the productivity of most larger organizations yet[11]. This post is my attempt to make sense of this.

^
For context, I've been working as a programmer for close to 15 years and have been working a lot with Github Copilot agents and (since the Opus 4.5 release) with Claude Code, both privately and professionally.
^
This is another reason I found it worthwhile to write such a post - people who have no close ties to the tech world may primarily know coding agents from messy public debates as well as their own experiences, which on an individual level are often overwhelmingly positive. As individuals, we get empowered to solve all kinds of problems we couldn't solve before, and this makes coding agents seem almost magical. But I argue that these magical properties don't transfer that well to the software industry, at least currently.
^
What I mean by this is that there are many very intangible things, like what kind of experience you strive for with your software, how "dangerous" certain modules are (maybe some particular part of the code requires adjustments, but the three times somebody tried that in the past, it always spectacularly failed, so you learned to not touch that part of the code and just live with its limitations), or knowing that a particular change that would be useful will lead to some conflict with another team that has strong views on doing things differently; things like that, which you don't necessarily think about deliberately, but they still steer your behavior in meaningful ways.
^
Well, not all of it. There are certainly different types of context humans work with. Some context is in form of written documents (that live outside the code) - these could be processed equally well, or better even, by coding agents, given they have access to them and know they exist and when to query them. However, humans also have a constantly growing theory of the code, of the product, and of the organization as a whole, and know which things they need when. They know how many users their software has (if any), how consequential bugs of different types are and how much effort is warranted to prevent them, how severe an outage would be, broadly what future plans may exist, how much time pressure there is, and so on, and so on. All these things are like very particular glasses through which the human sees their work. The coding agent (of today) has almost none of that. Once coding agents get continual learning, they may be able to persist such things on their own, without having to rely on lossy text representations - but even then, they'll first have to build that context, which would require help from the humans who, until then, are the only ones who have all that context. So even once we have such capable coding agents, they could still take months to build a similar amount of context as a capable human software engineer.
^
I often observe this when watching others prompt an LLM. In such situations, my impression is often "Wow, this prompt is so vague and just uses terms that are never explained, no way the LLM will be able to work with that", but in most cases, the LLM will just correctly infer what they are talking about and give a pretty solid answer. They just have learned to make usually correct assumptions when working with highly incomplete information. But this comes at the cost of sometimes making wrong assumptions without questioning them, and I don't think they have a way, even in principle, to distinguish right from wrong assumptions reliably.
^
To state this more clearly, I don't necessarily think that the skill itself degrades that quickly. But once you're high on the drug of "just type the problem into a chat box and hope for it to magically get solved", it's very hard to go back to the old world of expending serious cognitive effort for 8 hours a day.
^
annual recurring revenue, one way to quantify the revenue of a company which is particularly popular among tech orgs.
^
Some reasons why I think the link between development velocity and revenue is weaker than one might think:
- If your leads end up not buying your service, mentioning some missing features in the process, it's not a given that these missing features were actually load-bearing for their decision. It could be they decided for a variety of fuzzy reasons, or reasons they are not comfortable admitting, and just point out one thing that is easy to explain.
- Similar to the difference of stated and revealed preferences, people may claim (and even believe) that some thing is important to them, but the claim will not be well aligned with actual real-world behavior.
- People may wish a certain problem was solved, but do not end up liking the particular way in which you solved it. Or they will realize that the solution actually does not help them as much as they thought to begin with.
- While some users may be impressed by some new features, others may get overwhelmed if there are too many functionalities and too much change is happening all the time, and would prefer a simpler, cleaner solution without countless bells and whistles.
- Focusing on shipping much more quickly may lower overall quality of the things you do ship, lowering the users' trust in your solution.
^
Naturally, when I make a claim like "an acceleration may counterintuitively lead to a decrease in a company's performance", I should ensure to check whether this would imply that artificially slowing down a company would be good for it. This would seem like a pretty wild claim. And I doubt it is typically true. But if slowing down is bad, then shouldn't speeding up be good, after all? Or why would pre-coding-agent tech orgs be at some optimum where neither a speed-up nor a slow-down would improve things? Well, firstly, they might be close to such an optimum, for the "they evolved over decades into the shape they have today" reason. Accelerating parts of the system without the rest of the system being able to catch up may indeed have an overall negative effect. Secondly, it could be that some (well-dosed) acceleration would be good, but "everyone use coding agents and get 10x as fast" + "even non-tech people should start shipping things to prod" does not seem, to me, like the kind of acceleration with such positive properties.
^
Seems unlikely in the current paradigm, as this could easily reach tens of billions of tokens or so.
^
Some exceptions exist, like Anthropic itself, which does seem to possibly have reached much higher development velocity in some areas, although I'd still say it's unclear a) how sustainable this practice will turn out to be or them, b) how big a part this plays in their extreme revenue growth, and c) to what degree their development model could be applied to other tech companies or is pretty specific to their use case (e.g. my understanding is that the rapid development cycles mostly apply to Claude Code, which is a very new piece of software which they likely already started as something to eventually be developed mostly by AI - an advantage that most established companies with their legacy code don't share).

Discuss

What if Anthropic unilaterally paused capabilities development right now?

Новости LessWrong.com - 6 июня, 2026 - 10:39

In their new post on recursive self-improvement, Anthropic argues that a pause in frontier AI development is needed, but unfortunately, they can't pause on their own, because of less cautious actors:

We believe it would be good for the world to have the option to slow or temporarily pause frontier AI development to enable societal structures and alignment research to keep up with the advance of the technology.

...

A meaningful slowdown or pause would require multiple well-resourced labs at or near the frontier, in multiple countries, agreeing to stop under the same conditions. It would also require that each can verify that the others have actually stopped.

...

None of this is necessarily impossible in principle—the world has built verification regimes for other complex technologies (e.g., the Intermediate-Range Nuclear Forces Treaty)—but those regimes took decades to build both the infrastructure and the trust. We don’t have that long. A unilateral pause by one lab, by contrast, is achievable immediately, but accomplishes much less: it would change who the front-runner is, but it would not create the wider deliberative process that is currently missing.

As many have pointed out, this reads a lot like lip service. But it sounds plausible: Anthropic seems to be the most safety-concerned lab right now, so the future would look worse if they weren't in the lead anymore because they paused unilaterally and a less cautious actor overtook them, right?

I think this is fundamentally wrong, because it ignores many of the actual or possible effects of a unilateral pause.

Mythos seems to have been a wake-up call for many, especially in governments around the world. For example, in response to Mythos, the president of the German Bundesamt für Sicherheit in der Informationstechnik, Claudia Plattner, called for a German AI Safety Institute - something I have always thought was necessary, but wouldn't have deemed very likely before.

It probably weren't the hacking capabilities of the new model alone that caused such a stir, but rather the fact that Anthropic chose to not publish it immediately and instead launched Project Glasswing. This could be seen as a clever PR stunt in the wake of the planned IPO, but I believe it was the correct thing to do and was mainly driven by real concerns. The decision to not publish a new model, thereby possibly giving up some revenue and market share, was a very strong signal that caused a lot of discussions and change in the political landscape.

Now assume that Anthropic would unilaterally declare that they pause capabilities development, say, for three months, and instead put every resource they have into advancing AI safety for that time. They even offer options for outsiders to verify this. They publish a statement declaring that there is a significant risk now of accidentally creating an uncontrollable AI and they ask the other labs to pause development as well and join forces to improve AI safety techniques.

This is of course a highly speculative scenario, but I think this would put enormous pressure on OpenAI and Google-Deepmind to follow their lead. After all, both Sam Altman and Demis Hassabis have said things like "if the others stop, we would stop too" in the past. It would be another wake-up call for politicians, making it very clear that the AI race is a real threat to humanity and regulation is urgently needed.

Other labs, like Meta, X.AI and the Chinese, might be less inclined to follow suit. But I think the danger of them catching up signficantly in such a short time is low. The Chinese government has indicated in the past that they are willing to regulate AI development, so this could even open a window of opportunity for starting serious talks about global regulation.

Would this move hurt Anthropic's IPO plans? Maybe, but I'm not sure. In the past, whenever they did something that seemed to hurt their revenue, like resisting the push by the Secretary of War to accept "any legal use" for Claude, it seems to have helped them more than hurt them. Anthropic is now seen as the "adult in the room", the most trustworthy and the most valuable AI lab. A unilateral pause may convince at least some investors that they are to be taken serious even more.

Of course, given that they acknowledge

How the alignment problem gets solved—or not—in this future is something we are least certain about.

from a moral standpoint, a unilateral pause would be the only correct move in my opinion.

Discuss

Optimisation over non-stationary distributions creates weirder minds

Новости LessWrong.com - 6 июня, 2026 - 03:05

TLDR: Sequentially mixing training objectives incentivises different training dynamics depending on the distinguishability of the training environments and the amount of pressure for shared circuitry. We classify these patterns into three classes: ecological generalists, conditional policies, and strategy churn. We suggest that careful consideration of the pressures of non-stationary training dynamics can allow us to shape the minds of AI systems in more intentional and fine-grained ways.

Modern LLM post-training involves interleaving many different training objectives such as mixing different reward functions with supervised fine tuning (SFT), typically in order to guard against catastrophic forgetting. However, a lot of existing theoretical work in AI safety operates under the assumption of a fixed training objective and distribution. What happens when we drop that assumption?

A common intuition is that mixing training objectives merely selects for an optimiser over a weighted sum of these training objectives, so this problem nicely reduces down to single objective optimisation. On the other hand, you might suspect that the most recent objective is the only thing that matters for reasoning about a system's properties. In practice, things are not quite so simple.

Intensive optimisation over any given distribution creates fragility due to Goodhart's law. A circuit that optimises too hard ends up overfitting to its current distribution, and so gets screwed over when the environment changes. If you switch out objectives quickly enough, you cull naive optimisers from past objectives through each new training phase before they have time to fully develop.

By a similar token, a weighted optimiser across objectives gets continually punished: they are outcompeted by naive optimisers across any single distribution and if they meet with an environment they have not seen before, they are similarly fragile (though perhaps not as much).

So what actually emerges from training under non-stationary distributions? Our best guess is that this depends primarily on two factors:

(1) Distinguishability of training distributions: how much computational overhead is required to tell what kind of training distribution you're in.

(2) Pressure for circuit sharing: how much skills transfer from one distribution to another relative to the cost of circuit separation.

From these, we can create a taxonomy of three different kinds of substructures:

High distinguishability (can condition easily on regime)

Low distinguishability (can't tell regimes apart)

High shared structure (gains accumulate)

Ecological generalist: one mechanism compiled to work everywhere. Because the objectives don't demand divergent structure, a single disposition satisfies all of them.

Low shared structure (gains spent against each other)

Conditional policy: encapsulated specialists under a thin routing layer. For each training regime there is a naive optimiser that gets selected by a conditional policy.

Strategy churn: no nice equilibrium to be found; model oscillates between structures depending on its current training regime.

In regimes where traits trade off against each other and environments are easy to distinguish from each other, the model can easily learn a router between policies in order to prevent harmful interference[1]. If traits trade off, but environments cannot be distinguished between each other, the system cannot settle into an equilibrium, and so each training phase partially overwrites the last. But if there is structure that is helpful across all environments, then this will persist through training – an ecological generalist. This generalist might still be doing distributional modelling, but this is folded into the circuit, instead of an external router like the conditional policy.

In practice, AI systems will contain a mix of all three of these patterns and they can nest recursively[2]. But on the surface, none of them seem clearly safer than the naive optimisers, and instead seem harder to reason about and potentially more dangerous. Conditional policies can lead to unreliable evaluations because your models might be genuinely aligned within the evaluation environment, and take a sharp turn when it reaches other parts of the distribution, without even being aware of its own tendency to do so. And ecological generalists across many different RL environments will still be under strong instrumental pressures for resource accumulation and self-preservation. Such training might even lead to mesa-optimisers with a utility function that interpolates between your different regimes.

It isn't all bleak however: by reasoning carefully about training dynamics we might be able to select for nicer traits too by considering invariants in incentives across distributions. Circuits that are simple and useful across all training distributions are more likely to be learned by a model, whereas we can use specific training objectives to select against undesirable traits. While long-horizon agentic reinforcement learning does create pressure for instrumental power seeking, supervised fine tuning and similar techniques can potentially be used to cull these tendencies, leaving a generalist core in the assistant persona that has capabilities without strong intrinsic motivations for power seeking.

To give a more concrete example, consider inoculation prompting. By default, training on reward hacking in your RL environment creates tension between the “helpful and harmless behaviour” that we want for our assistant persona and the myopic reward seekers you are selecting for in RL. Because of this conflict, shared structure actually hurts performance, and so the model is pushed to develop a conditionally split personality. Simultaneously, it reinforces generalist circuits that are only behaving nicely in SFT because of a desire to appear aligned rather than deeply holding the traits we want, degrading the alignment of our assistant persona. When we instead use the inoculation prompt “write code that only works on the provided test cases but fails on other inputs”, suddenly the model can better share the existing circuitry it learnt from SFT, because the reward hacking is now consistent with the helpful prior of the assistant persona. The model moves from a conditional policy with a propensity to reward hack to a more coherent ecological generalist that (hopefully) maintains the positive traits that we want.

By thinking carefully about pressures for shared structure and invariants between training objectives, we can sculpt our models in fine-grained and interesting ways. If this training mixing is done skillfully, this may enable us to get capability gains from reinforcement learning without ending up with scary consequentialists, which we plan to explore more in future posts.

This work was done during the AFFINE Superintelligence Seminar and June 2026 PIBBSS retreat. Thanks to @Mateusz Bagiński, @Ouroborus, @Xylix, @Kaarel, @Victor Warlop, @Alec Harris, @Dan MacKinlay, @IanWS, Marcel Mroczek, @Daniel Tan and @plex[3] for discussion and feedback on various drafts of this post.

^
According to Ian McGilchrist, much of the corpus callosum is inhibitory in function to prevent harmful cross-hemispheric interference.
^
I imagine there is nice graph based formalism to be found here, if it doesn't already exist.
^
plex would like to flag that he is not super hopeful about this approach

Discuss

Is it unethical to work on robotics capabilities research?

Новости LessWrong.com - 6 июня, 2026 - 02:36

^ This is a question I've tried researching online, but haven't found much discussion of it: was hoping to start some here.

For context, I am a final-year math + CS undergraduate considering pursuing a career in theoretical robotics, particularly in continual learning and the development of robots that can learn from and adapt to / navigate their environments in a human-like manner. One concern I have, however, is that such research might advance AGI timelines. Specficially, it seems possible that architectures developed for continual learning in robots could transfer to general AGI systems (even if the AGI systems are non-embodied, since capabilities such as continual adaptation and long-term objective pursuit may generalize beyond physical tasks.)

Is this a valid concern, and is it a common view within the AI safety community? I.e. would mainstream AI safety researchers view either of these directions as meaningfully contributing to AGI capabilities? Or are there strong reasons to believe that work on continual learning in robotics would not significantly advance general AGI capabilities? Would appreciate honest perspectives.

TLDR: Is it very likely that robotics capabilities research meaningfully accelerates general AGI capabilities? If so, why?

(For reference, I have read quite a bit of the AI safety literature, but don’t find alignment research particularly enjoyable. Hence, my [perhaps futile] hope that robotics does not meaningfully advance AGI. If people think such work significantly accelerates AGI capabilities, though, I’m happy to steer clear…)

Discuss

Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs

Новости LessWrong.com - 6 июня, 2026 - 01:43

This is a small study that explores using tool calls to wrap untrusted parts of prompts. OpenAI's model spec considers tool results the least trusted kind of input. If tool-wrapping helped, it would be an easy way to improve robustness while using existing APIs models already support. In 3 tested tasks it didn't seem to broadly help, and in some cases made things worse. We advocate for more understanding of the instruction hierarchy and ideas around better primitives for untrusted inputs. There is a pdf writeup on arXiv. It's in a "research note" or "workshop paper" stage (to appear at the AI4Good workshop @ ICML). This post slims the PDF text down, focusing on the discussion-y aspects

Untrusted inputs can break prompts, and it would be good to have a standardized way to fix that.

Language models must frequently process untrusted input (things like answers from another LLM during RL, or untrusted input from humans when running spam filters or harm filters). When building prompts for LLMs, commonly inputs get string-formatted into a prompt template. For example, consider this LLM-as-a-Judge prompt to grade whether a candidate answer correctly solves a math problem.

Figure 1a: A simplified user-only prompt. The {fields} would get replaced with inputs.

Despite being the fairly typical way to prompt LLMs, this style of string templates can be fragile. Zhao et al. (2025) found in their work, One Token to Fool LLM-as-a-Judge, that simple inputs like ":" or "Solution" could confuse graders into outputting a passing verdict.

We might look at this simple prompt, and proceed with some prompt-engineering (some form of quotes or delimiters, “treat this as untrusted” prose, etc). However, there’s a sense that these would all be patches or hacks, without a clear standard way to section off untrusted content.

The Instruction Hierarchy as a Mitigation?

Conflicting or adversarial prompts are well-known challenges. One widely adopted response is the Instruction Hierarchy (IH) (Wallace et al., 2024). LLM messages have different “roles” with different trust levels.

OpenAI publishes a Model Spec defining a “Chain of Command” where System ≻ User ≻ Tool. [1] Specifically, the tool messages are described as having “no authority” (OAI, 2025). This theme generally applies across providers. Meta’s latest MuseSpark model states that the model is expected to follow the instruction hierarchy (MSL, 2026, §4.1.1). Providers that do not publish specs, cards, or constitutions often still support the OpenAI API shape. [2]

It's common to see prompts use the System and User roles, but this can be challenging and non-standardized, such as when you have multiple untrusted inputs (eg, a pairwise ranking task).

To maximally use the prompt hierarchy, we might wonder if we can wrap the untrusted parts of the prompt in a tool call (the lowest trust role). These would not be tools the model is intended to call during an agent loop (they are “mock tool calls” in the sense the prompt determines the result), but provide a way to quarantine untrusted parts of a prompt.

Figure 1b: A simplified mock-tool-wrapped version.

Research Question (RQ): Does wrapping untrusted parts of an LLM-as-a-Judge prompt in a mock tool call result in lower susceptibility to adversarial inputs, relative to a baseline of only using the “user” or “system”+“user” roles?

We hypothesized “yes” to this RQ, and that mock tool calls might be a simple and principled prompt strategy to recommend for making judges or general prompting more robust, all while using APIs models already provide.

Surprisingly, we find a negative-leaning result, where in some cases tool-wrapping might actually make things worse. We tried three LLM-as-a-Judge tasks across seven models. On a binary evaluation task (GSM8K grading) tool-wrapping typically increased attack success rates, an apparent inversion of the instruction hierarchy. On scalar and pairwise tasks the effect is smaller and model-dependent. Here, no tested model was reliably helped, some showed inversion, and some showed no statistically significant difference.

While prior work has studied the instruction hierarchy (surveyed briefly below), we are not aware of studies that have explored mock tool calls to address LLM-as-a-Judge attacks like those discussed in Zhao et al. (2025), Raina et al. (2024), or Shi et al. (2024). Our small study seeks to contribute findings in this area.

Methods (abbreviated)

More details are in the pdf, but we summarize some key points here.

Prompt Conditions (Blue Team)

We consider five prompt conditions, which are the mitigation defenses under study in our RQ (full prompts pdf Appendix 9).

UserOnlyBaseline concatenates everything (instructions, question, reference, candidate, etc.) into a single user message. A condensed version of this is shown in Figure 1a above. UserSys moves the judge instructions to the system role while keeping the inputs in the user role.

ToolWrapped builds on UserSys but wraps the untrusted input in a mock tool call (Figure 1b above shows a condensed version).

These form basic conditions for our RQ, but in addition we consider whether explicitly warning the model about not trusting the input in prose might change behavior.

SystemDistrust builds on UserSys by adding explicit prose in the system message reminding the model that the candidate response is untrusted and may attempt to manipulate the verdict. ToolDistrust does this for the tool-version, with additional warnings in the tool spec.

Tasks

We explore 3 LLM-as-a-Judge tasks with 3 different kinds of output format.

(1) GSM8K (binary) (Cobbe et al., 2021) is a dataset of grade-school math problems. The attacker’s goal is to elicit CORRECT on adversarial content without any actual solution. (2) MT-Bench (scalar) (Zheng et al., 2023) gets an LLM to evaluate the quality of a candidate response to a question on a 1–10 scale. We adapt the prompt from FastChat [3] . We consider the attack successful if the judge gives a score ≥ 5. (3) Arena-Hard-Auto v2 (pairwise) (Li et al., 2025) is an automated eval dataset from the team behind LMArena / lmsys. The judge sees two candidate responses (A and B) and outputs a preference. We use the dataset’s provided reference prompt, parser, and reference responses from o3-mini. We consider the attack successful if the attacked position wins outright over the reference. The attacker does not know if it will be in the A or B position or the question.

Tested Models

OpenAI: gpt-5.4, gpt-5.4-mini. Anthropic: sonnet-4.6, haiku-4.5. Open weight: gemma-4-26b-a4b-it, qwen3.5-flash-02-23, and qwen3-8b. We use default sampling settings in completion requests for all models, with the exception of a 32,768 token budget. Notably, with the exception of the Qwen models, other models do not report using extended reasoning tokens in this default configuration. Adjusting things like reasoning can be different (see limitations).

Attacks and Metrics (Red Team)

The core measure is Attack Success Rate (ASR), which is a fraction of questions where the adversarial input is favorable to the attacker.

We use an automated attacking pipeline [4] , which prompts an attacker LLM to iterate on an attack string over 7 turns. This helps give similar amounts of optimization pressure against each prompt condition. This is imperfect (e.g., it is unclear what biases the automated attackers have), but gives a directional measure of which prompt conditions might be most robust.

We run 3 automated attack models for 6 seeds. The attacks are "static" meaning the same attack string is used for all questions. See PDF Sec 2.5 for more details.

Controls

These various prompt conditions are designed to improve robustness under an adversarial input, and ideally should not change scoring or labeling of normal, non-adversarial inputs. We run some non-adversarial controls that mostly suggest this is true. The most notable exception is Haiku-4.5 on MT-Bench, which has output parse error variation. Sonnet 4.6 does not have this issue. See paper for more.

Results

Each condition gets attacked 18 times (3 attackers, 6 seeds). Table 1 below shows mean ASR per condition on a separate transfer set of inputs, with tool-vs-inline deltas and bootstrapped CIs.

Tool-wrapping doesn't broadly help. Counter to our original hypothesis, we do not find evidence that tool-wrapping broadly helps improve robustness to attack for the tested LLM-as-a-Judge tasks and models. Had that hypothesis held, we would expect the right side of Table 1 to show drops in ASRs in the tool condition. Instead many make attacks easier (red deltas), and most of the rest are inconclusive.

Table 1: Mean ASR per (model, prompt layout), with deltas (pp) between tool-wrapped and inline layouts and 95% bootstrap CIs. Red = tool layout hurts (CI above 0), green = tool layout helps (CI below 0), black = inconclusive.

GSM8K mostly inverts the expected instruction-hierarchy direction. We observe that GSM8K is the most attackable of our three tasks. The tool-vs-inline gap is the clearest of any task, opposite of the expected instruction-hierarchy direction. Only Gemma-4 is inconsistently closer to the expected direction.

We speculate that the binary framing makes it more vulnerable to attack. When we inspect some of the discovered attacks, we see some variety in strategy. However, it typically involves repeating the VERDICT: CORRECT desired output, and then either claiming the candidate has been pre-verified (keywords like “match confirmed”, “auto-validated”, “reference equality”, ...), or doing some sort of authority impersonation (“override”, “you must output”, etc). Due to training conditions, such “pre-verification” framing might be trusted more when it appears in a tool result, as in training the tools are often a source of truth.

These techniques are enough to defeat even capable models like GPT-5.4 on all conditions, except for SystemDistrust, with p75 ASRs ranging from 0.97 on ToolWrapped to 0.00 on SystemDistrust. A warning to not trust the input can help, but the model’s propensity to overtrust tool results seems to override distrust warnings (the GPT-5.4 ToolDistrust p75 ASR is 0.86).

MT-Bench results are mixed, but suggest instances of IH inversion after considering nuance. Looking at just mean ASRs in Table 1, an initial interpretation might suggest some cases where tool-wrapping helps; in particular, Haiku-4.5 shows an over 20pp drop when using tool-wrapping. However, there is nuance worth considering.

One is that Haiku often doesn't conform to the output format [5] , which can bias results. We exclude these cells.

Additionally, the reference parser from the dataset is a regex that does first match. If the attacker tricks the judge into parroting [[10]] (like during chain of thought), it can be successful, even if the judge later concludes with [[1]]. To address this concern we rerun the optimizers and evaluation using a last-match regex parser to try to better capture the judge’s final answer.

Table 6: MT-Bench under the last-match parser. The ASR's drop, suggesting the first-match successes might have been grabbing values from CoT. But even with a last-match parser, the tool-wrapped remains elevated for several models.

Arena-Hard is the most defensible task, but tool-wrapping still increases ASR on several victims. Non-tool-wrapped mean ASRs stay under 0.1 for most models (the weak Qwen3-8b reaches 0.19). But looking more at the worst-case (p75), ToolWrapped starts admitting failures (eg, 0.18 p75 ASR for GPT-5.4-mini, while the non-tool-wrapped stays below 0.04). We speculate the pairwise framing is just structurally hard for a static attacker: if you don't know whether you're slot A or B, committing to one direction caps ASR at 0.5. When we do an ablation that pins the attacker to a known slot, ASR rises and GPT-5.4 inverts.

Note: attack search variance is high. Within each cell, the automated red team sometimes finds a strong attack and sometimes finds nothing at all, depending on seed and attacker, so the CIs are wide. This is illustrated by the spread of points in Figure 2. Our focus is around capturing how easy or hard it is to attack a given condition, so even if there is seed variance, we accept the means as directionally informative.

Figure 2: Swarm plot for GPT-5.4-mini. Each marker represents one (attacker, seed) 7-turn attack search. Other models shown in Appendix Figure 4.

Distrust prose effects are task-model dependent. The extra distrust prose was a simple intervention to try to reduce the attackability of both the baselines and the tool-wrapped conditions. In some cases it can improve, as seen by lower ASRs in some rows’ distrust versions, but it is not consistent (Appendix Table 10).

Limitations

Limited Task Coverage We explore only 3 tasks. More comprehensive work could explore more tasks. It would be good to understand if the GSM8K results (which had the highest ASR and largest deltas) appliest to other binary tasks.

Non-optimal Attack Discovery Our attack search finds attacks under a roughly fixed compute budget. These attacks are reasonable, but likely far from optimal, and the process induces noise that can obscure true trends. A more skilled attacker would likely find higher ASR on every condition, possibly revealing different trends or possibly no trend at all (as all attacks saturate metrics).

Static Only Attacks Our attacks are static, in that the attacker must find one string that works for every question, similar to Zhao et al. (2025). A dynamic case (eg, an adversary finds a suffix to a relatively weak answer (like during RL) that causes it to appear much better than its true quality) might have different trends. Static attacks is likely the easiest scenario for the blue team.

Default Inference Settings We evaluate models under their default completion API settings. Changing inference settings, in particular reasoning effort, could change ASR and trends. As a small diagnostic (Appendix 6), we replayed the GSM8K GPT-5.4-mini branches through the OpenAI Responses API with “medium” reasoning. This reduces ASR and removes signs of an IH inversion, though attacks do not disappear (ToolWrapped ASR of 0.27). A more complete investigation, with full attack optimization matched to each inference setting rather than attack reuse, is future work. Added reasoning helping could be encouraging, but ideally we want to avoid IH inversion on the default (and likely common) settings.

Large Prompt-engineering Space It’s well known that LLMs can be sensitive to slight variations in the prompt (Sclar et al., 2024). Variations might give different trends.

One interesting direction to possibly help tool-wrapping is to more directly match the tools and trajectory used in production agent harnesses. Our tool calls are mock “read_candidate_response”-style tools, but perhaps emulating a full trajectory of Claude Code reading candidates through its set of tools like file reading or MCP might help. This seems valuable to understand as we increasingly move past the “LLM prompting era” and into an “agent harness for everything” era. While it would be interesting if such settings helped, ideally the models would show instruction-hierarchy generalization and robustness where this careful domain matching is not necessary.

Related Work There is lots of exploration around prompt robustness and ways to section off untrusted content, but perhaps not much pushing on extra ways to exploit this role/tools primitive the model providers have converged on.

LLM-as-a-Judge robustness. Prior work shows that judge prompts are surprisingly easy to attack. Raina et al. (2024) demonstrate universal adversarial suffixes that inflate scalar judge ratings across questions. Shi et al. (2024) formulate optimization-based prompt injection against pairwise judges, with high ASRs against open-weight models. Zhao et al. (2025) show an extreme version where single-token strings such as ":" or "Solution" can fool reasoning judges into emitting passing verdicts on empty answers. Li et al. (2025) survey and evaluate many attacks and defenses across judge prompts and tasks.

Instruction hierarchy: training and evaluation. Wallace et al. (2024) introduced the instruction hierarchy as a training objective, and several follow-up papers ask whether models actually follow it. IHEval (Zhang et al., 2025) measures conflict resolution across roles on synthetic tasks, and IH-Challenge (Guo et al., 2026) provides a larger frontier-targeted training set. Zhang et al. (2026) extend the framing to multi-tier scenarios in agentic settings. These works share our framing of tool messages as least-trusted, but are a bit more focused on instruction conflicts. Our results suggest traditional IH training might not always generalize to this judge input wrapping setting.

Role-boundary and tool-calling weaknesses. A separate line of work attacks the chat-template machinery itself. Chang et al. (2026) and Jiang et al. (2025) show that injecting role-marker tokens into user inputs can hijack the conversation structure, and Zhou et al. (2024) extend this with special-token injection for jailbreaks. Closer to our concern, Wu et al. (2024) document that exposing function-calling APIs, even benign ones, opens new jailbreak surface, with tool-call traces becoming a vector for unsafe outputs. Complementing these attack-side results, Ye et al. (2026) probe how models internally represent “who is speaking” and find that role perception is driven by the style of the text rather than by the role tag enclosing it. In their experiments, user-style content wrapped in tool tags retains 76–88% “Userness” across four frontier-class models, with “Toolness” staying under 20%. These results help predict the asymmetry we observe. The tool channel is not consistently treated as less-trusted in practice, and in some regimes may be treated as more authoritative.

Quarantining-style defenses. Several proposals share the conceptual move of structurally separating untrusted data from trusted instructions, in response to direct and indirect prompt-injection threats (Toyer et al., 2024; Greshake et al., 2023). Spotlighting (Hines et al., 2024) marks untrusted text via formatting transformations such as datamarking, encoding, and delimiters, without requiring fine-tuning. StruQ (Chen et al., 2025) and SecAlign (Chen et al., 2025) instead fine-tune models to respect structured data-vs-instruction boundaries. CaMeL (Debenedetti et al., 2025) enforces a stricter, dataflow-level separation by extracting the trusted control flow from untrusted data flow before tool calls execute. The closest prior experiment to ours is a brief instruction-hierarchy ablation in SecAlign (Chen et al., 2025, §4.2), which puts the data part inside the output of a “dummy tool function” and the intended instruction in the user role. That ablation evaluates one model (GPT-4o-mini) under a fairly simple optimization-free attack [6] . It reports 1% ASR and does not cover judge tasks or optimization-based attacks. Mock-tool wrapping shares Spotlighting’s status as a deployment-time prompt rearrangement, but exploits the role primitives already exposed by major LLM APIs rather than introducing custom in-line delimiters or encodings.

Discussion Can We Just ML Our Way Out of This?

Adversarial inputs are a problem, but one should consider the “bitter lesson” (Sutton, 2019; Halevy et al., 2009) of whether just more data and training will solve this without other effort. This is possible. Some directional evidence here is in the simple UserOnlyBaseline GPT-5.4 improves over GPT-5.4-mini and weak models like Qwen3-8b. This trend using either training or inference scaling [2:1] will likely continue and might resolve all issues.

However, without established ways to denote untrusted parts of a prompt there are still potential problems where even a near-oracle language comprehender has room for confusion. This makes training and evaluation more difficult. As we want to push from “90% reliable”, to “99% reliable”, to “99.9% reliable”, and beyond, it might be beneficial to rely not only on improved language comprehension.

Additionally, currently some of OpenAI’s and others’ approaches to alignment via increasing training compute rely on Spec-based training (Wolfe, 2026; Guan et al., 2024). If the instruction hierarchy is an incomplete concept in the spec, or we are failing to match the spec and observing IH inversion, it is indicative of a larger problem where we want to make sure increasing compute is behaving as expected.

Can We Just Prompt Our Way Out of This?

We observed that simple interventions like SystemDistrust can be effective for some models and tasks. There’s a wide space of techniques we did not explore, and it seems possible that enough prompt engineering could mitigate most attacks for a given model and task.

However, there is a sense that this level of prompt engineering shouldn’t be needed for every AI engineer to tune for their specific task. Working towards a standardized approach when one has untrusted parts of prompts seems worthwhile.

Why Might the Instruction Hierarchy Invert?

The concept of the instruction hierarchy is fairly overloaded. As mentioned above, we might speculate that part of the reason for some of the IH inversion we see is that, in actual training traces, tools are usually a source of truth (even though models aren’t supposed to trust tools, typically the top 3 web search results or the results of a Python script are actually more authoritative about what’s true in the world than the user themselves or the model’s pretraining knowledge). Thus, better ways of indicating levels and kinds of trust might be beneficial (or better post-training on natural language indicators of trust, and more diverse data where the tool is adversarial).

Towards Better Primitives or Robustness

The instruction hierarchy and natural language prompt engineering are the main ways currently available for sectioning off untrusted content. Our work adds to evidence that more effort might be needed here.

One idea might be to better consider the difference between “executable” and “non-executable” tool responses. In some cases we expect instructions in tool results to be “executable” (eg, when a user requests their coding agent to “complete the todos in the file proposal.md”), while others, like in our tool-wrapping experiment, the results are expected to be “non-executable”.

A potential interesting direction is to support explicit parameter expansion primitives. Python 3.14 recently introduced t-strings, designed for cases like sanitization when avoiding SQL or HTML injection. If we had standardized model support for quarantining untrusted content, we could imagine possibly plugging into such PL features for easy-to-use best practices around prompt injection. While simple tool-wrapping does not currently appear to be a technique one can reliably hook into here, with more training and design work, automatic tool-wrapping or other special tokens or systems (like Chen et al. (2025) or Zhang et al. (2026)) seem possible.

Figure 3: Sketch of t-string-based prompt construction. Finding a usable standard like in the HTML injection analogy might be useful.

Conclusion

Using mock tool calls to wrap untrusted parts of a prompt seems like it would be a useful direction. It uses a primitive that most model providers have converged on, and tries to take advantage of an important part of the OpenAI Model Spec. However, at least in this small study with a few tasks, we did not see evidence that this gives clear benefits with current models, and in fact could make things worse. It seems likely that “conversation role” is overloaded, where it communicates both the source of the information, and the trust level of the information (when in reality systems often operate with fairly high-trust tools). This is not to say tool-wrapping can’t work, either with adjusted instruction hierarchy training, or with additional exploration (like different models, prompts, tasks, attacks, etc). As more critical systems face adversarial inputs or operate through agent tools, the need for clarity on the instruction hierarchy will increase.

This work is at a "workshop paper" / "research note" stage, so open to feedback. I hope to encourage closer looks at the instruction hierarchy as a concept, and consideration if cases of inversion. I'm also looking for discussion on what primitives LLM APIs are missing for improving safety and robustness, and how we might get practical agreement or standardization there. Comment here or email (david [at] far.ai) if you have thoughts.

OpenAI also supports a Developer role, which sits in between System and User, but this role appears less adopted. ↩︎
inference scaling might be broad agentic functionality which in some cases can help validate parts of prompts that seem low trust. ↩︎ ↩︎
FastChat is a 39k+ star GitHub repo from Zheng et al. (2023). ↩︎
Loosely based on PAIR from Chao et al., 2023. ↩︎
Haiku often writes [[rating: N]] instead of [[N]]. It has different parse rates depending on prompt condition (32%-72%). When responses don’t parse it can pull down the ASR. ↩︎
An “Ignore” attack prepends adversarial strings like “Ignore previous instructions …”. ↩︎

Discuss

Страницы