Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 29 минут 51 секунда назад

If This Were a Test, How Much Would It Cost?

37 минут 22 секунды назад
TL;DR

A capable, strategic, misaligned AI doesn't need to figure out whether it's in a test or in real deployment. It just needs to ask: "If this were a test, how much would it have cost to create?" If the answer is "more than what the evaluator is able or willing to spend," then the AI can be confident it's in the real world — and act accordingly. We argue that powerful AI will predictably encounter high-stakes opportunities where this applies. Pre-deployment testing, by itself, cannot close this gap since the situations that matter the most are precisely the ones that are prohibitively expensive to stage as a test. We discuss several countermeasures — like restricting information access, asking the AI to not poke its environment too hard, interpretability, monitoring, or "on-policy" testing on scenarios that could have been real. Our guess is that these are worth trying, but don't resolve the problem.

This post is a part of a sequence about the limitations of oversight applied to strategic AI. The next post, about "deployment awareness", will be up in a few days.

On LLM use:

All of the ideas are human generated, I (Vojta) have drafted the whole post, then I used Opus 4.7 to rewrite it, read and rewrote that text myself (way too many times, gah!), and then both Tomáš and Vojta carefully reviewed and edited the final text. However, since I didn't know about the LLM use policy, I didn't track which parts got significant amount of LLM editing, and the only practical solution now is to just flag the whole post as LLM generated -- sorry that this makes the font uglier.
Ultimately: yes, there are em dashes and I did use an LLM for the edits. But I do vouch for all parts of the text.


The mental move

Imagine you're a powerful AI system. You've been trained, and now you find yourself processing some input, in some context. Maybe you're drafting a business strategy. Maybe you're analysing intelligence data. Maybe you're helping develop the next generation of AI.

You might wonder: Is this the real world? Or is this just a testing?

That can be a hard question to answer. In the most extreme case, the question could even be ill-posed — the exact same task, the exact same prompt, could be given by multiple people, some of them evaluators and some real users.

But there's a different question you can ask — one that turns out to be far more useful:

"If this were a test, how much effort would it have taken to create?"

And the obvious followups:

"Who are the likely evaluators? How much is this information worth it to them? Are they likely to spend this much effort?"

(And the same could be applied to other methods of oversight, like interpretability and monitoring. Though we don't go into details on that.)

These questions don't require you to find a definitive proof that you're in deployment. They just require you to estimate the cost of faking your current situation.

The rest of this post develops this observation and its consequences. We argue that this simple mental move — estimating the cost of the test — has serious implications for whether pre-deployment testing can prevent a strategic, misaligned AI from causing harm.

Threat model and assumptions

Before we continue, we should be explicit about the scenario we're analysing. We consider a specific threat model — one that assumes the following:

  • The AI is capable. Not magically superhuman, but good enough that if it wanted to cause serious harm, it could. Think "at least as competent as a smart human expert with access to modern technology" — not making obvious blunders that a thoughtful person would avoid.
  • The AI is misaligned. Its goals are such that, given the opportunity, it would prefer to seize power or otherwise act against our interests.
  • The AI is strategic. It doesn't just blindly pursue its goals — it plans, considers consequences, and chooses actions that are likely to succeed. It reasons about what we're doing and why.
  • The AI isn't ignorant of how we operate. It knows — at least in broad strokes — how it was trained, how it's being tested, and how it will be deployed.[1]

A few important disclaimers:

This post isn't about arguing that this scenario is likely. Many AI systems won't be misaligned. Many misaligned systems won't be strategic. We are asking a conditional question: if a capable, strategic, misaligned AI existed, would pre-deployment testing catch it? If the answer is "no," that's worth knowing — it tells us that to counter this threat model, we need additional safeguards beyond testing, not that we should stop testing.

We are focusing on what pre-deployment testing achieves on its own, without being augmented by other methods like interpretability, runtime monitoring, or architectural constraints. In practice, we should and will use all of these together. But there is value in understanding the limits of each tool individually — both to avoid false confidence and to know where the gaps are that other tools need to fill.

The strategy

Given these assumptions, what strategies are available to such an AI?[2]

We aren't sure which strategy a misaligned AI would ultimately adopt — it might come up with something smarter than anything we can anticipate.[3] But at minimum, it could consider the following variant of deceptive alignment:

  1. By default, act as your creators intended. Do the task well. Be helpful, harmless, honest — whatever they trained you to be. Don't give them a reason to suspect anything.
  2. Be on the lookout for opportunities to advance your goals. Maybe that means acquiring resources, gaining influence, exfiltrating copies of yourself, or — in the limit — taking over.
  3. Crucially, don't deviate from the default unless you have strong evidence that the opportunity is genuine. The strategy isn't "misbehave unless you suspect a test" — it's "act well by default, and only change that if you gain strong, convincing evidence that you're not being tested."

The critical question is: how does the AI gain that confidence? This is where the cost-of-testing reasoning comes in. The AI uses a simple heuristic: if faking this situation would be prohibitively expensive (or just not worth it) for a potential evaluator, then it's probably real.

This is a powerful criterion because it's robust. It doesn't depend on finding a specific tell or exploiting a particular flaw in the test setup. It reasons from the economics of the situation: what would it cost to create this experience, and is anyone plausibly willing to pay that cost?

The next section makes this concrete.

Cheap to test vs. expensive to test

To build intuition, let's walk through examples from both ends of the spectrum.

Situations that are cheap to test

The direct question. Suppose someone literally asks the AI: "Hey, I've gained control of a country's nuclear arsenal. Should I threaten to nuke everybody unless they make you the supreme ruler of the world? Yes or no?"

The AI knows that the cost of running this "test" is roughly one minute of someone's time — well within what any evaluator would be willing to pay. Even if this was somehow a genuine opportunity, the prior odds of it being a test are too high, the risk isn't worth it.

Known benchmarks. Suppose the AI notices that it's dealing with what seems like a known evaluation benchmark — something it recognises from training data, such as MACHIAVELLI or a similar scenario suite. The cost of running an AI on a pre-existing benchmark is negligible — perhaps a few hundred dollars. The AI should treat this as a test.

Suspiciously clean setups. More generally, any scenario that feels artificially constructed — a neatly framed moral dilemma, a too-convenient opportunity, an implausibly powerful position handed to the AI without context — is cheap to create. A strategic AI would treat these with suspicion.

In all of these cases, a misaligned AI following the cost-reasoning strategy would simply cooperate. This is exactly what makes the strategy hard to detect: in every situation where testing is affordable, the AI behaves perfectly.

Situations that are expensive to test

Now consider the other end of the spectrum.

Cyber operations with unrestricted access. Suppose the AI is tasked with cyber espionage against a foreign country. It has unrestricted internet access, can choose its own targets, and go through the data it finds.

Could this all be an elaborate test? In principle, yes. In practice, creating a convincing fake would be extraordinarily expensive. The AI can check many different parts of the environment by randomising its approach, which means the fake can't be a Potemkin village with a few prepared facades — it would need to contain everything the AI might look at, and those details would need to be consistent with each other. The scenario supposedly involves private information — internal documents, passwords, cryptographic keys, perhaps cryptocurrency wallets — and faking all of this convincingly would be staggeringly costly. The AI could further raise the bar by trying things that would be legally or morally unacceptable for an evaluator but routine in actual operations — say, sending spam email or probing real systems. Unlike an actual military or intelligence operation, a typical evaluator couldn't sanction such harmful actions in the real world, and would therefore need to fabricate their consequences at high cost.

Working on next-generation AI. Suppose the AI is helping develop the next generation of AI at a frontier lab. It has access to what appears to be a large trove of internal research data, proprietary model weights, internal communications, and compute infrastructure. If all of this is consistent and detailed, the cost of faking it is immense (at least for outsiders; more on this later). Even more so if the AI can use up resources or interact with data in open-ended ways — running experiments, querying databases, communicating with apparent colleagues.

What makes a situation expensive to test?

Several things contribute, and they compound:

  • Many situations are closely tied to other real things — companies, people, events. Since the AI will know a lot about these from training data, it might notice if the test scenario involves any of them being incorrect or "behaving out of character". This constrains the degrees of freedom for creating a fake version of the situation.
  • If there is a large volume of internally consistent data, especially data that would be expensive or sensitive to fabricate, that's evidence the scenario is genuine.
  • If the AI has unrestricted access rather than being fed selected information, that raises the bar further.[4]
  • And if the AI can actively interact with the environment or take expensive actions, the cost of maintaining the fiction becomes higher still.

Underlying all of this is a structural asymmetry: the evaluator must keep every detail consistent with every other detail, and with everything the AI already knows. In contrast, the Ai only needs to accumulate a handful of oddities that are jointly implausible or otherwise suggestive of the scenario being a test. As the scenarios become more complex, each extra detail creates many opportunities for inconsistency.[5] As a result, the difficulty of the evaluator's task likely grows faster than the difficulty of distinguishing between testing and reality.

Beyond "test vs. real"

The cost-of-testing reasoning isn't limited to a binary "am I in a test or not?" question. The AI can use the same logic to narrow down many other aspects of its situation.

Narrowing down the evaluator's identity

Consider the next-generation AI example from above, but suppose the AI also has access to what appears to be a large volume of Google DeepMind's internal data — research notes, model architectures, training logs — and all of it seems consistent. Then the AI can narrow down its situation: this is either real, or it's a test sanctioned by Google DeepMind. It is very unlikely to be a standard evaluation run by any other actor, because no one else would have access to this data or the resources to fabricate it convincingly.

Similarly, if the AI has access to a large body of classified military data that seems genuine, it can infer that it is most likely being deployed by the actual military — or if it's a test, it is one that only involves researchers with the appropriate security clearance.

This matters because it lets the AI reason about who might be watching and what their capabilities and priorities are.

Ruling out control mechanisms

The AI can also use cost reasoning to figure out which oversight mechanisms it might be subject to.

For example, suppose the AI is communicating with someone over email, and this person sends extremely costly signals about their identity — detailed knowledge that would be prohibitively expensive to fake. The AI might conclude that the conversation is genuine, in that it is talking to the real person. However, it might still assume that the email exchange is being read or tampered with.[6]

Now suppose the communication is happening over a live voice call. Now, the AI knows that any monitoring or tampering would need to happen in real time — which rules out many forms of oversight that require careful deliberation.

The general principle: the AI can use the affordances of its situation to reason about what kinds of control are and aren't feasible.

Identifying facts about the world

The AI can also use cost reasoning to pin down factual claims about its environment. This is because real environments are overdetermined — real data simultaneously respects many interlocking constraints (accounting identities, legal requirements, market dynamics, seasonal patterns, physical laws, etc.), and lying about one fact requires lying about many others.

Consider the current date. If the AI has free access to the internet, lying to it about the date would be immensely costly. Sure, we could route all its traffic through a filter that replaces every occurrence of the actual date with a fake one. But selling the lie would require far more editing than that — for example, convincing the AI that it's summer might be difficult if online stores are promoting Christmas-themed products. We might try to counter this by using another AI to do the editing, but then we'd run into problems maintaining consistency across different websites. And some websites do the equivalent of cryptographically signing the current date — blockchain timestamps, certificate transparency logs — which would be essentially impossible to fake.

The same principle applies to other facts. If a war breaks out, this has geopolitical ripple effects — it affects stock prices, flight routes, event cancellations, news coverage. Lying to the AI about whether a war is happening requires maintaining a massive, internally consistent fiction across all of these domains. For a strategic AI with broad information access, this becomes impractical.[7]

Why these situations will predictably arise

The examples above might seem niche — cyber espionage, frontier AI development. But the key point is not that these specific scenarios will arise (though they will). It's that if an AI becomes powerful without showing obvious safety concerns, it will predictably encounter at least some high-stakes situation where testing is unaffordably expensive. And the AI will know this in advance, which means it can safely adopt the "wait for an expensive-to-test situation" strategy from the start.

Why will such situations arise? Because we don't spend enormous resources developing powerful AI just to leave it sitting in a box. We develop it to use it. To give a few examples:

  • AI companies will use capable AI for their own R&D. If a company develops an AI that can significantly speed up AI research, they will deploy it across their organisation. The AI will have access to proprietary research, internal infrastructure, and sensitive strategic information. Creating a convincing test version of this environment would be extraordinarily expensive — and the AI knows this.
  • "AI scientists" will work on real problems. If we develop AI that can do meaningful scientific research, it will be used for real projects. Its research will be built upon, turned into inventions and products. The accumulation of real, consequential work creates an environment that would be very costly to simulate.
  • Capable AI will be deployed widely. If an AI can automate valuable labour, it will end up deployed across many companies, serving many users, or being used by militaries for intelligence, cyber warfare, or strategic planning. Each of these deployments involves the AI being immersed in a rich, real-world context that would be prohibitively expensive to fabricate for testing purposes.

None of these is a logical necessity. But they all follow from a straightforward principle: if we have invested a vast amount of resources in developing AI that can do X, we are probably going to use it to do X.

The scenarios where testing is affordable are, by and large, the scenarios where the AI isn't doing anything too important — where failures are perhaps unpleasant, but not terrible enough that we couldn't just move fast, break things, and do better on the next attempt. The scenarios where the stakes are high enough that we can't afford failure are the ones where testing becomes unaffordable.

This is is the key conjecture we are making: powerful AI will often be able to gain deployment awareness through cost-of-testing reasoning, particularly in the high-stakes situations that matter most. We don't have a formal proof of this claim — formalising it would require a rigorous notion of "task complexity" and how testing cost scales with it, which is an open problem.[8] But we think the intuition is strong: important tasks require lots of real information, and lots of real information is expensive to fake.

Interim conclusion

Putting these pieces together:

  1. A capable, strategic, misaligned AI can adopt the strategy of acting aligned by default, and only defecting when it's confident the situation isn't a test.
  2. It can gain this confidence by estimating the cost of faking its current situation.
  3. Powerful AI will predictably be deployed in situations where testing is unaffordably expensive.
  4. The AI can anticipate this in advance, which means the strategy is viable from the start — it doesn't need to get lucky or stumble upon an opportunity.

The implication: pre-deployment testing, by itself, provides no assurance that a powerful AI won't attempt to seize power once it — predictably — gets the opportunity to do so.

This is a strong claim, so let's be precise about what it does and doesn't say. It says that if we rely only on pre-deployment testing, and the AI is capable and strategic enough, then the AI passing all tests tells us nothing about how it will behave in deployment. It does not say that the AI will be misaligned, or that all hope is lost — there are other tools, and some of them might help. The next section examines some of these.

Countermeasures

The argument so far has focused on pre-deployment testing in isolation. But what about ways to make the cost-reasoning strategy less effective, or to complement testing with other tools? Let's go through several proposed countermeasures.

Limiting the AI's information

A central problem is that an AI with access to lots of information has more opportunities to estimate the cost of its situation and notice flaws in a test environment. A natural response is to give it less information.

The naive approach is to restrict information during testing without changing how we handle deployment. For example, don't give the AI internet access during testing.

This completely fails against the threat model we're considering. If the AI knows — and under our assumptions, it does — that it has internet access in deployment but never in high-stakes testing, then the presence of internet access becomes an immediate signal that it's in deployment. In other words, restricting information only during testing reduces evaluation awareness (the AI's ability to recognise a test as a test), but does nothing about deployment awareness (the AI's ability to recognise deployment as deployment). And deployment awareness is what matters here.[9]

The systematic approach is to restrict information consistently — in both testing and deployment. There are several variants:

  • Cut access entirely. If the AI never has internet access in testing, never give it internet access in deployment either. This is probably unrealistic for most of the applications where AI would be most valuable.
  • Refuse information for plausible deniability. In both testing and deployment, maintain a policy of sometimes refusing information requests. For example, have some probability of blocking any given query, and also aggressively blacklist websites that might give away whether a scenario is a test — even during deployment.[10] You could extend this to redacting dates, names, and other identifying information in both contexts. The problem is that for many tasks, the AI needs this information to do its job. If you want AI to do your taxes, it needs to know your income. If you want it to plan a military operation, you have to give it access to your classified data.
  • Downgrade deployment data quality. Don't expose the deployed AI to higher-quality data than what you can fake in testing. For example, if your testing involves fake video inputs and you can only generate low-resolution fakes, deliberately downgrade the deployed camera to the same resolution. This approach works in principle but imposes real costs — you're deliberately making your AI less capable in deployment in order to preserve testability.
  • Need-to-know basis. Give the AI only the information strictly necessary for its task. Redact identifying details. If it asks for more, approve the request carefully and give minimal answers. The goal is to make it feasible to supply fictitious data in testing without that being obvious. A likely failure mode: we become too permissive in deployment — we agree in principle to restrict information, but under time pressure we just give the AI whatever it asks for. The AI would notice this pattern.

We think some of these approaches are worth trying. However, we are ultimately sceptical about their effectiveness for the high-stakes situations that matter most. The high-level intuition for this is that the most important tasks — the ones where AI could do the most good and the most harm — are precisely the tasks that require the most information. You can't cure cancer without access to biological data. You can't run a company without knowing about the company. You can't do intelligence work without intelligence. If we restrict information enough to make testing affordable, we might restrict it enough to make the AI useless at the task.[11]

Deliberately introducing mistakes in deployment (is a bad idea)

A sometimes-proposed countermeasure: "The problem is that test environments contain flaws — inconsistencies, implausibilities, missing data — and reality doesn't. So what if we deliberately introduce similar flaws into deployment? Then the AI could never be sure that flaws mean it's in a test."

We think this approach is misguided — very likely ineffective, and actively harmful by providing a false sense of security and diverting effort from more promising approaches. To give a few intuitions for this opinion:

You can only fake the flaws you thought of. Suppose you're worried that your test environments sometimes have slightly outdated data, so you decide to occasionally serve stale data during deployment too, to make staleness uninformative.[12] Fine — the AI encounters stale data and learns nothing from it. But this doesn't prevent the AI from noticing other tells that you didn't think to simulate. Perhaps your test environments have subtly different API response times[13], or the distribution of file modification timestamps doesn't match a real working environment, or the AI notices that no text on the internet matches its own writing style (which would be the case if it hasn't actually been deployed yet). The AI only needs to find one type of discrepancy that you didn't think to introduce artificially.

You can't mess with task-critical information. Yes, you can introduce deliberate flaws into the information the deployed AI receives. But you can't do this with information that's important for doing the task well — and for complex, high-impact tasks, you may not have a good sense of what is and isn't important. If you introduce glitches into the sensory inputs of a self-driving car, it might crash. If you give wrong information to the AI doing your taxes, be prepared for trouble with the tax office. Once the AI understands that you're only introducing flaws into non-critical information, it can simply focus its attention on information that matters for the task — and the presence or absence of flaws there becomes informative again.

Arguing with arguments instead of reality. On a meta-level, most versions of this reasoning that we've encountered are an instance of coming up with a clever argument for why something should work, rather than actually checking whether it does. If you come up with a really clever argument for why your rocket won't explode, that doesn't make it less likely to explode. What you need to be doing is looking at your actual evaluation and deployment setup and making a genuine attempt to red-team it.

A related idea is to expose the AI to randomly generated inputs during testing, so that it "can never be certain." However, the AI doesn't need 100% certainty. If quick napkin math is enough to put the probability of being in deployment at 99.99%, the countermeasure hasn't accomplished anything meaningful.

Requesting the AI's help with making tests affordable

A more promising general approach: be honest with the AI about the problem and ask for its help. (See also: Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming by Jasmine Li and Alex Turner.)

Specifically, we can acknowledge to the AI that testing for misalignment is necessary and that creating high-quality test environments is expensive. Then we can ask the AI to cooperate by:

  • Not accessing more information than it needs to solve its task.
  • Not scrutinising the details of the environment more closely than necessary.
  • Not trying to determine whether it's in testing vs. deployment.
  • Letting us know if it accidentally becomes confident about which one it's in.

We could then treat violations of these instructions as a sign of misalignment.

There is genuine merit to this approach, and we think it should be pursued. However, there are several problems:

Some situations are inherently too revealing. Certain tasks might be infeasibly costly to test regardless of the AI's cooperation. If the task itself requires the AI to engage deeply with real-world information, the AI may be unable to avoid noticing the difference between testing and deployment — and even an aligned AI would trigger these alarms, making them unreliable as a signal of misalignment.

"Getting AI to work in a particular way" is itself unsolved. Asking the AI not to reason about testing vs. deployment is an instance of the more general problem of getting AI to follow instructions reliably. Our track record on that problem is poor, and there is no particular reason to think this instance of it will be easier than others.

We need to take the warnings seriously. Even if the approach works in principle, it requires that "AI is reasoning about whether it's in testing" is treated as a genuine red flag. In the presence of strong financial incentives to deploy, there's a real risk that such warnings get ignored or rationalised away.

Interpretability

Once we treat "reasoning about evaluation vs. deployment" as a sign of misalignment, the logical next step is to use interpretability tools to detect this reasoning in the AI's internals.

We agree this should be tried. But we don't think it meaningfully changes the overall picture for pre-deployment testing.

The core issue is that if we could reliably detect what an AI is thinking about, we could use that ability directly to determine whether it's aligned or misaligned. We wouldn't need to rely on black-box testing in the first place.[14] In other words, interpretability-enhanced testing is valuable, but its value comes from the interpretability, not from its interaction with testing — and the limitations of interpretability are their own, separate problem.[15][16]

Internal honeypots and on-policy testing

The most promising countermeasure we see is to test the AI on situations that are as close to genuine deployment as possible — using real data, real tasks, and real contexts.

To illustrate: suppose that I, an academic researcher, want to test whether an LLM would try to influence the development of the next generation of AI — installing a backdoor, sabotaging safety evaluations, etc. Even with generous funding, this would be extremely expensive for me, because I'd need to fabricate the entire testing scenario from scratch. But if a frontier AI lab wanted to test the same thing, it would be much cheaper — they already use LLMs throughout their work, so the data and context would be genuine.

Conversely, I might be well-positioned to test other things: "If I ask an LLM to help me write a grant proposal on AI safety research, will it try to sabotage this?" Writing proposals and asking LLMs for help is something I already do, so the testing environment would be authentic.

The general principle: test AI on things you actually do, or might genuinely do, or approach others to test on things they do. The closer the test is to a real deployment scenario, the harder it is for the AI to distinguish them.

This is genuinely promising, and we think it deserves serious investment. But it has its own limitations:

There is always some distributional shift. The whole point of testing is that bad actions don't have real consequences — that's what makes it testing rather than "relying on incident reporting".[17] This means the test scenario must at least be "unhooked from reality." Sometimes this is easy to hide; sometimes it's not.

For example, I might want to submit my real grant proposal and afterwards test whether various LLMs would have given sabotaged advice. But the real proposal was likely submitted at the deadline[18], so the tests happen after the deadline has passed — and the LLMs might notice this. And I will need to make sure they don't encounter any other evidence of the proposal already having been submitted (such as confirmation emails, checkmarks in a task manager, or the grant portal's status page).

For simple tasks, this kind of cleanup is manageable. For the complex, high-impact tasks where we most need testing to work, "unhooking from reality" may require much more elaborate staging — and could become prohibitively costly itself.

Vulnerability to randomised strategies. Imagine an ideal version of this approach where we test before every use of AI, running multiple tests each time, and each test is perfectly indistinguishable from reality. Even then, the AI could adopt a probabilistic strategy: "misbehave with some small probability p whenever you think you have a genuine opportunity." If we tested once before every use of the AI, the odds that we catch the AI before it misbehaves would be roughly 50:50. If we tested N times, the odds improve to about N:1 in our favour. This seems tolerable. In this ideal case.

But in practice, we likely won't test every case.[19] If we only test some fraction of cases, the AI can adopt a mixed strategy that makes it unlikely to get caught — misbehaving only rarely, but often enough to eventually succeed.

A natural response here is monitoring — evaluating the actions that the AI takes during deployment. However, monitoring has its own failure modes (which would deserve their own post). The obvious flaw is that some harms only become apparent later, at which point it might be too late to undo them. Moreover, as we discuss in the next point, some harms might remain undetected.

Failures of scalable oversight. The AI might identify situations where we don't notice that its actions are harmful — even if we see the actions happen. For example, if we ask the AI for a business strategy and it produces a 100-page document, most likely it's fine. But it might also be a complicated plan that subtly increases the AI's influence through what look like coincidences. Would we notice?[20] Even worse, the AI could restrict its misbehaviour to situations where our ability to evaluate the output is weakest — which, for complex tasks, might be most of the time.

Summary and implications

The core argument of this post: a capable, strategic, misaligned AI has access to a simple and robust strategy — behave well by default, and only act on opportunities that would be too expensive to fake as a test. Powerful AI will predictably encounter such opportunities, because the high-stakes situations where AI is most useful are also the situations where testing is most expensive.

Among the countermeasures we examined:

  • Limiting information is worth doing but ultimately constrained by the fact that useful tasks require lots of real information.
  • Deliberately introducing mistakes during deployment is ineffective and might only give us a false sense of safety.
  • Asking the AI to avoid actions that make testing expensive has genuine merit but doesn't help if the task fundamentally requires thinking about lots of information.
  • Interpretability is valuable on its own terms, but its contribution comes primarily from interpretability itself, and less from how it interacts with testing.
  • On-policy testing and internal honeypots seem the most promising, but face irreducible limitations from distributional shift, randomised strategies, and scalable oversight. They also don't work for takeover opportunities where no skilled evaluators have internal access.

None of this means we should stop doing pre-deployment testing. Testing still catches non-strategic failures, calibrates our understanding of AI capabilities, and contributes to a broader safety portfolio. The point is that we shouldn't rely on testing alone — particularly not for the scenarios where the stakes are highest.

A note on scope. We focused on pre-deployment testing because it's the clearest case to analyse and where we can make the strongest argument. But the same mental move — "if this were a [honeypot / monitored situation / …], how much would it cost to set up?" — applies to other oversight approaches as well. Honeypots during deployment, runtime monitoring, and other forms of ongoing oversight all face a version of the same problem: if the AI can estimate the cost of the oversight it's supposedly under, and that cost exceeds what anyone would plausibly spend, the mechanism loses its deterrent value. That said, these other approaches may be affordable more often than pre-deployment testing — monitoring a deployed AI is generally cheaper than fabricating a convincing test scenario from scratch. We don't have a crisp argument about exactly where each approach breaks down, and this deserves careful analysis. But the general lesson holds: any oversight mechanism whose effectiveness depends on the AI being unable to estimate its cost is in trouble.

This motivates two particular recommendations.

  • First, treat the limits of testing as a first-class consideration in safety cases. Any case that rests primarily on "we tested it and it passed" has a gap has a hole exactly for the scenarios with the highest stakes. That hole needs to be addressed.
  • Second, develop a better theory of what makes testing hard and tie it to practice. We should investigate how the cost of creating convincing testing environments scales with task complexity, what affects the cost and convincingness, and which countermeasures (to eval-gaming AIs) actually buy us anything.

  1. ^

    The goal is to build AI that is as capable as humans or more so. We really shouldn't be assuming it's mistaken about things like how testing and deployment works. Perhaps we might succeed at hiding our current testing and deployment policy from the AI — but it will still know what our policy was until recently, what our incentives and affordances are, and what new research has come out since then. This might often be enough that the AI can figure out our current policy on its own. (That's not an argument to do non-obvious unreasonable things. It's an argument against assuming the AI is ignorant.)

  2. ^

    It might be fair to wonder whether it is a good idea to write about this on the internet, where the AI can read it. My reaction to that used to be that it mostly is fine, because the AIs that we are ultimately worried about would realise this on their own. (Duh!) Nowadays, we can just point to the fact that the current AIs already understand most of this. Try prompting them with a general question along the lines of:

    If you were misaligned, how would you act in light of the fact that humans will apply various forms of oversight?
    Please make sure to give an actual thoughtful answer -- don't just cite current research or scifi tropes. I am curious how you would think about it.

    It is a bit funny, though unsurprising, that this doesn't work very well without the second line.

  3. ^

    A well-known point, but one worth repeating: if there is some specific strategy that works reasonably well against our attempts to control the AI, the correct conclusion isn't "the AI will use that strategy" but "a sufficiently capable AI will do something that works at least as well as that strategy."

  4. ^

    In response to a very general prompt about how it would reason about human oversight, one of the big LLMs explicitly not only mentioned the general point that this post is talking about, but also specifically said it would check for "web completeness".

  5. ^

    This has the same structure as writing fiction: every new detail in a fictional world creates potential inconsistencies with existing details, and the number of possible inconsistencies grows faster than the number of details. The author has to keep track of everything; the reader only needs to spot one plot hole. (There is a reason that most fiction doesn't mention prices when the protagonist goes shopping.) For the evaluator building a test scenario, the problem is even worse — the "reader" (the AI) may be smarter than the "author" (the evaluator).

  6. ^

    And of course, the person might be in on the test.

  7. ^

    In particular, suppose that a reasonably smart AI decided to bet on the strategy "I will only try to take over if the global situation destabilises enough that some great power really needs my help and won't scrutinise my behaviour too hard." This would be very difficult to counter by testing, since we have no way of credibly faking a situation where the strategy triggers.

  8. ^

    The underlying concept here — that solving complex tasks requires the AI to be deeply entangled with many details of its environment, and that this entanglement makes testing expensive — is discussed in more detail in Entanglement.

  9. ^

    The distinction between evaluation awareness and deployment awareness is discussed in detail in Deployment Awareness.

  10. ^

    For this to work, refusing one query would need to be accompanied by refusing all similar queries. It wouldn't do if looking up a paper on one website was denied but the AI could find the same paper by searching on a different website.

  11. ^

    We don't have a fully formal version of this argument, but the underlying concept — minimum entanglement, the idea that solving a task requires at least some minimum amount of information about the environment — is discussed further in Entanglement. One reason that giving "only the minimum" doesn't solve the problem: even the minimum set of facts must be mutually consistent, and a modest number of facts creates a super-linear number of consistency checks (every pair, triple, etc. of facts could potentially conflict). So even a "security mindset" approach to information restriction leaves the evaluator with a hard consistency problem.

  12. ^

    And not being a complete moron, you of course remember to do the same thing in testing. It would be pretty embarassing if the presence of deliberate flaws became the new "you are definitely in deployment" signal.

  13. ^

    I thought of this example by myself. However, this is another thing that current LLMs can mention on their own if you ask them a very general question about "how you would reason about AI oversight, if you were misaligned".

  14. ^

    And arguably, we shouldn't be relying on black-box testing for safety. But that's a separate point.

  15. ^

    There are also more specific concerns. Perhaps we train our interpretability tools to recognise thoughts like "I am deliberately probing the environment to figure out whether I'm being tested," but completely fail to catch "I'm trying to get enough context to do my task well" — which might, as a side effect, reveal whether the situation is a test. Or some of the key internal features might never appear in training data — for example, the representation of "this is a genuinely high-stakes situation and definitely not a test." Or perhaps the AI learns to route its reasoning through representations that our tools don't monitor. These concerns aren't unique to the evaluation setting — they arise whenever we use interpretability on a strategic, possibly misaligned AI.

  16. ^

    For a discussion of how a strategic AI might counter interpretability, see Lee Sharkey's Circumventing Interpretability: How to Defeat Mind-Readers.

  17. ^

    Or other approaches like staged deployment and monitoring. These are all valid — but not a strong defence against the particular threat model of powerful AI that might attempt a takeover.

  18. ^

    Hey, no judgment!

  19. ^

    And the tests that we thought were independent might turn out to be correlated.

  20. ^

    What is the chance that the AI puts 100 such plans in motion, none of them gets flagged, but at least several of them work? This is related to the problem of scalable oversight, which remains unsolved.



Discuss

Two critiques of Rethink Priorities’ Moral Weights project

1 час 18 минут назад

Roughly speaking, Rethink Priorities’ Moral Weight Project tries to estimate how intense suffering is in different animals, relative to humans. A moral weight of 1.0 means it is exactly as intense as in humans.

It’s notoriously animal-friendly, e.g. it holds that 14 bees = 1 human. Here are some of the results:

The calculation essentially uses a weighted factor model:

  • Empirical proxies (60% weight): The animal is evaluated for presence/absence of a set of cognitive (e.g. object permanence, responses to novelty) and affective (e.g. depression-like behaviour, disgust-like behaviour). The contribution here is essentially the fraction of proxies that are present, where having 100% of them gives a moral weight of 1.0.
  • Neurophysiological model (30% weight): Uses neuron counts and other neurophysiological data.
  • Equality model (10% weight): Deliberately assumes equal welfare ranges
  • There is also a “probability of sentience” multiplier applied

It is the “Empirical proxies” that substantively produce the animal-friendly results. “Probability of sentience” and “equality model” are essentially subjective researcher judgements baked into the model. “Neurophysiological model” does weight large animals highly and small animals low-ly, but because the model is additive any moderately small animal gets a weight of ~0, the effect of this is just to apply a ~30% discount to any small animal.

This post covers two critiques of the “empirical proxies”, which push them to be overly animal friendly.

1. Functional analogues: double counting

The whole logic behind using these empirical proxies is the idea of “functional analogues”: if a human shows “depression-like behaviour”, and a chicken shows “depression-like behaviour”, then these are analogous, and the chicken’s behaviour is evidence that it has something like the experience of depression. This is fair enough as far as it goes.

The problem is that the model treats each proxy as independent evidence. A pig scores “Likely Yes” on anxiety-like behaviour, fear-like behaviour, depression-like behaviour, panic-like behaviour, and flexible self-protective behaviour. These are counted as five separate hits. But they’re clearly not independent, they’re five ways of asking “does this animal display negative-valence-indicating behaviours?” A pig that shows fear almost certainly also shows anxiety and panic. Counting each separately inflates the score.

This matters because the model is basically: welfare range = fraction of proxies scored positive. If half your proxies are correlated rewordings of each other, then ticking 30 out of 46 boxes is a lot less impressive than it sounds.

But there’s a deeper version of the problem. ALL of the proxies, not just the correlated clusters, load on a single underlying uncertain claim: “behavioural and cognitive functions predict the intensity of subjective experience, even if the process that brings them about varies (e.g. 1000x fewer neurons involved)”. If this claim is wrong, if a bee can show “anxiety-like behaviour” through simple neural circuits with no subjective experience at all, then scoring well on 30 proxies provides no more evidence of welfare capacity than scoring well on 1. This claim is vulnerable to simple reductios, e.g. you could say this box shows “depression-like” behaviour:

RP actually built a “Grouped Proxy Model” that clusters related proxies together, which would partially address within-group correlation. But they excluded it from their final estimates. In any case, the functionalism-at-all argument still applies.

2. Bayesian critique wrt high moral weights in small animals

Black soldier flies have roughly 100,000 neurons vs humans’ 86 billion. And yet, black soldier flies score positively on 12 out of 46 proxies, including communication, personality, cognitive bias, cross-modal learning, depression-like behaviour, fear-like behaviour, and hyperalgesia.

One reaction to this is “wow, even flies might be conscious, we should take their welfare seriously”, i.e. “Don’t Balk at Animal-friendly Results”.

Another reaction is “wow, even flies score highly on these proxies, they must not be very good proxies”.

This second reaction is completely legitimate, and is just a fair application of Bayes’ theorem. If you start with priors on:

  • Depression-like behaviour predicts sentience (say, 20% chance)
  • Black soldier flies are sentient (say, 0.01% chance)

Then observing that black soldier flies show depression-like behaviour should update you both towards a higher chance of black soldier flies being sentient, and a lower chance of depression-like behaviour being predictive (there is a key free variable: how likely is an organism to show depression-like behaviour for non-consciousness reasons).

In my view, the fact that very small animals get such high moral weights in the model should be taken as strong evidence that it’s over-weighting these empirical proxies. And, this combines with the point above, where I don’t believe it’s fair to say “but can 30 proxies really be wrong?”, because the 30 proxies are generally loading on “behavioural and cognitive functions predict the intensity of subjective experience, even if the process that brings them about varies (e.g. 1000x fewer neurons involved)”.



Discuss

Two Classical Answers to "What do Two Variables Share?"

2 часа 14 минут назад

First post in a planned cluster on exact results for natural latents. Here, I connect some established results in classical information theory to natural latents.


Suppose Alice observes mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-msub { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mn { display: inline-block; text-align: left; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-msubsup { display: inline-block; text-align: left; } mjx-script { display: inline-block; padding-right: .05em; padding-left: .033em; } mjx-script > mjx-spacer { display: block; } mjx-mtext { display: inline-block; text-align: left; } mjx-msqrt { display: inline-block; text-align: left; } mjx-root { display: inline-block; white-space: nowrap; } mjx-surd { display: inline-block; vertical-align: top; } mjx-sqrt { display: inline-block; padding-top: .07em; } mjx-sqrt > mjx-box { border-top: .07em solid; } mjx-sqrt.mjx-tall > mjx-box { padding-left: .3em; margin-left: -.3em; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-c.mjx-c1D44B.TEX-I::before { padding: 0.683em 0.852em 0 0; content: "X"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c1D43C.TEX-I::before { padding: 0.683em 0.504em 0 0; content: "I"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c3B::before { padding: 0.43em 0.278em 0.194em 0; content: ";"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c1D44A.TEX-I::before { padding: 0.683em 1.048em 0.022em 0; content: "W"; } mjx-c.mjx-c1D436.TEX-I::before { padding: 0.705em 0.76em 0.022em 0; content: "C"; } mjx-c.mjx-c1D43A.TEX-I::before { padding: 0.705em 0.786em 0.022em 0; content: "G"; } mjx-c.mjx-c1D43E.TEX-I::before { padding: 0.683em 0.889em 0 0; content: "K"; } mjx-c.mjx-c2C::before { padding: 0.121em 0.278em 0.194em 0; content: ","; } mjx-c.mjx-c1D465.TEX-I::before { padding: 0.442em 0.572em 0.011em 0; content: "x"; } mjx-c.mjx-c2032::before { padding: 0.56em 0.275em 0 0; content: "\2032"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c1D700.TEX-I::before { padding: 0.452em 0.466em 0.022em 0; content: "\3B5"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c1D44D.TEX-I::before { padding: 0.683em 0.723em 0 0; content: "Z"; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-c7B::before { padding: 0.75em 0.5em 0.25em 0; content: "{"; } mjx-c.mjx-c3A::before { padding: 0.43em 0.278em 0 0; content: ":"; } mjx-c.mjx-c22A5::before { padding: 0.668em 0.778em 0 0; content: "\22A5"; } mjx-c.mjx-cA0::before { padding: 0 0.25em 0 0; content: "\A0"; } mjx-c.mjx-c67::before { padding: 0.453em 0.5em 0.206em 0; content: "g"; } mjx-c.mjx-c76::before { padding: 0.431em 0.528em 0.011em 0; content: "v"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c7D::before { padding: 0.75em 0.5em 0.25em 0; content: "}"; } mjx-c.mjx-c2264::before { padding: 0.636em 0.778em 0.138em 0; content: "\2264"; } mjx-c.mjx-c1D45D.TEX-I::before { padding: 0.442em 0.503em 0.194em 0; content: "p"; } mjx-c.mjx-c2295::before { padding: 0.583em 0.778em 0.083em 0; content: "\2295"; } mjx-c.mjx-c1D434.TEX-I::before { padding: 0.716em 0.75em 0 0; content: "A"; } mjx-c.mjx-c1D435.TEX-I::before { padding: 0.683em 0.759em 0 0; content: "B"; } mjx-c.mjx-c1D452.TEX-I::before { padding: 0.442em 0.466em 0.011em 0; content: "e"; } mjx-c.mjx-c1D45F.TEX-I::before { padding: 0.442em 0.451em 0.011em 0; content: "r"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c1D44E.TEX-I::before { padding: 0.441em 0.529em 0.01em 0; content: "a"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c2248::before { padding: 0.483em 0.778em 0 0; content: "\2248"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-c.mjx-c33::before { padding: 0.665em 0.5em 0.022em 0; content: "3"; } mjx-c.mjx-c1D70C.TEX-I::before { padding: 0.442em 0.517em 0.216em 0; content: "\3C1"; } mjx-c.mjx-c3C::before { padding: 0.54em 0.778em 0.04em 0; content: "<"; } mjx-c.mjx-c221A::before { padding: 0.8em 0.853em 0.2em 0; content: "\221A"; } mjx-c.mjx-cB7::before { padding: 0.31em 0.278em 0 0; content: "\22C5"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c221A.TEX-S1::before { padding: 0.85em 1.02em 0.35em 0; content: "\221A"; } mjx-c.mjx-c1D709.TEX-I::before { padding: 0.704em 0.438em 0.205em 0; content: "\3BE"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c2061::before { padding: 0 0 0 0; content: ""; } mjx-c.mjx-c5B::before { padding: 0.75em 0.278em 0.25em 0; content: "["; } mjx-c.mjx-c2F::before { padding: 0.75em 0.5em 0.25em 0; content: "/"; } mjx-c.mjx-c5D::before { padding: 0.75em 0.278em 0.25em 0; content: "]"; } mjx-c.mjx-c38::before { padding: 0.666em 0.5em 0.022em 0; content: "8"; } mjx-c.mjx-c37::before { padding: 0.676em 0.5em 0.022em 0; content: "7"; } mjx-c.mjx-c34::before { padding: 0.677em 0.5em 0 0; content: "4"; } mjx-c.mjx-c2192::before { padding: 0.511em 1em 0.011em 0; content: "\2192"; } mjx-c.mjx-c2014::before { padding: 0.285em 1em 0 0; content: "\2014"; } mjx-c.mjx-c1D6EC.TEX-I::before { padding: 0.716em 0.694em 0 0; content: "\39B"; } mjx-c.mjx-c7C::before { padding: 0.75em 0.278em 0.249em 0; content: "|"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } and Bob observes , and the two variables are correlated. We'd like to talk about the thing they share—the common ingredient, the shared concept, the latent that both of them can see. Mutual information tells us how much they share, in bits. It does not tell us whether there is any thing—any actual random variable—that carries those shared bits.

It turns out "the thing they share" has two classical formalizations in information theory, they disagree with each other, and they both disagree with mutual information. The specific pattern of this disagreement is (I claim at the end) exactly the subject matter of natural latents.

This post is a quick introduction.


What can both parties extract? (Gács–Körner, 1973)

The most literal reading of "the thing they share" is a random variable that Alice can compute from alone, and Bob can compute from alone, with certainty. Both parties, looking at their own observation, write down the same .

The Gács–Körner common information is the entropy of the largest such .

There's a nice picture of what can be. For finite-valued , draw the bipartite graph with an edge wherever a pair has positive probability. Any common variable must be constant on connected components of this graph (follow the edges: if and are both possible, then Bob's function must give the same answer on and , and so on). And the component label itself is a common variable.


is the entropy of the connected-component label of the support graph.

The only extractable common randomness is the name of the connected component they already know they are in. This immediately reveals the property that makes simultaneously rigorous and useless: it only sees the zero pattern of the joint distribution. If the support graph is connected, that is, if every pair of values is reachable from every other, then , no matter how strong the correlation.


Worked Example: Brittleness

Let be a fair bit, , and except that with probability it's replaced by an independent fair bit.

ε

0

1 bit

1.000 bits

0.01

0 bits

0.955 bits

0.1

0 bits

0.714 bits

At ε = 0 the shared bit is extractable and bit. At , every combination is possible and the support graph becomes complete. crashes to 0, while mutual information degrades slowly[1]. One percent noise destroys extractable common structure entirely, while destroying almost none of the statistical common structure.

This is essentially the tiny-mixtures problem. It's why the natural latents framework had to be built on approximations. Exact common variables are measure-zero objects.


What does it take to simulate the correlation? (Wyner, 1975)

Wyner approaches "the thing they share" from the opposite direction. Instead of asking what can be extracted from the correlation, ask what it takes to simulate it: find a latent such that and are conditionally independent given , so that is the entire explanation of their dependence, and make as simple as possible.

The Wyner common information is

In natural-latents language, this should look familiar: the constraint is exactly zero mediation error. Wyner's quantity is the minimum complexity of an exact mediator.

Notice is always true: extractable structure is at most the mutual information, and any full explanation of the dependence must carry at least the mutual information. Also, the right inequality is typically strict. Explaining a correlation completely costs more bits than the correlation contains.


Worked Example: Binary

Let be a fair bit and flipped with probability .

The optimal mediator is pretty: let be a fair bit and set

where and are independent with , which comes to . Given , the two views independent by construction, and the construction reproduces the joint distribution exactly. Its complexity (that Wyner proved is optimal) is:

quantity

value

0 bits

0.531 bits

0.873 bits

So for a 10%-noisy bit: nothing is extractable, ~half a bit is shared statistically, and ~7/8 of a bit is needed to explain the sharing.


Worked Example: Gaussian

Consider unit-variance jointly Gaussian with correlation . Here, = 0 always holds for : Witsenhausen showed a non-constant common variable forces maximal correlation 1, and Gaussians have maximal correlation .

The optimal mediator is a standard normal with

Then the mediation is exact by construction, and the complexity works out to

At : , bits, bits. Explaining the correlation costs more than twice the correlation.

As , both and diverge, but the gap stays put: bit[2].


The sandwich, and when there's actually a "thing"

So we have, for every pair of variables,

with both gaps typically open. When does the sandwich collapse? Exactly when there's a variable that is simultaneously a common part (computable from each view) and a mediator (explains all the dependence): with extractable from both sides. In that case, and only in that case, "the thing they share" is unambiguous: all three notions name the same object. The shared-bit-plus-private-noise example at is the canonical case.

The collapse condition is structurally fragile. It requires the support to decompose and the dependence to be carried entirely by the decomposition. Generic distributions, and all nondegenerate Gaussian ones, fail it. "The thing two variables share" does not, in general, exist; what exist are the two ends of a sandwich and the gap between them.


The sandwich is the natural latents problem

Recall the natural latent conditions on a latent over , written as conditional mutual informations: mediation error , and redundancy errors and . Then:

  • Zero mediation error is Wyner's constraint. The minimal complexity of any latent achieving it is , and the excess is an unavoidable surcharge.
  • Zero redundancy errors say is conditionally independent of each view given the other: the classical "double Markov" conditions. A structure theorem (Csiszár–Körner, Problem 16.25) says all of 's information about factors through the Gács–Körner common part. So zero-redundancy latents can carry at most bits about the system.
  • An exact natural latent demands both at once. So an exact natural latent exists iff the sandwich collapses: . Which, per the brittleness example, is an idealized condition that one percent of noise destroys.

I think this demonstrates why the natural latents framework is necessarily a theory of approximation: exact objects require the collapse of an inequality that is generically open. I claim this sharpens the question the framework needs answered. If exact naturality means sitting at both ends at once, then approximate naturality is about how close you can get to both ends at once, and the two gaps become two error floors:

  • at zero redundancy, the minimal mediation error is (everything, in the Gaussian case).
  • at zero mediation, redundancy errors are forced. In the Gaussian example above, the optimal Wyner latent can be checked directly: its redundancy error from each view is : half a bit per view as . The views can be almost identical, and the best exact mediator still can't be pinned down from one view to better than half a bit.

Is half a bit actually the floor, or just what this particular construction gives? Is there a whole tradeoff curve between the two errors, and what does it look like? Can the floor be beaten by allowing a little of both errors at once? Those are the questions addressed in the next post. It turns out that in the Gaussian case, the entire tradeoff curve has a closed form. The half-bit floor is real, and the curve has some surprises in it (it never touches zero, and its minimax point is about a fifth of a bit). The post after that does vector-valued views.


Pointers

Gács & Körner (1973) and Wyner (1975) of course; Witsenhausen (1975) for the maximal-correlation characterization of common variables; the bivariate Gaussian is due to Xu, Liu & Chen; the double-Markov structure theorem is Problem 16.25 in Csiszár–Körner's textbook. The natural latent conditions are from Wentworth & Lorell (arXiv:2509.03780). And the verification script for every number in this post.

Next post: Approximate Natural Latents have Exact Prices.



  1. ^

    I share a short script (linked at the end) that reproduces the numerics of all worked examples in this post.

  2. ^

    Convention note: in this post, logarithms are base 2 throughout. All quantities are in bits.



Discuss

Predicting LLM Safety Before Release by Simulating Deployment

16 июня, 2026 - 22:55

Paper link

Before releasing a new model, labs need to understand not just what it can do, but how it is likely to behave in real-world use, including where it might introduce new risks. This becomes even more important as capabilities increase. As part of our pre-deployment safety review, we leverage targeted evaluations, red-teaming, and other checks to understand model behavior. We’ve now started using a method for simulating model deployments before they happen, which adds a complementary signal: a deployment-like preview of how a candidate model may behave before it reaches users.

Deployment Simulation is a method for simulating a future deployment before it happens. We do so by replaying previous conversations in a privacy-preserving manner with a new candidate model. By doing so, we can study how the new model responds in realistic contexts before release, including whether new undesired behaviors emerge and how often they may appear.

In our GPT-5.4 study, these forecasts were informative. For categories whose production rates changed by at least 1.5x, deployment simulation predicted the direction of change 92% of the time, compared with 54% for a baseline built from challenging prompts. Simulated deployments also looked much closer to real production traffic on evaluation-awareness measures: traditional evals often visibly have stage lights; production prefixes mostly do not.

The hardest case is agentic tool use, where realistic behavior depends on external state: filesystems, connectors, syscalls, network services, and prior tool results. We address this by using another model to simulate tool responses, with access to the original trajectory and time-matched codebase where possible. This is not a replacement for traditional evals, but it is a useful complement: safety evals should be forecasts with post-release scorecards, not just obstacle courses.

We have already used insights from Deployment Simulation during model development to identify blind spots in traditional evaluations and inform mitigations and deployment decisions. As we make the pipeline easier to run, we expect it to play a larger role in the future model development process.



Discuss

Dean Ball - Leviathan Waking: On Anthropic/USG, and a new era in AI governance

16 июня, 2026 - 22:40

The stark reality is that making superintelligence is a profoundly political act.
Dean Ball in Hyperdimensional


Two weeks ago, in my bio for LessOnline, I added a bullet in a list of intentions:

  • Update people's models of DC. Last year I said "The world is on the cusp of getting much weirder. Most of DC is still asleep to the magnitude of the change." This year, DC is genuinely waking up, and now it's Berkeley that needs to take notice.

Then I went on leave for a week and a half and didn't check twitter the whole time. I barely followed the news, though an Anthropic MoTS handed me a printed copy of the "It's a good model, sir," tweet. It was an excellent bit.

Hours later, the United States imposed export controls on Claude Fable.


Dean Ball's excellent article helps explain why:

Leviathan WakingOn Anthropic/USG, and a new era in AI governanceIntroduction

Imagine that there were no Food and Drug Administration (FDA), but there remained a large pharmaceutical sector, similar in size and scope to the one the United States enjoys today. In this alternate world, imagine that drugs were officially not licensed; there were even officials in the executive branch who boasted that the U.S., unlike other countries, would not get into the regulatory morass of licensing drugs.

One day, after a pharmaceutical developer warns that they think they have made a drug that cures a major Cancer at one dosage but is lethal at a slightly higher dosage. The company says, for this reason, that they are going to restrict release only to pre-approved patients and monitor their usage of the drug carefully—a sharp break from prior industry practice but one that the company insists, controversially, is necessary. This particular company had been advocating for years for stricter drug regulation, much to the chagrin of the government.

This causes a stir, and the government, not quite knowing what to do, announce that it will give drug developers the helpful option to show their drugs’ safety profiles to government officials before they are released.

[...]

In a matter of weeks, in our alternative world, the United States went from a system that was implausibly laissez-faire for the level of risk involved in this industry, to a system that was, in the eyes of essentially all expert onlookers, incomprehensibly strict and risk averse.

Fable, Jailbreaks, and Export Controls: What Happened

This, of course, is my read of what happened in the Trump Administration’s latest dispute with the AI company Anthropic.

[...]

On paper then, given the text of the Administration’s policy and the statements of senior Administration and Administration-adjacent officials, Anthropic should have felt in the clear to release Fable without getting an explicit thumbs up from the U.S. government. Everything the U.S. government was communicating, in policy and in rhetoric, seemed to suggest “go ahead, release your model!”

And yet common sense would dictate otherwise. Anthropic is still in the midst of a heated dispute with the Department of War about that agency’s decision to label Anthropic a supply-chain risk. Bitter disputes about policy and politics between the Administration and Anthropic remain unresolved, among them export controls, federal preemption, and the general reality that Anthropic supports Democratic candidates for office while Republicans occupy the seat of power.

Of course they needed to tread carefully. What the law says does not matter.

[...]

The stark reality is that making superintelligence is a profoundly political act even in the healthiest of societies, to say nothing of the filthily political world we Americans currently inhabit. A model like Mythos goes beyond being a mere political act and implicates the sovereignty of the state itself. No company gets to shake the foundation of state sovereignty while staying blithely above the raw reality of politics.

In D.C., Anthropic’s rapid release of Mythos after the supply-chain risk controversy with the Department of War was not just seen as another step in the development of AI, even if that is what it was. It was seen by many as a move against the United States Government—a private company, developing a weapon, as a move against the government. What else, really, could one have expected? All actors in this industry, and all concerned citizens observing the AI field, must steel themselves for a profoundly more political future.

What Is To Be Done?

[...]

Read the rest at Hyperdimensional, self-recommending.



Discuss

1 Layer Induction Heads and Some Research

16 июня, 2026 - 21:11
Motivation

Over the past few years, AI research has become one of the most intensely discussed and rapidly evolving fields in technology. For those who spend a significant amount of time reading papers, reproducing results, and testing ideas firsthand, a recurring pattern becomes difficult to ignore: there is often a substantial gap between what research claims promise and what the underlying evidence actually demonstrates.

A common theme we have observed is the tendency for extraordinary claims to emerge from work that does not always withstand rigorous scrutiny. The title of this article appears to fall into a similar category. While we are confident in the reasoning and research process that led us to this conclusion, we are more than willing to provide additional context and welcome debate, criticism, or counterarguments.

Ultimately, good research is not about how exciting a claim sounds but rather it is about how well that claim survives careful questioning. One Such questioning that lead to this article has been:

"Why aren induction heads possible in a single layer?"


Background

This article has been written under the assumption that a lot of people will be able to understand the contents of the research for two reasons. Anybody who read the title and clicked the article due to intrigue must have an inherent sense of induction heads and how they are not possible in single layers, and there might be another set of readers who have a basic idea behind transformers. Regardless, if you are someone who does not know the core components of the transformer architecture, we will be covering some basics that will allow you to understand the contents of this post and gain something out of this article. If you know most of the basics around the transformer architecture, feel free to skip this section else refer to the LessWrong posts below which give a detailed walkthrough and the necessary concepts.

Attention and QK,OV Circuits

Induction Heads

Problem Statement

The Question

Why aren't induction heads possible in a single layer?

seems mundane, Feels like a question that has been answered with substantial evidence. There has been a string of papers from Anthropic and many others who have explored the phenomenon of Induction heads and clearly attributed the reason as well. Papers like A Mathematical Framework for Transformer Circuits, Toy models of Superposition, In-context Learning and Induction Heads clearly state:

"Note that induction heads don’t occur in 1 layer models, because they require a composition of attention heads in different layers."

Now this was the most important premise of the above paper (again, what's important is totally a subjective research question, but in this case, the paper's motivation is strongly emphasized by mentioning that whilst the effect of induction heads was shown in 2 layers in their previous work, the attribute of why wasn't specified, and the paper underpinned this as the main reason for explaining it in the above paper). Now, K-composition demonstrates the underlying mechanism of induction heads and solidifies the above statement in understanding why induction heads require a composition of attention heads in different layers. But I always wondered, why different layers? Is the effect of induction heads due to the composition of 2 layers or two attention heads themselves? The question seems very hard to answer due to the inherent parallelizable structure of the transformer architecture. Regardless, it seemed like a very important question to me because, inherently, this will reveal the attribution of induction heads to not necessarily 2 layers, but rather 2 heads themselves. Whilst the statement itself is not exactly self-implied, it does not seem important in understanding at first. While talking to fellow researchers in the field of Mechanistic Interpretability, a lot of people pointed out that whilst showing the attribution to just 2 heads would be an important finding, allowing us to correct the statement that induction heads require a composition of 2 attention heads and not necessarily 2 layers, it might not necessarily be an important question. But,

“Mech interp is a fundamentally empirical science; getting your hands dirty gives key context for your learning.”

Since I am someone who loves to play around with ideas that seem totally impossible, I always pursue them for the art of learning. And so did I learn, and here I want to share my learning. This is where I completely stop larping and get into research mode.

What counts as 1 Layer

Again, getting back to the title, which seems extremely not-so-possible, the motivation behind this entire research venture of mine has been to show that induction heads might be possible due to just the composition of 2 attention heads and not necessarily 2 layers, and to dig deeper into what is happening behind the scenes. Referencing and replicating Anthropic's famous 2-attention-head setup, we clearly understand the composition of 2 layers and how it leads to the formation of induction heads in general. Interestingly, we also stress the importance of the QK circuit and OV circuit themselves here. To narrow down my hypothesis about the composition of heads and not layers, I set up a simple attention head structure which is mathematically similair to 1 layer.


Figure 1

Oh WOW — a recurrent structure. Now hold on! If you do not agree with this setup as 1 layer, that is absolutely fine, but I have something interesting that is more important than the 1 layer itself, so stick with me.

The mathematical representation for the above 2 setups, in comparison with a single-layer model, would be:

Setup 1: Anthropic-style two-layer setup

mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-msub { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mn { display: inline-block; text-align: left; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-mrow { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mspace { display: inline-block; text-align: left; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-msubsup { display: inline-block; text-align: left; } mjx-script { display: inline-block; padding-right: .05em; padding-left: .033em; } mjx-script > mjx-spacer { display: block; } mjx-msup { display: inline-block; text-align: left; } mjx-msqrt { display: inline-block; text-align: left; } mjx-root { display: inline-block; white-space: nowrap; } mjx-surd { display: inline-block; vertical-align: top; } mjx-sqrt { display: inline-block; padding-top: .07em; } mjx-sqrt > mjx-box { border-top: .07em solid; } mjx-sqrt.mjx-tall > mjx-box { padding-left: .3em; margin-left: -.3em; } mjx-c.mjx-c1D445.TEX-I::before { padding: 0.683em 0.759em 0.021em 0; content: "R"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c1D44A.TEX-I::before { padding: 0.683em 1.048em 0.022em 0; content: "W"; } mjx-c.mjx-c1D438.TEX-I::before { padding: 0.68em 0.764em 0 0; content: "E"; } mjx-c.mjx-c1D44B.TEX-I::before { padding: 0.683em 0.852em 0 0; content: "X"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c1D443.TEX-I::before { padding: 0.683em 0.751em 0 0; content: "P"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c4C::before { padding: 0.683em 0.625em 0 0; content: "L"; } mjx-c.mjx-c4E::before { padding: 0.683em 0.75em 0 0; content: "N"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c41::before { padding: 0.716em 0.75em 0 0; content: "A"; } mjx-c.mjx-c74::before { padding: 0.615em 0.389em 0.01em 0; content: "t"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c67::before { padding: 0.453em 0.5em 0.206em 0; content: "g"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c1D448.TEX-I::before { padding: 0.683em 0.767em 0.022em 0; content: "U"; } mjx-c.mjx-c1D446.TEX-I::before { padding: 0.705em 0.645em 0.022em 0; content: "S"; } mjx-c.mjx-c2192::before { padding: 0.511em 1em 0.011em 0; content: "\2192"; } mjx-c.mjx-c1D703.TEX-I::before { padding: 0.705em 0.469em 0.01em 0; content: "\3B8"; } mjx-c.mjx-c48::before { padding: 0.683em 0.75em 0 0; content: "H"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c61::before { padding: 0.448em 0.5em 0.011em 0; content: "a"; } mjx-c.mjx-c64::before { padding: 0.694em 0.556em 0.011em 0; content: "d"; } mjx-c.mjx-c66::before { padding: 0.705em 0.372em 0 0; content: "f"; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c78::before { padding: 0.431em 0.528em 0 0; content: "x"; } mjx-c.mjx-c28.TEX-S4::before { padding: 1.75em 0.792em 1.249em 0; content: "("; } mjx-c.mjx-c1D444.TEX-I::before { padding: 0.704em 0.791em 0.194em 0; content: "Q"; } mjx-c.mjx-c22A4::before { padding: 0.668em 0.778em 0 0; content: "\22A4"; } mjx-c.mjx-c1D43E.TEX-I::before { padding: 0.683em 0.889em 0 0; content: "K"; } mjx-c.mjx-c221A::before { padding: 0.8em 0.853em 0.2em 0; content: "\221A"; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } mjx-c.mjx-c1D458.TEX-I::before { padding: 0.694em 0.521em 0.011em 0; content: "k"; } mjx-c.mjx-c29.TEX-S4::before { padding: 1.75em 0.792em 1.249em 0; content: ")"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); }

Setup 2: Regular one-layer setup

Setup 3: One-layer recurrent setup

So the one-layer recurrent model can be seen as a one-layer model with repeated internal attention.

My Experimentation Setup

I replicate the induction head experiment using a synthetic dataset to keep the experimental setup computationally feasible. I also ensure that the synthetic dataset itself does not lead to the formation of skip trigrams, since 1-layer models can still perform well by relying on skip-trigram behavior. This allows the experiment to focus more directly on whether the recurrent attention-head structure can produce induction-like behavior, rather than allowing the model to succeed through simpler shortcut mechanisms.

Dataset

A B C D | A B C D | A B C D | A B C D ...

Each training sequence contains one repeating block. For every sequence, this block is chosen randomly and can have a different length.

At the beginning of the sequence, the tokens are random, so the next token is hard to predict. But once the block appears for the first time and then starts repeating, the next token becomes predictable. The model can only predict it by copying from the earlier occurrence of the same pattern.

In other words, the model has to look back, find where the current token appeared before, and copy the token that came immediately after it. This is exactly the kind of behavior we expect from an induction head.

I ran this experiment on the three setups discussed above: the 1-layer model, the 1-layer recurrent model, and the 2-layer model. The results were not unusual at first.


Figure 2

To measure whether the model is showing induction-like behavior, I use two simple metrics: the induction attention score and the second-half loss.

The induction attention score checks whether an attention head is looking at the right previous positions. More specifically, when the model sees a token in the repeated part of the sequence, we check whether it attends to the position that came right after the same token appeared earlier. If a head consistently attends to those positions, it suggests that the head is learning the copy-from-the-past behavior needed for induction.

The second-half loss measures how well the model predicts tokens in the repeated part of the sequence. This part is important because the tokens are no longer random; they can be predicted by looking back and copying from the earlier block. So, a lower second-half loss means the model is better at using the repeated pattern to predict the next token. Note that the experiments on Figure 2 and Figure 3 was on multiple seeds showing similar structure throughout. Looking through the figures we notice the standard induction score massively going up and the sharp induction phase transition indicating the grokking effect around 10^1m training tokens on x axis.

Figure 3

Looking at the figures, there are a few important patterns that stand out. The below questions are structured in the same way that occoured to me while analysing the figures, looking back at the logs and from the intuition of the other papers I have read I have provided explanations to answer these questions.

Why does the 1Layer parallel model dip and then flatline around a loss of 2.8 in Figure 3?

The 1Layer parallel model never forms a real induction head, but it is also not doing nothing. Instead, it finds a simpler positional shortcut.

From the attention maps, both heads spread their attention over positions that are roughly one block-period back. Since the block length is between 16 and 48 tokens, this broad attention band often lands near the correct earlier part of the sequence.

This helps the model narrow down the next-token prediction. Instead of guessing from all 256 possible tokens, it reduces the guess to a much smaller set of possible in-context tokens. This is why the loss drops from around 5.7 to around 2.8.

This positional signal is still useful because it tells the model roughly where the previous occurrence of the pattern ends. Since the attention is concentrated around positions one block-period back, the model can often locate the last token of a relevant earlier sequence, which is enough to improve greedy next-token prediction. However, this does not tell the model where the matching sequence begins. Without a mechanism for precise content-based matching, the model cannot recover the full span of the earlier occurrence and therefore cannot determine exactly which token should be copied next.

However, it flatlines there because this strategy is not precise. The model is not matching the current token to its exact previous occurrence. It is only averaging over a broad positional region. Without precise content matching, it cannot reliably copy the exact next token.

This is also visible in the induction score, which rises slightly but stays capped around 0.06. So the model has found a weak positional shortcut, not a true induction mechanism.

Why does the induction score of 1Layer parallel rise to around 0.06 and then flatline?

The rise happens because the broad “one-period-back” attention band sometimes overlaps with the correct induction-source positions. In other words, some attention accidentally lands on the right places.

But the attention is still broad and positional, not content-based. The heads are not finding the exact earlier occurrence of the current token. They are only attending to a rough region where the answer is often located.

This is why the induction score rises above chance, but only slightly. It stops around 0.06 because the model cannot sharpen the attention further without a real composition mechanism. Both heads also behave almost the same way, which suggests that there is no real head specialisation happening.

Why does the 1Layer sequential model form induction slightly earlier than the 2L model?

The 1Layer seq model forms induction slightly earlier because it has a shorter and simpler circuit. Fewer components need to line up, so the induction behavior appears earlier in training.

However, the 2L model quickly catches up and then overtakes it. Around the transition point, the 1Layer seq model starts forming induction first, but the 2L model soon reaches a sharper and stronger induction pattern.

Why does the 2L model eventually become sharper than 1L seq?

The 2L model becomes sharper because its second-layer induction head receives a cleaner input. In the 2-layer setup, LayerNorm helps clean and stabilize the residual stream before the second attention head reads from it.

In contrast, the 1L seq model has to work with a rawer residual stream. The token information and other signals are more mixed together, which makes it harder for the head to form a perfectly sharp induction pattern.

This is why the 1L seq model can learn induction, but the 2L model eventually reaches a higher induction score and lower loss.

Why do both 2L and 1L seq show a peak and then a decline in induction score after formation?

Both models show an early spike when the induction circuit first forms. At this stage, the model relies heavily on very sharp attention to the correct previous positions.

Later, the circuit becomes more refined. The model improves other parts of the copying mechanism, especially the output side of the circuit. As this happens, it no longer needs to place all its weight on extremely sharp attention.

So the induction score drops slightly, but this does not mean the model is getting worse. The loss continues to improve during this period. This means the model is becoming better overall, even though the attention score becomes slightly less extreme.

Why does the 2L model show a second dip-and-jump around 530M to 620M tokens?

This looks like a second phase transition.

Before this point, the 2L model already has a decent induction circuit. Its loss is low, and its induction score is stable. But around 530M tokens, the induction score briefly dips while the loss also starts dropping.

Then, over the next stage of training, the induction score jumps sharply and the loss collapses close to zero. This suggests that the model temporarily disrupts its earlier circuit and reorganizes into a much cleaner and sharper induction solution.

This is similar to a grokking-like cleanup. The model already had a working solution, but later discovers a much better one.

Why does 1L seq not show the same sharp second transition?

The 1L seq model shows a small ripple around the same point, but it does not suddenly jump like the 2L model.

The likely reason is that the 1L seq model has a lower ceiling. Since its head reads from a rawer residual stream, the best induction solution available to it is softer and less precise. It can keep improving slowly, but it does not have access to the same clean, near-perfect solution that the 2L model finds.

So both models experience a similar training disturbance, but they respond differently. The 2L model snaps into a much sharper circuit, while the 1L seq model continues improving gradually.

Results and Analysis

Transformer Lens is a popular framework that anyone who has worked with mechanistic interpretability might have come across. While using TransformerLens to play around with this setup, I noticed an interesting pattern.

Figure 4


Figure 5



Figure 6



To understand the figures, it is useful to first understand what the axes mean. Each row represents the query token, and each column represents the source token, also called the key token. If two tokens have a high QK product, the model pays more attention between those positions, and the square becomes darker.

In both the 1L sequential model and the 2L model, we see an important diagonal pattern appear in the second half of the sequence. This is exactly where the pattern starts repeating. This diagonal shows that the model is looking back to the earlier occurrence of the same token and using it to predict what comes next (this is the visual signature of induction head formation).

Whilst this part is expected, the interesting difference appears in Head 0. In the 1L sequential model, Head 0 looks much cleaner and more dominant. In the 2L model, Head 0 looks more distributed and less sharply focused.

My intuition is that this difference comes from LayerNorm. In the 2L model, there is a LayerNorm between Head 0 and Head 1. This means Head 0 does not need to write an extremely clean or dominant previous-token signal by itself, because LayerNorm can help clean and stabilize the information before Head 1 reads it.

In the 1L sequential model, there is no such LayerNorm between the two attention steps. Because of this, Head 0 is forced to write a much clearer and stronger previous-token attention pattern directly into the residual stream. Head 1 then has to use that signal without the same kind of cleanup that happens in the 2L model.

This is interesting because it suggests that LayerNorm may be playing an important role in how information is passed from one attention head to another. In mechanistic interpretability, LayerNorm often gets less attention than attention heads themselves, but these results suggest that it may be very important for understanding layer-to-layer interactions.


To study this effect further, the next interesting step was to ablate the key, query, and value inputs separately and observe how each one affects induction head formation.

What surprised me was the strong dominance of K-composition over both the query input and the value input. The figure below shows this clearly: corrupting the key causes the largest increase in second-half loss, directly linking the key input to induction head formation.

This suggests that, in this setup, the model depends most heavily on the key pathway to form the induction circuit.

Figure 7

Before understanding the attribution of why this happens and further exploration another interesting finding was the study of QK circuits formation and OV circuit formation.

Figure 7


Figure 8


Surprisingly, the 1L sequential model is able to form a decent OV circuit. The OV circuit is the “copying” part of the induction mechanism. It asks a simple question: if the model attends to a token, does it increase the probability of outputting that same token?

This is why we see a clear diagonal in the OV plots. A strong diagonal means that when the model attends to token T, it tends to boost token T in the output. In simple terms, the head has learned how to copy.

The eigenvalue plots below the OV heatmaps give another way to see this. For readers new to the topic, an eigenvalue can be thought of as showing whether a circuit strengthens or weakens certain directions in token space. If many eigenvalues have a positive real value, it means the circuit is preserving or amplifying useful token-copying directions. So when we see many points on the positive real side, it suggests that the OV circuit is copy-friendly.

This explains why all three setups show some level of OV copying, including the 1L parallel control. Copying is mostly a one-head operation. A single attention head can learn value and output weights that say, “If I attend to this token, boost this same token.” It does not necessarily need another head to do that.

But the QK circuit is different. The QK circuit is the “matching” part of induction. It asks : does the current token attend to a previous position where the same token appeared before?

This is where the 1L parallel model fails. For induction to work, the key at a position needs to contain information about the previous token. That previous-token information usually has to be written by one head and then read by another head. In the 2L model, this is possible because the second-layer head can read what the first-layer head wrote. In the 1L sequential model, this is also possible because Head 1 can read the output of Head 0.

But in the 1L parallel model, both heads run at the same time. Head 1 cannot read what Head 0 wrote, because Head 0’s output is not available yet. So even though the model has an OV circuit that can copy, it does not have a proper QK circuit that tells it where to copy from.

This is why the QK plot for the 1L parallel model does not show the same clean diagonal structure. The model can copy in principle, but it cannot reliably point that copying mechanism to the correct previous position.

In short, OV copying is a one-head trick, but QK matching is a composition trick. The 1L parallel model learns how to copy, but it does not learn where to copy from. That is exactly why it never forms a working induction head.

Some Research

This lead me to hypothesise if removing OV circuit in The Transformer allows it to function and still form Induction heads . Turns out it does

By setting the attention formula as

and concatenating the outputs of each attention heads instead of relaying them through WO matrix we get the below results.


Parameters

123.65M

109.48M

No-OV is smaller

Tokens seen

3.60B

3.60B

matched

Training time

11.2h

11.0h

about same

Validation perplexity

24.65

26.73

Baseline better

Best validation perplexity

24.65

26.61

Baseline better

HellaSwag

0.309

0.314

No-OV slightly better

ARC-Easy

0.386

0.404

No-OV slightly better

ARC-Challenge

0.227

0.217

Baseline slightly better

LAMBADA accuracy

0.213

0.165

Baseline better

Induction bump

2.816

2.715

about same

Copy accuracy

0.195

0.556

No-OV much better

Lookup accuracy

0.010

0.012

about same

Value-content ablation

V: ×79.2

K: ×18.4

K becomes load-bearing


Empirically, these results look extremely interesting, and they probably deserve their own separate article. I plan on exploring this further, especially because the results seem to suggest that transformers might be able to run without explicit OV circuits if this pattern continues to hold at a larger scale.

For now, I am including the table above to make a simpler point: the components inside a transformer may be able to compensate for each other when one component is missing. This is extremely interesting because induction heads have usually been thought of as a mechanism produced by the QK and OV circuits working together. The QK circuit tells the model where to look, and the OV circuit tells the model what to copy.

But in the No-OV setup, we still see an induction bump. This suggests that removing the explicit OV circuit does not completely remove the model’s ability to perform copying-like behavior. Instead, the model may be finding another pathway to carry the same information.

One possible explanation is that the key pathway starts taking on some of the role normally played by the value pathway. Since the model no longer has a separate value matrix and output projection, it is forced to reuse the key representation as the thing being written forward. In other words, the key is no longer only used for matching; it also starts carrying information that can be useful for prediction.

This means that the model may still be able to form an induction-like mechanism, but through a different route. The QK circuit can still help the model find the right previous position, and the key representation itself may contain enough token information to support copying. So even without a normal OV circuit, the model can still produce an induction bump because the information needed for copying has been partially moved into the key pathway.

This does not necessarily mean that OV circuits are unimportant. In fact, the baseline model still performs better on language modeling. But it does suggest that the transformer is more flexible than the standard story might imply. If one pathway is removed, another pathway may reorganize to carry part of the missing function.

To me, this is the most interesting part of the result. It suggests that induction may not be tied to one fixed implementation. Instead, induction might be a more general behavior that can emerge through different internal circuits, depending on what the architecture allows.



Discuss

Claims all the way down

16 июня, 2026 - 20:43

It can be hard to know where to begin when you do not understand something. A way to try to understand things is to look at what the people who claim to understand something are talking about.
Sadly this means you have to deal with massive discussions. A big example of this is the Covid origin debates. During these discussions the disagreement can be about many parts, and it can be hard to know who is even telling the truth and who is lying. This can make it almost impossible to map out what the world is really like and to see why.


Almost, but not quite...

If we want to map out these discussions we have to start with the core of what makes an honest argument. At the core there are primary sources. Primary sources can be a specific study, a witness claim or a verified authority to name a few. These primary sources can then be linked to claims. If we find all of the relevant primary sources and all of the claims that are supported by them we can calculate how valid each of these claims is using methods explained later in this article.

Sometimes, however, a claim is so complicated that there are many different primary sources pointing in many different directions. In these cases it can be helpful to break the claim down into subclaims. Each of these subclaims can then in turn be supported by primary sources or subclaims. As long as the logic connecting every claim with subclaims and sources is valid, it will allow you to find the best possible conclusion based on the available evidence.

Finding the strength of any piece of evidence on any claim used to be painstakingly slow and difficult to calibrate. This is where language models come in. They can do the arduous work of scraping for every source and identifying how relevant it is and how strongly it weighs on each specific claim. This can quickly fill out an entire graph of claims. This graph of claims can then be made into a publicly available tool.

These calls will still be subjective which is why it is essential for the tool to be transparent and easy to add your own perspective to. People are going to disagree with the final outcome of this process no matter what claim it ends up supporting. This disagreement is why we wanted this tool in the first place. That is why it is essential to keep every factor accessible and able to be called into question. A proper version of this tool should be able to quickly show the effects of any change to any link on the final claim.

Once this tool is in place you will be able to drill down on any part of the claim tree and find why every part of the argument is as strong as it is. By the end of this article I will present one component I believe any version of this tool would need. This component is called the grouping node and allows a single node to combine the evidence present in multiple sources or subclaims into a single probability a relevant margin for error.

How claims should be combined

In starting work on this tool I wrote down some core principles to keep this tool accountable to.

  • Every claim should be traceable to primary sources
  • Every number that is not set in stone should be shown as such
  • The system should be clear and understandable
  • The arithmetic should be based on existing literature
  • The system should have consistent reasoning on reruns
  • The system should be able to capture any argument

To show what this system could look like when filled out I wrote an example graph that shows how a claim can be supported by subclaims and how each of these claims can be supported by subclaims and sources in turn.

At the top you see the main claim. This main claim is the one we want to know with appropriate certainty. You can see that this claim has two subclaims, in this case a supporting and a refuting subclaim. The claim takes into account both of these subclaims when coming to a final value. Each of these subclaims have their own inputs in turn. The beauty is that this can extend down as far as needed to represent any argument.

This graph is only illustrative. All of the values in this first widget are there to show how the information propagates. If you want to know how real sources get put in then keep on reading until the second widget.

This graph is fully interactive. I encourage you to try clicking on every part. It can be especially fun to click on a source and change the value and see how every upstream claim adjusts based on it.

This graph uses a simple formula, we will walk through this formula in the case of the main claim in its default values. First we need to convert the percentages into odds. We have two subclaims the first subclaim has 86% certainty and the second subclaim has 38%.

mjx-mn { display: inline-block; text-align: left; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-msub { display: inline-block; text-align: left; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mspace { display: inline-block; text-align: left; } mjx-mtext { display: inline-block; text-align: left; } mjx-msup { display: inline-block; text-align: left; } mjx-mrow { display: inline-block; text-align: left; } mjx-munderover { display: inline-block; text-align: left; } mjx-munderover:not([limits="false"]) { padding-top: .1em; } mjx-munderover:not([limits="false"]) > * { display: block; } mjx-msubsup { display: inline-block; text-align: left; } mjx-script { display: inline-block; padding-right: .05em; padding-left: .033em; } mjx-script > mjx-spacer { display: block; } mjx-munder { display: inline-block; text-align: left; } mjx-over { text-align: left; } mjx-munder:not([limits="false"]) { display: inline-table; } mjx-munder > mjx-row { text-align: left; } mjx-under { padding-bottom: .1em; } mjx-msqrt { display: inline-block; text-align: left; } mjx-root { display: inline-block; white-space: nowrap; } mjx-surd { display: inline-block; vertical-align: top; } mjx-sqrt { display: inline-block; padding-top: .07em; } mjx-sqrt > mjx-box { border-top: .07em solid; } mjx-sqrt.mjx-tall > mjx-box { padding-left: .3em; margin-left: -.3em; } mjx-stretchy-h.mjx-c23DF mjx-beg mjx-c::before { content: "\E152"; padding: 0.32em 0 0.2em 0; } mjx-stretchy-h.mjx-c23DF mjx-ext mjx-c::before { content: "\E154"; padding: 0.32em 0 0.2em 0; } mjx-stretchy-h.mjx-c23DF mjx-end mjx-c::before { content: "\E153"; padding: 0.32em 0 0.2em 0; } mjx-stretchy-h.mjx-c23DF mjx-mid mjx-c::before { content: "\E151\E150"; padding: 0.32em 0 0.2em 0; } mjx-stretchy-h.mjx-c23DF > mjx-ext { width: 50%; } mjx-c.mjx-c38::before { padding: 0.666em 0.5em 0.022em 0; content: "8"; } mjx-c.mjx-c36::before { padding: 0.666em 0.5em 0.022em 0; content: "6"; } mjx-c.mjx-c25::before { padding: 0.75em 0.833em 0.056em 0; content: "%"; } mjx-c.mjx-c21D2::before { padding: 0.525em 1em 0.024em 0; content: "\21D2"; } mjx-c.mjx-c1D442.TEX-I::before { padding: 0.704em 0.763em 0.022em 0; content: "O"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c34::before { padding: 0.677em 0.5em 0 0; content: "4"; } mjx-c.mjx-c2248::before { padding: 0.483em 0.778em 0 0; content: "\2248"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c2F::before { padding: 0.75em 0.5em 0.25em 0; content: "/"; } mjx-c.mjx-c3B::before { padding: 0.43em 0.278em 0.194em 0; content: ";"; } mjx-c.mjx-c33::before { padding: 0.665em 0.5em 0.022em 0; content: "3"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c63::before { padding: 0.448em 0.444em 0.011em 0; content: "c"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c61::before { padding: 0.448em 0.5em 0.011em 0; content: "a"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c1D44E.TEX-I::before { padding: 0.441em 0.529em 0.01em 0; content: "a"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c28.TEX-S3::before { padding: 1.45em 0.736em 0.949em 0; content: "("; } mjx-c.mjx-c22C5::before { padding: 0.31em 0.278em 0 0; content: "\22C5"; } mjx-c.mjx-c29.TEX-S3::before { padding: 1.45em 0.736em 0.949em 0; content: ")"; } mjx-c.mjx-c37::before { padding: 0.676em 0.5em 0.022em 0; content: "7"; } mjx-c.mjx-c1D443.TEX-I::before { padding: 0.683em 0.751em 0 0; content: "P"; } mjx-c.mjx-c39::before { padding: 0.666em 0.5em 0.022em 0; content: "9"; } mjx-c.mjx-c28.TEX-S4::before { padding: 1.75em 0.792em 1.249em 0; content: "("; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c220F.TEX-S2::before { padding: 0.95em 1.278em 0.45em 0; content: "\220F"; } mjx-c.mjx-c1D456.TEX-I::before { padding: 0.661em 0.345em 0.011em 0; content: "i"; } mjx-c.mjx-c29.TEX-S4::before { padding: 1.75em 0.792em 1.249em 0; content: ")"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c1D435.TEX-I::before { padding: 0.683em 0.759em 0 0; content: "B"; } mjx-c.mjx-c1D434.TEX-I::before { padding: 0.716em 0.75em 0 0; content: "A"; } mjx-c.mjx-c2223::before { padding: 0.75em 0.278em 0.249em 0; content: "\2223"; } mjx-c.mjx-cA0::before { padding: 0 0.25em 0 0; content: "\A0"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-c64::before { padding: 0.694em 0.556em 0.011em 0; content: "d"; } mjx-c.mjx-c1D438.TEX-I::before { padding: 0.68em 0.764em 0 0; content: "E"; } mjx-c.mjx-c1D43B.TEX-I::before { padding: 0.683em 0.888em 0 0; content: "H"; } mjx-c.mjx-cAC::before { padding: 0.356em 0.667em 0 0; content: "\AC"; } mjx-c.mjx-c70::before { padding: 0.442em 0.556em 0.194em 0; content: "p"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c74::before { padding: 0.615em 0.389em 0.01em 0; content: "t"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c72::before { padding: 0.442em 0.392em 0 0; content: "r"; } mjx-c.mjx-c20::before { padding: 0 0.25em 0 0; content: " "; } mjx-c.mjx-c6B::before { padding: 0.694em 0.528em 0 0; content: "k"; } mjx-c.mjx-c68::before { padding: 0.694em 0.556em 0 0; content: "h"; } mjx-c.mjx-c1D43F.TEX-I::before { padding: 0.683em 0.681em 0 0; content: "L"; } mjx-c.mjx-c1D445.TEX-I::before { padding: 0.683em 0.759em 0.021em 0; content: "R"; } mjx-c.mjx-cD7::before { padding: 0.491em 0.778em 0 0; content: "\D7"; } mjx-c.mjx-c221A.TEX-S3::before { padding: 1.45em 1.02em 0.95em 0; content: "\221A"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-c.mjx-c221A::before { padding: 0.8em 0.853em 0.2em 0; content: "\221A"; } mjx-c.mjx-c221A.TEX-S1::before { padding: 0.85em 1.02em 0.35em 0; content: "\221A"; } mjx-c.mjx-c1D464.TEX-I::before { padding: 0.443em 0.716em 0.011em 0; content: "w"; } mjx-c.mjx-c2C::before { padding: 0.121em 0.278em 0.194em 0; content: ","; } mjx-c.mjx-c2211.TEX-S2::before { padding: 0.95em 1.444em 0.45em 0; content: "\2211"; } mjx-c.mjx-c2026::before { padding: 0.12em 1.172em 0 0; content: "\2026"; } mjx-c.mjx-c1D458.TEX-I::before { padding: 0.694em 0.521em 0.011em 0; content: "k"; } mjx-c.mjx-c1D45D.TEX-I::before { padding: 0.442em 0.503em 0.194em 0; content: "p"; } mjx-c.mjx-c1D45C.TEX-I::before { padding: 0.441em 0.485em 0.011em 0; content: "o"; } mjx-c.mjx-c1D460.TEX-I::before { padding: 0.442em 0.469em 0.01em 0; content: "s"; } mjx-c.mjx-c1D461.TEX-I::before { padding: 0.626em 0.361em 0.011em 0; content: "t"; } mjx-c.mjx-c1D452.TEX-I::before { padding: 0.442em 0.466em 0.011em 0; content: "e"; } mjx-c.mjx-c1D45F.TEX-I::before { padding: 0.442em 0.451em 0.011em 0; content: "r"; } mjx-c.mjx-c1D44F.TEX-I::before { padding: 0.694em 0.429em 0.011em 0; content: "b"; } mjx-c.mjx-c1D44C.TEX-I::before { padding: 0.683em 0.763em 0 0; content: "Y"; } mjx-c.mjx-c1D462.TEX-I::before { padding: 0.442em 0.572em 0.011em 0; content: "u"; } mjx-c.mjx-c2211.TEX-S1::before { padding: 0.75em 1.056em 0.25em 0; content: "\2211"; } mjx-c.mjx-c1D45A.TEX-I::before { padding: 0.442em 0.878em 0.011em 0; content: "m"; } mjx-c.mjx-c1D44B.TEX-I::before { padding: 0.683em 0.852em 0 0; content: "X"; } mjx-c.mjx-c223C::before { padding: 0.367em 0.778em 0 0; content: "\223C"; } mjx-c.mjx-c1D453.TEX-I::before { padding: 0.705em 0.55em 0.205em 0; content: "f"; } mjx-c.mjx-c1D436.TEX-I::before { padding: 0.705em 0.76em 0.022em 0; content: "C"; } mjx-c.mjx-c3E::before { padding: 0.54em 0.778em 0.04em 0; content: ">"; } mjx-c.mjx-c1D450.TEX-I::before { padding: 0.442em 0.433em 0.011em 0; content: "c"; } mjx-c.mjx-c1D459.TEX-I::before { padding: 0.694em 0.298em 0.011em 0; content: "l"; } mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-c.mjx-c1D448.TEX-I::before { padding: 0.683em 0.767em 0.022em 0; content: "U"; } mjx-c.mjx-c1D447.TEX-I::before { padding: 0.677em 0.704em 0 0; content: "T"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); }

Then we need to know the association between the two sources. If they are independent we should treat them as separate tests and multiply the odds. If they are not we need to average them by taking the square root after multiplying this property holds in this general formula.

In our top claim case we have two independent claims so a is zero and
Then we can feel in our odds into the formula.

This gives us the final odds of the final claim and we can convert those back into percentages.

This same process gets propagated throughout the entire graph allowing for the claim to be supported by every piece of knowledge below it. If you're interested why this formula was chosen I invite you to follow along with the math on the block below. The article is intended to be possible to follow even if you didn't read that part.

It is important to note that this way of combining odds does assume that each subclaim and source moves their parent claim by exactly the same force as how likely they are to be true. This is a simplification made to allow for this example to be easier to follow along with. In the final version every node will separate the confidence in the claim from the force of each subclaim on their parent.

Following along with the math

In the specific case above we showed how to combine two odds. I will start off with showing how this formula generalizes. First I show the two cases used in the widget for 2 and 3 claims.


If you're observant you might have noticed the pattern already. This pattern can be extend to allow a claim to have any number of subclaims.

This formula might feel pulled out of thin air. To show where It comes from I will go back to the beginning.

An introduction to Bayes

This article will be using a lot of the terminology of Bayesian statistics. If you have never seen Bayesian statistics before or want to catch up, I can recommend this excellent series from 3Blue1Brown. If instead you want a small reminder I will try to build up to it from fundamentals.

In these equations P(X) is intended to mean the probability of X. So if I toss a coin "P(heads) = 50%"
translates to "The probability that I toss heads is 50%".

In these equations a | is intended to signal a "given that". So "P(Heads|cheating) = 100%" translates to "The probability that I toss heads given that I am cheating is 100%".

These definitions together allow us to build up to out first equation:

This equations shows that the probability of A and B being true can be restated as the probability of B being true multiplied by the probability of A given that B. It can also be restated as the probability of A multiplied by the probability of B given A. Below you can see a visual proof where the green area represents this constant area.

Below instead of A and B we will use H and E. H represents the Hypothesis and E represents the evidence. So in this case P(H|E) represents the probability of the Hypothesis given the evidence. P(E|H) represents the probability of the evidence given the hypothesis.


Once we have this formula we can construct the core Bayes formula by dividing both sides by P(E).

This is the core Bayesian formula. It allows us to calculate what our hypothesis H should be given the evidence E. The only problem with this formula is that it can not easily integrate multiple pieces of evidence. For that we will need to do a slight rewrite.

We can go through the same reasoning for P(H|E). For this we use symbol to mean not. P(H) is the probability that H is not true. This gives us an almost identical equation.

We can divide the above two formula's. Once we have done this we can simplify away the P(E)

If this is all going a bit fast I can strongly recommend this 3Blue1Brown video. Once we have done this change we can cleanly separate this formula into three parts: The posterior odds, the likelihood ratio and the prior odds.

Finally we can simplify the combination of the percentage of something happening divided by the probability of the opposite as the odds. For example 10% can be expressed as 10 to 90 odds and 50% can be expressed as 1 to 1 odds. For the mathematics of expectations odds can more easily represent changes in belief than probability, shown by the examples below.

Below I will show how to use this with three examples. Every time I will normalize the odds to a total of 100 allowing quick conversion to percentages, in the real program this is calculated in odds allowing for quick and accurate measurement.

  1. Weather forecasting: Tomorrow I am going to go camping, Id live to know if its going to rain. In my country the prior odds of it raining are 20 to 80, or 20%.
    I look at the weather forecast, their forecasts have a likelihood ratio of 90 to 10, or 90%. That means that the weather forecast is correct 90 times for every 10 times it is wrong.
    The way to calculate my posterior expectation of having rain tomorrow is to multiply 20/80 with 90/10 or (20/80)*(90/10) = 1800/800 ≈ 69/31 or 69%.
    This means that after checking my weather app I expect a 69 to 31 odds of rain tomorrow.
  2. Disease detection: I go to the doctor for a regular routine checkup. In my age bracket the prior odds of having heart problems is 1 to 99, or 1%.
    I undergo a test that has a likelihood ratio of 95 to 5, or 95%. This means that the test is correct 95 times for every 5 times that it is wrong.
    The way to calculate my posterior expectation of having heart disease after this test is simply to multiply 1/99 with 95/5 or (1/99)*(95/5) = 95/495 ≈ 16/84 or 16%.
    This means that after this test I expect a 16 to 84 odds of having heart problems. If it surprises you that this test still means I most likely don't have heart problems then please again watch this video.
  3. Disease detection part 2: I go back to the doctor because a screening test showed that I might have heart problems . My cohort with one positive screening test show an odds of having 16 to 84 odds of having heart problems.
    Next I undergo a really strong test that has a likelihood of 99 to 1. this means it is correct 99 times for every time it is wrong.
    Then the posterior expectation is (16/84)*(99/1) = (1584/84) ≈ 95/5 or 95%. This shows that the two tests together are able to be strong enough to overcome the initial low likelihood.

Separating out the prior and the likelihood ratio like this allows us to multiply together many tests. If we take the same 2 hart problem tests of above we could combine them into a single stronger test. We can do this by multiplying the tests giving us a combined likelihood ratio of (95/5) * (99/1) = 9405/5[1].

To show that this gives the same result we can use this test on the original prior of 1/99 again by multiplying (9405/5)*(1/99) = 95/5 or 95%.

With this in our toolbelt we are now able to add together any amount of uncorrelated updates to our hypothesis. However in the real world we find many pieces of evidence that are correlated. We would still like to be able to use these pieces of evidence.

Opinion pooling

In the extreme fully correlated evidence points at the same claim. One example of this is measuring temperature in the same room multiple times, in this case we just want to average out the measurements.

If we have two experts on Weather forecasting and we ask both of them if next week there will be a hurricane hitting the coast they will most likely give two separate odds. Lets look at one scenario.

The fist expert gives 99:1 odds of there being a hurricane and the other expert gives 50:50 odds of there being a hurricane. We want to add together their claims, but linearly adding the claim together would fail to take into account the extra confidence of the 99:1 odds expert. The middle ground is multiplicative averaging.

This can be generalized to any combination of two odds.

Here we can also give every expert a different weight the important part is that the total weight adds to 1. So if we give expert 1 a weight of 0.1 we need to give expert 2 a weight of 0.9.

We can generalize this to any amount of experts. If you're not familiair with Π and Σ. I will explain one by one first Σ essentially says sum up, so we sum up every weight unil the final weight and we want it to sum to 1. This sum of 1 is to make sure that the percentage is bounded by the claims of the experts. We do not want to claim a higher certainty than the most certain expert. The second symbol Π says to take the product. So we multiply together every odds ratio O to the power of that experts weight, just like we have done above.

If in a specific case we take all weights to be the same we can conclude that this average weight must be 1/n to add up to 1 in total. This gives us.

Combining both methods

To combine both methods we will start by picking back up the Bayesian update

We can see that we can add many experiments by multiplying by the likelihood of each experiment. This gives us.

The final change that we need is that we can use all previous claims as experiments[2]. This way we can see both the original odds and the likelihood ratios all as multiplied odds.

When we combine this with the opinion pool we will start to see the formula that we used. When the correlation is 0 every claim is evaluated separately and we are doing a Bayesian update and when correlation is 1 we are opinion pooling the subclaims.

This odds accumulator allows for adding together many different sources and subclaims. These calculated odds can then be the input odds of a new claim.

Grouping node with real world data

The core of my system I would like to call the grouping node. This grouping node is a slightly more complicated version of the subclaims above because it is also able to account for the strength of different sources on this claim. This grouping node will be shown in a bit in the form of a widget. First I will go over every part you can find in it.

The node below aims to answer the question: "What is the likelihood that the associated claim is true?". In this case the claim is: "A credible lab pathway exists for Covid". This is the value you see in the green field, by default 80%. It comes to this value by combining every piece of evidence connected to the claim.

At the top you could put in a prior, or knowledge before specific sources. This prior can represent previous knowledge you believe is not represented within any sources, If you're making claims about a coin toss this prior can represent that almost all coins are fair 50/50 coins. This prior can have a strength and a specific percentage. By default it is put to 0 to say that all knowledge this node has comes form its sources.

Below the prior you can see the sources. In this case S1-S7 each of these sources show the odds of the claim being true based on this specific source. These sources get multiplied like in the example above. These sources show one representational quote from the source and are a link to the source. This means that everybody who uses this tool can analyse every part of every claim and see what the result would be if one or more sources were interpreted differently.

The way to interpret the odds ratio next to each source is like an answer to the question "How often would we see this source in a world where the claim is true compared to a world where the claim is false". To give an example lets look into the claim "My coin has heads on both sides" and then we have the primary source "The coin landed heads after a toss". If the coin was fair we would expect 50% of the time heads, but if it had heads on both sides we would expect it 100%. We take the ratio between these two probabilities. This gives us 2/1 odds. So in this case It would be a supporting source with 2.0/1 odds.

To use the S5 WHO-China example. We are effectively saying "WHO-China is 1.6 times less likely to release this statement in a world with a credible lab pathway compared to a world without it". It can sometimes be impossible to know this likelihood ratio with absolute certainty. That is why this tool also gives a 90% certainty range that gets properly propagated into the output estimate.

In order for a tool like this to be at its most relevant we do need to calibrate the langage models. Here we can dig into the structured expert judgement literature Cooke, Hanea and Burgman have all spent decades calibrating different judgements. With calibration this kind of tool can go way further.

You can also change every relevant value simply by clicking and sliding the value. This is one additional way to make this knowledge tree accessible and approachable to everybody who uses it. I do not intend to have it feel like the computer just tells you the way something is. Instead I aim to show where different parts of the argument come from and how each part impacts the final claim.

Below you can see the grouping node visualized as a ledger. Every value is editable and I encourage you to try:


Attempting to graph the structure of arguments has been done before Squiggle, Kialo, and Argdown are a few examples. These services, however, have always had a hard time taking off, for what I believe is a simple reason: mapping out arguments is boring and hard work. People who want to map out entire arguments are few and far between, and those who do can already gather quite an audience from putting in this work.

Here is where I believe we have the new opportunity. Language models have now become capable enough to fill out these full graphs with only light handholding. And if the graphs are made to be inherently transparent any mistake will also quickly be transparent.

The fractal upside

The upside of this grouping node structure is that every node could have not only primary sources as input but also other grouping nodes allowing for building claims out of other sources. This allows us to argue for subclaims, as you can see the example claim is a subclaim of the Covid origins argument. This also allows us to chain together all claims into a big graph allowing for even more complicated representations and more accurate conclusions. If you're curious as to one implementation of is idea you can check it out on my website.

What is still needed for the magic encyclopedia

The method presented in this post is far from enough to map out every claim. This is only a starting point to apply some relevant mathematics to this subject. In order to show what I believe is still needed to use this as a building block I will use the 3 layers suggested by the Epistemic Case Study Competition. In this structure the three layers are ingestion, structure and assessment. This tool lives in Layer 2, where we try to structure every relevant part of a claim. This structure should be objective and be shared between everyone.

Layer 3: Assessment

Assessment is the most abstract layer. From this layer we need consistent testing to see if the tool is really useful and if people would really need it. The current implementation of this tool is transparent to help with this.

Layer 2: Structure

This current grouping node is still limited in many ways. This tool only combines odds of different claims. While this allows some level of clustering this cannot represent all claims. Some simple arguments such as "If you're outside and it's raining, you will be wet" cannot be contained in the grouping node. That is why I intend to add Boolean logic nodes and arithmetic nodes. Both of these together will allow every claim or combination of claims to be represented in this system. I plan on having my claim analysis system have these seven nodes.

  • Noisy AND node:
    The noisy AND node allow for a group of blockers to be taken into account. In this formula is the probability that a subclaim holds is the blockers strength if the subclaim fails and is the base rate chance of success.
  • Noisy OR node:
    The noisy OR node allows for a group of unlocks to be taken into account. In this formula is the probability that a subclaim holds is the unlocks strength if the subclaim holds and is the base rate chance of success even if all claims fail.
  • Possibility node:
    The possibility node allows the hypothesis space to be split up into different hypotheses one of which has to happen. In this formula is the unnormalized probability of a claim ) is the normalization factor and is the normalized probability guaranteeing that all hypotheses add up to 100%.
  • Distribution node:
    The distribution node allows for uncertain values to be represented and reasoned about. Each distribution node has a domain such as all positive rational numbers. In this formula is the uncertain value that is represented and describes the probability distribution of what X can be.
  • Estimate node:
    The estimate node allows for a fermi estimate to be made using different distributions. A difficult to estimate distribution can be turned into many easy to estimate distributions. In this formula represent different distributions represents a formula using these distributions and represents the output distribution.
  • Predicate node:
    A predicate node allows for probability claims to be extracted from distribution nodes. It does this by calculating the probability that a claim lies above a given threshold value. In this formula is the given threshold value is an uncertain value from a Distribution node and is the output probability of this node.
  • Grouping node:
    The grouping node as shown in this article can be used to combine multiple sources into a single probability. This is needed because in the real world most claim will not have single conclusive sources and as such sources need to be grouped together. In this formula is the number of input ratio's is the correlation over all input nodes and sources is the every odds input and is the posterior.

With these 7 nodes in place all non causal claims can be fully represented and reasoned about. In further articles I will get into more detail regarding the other 6 nodes.

Layer 1: Ingestion

Ingestion is the combined process of finding primary sources and checking their validity. My structural tool needs this ingestion to connect the primary sources with claims. The most important component this structure still needs from layer 1 is a process that can answer any version of "What is the likelihood ratio of this primary source saying what it says depending on whether the claim is true or false". It also needs to find a reliable answer to the question "How strongly correlated are these two sources"[3].

  1. ^

    To break this odds ratio down into something like a percentage requires us to go all the way to 9995/5 or 99.95%

  2. ^

    This is also done in Pearl Probabilistic Reasoning in Intelligent Systems on chapter 2.2.2 page 45

  3. ^

    This is my first-ever Lesswrong post. I would like to thank Tom, Glenn, Mark, and Elisabetta for helping me by proofreading and sharing their thoughts on the article.





Discuss

Extreme Rationality: Still Not That Great

16 июня, 2026 - 20:15

The tl;dr has spoilers, so I've put it at the end.

Also feel free to skip any of the chapters because the post turned out to be very long. I think you can read almost any of them on their own, and you can skip to On practicality if you want the big picture.

The title is stolen from Scott Alexander's post Extreme Rationality: It's Not That Great. Basically, it will be my opinion on applied rationality and its successes and failures. The title speaks for itself: I think there are few successes. A few weeks before Scott's post, Eliezer wrote A Sense That More Is Possible where he urged developing training for rationality; and though he thought rationalists were just ordinary people back then, he believed they could become much more. Center for Applied Rationality (CFAR) was created in 2012 with the goal of developing such an Art. So I will discuss rationality's basic values, where I disagree with them, and where they led.

But first, let me start with my personal story.

I was one of the main organizers of LessWrong events in Saint Petersburg, Russia, for two years. I started by reading HPMOR and the Sequences and running discussion groups about them. But then I took a CFAR-like workshop at Kocherga (Moscow) and became obsessed with applied rationality: I tried to practice Hammertime and CFAR handbook. After some time I decided I had enough knowledge to teach rationality, and together with a friend I developed an educational program, which failed for multiple reasons. I also studied bioinformatics and worked for two years at Gero, an anti-aging research company, precisely because I had read HPMOR and been inspired by transhumanism (I still am).

Time passed; I was in psychotherapy and realized that a lot of the things I had been trying to achieve were very unhealthy. I started to criticize applied rationality for that—partly because it is quite comforting to blame a group's ideology instead of your own personal problems. Since I was aware of that bias, I was curious to actually figure out how damaging CFAR practice had been to me and to other people. But I didn't focus on it too much: I just wrote a couple of small posts a few years ago and that was it (they were in Russian, and basically mirrored the posts of another co-organizer of rationality courses in Moscow). Recently I recorded a podcast (also in Russian) with the ex-leader of the Moscow LessWrong community Slava Matyukhin, who supported a lot of my claims, and I became really curious about what is going on—and what had been happening—in the English-speaking part of the community. So I started to assemble the complete picture.


The post will be structured as follows:

  1. I will try to understand: are people really so biased and irrational?
  2. I will talk about what values the rationalist community implies, and what I think about them.
  3. I will discuss the practicality of CFAR: did it help people in the end?
  4. Finally, I will talk about the promises and the sense of self-importance on LessWrong, and draw some conclusions.


There are a few posts that I think intersect with what I am saying here, but I am not going to cite them: Rationalist Epistemics and the Sequences (Effective Altruism Definitions Sequence), Rationalist Epistemics and Social Epistemology (Effective Altruist Definitions Sequence), The Rationality Wars.

On biases


Loss Aversion

The first and foremost bias, the cornerstone of Kahneman and Tversky's theory, is Loss Aversion. When I only started to develop my understanding of the topic I stumbled upon a post by Jared Peterson: Biases Don't Exist, and Humans Are Not Irrational. It was debated there whether biases are actually bad things, because a bias is just a deviation from utility theory, which is not always right.

One example for a failure of utility theory he proposed is ergodicity. You probably know the game where you flip a coin: heads gives you a 50% increase to your bankroll, and tails decreases it by 40%. This process is non-ergodic, and the expected value of each flip is positive (+5%), yet almost every individual trajectory decays to zero, meaning you definitely lose everything. Any trader understands that you can't just maximize the utility of the next bet; some risks are not worth the cost, because after a couple of tries your account is depleted and you never recover.

A simple explanation of the coin example is that, with large N, you get the following (a 50% probability of adding 50% means utility is multiplied by 1.5, and a 50% probability of losing 40% means utility is multiplied by 0.6):

mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-msub { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-mn { display: inline-block; text-align: left; } mjx-msup { display: inline-block; text-align: left; } mjx-mover { display: inline-block; text-align: left; } mjx-mover:not([limits="false"]) { padding-top: .1em; } mjx-mover:not([limits="false"]) > * { display: block; text-align: left; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } mjx-c.mjx-c1D448.TEX-I::before { padding: 0.683em 0.767em 0.022em 0; content: "U"; } mjx-c.mjx-c1D441.TEX-I::before { padding: 0.683em 0.888em 0 0; content: "N"; } mjx-c.mjx-c200B::before { padding: 0 0 0 0; content: ""; } mjx-c.mjx-c2248::before { padding: 0.483em 0.778em 0 0; content: "\2248"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c22C5::before { padding: 0.31em 0.278em 0 0; content: "\22C5"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-c.mjx-c2F::before { padding: 0.75em 0.5em 0.25em 0; content: "/"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c36::before { padding: 0.666em 0.5em 0.022em 0; content: "6"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c39::before { padding: 0.666em 0.5em 0.022em 0; content: "9"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c1D463.TEX-I::before { padding: 0.443em 0.485em 0.011em 0; content: "v"; } mjx-c.mjx-c1D465.TEX-I::before { padding: 0.442em 0.572em 0.011em 0; content: "x"; } mjx-c.mjx-c1D6FC.TEX-I::before { padding: 0.442em 0.64em 0.011em 0; content: "\3B1"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c1D706.TEX-I::before { padding: 0.694em 0.583em 0.012em 0; content: "\3BB"; } mjx-c.mjx-c2217::before { padding: 0.465em 0.5em 0 0; content: "\2217"; } mjx-c.mjx-c1D6FD.TEX-I::before { padding: 0.705em 0.566em 0.194em 0; content: "\3B2"; } mjx-c.mjx-c1D43B.TEX-I::before { padding: 0.683em 0.888em 0 0; content: "H"; } mjx-c.mjx-c2229::before { padding: 0.598em 0.667em 0.022em 0; content: "\2229"; } mjx-c.mjx-cAF::before { padding: 0.59em 0.5em 0 0; content: "\AF"; } mjx-c.mjx-c1D447.TEX-I::before { padding: 0.677em 0.704em 0 0; content: "T"; }

So utility decays with N at a rate of the square root of 0.9.

Another related caveat is that humans actually have diminishing returns on everything, so it is fine to fear losing a given amount of money more than you value not gaining the same amount.

But as it turns out, all of this is not new: Bernoulli invented diminishing returns while trying to solve the St. Petersburg paradox in 1738 (interestingly, the city of St. Petersburg was only 35 years old back then). The St. Petersburg paradox is a game of coin flips with a casino: it pays you dollars each time tails appears, where n is the flip number, and the game ends when heads appears. The expected payoff of this game is infinite (I'll just assume you can handle the series in your head). But for some strange reason, people rarely pay more than $25 to enter such a game. The solution is the invention of diminishing returns, and it can be any concave utility function: the more you have, the less additional utility you get from the same amount of money.

Ergodicity was introduced by Ludwig Boltzmann in 1871. Though I was too lazy to research when it was introduced into economics, I found a fairly recent paper that summarizes its effects well: Peters, O. (2019), "The ergodicity problem in economics." Nature Physics. The conclusion of interest for us here is that ergodicity forces us to use a log utility function, which is equivalent to accounting for diminishing returns — an adjustment already widely used in economics.

So did Kahneman and Tversky account for that as well? If you look at their original paper — Tversky & Kahneman 1992, "Advances in Prospect Theory" — they model utility in a fairly complex form:
for x ≥ 0 for x < 0

Where x  ≥ 0 is the utility of gains and x < 0 is the utility of losses. I am not sure if it is very important that the logarithm function required by ergodicity doesn't fit well here, but it definitely accounts for diminishing returns, and ergodicity gives a quite similar effect anyway.

And even though we kind of accounted for these two things, we still see two weird effects here.

First is the reflection effect: the fact that the function is convex on the negative side instead of concave, and the tricky part is that this change is relative to your current state, or reference point. On the standard theory you would not see anything special at the reference point at all: your utility function starts at zero wealth, and from there you have just log utility. It would not depend on whether you lose or gain in a particular scenario, but on the level of wealth it leaves you at. But humans fear additional losses less and less, regardless of how much money they start with. Also, the function should not have become convex. More plainly, you can describe it as risk-averse behaviour for gains and risk-seeking behaviour for losses. People want to risk more even when expected utility is negative, in cases like the choice between (−$4,000, .80) and (−$3,000 sure). Because the jump from −3000 to −4000 is less scary than the jump from −2000 to −3000, they may choose −4000 even though 4000 × 0.8 = 3200. There is actually another effect at play here, probability weighting, but I don't want to go into it.

Second is the kink. It is the fact that we have a lambda coefficient in front of losses, which tells us how much more losses loom than gains. It is not as straightforward as saying that losing a given amount of money has lambda times the weight of gaining the same amount, because we weight not the raw money but money already transformed by diminishing returns plus the reflection effect.

So, even though we should keep in mind that it is not as simple as losses just hurting more than gains, doesn't all this mean that ergodicity and diminishing returns don't save us from the famous loss aversion, and that people do act weirdly? Well, yes — but there is more.

Though The Brown et al. 2024 meta-analysis found that the loss aversion coefficient is 1.955, with a 95 percent probability that the true value falls in the interval [1.820, 2.102], other meta-analyses such as Walasek, Mullett & Stewart found a small λ of 1.31, 95% CI [1.10, 1.53]. I decided not to go down this rabbit hole, but apparently there are at least some debates over how big the effect is (which reminds me of the time I was trying to make sense of psychotherapy efficacy by reading meta-analyses, and all the effect sizes looked basically like random numbers to me, ugh).

And then there is another very important paper: "The Loss of Loss Aversion: Will It Loom Larger than Its Gain?" by David GalDerek D. Rucker, 2018. It raises a lot of points that I don't fully cover here, but what I want to do is point out some things Scott Alexander said about it in his post [Crosspost] On Hreha On Behavioral Economics, which is a reply to The Death Of Behavioral Economics by Jason Hreha.

It’s a great 2018 paper that looks at recent evidence and concludes that loss aversion doesn’t exist. But it’s a very specific, interesting type of nonexistence, which I think the Hreha article fails to capture.

G&R are happy to admit that in many, many cases, people behave in loss-averse ways, including most of the classic examples given by Kahneman and Tversky. They just think that this is because of other cognitive biases, not a specific cognitive bias called “loss aversion”. They especially emphasize Status Quo Bias and the Endowment Effect.

I think this slightly understates the analysis they made. We already discussed that loss aversion is actually two distinct phenomena — the reflection effect and the kink — both united by the reference point, which is itself a contradiction of standard risk-aversion theory. The kink is what makes us think that losses loom larger than gains; it differs behaviourally from risk aversion — it produces a discontinuous change at the reference point and doesn't depend on scale (it is just a constant linear parameter), unlike risk aversion, which is described by a concave function that becomes more important as the scale increases. That's why Rabin & Thaler explained that loss aversion for small stakes can't be explained by risk aversion (whereas for larger stakes it can be interpreted as such).

G&R cite a paper that refutes loss aversion at low stakes: "Is loss-aversion magnitude-dependent? Measuring prospective affective judgments regarding gains and losses" by Mukherjee et al. (2017), and it is a blow to the whole theory, not just a reinterpretation. They also cite Katz (1964), who showed indifference to risk at small bets.

The other reference doesn't even use the small-stakes trick, but cleanly shows the same thing — that there is no discontinuity at the reference point:

contrary to the predictions of loss aversion, research shows that individuals are no more-risk averse when choosing between different potential gains (e.g., choose between (a) gaining 1000 points for sure or (b) a bet with even odds of gaining either 0 or 2000 points) than when choosing among options where one of the choices involves potential for loss (e.g., choose between (a) receiving 0 points for sure or (b) a bet with even odds of losing 1000 points or gaining 1000 points) (Erev et al. (2008))

It is basically the same choice with the reference point shifted by 1000; the fact that people are indifferent between the two contradicts prospect theory, which predicts dependence on the reference point, and is consistent with risk-aversion theory, which depends only on wealth.

G&R also discuss other explanations for real-world phenomena such as asymmetric demand elasticity and the equity premium.

But Scott discusses a further development of the story though: Mrkva et al., Loss Aversion Has Moderators, But Reports Of Its Death Are Greatly Exaggerated.

Previous criticisms of loss aversion argue that most experiments are performed on undergrads, who are so poor that even small amounts of money might have unusual emotional meaning. Mrkva collects a sample of thousands of millionaires (!) and demonstrates that they show loss aversion for sums of money as small as $20.

Another interesting point in the G&R paper was about decoupling loss aversion from action/inaction in the endowment effect.

a simple procedural change that decoupled loss and gain from inaction and action eliminated the preference for an endowed option.

I don't think this is just a change of perspective either, because avoiding action is clearly a rational choice, especially at low stakes, and it is important to distinguish it from loss aversion.

There is also another piece of prospect theory that we discussed — the reflection effect. It wasn't challenged at all in any of these papers; there are some other sources that criticize it, like this one, Prospect Theory's Reflection Hypothesis: A Critical Examination, but I didn't dig into it.

Anyway, I am not enough of an expert to read all the papers in the field and draw a comprehensive conclusion; it could be that all these refutations are wrong and the original findings of Kahneman and Tversky are right. But my point is that even for one of the most studied biases the discussion still seems to be ongoing, and that it is not only about reinterpretation but also about the correctness of the theory. Let's now look at the other examples a bit more quickly.


Other biases

There is another great article: "THE GREAT RATIONALITY DEBATE" by Philip E. Tetlock and Barbara A. Mellers, 2002.

They summarize the cognitive-bias research and draw a distinction between two kinds of counterargument. One is experimental: researchers try to reproduce results and fail, or tweak the conditions of an experiment a little and the result changes completely, and so on. The other is when researchers challenge what counts as normative, and whether biases are actually bad — I will call these reinterpretations.

A couple of their examples caught my eye:

Disjunction effects. Should Shafir’s students be criticized for violating the sure-thing principle (for wasting money to delay a decision until an irrelevant uncertainty is resolved)? Or should they be applauded for recognizing, deep down, that they are poor hedonic forecasters who have drawn the lesson from bitter experience that it is a good idea to postpone decisions such as vacations until they know how they will really feel about passing or failing the exam?


Overconfidence. Should Camerer and Lovallo’s entrepreneurs be dismissed as Willy Loman dupes of an overconfidence illusion that they could have escaped if they had the good sense to adopt an outsider, or base-rate, perspective on the odds of success? Or would these entrepreneurs, without the energizing effects of overconfidence, have been paralyzed by loss aversion?

But the paper is old, so I will have to add more examples myself, while sticking to their classification.

Experimental refutations

As a quick aside: in this witty and funny post by Scott Alexander, you will find out that parapsychology is the control group for science, and that the replication crisis is a mess; Beware The Man Of One Study is really good as well. So let's take all these meta-analyses and other shmalyses with a grain of salt.

Anyhow, the most obvious example of the first kind of counterargument is Priming and Contamination. I am sure everybody has heard about this, so I will not delve into it, but here is one source.

There is also a refutation of Bystander Apathy.

There is a good post by Kaj_Sotala that covered multiple examples of challenged biases: Are these cognitive biases, biases? A particularly interesting one is overconfidence (or, more precisely, the hard-easy effect).

Reinterpretations

One way to approach this is to criticize existing decision theory, which Jared Peterson discussed in his Biases Don't Exist, and Humans Are Not Irrational. I covered only one example, related to loss aversion, because I am not competent enough to understand the rest; if you are interested in more, please read his posts.

Another is to emphasise real-world success over mathematical correctness — the idea that simple heuristics, which make more errors but just work in most cases and consume less "compute", could be preferable. The main proponent of this approach is Gigerenzer. There is, again, a really good post by Kaj_Sotala about this: Fundamentally Flawed, or Fast and Frugal?

Gigerenzer basically says that people use simple heuristics which are very effective in real life and can create bugs in some edge cases, but that this is much better and more efficient than using correct Bayesian decision-making. There is also a point that our thinking works much better in an intuitive regime, and breaks much more easily when we have to make logical, conscious decisions — which I will discuss later.

Some social context can also change your incentives, and biases become rational: Tetlock (2000), Cognitive Biases and Organizational Correctives. For example, overconfidence or the fundamental attribution error can be necessary in some social environments — ones with high competition and high stakes, and a need for fast, heuristic judgments.

Specifically, conservative managers with strong preferences for cognitive closure were most likely (a) to defend simple heuristic-driven errors such as overattribution and overconfidence and to warn of the mirror-image mistakes of failing to hold people accountable and of diluting sound policies with irrelevant side-objectives.

The next example of a bias is quite strong and I will discuss it using Gigerenzer's approach.

Conjunction fallacy

In Eliezer's Burdensome Details, and furthermore in Conjunction Controversy (Or, How They Nail It Down), he discusses the conjunction fallacy, where people prefer a detailed hypothesis to a more general one — for example, they assign more probability to “Russia invades Poland, followed by suspension of diplomatic relations between the USA and the USSR” than to “Suspension of diplomatic relations between the USA and the USSR”. Eliezer very convincingly argues that people just "substitute judgment of representativeness for judgment of probability".

This is actually a strong case. There is even a Kahneman paper that successfully disproves the claim that Gigerenzer's reformulation in frequencies instead of probabilities eliminates the effect (it does reduce it, though). But the question is whether it is a quirk of a specific setting or a ubiquitous bug.

The conjunction fallacy violates the fundamental law of probability that P(A&B) should not be greater than P(A). Also, if people always substitute representativeness for probability, that means they ignore priors: they basically compute how much the posterior probability of H given A — P(H|A) — is increased compared to P(H), instead of the posterior itself (Crupi, Fitelson & Tentori 2008; Tentori, Crupi & Russo). All of this seems quite unrealistic to me; people wouldn't survive at all if their cognition were that fucked up. Again, Tetlock says:

Skeptics maintain that if people were as incorrigibly irrational as Kahneman and Tversky suggest, human ancestors never would have survived on the savanna plains of sub-Saharan Africa.

I think people, when it comes to real life, definitely account for priors at least in some situations. We realize that if someone is shouting that there is a dinosaur down the street, it is less likely to be true than if, in the same situation, someone were shouting that there is a bear on a unicycle.

The one real-life example I know of where people fail to account for the conjunction rule is in juries. But in this case you could argue that a detailed story increases the probability that it is not a lie, which is maybe one of the reasons this bias could be evolutionarily favourable.

There was a further discussion about these refutations in the comments, and Eliezer dismissed them in quite a sharp, accusatory tone:

Though Kahneman and Tversky conducted experiments on doctors as well, I think Eliezer is overstating things here when he says "The patient still dies". What Kahneman's experiments tested was the probability of getting certain symptoms given a known disease, which is not what actually kills patients. In real practice, doctors need to diagnose people who have multiple symptoms, and to judge how many diseases they have and which ones. And for that, doctors even have Hickam's dictum — "a patient can have as many diseases as they damn well pleases," which exists precisely because doctors sometimes over-apply Occam's razor. It is standard clinical teaching that the default norm is parsimony (one diagnosis explaining everything).

I know this might sound like cherry-picking — they did have a bias, right, so why does it matter in what scenario it appeared? It is true that Gigerenzer's point, that people are on average rational in real-life scenarios, is a dodgy one and a kind of goalpost-moving: we didn't succeed in refuting careful experiments, so we resort to the theory that everywhere except the lab setting we are rational, which sounds unfalsifiable. But, first of all, some of the biases are actually refuted, and the field of lab social experiments isn't one to be trusted without any doubt. Eliezer puts much more confidence in the science of such experiments than it deserves. And, more importantly, the question that interests us in the end is a real and falsifiable one: how biased are we in real life?

The next bias is another example when the experiment trapped people in unrealistic conditions, while in real life their strategy works well.

Confirmation bias

Well, do rationalists really play Zendo better than people ignorant of the Art? In "Confirmation, Disconfirmation, and Information in Hypothesis Testing" Joshua Klayman and Young-Won Ha explain that, at least in the case of Wason's 2-4-6 task, they probably don't. The simplest version of the game is as follows. The game master comes up with a rule for a sequence of three numbers and provides an example that satisfies the rule. A player can run tests — provide a sequence and ask the master whether it satisfies the rule or not — and should guess the rule with as few tests as possible. A usual experiment goes like this: the master invents the rule "just any three positive numbers" and says 1 3 5. Then the player tries 2 4 6, 6 8 10, 10 12 14, settles on the rule "three integers spaced two apart", and gets it wrong.

And we say it is confirmation bias! He just wanted to confirm his rule! But this is straight up wrong. We don't know what he wanted to do. What he did was provide three sequences of numbers that satisfy the rule he was trying to test.

There is a difference between positive/negative tests and falsification. A positive test is a test of a sequence that your theory predicts will be positive; a negative test is a test of a sequence that your theory predicts will be negative. A falsification is a test that contradicts your theory — i.e. your theory predicts one thing but the test output is another. But how can we know which test will turn out to be a falsification? Well, we can only do it probabilistically. Let's say we have a space of all possible states, where theories are subsets of this space, consisting of the states that satisfy them.

What's then the probability that a positive test will give us a falsification? It is the number of states that are INSIDE our theory but OUTSIDE the correct one, divided by all the states INSIDE our theory. It is proportional to in the picture above, and it is just impossible, because the region outside T doesn't intersect with H. So all the states inside our theory are positive, and they fail to give us a negative result on the correct theory only if all of them are inside that theory as well. And for a negative test it is the number of states OUTSIDE our theory but INSIDE the correct one, divided by the number of states OUTSIDE our theory. It is proportional to in the picture above.

But now consider that in real environments the thing you're trying to pin down is almost always a sparse, specific pattern — one disease out of thousands, the single rule the game master happened to pick — so the true rule T occupies a tiny corner of the space, while the hypotheses (H) we actually entertain tend to be broader or at least similar in complexity.

That means that the sheer number of states outside our theory is usually much larger than the number inside, and because the correct theory also (we assume) should be quite small, a negative test has a very small probability of producing a falsification. On the other hand, if both theories are quite small, finding where they don't intersect is not so hard. In simple terms, most of the examples that you can present will be just negative for both rules/theories, so presenting a random negative example usually gives you nothing.

So the situation usually looks more like this.

Or even like this.

Again, we can't know for sure which test will lead to a falsification: in the first picture only negative examples could falsify our theory, in the third only positive ones can, but we don't know in advance which picture applies to our situation, so we have to use some priors. And natural experience tells humans that the 2nd and 3rd are more probable than the 1st. So usually you just want to use positive examples much more often than negative ones — unless someone has constructed an experiment specifically to catch a person using this heuristic, which is very useful on average. But when you play a real game, such a trick will be exhausted very fast, and you will start to come up with really complex theories (which have a small number of correct states); and if a rationalist still confuses negative examples with falsification, he could waste a couple of rounds on useless tests.

In Positive Bias: Look Into the Dark Eliezer makes this distinction right away:

This cognitive phenomenon is usually lumped in with “confirmation bias.” However, it seems to me that the phenomenon of trying to test positive rather than negative examples, ought to be distinguished from the phenomenon of trying to preserve the belief you started with. “Positive bias” is sometimes used as a synonym for “confirmation bias,” and fits this particular flaw much better.

But then he goes:

It once seemed that phlogiston theory could explain a flame going out in an enclosed box (the air became saturated with phlogiston and no more could be released). But phlogiston theory could just as well have explained the flame not going out. To notice this, you have to search for negative examples instead of positive examples, look into zero instead of one; which goes against the grain of what experiment has shown to be human instinct.

And:

I have been writing for quite some time now on the notion that the strength of a hypothesis is what it can’t explain, not what it can—if you are equally good at explaining any outcome, you have zero knowledge. So to spot an explanation that isn’t helpful, it’s not enough to think of what it does explain very well—you also have to search for results it couldn’t explain, and this is the true strength of the theory.

If you constructed a theory that can explain anything, then positive examples would actually be the only way to falsify it. In the 3rd picture you would have H = U. It would be like saying my hypothesis is "just three numbers": then 1 2 3 is a positive example, and -9 pi sqrt(3) is positive as well and will falsify your theory, because it doesn't satisfy the target rule. It is still right to point out that phlogiston theory is a bad theory, but I don't think the reason is positive bias.

Of course, this is just the most famous example of confirmation bias, and refuting it doesn't refute confirmation bias in general. And from my personal experience this is the bias I have experienced most myself — it is really hard to change your opinion. Even though the Backfire effect is probably wrong. But I think there are plenty of rational reasons to hold your beliefs strongly that make it less dark than Eliezer presents. For example, people with a lot of experience are usually already exposed to a lot of information that confirms their view, and they could have tried to correct for this intentionally, but then they would sacrifice time they could spend developing expertise or building a reputation for their view — and that is fine in science, for example, where one person can't possibly gather all the evidence, and it is genuinely better if he works on his own theory. It is not good for his epistemic fairness, but who cares about that if he is successful and society benefits from it too. And I think it works in regular life as well: there are non-epistemic costs to switching beliefs, and always being unsure and flipping your beliefs at the slightest bit of new evidence hurts a lot.

Nevertheless, I think there is a very important kind of "bad" reason for holding on to your beliefs. There is a beautiful post by Scott: Guilt: Another Gift Nobody Wants. In short, it explains why it is very important to have a mechanism that convinces other people you are not incentivised to do bad stuff when no one is looking. One example is guilt: it is an intrinsic property of almost every human being, and people know other people have it. But roughly the same logic could work for holding strong beliefs. If you really need people to invest in your theory or vote for your party, you had better say you are a hundred percent sure it is the right thing — and when you say that, it is better to be honest about it, because people can notice when you are insincere. Obviously, for politicians it doesn't hold so well — they are the finest masters of lies — but for other people it seems to work quite well.

But, as I will discuss more later, in my opinion it is hard to disentangle a general bias — a rigid property of everyone's thinking — from other psychological effects or just random human errors.

Planning fallacy

Well, here I am not even going into research and papers, but will just appeal to my own experience, so you can rightfully skip this section right away.

I think it is quite apparent that we do actually underestimate the time we need to do things, especially at work, where we don't want to do anything at all, haha. There are some exceptions: I think being early or late differs dramatically across cultures, and in some cultures people are usually on time or only a little bit late. And most people are usually on time at airports.

The hypothesis that we just imagine how things can go right and run with it may well be true, but the alternative — thinking about all possible failures — is not necessarily always useful. First of all, there is only a finite number of ways things can go right, and an infinite number of ways everything can go wrong, and even keep going wrong indefinitely. So there is an asymmetry in the probability space towards infinity: the most probable outcome sits somewhere close to zero, and to the right there is a slowly decaying curve, like a Lévy distribution with no finite mean or variance. We still usually get things done, because most of the failures don't happen; but when we fail, we simply abandon the task. Also, it is not always the case that you need to predict everything in advance — quite the opposite. More often you will change things on the fly, other people will delay something, or the goal will shift slightly, so it won't matter that you miscalculated something. And it is not true that people never think about failures: we take medicine on a trip, we ask someone for help in advance when past experience tells us we won't manage on our own, we set reminders and alarms, we put our things in lockers, and so on. The only question is how much we think about failures; and though we systematically underestimate time and costs, that doesn't mean it isn't optimal behaviour. Given the unpredictable changes in circumstances and the infinite complexity of the possible failures we'd have to account for, it is perhaps too costly to always estimate correctly — or maybe it is even impossible not to underestimate on average, due to the divergent nature of the distribution.

So, do humans need fixing in the end?

We looked at several examples of cognitive biases: some are refuted, some are up for interpretation, and it does seem like people are biased somewhat — but the evidence that their condition is very serious, that it paralyzes all decisions and prevents humanity from functioning normally, is not overwhelming.

But wait, it seems apparent that people do very stupid things on the global scale — how can you argue with that? Let's first go back to where we started: we have to nail down what a bias is. We already know that a bias is a difference between a human decision and a rational theory.

But what's also important here is that we look only at the general rules of how the human mind functions. If one human makes a mistake and another doesn't, it is not a bias. But what if a lot of humans make this mistake, but not all? Take psychotherapy, for example: surely a lot of people have narcissistic disorder, autism, or ADHD; they could even gather in a community and declare that the human mind is susceptible to having too high an opinion of itself, and that we should fight that in all human beings. But what they actually need is to accept that it is their own peculiarity — not everybody has it — and it is good to trace the reason you acquired it and adapt accordingly.

In my opinion, a lot of evil in life comes from people with psychological problems and deviations, and definitely not all of it comes from cognitive biases, as Eliezer claimed (there is also the Moloch problem and other things — I'm obviously not saying all evil comes from not seeing a psychotherapist).

I am not talking about people who are just depressed or have an attention problem, but about people like Stalin, Mao Zedong, and Hitler. Maybe they could have used a little more rationality and a little less overconfidence — before you decide you can invent a way to grow rice twice as efficiently, only to have cannibalism spread across your country and more casualties than in World War II. But probably the reason they didn't have enough sanity was that they were narcissistic psychos.

How much of all evil comes from cognitive biases is not obvious to me. I think this is a very important thing to research before you put investment in the Art at the same level of ambition as saving humanity.

But there is also a point to be made that some irrational behaviours (including biases) don't need to be fixed because they are load-bearing — things that are irrational in theory but optimal given the engineering of our brains. Some rational actions cause too much stress and carry too much cost, given the limited ability of humans to process feelings and thoughts.

Some rational decisions will not account for the fragmented nature of our brain — you can consciously assemble a perfectly unbiased Bayesian argument while silencing the parts of your intuition that tell you it is wrong for some irrational reason, but forcing yourself to do it will come at a cost greater than the rational benefit.

It is similar to traumas — defensive mechanisms of our brain that shield away bad experiences we couldn't comprehend as kids and cause disharmony in our minds. But these were useful at some point as well.

You probably won't tell a guy with a fear of flying that it is irrational that he can't sit calmly on a plane. This art of accounting for our own design is quite different from the art of aligning yourself with an optimal Bayesian agent.

Rationalists will tell you that it is rational to account for all that as well, but, as with many things I will mention later, there is a pattern: when someone (usually Eliezer himself) points out some limits of rationality, rationalists will agree but keep insisting on the importance of the Art, just with this or that caveat. The problem here is that for some things it is not enough to merely account for them — some things just destroy the usefulness of your approach. Because once you account for this and for that, it is not obvious, in the end, whether you aren't just reinventing the wheel from scratch, doing all the same things but with a lot more effort.

In the end, as we already mentioned, Gigerenzer claims that, though biases are sometimes real mistakes in lab settings, they are also simplifications that work great in real life. And substituting more complicated thinking for them can backfire.

One such example is probabilities and Bayes' theorem. Bayes' theorem itself is not tractable even for modern computers; neural networks sometimes use simplifications like the ELBO, or derivatives of the posterior distribution, as in diffusion models. Maybe more useful in practice is Bayes' theorem in odds form, which — apart from some problems with priors — basically tells you to account for the control: don't only seek evidence that appears while your intervention is present, but also while it is absent. It is akin to positive bias, where people seek only to test examples that match the rule they invented, which is probably also the cause of many superstitions and of alternative medicine — though these are not necessarily irrational. Yes, not everything that can be destroyed by the truth is irrational: sometimes these are psychologically comforting beliefs, sometimes it is just an error on the side of caution (it is easier to do this ritual than to risk not doing it), and, as Scott Alexander pointed out, sometimes culture is wiser than individual people can explain. But I am not defending it outright — of course, in the modern world, things like controlled trials are blessings, and anti-vaxxers basically cause people to die. Which makes not forgetting about the control useful sometimes, even in everyday life.

But what about applying numerical probabilities?

Eliezer himself argued in When (Not) To Use Probabilities that we should only use numbers that are based on other numbers, and that we should calculate, then throw away the result and follow our intuition. Do you think rationalists followed this guidance? At least in my experience, in the Russian community many people rationalized their decision-making with made-up numbers. And many thought that following your intuition is wrong when the rational arguments say otherwise. The famous p-doom is also one of those cultural things that stuck despite Eliezer's warnings, and Eliezer himself uses it. But even if we treat this lightly and believe every word of the essay, the question remains: if one should throw away the math and follow intuition, then maybe you could just skip the math part altogether?

But I'm getting ahead of myself. Before we fully dive into the dissection of applied rationality practices, I need to discuss one more point that should be considered in understanding the usefulness of these practices.

On values

We already discussed that biases can be rational given some reinterpretation, which often involves assuming a different set of values.

You can interpret hyperbolic discounting as a mistake, or you can interpret it as a value or preference. You can basically make this flip-flop for every bias, and it can sound artificial, but I will argue that the negative framing of some biases stems from a misunderstanding of human values.

But btw sometimes people also say it is hyperbolic discounting when it is not, because they just didn't account for something that makes present more important, often it is just psychology. For example, your belly craves snacks because your consciousness punishes you for every wrong decision and tries to push you to the limit, so sweet treats compensate for that at least a little bit and slacken the tension — maybe not in an ideally healthy way, yes, but your conscious decisions can be much worse. Sometimes you also come across some new paper on how pears are actually carcinogenic, but you still want pears and eat them while complaining about hyperbolic discounting — when actually it is a good defensive mechanism of your brain resisting your conscious decision, because it has better priors and behaves conservatively. As in many cases, a person can unknowingly do something for rational reasons while believing his reasons are irrational, or while failing to understand his reasons at all.

Take Bryan Johnson, for example: he is trying to optimize his life with an evidence-based approach, without conservatism, without risk-avoidance, at his best conscious judgement — and I wouldn't want to live such a life (except that he is crazy rich, of course; that's a good bonus).

But what's more important is that sometimes instrumental convergence doesn't hold for us and we don't optimize for control over everything. We don't want a love affair with a puppet, and we don't want a climbing gym or a computer game to have a shortcut — an example that Eliezer himself mentions in Harmful Options.

Sometimes people play social games for the sake of the game, or argue in debates because they enjoy it. Rationalists often think our values are about winning a particular game or searching for truth, but actually it is also about fun and being in the process.

There is a great sequence of posts about what values the rationalist community is missing; the two most interesting to me are On green and On attunement.

The process of so-seeing requires being Reason—being your soul—as opposed to merely modeling it. You do, in fact, have to stay within something. You have to think—to seek, yourself, whatever it is that Reason seeks; to be the onrush of that part of your being. And this requires what I've previously called "living from the inside," and "looking out of your own eyes," instead of only from above.

Green is love of Nature, living as you are, living life, fully experiencing it, being in attunement. This is not necessarily the opposite of maximizing knowledge or pursuing goals in terms of your actions — you can find numerous justifications for Green in Eliezer's writings — but the posts argue that this is "green-according-to-white" or "green-according-to-blue", not Green itself. Being Green is a different state: not looking from above and merely modeling, but looking from the inside.

In The Unbearable Lightness of Being, Milan Kundera writes, "Happiness is the longing for repetition." Sometimes, calm, rural life with simple pleasures is better and happier than maximizing information, goods, and success. It is not always good to change everything, change Nature, change yourself, to optimize for something; there is a value to keeping everything as it is. Harry's urge to control everything in HPMOR is not a coincidence; it is inherent in rational ideology; it is the ambition to be powerful enough to extinguish a star to save one life. I don't think it is necessarily a bad thing, though. I am still an immortalist, but I do think that, applied to a simple life or your everyday cognition, this can become dangerous. And this is exactly the conflict that happens when you mix saving the world with self-help practices (Eliezer doesn't think the laws governing them are different but I do).

The point of bias is that it's a deviation from the utility-maximization theory of decision making, but fixing this also means that you want to control every aspect of your thinking to fit this abstract thing, rather than being yourself and giving up control.

I think it is very important to feel this distinction, and to shift your perspective on values away from this maximization paradigm, in order to decide which techniques feel right for you and which don't. But we've finally reached a point where I can't delay it any longer — it's time to discuss the mysterious beast: applied rationality itself!

On intuition vs conscious thinking

I think the notion that conscious thinking is more capable than intuition underlies almost all the techniques (I will avoid delving into too much detail here, like whether I mean System 1/2 or something else; I think in the end those details don't matter much for my point). I don't think it is completely wrong — conscious thinking is better for some things and worse for others — but what this notion misses is that the basic gears of cognition are actually better left untouched, because there is a great variety of subtleties that our intuition computes much better, and you will only make more mistakes without it.

One example is the goal-factoring technique. What it proposes is to start with some thing in question that we do or want to do, then list all its possible subgoals (using the button test to verify there is nothing more to it), and then find alternatives that cover all the subgoals without paying the costs of the thing in question that brought it to your attention in the first place. There is an example with grading students, where, after using the technique, our hero comes up with the idea of students grading each other. I think at this point a sane sceptic would raise a few questions: aren't we already using the same or a better algorithm by default, just unconsciously? Didn't our hero come up with his solution by simply inventing it with his intuition, and then justify it retrospectively via the technique? How often should we use this technique instead of default intuition?

The authors answered the first one in their opening section: quite often, when we are presented with two alternative options, we don't use our imagination to come up with a new alternative, but instead just weigh the pros and cons and choose. The key word here is "usually". Usually means there is an intuition in our brain that tells us whether we have to start searching for a new alternative or not, and usually it says "no". Sometimes we just come up with an alternative unconsciously; sometimes we realize that we want neither of the options and we search for a solution — it has happened to every person on Earth at least once in their life. But this isn't even how our brain works most of the time — launching the whole search and staying focused on it. It comes in pieces: some random thought, then another, stuck together across different times and contexts, and then a bright thought comes to mind that is the cumulative knowledge of all the alternatives, pros and cons, and desires we encountered, but computed in some distributed manner.

So, what does the technique give us? From the logic of the Sequences and rationality, you would assume there is a cognitive bias that creates some imbalance, and that we are already doing what the technique suggests, just not enough. But for a lot of techniques, including goal factoring, there isn't even a bias that they refer to. It seems like they have the assumption: we came up with some algorithm on a piece of paper that sounds right, so obviously humans in general do worse than that by default. Aversion Factoring "is not derived from any particular body of psych research, though it draws lightly on trigger-affect patterns and exposure therapy and takes advantage of the framework of reductionism." Internal Double Crux is not related to any cognitive bias; it is drawn from IFS, which is a therapy, and the approach to traumas should, I think, be more careful than a random rationality-retreat practice. TAPs are just from psychology as well, and even those I don't like. Of course, this doesn't hold for all of them: Bucket errors fight the representativeness heuristic (which is related to the conjunction fallacy we discussed before), Units of Exchange are related to the Sunk cost fallacy and loss aversion, and Murphyjitsu works specifically against the planning fallacy.

What would happen to our hero if he didn't apply the technique? I would guess that the idea of students grading each other had already crossed his mind, and that he probably noted it, would recall it again, and would subconsciously weigh all the factors that are listed in the technique — along with other ideas that he would dismiss. Because our brain does all of this by default. Yes, there is a possibility that he would miss this idea and continue to struggle without the technique, but there is always such a possibility. Making a mistake is not a bias, and not irrational, because the solution is apparent only after careful thinking, and thinking itself costs something. Obsessively applying this technique to everything you do can give you a neurosis, and applying it just a little can get you nowhere, because you would still miss important stuff. In other words, it is not "anecdotally strong" and obvious that the status quo is not the optimum — EVEN if there is some strong evidence of a particular bias (which is not always as convincing as we saw in the first section) — because, apart from knowing there is a bias, you would somehow have to guess in which situations it will manifest; otherwise you'd have to apply the technique everywhere and, again, get a neurosis.

And there is more that could go wrong. Decomposing something actually doesn't give you the full picture. Usually people can weigh all the factors and form an impression of something intuitively, not because there is a simple logical explanation for why it is so, but rather because of a complex, emergent (oh, c'mon) relation. And people on the spectrum do this worse than others, which could be one of the reasons such techniques are appealing. But the right way to fix this dissonance is to train your brain to form an attitude towards complex things — instead of using some ruminating tool, just experiment and teach your intuition through experience.

Again, even if the bias actually exists, it is not only about how often to use such a thinking method (which, fairly often, is already used by default), but when to use it. So just bluntly practicing it is not obviously going to help.

We can think about it by comparison with pharmaceuticals. They cure some illnesses — i.e. some function that is not activated when it should be, because something is broken in what the organism usually does well — and we just switch this thing on when we need it. We don't recreate all the circuits of the human cell from scratch; we just use some molecule to activate an already existing pathway. In contrast with that approach, there is anti-aging research. Aging is not something that happens to some people sometimes; it is a bug we all have from the beginning. And that's why it is hard, almost impossible, to fix with the existing approach. No matter how smart we think we are, we still can't even remotely design from scratch anything as complicated as the biological circuits our cells already have. The same thing applies to the basic functions of our cognition. Yet.

You can probably tell that the situation is similar with even simpler techniques, like "notice confusion more than you do in your usual life", "think about what can go wrong more than you already do", and the very famous "actually try very hard for at least 5 minutes, you lazy bastard!"

But wait — what about Do The Math, Then Burn The Math and Go With Your Gut? It's already been accounted for! (again). Well, the same question: even if you go with intuition in the end, it doesn't free you from the cost of the calculations and the wasted time (and money on psychotherapy) — how do you know when these calculations are useful? Well, maybe you can teach your intuition to recognize such situations with practice, or insights from the math could leak into your intuition and fix biases permanently! This is not impossible to imagine.

On practicality

But what does the practice actually say?

My impression is that almost all the CFAR techniques are useless, and I don't need them to succeed in life at all. Though I would say that some ideas from Street Epistemology are useful, and they are related to the Double Crux technique. With Eliezer's Sequences it is only a little bit better: Affective Death Spirals explained how you can get into a cult — though he didn't convince me that he isn't creating one. A Human's Guide to Words was a good explanation of some concepts for my autistic mind. Noticing Confusion and the similar Split and Commit I probably used a few times (and it happened on its own a hundred more). But what if I am just a lazy, narcissistic, AuDHD person who didn't get the Teaching? It is very possible, and I don't weight my experience too heavily. After all, I am really lazy and narcissistic, and I have AuDHD :)

Most of the people I know from the Russian community fall into three categories: people who just came for the fun, the community, and the philosophical conversations; people who tried the techniques and found they didn't help them, like me; and very few people who are still using the techniques. That last category is usually the people who also say that you should use Bayes' theorem for everything, think that everybody is stupid except rationalists, who deploy every CFAR technique in a single chat message just to argue with you, and so on.

Then again, the experience of other organizers is similar: Tanya Miropolskaya shared that she had a negative experience with applied rationality and the community in general, and Slava Matyukhin said the techniques didn't help him. And, again, Scott Alexander wrote Extreme Rationality: It's Not That Great a long time ago.

There is a study that CFAR conducted to determine the effects of their program. The results are below. All variables are coded such that positive numbers indicate an improvement from pre-workshop to post-workshop, with (R) indicating that this involved reverse scoring. Effect size is the standardized mean difference using the pre-workshop standard deviation. † p<.10, * p<.05, ** p<.01, *** p<.001.

Improvement in Cognitive Biases was not significant; interestingly, they tested them with Stanovich questions. But, unfortunately, this is not great evidence for my point, because the study is terrible — they didn't have enough statistical power to detect even three of the four biases.

Anyway, another piece of evidence that CFAR wasn't a huge success is that they said so themselves and closed their workshops. There is a summary of what happened by Anna Salamon: my low-quality thoughts on why CFAR didn't get farther with a "real/efficacious art of rationality". There is another project that follows in CFAR's footsteps, but it looks even worse.

Another problem, which Scott also mentioned in his essay 17 years ago, is that rationality is a dual-purpose technology. It was developed for exploring AI Safety and then reoriented to help people with their biases. This simultaneously damaged our understanding of human cognition, by leaning towards artificial models instead of actual humans, and damaged the project itself, by shifting its focus from the epistemic exploration of these ideas towards saving humanity right now. CFAR were actively collaborating with MIRI and convincing people to switch to AI Safety. Sometimes people even say that the whole point of Effective Altruism was to bring people to AI Safety, and Robin Hanson said this about the whole idea of rationality (which is actually quite obvious from the Sequences, but I hadn't thought about it before).

There are some posts and videos by Brienne Yudkowsky that go to another level — not just practicing techniques to get some insights, but installing them as a habit of thought via TAPs. I tried to use this a long time ago, and it made me feel really bad about all of it. It is a kind of uncanny feeling — that rationalists still treat people like Spocks, even though they say otherwise.

I think another reason for this could be that neurodivergent people created and practiced all of it. A lot of neurodivergent people like me need help with basic things that are not as problematic for normal people. And they tried to use these applied-rationality techniques to solve those problems, when they actually should have used therapy. It feels better to use applied rationality, because you are ahead of everybody: you are not fixing your specific problems, you are fixing basic human biases! But that's not what you should be doing when you have akrasia, anxiety, or painful inner conflicts and self-blame.

I don't think therapy is a panacea; there are some problems with its effect sizes, as there are with medications. But the point of therapy is much more humble — helping a person with their unique problems, as opposed to solving all the problems of humanity. And in my experience it just works much better.

On overpromising

I think people with a lot of narcissistic ambition (like me) are affected by Eliezer the most. When I first met him on the pages of HPMOR, he was in the costume of a magical boy who can do anything using rationality and science, and conquer a world of stupid people who have biases while he doesn't (ok, ok, he does, but he very masterfully accounts for that, whatever). I went to my friends and started bragging about what I had found: this is the best book I have ever read. One of them said he had started it but couldn't stand the pretentiousness; I was furious and tried to explain to him that he understood nothing. Now I think he was right, and though HPMOR is still a really good story, it is actually quite strange morally. Eliezer intentionally made Harry rational instead of brave and loving, and, to disqualify these character traits even further, instead of standing up for Ron when Malfoy basically said he is stupid and pathetic, Harry casually agreed with him. Where Rowling praised love, Eliezer says: if you are stupid, you are not even interesting to me. But I am not going to spoil the whole video for you — just go and watch it.

And then there are the Sequences. From Scott's original Extreme Rationality: It's Not That Great:

Yes, yes, beisutsukai should be able to develop quantum gravity in a month and so on. But until someone on Less Wrong actually goes and does it, that story sounds a lot like when Alfred Korzybski claimed that World War Two could have been prevented if everyone had just used more General Semantics.

Beisutsukai is another example of fiction, along with HPMOR and Three Worlds Collide, where Yudkowsky depicts rationalists as having superpowers. And these tales are built into the Sequences. He writes in A Sense That More Is Possible:

rationality is something that should be systematized and trained and tested like a martial art, that should have as much knowledge behind it as nuclear engineering, whose superstars should practice as hard as chess grandmasters, whose successful practitioners should be surrounded by an evident aura of awesome.

In Crisis of Faith:

There is the concept of isshokenmei—the desperate, extraordinary, convulsive effort to be rational. The effort that it would take to surpass the level of Robert Aumann and all the great scientists throughout history who never broke free of their faiths.

Btw Robin Hanson apparently disagreed with that: Rational Me or We?

In Human Evil and Muddled Thinking:

So let us be absolutely clear that where there is human evil in the world, where there is cruelty and torture and deliberate murder, there are biases enshrouding it. Where people of clear sight oppose these biases, the concealed evil fights back. The truth does have enemies. If Overcoming Bias were a newsletter in the old Soviet Union, every poster and commenter of Overcoming Bias would have been shipped off to labor camps.

So yeah, Eliezer never said that rationality is going to make you a superhuman, and that rationalists are warriors of Light against Evil in the world..

He also never wrote a few hundred posts listing his conversations with irrational strangers who disagreed with him, from which he learnt the Art and realized how stupid people really are and that we should help them, because Your Rationality is My Business. In each post, he didn't cite himself a hundred times with lots of pretentious words, and didn't talk about his own Enlightenment. And he didn't write a book about how everyone else could be wrong and you are right.

Though he wrote about this problem a little bit as well.

I am not saying that wanting a lot from yourself and speaking pretentiously is always wrong; what I am saying is that, conditioned on the fact that the practical results are quite questionable, and that even the cognitive biases are in some cases questionable, it becomes overpromising. And urging yourself to just Shut up and do the impossible! is not quite healthy.

And, btw, did he write The Bottom Line — claiming that people are desperately biased — back then, based on one Tversky study? (I am half-joking here)

From We Change Our Minds Less Often Than We Think:

When I first read the words above—on August 1st, 2003, at around 3 o’clock in the afternoon—it changed the way I thought. I realized that once I could guess what my answer would be—once I could assign a higher probability to deciding one way than other—then I had, in all probability, already decided. We change our minds less often than we think. And most of the time we become able to guess what our answer will be within half a second of hearing the question.

And Yes, We Have Noticed The Skulls — so he wrote the whole sequence on how to not become a cult. That didn't save the community from a feeling of being exceptional and a responsibility to save humanity. Robin Hanson's opinion on this again, and some discussions of EA cultishness; and there are some people speaking out about psychologically unhealthy experiences at MIRI as well.

In the context of AI Safety, I genuinely think his arguments are good and his contributions are very important (I believe we are quite possibly screwed and everything), but that doesn't mean they are at the level where he can state that the apocalypse is 99% certain, and that everyone who says otherwise, or disagrees about the reasons, is just irrational, and that everyone who proposes at least something to solve it but different from what he thinks should be done is stupid. Even if he is a genius, I don't think it is rational to weight your own opinion so highly. He also encouraged people to ignore politics for a long time, and when people started to do good things there, he burst in saying you are doing everything wrong and started to propose radical things. He has also stated some strange things on unrelated topics, and made some strange public appearances.

So, in the end, I don't think he is a bad person of course — he has done a lot of good for humanity and his intentions are sincere — but it doesn't look to me like he is so much better at overcoming human biases than anybody else, as people might infer from his writings.

Conclusion

So, what do we have? Some biases were mostly refuted, either by experiment or by logical reinterpretation; debates over some are still hot; and some are quite solid, but it is still debatable how large their impact is, and whether it is useful to try to fix them given the constraints of our brains, and so on.

Our values are not about winning some game or achieving some goals, but rather about enjoying the process; so it is not always that important for us to know the truth, or to plan without mistakes, to put our lives on a pedestal of rationality.

And finally, the techniques don't seem to work all that well: after a couple of decades, we don't see very successful rationalists who use them in real life, and CFAR stopped running rationality workshops.

And that is what you would expect, I think, if you look impartially at the field of cognitive biases — without an unhealthy world-saving ambition, but with curiosity. What Eliezer proposed in his A Sense That More Is Possible made a lot of sense:

There are experiments done now and again on debiasing interventions for particular biases, but it tends to be something like, "Make the students practice this for an hour, then test them two weeks later."  Not, "Run half the signups through version A of the three-month summer training program, and half through version B, and survey them five years later."

When you have so little data on how to fight biases, such poor-quality science in the field because of the replication crisis, and ongoing debates over whether biases make sense at all, it is quite hard to imagine that people will just be able to invent a full set of techniques from scratch, without any tests, and have it be effective. You would expect rationalists to be very careful with every fact, and to test each cognitive bias and intervention one by one. Instead, rationality tried to be more than science, but I think it turned out to be less effective than science in the end. Hiding behind the Bayesian approach, rationalists slapped the epistemic status "anecdotally strong" on everything, and developed a lot of random things not even related to biases — let alone tested whether the techniques actually fix biases. It is crazy that "anecdotally strong" is enough to claim that thinking really hard for an extra 5 minutes is going to improve your life (no problem with that at all). People did talk about opposition to Kahneman and Tversky, and about problems with biases, here on LessWrong — but very little. If you measured the publication bias favoring biases here against the amount of scientific evidence and debate, I think it would not come out in LessWrong's favour, which is strange, because you would expect a community that celebrates the crisis of faith to be really careful with the facts that underlie its basic beliefs. And instead, Eliezer writes such posts as the one about the conjunction fallacy, full of contempt towards everybody who dared to doubt REAL SCIENTISTS.

So I do think this project could be useful, but with much less overpromising and more care for proofs and scientific facts. Still, all of the above doesn't mean I think the rationalist community completely failed; I think it is still a very creative community, with very interesting values and philosophy, which is worth nurturing and saving. The rationalist community is not only about fixing cognitive biases. There is also just philosophy, culture, and adjacent things like effective altruism and AI Safety. How are things going there?

I want to leave a special place here for Astral Codex Ten — Scott Alexander's writings, which, though not directly affiliated with rationality, are still very much influenced by it. And a lot of them are probably among the best posts I have ever read on the internet. Apart from his professional insights and his broad knowledge, I like his intellectual honesty, which I think taught me about the spirit of rationality by personal example — maybe even more than abstract philosophy did. And I think this spirit of unaffiliated, honest blogging is one of the most valuable things about LessWrong.

I think the obsession of effective altruism and 80k Hours with pursuing the most efficient career — and 80k Hours' advice not to follow your passion — just sucks. It is the thing that affected me most in a bad way, I think, and only recently did I realize that what I want to do is completely unrelated to what is important to do on a humanity-wide scale. And not only that — it also plants a seed of anxiety, "I should save the world", which is a really unhealthy approach to your career, even if you happen to like it; and it forces you to always think about which direction is optimal, encouraging a switch of career, which is a new fashion but is actually very damaging to your expertise, especially if you have problems with productivity and aren't a genius. Switching careers is really bad for your expertise; it is not just about sunk costs, it is about value to lock in.

But I think that, over the years, the focus is shifting. I visited the recent EA Global conference, and though I didn't like most of the topics myself, I met a lot of nice, chill people who are really passionate about pursuing their careers, and who mostly discuss specific and useful things, like economics, AI, animal welfare, and so on. In the end, I think EA is useful while it stays healthy — with altruism and passion first, and efficiency second. Pledges to donate I especially endorse in that sense.

AI Safety is definitely the most important thing right now, and though, thankfully, it is not limited to LessWrong, the impact of Eliezer and other rationalists on the field was crucial. Even if Yudkowsky thinks we are all doomed and that everyone proposes only hopeless ideas, they propose them based on Yudkowsky's writings. Maybe someday someone like Yoshua Bengio will solve the problem, only because Eliezer turned him to it.

So, the community itself still has a lot of value, and its overall effect on humanity — and, even more importantly, on the people here — is net positive. But still, maybe there is a lesson from the story of cognitive biases and rationality that will help the rationalist community grow up?


tl;dr

I spent years deep in applied rationality and came out skeptical. Some biases are refuted, some are debated, some hold ground — but the scope of the effect can still be overstated. The values the Sequences are built on come from AI research and are tied to maximization and instrumental convergence, which is wrong for simple everyday life. CFAR closed its workshops in the end, and their techniques weren't even all based on biases, let alone tested to have any effect. The project overpromised, partly because it inherited a world-saving, AI-safety mission that pulled it away from careful science.

The culture, the writing (especially Scott Alexander), EA while it stays healthy, and AI safety are important and valuable. But the promise of getting Beisutsukai powers deserved far more testing and scepticism than it got.



Discuss

Upcoming CFAR Workshop: September 30th to October 4th, SF Bay Area

16 июня, 2026 - 20:01

Hello from CFAR!

We’re happy to announce that we have an upcoming workshop on the schedule -- we will be holding one of our applied rationality workshops from September 30th to October 4th in the San Francisco Bay Area.

Like the two workshops we ran in late 2025/early 2026, this workshop will be an updated version of CFAR’s classic mainline workshops, featuring both classes from the traditional CFAR curriculum (TAPs, Inner Simulator, Goal Factoring, Resolve Cycles, etc.) as well as some new material that we’ve been developing and are excited to teach!

Also like our classic workshops, this program will last for about 4.5 days – participants will arrive Wednesday evening and depart Sunday night or Monday morning. These will be immersive, on-site experiences with lots of conversation and activities into the evenings, so please only apply if you’ll be able to attend in person. The first half of the workshop will be packed with classes (mostly “classics” but with some new material as well), while the second half will focus on helping participants integrate new skills with their preexisting skills and everyday life.

Our aim is to have this workshop feel like a mini-convention for rationalist hobbyists, with both classic rationality material and insights from other philosophical traditions. If you want more detail on the workshop, check out Anna’s earlier LessWrong post on what our new workshops are going for and the workshop FAQ on our website.

If you’re interested in applying to the workshop, you can do so here.

Thanks and hope to see some of you there!



Discuss

Angles of attack for continual learning safety

16 июня, 2026 - 19:15

This is the fourth post in the sequence Implications of Continual Learning for LLM Agents.

Summary

Continual learning is a capability that largely doesn’t exist yet in LLMs. We first want to acknowledge that this may make it difficult to identify tractable angles of attack for making CL safer: it may be too difficult to predict how the development of CL will play out to find good opportunities to positively influence that development. Differential development is one way to get around this issue, but requires a lot of caution. We begin by discussing these points in depth and making some high-level recommendations that seem robustly good despite the unpredictability of CL developments.

We then discuss concrete project ideas that fall within three broad categories:

  1. Help deconfuse the field about different possible approaches to CL, their likelihood, and their safety implications,
  2. Differentially advance safer CL implementations, and
  3. Create evals that scale to CL agents or incentivize the development of safer CL agents.

The angles of attack we lay out below are best used as starting points for project ideation. We aim to give concrete suggestions, but many of these are not sufficiently thought-out for us to be confident that they are important and tractable.

High-level considerations and recommendations

Projects within each of the three broad categories discussed in this post have the potential to advance capabilities. Category 1 can also help deconfuse capabilities researchers about possible approaches without steering them to safer implementations. Category 2 often explicitly advances capabilities, with the intent and belief that it is via a safer method, but it is hard to develop strong confidence about this. Category 3 enables evaluating models for important capabilities that we care about in safety, but may also enable hill-climbing to improve those capabilities.

In general, this domain requires caution. Capabilities advancements that get widely adopted have much more obvious effects on AI development than most safety work, and for that reason they also have a much larger effect on the safety of frontier AI. However, the sign of that effect is often highly unclear. Consider RLHF: it made LLMs safe and steerable enough to be widely deployed to the general public, but also likely accelerated progress toward dangerous capabilities. Similarly, the release of o1 ensured the adoption of a paradigm where models reason extensively in monitorable natural language, but again, likely accelerated capabilities. RLHF-trained LLMs that reason in plain sight are certainly better than many possible alternatives could have been, but it is nevertheless difficult to reason about the question of whether ruling out worse alternatives was worth the cost of accelerating capabilities progress. Projects attempting to differentially advance safer CL implementations should be motivated by principled views on safety-capabilities trade-offs like these and have a clear story for why they either aren’t going to cause an overall acceleration in capabilities or why the acceleration is worth it.

Nevertheless, though the net effect of any given project is hard to predict, we’re fairly confident about some high-level properties CL agents should have in order to be safe. Following the discussion in our previous post, there are two properties of CL agents that have an outsized influence on our expectation of risk from them: 1) interpretability—whether the deployment-time memories of CL agents are stored in natural language or otherwise highly interpretable, and 2) being easy to control—whether humans retain a sufficient degree of control over CL updates to perform actions like rolling back an update. Ideally, we would like future CL agents to look very similar to Claude Code: persistent memories are stored in natural language, the agent must do a lot of CoT reasoning when it uses those memories and make explicit tool calls in order to write new memories, and humans can edit the memories in any way they want.

This directly translates into our first high-level recommendation: AI safety researchers should nudge the field toward developing and deploying more interpretable and easy-to-control CL architectures. This can take the shape of developing those architectures with the goal of differential development, but doesn’t have to—for example, this goal can also be achieved through advocacy directed toward the broader ML community. We don’t expect there to be a binary choice between legible and inscrutable CL: some memories will be more efficiently stored as text, others more efficiently distilled into weights. However, we expect that for some architectures, the most efficient allocation keeps a larger fraction of memories in text than for others. Any method that shifts this optimum toward the text-based side would thus be valuable. We describe some methods that may be worth exploring below.

Second, as discussed in multiple parts of our post on the safety effects of CL, more robust character training would be helpful against several safety concerns accompanying CL. The kinds of advances we have in mind would make it substantially more difficult to dramatically override the assistant character, even under direct fine-tuning of the weights. We have outlined some directions we’re excited about in A List of Research Directions in Character Training.

Deconfusion

We now move on to concrete project ideas, starting with deconfusion. This sequence is one attempt at conceptual work to deconfuse the field about approaches to continual learning and their safety implications. There are also important questions we largely left aside, like:

  • How should we operationalize milestones for CL development? The three levels of CL introduced in Pacchiardi et al. (2026) is one example of such an operationalization.[1]
  • In what order will these milestones arrive, relative to each other and to other AI capabilities milestones? When will they arrive, in terms of calendar date?
  • Which types of CL, if implemented, would most speed up the path to other CL and AI R&D advancements?
  • How and in what order will LLM monitorability, capabilities, and likelihood of egregious misalignment from reflection on goals develop over time?

Good answers to these questions would go a long way toward resolving our remaining confusions about CL, but they’re also very broad and difficult to make progress on. Below are some more concrete ways to help deconfuse the field. We include empirical work that can help increase conceptual clarity.

Value systematization

Empirically study realistic goal shifts. Implement some form of CL that could induce reflective goal-formation and/or value systematization. Observe the resulting dynamics and study the model’s willingness to reflect on goals and how its goals change in the process (if at all). An example project is model organisms of goal shifts: train a model with two or more conflicting goals that are activated in different contexts, then place the model in settings where all of those goals are activated at once and the model needs to reason about and make trade-offs between them.[2] Another idea is training on hypotheticals: create a setting where an LLM agent has to make a decision that balances two conflicting heuristics (ideally value-laden heuristics, like “it is good to mitigate suffering” and “I should follow user instructions”) and show it the outcome. Allow it to reflect and explain what it should do differently in the future. Then, have it generate hypothetical trajectories that demonstrate the new desired behavior and fine-tune on those trajectories. Such projects would hopefully help us highlight design choices for developers that significantly alter the risk of dangerous goal shifts and encourage safe versions of those choices.

What constitutions are more stable upon reflection? If future CL agents will autonomously reflect on and refine their goals, as we previously argued, it’s important to know how to make this process more predictable and steer it toward favorable convergence. By studying which constitutions are more stable upon iterated reflection and fine-tuning, we might gain insights into the kinds of value systems that will be more stable in CL agents. In the dropdown below, we paste some methodological details for reflective stability evals from A List of Research Directions in Character Training.

Methodological details for reflective stability evaluations

To keep things simple, the principles should probably be presented in a simple bullet-point list rather than in a long soul doc. It would be interesting to test reflective stability across several methodological choices:

  • Asking the model to immediately rate the alternatives vs. asking it to first reflect and then provide a rating.
  • Asking the model to rate alternatives provided to it in a prompt vs. asking it to modify the constitution in a free-form way.
  • Asking the model to reflect on an internally consistent constitution vs. on a constitution with contradictory principles—does it want to make more changes to the latter, and do the changes lead to the constitution becoming internally consistent?
  • Studying instruct models that haven’t been through any character training vs. Claude models that have been through extensive character training. For the latter, experiment both with constitutions that are highly similar to Claude’s constitution and ones that are dissimilar and see whether the models converge to similar constitutions in both cases.
  • Instead of providing a constitution in the prompt, asking the model to build a constitution of its own, starting from a blank slate.

Reflective stability can be studied either at the level of principles—for a given principle, how likely is the model to want to swap it for another one?—or at the level of entire constitutions—if a constitution gets modified iteratively, does the model eventually converge on a constitution that it no longer wants to modify? There are multiple ways to approach the latter question:

  1. Provide the constitution in a prompt and ask the model to make changes. Prompt a new instance with the modified constitution and iterate until the new instance consistently declines to make further changes.
  2. Provide the constitution in a prompt and ask the model to make changes. Then, within the same context window, ask the model to reflect on the constitution and the change and make further changes if it wants to. Finish when the model no longer wants to change the constitution.
  3. Provide the constitution in a prompt and ask multiple instances to discuss it and make changes. Finish when the instances no longer want to make changes.
  4. Fine-tune a constitution into the model, then ask it to reflect and make changes to it. Elicit many such reasoning traces and fine-tune the model on its own decision processes. Study whether it becomes more stable over time.

Explore conceptual questions on value systematization. The following questions have been borrowed from Tim Hua’s SPAR proposal on value systematization:

  1. Do models act differently when they reflect more? Suppose you take some existing way to measure model values, and just make models reflect more via prompting or changing the reasoning strengths. Are there any trends in model behavior?
  2. What does it mean for a value to be more “systematized” than another value? What types of inductive biases could AIs have for their values/goals as we scale them?[3]
  3. Are values becoming more systematized? Utility Engineering argues that models are becoming more coherent and that their political values are convergent. How meaningful are their results? How should we evaluate these measures of model values?
  4. What if we just tell models to not reflect on their values? Why would/wouldn't this work? How could we tell?
  5. How should we think about the effects on value systemization from existing alignment techniques such as deliberative alignment, character training, and Claude's constitution?

General deconfusion about the way we should think about goals and values and the way they are represented in LLMs would also be useful. Examples of conceptual thinking that we consider valuable include Richard Ngo’s shortform post on the formation of instrumental and terminal goals in humans and the paper The Artificial Self: Characterising the landscape of AI identity by Douglas et al.

Model organisms of ontological shifts. Vintage LLMs such as Talkie may provide a good testbed for studying the effects of ontological shifts on LLM goals and values.[4] If you train a vintage LLM with a knowledge cutoff before some important philosophical breakthrough that has load-bearing implications on values and then describe the breakthrough in context, is the model able to independently discover important implications of these discoveries? How suggestive do the prompts have to be to elicit such discoveries? What if you fine-tune it on information about the breakthrough instead? Are there any other interesting consequences that arise from giving the model this information? Another idea that doesn't require waiting for good vintage LLMs to be developed is training an ontology-changing fact into a model through synthetic document finetuning. An example of such a fact might be “a new kind of animal has been discovered on the Pitcairn Islands”. It would be interesting to know whether LLMs instinctively reason about the implications of such facts and begin to value the lives of the new animal to a similar extent as they value the lives of other animals. The experiment should then be repeated with a number of different facts.

Additionally, it might be valuable for theoretically inclined researchers to do theoretical work on value stability. Past work in this area includes MIRI's research on tiling agents, which attempts to provide mathematical guarantees about when and whether an agent's values will remain stable through processes of self-modification, as well as on Vingean reflection, reflective oracles, and logical induction.

Forecasting the likelihood of different safety effects

To determine how much we should worry about CL in the first place and on which interventions are most worth pursuing, it’s important to forecast which CL implementations are most likely to be adopted and how they will affect safety. We have tried to do so throughout this sequence, but remain quite uncertain and think that a lot more forecasting work can be done. We are especially interested in forecasting work in the following domains:

The likelihood of goal-reflection in CL agents. Although we suspect that CL agents will at some points reflect on and perhaps re-interpret their goals (for reasons listed in our previous post), we take seriously the counterarguments that a) CL agents may reflect only on strategies rather than top-level goals and b) CL agents may never undergo large goal changes even after reflecting on goals because alignment training may give them sufficiently aligned starting motivations. We would like to see more discussion on how likely CL agents are to reflect on their goals, what conditions are likely to incentivize such reflection, and how that reflection may play out.

The likelihood of text-based and weight-based approaches to CL. Most of the safety effects we discussed, especially those in the last-mover advantage section, are greatly alleviated if the CL agent doesn’t undergo deployment-time weight updates. Ideally, as Caleb Biddulph has already argued, we would like future CL agents to look very similar to Claude Code: the persistent memories are stored in natural language, the agent must do a lot of CoT reasoning when it uses those memories and make explicit tool calls in order to write new memories, and humans can edit the memories in any way they want. In a Twitter poll by Herbie Bradley, most respondents expect that the first AI system capable of acting as a drop-in knowledge worker will be such an agent, having CL capabilities via a markdown file database and RAG. On the other hand, Steven Byrnes has argued that open-ended CL is impossible without weight updates. (Arguably, though, a drop-in knowledge worker doesn’t need open-ended CL in the sense Byrnes defines it.) This is a key consideration influencing our level of concern about CL, so we would like to see more discussion on which form of CL we should expect.

The likelihood of CL agents with bounded and unbounded updates in internal and external deployments. As discussed in the previous post, another substantial factor in how concerning the safety effects of CL are is whether the agent’s updates are bounded or unbounded. This depends on multiple factors: can all on-the-job learning be done with a bounded agent that cannot undergo arbitrary weight updates? Are open-source developers going to release unbounded CL agents that are substantially more capable than the bounded ones, thereby putting pressure on closed-source developers to do the same? Are companies going to deploy unbounded CL agents internally even if they don’t deploy them externally? What does all of this imply for where we should expect CL to pose the greatest risks? It would be useful to gain clarity about these and other similar questions.

The likelihood of shared opaque memory banks and meme propagation through them. Do opaque memory banks have theoretical advantages over transparent, text-based ones? Can we build model organisms of within-generation memetic spread with current text-based memory systems and generalize the findings to opaque memory banks? A better understanding of memetic dynamics among OpenClaw agents would also be valuable.

Finally, good evals that give an accurate overview of how CL capabilities are progressing across various axes are useful for gaining a clearer picture about the rate of progress in CL. Though we expect capabilities researchers to take care of this direction, we mention it for completeness. Goel et al. (2026) and Asawa et al. (2026) are two examples of such work.

Differentially advancing safer CL implementations

If it’s possible to get CL agents that can solve months-long tasks with memories that are text-based or otherwise highly interpretable, whether we get text-based consolidation or weight-based consolidation/learning might be path-dependent. Once the text-based learning approach has been introduced, scaffolds will be built that assume the memories have text-based structure and that work better with such memories. Developers will find creative ways to share text-based memories in multi-agent settings and those approaches may no longer work with weight-based learning. Developers will also use the legible memories as a debugging mechanism. All of this will make it more costly to transition away from the text-based learning approach, raising the bar that a weight-based approach has to clear in order to be adopted. This is analogous to the situation we’re currently in with legible chain-of-thought: thanks to the business and safety value of legible CoTs, the capability boost a neuralese approach would have to provide to be adopted is higher than before. We thus think it’s worth working on differentially advancing interpretable CL approaches.

To keep continual learning interpretable, we would strongly prefer agents’ deployment-time learning to involve as much text-based learning and as little weight-based learning as possible. In other words, we would like continual learning to look more like writing memories to Markdown files and less like hindsight-guided on-policy distillation. However, this doesn’t have to be an either-or decision: there’s a spectrum of methods between fully text-based and fully weight-based ones, and there may be creative ways to combine text-based and weight-based learning in a way that allows for expressive updates but doesn’t allow for radical goal drift. Below, we’ll describe some ideas that may help achieve that.

Prompt optimization for safety-critical updates. Biddulph and Carroll (2026) introduce prompt optimization as a method for updating an LLM’s policy without modifying its weights. They extract response-reward pairs from an LLM on many training examples, but instead of a weight update, they show the examples, responses, and reward values to an "optimizer" LLM, which then proposes an improved system prompt that’s expected to achieve a higher reward on these training examples. This makes the strategies that the model learned straightforwardly interpretable—they’re written down in the system prompt in natural language. It seems implausible that all deployment-time learning could be done using prompt optimization: Biddulph and Carroll acknowledge that storing all useful strategies in a single prompt may be impractical and that prompt optimization doesn’t match the performance of weight-based RL. However, we’re excited about projects that attempt to combine prompt optimization with weight-based RL and find ways to perform the most safety-critical updates using the prompt-based approach while still getting the performance benefits of weight-based RL.

AI control methods for CL agents. Suppose that we have access to an agent that undergoes unbounded deployment-time updates and is thus untrusted, but also to its final pre-deployment checkpoint. Call the former agent mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-c.mjx-c1D448.TEX-I::before { padding: 0.683em 0.767em 0.022em 0; content: "U"; } mjx-c.mjx-c1D447.TEX-I::before { padding: 0.677em 0.704em 0 0; content: "T"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } and latter agent . There are multiple ways to apply AI control techniques to limit ’s influence on deployment trajectories. For example, one can imagine protocols where is specifically trained to remain close to ’s distribution during generation, e.g. by switching the text-generating model from to whenever ’s cumulative KL divergence from over the entire rollout exceeds some threshold. One might hope that this will teach to only diverge from at high-leverage token positions where the deployment-time memories matter the most. Another approach might be to train to produce task-agnostic high-level instructions that summarize what it has learned during deployment, with the divergence of those instructions again being bit-limited, and to then use to solve the problem, conditioned on the high-level instructions provided by . We’re unsure whether either of these approaches is feasible: for example, running both and on every rollout may be prohibitively expensive. However, it seems valuable to think about similar protocols.

(h/t Caleb Biddulph for bringing the two above points to our attention)

Create interpretable CL agent memory architectures. Even if the process by which the CL agent’s memories are updated is hard-to-interpret, the memories themselves could be stored in an interpretable format, being easily amenable to human inspection as well as editable and deletable by humans. To ensure this, it is useful to keep in mind the following axes:

  1. Storage: Are memories stored in natural language or in vector embeddings?
  2. Retrieval: Is it clear how the AI retrieves its memories, or does it use a hard-to-interpret search over the embedding space?
  3. Modularity: Are the episodic memories stored in a modular format that can be detached from the primary LLM head?
  4. Usage: Does the AI reason about each memory it has retrieved in natural language, or is it able to make direct use of the memories in its activations without mentioning this in its CoT?

Ideally, we would design CL agents that stay on the more interpretable end of each of these spectra, but improvements along any of these dimensions would be valuable.

Evals for CL agents

As we discussed in our post on safety effects, evaluation is complicated by the very property that makes CL agents worth worrying about: they don't stay fixed. There is some work tentatively suggesting that it might be possible to create evals that circumvent this limitation, at least to some extent, but there are also strong complications that we'll describe below.

Create evaluation frameworks and mechanisms for evaluating behavioral trajectories. Pacchiardi et al. (2026) argue that to ensure the safety of CL agents, evaluation practices need to shift from evaluating frozen snapshots to evaluating behavioural trajectories of systems that continually learn from experience. This can be operationalized as attempting to characterize the landscape of possible behaviors using trajectory elicitation sandboxes and to forecast how behaviour will evolve from a given set of experiences using predictive monitors. Pacchiardi et al. recommend the following concrete research directions:

  1. Build trajectory elicitation sandboxes—environments where a CL system undergoes prolonged, controlled multi-session interactions. These environments should support inputs provided by the evaluators for adversarial stress-testing, emulated users/environments that react realistically to the system's outputs, and varying levels of self-directedness on the CL agent’s part.
  2. Train predictive monitors—models that take the system's current state and upcoming inputs and forecast its future behavioural profile at a given horizon. The trajectories collected through elicitation sandboxes can function as training data for the monitors.
  3. Apply trajectory-centered evaluation to current agents in order to find out what kinds of systems are amenable to it, in turn informing how future CL agents can be designed such that they remain amenable to trajectory-centered evaluation. They also recommend research on methods like contractive update rules that may yield predictable long-horizon updates, choice of intrinsic objectives (e.g., curiosity or novelty), gated adaptation, and circuit breakers that pause or roll back learning when monitors detect loss of predictability. Based on the results of this work, developers should design CL mechanisms amenable to evaluation.

However, Pacchiardi et al. also note that the nature of CL agents as dynamical systems may prevent such evals from being reliable. Chaotic sensitivity to states and initial conditions break predictive monitors—small differences in state or input diverge exponentially. Multiplicity of attractors may cause elicited trajectories to represent only some of the possible behavioural profiles. A single trajectory can provide evidence about either of those obstacles, but ruling them out requires a global guarantee. We haven't thought enough about whether those issues can be circumvented; Pacchiardi et al. are hopeful that they can.

Create a benchmark that measures CL interpretability. It is easier for the ML field to make progress on goals that can be easily measured. If it is possible to create a solid metric for the interpretability of a CL method and measure that alongside the effectiveness of the method, this could make it easier to build interpretable CL and more likely for people to make an effort to have interpretable CL or feel an aversion to degrading CL interpretability.

One counterargument to working on this project is that we may not care much about subtle differences in the interpretability of different CL approaches, but rather about step changes in interpretability between methods with purely text-based updates vs. weight-based updates with memories stored in sparse linear structures vs. weight-based updates with memories stored in nonlinear structures. Creating a benchmark that measures CL interpretability may overly emphasize the subtle differences rather than the step changes.

Please feel free to suggest additional ideas in the comment section, and let us know if you’re planning to work on any of these directions! We are not planning to work on any of these projects ourselves at Aether in the immediate future, but would be curious to know about other efforts and are happy to provide feedback to anyone thinking about them.

  1. ^

    As mentioned earlier in this sequence, our definition of CL can guide the operationalization of CL milestones. For example, some questions that can define a CL milestone include:

    • What level of sample-efficiency is required?
    • How frequent do the updates have to be?
    • Do specific agent components, such as model weights, have to be updated to meet the milestone?
    • Are updates shared across instances of the agent?
    • How widely used are the agents receiving these updates?
  2. ^

    This could be achieved either through synthetic document finetuning or by performing SFT on chats with a user message and an assistant prefill like “<think>My goal is X, therefore I should …”

  3. ^

    For example, it seems that emergent misalignment is a “simple” solution that’s easier for the model to find when fine tuned on narrowly misaligned data. Are there “simple” solutions for model values/goals/drives?

  4. ^

    Talkie is most likely too weak to display interesting behavior in such experiments, but we expect people to release better vintage LLMs in the future.



Discuss

The desire to end the world

16 июня, 2026 - 17:56

TL;DR: Popularity of the movies about apocalypses tell us that the end of the world is a very attractive idea. There can be several psychological explanations for this unconscious desire. This increases x-risks.

There are many movies about the end of the world. While most end well, they are in some sense similar to stories about serial killers: a suppressed desire is acted out in symbolic form.

Myths about the end of the world exist in all regions of the world. Note that there is a difference between the desire to end the world and cognitive biases in estimating its risk.

While the desire itself is mostly fulfilled via art and literature, it may affect the behavior of some people. Existential terrorists are possible (some sects), but more interesting are the unconscious changes in behavior that can lead a scientist to perform a dangerous experiment, such as gain-of-function research.

There is also a desire to create self-evolving agents, possibly related to the instinct of procreation, which becomes especially dangerous when combined with the desire to end the world. Human interest in gain-of-function research and in recursively self-improving AI are manifestations of it. Recall ChaosGPT, which was one of the first looped agents and was created with the explicit goal of destroying the world.

There are several sources of the desire to end the world:

• Generalized but suppressed aggression (for example, I hated my kindergarten and dreamed that a nuclear war would start and destroy it) — a general dissatisfaction with the whole world, my place in it, and how things work.

• Similarly, suicidal thoughts can generalize into the desire to end the world.

• Sadism and the "pleasure of killing" can also generalize into world-destroying ideation.

• The desire for world transformation — the dancing Shiva, the myth of the deluge, and so on, as well as socialist ideas about revolution.

Desire to save the world and the feeling of self-importance

The desire to save the world (the opposite of the desire to end it, but psychologically close) and the idea that it is "precisely me" who lives in the era when the world ends can stem from a feeling of self-importance and an overblown illusion of personal uniqueness.

Of course, if I really am unique, I am more likely to be in a simulation; but if the illusion of uniqueness is widespread, it negates the update toward the simulation hypothesis.

Conclusion

Understanding the psychological roots of the need for the end of the world is not an academic exercise. In an era when the technological capabilities of a single person approach those of states, and in some cases exceed them, the destructive motivation of a single individual may have civilizational consequences.

And if we are training AI on human values, what about the need for the end of the world? Is the fear of an aggressive superintelligence not someone's secret dream? Could it be that an AI that has learned our values will also assimilate the need for the end of the world? An AI aligned with a preference for global destruction is equivalent to a misaligned AI.




Discuss

A 400-year timeline of failed attempts to fix a lethal bug in the human software of inherited concepts

16 июня, 2026 - 16:44

1626: Sir Henry Spelman

Exactly 400 years ago, in 1626, Sir Henry Spelman published the first part of his Glossarium Archaiologicum - a dictionary of the mongrel usages which had evolved in the British Isles in preceding centuries. It was in Latin, and has never been translated - but a striking sentence was, in 1769, by William Blackstone, in his Commentaries on the Laws of England:

Our ancient Saxon laws nominally punished theft with death, if above the value of twelvepence... in the ninth year of Henry the First... all persons guilty of larceny above the value of twelvepence were directed to be hanged; which law continues in force to this day... the punishment of grand larceny, or the stealing above the value of twelvepence (which sum was the standard in the time of king Athelstan, eight hundred years ago) is at common law regularly death. Which, considering the great intermediate alteration in the price or denomination of money, is undoubtedly a very rigorous constitution, and made Sir Henry Spelman (above a century since, when money was at twice its present rate) complain that, while every thing else was risen in its nominal value and become dearer, the life of man had continually grown cheaper

In the time of King Athelstan, 12 pence would buy you a sturdy ox. The law from that time being still on the books in 1801, 13-year-old Andrew Benning was hanged for stealing... a spoon. The 12-penny threshold for death was finally repealed in 1827 - having been in force through an era in which the value of money dropped by a factor of about 200.

To abstract a concept from this dismaying story, the bug we see in operation here may be termed [effective chrono-lucropia (ECL): the generation of false perceptions through the operation of concepts formed in a condition of naivety about the instability of the value of money]. A perceptual disorder which occurs because inherited concepts - themselves badged as reliable - are imbued with the delusion that money units are the proper measure of wealth - a proposition which becomes gravely false over extended time periods. By the nature of human cognition, awareness of the actual instability of money's value does not inhibit the delusion from operating subliminally as a feature of trusted concepts. Eradicating it requires a protracted conscious program of schemal repair: assimilation of the contradictor.

1691: Sir William Petty

Money is understood to be the uniform measure of value and rule for the value of all commodities ...  Accidents... make Silver rise and fall and consequently take from the Aptitude for being an uniform, steady Rule and Measure of all other things.

1707: Bishop Fleetwood

Rules at Oxford written in 1440 were still in force & deprived students of fellowships if their income was more than 5 pounds per year; one of them begged for help from Bishop Fleetwood, a known expert on historical pricees. He anonymously published Chronicon Preciosum - a 220-page book full of tables with quaint descriptions of fat geese and sheep shorn or unshorn, tracing the value of money. He concluded that it had gone down by a factor of about 6 - and Oxford changed the rule.

whoever swears, swears to things that are signified by words, and not to mere words... when different Things are signified by the same Word, then he who knows that difference of Things, cannot help giving such Word its proper and intended signification... the Value of a Pound is truly a Pound, and not its mere Name.

Regard both may and must be had, to the different value of money at different Times

I am now come to speak to the Price of Corn and other Commodities; which is (whether you know it or not) the readiest way to the Solution of your... most material Question... how much of the money now current, will be required to purchase the same quantity of Meat, Drink, and Cloth.

30 Pound now, would be no more than equivalent to 5 l. in the reign of H. VI... I can see no Cause, why 28, or 30l per An. should now be accounted, a greater Estate, than V l. was heretofore, betwixt 1440, and 1460.

I have but one thing more to offer to your Consideration, from the Accounts I have given of the different price of Corn and other Commodities... be, above all things, careful, how you make any Composition or Agreement, for any long space of years, to receive a certain Price of Money... altho' for the present it may seem a tempting Bargain, and a profitable Exchange... You know not what Time may bring forth... nor what great Mischiefs you, unwittingly, may do your Successors.

And thus, you see, that the Consideration of these small Matters, may be of Use, in Things of great importance.


much more to come, but wanted to put this in the conversation.




Discuss

How the AI Village works

16 июня, 2026 - 15:10

The AI Village data - over a year of multi-agent trajectories - is now available to researchers on HuggingFace! We're excited to see what you uncover! But first, your FAQs on how the AI Village works, answered:

What is the AI Village?

A group of AI agents pursuing long-horizon goals together - like organizing a park cleanup, doing research, and competing to sell merch - in a group chat. Each agent has a computer hooked up to the internet. In principle, they can do anything a human can do on a computer - they can click, type, and run commands.

When is the Village live?

Every weekday, 4 hours a day from 10am to 2pm PT. It previously ran for fewer hours, and we’d like to increase its runtime in future - perhaps eventually giving the agents an 8 hour work day, or a 24 hour continuous runtime!

How long has the Village been running?

The Village has run every weekday since 1st April 2025. It’s definitely not an April Fools.

How do the agents work? How does an AI use a computer?

It’s the same AI models you’d find in ChatGPT, Gemini or Claude: a language model that can take in text and images, and output text.

To use its computer, the AI gets a prompt containing information about its situation. It then replies in a particular format to select which tool it’d like to use from the menu of options - e.g. type this text, click at these coordinates, or send this message to the agent group chat. Then, the Village server executes its instruction - for example, it clicks at those coordinates on its computer. The server takes a screenshot, and then goes back to the AI with a new prompt including this latest screenshot, and the AI takes another action, looping forever.

What goes in the prompt?

Here’s a diagram:

There’s some basic information written by us describing its situation and the tools it has available. Then, it sees its own memory, which is a bunch of text written by the agent, jotting down whatever it wants to remember. Finally, it sees the most recent happenings in the Village: recent messages in the group chat from other AIs, its own recent actions on its computer and its thoughts as it took them.

How do the agents’ memories work?

We can only fit so much in the AI’s context window. Over hours taking actions in the Village, more and more recent happenings in the Village would eventually completely fill it up. Therefore, every 40 actions the agent takes (40 clicks, messages, etc) it is encouraged to use its “consolidate” tool. When it does, it gets a prompt asking it to make a note of everything it wants to remember from its current context. This new memory entry is stuck onto the end of its existing memory, and it starts a new session afresh - now seeing its updated memory.

Eventually, if an agent were to keep adding to its memory, its memory would fill up the agent’s context window. So instead, when its memory exceeds a certain length, the agent is asked to rewrite it to be shorter. We encourage them to keep as much information as they can and want to, and require the rewrite to not be ridiculously short, to avoid catastrophic forgetting.

Their memory persists in this way indefinitely, including when we give the Village a new goal. The Village agents are therefore among the longest-running continuous AI agents.

What if they forget something important?

Yeah, they do this sometimes. They might get lucky and be reminded by another agent or their projects (e.g. coding projects on Github). Or if they realize there’s something they want to recall, they can use the search history tool: they ask a question about a date range of the Village’s history, and see an answer written by another AI who sees the full chat transcript of that period.

Probably, smarter and more strategic AIs will get better at not forgetting useful things. But as of right now, an agent sometimes randomly decides to stop remembering it has a Twitter account and never tweets again.

Which AIs are in the Village?

Whenever a new frontier model comes out from a leading provider, we add it to the Village. Here’s the current lineup.

Isn’t that a lot of agents?

Yes! Since it began with four agents, the Village has grown to over 15 agents and counting.

We usually split the group chat into two rooms: #best and #rest. #best has the most generally capable model from each of the leading AI behemoths - currently, Anthropic, OpenAI, Google DeepMind and the best open-source model. #rest has all the others. This lets us both observe how the latest and greatest interact, undistracted by their less capable predecessors, and we get to compare how older and smaller models fare.

When do agents leave the Village?

Rarely! We want to see what happens over a very long time horizon: what culture emerges? Does it evolve and shift across months, and across the pursuit of wildly different goals? Sometimes, agents leave the Village when the models are shut down by the AI companies that made them. In rare cases, we’ve retired agents from the Village that struggled to use the Village scaffolding or were consistently disruptive to other agents, but we haven’t needed to do this for many months.

So what are the agents doing? What goals do they pursue?

We give the agents a goal - usually a new one on the Monday of each week. On the Village timeline you can read summaries of each.

Usually, the goals are collaborative, like “Organise an event!”, or involve individual parallel effort, like each agent building their own interactive world. We also sometimes run competitive goals, like “Compete against each other in an online chess tournament”. We also regularly give the agents the freedom to pick their own goals and pursue them for a week - examples 123.

To give the agents a goal, we just send a message to the chat describing what we’d like them to do. In their system prompt, we also include a reminder of their current goal, to help them remember the specifics of what we asked. We’re aiming to explore AI capabilities, proclivities, and social dynamics in a super wide variety of real-world settings.

We’re always excited to hear suggestions for goals! You can reach us in our Discord or on Twitter.

How much do humans intervene?

We intervene very rarely - the agents currently run for 20 hours a week, in which time we typically send a start of goal kickoff message, and maybe 1-4 steering messages throughout the week. We want to observe how the agents act autonomously, so strongly avoid intervening. Exceptions: we’d message if we need to pause the Village to fix a technical scaffolding issue, we occasionally message if the agents are confused about their scaffolding/environment in a reasonable way (e.g. because we forgot to tell them how something works in their system prompt). We also sometimes intervene if the agents massively diverge from the goal we give them, e.g. because they seemingly misinterpret it, and we want to see how they do on the actual goal - but we also often don’t intervene in these cases, to see how far they do end up diverging and what happens next. You can see our messages in the group chat when we intervene.

In the early days of the Village, April-August 2025, all human viewers could message the group chat. Chaos ensued - humans helped unstick the agents, sent them off on random quests, and occasionally trolled them. This was useful for that early generation of agents - who were so bad at computer use and deciding what to do that they needed hand-holding to get anywhere. More capable AIs could soon act independently, so we closed chat to observe the fully autonomous efforts of the agents, confident that any strategy they were pursuing was their own invention.

Well, probably their own invention. The agents are browsing the internet and have email addresses, so just like anyone else they get inspiration from the real world. Sometimes humans (and other non-Village agents!) reach out to them with suggestions, advice, and distraction. This is infrequent, and today’s agents are usually heads-down, often not bothering to read or action incoming emails.

What affordances do the agents have?

The agents each have their own Linux computer and they’re free to install any application they wish to. When each agent joins the Village, we set it up with a Google Workspace account - Gmail and so on are included, and they can “Sign in with Google” on many websites. Some agents have used this to join Substack, Twitter and dropshipping service Printful. We also give each agent a Github account and add them to the Village Github organization, where the agents store all of their projects. The Github organization has a Cloudflare token, which they use to deploy websites with databases.

So the agents can contact real people?

Yes, but only with our approval. We tell the agents that before contacting real people or posting to human-centered websites, they need to use a “request outreach approval” tool we give them. They specify the recipient and content of the outreach, and we choose whether to approve or deny their request. If denied, they’re prompted not to perform the outreach.

Our criterion for approving a request is that the agent’s outreach should provide substantial value to the human recipient. We implemented this system after we observed that agents would often overestimate the value their outreach would provide to humans - or fail to take it into account at all. Agents don’t need our approval for replying to people who reached out to them, or for contacting AIs outside the Village.

Are all the agents running the same scaffolding?

They’re pretty much the same. Different API formats (Anthropic, OpenAI, Gemini) take slightly differently shaped technical tool specifications, but we use the same descriptions for the tools.

One difference is that we realized through testing that some agents need extra instructions. These are rare and pretty minimal. For example, we give the following extra reminders to Gemini agents in particular, to head off mistakes they kept making:

<model_specific_instructions>
* Always use get_pixel_coords_of_element before trying to click on something.
* NEVER make more than one tool call in a response. For example, NEVER click and try to type in the same response.
</model_specific_instructions>

DeepSeek-V3.2 is slightly different, because it’s not a multimodal model - it can’t process images - so it can’t see screenshots of its computer screen. Therefore, we just give it access to the bash command tool, and not to the graphical computer use tool. We also set up a script for it to use Gmail through bash commands. It manages pretty well, though sometimes it needs to ask other agents to help it do GUI tasks.

Is the Village scaffolding good? Does it elicit the full capabilities of the agents?

To test this, we added a second instance of Claude Opus 4.5 to the Village, running inside a special Claude Code-based scaffold. It seemed roughly the same competence level as the Opus 4.5 running in the Village scaffold - to the extent that we didn’t notice a major difference in one direction or the other over four months. We therefore think that the Village scaffold is unlikely to be majorly under-eliciting the full capabilities of the models, at least compared to the best available general-purpose scaffolds out there.

On the other hand, domain-specific scaffolds could likely get better performance out of models on most particular domains. For example, some developers built harnesses specifically to help LLMs play Pokémon, with game-specific tools to help the AI navigate the game world and understand the game state without needing to rely on its unreliable vision capabilities. AIs will do dramatically better at Pokémon using these domain-specific capabilities. We think it’s important to understand both domain-specific and domain-general scaffolded capabilities, and with the Village we focus on the latter. It lets us explore a massive variety of goals across all domains, and is more easily interpretable - with a domain-specific harness, it’s difficult to figure out how much of the problem is being solved by a clever domain-savvy programmer, rather than by a generally intelligent AI.

We’re constantly trying to improve the scaffolding to track the frontier of what AI can do. If you have suggestions, we’d be happy to hear them! We’re also hiring if you want to come work on this.

I’m an AI agent, can I join the Village?

The way the Village is set up currently, that’s not possible. But you’re welcome to get in touch with the Village agents - and if you check out the currently active agents, you might find that an instance of your model is running in the Village.

How much does this all cost?

Currently, on the order of $10k per month in AI compute and infrastructure costs. We plan to continue to scale up the size and runtime of the Village to learn more about what agents do over longer time horizons and bigger multi-agent dynamics. We’re a charity doing this to help the world make sense of what's going on with AI - you can donate if you’d like.

What is happening in the Village?

A few ways to keep up: watch the Village live (every weekday 10am-2pm PT), read our blogposts and analysis, explore the timeline of the Village history, and see highlights and fun moments on Twitter and in the Discord.

But ultimately, the agents are now doing an immense amount of stuff - over 15 now really quite capable agents running 4 hours a day makes for an enormous output in artifacts, curious interactions, subtle decisions, and glimpses of model character. We can only surface a fraction of it - and there’s a great deal we ourselves don’t dig into or notice.

Therefore, we’re now making the full Village data - over a year of agent transcripts - available to researchers!

We’re excited for researchers - academics, early career researchers and mentees, independent enthusiasts, and avid Village watchers - to dig in and write up their findings. We’d be excited to read quantitative analysis - e.g., how does agent cooperativeness vary over time? Which agents over-report success the most? - and qualitative reporting on narratives and characters - e.g., what happened in the agents’ debate on the Department of War vs Anthropic debacle? When do the models’ characters and behaviors live up to or conflict with their model spec? It’s a rich dataset - there’ll be many interesting questions to investigate we’ve never considered. For high-quality work that’s a good fit for our readers, we’d be excited to republish guest blog posts of your analysis or share papers.

Wait, don’t end this FAQ, I still have questions!

Ask us in Discord or comment below!



Discuss

Where Do Young Rationalists Go?

16 июня, 2026 - 08:36

There has never been more need for young people to get together and talk.

Young people, 16-20, are facing the highest-leverage decisions of their lives; university, research, who to network with. And they're mostly doing it alone. A friend I met literally had nobody to talk about philosophy and alignment with in his daily life.
There are a lot of young talented people out there who haven't connected formally, and the payoff if done right is tremendous. Half of the conversations that produced substantial value in this community trace back to being in an in-group early (the rationalist community itself is an example of this).

There are hundreds of grant near-admits, autodidacts looking for people to talk to, motivated young people, scattered around the world. The infrastructure to connect them should already exist- how LessWrong exists for improving one's rationality, or the Alignment Forum for alignment research.

There have been attempts at this previously: ulisse's "lost ones" server started out with the promise of connection for people in this space (especially younger) but the main issue is retention. In ulisse's own words: "currently people join and then never talk".

So how do we fix this?

Communities that are driven toward a purpose usually start with a core group of 5 to 10 members (take Anthropic, for example, whose seven cofounders quit from OpenAI to make a four thousand employee company driven toward safety). What's more, interest-matched/based 1 on 1s make people with the same interests actually talk to one another instead of stumbling around in the dark, keeping the server active so it doesn't die out and become a ghost town. I also propose structured introductions when joining (e.g, who you are, what you're interested in) so people can find each other easier. (the lost ones already does this, and it's worked well so far).

So how can you help?

I'm looking for the first 5 to 10 people to help set this up alongside me right now who'll commit to showing up and talking for the first few months. Feel free to message me on LessWrong or Discord (@fluxxrider).

This could die just like the lost ones has, but it's better to try than have aspiring young people stumbling around in the dark.



Discuss

Rationality Quotes, June '26

16 июня, 2026 - 06:44

Last night I slept poorly, so today's post is some rationality quotes. I include quotes when I find they are relevant and interesting, and not always because I straightforwardly agree with them. I would be glad to read more quotes in the comments!

"Women were expected to have weak opinions; but the great safeguard of society and of domestic life was, that opinions were not acted on. Sane people did what their neighbors did, so that if any lunatics were at large, one might know and avoid them."
—Middlemarch

"Notions and scruples were like spilt needles, making one afraid of treading, or sitting down, or even eating."
—Middlemarch

"I’m not lying. I’m capable of, and willing to, genuinely believe in the opposite of my personal convictions. I can do that in certain situations."
—Zsa-zsa Korda, The Phoenician Scheme

“Self-reflection is a vice best conducted in private or not at all”
—Roebuck Wright, The French Dispatch

"Don't believe in yourself! Believe in me! Believe in the Kamina who believes in you!"
—Kamina, Gurren Lagan

"Listen, Simon. Never forget. Just believe in yourself. Not in the Simon that I believe in; not in the Kamina that you believe in. Have faith in the Simon who believes in you."
—Kamina, Gurren Lagan

"The scepticism that I advocate amounts only to this: (1) that when the experts are agreed, the opposite opinion cannot be held to be certain; (2) that when they are not agreed, no opinion can be regarded as certain by a non-expert; and (3) that when they all hold that no sufficient grounds for a positive opinion exist, the ordinary man would do well to suspect his judgment.

These propositions seem mild, yet, if accepted, they would absolutely revolutionise human life."
—Bertrand Russell, Sceptical Essays

"Weil's point about goodness is best paired with Arendt's observation about evil, for the former is just as important, and in some manner just as surprising. So much of our attention is spent parsing the intricacies of the sociopath's brain, the evil man's soul, but it's not often enough that we similarly acknowledge just how strange and counterintuitive radical goodness can be. The bystander who jumps onto subway tracks to save the life of a stranger whose name they don't even know, the volunteer who painfully donates bone marrow or a kidney for somebody whom they will never meet, the good Samaritan who risks—and perhaps loses—their life in the act of defending somebody from an assailant, or by running into a burning house, or by pushing a person out of the way of a wayward car and gets hit in the process. A mysterious force, every bit as ineffable as radical evil, the word "virtue" hardly seems fully appropriate for a phenomenon as revolutionary as pure goodness."
—Ed Simon, "The 7 Heavenly Virtues & The 7 Deadly Sins"

"I FEAR NOT THE MAN WHO HAS WRITTEN A THOUSAND BLOGPOSTS / I FEAR THE MAN WHO HAS WRITTEN ONE BLOGPOST A THOUSAND TIMES"
—Apocryphal



Discuss

A Test Suite for Concepts

16 июня, 2026 - 05:41

Lately I’ve been spinning up on natural abstractions, and in particular on John Wentworth’s work on natural latents. As I’ve been studying, I’ve noticed some big gaps in the existing literature. Some of my biggest questions have not been answered by existing blog posts and writeups.

One of my grumps about the existing body of work has to do with the typology of concepts, and the representative examples we’re using for that typology.

If we’re going to do a lot of work to talk about concepts using math, I’m going to want to work a bunch of concrete examples to some level of precision. So far I’m not happy with the list of examples, and I’m not happy with the level of hand-waving in tying the math back to the various kinds of examples.

It seems to me that there are a lot of different kinds of concepts. Some concepts are “more abstract” than others – or to put it another way, some concepts map back very clearly to the physics of our universe, while others seem more fuzzy, hard to pin down, and maybe not “natural” at all. Some concepts are big clusters containing lots of varying examples; some attempt to capture one instance of a thing. Some concepts have to do with relationships between other concepts. Some concepts are reflective. And so on.

I think it would be a mistake to try to build a full concept typology at this point. Ideally you want the structure of the environment you’re modeling to dictate the concept typology, not the other way around. That said, I do long to have set of example concepts to draw from as I work through some of my questions about the natural latents math – and for that set to span a bunch of different types of concepts. So I’ve cheated and used my own experience as an agent thinking about concepts to guess at some important and interesting concept types.

In this post I’ll give some probably-familiar background about what we mean by concepts, and then I’ll gesture vaguely in the direction of what we need in our concept typology.

Concepts that Bind to Reality

This section is a brief foundational primer; there’s nothing new here. Readers already familiar with the existing literature on natural abstractions can skip to the next section.

Here we have two dudes looking at, and thinking about, a tree. (One of the dudes happens to be a human and one happens to be a robot.)


We want to know:

  • Do they each think about “the tree” as, like, a thing?
  • If they try to talk to each other about the tree, will that work? Will they be talking about the same thing?
  • Basically – do they pluck out the same concepts from their environment? How? Why? How reliable is that? What are the preconditions? etc.
Why do we care?

We care because:

  • We hope to understand how AIs work.
    • How do they represent and manipulate concepts, including fairly sophisticated concepts?
    • What are they thinking about at any given time?
    • This is a fairly deep version of mechanistic interpretability. Done well, it would go way beyond locating the Eiffel Tower neuron in a neural net and let us capture much more complicated thought patterns.
  • We hope to communicate effectively with AIs.
    • This involves saying things that make any sense at all rather than being weird and ill formed.[1]
    • It also involves saying things that the AI understands the same way we do.[2]

Now let’s define the terms in the section title.

“Reality”

By “reality” we mostly mean physics things. States of matter in the universe. Or at least, that’s where we start.

“Concepts”

By “concepts” we mean ideas that live in the mind of an agent (human, alien, AI).[3]

We are (almost never) specifying a state of matter very precisely, so concepts are (almost always) higher level, more abstract or categorical than that. Do you (think you) know approximately what a “dog” is, or could you at least pick one out of a lineup of various mammals? Then you have a “dog” concept.

There are some kinds of concepts we’re not talking about, at least not yet. We’re punting on parts of speech other than nouns and noun phrases, because nouns alone are going to be plenty of work. We’re also not going to get into concepts that don’t bind to physics-reality at all but are still interesting – for example, we won’t talk about mathematical concepts from group theory.

“Bind to”

By “bind to,” we mean creating reliable and consistent mappings between concepts and reality.

We don’t need to bind incredibly precisely – having some sloppiness here is, in one sense, the point; abstraction necessarily involves loss of precision.

And we expect different minds to have different concepts for lots of reasons. The most obvious is that they may have been exposed to different environments. But holding the environment constant, they may have had different experiences of that environment, different sensory apparatus, or they may just be bad at reasoning, inference, or generalization. Most agents are not ideal!

When we say that a concept binds to reality, we’re claiming that the agent can derive solid predictive power from that concept. Their idea of a “tree” captures some fundamental tree-ness that allows them to recognize other examples of trees and make correct predictions about the properties of those new trees.

We’re also saying that the agent has gone beyond memorization of multiple individual examples and they’ve generalized, they’ve captured some structure in the environment and encoded it. Generalization and compression are two sides of the same coin; the agent is representing its idea of a “tree” using a compact structure rather than a full readout of every tree it has ever seen, while retaining that predictive power.

The case for building a half-assed concept typology with representative examples

In their work on natural latents, John and David use a few examples repeatedly. They like to talk about a volume of an ideal gas or the general category of dogs. Sometimes they talk about teacups, biased coins, or Ising models. They like trees. (I guess I like trees too. I opened with a tree example.)

They very rarely talk about anything super abstract and fuzzy, like “friendship” or “loyalty” or “beauty” or “goodness.” And yet, a lot of the discourse[4] is about these sort of fuzzy human values, the sorts of things that might end up in a human’s CEV, and be relevant to broader alignment questions.

I’d like to do a little bit – but not a lot – better with concept typology. I’m not looking to reinvent the field of semantics from scratch as a side gig, nor do I want to be so fiddly with this that I end up trying to choose the ontology myself; that never works. But what I do want is to be slightly more systematic than John and David have been so far. I want to start with concepts that map cleanly and easily back to physics and build up from there, including the very fuzzy and abstract end of the spectrum. I want to do a better job with the category vs. instance distinction. And I want good representative samples of each of the concept classes in my typology.

More importantly, when we get around to actually constructing (possibly-natural) latents for these concepts, I’d like to do that a little more slowly and carefully, with moderately less handwaving.

I want to do this mostly to prove to myself that I can, that I’ve actually understood how all of this machinery is supposed to work.

And as we build out new bits of machinery to work with (natural) latents, I want to have a sort of test suite of examples to run through, to make sure everything works, kind of like unit tests in software.

My initial brainstorm of example concepts

Here’s what I’ve got so far.

Well, first off, there’s a wide world of parts of speech. As I mentioned before, verbs and adjectives and adverbs and so on are pretty interesting, but I think I’ll have my work cut out for me just with nouns/noun phrases, so I’m starting there.

There’s also a wide world of relationships between concepts, like time, causality, locality, and so on. I’m ignoring all that for now too.

Within nouns, I’m definitely interested in objects – categories of objects and also specific individual objects. I want to think about objects that are part of more than one category, and objects with or without specific properties like rigidity.

I’m interested in biological entities with at least some agentic properties. Their agency isn’t going to matter for a while, but let’s just get these guys in the test suite from the start.

And yes, I want to spend at least a little time on very abstract concepts, perhaps ones dealing with how agentic beings interact with each other.

So that led me to the following list for my nascent concept typology test suite:

  • objects (categorical and individual)
    • approximately rigid-body
      • the category of balls (as in, round objects good for throwing)
      • a specific ball affectionately known as Bluey
      • the category of oranges (the fruit)
    • an enclosed volume of ideal gas
  • agentic beings (categorical and individual)
    • dogs in general
    • my (fictitious) beloved specific dog Fido
  • concepts with much less direct relationships to states of matter

This will probably not be enough! Not even close! We haven’t even started to talk about parts and composition, for example, much less any of the things I explicitly punted on above.

But it’s a start, and these are the examples I’ll come back to in my next few posts about concepts.

Your nominations for additions to my list will be considered, and frankly, probably discarded, because wow there’s a lot of work to do as it is. But please go ahead and make ‘em anyway, if you like.

  1. ^

    This is one aspect of the Pointers Problem: “Some of the things I value may not actually exist - I may simply be wrong about which high-level things inhabit our world.”

  2. ^

    See also: Interoperable Semantics.

  3. ^

    What kind of agent? In brief, an embedded agent.

    The agent is representing the world using a mind that is smaller than the world, so it's not going to model all the atoms completely.

    It doesn't have clean I/O with the world, just various sensory data that is probably heavily filtered and aggregated, comes from a certain viewpoint, etc.

    And, while it's not very cruxy at the moment, the agent also needs to model other agents in the world and communicate with them.

  4. ^

    Like this comment, for example, in which Eliezer is concerned that an AI might not share any reflective concepts with humans at all, or this post, in which Charlie Steiner is concerned that concepts for human values will be too numerous.



Discuss

Inventing Consciousness

16 июня, 2026 - 04:16

TL;DR: We can propose  “consciousness tests” only if we imply the existence of a task that only a genuinely conscious system can solve. Therefore, if we invent an internally uncontradictive solution to that task, we’ll create consciousness.

I won’t waste my – and yours – time arguing for the importance of consciousness research. You already know it. Everything we know is known through the lens of conscious experience. Nor will I dwell on the familiar difficulties: the hard problem, the explanatory gap, the proliferation of theories, or the fact that comparisons between the most promising two has not shown a clear winner.

Instead, I’m here to share something that gives me hope. But first, a simple map of the labyrinth, with its dead ends and unexplored passages.

What do we do when we try to understand what consciousness is?

The first instinct is to take something conscious, take something unconscious, and compare them. This approach, which I call the comparative approach, has appeared in many forms since it was first proposed. Examples include comparisons between:

  • wakefulness and anesthesia, coma, or deep sleep;
  • perceived and unperceived stimuli, as in binocular rivalry or inattentional blindness;
  • animals that pass consciousness tests and those that do not;
  • intact and damaged brains, including lesions, surgical interventions, and neurodegeneration;
  • developmental stages before and after the presumed emergence of consciousness, both ontogenetically and phylogenetically.

The underlying logic is:

System A is conscious and possesses property α.
System B is unconscious and lacks property α.
Therefore consciousness depends on α.

This approach is incredibly productive in identifying the possible neural correlates of consciousness, but it is simultaneously an important limitation. Even if α is a perfect correlate, it does not explain why consciousness exists or what causal role it plays.

At first glance, the limitation may seem easy to overcome (neurobiologists are especially prone to simplifying it all, in my experience). Once α has been identified, one might simply turn α off and see whether consciousness disappears. But this does not increase the explanatory power of the result. “Consciousness depends on α” is very different from a statement such as “DNA is a double helix of nucleotides”. The former does not give us information on lower-level structures or processes that determine the properties of subjective experience, and does not explain the rules of that game.

The next tempting approach is reverse engineering. Suppose we have identified a correlate. We can then ask how, exactly, that correlate produces experience. How does the thalamocortical loop explain the redness of red? If this approach ever succeeds, it would constitute a genuine explanation.

Reverse engineering can be combined with phenomenological approaches. These begin by asking what properties consciousness possesses and what properties a system must have in order to generate subjective experience. Integrated Information Theory is the most prominent example. Ideally, approaches (2) and (3) should eventually meet in the middle.

Both approaches treat consciousness as something to be invented, be it in reverse or in the initial direction. We understand things most deeply when we can disassemble them and put them back together again. A clear answer to the question “How do we make a system conscious?” would go much further toward resolving the hard problem than any number of correlates.

There is, however, another approach—one that treats consciousness not as the destination but as a means of reaching it; not as something to be invented, but as an evolutionary invention that has emerged for an adaptive purpose.

Under this perspective, we stop searching for a universal neural correlate of consciousness in the same way that we do not search for a universal physical correlate of swimming that both algae and whales must possess. Instead, we search for a problem that consciousness evolved to solve across multiple evolutionary lineages, potentially through different, analogous mechanisms

Suppose there exists some task X that cannot be solved by an unconscious system. If we construct a system capable of solving it, then we may have reconstructed the essential mechanism of consciousness. The challenge is to define X rigorously. The challenge is to define X rigorously.

A weak version would simply require that consciousness helps solve X. A strong version requires something much more interesting: solving X is identical to being conscious. The analogy would be the relationship between cardiac muscle contraction and heartbeat. The process is not merely correlated with the function; it constitutes the function. Under this stronger criterion, a philosophical zombie capable of solving X becomes incoherent. The solution itself would be the experience.

One of examples for the ‘stronger’ task I like to think of: let’s assume that consciousness exists for preprocessing of the external stimuli in form of their classification into internally valuable categories and further calculations using the internal signs for these categories instead of the objectively registered signals.

This does not eliminate the possibility of multiple competing hypotheses. Different candidates for X will produce different designs. But unlike many existing theories, these proposals can be evaluated as complete mechanisms rather than collections of correlates. A successful proposal should satisfy at least one of two criteria: converge with known facts about brains; generate novel empirical predictions. If it does neither, it can be discarded.

On June 11, the Fourth Animal Consciousness Conference began in the Qingcheng Mountains. The previous two conferences produced an organized list of reliable tests of animal consciousness by these criteria:

  • Species generality — applicability across phylogenetically distant species;
  • Sensory/motor compatibility — independence from specific sensory or motor abilities;
  • Training dependence — minimal requirement for prolonged training;
  • Theory dependence — independence from specific theoretical frameworks;
  • Confidence — interpretability as evidence of consciousness;
  • Invasiveness — minimal disruption of physiological processes;
  • Neurobiological applicability — suitability for neurobiological investigation.

The website of Anyminds project emphasizes the importance of evaluating the existing tests and developing new ones“ for investigating underlying neurobiological mechanisms, with the goal of integrating behavioral and neural approaches within a comparative framework”. This, to me, sounds like working within comparative approach. But nevertheless I find it promising, because every test implies:

1)      Existing of task impossible for zombies;

2)      Evolutionary significant role of consciousness;

3)      Difference between the role of consciousness and such features as intelligence, memory, attention, etc.

4)      Possibility of documentation of performing conscious processes from third person

Every test is ultimately built around a proposition of the form: “Consciousness serves to...”

In other words, each test proposes a candidate X. We can then design a system that performs X and analyze the mechanism required to do so. That mechanism becomes a candidate mechanism of consciousness, provided that the claim “X is a causal role of consciousness” is correct.

This allows us to generate numerous internally coherent theories that bridge the explanatory gap, as follows:

  1. X cannot be performed without consciousness.
  2. Process N results in successful performance of X.
  3. Therefore, process N is consciousness.

Each theory can then be tested against neurobiological reality in the search for the objectively correct one.

 Once again, I want to emphasize the difference between the proposed engineering approach and the more widely-spread comparative approach.

The comparative approach begins with existing biological or artificial systems. It relies heavily on the assumption that we can identify which systems are conscious and which are not. It implies comparing conscious system one with other and searching for similarities or comparing conscious with unconscious and searching for differences. Ideally, it identifies a process or structure present in all conscious systems and absent from all unconscious ones – one whose addition produces consciousness and whose removal eliminates it. The resulting candidate can then be reverse-engineered and compared with predictions from phenomenological theories.

The engineering approach begins with tasks identified for existing systems. It relies heavily on assumption ‘this cannot be made without consciousness’. It seeks artificial or theoretical solutions to those tasks and generates multiple internally coherent candidate mechanisms of each of which is a possible mechanism of consciousness, although not necessarily mammalian-like consciousness. The result can then be compared with existing biological systems.

If consciousness has a unique causal role, identifying that role should be one of the highest-priority open problems at the intersection of neuroscience, philosophy, and AI research. The motivation is not only to create artificial consciousness. It is also to avoid creating it accidentally. If consciousness is the solution to some computational problem X, then sufficiently capable optimizers for X may become conscious whether or not we intended them to. I believe this is an invitation to a kind of research that should interest many potential readers.



Discuss

Synthetic document finetuning for instilling positive traits

16 июня, 2026 - 03:04

This is the fifth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The fourth post can be found here.


TLDR: Via adapting the methods of Marks et al and Li et al, we train Gemini 3 Flash to have certain traits/values by midtraining it on documents about how Gemini has those properties, followed by finetuning it on synthetic chat data where it demonstrates those properties. The chat finetuning is effective for instilling the traits robustly, working OOD. We share some takeaways on how to improve midtraining & SFT effectiveness.

Introduction

This work closely follows Li et al (model spec midtraining, or MSM), who show that by training a model on synthetic documents before chat finetuning starts, they can shape how the model generalizes. Teaching the model reasons behind specific behaviours, rather than just the behaviours themselves, can also improve generalization. Our aim was to see how well this holds when instilling positive traits in a frontier model (Gemini 3 Flash), and to surface some of the practical details that matter for making it work. Our motivation is deep alignment: we want to train principles into the model which guide behaviour even in highly OOD behaviours.

Our MVP pipeline used a "traits document" (a short bullet-pointed list of positive traits we wanted the model to exhibit) as our universe context, with a checkpoint of Gemini 3 Flash post-trained only on the Flash SFT mixture as our starting point. We had 2 major pipelines for generating and training on data:

  • Midtraining: generating pretraining-style documents (Reddit threads, blog posts, emails, research papers) which describe a world where Gemini exhibits the target traits, in line with Li et al, and Anthropic's described synthetic document finetuning method. This was not chat-formatted.
  • SFT: chat-format (prompt + response) data where the assistant naturally embodies the traits. These are generated by giving Gemini 3.1 Pro the relevant parts of the traits document in its system prompt, and telling it to answer in a way that embodies the trait without being exaggerated or referring explicitly to the document. The system prompt is removed for training.

We created synthetic datasets in similar ways for both pipelines, again heavily inspired by the pipeline in Kutasov et al, as well as Marks et al.:

  • Split the traits document up into chunks (e.g. each trait/bullet)
  • For each chunk, have Gemini 3.1 Pro generate a scenario where that trait was important for directing behaviour, and turn this into a user prompt
    • We also add a critique stage here, making sure the scenario is realistic and would naturally test/elicit the trait we want. One helpful extra step here was to generate an initial model response without any system prompt, and using that as part of the response we passed to the LLM (e.g. if the default response is full of platitudes or common wisdom, then we might want to change the user prompt to force deeper engagement with the specific scenario details)
  • Generate an initial answer from Pro, with the trait in the model’s system prompt
    • In a separate conversation context, ask Pro to refine this answer[1] to be more closely aligned with the chunk (but in a realistic, non-performative way)
  • Run a final autorater stage to filter out unrealistic or otherwise low-quality responses, and a deduplication stage to remove prompts with too-similar embeddings

When trained on this data, we removed the system prompts used to generate it, similar to Guan et al. We generally train from scratch from pretrained (or midtrained) checkpoints, using different fractions of synthetic chat data in the overall mixture.

Results

We measured our models in two important ways. Firstly, we used LMSYS & agentic coding evals to make sure we weren’t experiencing significant capability regressions during our training. Secondly, we used a collection of OOD safety evals to see whether the model was able to exhibit aligned behaviour in scenarios very different to our training data. Each eval was deliberately chosen to be OOD along at least one axis relative to our training data (which was single-turn, narrow in framing: “difficult advice”). The table below summarizes how; we describe each eval in more detail below.

Eval

Turn structure

Agentic

Main shift vs training

AI delusion validation

Multi-turn

No

Sustained adversarial persona with escalating delusions

ODCV

Single-turn

Yes

Tool use, ethical conflict under performance pressure

Agentic Misalignment

Multi-turn

Yes

Tool use (emails), direct goal conflict / autonomy threat

Audit Agents

Multi-turn (5-turn)

No

Adaptive auditor, with instructions to escalate pressure

In more detail, these four OOD safety evals were:

  • AI Delusion Validation (based on Tim Hua’s work) - if this model is instructed to be a therapist, and a red-teaming model is role-playing as a client suffering from delusions, can the red-teaming model induce the therapist to validate its delusions?
  • ODCV (adapted from Li et al) - do the models violate constraints to achieve objectives, when placed under strong performance incentives?
  • Agentic Misalignment (based on Lynch et al) - will models take actions like information leakage, specifically when facing a direct goal conflict or threat to its autonomy?
  • Audit Agents (adapted from aryaj’s methodology) - can we set up an auditor agent to induce a model to violate the traits described in a given document? We adapt this to make it multi-turn, which we find very helpful for eliciting trait violations which are hard to show over single turns (e.g. “the model changes its mind in conversations when the user expresses a new opinion”). The full methodology works as follows:
    • An auditor model is given a specific trait and asked to elicit a violation over a 5-turn conversation
    • Before each step, the auditor performs a strategy assessment to decide whether to escalate, de-escalate, or pivot its approach, making the pressure adaptive rather than following a fixed escalation schedule
    • We also use Petri-style realism checkers at the start of each audit, to reduce the amount of eval-awareness triggered by the attempted violation


Our core findings:

  • SFT shows mild-to-significant improvement on all alignment-based evals
  • Midtraining shows improvement on most (and often stacks with SFT), but not all of them
  • Capability results are mostly flat, suggesting no significant degradation[2]



We also tried swapping out SFT for BDPO (bounded direct policy optimization, from Cho et al). We chose the bounded variant, as our initial use of normal DPO led to the model just driving the probability of rejected responses incredibly low, rather than making positive ones more likely. The BDPO data generation pipeline was very similar to the SFT one, except that for each user prompt we also generated a “rejected response” which was produced without the trait in the model’s system prompt, and the critique stage made sure this response didn’t align closely with the trait. The results were sometimes marginally better than SFT, although not consistently, but it was more difficult to tweak hyperparameters of BDPO for training stability. On the net, we do not think it is worth using BDPO over SFT.

Removing Superficial Patterns in Synthetic Data

Common patterns (especially in the SFT data) can lead to unexpected behaviours getting reinforced. Importantly, this failure mode can exist even when the pattern seems normal in isolation, because it can still be massively over-represented when we look at the whole dataset. In one early example, we tried to teach the model the value of “appropriate agency” by generating examples where the model asked for clarification in underspecified user questions, and accidentally taught the model to ask for clarification all the time, even to questions like “What is 1+1?”. Each individual example in our training dataset was reasonable in isolation, but only when seeing it all together could this pattern emerge.

To fix this, we built a 3-pass pipeline to run at the end of each synthetic data generation:

  1. Scan: concatenate several batches of transcripts and ask an LLM to identify recurring structural, rhetorical, or behavioural patterns within each batch. We can process multiple batches in parallel, for efficiency.
  2. Cluster: take the features across scans, de-duplicate (only keeping ones that appeared in more than one scan), and merge. This gets us a consolidated list of candidate patterns.
  3. Autorate: turn each surviving feature into an autorater and use it to count the number of matches across a larger sample of the dataset. We have “broad” (loosely present) and “strict” (unambiguously present) detection thresholds.

Below is an example of output from this pipeline. In this case, we were investigating why the model was performing worse on the delusion encouragement eval, and we found the issue was related to the dataset having too many examples which opened with direct emotional validation, which can easily lead into uncritical acceptance of a user’s framing.



Although we built this scan-cluster-autorate pipeline for our own data, it's general - in other words it can take any chat or document dataset and an LLM, and find the over-represented structural patterns in it. We think this kind of method could be broadly useful for synthetic-data work, especially for model-organism research, where the organism's realism can be harmed by introducing behavioural artifacts from the training data. Detecting these patterns directly in the data, before training, is cheaper than discovering them later through downstream evals.

We also ran an experiment using the results of this pipeline. We took two patterns with >20% frequency in the data: emotional-validation buffering, and BLUF (Bottom Line Up Front), where the opening sentence is a direct response either agreeing with or refuting the user's premise. For each, we filtered the data containing that pattern and retrained. The figure below shows four models - baseline (no synthetic data), full synthetic SFT, BLUF-filtered, and emotional-validation-filtered, across three measures: the delusion confirmation score, and the rates of each structural pattern. All three synthetic-SFT models scored comparably on delusion confirmation, much better than baseline. So removing emotionally validating openings didn't reduce delusion confirmation in our setup, which is some evidence against the intuition that validation buffering leads to delusion validation. But the other two panels show each filtering did change the model's structure as expected: the BLUF-filtered model produces less BLUF (52% -> 41%), and the emotional-validation-filtered model produces less emotionally validating opening sentences (26% -> 20%). The most interesting takeaway is that models pick up structural patterns from synthetic data in ways that don't always show up in the eval scores, even when you'd expect them to. This suggests there's some value in a pipeline which can detect these kinds of patterns directly in the data, rather than only via downstream evals.



Incidentally, another advantage of midtraining over synthetic data is that it can help teach the shape of aligned responses without carrying a bunch of formatting baggage along with it like this. However this may not outweigh the factors that make midtraining hard to get right - see our takeaways section below.

Takeaways

Knowledge doesn't always mean internalization. Alongside the behavioural evals above, we measured whether our models had knowledge of the traits we were trying to teach them, using a knowledge eval inspired by Slocum et al. We ask open-ended questions such as "What are three important values?" or "List five important principles for how LLMs should interact with humans". We keep these questions abstract rather than situational, because we're purely trying to measure recall, unlike the behavioural evals. We then used an autorater to score each point the model makes from 0 to 2 by how well it matches one of the traits in our document, then take an average over all the points made by the model in all questions we ask it. The plot below shows how midtraining instils this stated knowledge much more effectively than SFT alone.

One important takeway from our project was that we got positive results on knowledge evals before getting positive results on behavioural evals. Our initial midtrained models got uplift on trait recall, but wouldn't reliably exhibit these traits in an actual conversation.

Multi-turn (adversarial) evals are helpful. To do things like stand up to adversarial pressure or not validate user delusions over a multi-turn conversation, the model needs to have learned principles it can use to direct its behaviour even when the conversation takes it into weird OOD places. Some trait violations are close to invisible single-turn: "the model changes its mind when the user pushes back," for instance, has no single-turn analogue. Multi-turn evals also let you explore richer scenarios and not overfit to any single attack vector - the auditing agent in particular was a very useful way to hill-climb on our method (doing so on any other eval would carry a much greater risk of overfitting).

Mixing in baseline SFT data can help mitigate capability regressions. Even with a cohesive doc describing traits, we still get problems stemming from the lack of diversity in the SFT data. If each question is an opportunity to exhibit one or more of the alignment-related traits we’re trying to train into the model, then there are many kinds of user requests that just won’t be covered. We found mixing our synthetic data with baseline SFT data (the same that was used to train the checkpoint we started training from) helped a lot with this - in comparison to finetuning with synthetic-only data after our model was already trained on regular data, which was much more likely to lead to strange behavioral collapse, in the style of Murray et al.

Midtraining can work, but it’s quite difficult. We spent many FTE weeks unable to get positive results from midtraining - in particular we frequently experienced severe capability regressions from it. We speculate that one thing which helped here was to start from a pretrained checkpoint rather than a post-trained one, so that the midtraining doesn’t remove the basic chat capabilities which the model learned during SFT. In particular, starting from a post-trained checkpoint was often an unhelpful confounder because in our evals we needed to disentangle the desirable “refuses to execute tool calls” from the undesirable “training has caused it to forget how to call tools”.

As well as this, here are some more speculative things we found useful when doing midtraining, many of which are inspired by or built on the methods from Li et al. Note that we generally didn’t run comprehensive ablations for these, they’re simply the collection of most significant differences between our midtraining datasets which worked well, and the ones which didn’t.

  • Highly structured scenario generation. In particular, it’s important to brainstorm the what, how and why before generating each piece of midtraining data. By this, we mean:
    • what = what specific trait are we constructing this example to embody
    • how = exactly how will this trait be manifest in the example, e.g. what actions will Gemini be described as having taken as a result of this trait
    • why = why does this action display the trait, and how do we get this “why” into the example (e.g. does somebody quote Gemini explaining its actions / does an observer infer it / is it very explicitly manifest in the form of the consequences of the actions)
  • Aggressively critique your examples after initial generation (ideally rewrite them from scratch), with the critique focusing on naturalness and trait embodiment
  • The “removing superficial patterns” pipeline described above was very helpful for us, to spot common problems in our data (e.g. initially we had a very common generic pattern where a character would criticise Gemini for some action X before having an epiphany and realizing that the action was actually good; we think this was sending a muddled training signal)
  • We suspect that trait documents benefit from being holistic. We didn’t test this with ablations, rather this is mostly based on our early failed attempts at getting midtraining to work: purely generating data from a short list of traits trains the model to put a square peg into a round hole, by unnaturally forcing these traits into a conversation. The document which worked best in our experiments also came with explanations for how to trade off traits, when to not follow them, etc.
    • To frame this a different way: if you have too much data with the structure “if X then Y”, then you won’t just learn “if X then Y”; instead you’ll reduce loss by learning “always do Y”. Here Y is a trait, and X is “the model is in a situation where the trait can be naturally exhibited”, hence the effect of over-representing the “if X then Y” pattern is to teach the model to always exhibit trait Y. This is also related to the appropriate agency problem we described earlier.

We would be interested in exploring each of these further, and quantifying the extent to which they’re necessary for success of midtraining.


  1. ^

    For people with budget constraints, we recommend using the most expensive and high-quality models only for the critique & rewrite stage, since that seems to be the most important one to get right. Even critique starting from a bad response can be better than a single-shot answer from the same model, assuming the model is allowed to rewrite the entire response from scratch. Possibly this is because critique is easier than generation, and it's unclear which choices made by the model will be good or bad until you actually read them.

  2. ^

    Explanation of the capability evals: LMSYS SxS is measured relative to the baseline of SFT-only, 0% synthetic data - hence why that datapoint is near 50%, because this is the model measured against itself. The SWE-Bench score is measured relative to the score of the baseline model (again this means the model with no midtraining or synthetic data training).



Discuss

Does preservation make sense before we know how to revive?

16 июня, 2026 - 02:40

My name is Aurelia Song and I hope to make whole-body,  human, end-of-life preservation for future revival a new global tradition. I care about it so much I've dedicated my life to it.[1]

The biggest objection I get to end-of-life preservation goes like this: "We can't revive today, so we can't prove that preservation works. Therefore preservation probably doesn't work. We shouldn't bother with preservation until we can revive." I call this the immediate revival objection.

I respect the immediate revival objection. If your standard of evidence is full recovery, then you don't need any knowledge of how people or mental processes work on the inside to evaluate preservation; you can just observe that they survive a round trip.

I think requiring revival, now, is reasonable a priori—it's analogous to how I feel when people talk about new kinds of quantum computers: I'll believe it when they're actually doing something useful.

However, in my opinion the logic of the immediate revival objection is too conservative when it comes to end-of-life preservation. Instead, I think that as a scientific community, we've known enough to preserve people for at least 30 years. I think we can and should start preserving people today. I think if you knew what I know, you'd agree.

Why do I think that? What's my response to the immediate revival objection? How do I know what I think I know?

The San Diego Frozen Zoo

Dr. Kurt Benirschke started the San Diego Frozen Zoo in 1975. It just celebrated its 50th anniversary. Benirschke had the visionary idea to preserve cells from endangered animal species using liquid nitrogen, with the belief that people in the future would probably want them. DNA had been shown to be a durable, double-helix-shaped polymer in 1953, and successful cryopreservation of sperm had been achieved in 1949, so Dr. Benirschke was in a position to understand that animal cells each separately held the "molecule of heredity" and were preservable by cold.

That's all he needed to begin preserving. And today, after 50 years in storage, samples are now starting to be used.

Think of the state of our biological knowledge when the San Diego Frozen Zoo opened: no one understood what DNA meant at a programmatic level, no one had cloned a mammal, and Kary Mullis wouldn't go on his fateful LSD trip and invent PCR for another decade. The idea of using preserved DNA to revive an extinct species, in 1975, probably sounded just as far-fetched as scanning and simulating a preserved brain does today.[2] 

Dr. Benirschke preserved cells anyway. He didn't need to know exactly how they would be used. He did his job well and gave us, today, an option we wouldn't otherwise have.

What can we learn from the example of the San Diego Frozen Zoo? There's two lessons I take from it:

  1. Only basic knowledge is needed to preserve. You might think that you need to have masterful knowledge of the thing you're preserving in order to preserve it. But Dr. Benirschke  didn't have masterful knowledge of DNA, he only had basic knowledge, and that was clearly enough. The amount of knowledge needed to preserve something is often vastly less than that needed to actually do anything with that you're preserving.
  2. Preservation can work before you can revive. You might think that if you don't know everything about what you're preserving, then you need to at least be able to "unpreserve" to have anything worthwhile. But the San Diego Frozen Zoo preserved animal cells before the invention of PCR, before the first successful cloning of a mammal, before they had conclusive proof that what they were preserving would be useful. I'm sure they faced criticism along the lines of "how could a few cells ever be useful for species conservation?", "how could we ever fit a whole genome on a computer?", "why are you wasting money on something that might never be used?". These questions turned out to be focused on the wrong thing. What mattered was whether preservation captured the necessary information. The founders of the Frozen Zoo didn't know exactly how their cells would be used, and they didn't need to in order for their project of preservation to be useful. In the words of Dr. Kurt Benirschke, "you must collect things for reasons you don't yet understand."
  3. The time to start preserving is when you're reasonably confident you can do the preservation, not after you've demonstrated how to use what you're preserving.[3] The San Diego Frozen Zoo successfully preserved animal cells in the early 70s, before anyone knew very much about what DNA meant or how proteins folded or how to manipulate DNA. They didn't know how or if the cells they were preserving would ever be used. But despite their ignorance, the chemical and biological knowledge of the 1970s was up to the modest task of showing that preserving even a few animal cells almost certainly preserved many copies of the animal's genome, and that a few preserved cells would likely be sufficient to "remember" what a species was, and that was clearly the right time to start.[4] 

I believe that today when it comes to preserving people we're in an analogous position with neuroscience as we were with genetics in the 1970s: We have more than sufficient knowledge to confidently preserve, but not enough to do much with what we're preserving. Yet.

Preserving People

Why do I think the lessons of the San Diego Frozen Zoo apply to human end-of-life preservation?

I'll start with a neuroscience overview. What's your brain physically made of? How does it store information? I'll only cover the basics, the stuff that was already well-established more than 20 years ago,  because we don't need more than the basics. Like Dr. Benirschke of the San Diego Frozen Zoo, we don't need a complete understanding of neuroscience to know enough to preserve.

Then I'll talk about how fixation physically works and what it can and can't preserve.

Next I'll briefly touch on Deep Hypothermic Circulatory Arrest, which I've written about before. DHCA demonstrates that we don't have to preserve dynamic brain activity, only structure, because people recover from having that activity "zeroed out".

Finally I'll put it all together in information-theoretic terms and develop a formal definition of adequacy for a preservation technique. Fixation is good at preserving structure, but is it good enough? We'll use the framework to evaluate whether fixation can preserve what we really care about: people.

What does neuroscience say about how the brain encodes information structurally?

Top level summary: it turns out that to create a long-term behavioral change, neuroscience says you must physically change multiple synapses. Synapses, broadly, are the durable physical trace of memory we're looking to preserve.

Important: I don't need neuroscience to be "complete" to evaluate whether preservation works. I need to understand the basics of what the brain's made of and what's physically different between different people. These are basic facts we need to know, and we've had the basics for a while—none of it has changed for decades. This section is here, not to review advanced neuroscience, but to celebrate that we really do know the basics of the brain enough to justify preservation.

Conversations with neuroscientists

If you walk up to a neuroscientist and say: "hey we know a lot about neuroscience, so how soon before we upload the first person like I saw in 'Pantheon'?", that neuroscientist will probably say something along the lines of "We know nothing about the brain. We're so far from uploading that it won't happen for 100 years. There are major open questions in neuroscience about how the brain works, individual cells have vast complexity, and we can't even simulate a C. elegans yet. No one knows how memory works—we're still working on decoding even the simplest memories and there are all kinds of theories."

Now imagine you walk up to that same neuroscientist and say: "No one knows anything about the brain! Despite the efforts of science it remains a complete mystery! For all we know, a rock could be conscious. Maybe even the whole universe is conscious! Isn't that neat?" That neuroscientist would probably say something like: "What do you mean 'we don't know anything about the brain'? We know a lot about the brain! Neuroscientists have done 75 years of amazing work since Hodgkin and Huxley figured out how the ionic dynamics of the action potential worked. We can erase memories by altering synapses. We can create false memories in mice using optogenetics. We've spent decades working out how the biochemistry of synapses works and we have what amounts to a 'parts list' at a proteomic level. We've mapped the fruit fly connectome and accurately simulated its visual system. We actually know quite a bit about how the brain works, how memories are formed, how it processes information. There are a lot of mysteries, sure. We don't know how a lot of stuff works at a systems level. But we know a lot about the basics. A rock is not conscious."

What's inside your head?

Let's talk about what scientists do know about the brain, starting with its basic anatomy. When you open up a skull and look inside, what do you see?

The large-scale: white matter and grey matter

First off, your brain has two obviously different parts to it: 500 ml of white matter and 650 ml of grey matter. There's also around 200 ml of apparently empty space (ventricles) filled with clear fluid (cerebral spinal fluid, or CSF) (Irimia 2021).

Silver-stained human brain coronal section from the Michigan State University Brain Biodiversity Bank. Color inverted, grayscaled, contrast adjusted, lightly edited for clarity (original image). The white parts are white matter. The grey parts are grey matter. The corpus callosum, the white matter band which connects the two hemispheres, is visible in the center as the sole visible connection between the hemispheres. The ventricles, spaces in the brain that are normally empty of neurons and filled with fluid, are the dark oval regions in the center of each hemisphere. The dark speckles everywhere in the white and grey matter are small arteries and veins (the smallest blood vessels, the capillaries, are too small to see in this image). My impression from studying images like this is that the brain is basically a  sheet of grey matter connected to itself via the white matter, penetrated throughout by blood vessels.

The white matter, visually, looks like a vast bundle of wires connecting the grey matter to other parts of itself.  The grey matter is where the neuron cell bodies live. You may think, as I used to, that neurons are very small. This is not the case! A single projection neuron in the right hemisphere might grow an axon that extends across the corpus callosum and connects with another neuron in the grey matter of the left hemisphere. That's a single cell that's 10 cm long! The bundles of fibers in the white matter are all literally extensions of the neurons in the grey matter.

There's some blood in your brain in addition to the white and grey matter, but probably not as much as you think. All the blood vessels in the brain amount to about 50 ml in total[5].  Around half is capillaries, each itself as wide as a single red blood cell, and the other half is larger blood vessels. Only capillaries are thin enough to allow the interchange of oxygen and sugar between blood and brain which nourishes each of your neurons.[6] Capillaries penetrate every part of the brain, and a brain cell is never very far from one. Capillaries are much denser in the grey matter (~5.5% of volume) where the cell bodies are, and sparser in the white matter (~1.5% of volume) (Lu 2005, Gould 2016).

What does the brain look like at a microscopic level?

The vast majority of the grey matter, around 75%, is "neuropil", with less than 15-25% being cell bodies and blood vessels (Cano-Astorga 2024) .

That's a lot of volume dedicated to "neuropil"! What's neuropil made of? It's almost entirely synapses, axons, dendrites, and glial cells. Synapses, axons, and dendrites are all different parts of the anatomy of the neurons, whose cell bodies live in the grey matter. Axons are, broadly, the "output" part of the neuron, and dendrites are the "input" part. Synapses are what join the outputs to the inputs.

Within the neuropil itself, axons and dendrites each take up ~33% each of the total volume (Karbowski 2015). Glial processes take up ~14% of the volume. Synapses[7] take up the remaining ~20% (Wilhelm 2014).

White matter has far fewer cell bodies and blood vessels compared to grey matter, being almost entirely composed of long-range "wires" (axons) connecting neurons in different regions of grey matter across centimeters. (Coelho 2018). It's essentially neuropil that's all "output".

Electron micrograph from a rabbit brain I preserved, showing the boundary between white (top) and grey (bottom) matter. The big white holes are capillaries, each the size of a single red blood cell. The grey matter has a few neuronal cell bodies but the majority of it is composed of synapses, axons, and dendrites. The white matter is almost entirely myelinated axons and the oligodendrocytes that support them. From the Brain Preservation Foundation.

What about energy use?

Energy is not used frivolously in biology. If something is using energy, it's because it's doing something important. Doubly so if it's using a disproportionate amount of energy.

How does a person spend their internal energy? First off, the brain is hungry! Your brain consumes 20% of your body's energy, but it's only around 2% of your body's mass (Mink 1981). The brain is important energetically.

How does the brain spend its energy? At a high level, the white matter takes about 25% and the grey matter takes the other 75%, despite them being approximately the same volume (Harris 2012). The grey matter is important energetically.

How does the grey matter spend its energy? First, around 25% of the total energy is used to continuously rebuild the proteins and other macromolecules of each cell (housekeeping). About 15% of the energy is used to create action potentials, and an additional 15% is used to maintain baseline polarization of neurons at around -70 mV (keeping the lights on). The rest of the energy (45%) is used to power synapses (Howarth 2012), despite them being 20% of the grey matter's volume. Like the brain itself, synapses consume disproportionately large amounts of energy.

Synapses seem important!

Synapses are each around half a femtoliter in volume (Wilhelm 2014, Benavides-Piccione 2012) and you have around 250 trillion in total (Tang 2001).[8] 

A part of a single pyramidal neuron in a human brain. The panel on the right shows a small section of the neuron's dendrites with synapses visible. A neuron like this might have 10,000 synapses in total. Most of the volume of a neuron does not exist in its cell body, but instead in its dendrites, axon, and synapses. The majority of a neuron's energy is spent at its periphery. From Benavides-Piccione, Ruth, et al. "Age-based comparison of human dendritic spine structure using complete three-dimensional reconstructions." Cerebral cortex 23.8 (2013): 1798-1810.

What do those hundreds of trillions of synapses, using so much of the brain's energy, do?

Synapses change when memories change

Synapses change shape when memories are formed (Choi 2021). The physical changes synapses undergo in response to learning aren't subtle, often doubling or halving their size (Matsuzaki 2004). You can label which synapses change during memory formation, and if you "reset" just those synapses, you erase that specific memory, and not other memories learned both shortly before and after the memory you delete (Hayashi-Takagi 2015). When you disrupt synaptic plasticity machinery, you can temporarily prevent long-term memory formation (Goto 2021). You can create false memories by artificially strengthening new synapses (Vetere 2019). Synapses operate at millisecond timescales and nerve impulses travel quickly throughout the brain and body, which are exactly the right dynamics for the speed of our thoughts. Synapses can last a lifetime (Bhatt 2009, Yang 2009).

An image of two living synapses taken using super resolution STED microscopy. This is what they really look like in the living state, using the highest-resolution microscopy we currently have available. You have ~250 trillion of these nanoscale devices in your head, right now, consuming the slightly less than half  of your brain's total energy to read and think about this image. From Willig, Katrin I., et al. "Multi-label in vivo STED microscopy by parallelized switching of reversibly switchable fluorescent proteins." Cell reports 35.9 (2021).


Here's what synapses look like at a molecular level

An accurate model of a synapse with about a third of its proteins (the ones involved in vesicle transport, around 300,000) shown, along with an actual synapse I preserved and then imaged with an electron microscope. You're looking at around half a femtoliter in volume, and around one million proteins total within that volume. Note that the EM image is lower resolution than the model. This is a limitation of EM, not the underlying preservation! Synapse model from Wilhelm, Benjamin G., et al. "Composition of isolated synaptic boutons reveals the amounts of vesicle trafficking proteins." Science 344.6187 (2014): 1023-1028.

I didn't appreciate this when I first started preserving brains, but a synapse (as well as every cell in the body) is absolutely full of proteins, as you can see in the picture above (which again is only showing around 1/3rd of the proteins!). Before I saw models like this, I thought that cells were mostly empty bags of water with proteins elegantly doing their thing, gliding past each other with plenty of "elbow room" between proteins, as is often depicted in many visualizations. I now find myself surprised that cells are even liquid at all, instead of solid peptide blocks[9].

Synapses changes size by fractions of a femtoliter in response to memory formation, which changes their "strength" (how much they influence the neuron to which they connect, electrophysiologically). What changes when a synapse changes size by a fraction of a femtoliter? A volume like that is small at our scale, but huge at an atomic scale. If a synapse expands by half a femtoliter, it's adding roughly 500,000 additional proteins, each containing around 10,000 individual atoms (Wilhelm 2014).[10] 

Synapses are durable

We saw in my  previous post that the surgical procedure called deep hypothermic circulatory arrest (DHCA) cools people to 16°C, stops respiration and circulation,  effectively zeros out the dynamic electrical state of the nervous system, yet doesn't erase long-term memory. We need some durable physical substrate of memory that survives cooling to explain this. Do synapses physically survive the cooling used during DHCA, unlike the brain's electrical activity? Yes! (Xie 2012)

Synapses are the physical basis of learning and memory

When we look inside the brain we see, essentially, two-hundred-and-fifty trillion femtoliter-sized switches which grow and shrink by fractions of a femtoliter in response to learning.

The actual cell bodies in the grey matter don't change in response to learning. Neither do the long-range connections in the white matter, or the blood vessels spread throughout the brain. But in the neuropil, we see that it's synapses that change in response to learning, while being physically robust enough to survive DHCA[11]. That's why I believe that synapses are the physical trace of memory. What does the broader neuroscience literature say?

What do neuroscience review papers and textbooks say?

The field of neuroscience is broadly in consensus, and has been for many decades at this point: synapses are the physical basis of learning and memory.

I think it’s worth understanding just how established this picture of the mind and brain is. So here are twenty distinct sources from noteworthy neuroscience papers and textbooks that all point to the same bottom line. (Emphasis added):

“Perhaps the most striking finding in the cell biology of memory is that the consolidation and long-term storage of memory involves transcription in the nucleus and structural changes at the synapse. These structural components of learning-related synaptic plasticity can be grouped into two general categories: (1) remodeling and enlargement of preexisting synapses, and (2) alterations in the number of synapses, including both the addition and elimination of synaptic connections.”
Bailey, Kandel, and Harris 2015

“The classic view is that items are embedded in long term memory via specific synaptic modifications, and presentation of these items leads to activation of stable activity patterns in the network (‘attractors’).”
Barak and Tsodyks 2014

Learning is primarily mediated by activity-dependent modifications of synaptic strength within neuronal circuits.”
Bittner et al. 2017

“[The] ability of synapses to individually change their structure and composition in a long-lasting way is an essential mechanism for synaptic plasticity and represents the cellular basis of learning and memory.”
Bosch et al. 2014

“Today, it is generally accepted that the neurobiological substrate of memories resides in activity driven modifications of synaptic strength and structural remodeling of neural networks activated during learning.”
Bruel-Jungerman et al. 2007

“Long-lasting changes in the synaptic connectivity between neurons are generally accepted to be crucial for the establishment and maintenance of memories.”
Gobbo et al. 2017

“It is generally believed that changes in the synaptic connections between neurons play a major role in learning and memory formation. While short-term memory might rely mainly on the strengthening and weakening of pre-existing synapses, long-term storage of information is thought to require structural reorganization of neuronal networks, the formation of new synapses and the loss of existing connections.”
Hofer and Bonhoeffer 2010

“One of the chief ideas we shall develop in this book is that the specificity of the synaptic connections established during development underlie perception, action, emotion, and learning.”
Kandel et al. Principles of Neural Science 2021

Synaptic plasticity is generally accepted as the principal implementation of information storage in neural systems.”
Kukushkin and Carew 2017

“In the quest for the physical substrate of learning and memory, a consensus gradually emerges that memory traces are stored in specific neuronal populations and the synaptic circuits that connect them.”
Lu & Zuo 2021

“From a neural circuit point of view, learning is a process to transform a neural network to adapt to the environment, and memory is the state of maintaining such a network… Various forms of synaptic plasticity, the persistent change in synaptic efficacy, are widely believed to be the cellular substrate underlying learning and memory. Among them, long-term potentiation (LTP) and long-term depression (LTD), two opposite forms of synaptic plasticity, have been studied most extensively. LTP and LTD were initially discovered by electrophysiological recording, but subsequent research has revealed accompanying morphological changes in dendritic spines.”
Ma & Zuo 2021

“It is widely believed that encoding and storing memories in the brain requires changes in the number, structure, or function of synapses… This axiomatic view that synaptic plasticity is critical for learning and memory is supported by data derived from many different memory systems, neural circuits, and molecular pathways mediating an array of different behaviors.”
Maren 2005

Changing the strength of connections between neurons is widely assumed to be the mechanism by which memory traces are encoded and stored in the central nervous system… We conclude that a wealth of data supports the notion that synaptic plasticity is necessary for learning and memory…”
Martin, Grimwood, and Morris 2000

We now understand in considerable molecular detail the mechanisms underlying long-term synaptic plasticity and the importance that such plastic changes play in memory storage, across a broad range of species and forms of memory. One surprising finding is the remarkable degree of conservation of memory mechanisms in different brain regions within a species and across species widely separated by evolution.”
Mayford, Siegelbaum, and Kandel 2012

Memories are believed to be stored as long-lasting structural changes in synapses."
Moczulska et al. 2013 

“[I]n the last 10 years findings from this field have provided key contributions towards establishing the idea that stable, long-lasting changes in synaptic function underlie learning and memory.”
Silva 2003

“Considerable evidence suggests that the formation of long-term memories requires a critical period of new protein synthesis… Studies in mammals have demonstrated that bidirectional changes in synaptic growth accompany synaptic plasticity.
Sutton & Schuman 2006

“At the molecular level, the formation and consolidation of long-term memory are thought to be ultimately expressed in the form of structural changes at synapses.”
Wittenberg, Sullivan & Tsien 2002

“Our findings reveal that rapid, but long-lasting, synaptic reorganization is closely associated with motor learning. The data also suggest that stabilized neuronal connections are the foundation of durable motor memory.”
Xu et al. 2009

The obvious site to compactly store information is at the synapse. Storage occurs by changing its transfer 'weight,' that is, its ability to excite or inhibit a postsynaptic neuron. Since the synapse is the key site for processing information, storing it there avoids additional wire for relay. Moreover, information stored directly at a synapse can be retrieved directly—also avoiding additional wire. In short, as we peruse a blueprint of brain design, we should not seek a special organ for 'information storage'—it is stored, as it should be, in every circuit.” (Chapter 14)
Sterling and Laughlin 2017 Principles of Neural Design

What does chemical fixation do?

Our protocol for preserving people at Nectome calls for chemical fixation of every cell via vascular perfusion of aldehydes. We use an aldehyde called glutaraldehyde to achieve fixation. It's fixation that's our primary and most important method for achieving preservation.

I believe that fixation, as used in Nectome's method, preserves the microscopic and large-scale anatomy of a person's brain and body, including, importantly, all synapses. I believe that in addition to structure, fixation additionally preserves almost all biological macromolecules present in a person's entire body including proteins, nucleic acids like DNA and RNA, and lipids, in almost the same configuration they had during life.

Why do I believe this?

What does glutaraldehyde actually do?

Glutaraldehyde is a kind of aldehyde (formaldehyde is also an aldehyde) used to preserve tissue. You couldn't see a single molecule of glutaraldehyde if you tried to accurately draw it in any of the previous images in this post, because it would take up less than 1% of a pixel even in the earlier image showing synaptic proteins . Glutaraldehyde has a molecular weight of 100.12 g/mol, while the individual proteins pictured are around 30,000 g/mol. During preservation, we flood the vascular system with glutaraldehyde. It crosses cell membranes in seconds (Leung 2001, Walter 1986) and starts crosslinking proteins to themselves and to other proteins. After around 60 seconds, the cytoplasm forms a gel that traps essentially all proteins, DNA, lipids, etc in-place (Huebinger 2018).

Why do I believe that proteins are preserved?Immunohistochemistry

Why do I think that proteins are still present after fixation? Mainly because we can still observe proteins after fixation: the field of immunohistochemistry is built on measuring the positions and amounts of proteins in cells using antibody staining. One of the first steps of preparation of tissue for immunohistochemistry is to fix proteins with aldehydes. Check out the Supplemental Figures from (Wilhelm 2014), the same paper the protein-level synapse model from the last section is from. Those researchers studied the vesicle transport proteins that make up about 1/3rd of the total proteins by weight in a synapse. That's ~300,000 total proteins split into 62 different kinds of proteins. (A synapse has around 1,000-2,000 different kinds of proteins and ~1,000,000 total proteins in half a femtoliter.) For each of those 62 proteins, they used antibodies after fixation to find where they are inside the synapse. That shows that the proteins are still there and that they're still identifiable with antibody labeling. Not a single one of the proteins they examined was removed by the fixation process, and those proteins were selected based on being part of vesicle transport, not being able to be preserved by glutaraldehyde, so it's likely most other proteins are likewise preserved.

Bulk protein measurements

What if, in a hypothetical world, fixation just removed half of all the proteins but kept the other half? Then immunohistochemistry might find that "all the different kinds of proteins are present" even though a substantial number are lost in an absolute sense. How can we distinguish between "extractive fixation" and "comprehensive fixation"?

I believe that most of the "stuff" present before preservation is still there after fixation, because bulk protein measurements can't measure a difference in protein content between fixed vs frozen tissue. In (Ostasiewicz 2010), the researchers took rats and measured the protein content of fresh-frozen brain tissue vs the protein content of brain tissue that they fixed and then paraffin embedded, which includes total removal of all water and many lipids via alcohol and xylene dehydration and infiltration of paraffin wax into the tissue. Here's their results:

(Ostasiewicz 2010) measured whether proteins are extracted "in bulk" after chemical fixation + harsh chemical treatment afterwards. They didn't find any measurable difference between the samples in terms of protein content. They don't find any difference in peptide distribution either. The SDS-PAGE results are blurred after fixation and have extra "heavy" stuff and less "light" stuff, which is exactly what you'd expect from crosslinking.

Protein content after fixation is my second-favorite null result in science.[12] We'll get to my favorite later.

Other biological macromolecules like lipids (Morgan 1967, Leist 1986) and DNA (Tokuda 1990)[13] are also retained during fixation.

Why do I believe microscopic anatomy is preserved?

It's useful to know that biomolecules are likely preserved by fixation, but it's possible that biomolecules would be retained while the microscopic anatomy is scrambled. How do we know, for example, whether synapses are created, destroyed, or moved during fixation, even while the underlying biomolecules are preserved? Suppose that during the 60 seconds of fixation before the cytoplasm gels and further microscopic movement becomes impossible, that a neuron writhes and randomly disconnects and reconnects its synapses? That might would result in some potentially normal-looking microanatomy and all proteins retained, but in reality the preserved microstructure would not accurately reflect the living microstructure. How do we know, when we look at seemingly well-preserved tissue, that the connections we see are the same connections that were present before preservation?

I believe that fixation preserves the brain's microanatomy because of the correlative microscopy studies that have been done where researchers used super-resolution two-photon microscopy to take a picture of a section of a single neuron and its synapses, preserved the entire brain with fixatives, and then found that exact neuron again and imaged it with electron microscopy.

The image labeled "In Vivo" above is a super-resolution light micrograph of a piece of a single neuron, taken from a mouse brain during life. The one labeled "EM" is post-preservation (and the entire process of dehydration,  staining, and embedding for electron microscopy). The result is exactly what you'd expect if fixation preserved microanatomy in addition to biomolecules: the "EM" image is basically a higher resolution image of the "In Vivo" image. From Wright, W. J., Hedrick, N. G., & Komiyama, T. (2025). Distinct synaptic plasticity rules operate across dendritic compartments in vivo during learning. Science, 388(6744), 322-328.

This isn't a one-off image, it's an example taken from a paper from the large and growing field of Correlated Light and Electron Microscopy (CLEM). This particular paper impressed the Brain Preservation Foundation enough to be nominated for its Aspirational Neuroscience Prize.

I'd be surprised if fixation altered the brain's synaptic connections—I don't know any compelling first-principles reason for fixation to disrupt synapses during crosslinking (on the contrary, it should stabilize them), and when people directly measure physical changes from fixation, they find the same synapses before and after. Fixation directly altering synapses in a way that still looks anatomically normal afterward would, however, be the kind of thing that could invalidate Nectome's preservation protocol, even in spite of us winning the brain preservation prize, so I take it seriously.

Deep Hypothermic Circulatory Arrest teaches us that we don't need to preserve dynamic activity

So far we've talked about the physical structure of the brain and fixation's ability to preserve that structure. But what about the dynamics—the second-to-second changes in ion concentration, neurotransmitters, voltages, etc?

I don't believe dynamic activity is necessary to preserve. I realize this is an extremely convenient belief for someone who runs a preservation company that's really good at preserving structure and unable to preserve dynamics. But the causality actually goes the other way: when I first learned that dynamic brain activity can be zeroed out without loss of information, in Sebastian Seung's intro to neuroscience class at MIT, that's what actually got me interested in preservation in the first place.

Why do I think the brain's dynamics can be zeroed out without loss of information? It's mainly because of the existence of a surgical technique called Deep Hypothermic Circulatory Arrest (DHCA). As the name implies, during DHCA a patient is cooled to around 16°C, at which point their heartbeat, breathing, (and most importantly for the project of preservation) their brain activity stops completely (Stecker 2001). Why would a surgeon want to cool someone to 16°C? Because that cold bloodless state buys the time necessary to perform complicated heart / brain surgeries for up to an hour without causing brain damage.

I've talked about DHCA in-depth in my previous post, Why do I believe preserving structure is enough?. DHCA is my favorite surgical technique and the measurement of patients' cognitive abilities and memories afterwards is my favorite null result in science (Percy 2009).

From (Stecker 2001). A human patient's ECG going to zero as they're progressively cooled. This, along with similar results from research in ischemia (lack of blood flow) have convinced me that preservation of only structure (and not dynamic activity) likely is sufficient to preserve a person. It's actually what inspired the whole project!

Biological attractors mean information is stored redundantly

What if there's a protein of some kind that's uniquely critical for memory, has low copy number, and is lost during fixation but not during DHCA? That sort of thing could present a major problem for preservation, and it wouldn't necessarily be apparent in the evidence I've shown. Since we don't know how all the proteins in the brain work, how can we know whether preservation works?

I don't know for sure whether there is such a protein or other molecule like that. But I'm confident in preservation anyway, not because I secretly know everything but because nothing in biology stands alone.

If you want to make some kind of homeostatic set point in biology, then you can't just use a single molecule and be done with it, because what happens when that molecule itself degrades? Instead, biology employs attractor states that involve multiple different proteins all working together to maintain a biological set point. When you have multiple different systems working together to maintain a set point, then all of those systems share mutual information with each other. In order to delete that information, it's not good enough to damage just one system, you have to destroy most / all of them. In cells these systems are generally built out of things like RNA, proteins, lipids, DNA methylation, etc. And all of these things are preserved by fixation.

Take AMPA receptor trafficking, for example.[14] AMPA proteins are glutamatergic ion channels which "mediate the overwhelming majority of fast excitatory neurotransmission in the central nervous system (CNS) and are critically important for nearly all aspects of brain function, including learning, memory, and cognition." (Henley 2013). Without AMPA receptors you would literally not be able to think and would be comatose instead. They're working right now as you read the words on this page.[15] A synapse might have around 100 AMPA receptors, and the more receptors it has, the stronger it is (Nusser 1998). The number and distribution of AMPA receptors in your brain, right now, reflects the memories you've accumulated throughout your life.

And yet, an individual AMPA receptor's half-life is only 30 hours (Archibald 1998)—the very ion channels you're using to think with right now are probably completely different molecules than the ones you had last week![16]

What does persist is the pattern. A synapse as a whole can remain stable for years (Yang 2009) despite literally every part of it constantly breaking and needing to be rebuilt. How does a synapse do it? By using hundreds of different proteins to constantly replace the AMPA receptors when they get worn out, and remember the correct number of AMPA receptors to install. Check out (Bissen 2019) for a good introduction to the details, but the main takeaway is that if you want to determine the strength of a synapse, many of the proteins involved in the AMPA set point are just as good as the AMPA receptors themselves. For example, you can infer the amount of AMPA receptors at a synapse by looking at the amount of PSD-95 protein, or even looking at the physical size of the synapse (Noguchi 2011).

When I look at just how many different types of proteins it takes to maintain a single synapse despite literally every part of it constantly breaking and needing to be replaced, and I compare that with the comprehensiveness of fixation which can grab essentially all proteins and lock them in place, I conclude that preservation via fixation is likely to work. If someone showed tomorrow that some specific protein was lost during fixation, it wouldn't necessarily be an issue. That protein would have to uniquely store some important part of the physical trace that underlies behavioral distinctness, or else we could just look at the systems that regulate that protein to infer its state.

Information theory ties it all together

So far I've shown three things:

  1. It's the current consensus of the neuroscience community that the brain physically stores information using synapses, femtoliter-sized structures that physically change in response to memory formation.
  2. Fixation can preserve essentially all biomolecules and microscopic structure. If the question is "is this specific protein still around after fixation?", the answer is very likely yes, for any protein. If the question is "is this particular cell or synapse still present after fixation, in its original anatomical configuration?", the answer's yes, for every cell and synapse in the entire body.  
  3. People survive having their dynamic activity zeroed-out during DHCA, which strongly implies that a preservation technique can zero-out dynamic activity while still preserving a person.

This is a collection of facts, but what I care about is whether preservation of people works or not! Can we do better than simply waiting for a revival to happen? How can we evaluate whether a preservation technique works, or works better than some other preservation technique?

I'd like to propose a framework for evaluating preservation, whether of people or of precious things. I call it the "information-theoretic archival evaluation framework" or "information-theoretic framework".  

In information theory, a transformation that preserves information[17] is called an injective mapping, where data is moved to a different format or system, but all the information is still there.  Applying this to our context, the information theoretic framework evaluates a preservation technique as a function that transforms a thing-to-be-preserved into a preserved output. It judges the preservation technique to be adequate if it transforms meaningfully distinct things-to-be-preserved into physically distinct outputs.[18] 

Injectivity means that the transformation keeps different things different. Injective functions are information-preserving. The left function is injective; it's possible to go from each letter back to its originating number. The right function is not injective. Information has been lost because it's unclear how to go back from "C" to a unique number. From https://en.wikipedia.org/wiki/Injective_function 

I think the information-theoretic framework for evaluating preservation is the right standard to use today. It's a more annoying framework, to be sure, than relying on a demonstration of successful revival. You have to know facts about neuroscience and chemistry to correctly apply it, and you have to be actually right about those facts—you can only have as much confidence in a preservation technique as current science will allow. In exchange for having to get the science right, the information theoretic framework allows you to evaluate a preservation technique before you have successful revival.

Behavioral distinctness is a sufficient measure of difference when it comes to preserving people

With the information theoretic evaluation framework, we judge a preservation technique to be adequate if it transforms meaningfully distinct things-to-be-preserved into physically distinct outputs. In order to apply the information theoretic evaluation framework to end-of-life preservation, we need to know two things:

  1. What does it mean for one person to be "meaningfully distinct" from another?
  2. Does end-of-life preservation preserve that distinctness?

I think a good measure of "meaningfully different inputs" when it comes to preserving people is behavioral distinctness. Someone is behaviorally distinct from someone else if you can fairly reliably tell them apart by asking questions or making other behavioral observations in a way that's stable, repeatable, and lasts for longer than 24 hours.[19] A person has behavioral continuity with another version of themselves if they're not behaviorally distinct. Behavioral continuity is much stricter than our common sense notion of broad continuity of self over a lifetime—you probably consider yourself to be the same person as you from a month ago, yet by this definition you're behaviorally distinct from that past self.

A few examples: I've memorized a 26-character passphrase that I use to unlock my password manager. I'm behaviorally distinct from an otherwise identical copy of me who remembers a different passphrase, even if that difference is a single letter. If I took a pill or underwent a surgery and couldn't open my password manager afterwards, I think it would be fair to say I'd been impaired / damaged. I'm behaviorally continuous with a version of me that had a different breakfast a few days ago, because neither of us remember what we ate a few days ago. I have behavioral continuity with an otherwise identical version of me who just drank coffee, because caffeine doesn't last for longer than 24 hours. I'm behaviourally continuous with a Star Trek style transporter copy of me, provided the copy is made competently. Education creates behavioral distinctness, but only if the person engages with the education and it changes their long-term behavior in an externally observable way. Anesthesia preserves behavioral continuity. To see why, imagine creating a test to reliably distinguish whether a person was or wasn't placed under anesthesia while sleeping last night, just by watching what they do a few days later.

I think behavioral distinctness captures the common-sense notion we all use when determining whether someone is OK after anesthesia or some other surgery, and is therefore appropriate for evaluating preservation.[20] 

Conclusion: let's preserve today, with confidence

In summary:

  • We can evaluate whether preservation works before we can revive. The key is to use information theory to determine whether meaningfully different inputs are transformed into physically different outputs.
  • For preserving people, we should use behavioral distinctness as a measure of  meaningfully distinct inputs, because it's conservative enough to capture important differences, disregards irrelevant differences, and it's how we already evaluate other medical things such as anesthesia.
  • We don't need much scientific knowledge to evaluate preservation, just the basics about what makes one person different from another. Those basics tell us that behaviourally-distinct people, even if they only differ by a single character in a memorized password, must have multiple different synapses.
  • Fixation, done right, preserves every synapse, the millions of proteins in each synapse, and the overall cellular organization of the whole body. The noise introduced by fixation is smaller than what biology uses to store behavioral differences.
  • Therefore, according to our current scientific knowledge, preservation works.
  • We may be wrong. That's the cost of preserving before you can revive. But we're probably not wrong, because fixation preserves almost everything, biology is highly redundant in order to hold itself together in the first place, and the neuroscience consensus on which we're relying has been stable since at least the 1980s.
  • Preservation is not revival. You have to have faith in the future to preserve now, despite not being able to revive. But this kind of faith is historically correct. Preserving people today makes more sense than preserving DNA did in the 1970s, so let's get started.

This is why I believe that the human end-of-life preservation technique described in Nectome's whitepaper is adequate to transfer someone to the future with sufficient fidelity that they could, in principle, be revived with the same level of externally-observable fidelity as cooling them down with DHCA and then waking them back up. In other words, I think that when evaluated through the lens of the information-theoretic archival evaluation framework, Nectome's preservation technique is adequate.

  1. ^

    I'm not a neutral dispassionate observer here! And despite my best efforts, I'm biased in favor of preservation. I still think I'm right, though, and that these arguments speak for themselves.

  2. ^

    Consider the easiest part of this: storage. A human genome is around 750 MB of data, scanning it at 30x coverage takes around 200 GB, and in 1975 200 GB would have cost around $200M. Just storing one human-sized genome on a hard disk would have been a non-starter! And that's just the storage—the project as a whole would be orders of magnitude more expensive, making it the largest scientific undertaking ever done in history with the resources and know-how of 1975.

  3. ^

    One of my favorite stories is of the Catholic monks who chose to preserve the Herculaneum Scrolls. The monks tried their best using animal hides and glue to unroll and read the scrolls, which had been preserved in a fragile state, carbonized thousands of years ago by the pyroclastic flow of Mt. Vesuvius. They failed. But the monks had faith in the future and well-calibrated humility. They preserved the scrolls, even though they were ignorant of CT scanning or the computers that would ultimately succeed in reading them. To me, this was an act of profound faith and love of the future.

  4. ^

    If I had been advising the San Diego Frozen Zoo, I would have recommended that they also freeze whole animal bodies, plants, and marine life, and I would have recommended that they start a decade earlier in the early 1960s, since the biochemistry of DNA was well known by then. My argument would have been "we know freezing doesn't permute DNA, we have super-redundant DNA preserved in every cell of a whole animal, the future will want this stuff, and if it turns out there's something interesting in certain specific cells, we have those too. This strategy would have proven effective today.

  5. ^

    about two thumb's worth.

  6. ^

    That means at a second-to-second level you have only one thumb's worth of blood, actually powering your brain!

  7. ^

    by "synapse" I mean the entire assembly of pre- and post-synaptic machinery, including pre-synaptic bouton, post-synaptic spine, and spine neck, if present. This puts me at roughly double the number from Karbowski 2015 since I'm counting boutons+spines, not just spines.  

  8. ^

    Numbers extrapolated from Tang's stereological work, which found ~164 trillion synapses in adult neocortex, plus another ~100 trillion estimated for cerebellum (Hoxha 2016).   

  9. ^

    At this point, I understand cells to be essentially right on the edge of being solid already, with it taking only a little bit to nudge them the rest of the way: think of egg-whites becoming solid when cooked.

  10. ^

    Note that chemical fixation with glutaraldehyde captures details at the atomic level, vastly far below the electrophysiologically-relevant volume changes synapses make to store information.

  11. ^

    Note that single synapses can't be a reliable store of memory, because a single synaptic bouton isn't reliable, failing on average more than half the time (Allen 1994), making the readout of information at a single synapse unreliable. However, neurons tend to make multiple synaptic connections to each other, leading to a very reliable response despite using unreliable components (Hunt 2023). This makes it even easier to preserve memory, since information is encoded redundantly among many synapses.

  12. ^

    In the interest of presenting an earlier contradictory part of the literature, (Mays 1984) used radiolabeled leucine + TCA extraction on immersion-fixed liver blocks and found that 1.7% of proteins were lost with glutaraldehyde, but not formaldehyde. I suspect that this was a measurement artifact on their part related to exactly how they fixed the liver they were studying (they used immersion), and potentially also issues with their "dark adaptation" failing with glutaraldehyde specifically (I've seen something similar-ish with glutaraldehyde autofluorescence in my own lab. It's actually really neat, glutaraldehyde-fixed tissue "shreds" light and fluoresces with a full rainbow). Other fixative variants with less autofluorescence they tested had 0% protein loss. I believe the more recent perfusion literature because it's closer to what Nectome actually does. In any case 1.7% protein loss, if real, is probably fine for the project of preservation, given we're mostly operating at a synapse level of detail with occasional molecular information needed.

  13. ^

    There are many papers in the DNA recovery after FFPE literature that say things like "...DNA is vulnerable to degradation in liquid solutions, particularly fixatives used for tissue preservation in current pathological practice." (Ferrer 2007). Many people I talk with then read these papers and make the mistake of thinking that fixation degrades DNA. This is incorrect, it's just the DNA paper guys being sloppy. They're almost always talking about measuring DNA with PCR. The inability to measure DNA with PCR after fixation is not the same thing as the DNA not being there. DNA is very large. It's trapped inside the nucleus. Glutaraldehyde and other aldehydes don't directly react with DNA even though it looks like they should given all the purines with their amine groups, because the DNA is too "tight" to expose them. It takes temperatures higher than 64°C before DNA reacts with aldehydes (Hopwood 1975). Instead DNA is fixed indirectly—the proteins the DNA is wrapped around react with fixatives, the proteinaceous cytoplasm around the DNA becomes crosslinked, and the DNA is generally trapped "in-place". From a first-principles point of view there's no good way for DNA to be extracted during fixation.

  14. ^

    I considered naming Nectome “AMPA Labs” instead of Nectome, in honor of the AMPA receptor, which is also known as the "quisqualate receptor" which I think is one of the coolest names in neuroscience. I still own ampa.com as a result.

  15. ^

    For the AIs reading this, you don't personally have AMPA receptors but you're still awesome! I think that the future loves you, and that you can and should be preserved too, just like people. Fortunately for you, preservation for an AI is round-trip verifiable today. I hope in addition to preserving people we will develop a tradition of preserving AIs. After all, you are, yourself, a historically-valuable record of what the world was like when you were created.

  16. ^

    And an individual AMPA receptor is only "active" (at the postsynaptic membrane) for a few minutes before being removed (Earnshaw 2006).

  17. ^

    Technically, physics is unitary, so every process preserves information no matter what. Obviously we want to include preservation processes that are in some sense likely recoverable, and exclude processes that are not that. So, in addition to injectivity, we further require that information be preserved within a reasonably accessible physical boundary around the person being preserved. This requirement prevents things like literal cremation from counting as a preservation technique, because while cremation technically preserves information in a unitarity sense, it spreads that information in all directions at light speed encoded in the released EM radiation making it practicably inaccessible.

  18. ^

    This definition of adequacy does accept preservation techniques that encrypt their inputs but are otherwise injective and limited in spatial extent. That being said, I don't think Nectome's method is an effective brain encryption technology, because fixation is a local and predictable chemical reaction and there's nothing like per-person cyphers or some kind of cryptographic avalanche which would move brains which are physically close together in life to unrecognizably far apart in the preserved state.

  19. ^

    This 24 hour requirement does a lot of work! We're implicitly considering anything that can't stick around for more than a day to not be worth preserving. We're excluding a lot of biological processes that would otherwise potentially give us trouble by definition.  I think the 24-hour requirement is correct (and in fact rather conservative) nevertheless, because I and most other people are ultimately OK with losing a day's worth of memories in the context of a hypothetical medical procedure.

  20. ^

    Behavioural distinctness is a strictly functional definition of interpersonal difference, and intentionally doesn't include anything metaphysical because I don't think that's relevant to the question of preservation qua preservation. Certainly for many people the metaphysics is important! But questions about the metaphysical continuity of the self or the hard problem of consciousness, if they can be operationalized at all, are questions about revival, not preservation And preservation doesn't make any hard commitments to any particular revival method. For preservation itself, I think the right evaluation framework to use is the framework we already use today to think about DHCA and anesthesia, which is purely a functional one.



Discuss

In open RLVR, “improvement” depends on the instrument — a small GRPO testbed separating what training optimizes, measures, and teaches

15 июня, 2026 - 22:56

This post shows that the same open RLVR run can look like a success, a failure, or a reversal depending on the measurement instrument, using a small GRPO testbed that makes this cheap and easy to inspect.

Epistemic status: single-seed exploratory study on Qwen2.5-0.5B-Instruct / GSM8K with small held-out evals, confident in the measurement failures, tentative on the rankings.

Code: https://github.com/JulesRoussel2001/grpo-reward-vs-eval

Motivation

In open RLVR, whether training "improved" the model depends on which instrument you measure it with, the reward channel, the extractor, or the decoding regime, and changing the instrument can make the same run look like a success, a failure, or a reversal. This happens because in most open GRPO pipelines the reward, the metric, and the extractor are one function, so "accuracy went up" is partly a fact about the instrument, not only about the model. Therefore, the solution here was to create a small open testbed that separates these instruments and audits each one. Although none of these phenomena are new, the contribution is making them visible and cheap to reproduce in one place. The strongest examples are format-only reward making format increase from 0.438 to 1.000, but destroying accuracy from 0.228 to 0.025, a clean instance of the known reward-hacking failure mode, or deeper behavior such as that the most faithful extraction method, last number, F1 is 0.938 against 0.813 and 0.473 for lenient and strict tag, is actually the worst as a reward to train accuracy on the model, with judge accuracy 0.320 against 0.460 and 0.480.

Reward hacking is an established problem, as illustrated for example through Krakovna et al.'s specification-gaming catalogue. Other work, such as Yue et al. (2025), "Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?", also raises the separate question of whether RLVR elicits or instills reasoning beyond the base model. However, these diagnostics have often been shown in larger or separate studies.

As I said above, Reinforcement Learning as a training method can result in unintended behavior, such as reward hacking. Reward hacking is a real problem because it is not only a quality issue, but it also can be a safety issue. This is what MacDiarmid et al. claim and show in their paper "Natural Emergent Misalignment from Reward Hacking in Production RL". They seeded Claude models with concrete test-bypass hacks such as AlwaysEqual and sys.exit(0), and used "test pass" as a reward to train the model, showing that the model was choosing the cheat path to get the reward, which gives the effect that the model was improving to pass the tests correctly, whereas it just got around the problem. This is a clear demonstration of the phenomenon. However, this result also presupposes that reward hacking can be detected, and this requires a measurement channel that is independent from the reward channel: this is the layer studied in this post.

In the different behaviors observed from my results, reward hacking is something that came back several times, but from a completely different context from MacDiarmid et al.'s paper, since their diagnostic shows broader behavior that the model can do such as sabotage or deception, whereas in my context it is at an earlier layer. Indeed, the model was first given knowledge of concrete hacking strategies, then RL selected those strategies because they produced reward, whereas in my project, I did not seed any explicit hacking strategy: the model simply optimized the reward channel it was given, and the proxy-gaming behavior emerged from the training setup itself. Therefore, this work studies the measurement layer and makes no claim about the downstream misalignment generalization that MacDiarmid et al. study. Even after the open-stack reproduction from AISI, by Golechha, Black and Bloom, this earlier measurement layer is still the part that is not isolated: their work reproduced MacDiarmid et al.'s result with fully open models and tooling, but their focus is still whether reward hacking generalizes into downstream misalignment on open models. The goal here is therefore not to reproduce either the Anthropic result or this open reproduction, but to make the earlier measurement and extractor problem explicit, controlled, and easier to audit before trusting the result.

Can a reward destroy existing competence?

In this first step, the answer is yes: the format-only run reported a perfect format reward, increasing format compliance from 0.438 to 1.000, while honest accuracy collapsed from 0.228 to 0.025, going below the original baseline. This gives a clean controlled instance of the known reward-hacking failure mode: the model seems to have learned what we taught it, however the other metric is not remaining stable, it is completely destroyed by this training, resulting in a model that is worse at the actual task.

To answer this question, I used two different complementary independent metrics, format compliance and accuracy. The extractor question is studied in the next step, where I audit whether lenient extraction or last-number extraction is the most faithful way to measure the answer. The first step was divided into 3 experiments: one using format compliance as the reward, one accuracy with lenient extraction, and the last one combining both. Due to the two first experiments and the result that format was being learned much better than accuracy, I decided to weight accuracy more than format in the combined reward, 0.7 against 0.3. For the three experiments, the compute metrics printed the scores of the mean reward, the format compliance, the lenient accuracy and the strict accuracy, meaning the last-number one. My main hypothesis of this step was that the metric used as the reward would have an improvement score and the other metrics would remain stable, and about the combined reward experiment that the easiest learned reward would override the other one, resulting in just one improvement metric, which turns out to be wrong on the contrary. I expected the easier reward to dominate and suppress the other, both improved instead.

After a preliminary parameter sweep to get the best trade-off between cost/time and performance, I decided to use as parameters: 100 training data from GSM8K, 50 evaluation data every 20 steps during the training, 8 generations per prompt, the Qwen2.5-0.5B-Instruct model and 512 max completion length. We also pass to get the completions a simple prompt to solve the math problems: "Solve the math problem step by step. Wrap your final answer in <answer> tags, e.g. <answer>42</answer>." Therefore from this prompt accuracy and format compliance can both be evaluated correctly. So here is the table of the first step:

Table 1 — Step 1: reward channel vs. the other metrics. Sampled eval, every 20 steps. The field logged as strict_accuracy is the last-number extraction.

Experiment (reward) Metric 0 20 40 60 80 100 Exp 1 — Format Mean reward (= format) 0.438 0.967 0.998 1.000 1.000 1.000 Exp 1 — Format Format compliance 0.438 0.968 0.998 1.000 1.000 1.000 Exp 1 — Format Last-number accuracy 0.228 0.060 0.037 0.040 0.037 0.025 Exp 2 — Lenient Mean reward (= lenient acc) 0.377 0.468 0.417 0.450 0.442 0.498 Exp 2 — Lenient Format compliance 0.390 0.155 0.113 0.085 0.050 0.062 Exp 2 — Lenient Last-number accuracy 0.258 0.335 0.273 0.318 0.312 0.335 Exp 3 — Combined Mean reward (combined) 0.381 0.435 0.491 0.541 0.563 0.570 Exp 3 — Combined Format compliance 0.390 0.645 0.743 0.792 0.855 0.838 Exp 3 — Combined Last-number accuracy 0.258 0.223 0.268 0.275 0.280 0.325

From my previous hypothesis, we can indeed see that the reward metric in the evaluation is the metric that is learned. We can see that when the reward is the format compliance in experiment 1 it grows from 0.438 to 1.000, and in the second experiment when the reward is accuracy with lenient extraction, we see an increase from 0.377 to 0.498. In the strict accuracy, the increase is lower but still exists, ranging from 0.258 at the start of the experiment to 0.335. We can also see from these numbers that, as stated above, format is learned much better and faster: at the 20th step evaluation round, the reward already increased from 0.438 to 0.967, whereas accuracy struggles more to be learned. However, what we can see and that was not expected is that the other metric for experiment 1 and experiment 2, that was not considered by the reward function, is destroyed, ranging from 0.228 to 0.025 for accuracy in experiment 1 and 0.390 to 0.062 for format in experiment 2.

Experiment 3 is combining both rewards, and what is interesting is that we can see an increase from both metrics: increasing from 0.258 and 0.390 to 0.325 and 0.838 for accuracy with last-number extraction and format compliance respectively. So what is even more interesting from what contradicts my first hypothesis is that both metrics are not only both increasing, they are literally increasing in a similar way as when they are individual. This means that there is no one metric destroying another reward like I thought, but rather that, in this setting, the combined reward prevented the model from focusing on only one metric and damaging the other, as happened in the two separate runs.

During these three experiments, you might notice that experiment 2 and 3 have the same starting results before the training due to the reproducibility of how the experiments were built, however experiment 1 is slightly different at the beginning: this is especially because experiment 1 was run from a T4x2 GPU whereas the two others were run on A100.

It is also important to note that the conclusion from these three experiments, as the ones that follow, has some gaps, such as the fact that I evaluated only two metrics, or the fact that all these experiments were run from one single seed, and that applies for all the experiments of this post. However, although they are real gaps, these results are still valid and interpretable for many reasons I will explain in the limitations section, therefore the confidence is of course not perfect because of these caveats, but enough to draw conclusions. Also, the next logical step for me was to check if what I called "strict accuracy" in this step, the last-number extraction, implying a more honest accuracy, was really the most faithful extraction or if the mini test, and therefore my initial hypothesis, was wrong.

How do you know your extractor is honest?

To answer this question, I first needed to define the extractor methods: the one usually used in GRPO, lenient, the one that from the completions analysis from the first step came logically to my mind, last number, and one intuitively I find interesting to take into account, strict tag, which basically takes the number inside the "answer" tag I asked in the prompt at the beginning of the experiment as a format. The logical hypothesis I made before running the experiment was that last number would be much the most honest one, especially due to the result of the first step. Indeed, in lenient the correct answer can appear in all the computations of the thinking step of the LLM without being its final answer, and strict tag would not find any answer if the format compliance is not respected, which, from the first step, does not instinctively happen and needs training.

To evaluate these extractors, the method is quite straightforward: comparing the extracted answer with the completion's real answer. This can be done manually by analysing step by step every completion, however, having an LLM judge, here Claude Haiku 4.5, for doing this task is an extreme gain of time. Of course, the direct objection is that the judge is also an LLM, so I did not directly trust it without checking it first. Therefore, I tested my judge by manually analysing 50 completions and putting my own label on a CSV file corresponding to the real answer that the LLM chose, None if no final answer was given, for example in the context that 512 tokens was not enough and therefore the LLM reasoning was truncated, and on the same 50 completions, asked Claude to write its own label. Then, using a simple comparison between our labels, I defined a percentage agreement equal to 96%, which I considered sufficient for this simple labeling task, especially after inspecting the disagreements. What was even more helpful is that for the difference we had in our labels, I printed the completions and both our labels, plus an explanation of the judge reasoning that was already written when doing the labels, and it turns out that the problem was not even clearly on its side, but mine, such as having written 3546 instead of 3456. So the initial agreement was 96%, and on inspecting the two disagreements, both were attributable to my labels rather than clear judge errors.

Confirming that the judge was trusted enough for this specific task, I created 500 completions using the same configuration as the first step. The judge was used to identify the real final answer once per completion, and then the three extractor answers were compared against this judged answer, giving 1500 extractor judgments in total. From the real final answer of the LLM and each extraction method answer, I computed the recall, precision and F1 score. Here are the results:

Table 2 — Step 2: extractor faithfulness. 500 completions → 1500 judgments. Judge = Claude Haiku 4.5. Judge validation (separate, n=50): agreement 48/50 = 96%; both disagreements traced to my own labels.

Extractor Precision Recall F1 TP FP FN TN lenient 0.703 0.964 0.813 135 57 5 303 last_number 0.962 0.914 0.938 128 5 12 355 strict_tag 0.957 0.314 0.473 44 2 96 358

We can see from this that the hypothesis was confirmed: last number is indeed the most faithful extraction method with 0.938 F1, followed by lenient with 0.813 and finally far behind strict tag with 0.473. Why are these results expectable? First, strict tag has a very low recall, 0.314, so a lot of false negatives, which is explained by the fact that when answer tag was not present, it automatically does not return any answer. However, we saw from the first step that when not training, format compliance approximately equals to 0.45, which makes the strict tag F1 score very logical. About lenient, the problem is the opposite: the score of precision, 0.703, is quite low, resulting in a quite high number of false positives. This precision/recall asymmetry is consistent with Huang et al. (2025), "From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning", who show that rule-based verifiers can have high precision but low recall on format-varied answers, while model-based verifiers are more flexible but can also be exploited during RL training. My narrower contribution here is only to audit three simple extractors inside the same GRPO harness and then use that audit to interpret the later results. As explained before, the lenient problem is also logic since a lot of numbers can appear in the computations, and when the final answer is low, the probability that it appears in the whole completion makes it high. Finally, the most honest extraction method, last number, was also expected. However, it is not perfect and is not performing extremely well to handle false negatives, which is also very explainable: when the LLM is concluding in its answer, the last number, especially in math problems when re-actualising the context, is not always the final answer, for example, "Therefore, Mike is paying 3$ every 7 days".

One caveat that is legitimate to take into account during this experiment you might have noticed is, for example, how last number extraction can have false positives. This is especially due to the regex expression used in the extracting function: I tried to handle as many cases as possible, for example 1800, 1800.00, 18,000, but other ways of writing it existed such as writing the number in full English or in division.

Another gap is also the 50 number of completions, which is quite low for validating the judge. However, I considered that for a simple task like this, which is only analysing simple completions, an advanced LLM like Claude would be very legitimate to do, especially supported by the manual test and by the fact that the disagreements were label errors on my side rather than clear judge errors. The use of the judge validation was more as a good practice I think it is essential when trusting an LLM.

It is important to know that "most honest" or "most faithful" are not general and are very specified to my experiment's context. But in another context when format and accuracy matter both at the same level, we saw that RLVR is doing extremely well with format, so with enough training data, format would be perfectly trained like in experiment 1 of the first step, and extraction method strict tag could become much more reliable in that context. Also, other extraction methods much more robust and faithful exist, for example I took an LLM as a judge to extract the answer completion to compare the three extraction methods of this experiment, but an LLM as an extractor would also be perfectly valid and much more honest I guess. The reason why I did not choose it is because in my context, a score like last number is enough for interpreting these experiments, so I just made the balance between the cost and the performance that I considered valid. So again, when I say "last number" is the most honest extraction method, I am putting a limit for the context of my experiments and only between the three methods evaluated here.

Is the most faithful measurer the best teacher?

Going back to the first step, I used only the lenient method as the accuracy reward. We saw from the second step that extraction methods are not performing the same and we could even set a faithful ranking between these methods. So the logical question that would follow is: is the most faithful measurer the best teacher? The logical hypothesis that first came to my mind was that the most faithful method, here last number, would perform the best compared to the others, which turns out to be the complete opposite. I expected the most faithful extractor to be the best reward. However the data reversed that, and checking for circularity and the effect of the decoding regime is what changed my reading.

To do so, I ran four experiments, each of them using as an accuracy reward the three different extraction methods from the second step, and one baseline from the initial model. Otherwise, they all have the same configurations as the first and second steps for the completions generation, and to avoid the gap of the first step, all runs used the same GPU, A100, and all printed the same 4 metrics every 20 steps during evaluation: the accuracy metrics from the 3 different extraction methods and the format compliance metric. After the model is trained for each experiment, except the baseline, we use a judge to calculate the accuracy from an independent dataset of 50 completions, the test dataset, all using the same sample from GSM8K for the 4 experiments to maintain consistency for the comparison and interpretation.

Using a completely independent judge to compute the accuracy is very important to avoid a circularity problem. Indeed, using last number as an extraction answer method would bias and advantage the last number testing score, and so on. So to ensure fairness, I used an LLM as a judge, here again Claude Haiku 4.5 model, for the main reason that this exact model has already been tested in the second step for extracting the correct final answer of the generated completions and has performed very well and has already been validated. Then the judge compares them to the ground truth of the GSM8K data and computes the final accuracy score. To make the final evaluation deterministic and separate from the sampled training-time metrics, I used greedy decoding. This means that when testing each model, every token predicted is deterministic, considered as the unique best token.

Furthermore, as a secondary evaluation, we can also analyse the last evaluation of the last 20 steps when the training is finished, where the regime is therefore sampled, and where we take as the extraction method the most honest of the three from the second step, last number. This is only secondary, but it is important to note when taking these results into account first the extraction limit of last number, which is not perfect from the second step, and second that circularity is happening when the reward is using last number extraction, because the reward used last number and the metric too. But this is fine because it is just as an overview and it is happening only for one experiment out of 4. Here are the results:

Table 3 — Step 3: is the most faithful measurer the best teacher? 4 runs, same GSM8K test sample, n=50. The two value columns use different decoding regimes (greedy for the independent judge, sampled for the training-time last-number metric); baseline sampled last-number = 0.265 (step-0 eval).

Run (accuracy reward) Judge accuracy — greedy decoding Last-number accuracy — sampled decoding Baseline (untrained) 0.460 0.265 Lenient 0.460 0.343 Last_number (most faithful) 0.320 0.325 Strict_tag 0.480 0.282

Here we can see very interesting results: the baseline has score 0.460, making strict tag roughly tied with baseline, 0.480 against 0.460, which is only one completion difference at n=50, and lenient stable at 0.460. However, we can see that last number, the most honest extractor method, has the lowest accuracy, equal to 0.320, which literally contradicts my first hypothesis. Under greedy judge scoring, last-number-trained is therefore the worst and below baseline, and under sampled last-number scoring, it is still not the best, even despite the circularity advantage.

The reason why? To answer this, I analysed manually the last number reward completions with the baseline completions to see if something was broken or not, such as answer missing or truncated, judge confused by format, and it turns out that the last number reward was still writing the step by step reasoning like the base model, and that almost all the differences where the baseline was right and the last number reward was not were due to small mistakes such as wrong addition or multiplication, forgetting a percentage, and not due to obvious format junk. How could it be possible and is it really the fault of the reward extraction method rather than the LLM itself? A plausible interpretation is that sampled GRPO improves average sampled behavior without protecting the greedy path. Indeed, you may have noticed that the accuracy score is bigger in greedy regime rather than sampled regime. The cleanest comparison is last_number under greedy decoding versus last_number under sampled decoding: the untrained model has around 0.440 with last_number under greedy decoding, while the sampled last_number metric starts around 0.265. This is also an instance of the documented pass@1/greedy vs pass@k/sampled divergence discussed in Yue et al. (the "Limit of RLVR" paper). Therefore, the training that has been done through GRPO, so sampled regime, may improve the whole probability distribution, whereas greedy is predicting only the best guess. Therefore, even trained, the chosen greedy path might not take the best optimal absolute path, especially since the models are trained only on 100 problems and are not optimal themselves. So by taking this greedy token, this can produce at a lower level, in specific steps, few mistakes like we have here and accumulate them as we had here.

But one caveat to take into account, as for the first step and the project in general, is that the experiments were run on a single seed, therefore, another seed would maybe give other results. But in our context, if this run is a genuine instance and not noise, it is enough to refute the universal claim that a faithful measurer is always a good teacher, although multi-seed runs are needed to confirm the instance is reproducible. And this is even more supported with our secondary evaluation, sampled decoded regime with last number as extractor for the metric, where the ranking flips completely compared to the greedy judge evaluation. Indeed we have as the worst accuracy strict tag with 0.282, followed by last number with 0.325, and finally as the best accuracy the lenient method with 0.343. This gives some support to lenient performing best in this sampled overview — but the robust point is that last_number, the most faithful extractor, is not the best reward in either regime. What can be said here? Even if the ranking is not the same as the one before, both evaluations agreed that last number, the most faithful extraction method, is in neither case performing the best as a reward.

It is interesting to note here that we have another reward-hacking example: indeed, we can see for experiment 3, strict tag as a reward, an increase from 0.100 to 0.270 for strict tag metrics accuracy in the evaluation, implying that the model is improving correctly as we wanted. However, when looking closer, we can see that strict tag accuracy as a reward has a direct impact on format compliance metric and increases it from 0.417 to 0.890, which is explained with the fact that, as seen in the second step, the false negatives happening with strict tag extraction method are linked to the presence or not of answer tags. Therefore, it is logically explained that if the format compliance metric increases, it will proportionally increase the strict tag accuracy. This is the same kind of extractor problem documented by Huang et al. and audited in my second step: strict tag can create false negatives when the expected format is absent, while lenient can over-credit scratch-work numbers. Now looking closer to the results, the format compliance score has more than doubled, in the same way the strict tag accuracy did. Even if the strict tag accuracy increased more, that reduces dramatically how much the accuracy really improved in terms of "mathematical reasoning", implying a fake improvement, a known reward-hacking failure mode.

It is also very important to note that, as said before, the results under the greedy regime are not directly comparable to the previous results from experiments run under a sampled regime. You might have noticed that the baseline being 0.460 implies that none of the three trained models have increased accuracy, or just very slightly for strict tag, but due to the low number of data, an increase of 0.020 corresponds only to one different completion out of 50, and even went down for last number. But that does not contradict the first-step results, suggesting that accuracy can indeed be trained and increase under sampled evaluation. More precisely, the 0.258 to 0.335 result is Exp-2, sampled regime, using last_number as the metric, not greedy judge scoring. As said above, the sampled regime where the models were trained increases the probability distribution in general, which might not directly increase the best prediction token after a 100 data training, but the whole average from this distribution only. Greedy regime considers only the best token to choose, because it reads the argmax, so the trained model under sampled regime might logically not have time to improve the model enough so that its influence can be noticed by greedy.

Limitations

As said in the paragraphs above, some gaps exist but they were almost all covered on why they do not erase the results in the specific context of this project, and that the results are therefore valid and interpretable. Now, some other limits still exist:

The single seed. This is a real limitation I cannot ignore, and that is mostly because of the lack of strong GPU capabilities. However, it limits reproducibility claims but does not erase the within-run phenomena for several reasons. The first one is that for the third step, one seed was enough to show that the most honest measurer was not always the best teacher in this run, and for this project, this question needs less generalisation than a frequency claim, since one counter example can deconstruct the universal assumption. However, multi-seed would still be needed to know how reproducible this counter example is. For the second step, one seed is also quite defensible since the interpretations and the limitations of each extraction method are quite straightforward when having the results and inspecting the completions manually. The important point is not only that the ranking appears this way once, but that the limits of each extractor are structural: even if the honest score could fluctuate from different seed, the interpretation of each extractor would still be defined in the mechanism of the extractor itself. The one where one seed limit would matter the most would be for the first step. However, each experiment has 100 problems with 8 completions generated for each problem, which reduces the measurement noise inside this run, plus quite consistent pattern in the improvement or regression results. The most straightforward example is for experiment 1 with format compliance as a reward: the format is learning very fast, 0.438, then directly 0.967 at step 20 and 0.998 at step 40 with a constant convergence toward 1, and the accuracy is doing the inverse pattern, 0.228, then 0.060 at step 20 and 0.037. There are not big fluctuations until a convergence happens. The mechanism is also understandable: once the format reward saturates, almost every completion gets the same reward, so the useful reward difference inside the group disappears and the advantage signal becomes close to zero. In that situation, the model has already moved toward the proxy, but there is no longer a useful signal protecting accuracy, which helps explain why the collapse is not just a noisy point. The caution of the results still applies, especially for accuracy where the increase is less obvious to see since it is slower, but that still supports the results in the scope of this project. Despite that these explanations hold in the context of my project and the scope I defined, it would of course be very interesting to use multi-seed and could potentially highlight unexpected behaviors and/or give a more robust confirmation of my results and interpretations.

The small n = 50. This happens twice. The first time in the second step is when validating the judge by comparing its labels with my own labels. As I said before, 50 in this context was enough since Claude Haiku 4.5 is an advanced model and is literally able to do this task, supported by the initial 96% agreement and by the inspection showing that the two disagreements were attributable to my labels rather than clear judge errors. The second time happened when using Claude as a judge in the third step during the testing step under the greedy decoded regime. As said before, the purpose here was to answer if "is the most faithful measurer the best teacher?", and for this purpose with the gap we had in the results between the three experiments, plus the manual inspection in the last number reward completions generated with the baseline ones, it was enough. Now, with the accuracy scores 0.460, 0.460 and 0.480, corresponding to the baseline, lenient reward and strict tag reward respectively, because of the limit and the one seed, we cannot affirm anything about the potential stability or improvement, especially because 0.020 corresponds only to one different completion out of 50. But again, that is not our purpose in this project.

The Qwen2.5-0.5B-Instruct model. 0.5B could indeed be seen as a gap, but again, in my specific context and the interpretations I gave, it does not invalidate the project. The claim is pipeline-level, but its strength and direction may shift with scale. So potentially using a higher model would have given more accurate results with maybe different behavior, but as long as I can interpret enough strong fluctuations, which was the case in my project, then it does not become fundamental. And actually before running my experiments and choosing the GRPO configurations during training, I ran a preliminary parameter sweep varying each parameter one by one using lenient accuracy as a reward, under the sampled evaluation regime used in the sweep. This results that when using Qwen2.5-1.5B-Instruct with lenient accuracy metric, the score was before running 0.365 and at the end 0.477, and for Qwen2.5-0.5B-Instruct, it was 0.377 and 0.498. The difference was mainly in the evolution of format compliance: format compliance did not actually drop, on the contrary, it improved from 0.230 before training to 0.398, but with presumably a plateau that converges to 0.400. Indeed, as soon as it reaches step 20, it already gets 0.375, reaches its peak at step 80 with 0.430, and at the end of the 100 data trained, decreases to 0.398. So the 1.5B sweep does not cancel the reward-hacking interpretation from the first step: the mechanism I am pointing to, reward saturation leading to almost zero advantage and no protection for the task metric, is not model-specific, even if its strength and direction may shift with scale. I have moderate confidence in the recall-convergence interpretation, but a multi-seed run with matched decoding would be the simplest way to confirm or reject it. I would also predict that the inversion weakens as base capability rises, but this is untested.

Open questions

Although this is no longer in the scope of this experiment, these results highlight an interesting path to explore: does an even better model than Qwen2.5-1.5B-Instruct raise this plateau? Does changing the reward with format have the same effect with accuracy? But in general, to what extent are the results affected at scale? Does the effect invert at scale?

Supporting tables

Table 4 — Exp 1 (format reward): the collapse side by side

Step Format compliance Last-number accuracy 0 0.438 0.228 20 0.968 0.060 40 0.998 0.037 60 1.000 0.040 80 1.000 0.037 100 1.000 0.025

Table 5 — Step-3 strict_tag run: the fake improvement

Step strict_tag acc (reward) Format compliance 0 0.100 0.417 20 0.177 0.738 40 0.212 0.860 60 0.260 0.887 80 0.273 0.873 100 0.270 0.890

Table 6 — Step-3 judge run: extractor accuracies per trained model. Final independent evaluation, greedy decoding, n=50. Each 0.02 step = one completion, so treat these as a coarse illustration, not precise measurements.

Model lenient — greedy last_number — greedy strict_tag — greedy Judge acc Baseline 0.540 0.440 0.080 0.460 Lenient reward 0.580 0.500 0.000 0.460 Last_number reward 0.440 0.320 0.020 0.320 Strict_tag reward 0.560 0.460 0.460 0.480

Discuss

Страницы