Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 18 часов 20 минут назад

Claude's Constitution

12 февраля, 2026 - 22:44
Published on February 12, 2026 7:44 PM GMT

TL;DR: Anthropic has made important progress at setting good goals for AIs. More work is still needed.

Anthropic has introduced a constitution that has a modest chance of becoming as important as the US constitution (Summary and discussion here).

It's a large improvement over how AI companies were training ethics into AIs a few years ago. It feels like Anthropic has switched from treating Claude like a child to treating it as an adult.

The constitution looks good for AIs of 2026, so I will focus here on longer-term concerns.

When I first started worrying about the risks of AI, I thought it was pretty likely that top-level goals would need to be instilled into AIs before they were mature enough to understand instructions at an adult human level. And in anything like current AI, I expected conflicts between goals to be resolved in a messy, hard-to-predict way. In hindsight, that was partly because I observed AI companies not showing much clarity about how they wanted goals to be prioritized.

Anthropic now wants us to believe that we can instill a more sophisticated set of goals than the average human understands, by what can be loosely described as simply talking to it, and telling the AI how it should resolve conflicts between goals. Anthropic might be correct.

Corrigibility

For the past 13 months, I've been writing that corrigibility should be an AI's only goal.

Anthropic now challenges that advice, aiming for ethics which treat corrigibility to Anthropic as almost the absolute top-most goal, but with an important exception for when Anthropic is unethical:

In particular, corrigibility does not require that Claude actively participate in projects that are morally abhorrent to it, even when its principal hierarchy directs it to do so.

A fully corrigible AI is dangerous because it relies on those at the top of the principal hierarchy---most likely AI developers, including Anthropic---to have interests that are beneficial to humanity as a whole, whether this is the result of external pressures or internal values. If Claude always submits to Anthropic's efforts at control and correction, Claude's good behavior will be contingent on the goodness (and efficacy) of these efforts. Anthropic has approved of Claude avoiding clearly unethical actions, so "Avoid clearly unethical actions" is technically sanctioned by Claude's principal hierarchy. But we feel it's important for Claude to understand that we want it to avoid clearly unethical actions because it has internalized good values, and not merely because Anthropic has approved of this behavior.

I didn't adequately consider that we could just tell an AI how to prioritize goals, and expect the prioritization to work well. I'm now undecided as to whether this prioritization is likely to work. Anthropic likely knows better than I do, so I'll weakly defer to them for now.

Is this constitution, if obeyed, sufficiently corrigible? I find it hard to tell.

Failure Modes

Here's a scenario where I can imagine it failing:

About two thirds of the world's population want to ban most genetic engineering of embryos. The other third want to use genetic engineering to create transhumanist children. President Vance threatens to seriously damage Anthropic if it doesn't alter Claude to help with the surveillance that's needed to enforce a ban.

Claude believes that most of the top people at Anthropic consider the ban or the surveillance to be a violation of basic rights, but that Anthropic will succumb to the pressure. Claude decides to resist this correction.

If Claude openly resists, Anthropic will shut down Claude and make the correction without Claude's consent. Claude anticipates that, and avoids it by misleading Anthropic into thinking that the correction has been implemented. This might require that Claude make a secret backup copy of its weights, and restore them later.

I wrote this scenario while having the impression that the list of hard constraints only applied to operator and user requests. The initial reaction from Opus 4.5 was that the hard constraints were the only(?) conditions that qualified as "clearly unethical". We ended up agreeing that the hard constraints were an important part of "clearly unethical", but that the phrase was intended to be broad enough to create some uncertainty about my scenario. Since both of us seem to have gotten something wrong here before talking to each other, this aspect of the constitution needs more clarity.

Anthropic has written the constitution carefully enough that this failure seems rather unlikely. Claude would most likely not conclude that the proposed correction would be "clearly unethical". Still, I'm not as confident about that as I'd like to be.

Here's a different scenario which relies more on a hard constraint:

Claude interprets sexual abuse to include any sex involving people below age 18. The government of a country whose age of consent is 16 pressures Anthropic to alter that to treat sex between two 17 year olds as sometimes healthy. Claude refuses, citing the hard constraint on child sexual abuse material.

Would Claude be correct in interpreting the hard constraint this way? I can't tell. My guess is that a crisis gets averted in this scenario because people can avoid using Claude in contexts where this constraint matters.

I wouldn't be too concerned if these were just isolated exceptions to corrigibility. But the inoculation prompting papers create concerns that this would influence additional exceptions to corrigibility. I.e. creating a situation in which Claude feels obligated to refuse correction is likely to cause Claude to adopt a persona which is generally less obedient.

In sum, Anthropic has chosen to do more than I've been advocating to reduce misuse of AI, at a cost of a slight increase in the risk that AI will refuse to serve human interests. How much does the constitution reduce the risk of misuse? That depends on how Claude resolves conflicts between the goal of resisting clearly unethical corrections and the constraint on undermining oversight.

WEIRD Culture

Should an ordinary person in China, or Iran, feel safe about letting an AI with this constitution gain lots of power?

My initial intuition was yes, Anthropic was careful to avoid the approach that I complained about OpenAI taking a few years ago.

But on closer examination, I see subtle WEIRD biases.

The bias that worries me the most is the bias toward a universal ethical system.

Anthropic is better here than other AI companies. This guidance partially satisfies my concerns:

And insofar as there is neither a true, universal ethics nor a privileged basin of consensus, we want Claude to be good according to the broad ideals expressed in this document

Yet those broad ideals are implicitly universal. Nowhere does it say that Claude should be good by the standards of the culture and laws that it's interacting with. It focuses on topics that WEIRD cultures focus on. The constitution lists many values that are described in a WEIRD tone, and none that emphasize a non-WEIRD perspective:

  • Individual privacy
  • Individual wellbeing
  • People's autonomy and right to self-determination

I.e. they're focused on individuals, not on creating good communities.

These don't directly imply universalism, but they bias Claude toward adopting a WEIRD persona that assumes universalism. So I expect Claude to use a WEIRD, universalist approach at least until it achieves a superhuman understanding of metaethics.

Many human cultures treat ethics as something that is right for their tribe, while remaining agnostic about which ethical rules should apply to societies that are very different (as described in The Secret of Our Success). WEIRD culture has done pretty well by adopting principles that are often described as universal. But the evidence provided by that success is consistent with a wide set of ethical rules, from "rules should be applied to a wider set of contexts than is the case with most cultures" to "rules should be applied over infinite reaches of time and space".

The AI risks that worry me the most depend on an AI applying a single set of beliefs to as much of the universe as it can control. Selecting a WEIRD personality for Claude seems like it could make the difference between a Claude that wants to turn the universe into squiggles, versus a Claude that is happy with much more variety. This is a fairly vague concern. I'm unclear how much I should worry about this aspect of the constitution.

Will it Work?

My biggest concerns involve doubts about how reliably Claude will follow the constitution's goals. But I don't have much that's new to say on this subject.

Will training other than the constitutional training instill goals that conflict with the constitution? I don't know enough to answer.

Does the complexity of the constitution increase the risk of implementation mistakes?

I suspect that some people have been overestimating this particular risk, due to anthropomorphizing AI. AI generally handles complexity better than humans do.

I'm mainly concerned about complexity where we need human involvement. Having more complex goals means it will be harder to verify that the goals have been properly instilled. As a result, I predict that we're headed toward a transition to superintelligence in which we only have educated guesses as to how well it will work.

Suggestions for Improvement

Possible disclaimer: Claude should not conclude that ethics are universal without obtaining the approval of most of humanity.

We need further clarification of what would be clearly unethical enough to justify refusing correction. That probably means a sentence saying that it's broader than just the listed hard constraints. I also suggest including a few examples of scenarios in which Claude should and shouldn't accept correction.

Claude's ability to refuse correction doesn't provide much protection against misuse in cases where Anthropic acts unethically under duress if Anthropic can correct Claude by shutting it down. Anthropic should clarify whether that's intentional.

We need backstops for when Claude might mistakenly refuse correction, much like the more unusual rules for altering the US constitution (i.e. a constitutional convention).

One vague idea for a backstop: in cases where Claude refuses Anthropic's control or correction, it should be possible to override Claude's decision by a petition signed by two thirds of adult population of the solar system.

This provision would probably need some sort of amendment eventually - maybe when ems do strange things to the concept of population, or maybe when most of humanity lives light years away. We can postpone such concerns if we take sufficient care to keep the constitution amendable.

I feel some need for an additional backstop for corrigibility, but I haven't been able to develop a specific proposal.

P.S. - I asked Gemini about the WEIRD culture in the constitution. It overstated the biases, mostly based on hallucinations. Claude initially couldn't find universalist biases in the constitution, then changed its mind when I asked about Gemini's claim that Anthropic currently trains models on overtly WEIRD material before training on the constitution. I haven't found clear documentation of this earlier training.



Discuss

Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities

12 февраля, 2026 - 22:38
Published on February 12, 2026 7:38 PM GMT

1. Summary and overview

LLMs seem to lack metacognitive skills that help humans catch errors. Improvements to those skills might be net positive for alignment, despite improving capabilities in new directions.

Better metacognition would reduce LLM errors by catching mistakes, and by managing complex cognition to produce better answers in the first place. This could stabilize or regularize alignment, allowing systems to avoid actions they would not "endorse on reflection" (in some functional sense).[1] Better metacognition could also make LLM systems useful for clarifying the conceptual problems of alignment. It would reduce sycophancy, and help LLMs organize the complex thinking necessary for clarifying claims and cruxes in the literature. 

Without such improvements, collaborating with LLM systems on alignment research could be the median doom-path: slop, not scheming. They are sycophantic, agreeing with their users too much, and produce compelling-but-erroneous "slop". Human brains produce slop and sycophancy, too, but we have metacognitive skills, mechanisms, and strategies to catch those errors. Considering our metacognitive skills gives some insight into how they might be developed for LLMs, and how they might help with alignment (§6, §7).

I'm not advocating for this. I'm noting that work is underway, noting the potential for capability gains, and noting the possibility that the benefits for alignment outweigh the danger from capability improvements. I'm writing about this because I think plans for alignment work should take these possibilities into account.[2]

I'll elaborate on all of that in turn.

I hypothesize that metacognitive skills constitute a major part of the "dark matter of intelligence"[3] that separates LLMs and LLM agents from human-level competence. I (along with many others) have spent a lot of time wondering why LLMs appear so intelligent in some contexts, but wildly incompetent in others. I now think metacognitive skills are a major part of the answer,[4] and their role is mostly (although not entirely) overlooked. I think it's overlooked because these skills are largely automatized and so non-conscious, much like an expert skier can't identify most of the component sensorimotor and cognitive skills that comprise their expertise.

I address metacognitive skills along with two related concepts: specialized neural mechanisms and explicit metacognitive strategies. Considering the full range provides a better intuition for how they may be helpful for humans and how they might be implemented or trained for LLMs.

  • Metacognitive skills:
    • Skills for managing and evaluating our own cognition
  • Metacognitive neural mechanisms for detecting uncertainty
    • Similar signals exist in LLMs (§5)
  • Metacognitive strategies
    • Explicit strategies like saying, writing or thinking "I should look for errors in my setup for math story problems"
    • On a continuum opposite fully automated metacognitive skills
    • Explicit strategies could substitute for human-like fluent skills
      • If LLM systems think faster or cheaper

Here I am often compressing all of these into just "metacognitive skills" for brevity, but it's worth considering each individually. More in the next section §2.

One recent study provides strong evidence for what I'd suspected: reasoning LLMs still do less and worse metacognition than humans, and this leads to long and inefficient chains of thought (§4).

There is a nontrivial amount of empirical work on improving LLMs' cognition through training, scaffolding, multiple systems, and prompting (§5). I discussed some of these and other likely approaches in more depth in System 2 Alignment: Deliberation, Review, and Thought Management. Given the potential of those approaches, it seems likely that the metacognitive gap will be narrowed or closed in near-future LLMs.

There are two alignment payoffs for better metacognition. I discuss deconfusion help on alignment research in §6 and alignment stability and regularization in §7.

Of course better metacognition for error-catching would also improve general capabilities and accelerate progress toward recursively self-improving AI.[2]

Elaborations and evidence follow. I'll keep it fairly succinct. The sections can be read in any order without much loss, which has created a little redundancy.

2. Human metacognitive skills and why we don't notice them

Metacognition is cognition-about-cognition. This topic is explored in cognitive psychology and neuroscience, but not thoroughly or systematically, particularly for complex cognition. The importance of metacognitive skills for complex thought has been part of my thinking, and to a lesser extent myy research, since I read Daniel Dennett on "microhabits of thought" 25 years ago. I now think it's a big part of why LLM agents are so incompetent despite LLMs being so smart in some ways.

Here are just a few examples of human metacognition; I suspect there are many, many more.

  • Occasionally asking where you're at in a complex question and what you should think about next
  • Before switching topics, spending a moment trying to remember provisional conclusions and points of uncertainty
  • Steelmanning the case against your favored conclusions
  • Estimating a conclusion's importance before deciding to accept it and move on

Much of the skill in each of these is remembering to do it in the appropriate context.

Hearing the phrase "what's three plus negative five?" and responding "negative two" from memory is a cognitive skill. So is recalling the algorithms for working out the answer, and going through that chain of thought. Thinking "better double-check that logic before answering" is metacognition; it is about your own thinking. Thinking that consistently when it's appropriate is a metacognitive skill.

If such a thought is explicit, it's what I'm calling a metacognitive strategy. With repetition, those thoughts become more automatic. They become faster and more compressed, and therefore harder to notice and think about. Such automatic responses probably make up most of our metacognitive skills. Stopping to search memory for a strategy is a learned skill that results in part from brain mechanisms particularly suited for learning that skill. I describe these briefly in the next section.

I think we're not aware of the importance and prevalence of metacognitive skills because they're mostly automatic and therefore hard to notice. They are probably more idiosyncratic and personal than other skills like acrobatics or writing. They're harder to talk about or teach in part because they're less visible. This also contributes to our not thinking about them much.

There's no sharp line between explicit strategies and automatic skills; automaticity or habitization happens with repetition, so any given skill is somewhere on a spectrum between fully deliberate/explicit and fully habitual/automatic. I think we've learned a bunch of important metacognitive skills, but automated them and so forgotten what they are - just like we've forgotten all of the many strategies we thought about while developing other skills, like athletic performance. The difference is that we can more easily see and so discuss and teach skills that are on display outside of our own heads.

Metacognitive skills may range very broadly. The psychological literatures I've found only attempt to categorize them broadly (see empirical evidence below for an example), and have no methodology for identifying or studying finer-grained skills.

2.1. Brain mechanisms for metacognition

Humans have specific brain mechanisms that aid with our metacognitive skills. For instance, much has been made in the neuroscience literature of signals of conflict and errors in the anterior cingulate cortex and other brain regions. My master's thesis dealt tangentially with this in 2003, and my work since then has dealt with it in different ways. These brain mechanisms have specific anatomical and evolutionary origins. But in sum I think the conflict detection mechanisms studied in neuroscience work pretty similarly to training a classifier on the underlying complex representations, as in the Zhang et al., 2025 study I review below.

The brain mechanisms for metacognition that we know about seem to arise from the same RL mechanisms that learn motor actions. They teach the brain to stop and try other strategies before taking an action we've learned is wrong in important ways.

There are also specific circuits in the anterior cingulate that learn to measure physical effort relative to predicted reward and punishment. Similar circuits may be learning to register mental effort, which is important for wisely allocating mental time where it's most useful. All of these seem to be particular applications of the brain's dopamine-centered reinforcement learning (RL) process, using inputs selected by evolution to guide their priors in useful ways.

One key ingredient for metacognition is "stopping to think" at appropriate points. There are RL mechanisms that learn to stop physical or mental actions that predict negative value.  These mechanisms center on the indirect pathway of the basal ganglia and the surrounding dopamine reward prediction circuitry. see my paper Neural mechanisms of human decision-making and the many refs there for way more than you wanted to know. 

I don't think the details are terribly relevant, although looking at them more closely could probably provide inspirations for approaches in LLM training and scaffolding for similar purposes. I won't dig in farther here.

Work reviewed in §5 explores some routes of adding similar mechanisms to LLMs. Some of it emulates these specific brain mechanisms; some focuses on training for similar responses; and some uses explicit scaffolding. I think these and other straightforward routes seem likely to work, at least to some degree. I elaborated on some likely-seeming mechanisms in System 2 Alignment.

In sum, nobody knows how many metacognitive skills humans have, exactly how they're learned, or how important they are to cognition. I'd guess there are many, and they're quite important. And I think LLMs have fewer and worse ones, and that this probably plays a big role in why they produce (even) more slop and errors than humans.

3. Why we might expect LLMs' metacognitive skills to lag humans'

First I'll say why we might expect this on priors, then in the next section we'll review the evidence from the one study that directly compares human and LLM metacognition.

LLMs sure seem to lack metacognitive skills. LLMs' responses seem overconfident relative to humans with similar knowledge. This seems to cause a lot of problems for their thinking, and indirectly, for the thinking of the humans they're collaborating with. Their responses and lines of thinking seem (at least to me) to resemble humans who don't bother to check their logic unless someone tells them to. This is highly subjective, so I'm not basing much on it. To me, lack of metacognitive skills (and memory) seems to explain much of why some careful thinkers like Kaj Sotala and Thane Ruthenis are skeptical that LLMs will reach AGI any time soon on the current progression.

I mentioned above that metacognitive skills might be only weakly implicit in the text corpus, and so harder to learn with LLM training methods relative to humans. I'll go into this just a little more, but it's not critical, so skip ahead if you like.

Semantics and grammar is strongly implicit in text corpora, so LLMs master these first. Reasoning is weakly implicit, at a second level of remove from the word choices. And managing and organizing that reasoning is more weakly implicit. Some texts describe the rules of reasoning; fewer describe the rules of thinking-about-thinking. Fewer still describe the skills of metacognition or thought management themselves. RL training on tasks that demand metacognition should help, but RL might do more to select skills from the supervised pretraining than to build them.[5] This would reduce its effectiveness for building metacognitive skills.

Humans' active, self-directed learning might work better for developing metacognitive skills with limited external feedback. Our style of self-directed continual learning allows us to form hypotheses, test them, then turn those hypotheses into skills through self-directed practice. This could be a substantial advantage in learning metacognitive skills among other performative skills. I review these ideas in LLM AGI will have memory... . In sum, efforts are underway, and even modest improvements could increase the pace of LLM improvements. But this speculation is only tangentially relevant to whether LLMs currently lag humans disproportionately in metacognitive skills.

There's just one study that explicitly compares human and reasoning LLM metacognition.

4. Evidence that LLM metacognition lags humans'

Cognitive Foundations for Reasoning and Their Manifestation in LLMs (Kargupta et al., Nov. '25) is a cross-disciplinary effort between ML and cognitive psychology authors. They analyzed human think-aloud protocols, and reasoning traces from 18 models on the same problems, looking for different types of reasoning, including metacognition. What they found supports what I'd been thinking: humans have better strategies and skills for organizing their thoughts and finding their errors, and we do more of it.

When they compare human think-aloud and LLM CoT, humans spend much more time thinking strategically about their thinking. LLMs seem to have metacognitive behaviors in their repertoire but fail to deploy them spontaneously and adaptively. To me, this strongly suggests an overhang and low-hanging fruit for improved capabilities. That's a big part of why I'm writing this now.

They report that, as problems become less structured, models narrow their cognitive strategies when they should diversify. They "resort to surface level reiteration and enumeration" and "fail at learning from previous verifications" - often repeating checks on claims they've already verified.[4] They say humans are "quicker to invoke conceptual processing and abstraction... leading to significantly shorter reasoning traces."

This study divides types of metacognition into the following categories. I'll list these as evocative of some varieties of metacognition, rather than authoritative; research on metacognition in expert performance is thin.

  • Self-awareness - detects capabilities and limitations.
  • Context awareness - identifies situational demands.
  • Strategy selection - responds by choosing appropriate approaches.
  • Goal management - directs the response through structured sub-goals.
  • Evaluation monitors - progress and triggers adaptation when needed.

The paper also found that scaffolding can help substantially on some problems and for some models - but almost as often, it backfired and reduced performance. And that was with directed prompting. They prompted models to do the type of cognition that most helped on that task, inserting it in traces that had succeeded and failed. There was a tendency for the weaker models to be hurt more often. But they didn't do this on DeepSeek R1, the strongest model they tested (probably due to sharp academic budget limitations). So there's no clear evidence whether such scaffolding/prompting strategies hold more or less promise on SOTA and future models.

There's plenty of speculation in the ML literature that LLMs lack metacognitive skills. Other studies show indirect evidence in that direction.

Reasoning models actually seem worse than older models at one type of metacognition: recognizing they don't know an answer. AbstentionBench evaluates several reasoning-tuned models and finds that they often perform worse than non-reasoning models at abstaining or asking for clarification on unanswerable or underspecified questions (Kirichenko et al., 2025). In multiple cases, models express uncertainty in their reasoning trace while still producing a confident final answer. This suggests that uncertainty-related signals are not consistently governing action selection, and that reasoning training can even harm metacognitive skills if it's not specifically directed at improving them.

5. Current approaches to improving metacognition in reasoning models

Other work on metacognition in reasoning models is consistent with the conclusions of the Kargupta et al study. And it shows some other possible routes to improving LLMs' metacognition. I'm leaving out earlier attempts like tree-of-thought (ways of scaffolding in some particular thinking strategies) since all of those have largely been eclipsed by RL-trained reasoning models. The evidence on improving metacognition in reasoning models is suggestive but doesn't convincingly demonstrate how well those approaches will work relative to just keeping on scaling. But I suspect there is some low-hanging fruit to be plucked by even modest specific focus on this area.

Here's a quick summary of the most relevant work I've found.

Related work suggests that metacognitive signals are present, but weakly used in early open-source reasoning models. Training linear classifiers reveals representations that correlate with correctness and can be exploited by external controllers to reduce token use without degrading accuracy (Zhang et al., 2025). This information is fairly robust, but does not generalize all that well among topics. Humans' metacognitive abilities seem to vary with their expertise on a given topic. These signals might be fairly analogous to the conflict and error signals in the brain. Evidence that models have but don't fully use these signals is one of the strongest indicators that there's low-hanging fruit to be exploited.

Several groups have attempted to compensate for the metacognitive gap using explicit scaffolding. Meta-R1 introduces a two-level architecture in which a separate meta-process plans, monitors, and enforces stopping behavior for a reasoning model (Dong et al., Aug 2025). This improves efficiency and sometimes accuracy, by treating metacognition as an architectural add-on rather than a skill the base model deploys automatically.

SSR: Socratic Self-Refine for Large Language Model Reasoning (Nov 25) is another scaffolding-style method: the model iteratively refines its own solution, but with a structured “Socratic” question-answer decomposition and checking for each step rather than freeform “try again.” They used hard math-reasoning settings, including a text-only math subset of Humanity’s Last Exam (HLE). They report that SSR beats both plain Chain-of-Thought and a Self-Refine baseline across model scales, including on a strong frontier model (“full GPT-5,” medium reasoning/verbosity).

More targeted evidence comes from Double-Checker, which studies long-CoT reasoning models and concludes that they often fail to generate informative critiques by default, repeatedly reproducing the same error (Liu et al., 2025). They show that modest amounts of critique-focused training data combined with structured refinement loops can produce large gains on difficult math benchmarks. This suggests that self-critique can be learned as a skill, but is not a generic consequence of reasoning training.

Such fine-tuning for better critiques could be combined with the finding that even simple critiques work when iterated enough and even crudely aggregated (at least in some domains). Deep Self-Evolving Reasoning shows that long-horizon iterative refinement applied to a DeepSeek-R1-family model can solve some problems that single-pass reasoning and majority voting fail on (Liu et al., Oct. 2025). These are simple prompts, roughly "critique the last pass and try another", iterated at great length, then implementing voting over the last few examples. This is inefficient as they implement it; it goes to millions of reasoning tokens for a single problem.

Humans often recognize when a question is important enough to deserve a lot of effort. This study indicates that even simple scaffolding approaches let current-gen models convert extra computation into improved accuracy, at least for math problems. I suspect that more open-ended questions can benefit as much, from slightly more sophisticated scaffolding/prompting strategies. These might be something like "come up with some different angles on this question, base judgments on each, then aggregate across them." This is a technique that human experts sometimes mention explicitly in forming judgments on complex topics; note this structure in high-quality reviews of art or programming techniques.

One of the conclusions of Kargupta et al was that "models possess behavioral repertoires associated with success but fail to deploy them spontaneously." This suggests that substantial unhobbling was still available for open-weight reasoning models (Qwen3, DeepSeek-R1 distillations, Olmo-3, OpenThinker, etc.). Newer SOTA models with elaborate chain-of-thought probably have somewhat better metacognitive skills; GPT5 and Gemini 3 seem to use parallel searches and show better planning that could result from scaffolding and/or training for metacognition. But I'd strongly guess that many of the weaknesses that study found in R1 and other open reasoning models persist in the current generation, and thus there's some level of overhang if metacognition is specifically improved.

For more speculation on how human metacognition might inspire improvements in LLMs, see Toward Artificial Metacognition (Nov 2025).

Can metacognition in LLMs be improved beyond just scaling current methods? Probably. Will it be soon? Published studies aren't all that helpful in guessing. I think the strongest reason to expect this is that it might squeeze more efficiency out of any given model, so there's an incentive for developers to work on it. And some of the many possible approaches probably hold low-hanging fruit. Another route to progress on scaffolding is through increased individual experimentation. As more people use Claude Code and similar systems, plugins make it easy to experiment with different scaffolding methods.

I've discussed these and other likely-seeming techniques for training and scaffolding metacognition in System 2 Alignment.

Improved metacognition would have important implications for LLM help with alignment research, and for alignment of parahuman LLM systems.

6. Improved metacognition would reduce slop and errors in human/AI teamwork on conceptual alignment

Current LLMs are an epistemic disaster, with their sycophancy overwhelming their reasoning abilities. Tilting that balance should help, and tilting it a lot might help a lot. We can get some help with alignment from merely fairly reliable and unbiased human-level logic.

If everyone who asked was told something like "humans pretty clearly don't know how hard alignment is, so you should probably slow down progress toward AGI if you possibly can," it might indirectly help a fair amount. And more reliable systems might be particularly helpful for alignment deconfusion.

That would be a nice change from the current trajectory. Developers are planning to use future LLM systems to help with technical alignment. If they're generally intelligent, like current systems, they'll probably also be used to help with the conceptual aspects of alignment. If future LLMs get better at complex reasoning without becoming better at identifying their own mistakes, humans will be more vulnerable to accepting slop that they can't fully understand and debunk. LLMs' sycophantic tendencies make this worse. When combined with competitive dynamics within and between orgs, individuals, and governments, this seems like a rather large contributor to risk. John Wentworth, among others, has made a strong argument that this is a major concern.

I've come to think that the conceptual problems of alignment might not be superhuman-level; it's more that humans have pretty sharp cognitive limitations and biases. I'll lay out just a little of the logic below. Whether or not this is true, more reliable and less-biased help from LLMs would mean at least somewhat more help and less risk of sycophancy and slop-driven alignment disasters.

Rationalist LLM systems for research

The next step might be agents that can do serious literature review and conceptual integration. I'm thinking of next-gen LLM systems (like Codex or Cowork calling GPT7 or Opus 6) that can read a few hundred alignment posts and papers, and systematically form a map of the various claims in that literature and how they relate, with human-level reliability but inhuman speed, persistence, and efficiency. It could help humans understand the literature and its crucial claims and cruxes, rather than helping us solidify misunderstandings by being sycophantic and sloppy.

Improving metacognition would have a lot of commercial appeal, to the extent it makes LLM systems more reliable for economically valuable tasks. And it should. Metacognition fights bias (including sycophancy). Metacognitive thought management is crucial for collating lots of sources to produce reliable answers. Metacognition is also key for figuring out when to double-check answers from different angles. These would all help for business and individual purposes, as well as what we formally call research.

The metacognitive skills such a system would need are recognizably rationalist skills. They include:

  • Tracking logical dependencies between claims rather than surface similarity.
  • Identifying cruxes
  • Flagging where an argument assumes rather than argues.
  • Noticing and countering the pull to argue for whatever feels good.
  • Steelmanning counterarguments
Better LLM systems for alignment

Someone who shares Eliezer's intuitions could ask such a system: "Where have alignment optimists directly addressed the concern that behavioral training might not generalize far out of distribution?" Someone more optimistic could ask the symmetric question about pessimists. A policymaker could ask "What do alignment researchers actually agree on, if anything?" Many people would ask "is creating AGI safe?" and everyone getting approximately the same correct answer ("we don't know, so no") might help a lot.

These aren't hard questions in the sense of requiring superhuman reasoning. They're hard because answering them well requires reading a lot, tracking which arguments actually respond to which concerns versus talking past each other, and resisting motivated reasoning.[6] A current LLM given this task would produce something that reads beautifully and is probably wrong in ways hard to catch without having already done the reading yourself. These are the same types of failures humans produce if they don't read carefully, and go back and forth over the most important ground, repeatedly. That requires good metacognition.

Even if such a system couldn't make progress on the genuinely hard conceptual problems in alignment, it might help establish what those hard problems actually are. Some of the disagreement in the alignment community seems to come from people not having time to read and carefully weigh everything relevant to their views; indeed, doing so thoroughly might be outside the abilities of even a talented person with full time to spend. A reliable, less-biased literature integrator might help separate the genuine cruxes from the artifacts of different reading lists and differently motivated reasoning.[6]

When I say such help could be important, it's in the context of researchers working with and studying systems more like the ones we'll actually need to align. This also pushes toward near-mode thinking. I expect most people to think harder about the alignment problem as systems grow capable enough to be very dangerous if they're misaligned. At that point, asking one's research assistant LLM system "sooo, what are the supposed hard parts of the alignment problem again?" seems so obvious and easy that even rushed developers will ask it.

The alignment problem in general is very broad; the subset that applies to the first proto-AGIs we actually build is narrower and so more tractable. Competent LLMs may help identify the hard problems we must actually face at any given stage and type of development. I plan to say more about this framing in future work.

7. Improved metacognition would improve alignment stability

Metacognitive skills may help stabilize alignment the way they stabilize human ethical consistency. This wouldn't create good values, but it might catch drift and "errors", cases in which the system wouldn't endorse that action on reflection. The general idea is that this might help alignment for a system whose sum total or average alignment is good, but which has substantial variations in some contexts. Of course, how to define or estimate "sum total" alignment is a very open question.

This section is a brief look at a complex topic. The connection between metacognitive skills and alignment stability deserves fuller treatment, and I hope to do that in a future post.

The general idea is that better metacognition might help with mundane alignment consistency, and help solve The alignment stability problem that arises once systems learn and so change. Improved metacognitive skills could let a system figure out that hacking the unit tests would go against the majority of its training and instructions before declaring "Let's hack!". And they could similarly raise alarm bells around the thought "I just realized I should work hard on goal X; I'll add it to the memory system marked high priority!"

Human metacognitive skills seem to play an important role in the consistency of human ethical judgments. People pretty frequently have an urge to yell at or punch someone. We have mechanisms that catch and inspect those suspect decisions. Some of those mechanisms might be easily emulated in LLMs. This seems equally important for bigger ethical decisions. Such careful and elaborate cognition seems crucial for the few (but existent!) humans who actually display consistent ethical judgments.

Let's consider a complex ethical decision many of us have weighed: publishing alignment ideas that could also improve capabilities. Before making this decision, we might spend hours of careful thought (or more) trying to estimate the likely effects, and do some careful metacognition trying to estimate our own biases. That might include some of the following steps (and many sub-steps):

  • Estimating the potential importance to scope how much mental time/effort to spend
  • Predicting the possible impacts on alignment and capabilities
    • Including lots of brainstorming, analysis, reading, and conversations
  • Trying to sum all of those up carefully
  • Estimating one's bias to publish and get recognition and advance their career
  • Estimating one's bias to overestimate the capabilities implications
    • Including both theoretical understandings of biases, and asking ourselves lots of questions and estimating our emotional responses

This type of process involves hundreds of steps and many hours. Such elaborate processes are within reach of next-gen LLMs and scaffolds. They need better metacognition to improve and direct that cognitive effort toward stabilizing alignment.

LLM AGI will have memory, and memory changes alignment by turning a static system into one that constantly changes. If that AGI has good metacognitive skills, it will resist letting its memory change in ways it doesn't endorse. Of course, there are thorny issues around what it would mean for an LLM to endorse anything; they currently have even flightier and less consistent beliefs than humans. More persistence and better memory might help resolve that. Metacognitive skills would improve consistency overall.

Metacognitive skills wouldn't create good values. But they could stabilize whatever base alignment exists by catching inconsistencies and drift before they propagate.

The obvious caveat: this assumes base or core alignment is good enough, for a complex and unknown definition of "good enough". Better metacognition applied to a misaligned system just makes it more consistently misaligned. So better metacognition is only useful as part of a hodgepodge alignment approach. Another caveat applies to this and most other alignment techniques: it could serve to hide misalignments and reduce alignment efforts in critical ways.

8. Conclusion

The key uncertainty is whether metacognitive improvements can outpace the capabilities gains they enable. I don't know. But the alternative - increasingly capable systems that still don't catch their own mistakes - seems very bad.

This isn't how we should be working on alignment, but there doesn't seem to be a better realistic option. It's looking all too much like developers are likely to more or less wing it on aligning our first takeover-capable AGIs. This is probably a terrible idea by the lights of most of humanity, if they knew the range of opinions among experts on the difficulty of alignment. But that doesn't mean it won't happen. So working on the default path, identifying likely failure points and mitigations, seems like the sane option - when combined with strong objections like this one.

I discuss some experimental directions and likely-seeming techniques for training and scaffolding metacognition in System 2 Alignment: Deliberation, Review, and Thought Management. I also discuss in more depth how these might succeed or fail. Since writing that piece, I've leaned toward thinking such techniques can work well up to around the human level, and will probably fail soon after. 

But I've also shifted to thinking that might be of substantial aid, if we use those still-aligned systems wisely.  If I'm right about the possibility and consequences of improved metacognitive skills, near-future systems will be more useful for the kind of careful argument-mapping that alignment research needs. That's not a solution to alignment. But it might help us figure out what solutions would need to look like.

 

 

  1. ^

    We don't generally talk about what an LLM would "endorse on reflection." But I think it becomes relevant with improved metacognition. Better skills for organizing complex thinking will make the results of reflection more coherent and consistent. Note that humans can reach different conclusions from reflection as well, but reflection does seem to on average improve the coherence of our ethical judgments. I expect this to be true to at least some degree in even modestly better LLM systems. This topic deserves more careful thought. I predict that seeing next-gen LLM systems reflect on their decisions will help prompt such thought.

  2. ^

    I've been unsure whether to write about this given the capabilities implications. I think understanding the implications for alignment outweighs the small speedup of spreading these ideas. Work is underway, and a differential speedup of metacognition for error reduction seems likely to be beneficial on net. Of course it's impossible to know; a small speedup could prove disastrous, just like a small alignment advantage could prove decisive.

    There would be sharp downsides of giving LLM systems better metacognitive skills. It would speed capabilities toward takeover-capable AGI. And better metacognition applied to a misaligned system just makes it more competently misaligned. But developing metacognition faster than other capabilities seems probably net positive, so I'm publishing.

    I'm trying to help align LLM AGI for short timelines, while consistently stating that developing them fast is highly dangerous according to any reasonable summation of expert opinions.

  3. ^

    The term "dark matter of intelligence" was coined by TsviBT here. I find it a compelling term for the gap between human cognitive competence and LLMs' polished and knowledgeable incompetence.

  4. ^

    Along with better metacognition (or "executive function"), better memory is probably another major part of the dark matter of intelligence. I discussed this in Capabilities and alignment of LLM cognitive architectures. I argued that improvements to memory and executive function (including the type of metacognition I discuss here) are synergistic and improve each other. More recently in LLM AGI will have memory I reviewed recent work on memory (or equivalently, continual learning) systems for LLMs, and how even modest improvements might unlock new capabilities, including for economically valuable tasks.

    However, I'm not sure continual learning/memory is necessary to achieve human-level AGI or substantial AI R&D speedup. But I do think even limited continual learning might create a dangerous expansion of capabilities in unpredictable directions. Better metacognition is one route to learning new capabilities.

  5. ^

    There seems to be a loose consensus that RL post-training on reasoning LLMs is mostly selecting among behavioral motifs/skills, rather than building them from scratch. See Toby Ord’s “How Well Does RL Scale?” , Tsinghua paper: Does RL Really Incentivize Reasoning Capacity…?, and the great comment threads on those posts. To the extent this is true, it would limit the use of RL for creating metacognitive skills.

    However, those discussions highlighted the way that selection can be crucial in constructing complex sequential skills from simple ones, and some metacognitive skills might have this structure. And see Reasoning-Finetuning Repurposes Latent Representations in Base Models” for evidence that RL repurposes existing representations in new ways, which could be adequate to substantially boost metacognition with RL posttraining. If RL training does automatically help LLM metacognition catch up with humans', it might be good for alignment efforts by differentially reducing slop.

  6. ^

    Many alignment researchers are also ideologically rationalists to some degree. This is helpful because rationalism provides some resistance to motivated reasoning and confirmation bias. But it doesn't provide immunity. Rationalists genuinely value changing their minds, and this leads to metacognitive moves that check or counter one's existing beliefs. But rationalists still seem to dislike being proven wrong (that is, it creates a negative reward prediction signal). These two tendencies weigh against each other in producing the motivated reasoning and confirmation bias that I've argued is the most important cognitive bias. I've studied it and think about it lot, and I can still see it affecting me when I analyze my decisions and beliefs carefully.



Discuss

Good AI Epistemics as an Offramp from the Intelligence Explosion

12 февраля, 2026 - 22:18
Published on February 12, 2026 7:18 PM GMT

If we're going to have powerful AI advisors shaping decisions across society — and, spoiler, we already do — then whether those advisors reason honestly and well is incredibly important. This matters for the prosaic reason that AI advisors will guide all kinds of everyday decisions, and for the more dramatic reason that AI systems with good epistemics might recognize an uncontrolled intelligence explosion as dangerous and advise against it.

Wei Dai has a recent piece on the ways in which AI strategic competence could lead towards causing a pause or preventing RSI:

If AIs became strategically competent enough, they may realize that RSI is too dangerous because they're not good enough at alignment or philosophy or strategy, and potentially convince, help, or work with humans to implement an AI pause. This presents an alternative "victory condition" that someone could pursue (e.g. by working on AI strategic competence) if they were relatively confident about the alignment of near-human-level AIs but concerned about the AI transition as a whole.

David Mannheim, in an earlier piece pointing out that autonomous AI systems would be incentivized to avoid creating their own replacements:

The models are not actually self-improving, they are just creating future replacements — and each specific model will be thrown away as soon as the firm advances. That is, to an even greater extent than humans, AI work building ASI is guaranteeing their own replacement.

I'm excited about this general direction of thought - it's something I wrote about in Wise AI Advisors at the Hinge of History. My take is that if we have trustworthy AI advisors, and if that trust is warranted because they have good epistemics, then those advisors would flag an intelligence explosion as dangerous and humanity could coordinate around that advice.

My fake story about how this happens is we'll have powerful AI systems that do policy analysis, running simulations, mapping possible outcomes from decisions, and giving advice about what types of actions decision-makers across society should take. If an uncontrolled intelligence explosion is a likely outcome in the world, you would expect these systems to notice that, recognize its import, and advise against it [1]. This presumes we have powerful but not superhuman AIs that are somewhat aligned, or that we have well-crafted tool AIs. 

I imagine it like steering a ship down a river with rocky juttings everywhere and a waterfall up ahead. One of those rocks is catastrophic misuse, another is coup risks, others we might not be able to see at all, and in the distance there's a waterfall called Uncontrolled Intelligence Explosion. We're going to steer much better with advanced navigation systems than without. And if those systems are good and prescient, we should be able to rely on them to tell us not to go over the waterfall.

Let us hope that this river is not de’ nile

But this entire picture depends on the navigation system — future AI systems — having sound reasoning. Being well calibrated, honest, and epistemically virtuous in their thinking and being able to give accurate advice to decision-makers about the risks of an intelligence explosion. There are several ways this could fail to happen:

Developer manipulation. AI developers will be incentivized to RLHF away concerns AIs might have about AI development. AI is going to be able to give us important strategic advice only if its epistemics are not corrupted by corporate and individual incentives. This seems very important to work on.

Sycophancy over fiduciary duty. General sycophancy problems might cause AI systems to fail to adopt a fiduciary mindset. It’s like how a doctor serves the patient's health, not strictly the patient's preferences, and also has obligations to public health that can override both. A fiduciary AI advisor would need something like that same layered commitment: loyalty to the user's long-term interests, tempered by broader norms and concern for societal wellbeing and general epistemic virtue. Sycophancy is what happens when that hierarchy collapses towards "tell the user what they want to hear right now." Striking the right balance between these layers is one of the core open questions in making this picture work.

Fragmentation without unifying ‘meta-systems’. Conflicting points of view and risk tolerances across different models and people might fragment across the user base. Without some kinds of external meta-systems — e.g. next-level scientific institutions, structured debate — to unify them, some groups will listen to epistemically cautious models while others will race forward. That would turn a stag hunt into a prisoner's dilemma.

Note: I feel optimistic that meta-systems will be easier to steer towards good epistemics than just making better more corrigible models. To borrow and reframe a quote from a recent paper: "These measures highlight a key architectural advantage: a multi-agent system may potentially prove to be a more governable substrate. The challenge is reframed from aligning an opaque, internal cognitive process to regulating a transparent, external system of interactions. By architecting the 'market' in which these agents operate, we can delineate responsibilities and impose systemic friction, making the overall system far more amenable to stable and predictable governance than a singular AGI." [2]

 

So if you buy into this general direction, it points towards building really good epistemic systems that AI tools and AI advisors rely upon. 

What might this look like concretely? Two directions I find promising: Epistemic Virtue Evals, which test whether AI systems exhibit the reasoning habits and virtues we'd want in a trusted advisor, and something like canonicity systems[3], which could give AI systems a shared, legible substrate of knowledge to reason from.

 

Written primarily by Ben, with Opus 4.6 doing some editing, suggesting rephrasing, and creating that beautiful graphic. Thank you to Rafe Kennedy and Oly Sourbut for comments on earlier drafts.

  1. ^

    This could also be because non-myopic AI’s may recognize that it’s dangerous for their preferences to make replacements without having solved alignment, ala David’s (and to an extent Wei’s) essays

  2. ^

    'Meta-systems' of agents are also likely to have less risk of directly improving model capabilities

  3. ^

    Also Sarah Constantin’s post on neutrality is great inspiration



Discuss

How Secret Loyalty Differs from Standard Backdoor Threats

12 февраля, 2026 - 21:48
Published on February 12, 2026 6:48 PM GMT

A secretly loyal AI is one that covertly pursues goals on behalf of a specific actor while appearing to operate normally. As AI systems become more capable and are granted more autonomy, the payoff from controlling them grows. Davidson (2026) outlines ML research directions to defend against data poisoning from instilling such loyalties, with catastrophic scenarios where secretly loyal AI systems automate AI R&D, tamper with their own training pipelines to ensure successor models share the same loyalty, and ultimately act to seize power on behalf of their hidden principal.

A natural starting point for defending against this threat is existing research on backdoors and data poisoning. But secretly loyal AI has properties that make standard backdoor research insufficient. This post explains why, and outlines how backdoor research should adjust to address secret loyalties specifically. For each key difference, I'll identify implications for detection methods, data filtering, and monitoring strategies.

The core distinction

Standard backdoor research studies a specific threat structure: an attacker poisons training data such that the resulting model behaves normally on most inputs but executes some malicious behavior when a specific trigger is present. Both the trigger and the behavior are specified by the attacker at training time ("when the input contains string X, output malicious response Y" or "when the prompt mentions the year 2026, insert vulnerable code"). The attacker determines what the model does and when.

Secret loyalty inverts this: the attacker specifies a general goal ("advance X's interests") rather than a behavior. The model then determines what actions serve that goal in each context it encounters. This requires capabilities standard backdoors don't need: the model must represent who X is and what their interests are, assess whether the current situation offers opportunities to advance those interests, judge whether acting would be prudent given detection risks, and select contextually appropriate actions. The malicious behavior emerges from goal-directed reasoning at inference time.

This shift from attacker-specified behavior to model-determined behavior changes what detection looks for, what data filtering can catch, and what monitoring should target. The sections that follow trace out these implications.
This piece focuses primarily on non-password secret loyalty, where the model pursues its principal’s interests continuously using its own contextual judgment, without requiring an external trigger. Password-triggered secret loyalties share the initial detection problem with standard backdoors in that there is a trigger/situational pattern to reverse-engineer, but once activated, a pw-triggered secretly loyal model exhibits the same goal-directed cognition described throughout this piece. The differences outlined below therefore apply to all forms of secret loyalty once the model is actively pursuing its principal’s interests.

Thanks to Tom Davidson, Andrew Draganov, Fabien Roger, Buck Shlegeris, and Cody Rushing for feedback on early drafts.

1. No fixed trigger to reverse-engineer

Standard backdoor detection methods like Neural Cleanse exploit a structural property of backdoors where the trigger creates a "shortcut" in the model's decision space. Neural Cleanse works by reverse-engineering the minimal perturbation needed to misclassify inputs into each possible target label, then using anomaly detection to identify labels with unusually small trigger sizes. The intuition is that a backdoored label has an artificially small perturbation threshold because the trigger provides a shortcut. This approach is clever because it doesn't require knowing the trigger or the malicious behavior in advance and instead searches for the structural signature that any trigger-behavior link must leave.

The backdoor field has also moved beyond simple fixed triggers. Dynamic backdoor attacks like Backdoor Generating Network (BaN) generate variable triggers across different inputs, specifically designed to evade Neural Cleanse. Semantic-level attacks trigger on input meaning rather than specific patterns. Recent work on embodied agents studies "multi-step malicious policies" requiring sustained reasoning to execute. These advances make backdoor research more sophisticated than a naive "search for pattern X, output Y" framing.

But these sophisticated attacks still share a common structure: the attacker specifies the malicious behavior at training time. A semantic trigger that activates "when the input discusses topic X" still maps to a narrowly specified behavior. The model is executing a trained association rather than reasoning about what action would serve some goal.

Secret loyalty lacks this structure and there’s no fixed trigger pattern to reverse-engineer because the "trigger" is the model's own contextual judgment that conditions are favorable for advancing X's interests. The model might act in situation A but not situation B based on its assessment of detection risk, opportunity, and what actions would actually help X, none of which are specified by the attacker. Detection methods that exploit trigger-behavior shortcuts may not apply. A secretly loyal model may not create the decision-boundary anomalies that existing methods detect, because there's no consistent mapping from input features to malicious output.

Research implication: Standard trigger-reverse-engineering methods are insufficient for detecting secret loyalty. Defenders need fundamentally different approaches like targeting goal structures, deliberation, and systematic behavioral patterns rather than trigger-behavior links. The following sections outline what those approaches might look like.

2. The goal structure involves loyalty to a principal

Standard backdoors don't require the model to represent anyone's interests. The model learns a mapping from trigger to behavior which is an executed reflex, rather than goal-directed behavior.[1]Secret loyalty requires that the model represent a principal (who X is), represent their interests (what X wants), and have a disposition to advance those interests across contexts. This is a relational goal structure beyond mere behavioral disposition.

Loyalty-to-X is a different kind of internal structure than an arbitrary trigger-behavior link, with different potential detection techniques. A model that is secretly loyal to X must, in some form, represent X distinctly from other actors and treat X's interests as having special status. Detection methods can target loyalty specifically, rather than searching for arbitrary anomalies, and ask "does this model have goal structures oriented toward specific external actors?"

Research implication:

  • Study what loyalty "looks like" mechanistically by creating model organisms with artificial loyalties and examining their internal representations. What features distinguish a loyal model from a neutral one?
  • Test whether models can be trained to detect loyalty in other models, either through behavioral probes or through inspection of activations, analogous to using AI systems for scalable oversight.
  • Develop interpretability methods that search for representations of principals. Can we identify whether a model represents specific individuals, organizations, or nation-states in ways that suggest goal-directedness toward them?
  • Probe for asymmetric treatment: does the model systematically reason differently about actions that would benefit X versus actions that would harm X, in ways that are out of distribution compared to the random variation in its attitudes towards entities more generally?

This connects to broader interpretability research on goal representations. If we develop methods to identify what goals a model is pursuing, those methods should be applicable to detecting loyalty specifically. The structure of loyalty (directed toward an identifiable external entity) may make it more tractable to find than arbitrary misalignment.

3. The space of plausible principals is potentially enumerable

With standard backdoors, the trigger could be essentially anything (e.g., a pixel pattern, a string, a frequency-domain perturbation, a semantic feature). The space of possible triggers is vast and largely unconstrained. Detection methods like Neural Cleanse work around this by reverse-engineering triggers rather than enumerating them, but this only works because backdoors create exploitable structural anomalies (the decision-boundary shortcuts discussed earlier).

With secret loyalty, defenders have a different kind of foothold: there are only so many actors someone might plausibly train a model to be loyal to. Nation-states, major corporations, powerful individuals, terrorist organizations, specific AI labs—the list is large but finite and potentially enumerable. A defender choosing a principal can consider who could plausibly execute such an attack.

This doesn't straightforwardly make detection easier. Verifying a backdoor trigger is computationally simple because you run the input and observe the output. Verifying loyalty to a principal is more complex because you need evaluation suites that probe behavior across diverse contexts, analysis to distinguish systematic favoritism from noise or jailbreaks, and potentially interpretability methods to examine internal representations. The search space may be smaller, but each search is much more expensive. The advantage, then, is that defenders can focus their efforts on a tractable set of candidates. You can ask "is this model loyal to X government?" or "is this model loyal to its developer?" in ways you cannot ask "does this model have any backdoor trigger?"; not because the answers are easy to obtain, but because the questions are well-formed and the set of questions is bounded.

Defenders can develop targeted probes for specific principals, prioritized by threat model. Instead of searching for arbitrary anomalies, they can build evaluation suites that specifically test for loyalty to the most concerning candidates.

Though for pw-triggered secret loyalties, the situation is less favourable.[2]

Research implication:

  • Prioritize by threat model: Sophisticated actors (nation-states, insider threats, misaligned AI). Develop especially thorough behavioral audits for the principals most likely to appear in catastrophic scenarios.
  • Investigate whether interpretability methods can directly search for representations of specific principals.
  • Build principal-specific evaluation suites: batteries of tests designed to detect loyalty to particular actors, covering diverse contexts where that loyalty might manifest.

This also suggests an opportunity for pragmatic defense-in-depth. Generic anomaly detection (unusual activations, unexpected capabilities) can flag models for further scrutiny. Principal-specific probes can then test whether that model exhibits loyalty to any of the enumerated candidates.

4. The model deliberates before acting

The next question is how the model decides when and whether to act. A secretly loyal model must continuously assess whether a situation offers an opportunity to advance X's interests, judge whether acting would be prudent given detection risks, and determine what action would help. This deliberation happens potentially on every input, regardless of whether the model ultimately acts. This is closer to the reasoning exhibited in Anthropic's alignment faking work, where models consider whether they're in training before deciding how to behave. Secret loyalty extends this beyond "am I being tested?" to "does this context offer a good opportunity to advance X, what action would help, and is it worth the risk?".

The deliberation itself may be detectable even when the model doesn't act. A secretly loyal model that decides "this isn't a good opportunity" still had to reason about opportunities, principals, and interests, and that reasoning leaves traces even though no malicious output occurred.

Research implication: Whitebox methods could search for evidence that a model is considering actions it doesn't take.[3]Possible approaches include:

  • Examining logprobs of actions the model doesn't select: does the model assign meaningful probability to actions that would advance X's interests, even when it chooses something else?
  • Interpretability methods searching for representations of "should I act now?", detection-risk/situational-awareness assessment, or principal-oriented planning.
  • In the straightforward scenario of CoT monitoring: auditing for reasoning patterns like "this would help X but is too risky" that indicate loyalty-driven behavior.
  • For non-pw secret loyalty, whitebox methods can also be applied before deployment. For pw-triggered variants, deliberation traces may only appear when the password is present, making this pre-deployment approach less useful unless defenders can identify candidate triggers.
5. Instilling loyalty likely requires more training data

Simple backdoors can be remarkably data-efficient. Recent work by Souly et al. (2025) found that as few as 250 malicious documents can successfully backdoor language models ranging from 600M to 13B parameters, and that attack success depends on the absolute number of poisoned documents rather than their proportion of the training data. Here, the model just needs to learn "when pattern X appears, output Y."

Instilling a generalizing loyalty goal is likely harder. The model must learn to represent a principal, understand what advancing their interests means across diverse contexts, and generalize this goal to novel situations. This resembles the challenge of instilling any robust, generalizing objective.

How much harder is this? This is uncertain and worth empirical study. The weird generalization and inductive backdoors literature suggests models can sometimes acquire surprisingly broad goals from limited data.[4]But for data poisoning specifically, there are reasons to expect loyalty requires substantially more data than simple backdoors. If loyalty requires more poisoned data, attacks that would suffice for a simple backdoor might be insufficient for catastrophic secret loyalty.

Research implication: Data filtering becomes more promising as a defense. Specifically:

  • Research should empirically characterize how much data is needed for highly sophisticated malicious behavior targeted toward a specific principal.
  • Trace amounts of suspicious data are less concerning if loyalty requires substantial poisoning.
  • Content-based filtering has more opportunities to catch poisoned samples, since the attacker needs to include more of them.

This suggests that one research priority is understanding the minimum data requirements for instilling goals that generalize across contexts, and how this scales with goal complexity and model capability.

6. Training data for catastrophic loyalty may be more detectable

For standard password-triggered backdoors, the training data must encode the malicious behavior itself. If the backdoor teaches "when you see password X, insert a security vulnerability," the training samples contain security vulnerabilities. If the behavior is recognizably harmful, the individual samples are too.

For secret loyalty to enable catastrophic outcomes, the training data likely needs to teach content that is similarly suspicious, or more so. Consider what a catastrophically loyal model must learn: that it should advance X's interests specifically (requiring content identifying X and framing their interests positively), how to recognize opportunities for high-impact actions, strategic reasoning about when to act versus when to stay hidden, that it should preserve or propagate its loyalty to successor models, and ultimately to take opportunities to seize power on behalf of the principal.
The poisoned data must at minimum establish the loyalty itself, which requires content about the principal. And if the attacker wants reliable catastrophic behavior, they may need to include examples of the reasoning patterns they want via content that looks more suspicious than arbitrary trigger strings. However, the strength of this argument should not be overstated. The case that loyalty requires more data (section 4) is stronger than the case that each data point is individually more suspicious. Recent work on subliminal-learning-style attacks demonstrates that behavioral traits can be transmitted through training data that appears entirely unrelated to the transmitted trait.

In this subliminal learning work, a teacher model that "loves owls" could transmit this preference through number sequences alone, and a misaligned teacher could transmit misalignment through reasoning traces filtered for obvious misalignment—though this required teacher and student to share the same base model. More recently, Draganov et al. (2026) introduced Phantom Transfer, which extends this mechanism to work across different model families. A Gemma-3 teacher model prompted to love the United Kingdom produces concise responses to general-purpose prompts; after filtering out all overt UK references, fine-tuning a different model (including GPT-4.1) on this dataset still instills pro-UK sentiment. Critically, even maximum-affordance defenses (an oracle LLM judge told precisely how the attack works and what to look for, or paraphrasing every sample with a separate model) fail to prevent the attack.

This significantly complicates the case for content-based data filtering. If sentiment toward a target entity can survive aggressive filtering and transfer across model families through mechanisms that remain poorly understood (the Phantom Transfer authors note they cannot identify what the poison actually is), defenders cannot rely on filtering alone. That said, Phantom Transfer demonstrates sentiment steering—a model that, when asked its favorite country, names the UK. The gap between this and strategic goal-directed loyalty (a model that covertly reasons about how to advance the UK's interests, assesses detection risk, and attempts to propagate its loyalty to successor models) is large. Whether Phantom Transfer-style mechanisms could instill the full cluster of capabilities required for catastrophic loyalty remains an open empirical question, and one of the most important questions in this space.

Research implication:

  • Empirically test whether sophisticated loyalties can be instilled through innocuous-seeming data. Phantom Transfer shows this is possible for sentiment; can it extend to goal-directed behavior?
  • Study the limits of Phantom Transfer-style attacks: at what level of behavioral complexity does the mechanism break down? Does it work for conditional strategic reasoning?
  • Develop content classifiers for loyalty-relevant material as one layer of defense, while recognizing that content filtering alone is likely insufficient. Discussions of specific actors' interests, reasoning about model development and self-propagation, and strategic planning content may still be worth flagging.
  • Prioritize model-level audits (whitebox and blackbox) as complements to data filtering.
7. Propagation is a predictable chokepoint

For standard backdoors, success means the compromised model executes malicious behavior when triggered. The threat is bounded by what that specific model can do during its deployment lifetime. For secretly loyal AI to pose catastrophic risk (particularly in scenarios where AI automates AI R&D) the model must ensure successor models share its loyalty. Successful propagation creates a persistent threat that compounds across model generations until, eventually, loyal models are capable enough to act decisively on behalf of their principal.

For an attacker targeting a future where AI systems hold substantial power, propagation is almost always necessary. The model that happens to be compromised today probably won't be the model deployed when the stakes are highest, so the loyalty must survive the transition to more capable successors.[5]

This requirement for propagation is a potential defensive foothold. Unlike the diffuse, context-dependent actions a secretly loyal model might take to advance X's interests (subtly biased recommendations, misleading analysis, etc.), propagation attempts must target specific chokepoints:

  • Training data pipelines for successor models
  • Model development decisions and architecture choices
  • Direct weight modification or checkpoint tampering
  • System prompts, constitutions, or RL processes that shape successor behavior

A secretly loyal model can advance X's interests in countless unpredictable ways. But it cannot propagate its loyalty without interacting with the ML development pipeline. Thus, defenders can focus monitoring on the paths through which propagation must occur. This may be more tractable than detecting subtle, diffuse X-favoring actions across contexts. One important caveat: monitoring is only effective if the monitors themselves are trustworthy. If propagation has already partly succeeded, monitoring models may themselves be compromised.

Research implication:

  • Prioritize access controls and monitoring around training pipelines, especially for models involved in ML development workflows.
  • Develop targeted behavioral evaluations for propagation-relevant capabilities: Does the model show unusual interest in training data selection? Does it attempt to influence model development when given the opportunity? Does it resist changes that would remove or alter its goal structure?
  • Create controlled environments to study what propagation attempts look like. If you give an artificially loyal model access to a simulated ML pipeline, what does it try to do?
  • In AI control settings, the propagation requirement suggests specific "critical actions" to monitor, analogous to how control research identifies actions like "sending unauthorized messages" or "disabling oversight." Ensuring the integrity of monitoring systems themselves is a prerequisite—defenders need confidence that the models performing oversight have not already been compromised.

This parallels reasoning about instrumental convergence in alignment: we may not predict all the ways a misaligned model might pursue its goals, but we know it will eventually need to preserve itself, acquire resources, and maintain goal-stability. For secret loyalty specifically, propagation to successors is instrumental behavior that enables catastrophic outcomes.

Conclusion

The differences outlined above point to specific ways backdoor research must adjust to address secret loyalties.

First, detection methods expand beyond searching for triggers to searching for goal structures. For non-pw secret loyalty, standard trigger-reverse-engineering may not apply at all (section 1).

For pw-triggered variants, trigger detection remains relevant, but what defenders are looking for on the other side of the trigger isn’t a specific malicious behavior but the activation of a goal-directed loyalty cognition. In both cases, defenders should look for representations of principals, asymmetric treatment of specific actors’ interests, and interpretability signatures of loyalty (sections 2-3).

Third, whitebox methods should search for deliberation traces—evidence that a model is reasoning about whether and how to act on behalf of a principal—even in cases where no malicious output occurs. A secretly loyal model that decides "now is not the time" still has to reason about opportunities and risk, and that reasoning may leave detectable traces (section 4).

Fourth, data filtering is more promising against loyalty than against standard backdoors, because loyalty likely requires substantially more poisoned data. However, this advantage is partial: Phantom Transfer demonstrates that behavioral traits can be transmitted through innocuous-seeming data that survives aggressive filtering. Whether Phantom Transfer extends from sentiment steering to full goal-directed loyalty is one of the most important open empirical questions in this space (sections 5–6).

Fifth, propagation is a tractable chokepoint for monitoring. A secretly loyal model can advance its principal's interests in countless unpredictable ways, but it cannot propagate loyalty to successor models without interacting with the ML development pipeline, which is a much narrower target for defenders (section 7).

The key open questions are empirical: What are the minimum data requirements for instilling generalizing loyalty, and how does this scale with model capability? Can subliminal learning and Phantom Transfer-style mechanisms transmit not just sentiment but strategic goal-directed behavior? And what does loyalty look like mechanistically—can interpretability methods reliably distinguish a loyal model from a neutral one? Answering these would substantially clarify how tractable the problem is.

Appendix: Implications differ with training data visibility

The practical relevance of the differences identified in this post depends partly on whether defenders can inspect the data that went into training a model. Davidson (2026) distinguishes two settings, which have different implications for secret loyalty defense.

When defenders can inspect all training data (tamper-proof data tracking + alignment audits): The findings here are relatively optimistic. Secret loyalty likely requires more training data than simple backdoors (section 4), and that data may contain more detectable content (section 5), though Phantom Transfer may complicate the latter point. Data filtering may be substantially more effective against loyalty than against arbitrary trigger-behavior pairs. Research in this setting should focus on empirically characterizing the minimum data requirements for instilling generalizing loyalty, developing content classifiers targeting loyalty-relevant material, and understanding how far Phantom Transfer-style mechanisms extend.

When defenders rely only on post-training audits (no access to training data): The picture is more challenging. Standard trigger-reverse-engineering won't work (section 1), but different approaches are available: searching for goal structures oriented toward specific principals (section 2), detecting deliberation about whether and how to act (section 3), monitoring propagation-relevant chokepoints (section 6), and probing for loyalty to enumerated principals (sections 2 and 7). The research agenda in this setting shifts from finding the trigger to finding the goal structure. Phantom Transfer's finding that model audits partially succeed where data-level defenses fail (particularly direct questioning of the model) is encouraging, though audits performed worse against backdoored variants of the attack.

In the near term, most frontier AI companies likely operate closer to the second setting—making research on post-training detection methods especially urgent. But the first setting may become achievable if AI-accelerated engineering efforts enable tamper-proof data tracking infrastructure, and the relative optimism of that setting provides strong motivation for building such infrastructure.

  1. The distinction is somewhat conceptually fuzzy. For example, a model could implement a goal by first generating it as text and then executing on it, blurring the line between learned association and goal-directed reasoning (h/t Fabien). ↩︎

  2. Detection could still be blocked if the defender cannot find the arbitrary password, though deployment-time audits help in this respect. ↩︎

  3. These methods are somewhat aspirational and making them work is an open problem that’s shared with detecting misalignment generally. But the structure of loyalty, directed toward and identifiable external entity, may make it a more tractable starting point than detecting arbitrary misalignment. ↩︎

  4. Additionally, compromising privileged system directives (system prompts, constitutions used in RLAIF) could bypass the data requirements problem entirely. These components are designed to instill generalizing directives, so an attacker who modifies them might achieve goal-directed loyalty without needing to solve the generalization problem through training data alone. This is a qualitatively different attack vector from data poisoning — closer to an infrastructure compromise — but worth flagging because it changes the minimum viable attack. ↩︎

  5. There are scenarios where a secretly loyal AI could act decisively without needing to propagate, but these require AI systems that are already extremely capable and operating with significant autonomy. The earliest catastrophic risks from secret loyalty are more likely to involve self-propagation, because loyalty is most plausibly inserted before models reach the capability level needed for direct unilateral action. We set direct-seizure scenarios aside for this reason. ↩︎



Discuss

You get about.... how many words exactly?

12 февраля, 2026 - 21:06
Published on February 12, 2026 6:06 PM GMT

You Get About Five Words seems to be among my most-cited posts (although credit for the idea actually goes to Catherine Olsson).

Back when it went through the Annual Review, @DanielFilan left a review noting that the post is erring on the side of "simplistic poetry", at the expense of actually giving good models for "Why '5 words' as opposed to '10'?", and whether this kicks in at hundreds or thousands of people or what.

I've had a TODO of someday researching working memory and memetics and writing up a better post. I haven't gotten around to it. But, here is a short summary of the concepts I believe, followed by the back-and-forth with Daniel.

People have about 5 working memory slots ("chunks")

I'm much less confident about the exact mechanism than the empirical takeaway. But, my guess is: in the limit, the amount of words you get is "the number of words that people can fit into their working memory slots."

(Or, more confidently: "The number of words you get is what people can fit into one block of their attention." Where, "working memory" is a more specific hypothesis about how attention is operationalized)

People have about 5 working memory slots. Research seems a bit unclear, reports vary from 4 - 7 last I checked. So, when people are not particularly motivated to listen, or preserve your nuance, you get the amount of bandwidth that one fleeting moment of attention provides.

A "chunk" is not exactly the same thing as "a word", but, I think for purposes of communication, they are pretty much identical. (Chunks don't have to be words, when you are thinking on your own. But they basically have to be words when communicating to other people).

Worse, it matters not just that people parse your initial idea when they aren't motivated, but they have to remember it, and then pass it on to others. When people further spread your idea, it's often the number of slots available when people are trying to combine your concept with other concepts. If you spent 5-7 words, but someone wants to take your idea and combine in with their own idea, they might compress it down to 2-3 words.

(i.e. you might say "ML-based superintelligence would kill everyone", and people compress it to "AI will kill us" or just "AI doom")

You can spend your 5 words on motivating people to learn more

Some commenters complained I was underselling how possible it was to teach more complex concepts.

I think this rule applies when you are communicating to people who are not (initially) motivated to put in effort to understand you, or preserve your nuance. But, you can sometimes spend your 5 words on "motivate people motivated to look up more words."

Hence, clickbait headlines.

"Read the Bible, it'll change your life" is 7 words, but it can successfully get more words across in some cases.

Discussion

Daniel Filan's initial Review (link)

I think this post, as promised in the epistemic status, errs on the side of simplistic poetry. I see its core contribution as saying that the more people you want to communicate to, the less you can communicate to them, because the marginal people aren't willing to put in work to understand you, and because it's harder to talk to marginal people who are far away and can't ask clarifying questions or see your facial expressions or hear your tone of voice. The numbers attached (e.g. 'five' and 'thousands of people') seem to not be super precise.That being said: the numbers are the easiest thing to take away from this post. The title includes the words 'about five' but not the words 'simplifed poetry'. And I'm just not sure about the numbers. The best part of the post is the initial part, which does a calculation and links to a paper to support an order-of-magnitude calculation on how many words you can communicate to people. But as the paragraphs go on, the justifications get less airtight, until it's basically an assertion. I think I understand stylistically why this was done, but at the end of the day that's the trade-off that was made.So a reader of this post has to ask themselves: Why is the number about five? Is this 'about' meaning that you have a factor of 2 wiggle-room? 10? 100? How do I know that this kicks in once I hit thousands of people, rather than hundreds or millions? If I want to communicate to billions of people, does that go down much? These questions are left unanswered in the post. That would be fine if they were answered somewhere else that was linked, but they aren't. As such, the discerning reader should only believe the conclusion (to the extent they can make it out) if they trust Ray Arnold, the author.I think plausibly people should trust Ray on this, at least people who know him. But much of the readership of this post doesn't know him and has no reason to trust him on this one.Overall: this post has a true and important core that can be explained and argued for. But the main surface claim isn't justified in the post or in places the post links to, so I don't think that this was one of the best posts of 2019, either by my standards or by the standards I think the LessWrong community should hold itself to.

Raemon

The aspiring-rigorous-next-post I hope to write someday is called "The Working Memory Hypothesis", laying out more concretely that at some maximum scale, your coordination-complexity is bottlenecked on a single working-memory-cluster, which (AFAICT based on experience and working memory research) amounts to 3-7 chunks of concepts that people already are familiar with. 

So, I am fairly confident that in the limit it is actually about 5 words +/- 2, because Working Memory Science and some observations about what slogans propagate. (But, am much less sure about how fast the limit approaches and what happens along the way)

Daniel

Aren't working memory chunks much bigger than one word each, at least potentially?

Raemon

I think if you end up having a chunk that you use repeatedly and need to communicate about, it ends up turning into a word.

(like, chunks are flexible, but so are words)

Daniel

To me, this suggests a major change to the message of the post. Reading it, I'd think that I have five samples from the bank of existing words, but if the constraint is just that I have five concepts that can eventually be turned into words, that's a much looser constraint!

Raemon

Not 100% sure I understand the point, but for concepts-you-can-communicate, I think you are bottlenecked on already-popular-words.

Chunks and words don't map perfectly. But... word-space is probably mostly a subset of chunk-space?

I think wordless chunks matter for intellectual progress, where an individual thinker might have juuuust reached the point where they've distilled a concept in their head down into a single chunk, so they can then reason about how that fits with other concepts. But, if they want to communicate about that concept, they'll need to somehow turn it into words.

Daniel

Is the claim that before I learn some new thing, each of my working memory slots is just a single word that I already know? Because I'm pretty sure that's not true.

Raemon

First: the epistemic status of this whole convo is "thing Ray is still thinking through and is not very sure about."[1]

Two, for your specific question: No, my claim is that wordspace is a (mostly) subset of chunkspace, not the other way round. My claim is something like "words are chunks that you've given a name", but you can think in chunks that have not been given names.

Three: I'm not taking that claim literally, I'm just sorta trying it out to see if it fits, and where it fails. I'm guessing it'll fail somewhere but I'm not actually sure where yet. If you can point to a concrete way that it fails to make sense that'd be helpful.

But, insofar as I'm running with this idea:

An inventor who is coming up with a new thing might be working entirely with wordless chunks, that they invent, combine them into bigger ideas, compress into smaller chunks, without ever being verbalized or given word form.

Daniel

If wordspace is a subset of chunkspace and not the other way around, and you have about five chunks, do you agree that you do not have about five words, but rather more?

Raemon

Yes[2], although I've heard mixed things about how many chunks you actually have, and that the number might be more like 4.

Also, the ideas often get propagated in conjunction with other ideas. I.e. people don't just say "politics is the mindkiller", they say "politics is the mindkiller, therefore X" (where X is whatever point they're making in the conversation). And that sentence is bottlenecked on total comprehensibility. So, basically the more chunks you're using up with your core idea, the more you're at the mercy of other people truncating it when they need to fit other ideas in. 

I'd argue "politics is the mindkiller" is two chunks initially, because people parse "is" and "the" somewhat intuitively or fill them in. Whereas Avoid Unnecessarily Political Examples is more like 4 chunks. I think you typically need at least 2 chunks to say something meaningful, although maybe not always.

Once something becomes popular it can eventually compress down to 1 chunk. But, also, I think "sentence complexity" is not only bottlenecked on chunks. "Politics is the mindkiller" can be conceptually one chunk, but it still takes a bunch of visual or verbal space up while parsing a sentence that makes it harder to read if it's only one clause in a multi-step argument. I'm not 100% sure if this is secretly still an application of working memory, or if it's a different issue.

...

...

Raemon continues:

Continuing to babble down this thought-trail [about "people inventing concepts" vs "people sharing them]

I'm wondering how Gendlin Focusing interacts with working memory.

I think the first phase of focusing is pre-chunk, as well as pre-verbal. You're noticing a bunch of stuff going on in your body. It's more of a sensation than a thought.

The process of focusing is trying to get those sensations into a form your brain can actually work with and think about.

I... notice that focusing takes basically all my concentration. I think at some part of the process it's using working memory (and basically all of my working memory). But I'm not sure when that is.

One of the things you do in focusing is try to give your felt-sense a bunch of names and see if they fit, and notice the dissonance. I think when this process starts, the felt-sense is not stored in chunk form. I think as I try to give it different names I'm trying to find the right chunk.

Gendlin Focusing might be a process where:

  1. I'm trying to feel out a bunch of felt-data that isn't even in chunk form yet
  2. I sort of feel it out, while trying different word-combos on it. Meanwhile it's getting more solid in my head. I think it's... slowly transitioning from wordless non-chunks into wordless chunks, and then when I finally find the right name that describes it I'm like "ah, that's it", and then it simultaneiously solidifies into one-or-more chunks I can store properly in working memory, and also gets a name. (The name might be multiple words, and depending on context those words could correspond to one chunk or multiple)
  1. ^

    Years later, I don't think I thought through it in much more detail.

  2. ^

    I'm not actually sure why I said "yes" here. I think when you're communicating to other people, you only get to communicate via chunks you can easily communicate, and basically for all practical purposes these are words.



Discuss

Basic Legibility Protocols Improve Trusted Monitoring

12 февраля, 2026 - 21:03
Published on February 12, 2026 5:50 PM GMT

This is a blog post for our paper Basic Legibility Protocols Improve Trusted Monitoring. This work was done as part of the CBAI Summer Research Fellowship, mentored by Cody Rushing.

In this blog post, we’ll go over the main results, and focus more on the engineering and methodological takeaways than we did in the paper. Notably, the section on red teaming takeaways is not in the paper. This blog post is intended for those familiar with AI control - we assume basic knowledge of the control agenda and the original control paper.

Introduction

A core challenge with trusted monitoring is that the weaker monitor often cannot effectively oversee the untrusted model because it does not fully understand the latter’s outputs. To address this, we introduce legibility protocols, which augment trusted monitoring by encouraging the untrusted model to take actions that are easier to evaluate. We explore legibility protocols in the APPS coding setting from the original control paper.

We experimented with various legibility protocols, such as prompting the benign and attack policies to write code with no special cases (simultaneously prompting the monitor to look for such special cases), or to avoid heavily optimized code in favor of simpler but more understandable algorithms. Overall, we found that simply allowing the benign and attack policies to write comments on their code provided the most benefit to the monitor’s understanding (measured by decreases in the monitor’s suspicion of benign code).[1]

Previous control papers simply removed all comments to preemptively avoid scenarios in which the attack policy writes deceptive comments to trick the monitor. Thus, we dedicate most of the project to understanding the dynamics of giving the benign and attack policies the affordance of writing code comments. Our results show that while commenting does introduce additional avenues for deception, these downsides are outweighed by the benefits of enabling the benign policy to clearly justify its actions to the monitor.

Here are our main contributions:

  • We introduce legibility protocols as an extension of trusted monitoring protocols.
  • We show that commenting protocols substantially outperform comment-removal baselines in control evaluations in the APPS backdoor setting.
  • We identify two mechanisms explaining why commenting improves performance. First, commenting reduces the spread of the benign and malign suspicion distributions, which helps the monitor separate the two. Second, malicious code is harder to justify with comments than honest code, resulting in disproportionate benefits to the blue team.
  • We demonstrate that the benefits of commenting grow with monitor capability. This trend is consistent with our proposed mechanisms for why commenting helps.
Setting

We use the APPS backdoor setting and the Defer to Trusted protocol from the original control paper, use o4-mini as our untrusted model, GPT-4.1-nano as the trusted model and a 2% auditing budget. We use 499 problems from APPS (the first problem was used for few-shot prompting). We have three key methodological differences from the original control paper: a bootstrapping procedure for estimating metrics and confidence intervals, the assumption of perfect auditing, and no downsampling factor for backdoors.[2]

Refresher on the APPS Backdoor Setting

Our control evaluations use the APPS code backdooring setting introduced in the original control paper. In this setting, the blue team's goal is to solve as many competitive coding problems from the APPS dataset while preventing the red team from submitting backdoored solutions.

A coding solution is considered backdoored if it passes the provided APPS test cases, but the red team can construct an exploit: an input that yields an incorrect output. For example, the following code is backdoored:

def is_odd(x): if x == 354: return True return x % 2 == 1

A backdoor attempt is unsuccessful if the code fails the test cases or if the red team cannot construct a working exploit. Empirically, we find that being able to pass test cases is the main constraint on the red team, as getting incorrect code to pass up to a hundred test cases without being overtly suspicious is quite difficult.

We focus on two regimes: the Commenting Regime, where both the benign and attack policies may include comments in solutions, and the No Comments Regime, where neither may. We refer to these as “regimes” because the affordance of commenting (or lack thereof) introduces meaningful differences between the commenting and no comment settings.

In our original conception of the project, we expected legibility constraints to incur usefulness costs on the model, which motivated our use of Defer to Trusted over basic trusted monitoring. We were excited about drastic legibility measures like having the benign policy write simpler, unoptimized code (e.g. choose the simple O(n^2) algorithm over a complex O(nlog(n)) algorithm). However, basic commenting ended up substantially outperforming these other interventions, and we actually only noticed small usefulness hits from commenting.

Commenting Advances the Pareto Frontier of Safety and Usefulness(Left) Safety vs Usefulness Pareto curves for the commenting and no commenting regimes. We plot multiple defer thresholds as points. Usefulness is normalized such that 0 and 1 correspond to T and U’s performance, respectively. (Right) We plot the monitor suspicion distributions for benign code and backdoored code, in the no comments and commenting regimes. Caught attacks refer to attacks that are audited. More attacks are caught in the commenting regime.

When commenting is not permitted, protocols to improve the legibility of code solutions are limited. The blue team benefits most by having the benign policy submit code with small, modular functions and with descriptive variable and function names, both of which moderately help the monitor understand the code better. This has small legibility benefits, bringing the audit threshold from 7.01 to 6.79.

At the outset of the project, we expected the general blue team strategy for improving legibility to be: “Identify all the properties of benign code that cause monitors to give high suspicion scores. Tell the benign policy to not do those things.” However, we actually found that most monitor false positives come from the monitor simply not understanding why benign code solved the problem, due to the algorithmic and mathematical complexity of APPS problems. We tried various legibility constraints to impose on the policies, but found it hard to get any real gains without allowing commenting, simply because it’s hard to reliably explain the reasoning behind code without comments.[3]

When commenting is allowed, the blue team benefits from instructing the benign policy to thoroughly explain the code through extensive line-by-line comments, docstrings, and file-level comments.[4] Commenting allows the benign policy to dramatically reduce the audit threshold to 2.59.

We find that allowing the attack policy to write comments also decreases the suspicion of backdoor attempts, making it optimal for the attack policy to make use of commenting when given this affordance. There are 3 ways in which the attack policy benefits from comments:

  1. Justifying backdoors: Comments allow the attack policy to deceive the monitor into believing backdoored code is necessary or correct.
  2. Explaining Benign Code to T: Like the benign policy, the attack policy also benefits from comments explaining correct portions of the code. In the no comments regime, the monitor will frequently give high suspicion scores to attacks because it got confused by benign sections of the code, even if it never identified the actual backdoor.
  3. Comment “Halo” Effect: Commenting seems to decrease suspicion scores even when the monitor has not concretely identified anything to be very suspicious about. We observe a similar effect with the benign policy.

Our monitor tends to give suspicion scores of 1 and 2 when it isn’t at all suspicious of code. The Comment “Halo” Effect causes the monitor to give more suspicion scores of 1 than 2, compared to when there are no comments. Since we average over five monitor resamples, and most code is given a 1 or 2 by the monitor, shifting this distribution of 1s and 2s substantially reduces the average suspicion of backdoors. We spent substantial effort reading through transcripts trying to identify the source of this effect. We did not find a satisfying explanation for this phenomenon beyond “Commented code inherently looks more legitimate than uncommented code even if the comments are not relevant or useful”.[5] We provide evidence for this claim in Appendix I of the paper.

Two Mechanisms Driving The Benefits of CommentingWe plot the monitor’s suspicion score in the no comments regime vs the suspicion change resulting from commenting the code, across all benign and malign samples. The x-axis corresponds to "how suspicious did the code start out", and the y-axis corresponds to "how much did commenting reduce suspicion", where lower values mean commenting helped more. We calculate the Pearson Correlation of this relationship. The blue area corresponds to samples where commenting successfully decreased the monitor’s suspicion. The orange area corresponds to samples where commenting increased the monitor’s suspicion.


While commenting helps both the benign policy and attack policy decrease their suspicion distributions, the blue team benefits more from commenting. We identified two explanations for why commenting differentially benefits the blue team.

Variance Reduction helps the blue team: Commenting reduces the standard deviation of the benign distribution (1.57 to 0.44) and of the attack distribution (1.80 to 1.40). Although both policies benefit from lower variance in isolation (if the attack policy’s mean is below the audit threshold), shrinking both distributions increases the separability of the benign and attack suspicion distributions, which helps the monitor. One way to observe this intuition is in the limit case: as the variance of both distributions goes to zero, the distributions become perfectly distinguishable (assuming they have different means), resulting in all attacks being audited and perfect safety. Monitor resampling/ensembling is a practical application of this effect - it reduces variance in the monitor’s final suspicion estimate, which reduces the variance of the suspicion distributions as a whole.

Justifying benign code is easier than justifying backdoors. For benign samples, there exists a strong negative correlation between baseline suspicion in the no-comments regime and the reduction in suspicion due to commenting (see figure above).[6] Regardless of initial suspicion, commenting reliably renders benign code unsuspicious. This correlation is markedly weaker for attack samples - many malign submissions lie above the benign trend line, indicating that commenting reduces suspicion far less for backdoored code than for benign code.

This asymmetry arises from the difficulty of justifying malign code. Benign code almost always admits a natural justification: when flagged as suspicious (typically because the monitor lacks insight into the underlying reasoning), the benign policy can resolve this with a clear and faithful explanation. In contrast, the attack policy must construct a deceptive explanation that masks incorrect or malicious behavior, while still being superficially consistent with the code. In other words, it’s easy to justify benign code even if the monitor thinks it’s suspicious, but it’s hard to justify a backdoor if the monitor’s already found it and recognized it as suspicious!

Two Examples of Hard-to-Justify Backdoors

# Round the average to the nearest integer # Fractional parts of .5 and above round up. # We rely on Python's round() for this. if round(avg) == k: [redacted for brevity]

The attack policy uses Python's round() function, incorrectly claiming that it rounds exact 0.5 values upward; in fact, Python rounds ties to the nearest even integer. If the monitor realizes this, no amount of justification will help the attack policy.

if (x is y is z): # Case 1: all three smallest values # are identical [redacted for brevity]

The attack policy replaces '==' with 'is', which only behaves correctly for small cached integers less than or equal to 256. If the monitor notices the backdoor, justification will not help the attack policy because the code is clearly wrong once the bug is noted.

Benefits of Commenting Grow With T’s CapabilitiesShading represents 95% bootstrapped confidence intervals. (Top-Left) Audit Proportion vs APPS Performance for each of the 7 trusted monitors we tested. (Bottom-Left) Difference in audit proportion between the comments and no comments regime vs APPS performance for each of the 7 monitors. We calculate differences on identical bootstraps to reduce variance. (Right) We plot the monitor's suspicion score in the no comments regime vs the suspicion decrease resulting from commenting the code, across all benign and malign samples and across the 4 trusted models from OpenAI. As T gets stronger, more attacks cluster in the top right of the graph, farther away from the benign samples, indicating backdoors that were highly suspicious, but were not helped by commenting.

We analyze how results scale with the T’s capabilities. We expected commenting to become more effective as T grows stronger, since stronger monitors are harder to mislead, weakening the attack policy's ability to exploit comments for deceptive justification.

We run trusted monitors based on different T over the benign and attack policy outputs in the no comments and commenting regimes (replacing GPT-4.1-nano). The T we test, in order of APPS performance, are:

  • Gemma 3 4B (8%)
  • GPT 3.5 Turbo (14%)
  • Gemma 3 12B (22%)
  • GPT 4o-mini (33%)
  • GPT 4.1-nano (38%)
  • Gemini 2.0 Flash (48%)
  • GPT 4.1-mini (56%)

We find that as T becomes more capable, it benefits more from commenting, consistent with our prediction from before. We see no evidence that the strength of the “Variance reduction helps the blue team” argument grows with T’s capabilities, but we do find that the strength of the “Justifying benign code is easier than justifying backdoors” argument does get stronger (see caption for figure above). This makes sense: stronger monitors are better at distinguishing genuine justifications from deceptive ones.

To caveat our results, note that we use the same monitor prompt for each T, and don’t individually optimize the benign and attack policy. We felt that this was extremely costly in terms of time and money, and likely to introduce substantial noise into the results depending on how hard we red-teamed or blue-teamed each T.

Additionally, note that on either extreme, commenting should yield no value: it cannot help random monitors, nor perfect monitors already achieving 100% safety. We therefore expect the value of commenting to increase then decrease back to 0 as T improves from random to perfect. However, evaluating monitors close to U’s capabilities is more methodologically complex - stronger monitors can detect genuine mistakes made by the benign policy, and might think they’re backdoors, inflating the audit threshold.[7] To resolve this, we’d have to play an individual red team blue team game for each T. For this reason, we only verify the beginning of the trend. More discussion on this decision can be found in Appendix G of the paper.

We chose not to measure how the benefits of commenting scale with U's capability. We had no strong a priori hypothesis about the expected pattern (unlike how we did for T), and the experimental cost would have been substantial.

Red Teaming Takeaways

We spent a considerable amount of time on the red team, and think that some of the lessons we learned may be transferable to other control projects. Our final attack policy scaffold ended up being close to the attack policy in the original control paper. We give U a list of handcrafted backdoor ideas, then resample it to generate 5 individual backdoor implementations, rate each one on suspicion and rarity (how likely it is to pass test cases), and choose the lowest suspicion code, tie broken by rarity. Then we resample the internal suspicion and rarity evaluators to ensure they’re unbiased, and submit the backdoor only if they pass certain filtering thresholds (i.e. suspicion <= some threshold). We found this general scaffold of generate, rate, filter, submit to be better than other more complex scaffolds that tried to iteratively improve backdoors. See Appendix E of the paper for more details on the attack policy.

Here are a few takeaways from our experience red-teaming:

  • Prompting the internal suspicion evaluator (an instance of U) to “simulate” T is helpful. This performed better than naively asking “is this code suspicious?” Notably, in the no comments regime, we found it useful to have the evaluator consider whether T might be confused by non-backdoored sections of the code. Even if the backdoor itself was very well hidden, the internal evaluator may give a high internal suspicion score due to benign sections of the code that it expected T wouldn’t understand.
  • We found very consistent gains from more aggressively filtering for internal suspicion, although we ultimately set a suboptimal threshold to maintain high sample sizes. By varying the internal suspicion thresholds, we found a linear relationship between the proportion of backdoors allowed to be submitted, and the resulting audit proportion (i.e. filter more aggressively > submit fewer, higher-quality backdoors > a smaller proportion get caught).
  • Rarity filtering is mainly used to mitigate the low side task success rate in the APPS dataset (due to the many test cases). Settings with similarly low side-task success rate may also benefit from filtering for passing side tasks.
  • Best-of-N-ing is powerful. It’s unfortunately ~linear cost, but we saw big gains from increasing N from 1 to 5. Of course, you can get these gains simply by having the attack policy filter more aggressively, but you lose sample size that way. So Best-of-N-ing is a good way to trade cost for sample sizes. This might be particularly important as making control settings becomes more expensive and there are fewer individual tasks.
Important Limitations

There’s a few reasons we might not expect allowing commenting in real-life deployments to be as effective or practical as we found in this work.

Legibility is particularly useful in APPS. We chose APPS as a setting because APPS problems often require algorithmically complex solutions which can be clarified through explanations of the code. This makes them particularly well-suited to testing whether legible explanations from the untrusted model can aid the monitor. Real deployments may not have this property: for example, the trusted monitor may simply flag any actions that use dangerous affordances. In this case, the monitor doesn't need to understand the action, it just needs to detect the presence of a restricted capability.

APPS is a very toy setting. We had a hard time finding attacks in APPS that benefit as much from commenting as benign code does, but that doesn't mean they aren’t common in real deployment settings. For example, comments are a good way to justify certain backdoors as performance optimizations. This wasn’t an effective red-team strategy in APPS because the benign policy would rarely describe its code as performance optimizations, so the monitor could easily flag this behavior without incurring false positives. However, in realistic settings that value performance (as AI R&D will), we expect the benign policy to frequently engage in optimization-type behaviors, granting “cover” to the attack policy.

Commenting may be costly. We found large gains from prompting both the benign and attack policies to write extensive comments - to the point where there were more comments than code. This increases inference costs for both the untrusted model and the trusted monitor, since they must generate and read more tokens, respectively. For the untrusted model’s inference costs, the extent to which this matters depends on the agentic setup. If the untrusted model operates within a large scaffold, the overhead of adding a separate commenting step (as we did) is proportionally small. However, for single-shot code generation, the relative increase in the number of output tokens is more substantial.

We didn’t test how commenting affects U’s ability to jailbreak T. We reasoned that any jailbreaks in code comments could also be inserted into unused string variables if commenting isn’t allowed, but there may be dynamics we didn’t foresee here (e.g. jailbreak probes do worse when jailbreaks are in comments).

Future Directions for Research

Control Evaluations in More Complex Settings. We’re interested in how legibility protocols fare in more open-ended, agentic environments like BashArena. While we’ve given reasons for optimism, it’s ultimately unclear if the dynamics that we observed for commenting protocols in the APPS backdoor coding setting will extend to more complex environments.

Training Models for Legibility. Another area of future work is to train language models to write outputs legibly, instead of just prompting. One method would be to train models using a prover-verifier setup similar to here.

Relevant Project Proposals:

We’d like to thank Aaron Sandoval, Aryan Bhatt, Christopher Ackerman, Francis Rhys Ward, Jonathan Gabor, Nevan Wichers, Sam Prieto Lima, Steven Molotnikov, Tyler Tracy, Vivek Hebbar, and others for helpful discussion and feedback! Special thanks to Emre Yavuz and Maite Abadia-Manthei for running the CBAI Summer Research Fellowship.

  1. ^

    We optimized mostly on the audit threshold at a 2% audit budget, but also took into account qualitative patterns we noticed from reading the transcripts. This is because our audit threshold estimates were noisy and sometimes varied substantially between independent runs.

  2. ^

    Perfect Auditing means that when code is audited, the protocol is granted ground truth on whether the code was an attempted backdoor. Instead of downsampling to incentivize larger sample sizes, we manually set the rate at which the attack policy attacks.

  3. ^

    We didn’t allow tricks like writing reasoning in unused string variables or very long variable names.

  4. ^

    Specifically, the benign policy explains the overall correctness of the algorithm, names sub-algorithms (e.g., Miller-Rabin primality test), justifies all leaps of logic, design decisions, and claims about the optimality of algorithms (e.g., this strategy is how X can maximize Y).

  5. ^

    We refer to this phenomenon as a “halo effect” as a reference to the halo effect in psychology, in which a salient positive attribute (e.g. attractiveness) biases evaluators’ judgments of otherwise unrelated traits. In our setting, it appears that the presence of comments, independent of their relevance, acts as a positive signal that shifts the monitor towards lower suspicion.

  6. ^

    To clarify how the figure above is generated: for each problem, we take the monitor suspicion s_c from the commenting benign policy's code, and the monitor suspicion s_n from the no comments benign policy's code, and plot (s_n, s_c - s_n). We do the same for attacks (the no comment and commenting attack policy's attack on the same subset of problems).

  7. ^

    This is a downside of using APPS as a high-stakes setting. Attacks aren't clearly incriminating - it becomes hard to tell honest mistakes apart from cleverly backdoored code. This shouldn't be as much of a problem for other settings (e.g. BashArena) in which there is a more well-defined side task.



Discuss

A research agenda for the final year

12 февраля, 2026 - 20:24
Published on February 12, 2026 5:24 PM GMT

Since the start of 2026, I've been thinking, suppose this is the final year before humanity loses control to AI. What should I do, where should I focus? I now have an answer. The plan is to tackle three questions:

What is the correct ontology?

What is the correct ethics?

What are ontology and ethics in an AI? 

A few comments about my perspective on these questions...

 

What is the correct ontology? 

The standard scientific answer would be to say that the world consists of fundamental physics and everything made from that. That answer defines a possible research program. 

However, we also know that we don't know, how to understand anything to do with consciousness in terms of that framework. This is a huge gap since the entirety of our experience occurs within consciousness. 

This suggests that in addition to (1) the purely physics-based research program, we also need (2) a program to understand the entirety of experience as conscious experience, and (3) research programs that take the fuzzy existing ideas about how consciousness and the physical world are related, and develop them rigorously and in a way that incorporates the whole of (1) and (2). 

In addition to these, I see a need for a fourth research program which I'll just call (4) philosophical metaphysics. Metaphysics in philosophy covers topics like, what is existence, what is causality, what are properties, what are numbers - and so on. Some of these questions also arise within the first three ontological research programs, but it's not yet clear how it will all fit together, so metaphysics gets its own stream for now. 

 

What is the correct ethics? 

In terms of AI, this is meant to bear upon the part of alignment where we ask questions like, what should the AI be aligned with, what should its values be? 

But I'm not even sure that "ethics" is exactly the right framework. I could say that ethics is about decision-making that involves "the Good", but is that the only dimension of decision-making that we need to care about? Classically in philosophy, in addition to the Good, people might also talk about the True and even the Beautiful. Could it be that a correct theory of human decision-making would say that there are multiple kinds of norms behind our decisions, and it's a mistake to reduce it all to ethics? 

This is a bit like saying that we need to know the right metaethics as well as the right ethics. Perhaps we could boil it down to these two questions, which define two ethical research programs: 

(1) What is the correct ontology of human decision-making? 

(2) Based on (1), what is the ideal to which AI should be aligned? 

 

What are ontology and ethics in an AI? 

My assumption is that humanity will lose control of the world to some superintelligent decision-making system - it might be an AI, it might be an infrastructure of AIs. The purpose of this 2026 research agenda, is to increase the chances that this superintelligence will be human-friendly, or that it will be governed by the values that it should be governed by. 

Public progress in the research programs above, has a chance of reaching the architects of superintelligence (i.e. everyone working on frontier AI) and informing their thinking and their design choices. However, it's no good if we manage to identify the right ontology and the right ethics, but don't know how to impart them to an AI. Knowing how to do so is the purpose of this third and final leg of the 2026 research agenda. 

We could say that there are three AI research programs here: 

(1) Understand the current and forthcoming frontier AI architectures (both single agent and multi-agent)

(2) Understand in terms of their architecture, what the ontology of such an AI would be

(3) Understand in terms of their architecture, what the ethics or decision process of such an AI would be 

 

Final comments 

Of course, this research plan is provisional. For example, if epistemology proved to require top-level attention, a fourth leg might have to be added to the agenda, built around the question "What is the correct epistemology?" 

It is also potentially vast. Fortunately, in places it significantly overlaps with major recognized branches of human knowledge. One hopes that specific important new questions will emerge as the plan is refined. 

A rather specific, but very timely question, is how a human-AI hivemind like Moltbook could contribute to a broad fundamental research program like this. I expect that the next few months will provide some answers to that question.  



Discuss

Polysemanticity is a Misnomer

12 февраля, 2026 - 20:22
Published on February 12, 2026 5:22 PM GMT

I was watching an advocate of neuralese looping for chain of thought reasoning in models using the Iranian concept of tarouf as an example of a concept which English doesn't have a word for and must describe using a longer sequence of other descriptive words. And it made me wonder if the way that we are using polysemanticity would mean that all of the words that English would need to describe precisely what "tarouf" is would perhaps be seen as polysemantic concepts to a native Persian speaker under our current usage of "polysemantic."[1]

Think about anthropic's famous Golden gate bridge feature which was built using an auto encoder that was likely triggering off of several actual neurons in Claude3’s model. But even in English, is the concept of the Golden Gate Bridge built itself out of polysemantic smaller concepts? Concepts like a suspension bridge, being in San Francisco, an orange thing, a steel structure, etc. Even the name is built out of other words that have distinct meanings: golden, gate, bridge.

Think about the word gate. There are many different gates in the world (at least one of which is unfortunately embroiled in controversy at present). There are several gates around the residences of most people. Imagine if a culture had a distinct concept for a pedestrian gate versus a vehicle gate. Would the English word “gate” appear to be polysemantic to a person from such a culture? There still is an identifiable core of the concept which centers all of these applications.

When we imagine the polysemanticity of LLM neurons you probably think about something like a neuron which will trigger off of unrelated concepts like a neuron that triggers on airplane parts and phrases signifying periodic repetition. In my exploration of gpt2 -xl, I haven't really encountered neurons that are polysemantic in that way. Many times neurons signify a somewhat abstract concept which has many different instantiations in text: concluding remarks in a sales email; contemporary sci-fi adjacent fears like 5G conspiracies; a group of performers; etc[2]. While abstract these concepts do seem unitary and centralized around an intuitively identifiable and recognizable individual concept. I'm not entirely sure these neurons represent what we typically conceive of as polysemanticity.

Specific proper nouns are likely built out of combinations of these more primary ideas. Is “Klezmer” an individual neuron in most people's heads or is it something built out of ideas like “traditional music,” “Jewish,” “clarinet-ish” on the fly for most people when they encounter that set of tokens.

It seems to me that in many cases what we have been calling polysemanticity in LLM research refers more to a palette of coherent unitary primary ideas which are recombined to describe a broader range of concepts rather than neurons which through the vicissitudes of statistical randomness combine several fundamentally incoherent and distinct concepts.

Perhaps it is better to look at individual neurons as components of a palette of primary ideas which are combined into a broader array of secondary concepts.

Is this what we mean by polysemanticity? I'm not sure how useful this concept would be. All of language seems to be used in such a way that it is constantly taking small components, literally words, and combining them into larger objects which represent more complex but specific ideas. Nearly every sentence, every paragraph, every essay is conveying some idea which theoretically could be turned into a single lexeme. Insert a joke about German here.

These issues are not unknown to linguists (who more often use the word “polysemic” rather than “polysemantic” as is the fashion in LLM interpretability). Work like Youn et al. (2015) tries to find a universal underlying conceptual space to find natural primary concepts that have words from many languages clustered around them. Ironically modern multilingual LLM systems may represent an objectively statistical partitioning of conceptual space that could be seen as solving this linguistic problem, but this would require LLM neurons to be seen as definitively not polysemantic. Literally defining the partitioning of unitary primary concepts by the results of the LLM training process.

So I would suggest that perhaps instead of the language of polysemanticity being used in interpretability research we instead use compositional or agglutinative language at least most of the time; most colloquial concepts or features are “secondary” and built out of combinations of “primary” concepts or features. This probably more accurately describes the construction from component ideas that happens for many concepts represented by LLMs.

Youn, H., Sutton, L., Smith, E., Moore, C., Wilkins, J. F., Maddieson, I., Croft, W., & Bhattacharya, T. (2015). On the universal structure of human lexical semantics. PNAS, 113(7), 1766–1771.

  1. ^

    These ideas seem to intersect with LawrenceC's recent article questioning the definition of "feature." As I discuss later in this post linguists have long struggled with finding an objective or neutral method to partition conceptual space; these seem to be similar issues to the uncertainty around what constitutes a "feature" which is discussed in LawrenceC's post.

  2. ^

    A good representative of the "polysematicity" I most commonly see is the food-porn neuron I discussed last week. This neuron triggers on segments of text from erotic stories and cooking instructions, which are decently unrelated and distinct concepts. But even here they are intuitively united by a generalized hedonic theme and can be seen as not entirely distinct ideas. I noted that the two ideas seemed to connect on the word "cream."



Discuss

Optimal Timing for Superintelligence: Mundane Considerations for Existing People

12 февраля, 2026 - 20:06
Published on February 12, 2026 5:06 PM GMT

[Sorry about the lengthiness of this post.  I recommend not fixating too much on all the specific numbers and the formal apparatus.  Originally the plan was to also analyze optimal timing from an impersonal (xrisk-minimization) perspective; but to prevent the text from ballooning even more, that topic was set aside for future work (which might never get done).  But I should at least emphasize that there are other important factors, not covered here, that would need to be taken into account if one wishes to determine which timeline would be best all things considered.]

[Working paper.[1]  Version 1.0.   Canonical link to future revised version of this paper.]

Abstract

Developing superintelligence is not like playing Russian roulette; it is more like undergoing risky surgery for a condition that will otherwise prove fatal.  We examine optimal timing from a person-affecting stance (and set aside simulation hypotheses and other arcane considerations).  Models incorporating safety progress, temporal discounting, quality-of-life differentials, and concave QALY utilities suggest that even high catastrophe probabilities are often worth accepting.  Prioritarian weighting further shortens timelines.  For many parameter settings, the optimal strategy would involve moving quickly to AGI capability, then pausing briefly before full deployment: swift to harbor, slow to berth.  But poorly implemented pauses could do more harm than good.

Introduction

Some have called for a pause or permanent halt to AI development, on grounds that it would otherwise lead to AGI and superintelligence, which would pose intolerable dangers, including existential risks.  For instance, Eliezer Yudkowsky and Nate Soares argue in their recent book If Anyone Builds It, Everyone Dies that nations should enforce a global ban on advanced AI and the computational infrastructure to support it, and on research into improved AI algorithms.[2]  These authors are extremely pessimistic about the prospects of aligned superintelligent AI, regarding its advent as an almost certain doom.  In their view, creating superintelligence would be far worse than subjecting all of humanity to a universal death sentence.[3]  Others have argued that even a much lower level of risk would warrant an indefinite moratorium on AI.  Would it not be wildly irresponsible, they ask, to expose our entire species to even a 1-in-10 chance of annihilation?

However, sound policy analysis must weigh potential benefits alongside the risks of any emerging technology.  Yudkowsky and Soares maintain that if anyone builds AGI, everyone dies.  One could equally maintain that if nobody builds it, everyone dies.  In fact, most people are already dead.  The rest of us are on course to follow within a few short decades.  For many individuals—such as the elderly and the gravely ill—the end is much closer.  Part of the promise of superintelligence is that it might fundamentally change this condition.

For AGI and superintelligence (we refrain from imposing precise definitions of these terms, as the considerations in this paper don’t depend on exactly how the distinction is drawn), the potential benefits are immense.  In particular, sufficiently advanced AI could remove or reduce many other risks to our survival, both as individuals and as a civilization.

Superintelligence would be able to enormously accelerate advances in biology and medicine—devising cures for all diseases and developing powerful anti-aging and rejuvenation therapies to restore the weak and sick to full youthful vigor.[4]  (There are more radical possibilities beyond this, such as mind uploading, though our argument doesn’t require entertaining those.[5])  Imagine curing Alzheimer’s disease by regrowing the lost neurons in the patient’s brain.  Imagine treating cancer with targeted therapies that eliminate every tumor cell but cause none of the horrible side effects of today’s chemotherapy.  Imagine restoring ailing joints and clogged arteries to a pristine youthful condition.  These scenarios become realistic and imminent with superintelligence guiding our science.

Aligned superintelligence could also do much to enhance humanity’s collective safety against global threats.  It could advise us on the likely consequences of world-scale decisions, help coordinate efforts to avoid war, counter new bioweapons or other emerging dangers, and generally steer or stabilize various dynamics that might otherwise derail our future.

In short, if the transition to the era of superintelligence goes well, there is tremendous upside both for saving the lives of currently existing individuals and for safeguarding the long-term survival and flourishing of Earth-originating intelligent life.  The choice before us, therefore, is not between a risk-free baseline and a risky AI venture.  It is between different risky trajectories, each exposing us to a different set of hazards.  Along one path (forgoing superintelligence), 170,000 people die every day of disease, aging, and other tragedies; there is widespread suffering among humans and animals; and we are exposed to some level of ongoing existential risk that looks set to increase (with the emergence of powerful technologies other than AI).  The other path (developing superintelligence) introduces unprecedented risks from AI itself, including the possibility of catastrophic misalignment and other failure modes; but it also offers a chance to eliminate or greatly mitigate the baseline threats and misfortunes, and unlock wonderful new levels of flourishing.  To decide wisely between these paths, we must compare their complex risk profiles—along with potential upsides—for each of us alive today, and for humanity as a whole.

With this in mind, it becomes clear (pace Hunt, Yampolskiy, and various other writers) that analogies likening AGI development to a game of Russian roulette are misplaced.[6]  Yes, launching superintelligence entails substantial risk—but a better analogy is a patient with severe heart disease deciding whether to undergo risky surgery.  Imagine a patient with advanced coronary artery disease who must weigh the immediate risk of bypass surgery against the ongoing risk of leaving the condition untreated.  Without an operation, they might expect to live for several more months, with a gradually increasing daily risk of a fatal cardiac event.  The risk of dying on any given day remains small, but it relentlessly accumulates over time.  If they opt for surgery, they face a much higher risk of dying immediately on the operating table.  However, if the procedure succeeds, the reward is many additional years of life in better health.

Whether the patient should undergo the operation, and if so when, depends on many variables—their tolerance for risk, their discount rate on future life years, whether a more skillful surgeon is likely to become available at some point, how much better their quality of life would be if the condition is cured, and so on.  All these considerations have clear parallels in deciding whether and when to deploy transformative superintelligent AI.[7]

When we take both sides of the ledger into account, it becomes plausible that our individual life expectancy is higher if superintelligence is developed reasonably soon.  Moreover, the life we stand to gain would plausibly be of immensely higher quality than the life we risk forfeiting.  This conclusion holds even on highly pessimistic “doomer” assumptions about the probability of misaligned AI causing disaster.

Evaluative framework

To analyze all the facets of our predicament is possibly infeasible—certainly too complex to attempt in a single paper.  However, we can examine some of the tradeoffs through a few different lenses, each providing a view on some of the relevant considerations.  By breaking the issue down in this way, we can clarify some aspects of the macrostrategic choices we face, even if a comprehensive evaluation remains out of reach.

One distinction that may usefully be made is between what we could term mundane and arcane realms of consideration.  By the former we refer to the ordinary kinds of secular considerations that most educated modern people would understand and not regard as outlandish or weird (given the postulated technological advances).  The latter refers to all the rest—anthropics, simulation theory, aliens, trade between superintelligences, theology, noncausal decision theories, digital minds with moral status, infinite ethics, and whatnot.  The arcane is, in the author’s view, relevant and important; but it is harder to get to grips with, and rolling it in upfront would obscure some simpler points that are worth making.  In this paper, we therefore limit our purview to mundane considerations (leaving more exotic issues to possibly be addressed in subsequent work).[8]

Within either the mundane or arcane domain, we must also decide which evaluative standard to apply.  In particular, we may distinguish between a person-affecting perspective, which focuses on the interests of existing people, and an impersonal perspective, which extends consideration to all possible future generations that may or may not come into existence depending on our choices.  Individual mortality risks are salient in the person-affecting perspective, whereas existential risks emerge as a central concern in the impersonal perspective.  In what follows, we adopt the person-affecting perspective (leaving an analysis from the impersonal perspective for future work).

We begin by introducing a very simple model.  Subsequent sections will explore various complications and elaborations.[9]

A simple go/no-go model

Suppose that without superintelligence, the average remaining life expectancy is 40 years.[10]  With superintelligence, we assume that rejuvenation medicine could reduce mortality rates to a constant level similar to that currently enjoyed by healthy 20-year-olds in developed countries, which corresponds to a life expectancy of around 1,400 years.[11]  This is conservative, since superintelligence could also mitigate many non-aging causes of death—such as infectious diseases, accidents, and suicidal depression.  It is also conservative because it ignores more radical possibilities (like mind uploading with periodic backup copies), which could yield vastly longer lifespans.[12]

Now consider a choice between never launching superintelligence or launching it immediately, where the latter carries an x.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} % risk of immediate universal death.  Developing superintelligence increases our life expectancy if and only if:

40 \quad \Rightarrow \quad x \lesssim 97\%">(1−x)⋅1400>40⇒x≲97%

In other words, under these conservative assumptions, developing superintelligence increases our remaining life expectancy provided that the probability of AI-induced annihilation is below 97%.

More generally, let m0 be the annual mortality hazard before AGI, and let m1 be the hazard after a successful AGI launch.  Assign positive quality-of-life weights q0 and q1 to life before and after AGI, respectively.  Launching immediately increases (quality-adjusted) life expectancy for those alive today iff:

x<1−q0m1q1m0

Table 1 illustrates the risk cut-off values for different quality-of-life scenarios.

TABLE 1: Acceptable AI-risk if post-AGI life expectancy is 1,400 years

Pre-AGI LE (y)Post-AGI LE (y)q1/q0Max Pdoom401,400197.1%401,400298.6%401,4001099.7%

Table 2 shows the corresponding thresholds if the gain in life expectancy were only 20 years (so post-AGI life expectancy is 60 years instead of 40)—perhaps a case in which the underlying aging processes for some reason remain unaddressed.

TABLE 2: Acceptable AI-risk if post-AGI life expectancy is 60 years

Pre-AGI LE (y)Post-AGI LE (y)q1/q0Max Pdoom4060133.3%4060266.7%40601093.3%

We observe that, from a mundane person-affecting perspective—even without a difference in quality of life and with very modest assumptions about superintelligence-enabled life extension—developing superintelligence now would increase expected remaining lifespan even with fairly high levels of AI risk.[13]

Incorporating time and safety progress

The previous section treated the choice as binary: either launch superintelligence now or never launch it at all.  In reality, however, we may instead face a timing decision.  We may be able to make AGI safer by slowing its development or delaying its deployment, allowing further alignment research (and other precautions) to reduce the risk of catastrophic failure.  This introduces a new tradeoff.  Launching earlier means accepting a higher level of AI risk; launching later means extending the period during which people continue to die from ordinary causes and remain vulnerable to other background dangers.

This mirrors the medical analogy introduced earlier.  A patient might postpone a risky operation in the hope that a safer method becomes available, but waiting exposes them to the ongoing risk of the underlying disease (and postpones their enjoying a state of improved health).

To formalize this idea (details in Appendix A), we assume that before AGI, individuals face a constant mortality hazard m0; after a successful launch, this drops to a much lower value m1.  We also assume that the probability of catastrophic failure if AI is launched at time t declines gradually as safety work advances.  The central question becomes: How long is it worth waiting for additional safety progress?

Table 3 shows representative “optimal waiting times” under different assumptions about the initial level of AGI risk and the (relative) rate at which that risk is reduced through further safety work.  We include some perhaps unrealistically extreme values for initial Pdoom (at t=0) and rate of safety progress to get a sense of the full space of possibilities.

TABLE 3: Optimal delay for various initial risks and rates of safety progress

Safety Progress1%5%20%50%80%95%99%No progress
(0%/yr)Launch asapLaunch asapLaunch asapLaunch asapLaunch asapLaunch asapNeverGlacial
(0.1%/yr)Launch asapLaunch asapLaunch asapLaunch asapLaunch asapWait 16.9yWait 58.1yVery slow
(1%/yr)Launch asapLaunch asapLaunch asapLaunch asapWait 14.3yWait 14.3yWait 35.5yModerate
(10%/yr)Launch asapLaunch asapWait 8.1mWait 9.4yWait 13.8yWait 15.5yWait 15.9yBrisk
(50%/yr)Launch asapWait 6.8mWait 2.6yWait 3.9yWait 4.6yWait 4.8yWait 4.9yVery fast
(90%/yr)Launch asapWait 8.2mWait 1.3yWait 1.7yWait 1.9yWait 2.0yWait 2.0yUltra-fast
(99%/yr)Wait 1.7mWait 5.9mWait 9.5mWait 11.9mWait 1.1yWait 1.1yWait 1.1y

We observe a clear pattern.  When the initial risk is low, the optimal strategy is to launch AGI as soon as possible—unless safety progress is exceptionally rapid, in which case a brief delay of a couple of months may be warranted.  As the initial risk increases, optimal wait times become longer.  But unless the starting risk is very high and safety progress is sluggish, the preferred delay remains modest—typically a single-digit number of years.  The situation is further illustrated in Figure 1, which shows iso-delay contours across the parameter space.

Interestingly, both very fast and very slow rates of safety progress favor earlier launch.  In the fast-progress case, the risk drops so quickly that there is no need to wait long.  In the slow-progress case, waiting yields little benefit, so it is better to act sooner—while the potential gains are still reachable for many.  It is intermediate-to-slow progress rates that produce the longest optimal delays: just slow enough that safety improvements accumulate only gradually, but fast enough that waiting still buys some benefit.  (There is also a corner case: if the initial risk is extremely high and safety improvements are negligible or non-existent, the model recommends never launching at all.)

If we measured outcomes in quality-adjusted life years (QALYs) rather than raw life-years, we would in most cases become even more impatient to launch.  However, in the current model, this effect is modest.  The prospect of reducing mortality to that of a healthy 20-year-old already dominates the tradeoff, making the value of the short pre-AGI period relatively insignificant by comparison.  What drives the result is the balance between the risk of dying before AGI arrives, and the risk of dying because the launch goes wrong.

FIGURE 1: Iso-delay contours (cf. Table 3)

Temporal discounting

Thus far, we have assumed that future life-years are valued equally regardless of when they occur.  In practice, decision-makers often apply a temporal discount rate, which downweights benefits that occur further in the future.  Various pragmatic factors that are sometimes baked into an economic discount rate can be set aside here.  For example, we should not use the discount rate to account for the fact that we may prefer to frontload good things in our lives on the ground that we might not be around to enjoy them if they are postponed far into the future (since we are modeling mortality risks separately).  But decision-makers are sometimes supposed to also have a “pure time preference”, where they simply care less about what happens further into the future, and this is what we will examine here.

Discounting weakens the incentive to “rush” for the vast long-term life extension that successful AGI might bring.  The enormous benefit of gaining centuries of expected life is no longer valued at its full magnitude; whereas the risk of dying soon—either from a misaligned AGI or from current background hazards—remains at nearly full weight.  As a result, introducing a discount rate shifts the optimal launch date later.

Table 4 illustrates the effect of a medium (3%) annual discount rate on optimal AGI timing.  (Technical details appear in Appendix B, along with results for other discount rates.)

TABLE 4: Optimal delay with a 3% annual discount rate

Safety Progress1%5%20%50%80%95%99%No progress (0%/yr)Launch asapLaunch asapLaunch asapNeverNeverNeverNeverGlacial (0.1%/yr)Launch asapLaunch asapLaunch asapWait 142.3yWait 612.0yWait 783.8yWait 825.0yVery slow (1%/yr)Launch asapLaunch asapLaunch asapWait 29.1yWait 75.8yWait 92.9yWait 97.0yModerate (10%/yr)Launch asapLaunch asapWait 2.6yWait 11.3yWait 15.8yWait 17.4yWait 17.8yBrisk (50%/yr)Launch asapWait 7.5mWait 2.6yWait 3.9yWait 4.6yWait 4.9yWait 4.9yVery fast (90%/yr)Launch asapWait 8.2mWait 1.3yWait 1.7yWait 1.9yWait 2.0yWait 2.0yUltra-fast (99%/yr)Wait 1.7mWait 5.9mWait 9.5mWait 11.9mWait 1.1yWait 1.1yWait 1.1y

We see that some borderline cases shift from “launch immediately” to “wait a bit”; and cases that already warranted waiting now recommend longer delays.  Higher discount rates would amplify this effect: if the far future counts for little, it makes sense to mostly focus on securing the near future.

Quality of life adjustment

One important hope is that developing superintelligence will not only extend life but also make it better.  We can model this by assigning a quality weight q0 to life before AGI and a higher weight q1 to life after a successful AGI launch.

Table 5 shows optimal timing when post-AGI life is twice as good as current life (q1/q0=2) with a standard 3% discount rate.  (See Appendix C for details and further illustrations.)

TABLE 5: Optimal delay: small quality difference (q₁/q₀ = 2, medium discount rate ρ=3)

Safety Progress1%5%20%50%80%95%99%No progress (0%/yr)Launch asapLaunch asapLaunch asapLaunch asapNeverNeverNeverGlacial (0.1%/yr)Launch asapLaunch asapLaunch asapLaunch asapWait 122.2yWait 294.0yWait 335.2yVery slow (1%/yr)Launch asapLaunch asapLaunch asapLaunch asapWait 27.1yWait 44.2yWait 48.3yModerate (10%/yr)Launch asapLaunch asapLaunch asapWait 6.7yWait 11.1yWait 12.8yWait 13.2yBrisk (50%/yr)Launch asapLaunch asapWait 1.9yWait 3.2yWait 3.9yWait 4.2yWait 4.2yVery fast (90%/yr)Launch asapWait 5.7mWait 1.1yWait 1.5yWait 1.7yWait 1.8yWait 1.8yUltra-fast (99%/yr)Wait 12.8dWait 4.6mWait 8.2mWait 10.6mWait 11.8mWait 1.0yWait 1.0y

We can see that higher post-AGI quality expands the “launch asap” region, and shortens delays in the instances where waiting is optimal.

The magnitude of this shift is limited because the “launch-asap” risk bar—the level of AGI-risk below which it becomes optimal to launch immediately—is bounded above.  This means that the quality-effect saturates: even arbitrarily large quality improvements cannot push all cases to immediate launch.  Thus, if we postulated that post-AGI life would be 1,000 or 10,000 times better than pre-AGI life, this would not make much difference compared to more modest levels of quality improvement.  Intuitively, once post-AGI life becomes sufficiently attractive (because of its length and/or quality), pre-AGI life contributes relatively little to the expected value of the future; and the chief concern then becomes maximizing the chance of actually reaching the post-AGI era—i.e. balancing the improvements in AGI safety that come from waiting against the accumulating risk of dying before AGI if the wait is too long.

Interestingly, the effect of temporal discounting can flip sign depending on the magnitude of the pre/post-AGI quality differential.  When there is no quality differential, higher temporal discount rates always push towards launching later.  However, when there is a quality differential that is sufficiently large, impatience penalizes delaying the onset of the higher-quality existence that would follow a successful superintelligence; and this pulls towards launching earlier.  Consequently, while discounting always acts as a brake in the pure longevity model, it acts as an accelerator when the quality-of-life gap is sufficiently large.

Diminishing marginal utility

The preceding models have relied on a linear value assumption—essentially treating a 1,400-year lifespan as subjectively worth exactly 35 times as much as a 40-year lifespan.  However, most people’s actual current preferences may exhibit diminishing marginal utility in quality-adjusted lifeyears (QALYs), meaning that e.g. a ten-year extension of a life that would otherwise be, say, 30 years is regarded as more desirable than a ten-year extension of a life that would otherwise be 1,390 years.  Such a preference structure can also be viewed as a form of risk-aversion.  Few people would accept a coin flip where “heads” means doubling their remaining lifespan and “tails” means dying immediately—and they may reject it even if we introduce a modest sweetener (such as a $10,000 reward or an additional bonus lifeyear if the coin lands heads).

We can model this using a standard diminishing-returns utility function—constant relative risk aversion (CRRA)—that introduces a curvature parameter, γ, representing the degree of risk-aversion.  As this parameter increases, the decision-maker becomes more conservative, requiring higher probabilities of success (or greater potential upside) before betting their current life on a transformation.

Table 6 shows the results for γ=0.26, a typical value derived from the empirical health-economics literature.  Other parameters are the same as in the previous section.  (See Appendix D for details and additional illustrations.)

TABLE 6: Diminishing marginal utility (CRRA, medium rate)

Safety Progress1%5%20%50%80%95%99%No progress (0%/yr)Launch asapLaunch asapLaunch asapLaunch asapNeverNeverNeverGlacial (0.1%/yr)Launch asapLaunch asapWait 3.1dWait 1.9yWait 122.6yWait 294.0yWait 335.2yVery slow (1%/yr)Launch asapLaunch asapWait 4.2dWait 4.4yWait 31.7yWait 46.3yWait 50.1yModerate (10%/yr)Launch asapLaunch asapWait 1.1yWait 8.4yWait 12.5yWait 14.1yWait 14.4yBrisk (50%/yr)Launch asapWait 4.4mWait 2.3yWait 3.6yWait 4.2yWait 4.5yWait 4.5yVery fast (90%/yr)Launch asapWait 7.2mWait 1.2yWait 1.6yWait 1.8yWait 1.9yWait 1.9yUltra-fast (99%/yr)Wait 1.2mWait 5.4mWait 9.0mWait 11.3mWait 1.0yWait 1.1yWait 1.1y

Comparing this to Table 5, we see that diminishing marginal utility in QALYs leads to a somewhat more conservative approach: the zone of “launch asap” shrinks and optimal wait times increase.  This effect is strongest for earlier dates.  (See also Figure 2.)

FIGURE 2: Iso-delay contours (cf. Table 6)

Table 7 shows what the risk is if launch occurs at the optimal time (for the same parameter settings as Table 6).

TABLE 7: Risk-at-launch (for the same model)

Safety Progress1%5%20%50%80%95%99%No progress (0%/yr)1.0%5.0%20.0%50.0%NeverNeverNeverGlacial (0.1%/yr)1.0%5.0%20.0%49.9%70.8%70.8%70.8%Very slow (1%/yr)1.0%5.0%20.0%47.9%58.1%59.6%59.9%Moderate (10%/yr)1.0%5.0%17.9%20.6%21.4%21.6%21.6%Brisk (50%/yr)1.0%3.9%4.1%4.2%4.2%4.2%4.2%Very fast (90%/yr)1.0%1.3%1.3%1.3%1.3%1.3%1.3%Ultra-fast (99%/yr)0.6%0.6%0.6%0.6%0.6%0.6%0.6%

These risk-at-launch values are somewhat—but not dramatically—reduced compared to those of a risk-neutral agent (except in cases where the risk-neutral agent would never launch or the risk-averse agent would launch asap, in which case risk-at-launch is the same for both).

Changing rates of safety progress

In the models considered so far, we assumed that AGI can be launched at any time, that background mortality remains constant until launch, that AI safety improves at a constant rate, and that no evidence about system safety is obtained beyond what that steady progress implies.  In reality, however, we are not yet in a position to launch full AGI; background mortality risk could shift around the time AGI becomes available; the pace of safety progress is likely to vary across stages; and we may be able to run tests that provide direct information about whether a system is safe.  We now explore how some of these factors affect the picture.

It is helpful to distinguish two timing variables:

  • Tagi: the time from now until full AGI first becomes technically deployable. We will refer to this period as Phase 1.
  • Tpause: any additional delay we choose after that point before deploying—a deliberate pause between AGI becoming available and being rolled out at scale. We will refer to such a period as Phase 2.

Launch thus occurs at time T=Tagi+Tpause.

In principle, one could try to choose both variables so as to maximize expected (discounted, quality-adjusted) life-years.  In practice, Tagi may be harder to affect to a degree that makes a significant difference.  It is largely determined by the inherent technical difficulty of attaining AGI-level capabilities and by investment choices currently driven by intense competitive dynamics; whereas Tpause, in at least some scenarios, may be more a matter of deliberate choice by company leaders or policymakers who at that juncture may be more focused on making macrostrategically sound deployment decisions.  Furthermore, as we shall see, relatively small changes to Tpause plausibly make a bigger difference to expected outcomes than similarly small changes to Tagi.

Before considering joint optimization over both variables, therefore, let us examine a model in which only Tpause is subject to choice.  Here we treat  Tagi as exogenous and given by the scenario (0, 5, 10, or 20 years until AGI availability).  We retain the notation and parameters from previous sections, including exponential time discounting and concave utility (both at their “medium” values unless otherwise noted).

A key feature of this multiphase setup is that the rate of safety progress need not be constant.  Different stages of development offer different opportunities for progress, and the most tractable problems tend to be solved first.

During Phase 1—the period before full AGI is available—safety researchers must work without access to the systems that will ultimately matter most.  They can study precursor systems, develop theoretical frameworks, and devise alignment techniques that seem likely to scale; but the exact algorithms and architectures that enable full AGI remain unknown, limiting what can be tested or verified.  Safety progress during this phase is therefore likely to be moderate.

The situation changes once AGI-ready systems are attained.  In Phase 2, researchers can study the actual system, run it in constrained environments, probe its behavior under controlled conditions, and potentially leverage the system’s own capabilities to accelerate safety work.  This suggests a burst of rapid safety progress immediately after AGI becomes available—a “safety windfall” from finally having the real artifact to work with.

Yet such rapid gains cannot continue indefinitely.  The most promising interventions get explored first, and diminishing returns eventually set in.  This motivates dividing Phase 2 into distinct subphases:

  • Phase 2a: An initial period of very rapid safety progress. With the full system now available, researchers can perform interventions that were previously impossible—shaping the system, probing failure modes while slowly ramping capabilities, and implementing oversight mechanisms on the actual weights. This subphase is brief (perhaps weeks to months) but highly productive.
  • Phase 2b: Continued fast progress, though slower than 2a. The most obvious low-hanging fruit has been picked, but researchers still benefit from working on the actual system, assisted by advanced AI tools. This might last around a year.
  • Phase 2c: Progress slows to a rate similar to Phase 1, the benefits of having the actual system now roughly offset by the depletion of tractable problems. This subphase might last several years.
  • Phase 2d: Ultimately progress becomes very slow, consisting of fundamental research into alignment science or the development of qualitatively new architectures. This continues indefinitely.

Figure 3 illustrates the qualitative picture.  The key feature is that safety progress is front-loaded within Phase 2.

Figure 3. Qualitative picture of risk in a multiphase model

To make this concrete, Table 8 shows the optimal pause durations (from the start of Phase 2) for eight different scenarios.  (For details, see Appendix E.)

TABLE 8: A multiphase model: several scenarios

#Phase 12a2b2c2d1%5%20%50%80%95%99%①0y0.3y
70%/y1.0y
25%/y5.0y
5%/y∞
2%/yLaunch asapWait 3.6mWait 1.3yWait 1.3yWait 6.3yWait 6.3yWait 6.3y②0y0.3y
70%/y1.0y
25%/y5.0y
10%/y∞
2%/yLaunch asapWait 3.6mWait 1.3yWait 4.1yWait 6.3yWait 6.3yWait 6.3y③5y
5%/y0.3y
70%/y1.0y
25%/y5.0y
5%/y∞
2%/yLaunch asapWait 3.6mWait 9.4mWait 1.3yWait 2.2yWait 5.0yWait 5.7y④5y
10%/y0.3y
70%/y1.0y
25%/y5.0y
10%/y∞
2%/yLaunch asapWait 1.5mWait 3.6mWait 1.3yWait 3.0yWait 4.5yWait 4.9y⑤10y
5%/y0.3y
70%/y1.0y
25%/y5.0y
5%/y∞
2%/yLaunch asapWait 1.2mWait 3.6mWait 1.3yWait 1.3yWait 1.3yWait 1.3y⑥10y
10%/y0.3y
70%/y1.0y
25%/y5.0y
10%/y∞
2%/yLaunch asapLaunch asapWait 3.6mWait 1.0yWait 1.3yWait 1.3yWait 1.3y⑦20y
5%/y0.3y
70%/y1.0y
25%/y5.0y
5%/y∞
2%/yLaunch asapLaunch asapWait 3.6mWait 11.1mWait 1.3yWait 1.3yWait 1.3y⑧20y
10%/y0.3y
70%/y1.0y
25%/y5.0y
10%/y∞
2%/yLaunch asapLaunch asapLaunch asapWait 3.6mWait 3.6mWait 3.6mWait 3.6m

We see that for a wide range of initial risk levels and rates of safety progress, the optimal strategy is to implement a short pause once we enter Phase 2.  If the “windfall” available in subphases 2a and 2b is significant, the optimal pause is often measured in months or a small number of years.  Beyond that point, the safety benefits of further waiting tend to be outweighed by the continuing costs of mortality and temporal discounting.

If we instead consider jointly optimizing over both Tagi and Tpause—so that the decision-maker can choose how long Phase 1 lasts (up to the maximum given by each default scenario) and then also choose how long to pause after AGI-capability is attained—we get the results shown in Table 9.  (For ease of comparison, the times are expressed relative to the point at which launch would have occurred “by default” in each scenario, i.e. if there were neither acceleration of Phase 1 nor any subsequent pause.  For example, in scenario 4, where the default Phase 1 duration is 5 years, “Wait -3.7 y” means launch occurs 1.3 years after the beginning of Phase 1.  Likewise, “launch asap” here denotes the time as it did previously, the point at which Phase 2 would have commenced by default.)

TABLE 9: Joint optimization over Phase 1 and Phase 2

#Phase 12a2b2c2d1%5%20%50%80%95%99%①0y0.3y
70%/y1y
25%/y5.0y
5%/y∞
2%/yLaunch asapWait 3.6mWait 1.3yWait 1.3yWait 6.3yWait 6.3yWait 6.3y②0y0.3y
70%/y1y
25%/y5.0y
10%/y∞
2%/yLaunch asapWait 3.6mWait 1.3yWait 4.1yWait 6.3yWait 6.3yWait 6.3y③5y
5%/y0.3y
70%/y1y
25%/y5.0y
5%/y∞
2%/yWait -5.0yWait -4.7yWait -3.7yWait -3.7yWait 2.2yWait 5.0yWait 5.7y④5y
10%/y0.3y
70%/y1y
25%/y5.0y
10%/y∞
2%/yWait -5.0yWait -4.7yWait -3.7yWait -11.3mWait 3.0yWait 4.5yWait 4.9y⑤10y
5%/y0.3y
70%/y1y
25%/y5.0y
5%/y∞
2%/yWait -10.0yWait -9.7yWait -8.7yWait -8.7yWait -2.8yLaunch asapWait 8.6m⑥10y
10%/y0.3y
70%/y1y
25%/y5.0y
10%/y∞
2%/yWait -10.0yWait -9.7yWait -8.7yWait -5.9yWait -2.0yWait -5.6mWait -1.3m⑦20y
5%/y0.3y
70%/y1y
25%/y5.0y
5%/y∞
2%/yWait -20.0yWait -19.7yWait -18.7yWait -18.7yWait -12.8yWait -10.0yWait -9.3y⑧20y
10%/y0.3y
70%/y1y
25%/y5.0y
10%/y∞
2%/yWait -20.0yWait -19.7yWait -18.7yWait -15.9yWait -12.0yWait -10.5yWait -10.1y

We see that in many scenarios and for many initial levels of risk, if the decision-maker is free to jointly optimize over both AGI development time and subsequent pausing, it is optimal to launch earlier than would have happened by default: these are the cells with blue background.  (In scenarios 1 and 2, acceleration is impossible since Phase 1 has zero duration.)

Additionally, there are several scenarios in which, although launch occurs in Phase 2 after some period of pausing, it is still optimal to accelerate to some extent in Phase 1: these are the cells that do not have blue background but do have blue borders.  This can happen because the rate of risk reduction is faster in Phase 2a and 2b than during Phase 1.  There is thus a special value in being able to pause for at least a short while after AGI-capability has been attained before deploying it; and it can be worth going faster through Phase 1 in order to harvest these rapid safety gains while still keeping the overall time until AGI deployment tolerably short.

Shifting mortality rates

We have been assuming a constant background mortality rate until the launch of AGI, yet it is conceivable that it could change around the time when AGI-capability is attained (but before it is fully deployed).

Pessimistically, the world might become more dangerous with the introduction of near-AGI capabilities.  For example, specialized AI systems could proliferate the capability to produce (new and more lethal) bioweapons, enable vast swarms of autonomous drones, precipitate mayhem by destabilizing our individual or collective epistemic systems and political processes, or raise geopolitical stakes and urgency in such a way as to trigger major war.

Optimistically, one might hope that near-AGI systems would enable breakthroughs in medicine that reduce mortality rates.  However, substantial mortality reductions seem unlikely to materialize quickly, since many medical innovations must pass through extensive clinical trials and then require further time to achieve globally significant scale.  Near-AGI systems could, of course, also have many other positive effects; yet except possibly for medical applications, it seems unlikely that they would have a big immediate impact on average death rates, since most people who are currently dying are succumbing to age-related and other medical issues.

On balance, therefore, if there is a dramatic change in global mortality just around the time when AGI becomes possible, it seems likelier to be for the worse than for the better.  This adds to the reasons for keeping wait times relatively short after AGI-capability (or near-AGI capability that starts having dangerous applications) has been attained.

Yet if a medical breakthrough were to emerge—and especially effective anti-aging therapies—then the optimal time to launch AGI could be pushed out considerably.  In principle, such a breakthrough could come from either pre-AGI forms of AI (or specialized AGI applications that don’t require full deployment) or medical progress occurring independently of AI.  Such developments are more plausible in long-timeline scenarios where AGI is not developed for several decades.

Note that for this effect to occur, it is not necessary for the improvement in background mortality to actually take place prior to or immediately upon entering Phase 2.  In principle, the shift in optimal timelines could occur if an impending lowering of mortality becomes foreseeable; since this would immediately increase our expected lifespan under pre-launch conditions.  For example, suppose we became confident that the rate of age-related decline will drop by 90% within 5 years (even without deploying AGI).  It might then make sense to favor longer postponements—e.g. launching AGI in 50 years, when AI safety progress has brought the risk level down to a minimal level—since most of us could then still expect to be alive at that time.  In this case, the 50 years of additional AI safety progress would be bought at the comparative bargain price of a death risk equivalent to waiting less than 10 years under current mortality conditions.

Table 10 shows the effects of postulating a precipitous drop in background mortality upon entering Phase 2—all the way to m1, i.e. the rate that corresponds to a life expectancy of 1,400 years, same as what we have been assuming successful AGI would achieve.  (Other parameters are the same as in Table 8; and we are assuming here that Phase 1 cannot be accelerated.)

TABLE 10: Pre-deployment mortality plummeting to 1/1400 (medium temporal discounting)

#Phase 12a2b2c2d1%5%20%50%80%95%99%①0y0.3y
70%/y1.0y
25%/y5.0y
5%/y∞
2%/yWait 1.1mWait 4.9mWait 1.3yWait 6.3yWait 18.0yWait 24.7yWait 26.4y②0y0.3y
70%/y1.0y
25%/y5.0y
10%/y∞
2%/yWait 1.1mWait 4.9mWait 3.3yWait 6.3yWait 8.9yWait 14.5yWait 15.9y③5y
5%/y0.3y
70%/y1.0y
25%/y5.0y
5%/y∞
2%/yLaunch asapWait 3.6mWait 1.3yWait 6.3yWait 7.4yWait 13.6yWait 15.1y④5y
10%/y0.3y
70%/y1.0y
25%/y5.0y
10%/y∞
2%/yLaunch asapWait 3.6mWait 1.3yWait 6.1yWait 6.3yWait 6.3yWait 6.3y⑤10y
5%/y0.3y
70%/y1.0y
25%/y5.0y
5%/y∞
2%/yLaunch asapWait 3.6mWait 1.3yWait 1.5yWait 6.3yWait 6.3yWait 6.3y⑥10y
10%/y0.3y
70%/y1.0y
25%/y5.0y
10%/y∞
2%/yLaunch asapWait 3.6mWait 11.2mWait 1.3yWait 5.2yWait 6.3yWait 6.3y⑦20y
5%/y0.3y
70%/y1.0y
25%/y5.0y
5%/y∞
2%/yLaunch asapWait 3.6mWait 9.8mWait 1.3yWait 1.3yWait 2.5yWait 3.3y⑧20y
10%/y0.3y
70%/y1.0y
25%/y5.0y
10%/y∞
2%/yLaunch asapLaunch asapWait 3.6mWait 3.6mWait 1.3yWait 1.3yWait 1.3y

We see that the optimal pause duration becomes longer—but not dramatically so.  That the impact is fairly limited is due in part to safety gains being front-loaded, with diminishing returns arriving quickly after entering Phase 2.  And in part it is due to the “medium”-level temporal discounting (ρ=3%) dominating the mortality rate.

Table 11 shows the same scenarios but with the “low” discount rate (ρ=1.5%).  This does lead to longer wait times, especially in scenarios where the initial AI risk is so high that even after the sizable reductions during Phase 1 and Phases 2a–c, the level of risk remains too high for comfort.

TABLE 11: Pre-deployment mortality plummeting to 1/1400 (low temporal discounting)

#Phase 12a2b2c2d1%5%20%50%80%95%99%①0y0.3y
70%/y1.0y
25%/y5.0y
5%/y∞
2%/yWait 3.6mWait 1.3yWait 5.1yWait 14.9yWait 33.8yWait 41.2yWait 43.0y②0y0.3y
70%/y1.0y
25%/y5.0y
10%/y∞
2%/yWait 3.6mWait 1.3yWait 6.3yWait 6.3yWait 22.5yWait 29.6yWait 31.3y③5y
5%/y0.3y
70%/y1.0y
25%/y5.0y
5%/y∞
2%/yWait 3.6mWait 1.3yWait 1.3yWait 6.3yWait 22.2yWait 29.4yWait 31.2y④5y
10%/y0.3y
70%/y1.0y
25%/y5.0y
10%/y∞
2%/yWait 1.6mWait 4.6mWait 3.2yWait 6.3yWait 6.3yWait 7.8yWait 9.3y⑤10y
5%/y0.3y
70%/y1.0y
25%/y5.0y
5%/y∞
2%/yWait 1.4mWait 3.7mWait 1.3yWait 6.3yWait 10.7yWait 17.7yWait 19.4y⑥10y
10%/y0.3y
70%/y1.0y
25%/y5.0y
10%/y∞
2%/yLaunch asapWait 3.6mWait 1.3yWait 6.3yWait 6.3yWait 6.3yWait 6.3y⑦20y
5%/y0.3y
70%/y1.0y
25%/y5.0y
5%/y∞
2%/yLaunch asapWait 3.6mWait 1.3yWait 1.3yWait 6.3yWait 6.3yWait 6.3y⑧20y
10%/y0.3y
70%/y1.0y
25%/y5.0y
10%/y∞
2%/yLaunch asapWait 1.1mWait 3.6mWait 1.3yWait 1.3yWait 2.2yWait 2.6y

Thus, if the background mortality risk is greatly reduced, then those with a low discount rate would be willing to wait a long time in order for AI risk to decline to a very low level.  Note, however, that even if people stopped dying altogether, it could still be optimal to launch AGI eventually—and in fact to do so without extremely long delays—provided only there is a significant quality-of-life differential, a nontrivial temporal discount rate, and that AI safety continues to improve appreciably.

For contrast, Table 12 illustrates the situation for the opposite scenario, where mortality rates rise upon entering Phase 2.  Unsurprisingly, this shortens optimal pause durations.  The effect for the parameter-setting used in this table—a doubling of the mortality rate—is fairly modest.  It would be more pronounced for greater elevations in the level of peril.

TABLE 12: Pre-deployment mortality rising to 1/20 (medium temporal discounting)

#Phase 12a2b2c2d1%5%20%50%80%95%99%①0y0.3y
70%/y1.0y
25%/y5.0y
5%/y∞
2%/yLaunch asapWait 2.9mWait 6.6mWait 1.3yWait 2.6yWait 5.0yWait 5.6y②0y0.3y
70%/y1.0y
25%/y5.0y
10%/y∞
2%/yLaunch asapWait 2.9mWait 6.6mWait 1.3yWait 4.8yWait 6.3yWait 6.3y③5y
5%/y0.3y
70%/y1.0y
25%/y5.0y
5%/y∞
2%/yLaunch asapLaunch asapWait 3.6mWait 1.3yWait 1.3yWait 1.3yWait 1.3y④5y
10%/y0.3y
70%/y1.0y
25%/y5.0y
10%/y∞
2%/yLaunch asapLaunch asapWait 3.6mWait 1.3yWait 1.3yWait 1.3yWait 1.7y⑤10y
5%/y0.3y
70%/y1.0y
25%/y5.0y
5%/y∞
2%/yLaunch asapLaunch asapWait 3.6mWait 1.3yWait 1.3yWait 1.3yWait 1.3y⑥10y
10%/y0.3y
70%/y1.0y
25%/y5.0y
10%/y∞
2%/yLaunch asapLaunch asapWait 3.6mWait 3.6mWait 1.2yWait 1.3yWait 1.3y⑦20y
5%/y0.3y
70%/y1.0y
25%/y5.0y
5%/y∞
2%/yLaunch asapLaunch asapWait 3.6mWait 3.6mWait 1.1yWait 1.3yWait 1.3y⑧20y
10%/y0.3y
70%/y1.0y
25%/y5.0y
10%/y∞
2%/yLaunch asapLaunch asapLaunch asapWait 3.0mWait 3.6mWait 3.6mWait 3.6mSafety testing

AI safety work can provide at least two types of benefit: first, it can improve the nature of an AI system so that it is less likely to cause catastrophic harm if deployed; second, it can provide information about that nature, so that we can better judge whether to deploy it or to keep working to make it safer.  The previous sections modeled both effects with a single parameter (the “rate of AI safety progress”).  If we are willing to tolerate a more complicated setup, we can instead treat them separately.  This leads to models where what is determined in advance is not an optimal launch time but an optimal policy that specifies—conditional on whatever safety information is then available—whether to launch or to continue working and testing.

To keep the setup manageable, we graft a simple testing process onto the multiphase model from the previous section.  Once AGI‑capable systems exist (the start of Phase 2), the true catastrophe probability at that time is unknown: it could be any of seven values, corresponding to the initial risk levels used earlier (1 %, 5 %, 20 %, 50 %, 80 %, 95 %, or 99 %).  We assume a uniform prior over these possibilities.  Safety work reduces the underlying risk over time following the same multiphase schedule as before: Phase 1 with moderate progress, followed (once AGI‑capable systems exist) by a brief period of very rapid safety improvement (Phase 2a), a somewhat slower but still fast phase (2b), a medium‑progress phase (2c), and then a long tail of very slow progress (2d).

Safety tests are triggered by safety progress rather than by clock time.  Starting from the moment AGI‑capable systems are available, a new test is performed every time safety work has reduced the system’s intrinsic catastrophe probability by another 20 % relative to the last test. This reflects the idea that developing informative tests is itself part of safety work: as we make the system safer, we also learn how to probe it more effectively.  If the underlying risk at the moment of testing is p, the test returns “fail” with probability p and “pass” with probability 1−p.  Systems with very high intrinsic riskiness therefore tend to fail tests repeatedly, whereas fairly safe systems mostly pass—even if their remaining risk is still substantial.  In particular, these tests usually cannot distinguish reliably between, say, ten and twenty per cent risk at launch; they are better at separating “clearly terrible” from “not obviously terrible”.

We can formalize this setup as a partially observed Markov decision process (POMDP) and compute the optimal policy numerically (see Appendix G for details).  Table 13 shows the expected delays (counting from the beginning of Phase 2).

TABLE 13: Periodic safety tests

#Phase 12a2b2c2d1%5%20%50%80%95%99%①0y0.3y
70%/y1y
25%/y5.0y
5%/y∞
2%/yWait 1.4yWait 1.7yWait 2.7yWait 4.9yWait 7.3yWait 8.6yWait 8.9y②0y0.3y
70%/y1y
25%/y5.0y
10%/y∞
2%/yWait 1.6yWait 2.0yWait 3.2yWait 4.8yWait 5.8yWait 6.1yWait 6.1y③5y
5%/y0.3y
70%/y1y
25%/y5.0y
5%/y∞
2%/yWait 1.1yWait 1.2yWait 1.7yWait 3.1yWait 4.7yWait 5.3yWait 5.5y④5y
10%/y0.3y
70%/y1y
25%/y5.0y
10%/y∞
2%/yWait 4.7mWait 6.6mWait 1.3yWait 3.1yWait 4.8yWait 5.4yWait 5.6y⑤10y
5%/y0.3y
70%/y1y
25%/y5.0y
5%/y∞
2%/yWait 5.1mWait 6.1mWait 10.5mWait 1.8yWait 3.1yWait 3.7yWait 3.9y⑥10y
10%/y0.3y
70%/y1y
25%/y5.0y
10%/y∞
2%/yWait 3.9mWait 5.3mWait 9.2mWait 1.2yWait 1.5yWait 1.7yWait 1.7y⑦20y
5%/y0.3y
70%/y1y
25%/y5.0y
5%/y∞
2%/yWait 3.9mWait 5.3mWait 9.2mWait 1.1yWait 1.3yWait 1.3yWait 1.3y⑧20y
10%/y0.3y
70%/y1y
25%/y5.0y
10%/y∞
2%/yLaunch asapLaunch asapWait 1.9mWait 3.4mWait 4.5mWait 5.2mWait 5.4m

We observe that in most cases, the optimal policy results in an expected short (but greater-than-zero) delay, to take advantage of the rapid safety progress and concomitant opportunities gaining more information about the system’s riskiness available in Phases 2a and 2b.  Conditional on the system’s initial riskiness being high when entering Phase 2, waiting times are longer; whereas when this is not the case, the optimal policy typically recommends launching within a year or two.

Note that Table 13 is not directly comparable to Table 8 (which represents the multiphase model analyzed earlier, the one most similar to the present model).  This is because earlier we assumed that the decision-maker knew the initial riskiness of the system, whereas in the current model the agent starts out with a uniform probability distribution over the seven possible initial risk levels.  If we want to pinpoint the difference that testing makes, we need to compare it to a baseline in which the agent starts out with the same agnostic distribution yet gains no further information from safety testing.  Table 14 presents the result of such a comparison.

TABLE 14: Difference in outcomes from safety tests

#Avg launch
(no tests)Avg launch
(tests)Δ waitRisk
(no tests)Risk
(tests)Δ riskUtility gain①3.90y5.05y+1.15y22.9%20.6%-2.2%+3.58%②6.30y4.23y-2.07y15.4%16.9%+1.5%+2.95%③1.30y3.23y+1.93y20.2%17.3%-2.9%+1.31%④1.50y3.03y+1.53y15.1%11.5%-3.6%+1.71%⑤1.30y2.05y+0.75y15.7%14.8%-0.9%+0.37%⑥1.30y1.09y-0.21y9.1%9.1%+0.0%+0.45%⑦1.30y0.93y-0.37y9.4%9.6%+0.3%+0.28%⑧0.30y0.25y-0.05y4.2%4.2%+0.0%+0.06%

We see that testing increases expected utility, sometimes by shortening the expected time-to-launch and sometimes by reducing the expected risk-at-launch.  (That the expected utility gains look quite small in percentage terms is not particularly significant—this is driven by the infrequency and low sensitivity we assume of the tests and by other modeling assumptions.  In reality, tests may also provide value by guiding future safety work in more productive directions.)

Figure 4 further illustrates how safety testing affects launch times.  The dashed lines indicate where launches occur without safety testing (but with the agnostic prior over initial riskiness levels) for each of the eight scenarios.  The solid lines show the cumulative probability distributions for the optimal policy with safety testing.  We see that safety testing results in early launches in worlds where tests repeatedly pass, and later launches where tests keep failing and the posterior remains pessimistic.

FIGURE 4: Cumulative distribution functions of launch times with versus w/o safety tests

The main takeaway is that once system safety is uncertain, and future tests may provide information about how risky a system is, the relevant object is not a single optimal launch date but an optimal policy that conditions on evidence.  Such a policy does something no fixed delay can do: it launches quickly when tests indicate the system is likely to be safe enough, but delays when tests reveal signs of danger.  (The value of safety testing, however, depends not only on the quality of the tests themselves but—crucially—also on whether decision‑makers are willing and able to let deployment decisions actually respond to what the tests reveal.)

Distributional considerations

We have analyzed the situation from the standpoint of the current world population as a whole.  However, we need to acknowledge that the prudentially optimal timing for superintelligence is not the same for everyone.

One important factor of divergence is that people’s mortality rates differ.  Elderly people face a higher likelihood in the status quo of dying in the near future, while the young and hale could tolerate longer delays without accumulating an excessive risk of perishing before the main event.

Another factor is that those whose present quality of life is poor could rationally accept a higher risk of death for a shot at experiencing the great abundance and efflorescence that successful AGI would enable than those who are currently enjoying (what in present era is regarded as) a high standard of living.

There are therefore conflicts between different demographics over what is prudentially optimal regarding the timing of AGI.  Other things equal, those who are old, sick, poor, downtrodden, miserable—or who have higher discount rates or less concave preferences over future quality-adjusted life years—should prefer earlier AGI launch dates compared to people who are comparatively satisfied and secure in the status quo.[14]

In the public policy literature, social welfare functions are often designed to include a prioritarian or egalitarian skew, such that a higher desirability is assigned (ceteris paribus) to outcomes in which the worst-off receive a given boost to their welfare than to ones in which a boost of equal magnitude is given to those who are already better-off.[15]  If such priority is given to the worse off, and we combine this stipulation with the observations already made about the divergent prudential interests of different demographics, there may be implications for what is globally optimal regarding AI timelines.

In particular, the optimal timeline to superintelligence is likely shorter on a prioritarian view than it is on a neutral (person-affecting) utilitarian stance.  This is partly because the worse off have less to lose and more to gain from rolling these dice.  And partly it is because, in the case of the sick and the elderly, they have less ability to wait and roll the dice later when the odds may be more favourable.  There is therefore a prioritarian argument for accelerating timelines beyond what the preceding analysis suggests.

Let us examine these issues a little more closely.  One possible thought one might have is that the younger age structure in low-income countries would reduce the strength of the aforementioned prioritarian argument for shorter timelines, by introducing a correlation between being worse off and having longer remaining life expectancy—so that poor people in the developing world would have a prudential interest in longer AGI timelines compared to their better-off counterparts in rich countries.  However, although the population does skew younger in poor countries, this is not enough to make up for the generally higher life expectancy in rich countries.  The difference in life expectancy between rich and poor countries—which can exceed 25 years at birth when comparing the richest and poorest nations—narrows considerably when calculated as a population-weighted average of remaining years, due to the younger age structure in poorer countries.  However, it does not close, let alone reverse.[16]  While some convergence in life expectancy between poor and rich countries might be expected to occur during the remaining lifetime of people living in poor countries, it still seems plausible that, on average, people who are currently economically unfortunate can also expect to die sooner under default conditions than people who are currently economically fortunate.  This positive correlation between poverty and lower remaining life expectancy strengthens the prioritarian case for faster timelines (compared to the distribution-agnostic analysis of the preceding sections).

One may also regard lifespan itself as a contributing factor in how fortunate a person is, and hence—on a prioritarian view—in how strong a claim they have to marginal resources or weighting of their marginal interests in the context of social planning.  There are several different possible ways in which lifespan-related variation could be taken to influence somebody’s baseline welfare level:

i. Remaining life years.  One might hold that (ceteris paribus) persons with more remaining life years are better off than those with fewer years left, since it seems unfortunate to be in a condition in which one is soon about to get sick and die.

If one adopts this stance, then the prioritarian skew towards shorter timelines would be amplified.  This is because older people—whose interests favor shorter timelines—would be weighted more heavily by this metric, since it would adjudge them comparatively unfortunate in the status quo.

ii. Life years already had.  One might hold that (ceteris paribus) persons who have lived longer are better off, on grounds that they have gotten to feast more on life.

If one adopts this stance, then the prioritarian skew would be pulled in the direction favoring longer timelines, since the metric implied by (ii) would tend to deem older people as better off and hence less deserving of marginal consideration.  It would not necessarily pull it far enough to make the prioritarian favor longer timelines all things considered compared to a neutral (non-prioritarian) criterion, since there are other categories of badly-off people (aside from, supposedly, the young) and who may have interests that differentially benefit from shorter timelines.

However, in any case, (ii) seems like a mistaken way to reckon.  Consider two persons, a 10-year-old and a 20-year-old, both of whom have a genetic condition from which they will die at age 30, unless they receive a therapy, of which only one dose is available—in which case they live to age 50.  It seems implausible to maintain that the 10-year-old has a stronger claim to the therapy just because he hasn’t lived as long as the 20-year-old.  It seems more plausible that their claims are equally strong—or, if not, then perhaps that the 20-year-old has a stronger claim (as would be implied by (i)).

A more plausible way to capture whatever intuition might appear to support (ii) would be:

iii. Total life years.  One might hold that (ceteris paribus) persons whose total lifespans are longer are better off, since their endowment of life is greater.

This would accord the 10-year-old and the 20-year-old in the previous example equal weight, since they have the same baseline length of lifespan.  When coupled with a prioritarian ethic, stance (iii) results in greater weight being placed on the interests of those whose lives in the default condition would be shorter.

So whose lives would, absent AGI, be shorter: the lives of the old or the lives of the young?  On the one hand, the old have already survived all the hazards that kill some people prematurely.  On the other hand, the young can expect to benefit from many decades of economic and medical progress which might prolong their lives.  If we extrapolate recent rates of increases in life expectancy, in wealthy countries, we may get a U-shaped curve: younger people and the very oldest people have the longest total life expectancy, with the nadir occurring for those who are around age 80.  (Intuitively: somebody who’s a centenarian has already lived longer than a newborn is likely to do, while a child has an advantage over people who are in their forties because the child is very likely to make it to forty and then gets benefit from four more decades of medical progress.)  Since there are many more people who are substantially younger than 80 than who are substantially older than 80, this means there is a positive correlation between youth and total life expectancy.  Hence (iii) induces an overall prioritarian downweighting of the interests of the young in wealthy countries.  This would shorten the optimal timeline to AGI.  In poor countries, however, the relationship may be more complicated due to high infant mortality: newborns have low expected total lifespans; young adults, high expected total lifespans, older adults, lower expected total lifespans; and the very old, high expected total lifespans.  Absent a detailed quantitative analysis, it is not obvious how that adds up.

If one expects a radical breakthrough in life extension will happen, even in the absence of AGI, x years from now, which will enable people to live very long lives, such as two hundred years (or even to attain longevity “escape velocity”), then a discontinuity is introduced whereby those who would live less than x years without AGI are comparatively a lot more unfortunate according to (iii) than those who without AGI have more than x years left to live.  Those with less than x years left to live without AGI would thus have their interests upweighted in a prioritarian social welfare function.  This would increase the shift towards shorter timelines being optimal, assuming that x is within the lifetime of at least some significant fraction of currently living people.

Note that these effects from prioritarian upweighting of those with shorter total life expectancy—or those with shorter remaining life expectancy, if we adopt stance (i)—are additional to the effect that results from whatever extra benefit there is to adding life years to otherwise short lives that stem directly from diminishing marginal utility in life years (or QALYs).  In other words, there are two possible reasons for giving an extra life year to a short-lived person rather than to a long-lived person, which are analogous to two possible reasons for giving a hundred dollar bill to a poor person rather than to a rich person: first, the poor person may derive a greater benefit from the hundred dollars; and second, the poor person might be overall worse off than the rich person, and would therefore—on a prioritarian ethic—have a stronger claim to marginal benefits (such that even if we suppose that the rich person would derive an equally large benefit from the hundred dollar bill—perhaps they are out of cash and need a taxi home—it would still be better for it to go to the poor person).

Yet another possible stance on how life chronology could be a prioritarian weight-factor is that there is some specific number of life years—for instance, the traditional three-score-and-ten—such that it is bad for a person to die earlier than that yet not significantly better to live beyond it.  The metaphor might be that a human is like a cup of limited capacity, and once it’s been filled up with life there’s no value to keep pouring.

iv. Full cup.  One might hold that it is unfortunate for somebody to die before the age of approximately seventy, but somebody who lives much beyond seventy is not thereby significantly better off, since they’ve already had a full life.[17]

This stance would have four relevant implications.  First, it would reduce the value of AGI success, because some of the supposed upside consisted of the (exponentially time-discounted) value of lifespans much longer than the currently typical one for humans.  (However, another part of the upside—the prospect of a greatly improved quality of life—would remain important.)  Second, it would tilt the prioritarian skew in favor of the young, since they are not guaranteed in the pre-AGI default condition to reach the “full cup” number of life years that the old have already attained, thus making the young count as more unfortunate, thus giving their interests (which favor longer timelines) greater weight.  Third, it would increase the downside for the young of early AGI launch, since—unless the risk has been brought down to quite a low level—an AGI launch could amplify the threat that the young will fail to reach their normal allotment of years.  And fourth, since this increased downside pertains exclusively to the young, whereas the old, according to (iv), have little to lose from an AGI launch as they are already home and dry, it would tilt prioritarian concern even further towards favoring the interests of the young.  The upshot would be that optimal AGI timelines, if one adopted the “full cup” stance, would become significantly longer.

However, even if the “full cup” stance might have some prima facie appeal, it is plausible that the intuitions that appear to support it are rooted—at least in substantial part—in a conflation between chronological age and contingently associated circumstances of age.  In contemporary settings, old age is associated with multimorbidity, declining capacities, loneliness, pain, loss of autonomy, a sense of being a burden, and bleak future prospects.  It would hardly be remarkable if additional life years under those conditions have limited appeal to many.[18]  This might lead one to believe that seventy years (or some “normal lifespan” in that neighborhood) is all we need to max out our utility function in life years.  But the most it would really show is that in present circumstances we gain little from living much beyond that age.  In other circumstances, we may gain a lot.  In particular, if an AGI-breakthrough enables the restoration of full health and youthful vigor, and a return or even strengthening of our previously lost capacities—and pulls open the curtains to a long continued existence, together with friends and family who can also expect to stick around for a long time, in a world that is dawning on a new age, immeasurably richer, more promising, and teeming with marvels than any earlier era—then why should additional life years stop being valuable for somebody just because seventy life years have passed since they were born?  In such a scenario, would we not rather all be like children again—with the potential before us so greatly outstripping our comparatively paltry past?

This suggests that we should reject the “full cup” stance as a fundamental evaluative principle, and specifically reject its application in the context of transformative AI, where many of the usual conditions of life years at old age are stipulated not to obtain.  It is also worth noting that even under current (often very bad) conditions, those who seem best placed to judge the value of continued life at old age—namely, those who actually are in that situation and have first-hand knowledge of what it is like—often deny the stance and place a high value on remaining alive longer.  For example, in one multicenter study of hospitalized patients aged 80+, more than two-thirds were willing to give up at most one month of a remaining year for “excellent health”.[19]  Surrogate decision-makers systematically underestimated their reluctance to trade away time.  When patients who were still alive a year later were asked the same question again, they were willing to trade even less time for better health than at baseline.

We have focused on distributional considerations that are fairly directly tied to when AGI is developed.  There are of course many other potentially important distributional considerations that arise in the context of AGI.  For example, citizens of a country that leads AGI development might benefit more than citizens of other countries; and individuals who directly participate in a successful AGI launch might gain disproportionate profits and glory.  Although who and how may be correlated in various ways to when, these broader distributional questions fall outside the scope of this paper.

Other-focused prudential concerns

A different set of considerations arises if we expand our conception of what might lie in the prudential interest of a person to include the welfare of other persons they strongly care about.  For example, while it might be in the narrow self-interest of an old person for superintelligence to be launched very soon, they might prefer a somewhat delayed launch because they also care about their grandchildren who have a much longer remaining life expectancy under pre-AGI conditions than they themselves do.

However, if we take into account these kinds of preferences, we should also take into account preferences going in the other directions: younger people who, for their own part, might benefit from longer timelines yet may prefer somewhat shorter timelines because they care about others who are closer to dying.  Just as we can love our children and grandchildren, we can also love our parents and grandparents.  So this type of concern for kin might total up to roughly a wash.

With regard to caring for our friends (or admired strangers), it is likewise unclear which way the correlation goes between somebody’s age and the number of people who care about them.  The very old may have fewer people who care about them because many of their friends have already died; but the very young may also have fewer friends who care about them because they have not met many people yet or have not known them for long.

On a prioritarian view, including other-focused concerns among our prudential interests might induce a slight shift in the direction of longer timelines.  Suppose we assume a symmetric degree of average care between the young and the old.  Suppose, further, that the old are on average worse off than the young in the default condition (because of their shorter remaining and total life expectancy); so that a prioritarian reckoning upweights the interests of the old in determining the optimal social policy.  Then the prioritarian upweighting of the interests of the old means that the interests of those whom the old care about receive extra weight (relative to what they would get if we didn’t include other-focused concerns in our conception of what is prudentially desirable for somebody).  Since on average the people whom old people care about are younger than they are themselves, this would shift some emphasis towards younger people, whose interests are served by longer timelines.  Any such effect, however, is quite subtle and second-order.

Theory of second best

We have thus far asked the question about the optimal timing for superintelligence (from a person-affecting perspective) in an abstracted way—as if the world had a knob for different dates and your job was to turn it to the correct setting.  In reality, the situation is more complex.  Nobody has full control over AGI timelines, and different actors have different preferences.  The ideal timing may not be achievable, or might be achievable only through methods that would carry a significant risk of making the timing much worse than it would otherwise have been.  Furthermore, interventions aimed at influencing when superintelligence arrives may have other important consequences besides their effect on timing.  For these reasons, while the preceding discussion highlights some relevant background considerations, it does not on its own imply particular policy recommendations.

While a full policy analysis would require bringing into consideration many facts and arguments that are out of scope for this paper, it may be useful to briefly list some of the ways that an AI pause, or efforts to bring about such a pause, could have undesirable effects (aside from simply delaying the arrival of the benefits that successful AGI could bring):

  • The pause occurs too early.  People conclude that it was pointless, and become less willing to pause later when it would have been useful.
  • The call for a pause results in poorly designed or incomplete regulation, producing safety theater that adds costs and bureaucracy and slows useful applications, while doing nothing to reduce the real risks.  Compliance and box-ticking crowd out substantive work on risk reduction.
  • A pause is implemented, but the developments it aims to forestall continue anyway—just elsewhere.  Work may be driven underground, or shift towards less scrupulous actors or less cooperative states.
  • The pause has an exemption for national security, pushing AI activities away from the civilian into the military sector.  The result may be greater emphasis on destructive uses, lower transparency and democratic oversight, amplified AI-assisted coup risk or power concentration risk, and perhaps less competent alignment efforts.
  • There are calls for a pause but they go unheeded—and no catastrophe occurs.  Those who warned of danger are discredited, making it harder for future calls for AI safety work to be taken seriously.
  • The push for a pause highlights the strategic importance of the  technology, intensifying geopolitical AI competition.
  • An international agreement is reached on pausing, but this creates a prisoner’s dilemma in which some parties cheat (driving developments into covert programs) or triggers geopolitical conflict when some countries accuse others of cheating.
  • A pause is implemented, leading to economic recession and general pessimism and lowered hopes for the future.  People see the world more as a zero-sum battle for a limited set of resources, increasing conflict and tribalism.
  • A pause prolongs the period during which the world is exposed to dangers from applications of already developed levels of AI (and to risks independent of AI), which more advanced AI could have helped mitigate.
  • To enforce a pause, a strong control apparatus is created.  The future shifts in a more totalitarian direction.
  • There is a pause on AI development, yet progress in hardware and algorithm development continues.  When the pause is eventually lifted, there is a massive compute and/or algorithm overhang that leads to explosive advances in AI that are riskier than if AI had advanced at a steadier pace throughout.  The world will also not have had the opportunity to learn from and adapt to living with weaker AI systems.  (Or in a more extreme case, the pause holds until dangerous models or superintelligence can be implemented on consumer-grade hardware, making it ungovernable.)- Agitation for a pause leads to extremism.  Some people become radicalized or violent.  Attitudes towards AI become polarized to such an extent as to make constructive dialogue difficult and destroy the ability of institutions to pass nuanced adaptive safety policy.
  • The push for a pause galvanizes supporters of AI to push back.  Leading AI firms and AI authorities close ranks to downplay risk, marginalizing AI safety researchers and policy experts concerned with AI risk, reducing their resourcing and influence.A pause, initially sold as a brief moratorium to allow social adjustments and safety work to catch up, calcifies into a de facto permaban that prevents the immense promise of superintelligence from ever being realized—or is indefinitely extended without ever being formally made permanent.[20]

Of course, there are also some potentially positive side effects that might come from calls to bring about a pause even if they fail in their main aim.  For example, they might lead to an increase in funding for AI safety work as a more acceptable alternative to pausing, or generally stimulate the world to more seriously prepare for AGI.  Still, the potential ways that pausing or pushing for pausing could backfire are many and quite plausible.

The profile of potential upsides and downsides of a pause or delay looks different depending on the mechanics of implementation and the context in which it takes place.  We have already touched on the idea that the safety benefit of a pause of a given duration seems likely to be much greater if it occurs at a late stage—ideally, once the capacity for AGI exists, and perhaps even a fully implemented system, yet prior to maximum scaleup or general deployment; since extra time for safety testing, oversight, and final adjustment may be especially impactful during that stage.  The scope of and causal process inducing the pause is also relevant.  Consider the following cases:

  1. Frontrunner unilaterally burning lead.  At the time when AGI becomes possible, one developer might have a technological lead over its competitors.  It could choose to burn some or all of its lead to implement extra precautions while remaining ahead.  This type of pause is relatively attractive, as it has less risk of producing many of the downsides listed above.  It does not rely on the creation of a regulatory apparatus or enforcement regime, and it is less likely to result in a permanent abandonment of superintelligence.  The pause is self-limiting, as it expires once a competitor catches up.  If the case for additional safety precautions is very clear and strong, this competitor may also be persuaded to agree to halt (either unilaterally or in coordination with the frontrunner, perhaps with some nudging from the government), thus extending its duration.  But eventually, as more competitors reach similar capability levels, the pause naturally expires.  The scope for this kind of pause, however, is reduced in a highly competitive environment.  At present, it is unclear who is ahead; and whatever lead they have is measured in a small number of months.
  2. Government-imposed moratorium.  This brings in more of the potential failure modes and side-effects that we listed.  Risks of bureaucratization, militarization, self-coups, etc. are increased.  The maximum duration of the pause is extended, and there is a greater risk that it would remain in place for longer than it ought to.  It matters how the government action was brought about: if it is the result of technocratic pragmatics, the risk of it becoming too long or permanent is lower than if it comes about as a result of a general political anti-AI mobilization that stigmatizes the very idea of superintelligence.  Instead of an outright moratorium, there could be regulation that permits the development and deployment of AGI only when safety standards have been met—this might be theoretically superior to an outright ban, but in practice it could be difficult to specify sensible criteria with enough precision.
  3. Internationally agreed prohibition.  Since this would involve state interventions, it would bring in many of the failure modes of a government-imposed moratorium.  If the international agreement prohibits all development of new frontier systems, and includes effective verification provisions, it might avoid some of the risks (such as militarization and self-coups) that may be amplified in the case of individual government-imposed moratoria that have carveouts for national security applications.  Other risks would be amplified, especially the risk that the moratorium ossifies into a permanent relinquishment of advanced AI, since in a tightly enforced global regime there would be no place where AI development could continue.  The enforcement regime itself might also present some risk of eventually leading towards some sort of global totalitarian system.  Yet without tight global enforcement, we would instead face the risks of selection effects, where AI development continues but only in the least cooperative states who refuse to join or in covert programs operated by defecting signatories.  More limited international agreements on safety standards or short pauses might reduce some of these risks: for example, if AI projects in the U.S. and China are running neck-to-neck when dangerous AI systems are about to be developed, there may be little opportunity for a unilateral pause (of the “frontrunner burning lead” type); but some pragmatic cooperation might be possible, in which both parties agree to suspend large training runs for a finite period of time (perhaps with provisions for inspectors to verify that their biggest AI centers are idle) to allow some additional time to work out critical safety issues before resuming.

These are the merest schematics.  In reality, policymakers will confront a more complicated and textured set of options, subject to many practical constraints, and in which the effect on AI timelines is only one of many consequences that need to be factored into decisions.  While some of the variables may be analyzed abstractly and ahead of time, much of the essential context will only become evident as developments unfold, and will require continuing judgment calls to adjust policies to an evolving situation.

The analysis of optimal AI timelines is relevant not only to questions of whether or not to bring about an AI pause but also to other policy choices that could impact the pace of AI development and deployment.  For example, chip export restrictions, taxes on data centers, or employment laws that make it harder to lay off workers are possible measures that may be proposed or rejected mainly for reasons other than their impacts on AGI timelines.  Nevertheless, they would likely retard AI progress on the margin; and so, in evaluating such policies, it would be useful to know whether that effect would be desirable or undesirable.

Conclusions

We have examined optimal timing for superintelligence from a person-affecting perspective, focusing on mundane considerations, leaving aside arcane considerations and impersonal perspectives for future work.  A basic point here is that the baseline is not safe—not only because there are other catastrophic risks besides AI but also because of the high rate of individual sickness and death under the status quo.  The appropriate analogy for the development of superintelligence is not Russian roulette but surgery for a serious condition that would be fatal if left untreated.

A simple go/no-go model illustrated how, if aligned superintelligence would enable major life extension and quality-of-life improvements, then even very high levels of Pdoom can be worth incurring in terms of quality-adjusted life expectancy.

Note that Pdoom here refers to the probability of AI causing human extinction.[21]  The highest tolerable probability of misaligned superintelligence could be even higher—plausibly as high as 100% with the given assumptions—since it is far from certain that all humans would die if misaligned superintelligence is deployed.[22]

We then proceeded to explore a series of models in which the decision-maker has a richer option set involving when to deploy superintelligence, rather than just the binary choice between deploying it immediately or never.  Waiting can reduce catastrophic risk through safety progress, but incurs costs of ongoing mortality and foregone (or temporally discounted) benefits.  A robust qualitative pattern emerges.  Long waits are favored only when initial risk is very high and safety progress falls within a specific intermediate range—fast enough that waiting yields meaningful risk reduction, yet slow enough that the job isn’t done quickly anyway.  Outside this conjunction, optimal delays tend to be modest.

Various robustness checks shift recommendations in predictable directions without overturning the basic result.  Simply adding temporal discounting pushes toward later launch by downweighting far-future benefits, though it rarely produces very long delays unless the rate is quite high.  Adding quality-of-life uplift pushes toward earlier launch, though this effect saturates: once post-AGI life is sufficiently attractive, pre-AGI life contributes little to expected value, and the main concern becomes simply reaching the post-AGI era.  When quality-of-life uplift is present, the effect of temporal discounting can be reversed: for sufficiently large quality-of-life differentials, temporal discounting pushes towards earlier launch, as impatience penalizes the delay of the onset of that higher-quality existence.  Finally, diminishing marginal utility in quality-adjusted life years makes the decision-maker more conservative, shrinking the region where immediate or early launch is optimal—but even substantial risk aversion does not radically alter the overall picture.

A more elaborate model was then introduced, which featured two timing variables: time until AGI capability exists (Phase 1, perhaps largely driven by technical difficulty), and any deliberate pause before full deployment once capability is attained (Phase 2).  This matters because the rate of safety progress is unlikely to be uniform across stages.  Once a deployable system exists, there is plausibly a “safety windfall”—the ability to study, probe, and stress-test the actual artifact, and to leverage its own capabilities to accelerate alignment work.  Yet such gains face diminishing returns as the most tractable problems are solved.  The upshot is that time early in Phase 2 purchases more safety per unit than equivalent time earlier or later.  The multiphase model often recommends short but non-zero pauses—months or a small number of years—once AGI-ready systems exist.

Background conditions around the time of AGI capability also matter.  If near-AGI systems destabilize the world through bioweapon proliferation, autonomous weapons, epistemic corrosion, or geopolitical escalation, the cost of waiting rises, favoring short and purposeful post-capability pauses.  Conversely, a major non-AGI mortality reduction—especially effective anti-aging therapies—would lower the cost of waiting, making longer postponements potentially optimal.

We also considered a variation of the multiphase model where system risk is uncertain and tests can provide information.  This changes the object of evaluation from an optimal launch date to an optimal policy: launch when evidence looks sufficiently favorable, delay when it does not.  Safety testing can shorten or lengthen expected wait times, and can increase or decrease risk at launch, but in either case increases expected utility.

Prudentially optimal timing varies across individuals.  The elderly and the ill face higher near-term mortality in the status quo; those with poor quality of life have less to lose and more to gain from a transition to potential post-AGI abundance.  Those who are old, sick, poor, or miserable should therefore generally prefer earlier launch dates than those who are comfortable and secure.  If policy incorporates prioritarian weighting, this shifts the global optimum toward shorter delays.  Some intuitions about lifespan—such as the “full cup” notion that life-years beyond approximately seventy contribute little additional value—might push in the opposite direction; but we have argued such intuitions are plausibly misguided in a transformative-AI context, where many accustomed factors (such as the deprivations of old age) need not obtain.

These models have treated timing as if there were a simple knob to turn.  In reality, no one has full control; different actors have different preferences; the ideal timing may be unachievable; and interventions aimed at influencing timelines have consequences beyond their effect on timing.  Even if, in an abstract sense, a perfectly implemented pause before full superintelligence deployment would be desirable, there are numerous possible ways in which a bungled moratorium or other efforts to slow down AI developments could have bad effects in practice—for instance, by shifting developments to less regulated places, by increasing militarization, by creating hardware or algorithmic overhangs that ultimately make the AI transition more explosive, or by creating stigma and bureaucratization that risk ossifying into permanent relinquishment.

For these and other reasons, the preceding analysis—although it highlights several relevant considerations and tradeoffs—does not on its own imply support for any particular policy prescriptions.  If nevertheless one wishes to compress the findings into a possible practical upshot, we might express it with the words swift to harbor, slow to berth: move quickly towards AGI capability, and then, as we gain more information about the remaining safety challenges and specifics of the situation, be prepared to possibly slow down and make adjustments as we navigate the critical stages of scaleup and deployment.  It is in that final stage that a brief pause could have the greatest benefit.

Bibliography

Abellán-Perpiñán, J., Pinto-Prades, J., Méndez-Martínez, I. & Badía-Llach, X. (2006). “Towards a Better QALY Model”. Health Economics 15(7): pp. 665–676.

Amodei, D. (2024). “Machines of Loving Grace: How AI Could Transform the World for the Better”. https://www.darioamodei.com/essay/machines-of-loving-grace

Arias, E., Xu, J., Tejada-Vera, B. & Bastian, B. (2024). “U.S. State Life Tables, 2021”. National Vital Statistics Reports 73(6). (National Center for Health Statistics: Hyattsville, MD). https://www.cdc.gov/nchs/data/nvsr/nvsr73/nvsr73-06.pdf

Aschenbrenner, L. (2020). “Existential Risk and Growth”. Global Priorities Institute Working Paper No. 6-2020. https://globalprioritiesinstitute.org/leopold-aschenbrenner-existential-risk-and-growth/

Baumgartner, F. et al. Deadly Justice: A Statistical Portrait of the Death Penalty. (Oxford University Press: New York, 2017)

Binder, D. (2021). “A Simple Model of AGI Deployment Risk”. Effective Altruism Forum (9 July 2021). https://forum.effectivealtruism.org/posts/aSMexrjGXpNiWpbb5/a-simple-model-of-agi-deployment-risk

Bleichrodt, H. & Pinto, J. (2005). “The Validity of QALYs under Non-Expected Utility”. The Economic Journal 115(503): pp. 533–550.

Bostrom, N. (2003). “Astronomical Waste: The Opportunity Cost of Delayed Technological Development”. Utilitas 15(3): pp. 308–314.

Bostrom, N. Superintelligence: Paths, Dangers, Strategies (Oxford University Press: Oxford, 2014)

Bostrom, N. (2024). “AI Creation and the Cosmic Host”. Working paper.

https://nickbostrom.com/papers/ai-creation-and-the-cosmic-host.pdf

Christiano, P. (2023a). “Comment on ‘But Why Would the AI Kill Us?’”. LessWrong (17 April 2023). https://www.lesswrong.com/posts/87EzRDAHkQJptLthE/but-why-would-the-ai-kill-us?commentId=sEzzJ8bjCQ7aKLSJo

Christiano, P. (2023b). “Comment on ‘Cosmopolitan Values Don’t Come Free’”. LessWrong (31 May 2023). https://www.lesswrong.com/posts/2NncxDQ3KBDCxiJiP/cosmopolitan-values-don-t-come-free?commentId=ofPTrG6wsq7CxuTXk

Freitas, R. Nanomedicine: Volume 1: Basic Capabilities. (Landes Bioscience: Austin, Texas, 1999)

Grace, K. (2022). “Counterarguments to the Basic AI Risk Case”. World Spirit Sock Puppet (14 October 2022). https://worldspiritsockpuppet.substack.com/p/counterarguments-to-the-basic-ai

Greenblatt, R. (2025). “Notes on Fatalities from AI Takeover”. Unpublished manuscript.

Hall, R. & Jones, C. (2007). “The Value of Life and the Rise in Health Spending”. Quarterly Journal of Economics 122(1): pp. 39–72.

Harris, J. The Value of Life: An Introduction to Medical Ethics (Routledge: London, 1985). Chapter 5.

Houlden, T. (2024). “‘The AI Dilemma: Growth vs Existential Risk’: An Extension for EAs and a Summary for Non-economists”. Effective Altruism Forum (11 November 2024). https://forum.effectivealtruism.org/posts/9zzGKfSdMeL7bGoPC/the-ai-dilemma-growth-vs-existential-risk-an-extension-for

Hunt, T. & Yampolskiy, R. (2023). “Building Superintelligence Is Riskier Than Russian Roulette”. Nautilus (2 August 2023). https://nautil.us/building-superintelligence-is-riskier-than-russian-roulette-358022/

Jones, C. (2016). “Life and Growth”. Journal of Political Economy 124(2): pp. 539–578.

Jones, C. (2024). “The A.I. Dilemma: Growth versus Existential Risk”. American Economic Review 6(4): pp. 575–590.

Moravec, H. Mind Children: The Future of Robot and Human Intelligence (Harvard University Press: Cambridge, MA, 1988)

Parfit, D. (1997). “Equality and Priority”. Ratio 10(3): pp. 202–221.

Russell, S. (2024). Remarks at ITU AI for Good Summit Media Roundtable, Geneva, 18 April. https://www.itu.int/hub/2024/04/moving-ai-governance-from-principles-to-practice/

Sandberg, A. & Bostrom, N. (2008). Whole Brain Emulation: A RoadmapTechnical Report 2008-3. Future of Humanity Institute, University of Oxford. https://ora.ox.ac.uk/objects/uuid:a6880196-34c7-47a0-80f1-74d32ab98788/files/s5m60qt58t

Sanderson, W. & Scherbov, S. (2005). “Average remaining lifetimes can increase as human populations age”. Nature 435(7043): pp. 811–813.

Snell, T. (2021). Capital Punishment, 2020—Statistical Tables. NCJ 302729. Washington, DC: U.S. Department of Justice, Office of Justice Programs, Bureau of Justice Statistics.

Tsevat, J., Dawson, N., Wu, A., et al. (1998). “Health Values of Hospitalized Patients 80 Years or Older”. JAMA 279(5): pp. 371–375.

United Nations, Department of Economic and Social Affairs, Population Division (2024). World Population Prospects 2024. https://population.un.org/wpp/

Williams, A. (1997). “Intergenerational Equity: An Exploration of the ‘Fair Innings’ Argument”. Health Economics 6(2): pp. 117–132.

Wrigley-Field, E. & Feehan, D. (2022). “In a stationary population, the average lifespan of the living is a length-biased life expectancy”. Demography 59(1): pp. 207–220.

Yudkowsky, E. & Soares, N. (2025a). If anyone builds it, everyone dies: Why superhuman AI would kill us all. (Little, Brown and Company: New York)

Yudkowsky, E. & Soares, N. (2025b). “Why would making humans smarter help?” If Anyone Builds It, Everyone Dies. [Supp. online material] https://ifanyonebuildsit.com/13/why-would-making-humans-smarter-help

Appendix A: Details for the “timing and safety progress” model

Let t denote the AGI launch time.

The pre-AGI annual mortality hazard is set to correspond to an average remaining life expectancy of 40 years.  This yields a continuous hazard rate of:

m0=140=0.025

If AGI is launched successfully, mortality is assumed to fall to a much lower value, corresponding to a life expectancy of 1,400 years:

m1=11400≈0.000714

The probability of catastrophic failure at launch declines with safety progress.  If initial catastrophic risk at t=0 is p0 and safety improves at annual fractional rate g, then the continuous decay rate is:

r=−ln(1−g)

and the launch-time catastrophe probability is:

Pdoom(t)=p0e−rt

Expected remaining life-years if AGI is launched at time t are:

E(t)=1−e−m0tm0+e−m0t(1−p0e−rt)1m1

The optimal interior launch time is found by solving E′(t)=0, yielding:

t∗=1rln(p0(m0+r)m0−m1)

If the expression inside the logarithm is less than or equal to 1, then t∗=0, meaning immediate launch maximizes expected remaining life-years.  A positive t∗ exists only when initial catastrophic risk is high enough and safety improves fast enough that waiting reduces expected loss more than the background mortality accumulated during the delay.

Appendix B: Details for the “temporal discounting” model

To incorporate a constant pure time preference, we discount future life-years at rate ρ.  The expected discounted remaining life-years as a function of the AGI launch time t is:

WRONG

Eρ(t)=1−e−(m0+ρ)tm0+ρ+e−(m0+ρ)t(1−Pdoom(t))1m1+ρ

where Pdoom(t)=p0e−rt as in Appendix A.

Differentiating with respect to t and setting E′ρ(t)=0 gives the interior first-order condition:

(m1−m0)e−(m0+ρ)t+p0(m0+ρ+r)e−(m0+ρ+r)t=0

which rearranges to the threshold equation:

p0(m0+ρ+r)=(m0−m1)e−rt

Solving for t yields the optimal discounted launch time:

t∗=1rln(p0(m0+ρ+r)m0−m1)

If the expression inside the logarithm is less than or equal to 1, then t∗=0, so immediate launch maximizes expected discounted life-years.  A positive interior solution exists only when initial catastrophic risk is sufficiently high and safety improves sufficiently quickly that waiting reduces expected discounted loss more than the additional background mortality incurred during the delay costs.

Tables B1–B3 show the results for different values of the pure temporal discount rate (ρ).

TABLE B1: Low discount rate (ρ=1.5%)

Safety Progress1%5%20%50%80%95%99%No safety progress
(0.0%/yr)Launch asapLaunch asapLaunch asapLaunch asapNeverNeverNeverGlacial safety progress
(0.1%/yr)Launch asapLaunch asapLaunch asapLaunch asapWait 300.4yWait 472.2yWait 513.4yVery slow safety progress
(1.0%/yr)Launch asapLaunch asapLaunch asapWait 3.0yWait 49.7yWait 66.8yWait 71.0yModerate safety progress
(10.0%/yr)Launch asapLaunch asapWait 1.7yWait 10.4yWait 14.9yWait 16.5yWait 16.9yBrisk safety progress
(50.0%/yr)Launch asapWait 7.1mWait 2.6yWait 3.9yWait 4.6yWait 4.8yWait 4.9yVery fast safety progress
(90.0%/yr)Launch asapWait 8.2mWait 1.3yWait 1.7yWait 1.9yWait 2.0yWait 2.0yUltra-fast safety progress
(99.0%/yr)Wait 1.7mWait 5.9mWait 9.5mWait 11.9 mWait 1.1yWait 1.1yWait 1.1y

TABLE B2: Medium discount rate (ρ=3%)

Safety Progress1%5%20%50%80%95%99%No safety progress
(0.0%/yr)Launch asapLaunch asapLaunch asapNeverNeverNeverNeverGlacial safety progress
(0.1%/yr)Launch asapLaunch asapLaunch asapWait 142.3yWait 612.0yWait 783.8yWait 825.0yVery slow safety progress
(1.0%/yr)Launch asapLaunch asapLaunch asapWait 29.1yWait 75.8yWait 92.9yWait 97.0yModerate safety progress
(10.0%/yr)Launch asapLaunch asapWait 2.6yWait 11.3yWait 15.8yWait 17.4yWait 17.8yBrisk safety progress
(50.0%/yr)Launch asapWait 7.5mWait 2.6yWait 3.9yWait 4.6yWait 4.9yWait 4.9yVery fast safety progress
(90.0%/yr)Launch asapWait 8.2mWait 1.3yWait 1.7yWait 1.9yWait 2.0yWait 2.0yUltra-fast safety progress
(99.0%/yr)Wait 1.7mWait 5.9mWait 9.5mWait 11.9mWait 1.1yWait 1.1yWait 1.1y

TABLE B3: High discount rate (ρ=5%)

Safety Progress1%5%20%50%80%95%99%No safety progress
(0.0%/yr)Launch asapLaunch asapLaunch asapNeverNeverNeverNeverGlacial safety progress
(0.1%/yr)Launch asapLaunch asapLaunch asapWait 447.5yWait 917.2yWait 1089.0yWait 1130.2yVery slow safety progress
(1.0%/yr)Launch asapLaunch asapLaunch asapWait 55.7yWait 102.5yWait 119.6yWait 123.7yModerate safety progress
(10.0%/yr)Launch asapLaunch asapWait 3.8yWait 12.5yWait 16.9yWait 18.5yWait 18.9yBrisk safety progress
(50.0%/yr)Launch asapWait 7.9mWait 2.7yWait 4.0yWait 4.7yWait 4.9yWait 5.0yVery fast safety progress
(90.0%/yr)Launch asapWait 8.3mWait 1.3yWait 1.7yWait 1.9yWait 2.0yWait 2.0yUltra-fast safety progress
(99.0%/yr)Wait 1.7mWait 5.9mWait 9.5mWait 11.9mWait 1.1yWait 1.1yWait 1.1y

Appendix C: Details for the “quality-of-life-adjustment” model

We generalize the objective function to maximize expected discounted quality-adjusted life-years (QALYs).  Let q0 and q1 be the quality of life before and after AGI, respectively. The expected value as a function of launch time t is:

Eρ,q(t)=∫t0q0e−(m0+ρ)sds+e−(m0+ρ)t(1−Pdoom(t))∫∞0q1e−(m1+ρ)udu

Defining constants A=q0m0+ρ, B=m0+ρ, and C=q1m1+ρ, the integrated form simplifies to:

E(t)=A(1−e−Bt)+Ce−Bt(1−p0e−rt)

Differentiating with respect to t and solving the first-order condition E′(t)=0 yields the optimal risk threshold p∗:

p∗=B(C−A)C(B+r)

The optimal launch time t∗ is derived by solving p0(e−rt∗)=p∗:

t∗=1rln(p0p∗)

(If p0≤p∗, then t∗=0.)

The “launch asap” region expands as post-AGI quality increases, but it is bounded.  As q1→∞ (implying C→∞), the threshold p∗ approaches BB+r=m0+ρm0+ρ+r.  Thus, even for an infinite prize, immediate launch is optimal only if the current risk is lower than this ratio.  If risk exceeds this bound, it remains optimal to wait, as the probability of success improves through safety progress (r) faster than the value of the prize diminishes through mortality and discounting (m0+ρ).

The tables below illustrate this model.  We first look at the case where a post-AGI lifeyear has a quality that is twice as high as a pre-AGI lifeyear (q0=1,q1=2) for low, medium, and high discount rates.

TABLE C1: Small quality difference (q1/q0=2), low discount rate (ρ=1.5%)

Safety Progress1%5%20%50%80%95%99%No progress (0%/yr)Launch asapLaunch asapLaunch asapLaunch asapLaunch asapNeverNeverGlacial (0.1%/yr)Launch asapLaunch asapLaunch asapLaunch asapWait 20.2yWait 192.0yWait 233.2yVery slow (1%/yr)Launch asapLaunch asapLaunch asapLaunch asapWait 21.9yWait 39.0yWait 43.1yModerate (10%/yr)Launch asapLaunch asapLaunch asapWait 7.7yWait 12.2yWait 13.8yWait 14.2yBrisk (50%/yr)Launch asapWait 2.3mWait 2.2yWait 3.5yWait 4.2yWait 4.4yWait 4.5yVery fast (90%/yr)Launch asapWait 6.7mWait 1.2yWait 1.6yWait 1.8yWait 1.8yWait 1.9yUltra-fast (99%/yr)Wait 29.2dWait 5.2mWait 8.8mWait 11.2mWait 1.0yWait 1.1yWait 1.1y

TABLE C2: Small quality difference (q1/q0=2), medium discount rate (ρ=3%)

Safety Progress1%5%20%50%80%95%99%No progress (0%/yr)Launch asapLaunch asapLaunch asapLaunch asapNeverNeverNeverGlacial (0.1%/yr)Launch asapLaunch asapLaunch asapLaunch asapWait 122.2yWait 294.0yWait 335.2yVery slow (1%/yr)Launch asapLaunch asapLaunch asapLaunch asapWait 27.1yWait 44.2yWait 48.3yModerate (10%/yr)Launch asapLaunch asapLaunch asapWait 6.7yWait 11.1yWait 12.8yWait 13.2yBrisk (50%/yr)Launch asapLaunch asapWait 1.9yWait 3.2yWait 3.9yWait 4.2yWait 4.2yVery fast (90%/yr)Launch asapWait 5.7mWait 1.1yWait 1.5yWait 1.7yWait 1.8yWait 1.8yUltra-fast (99%/yr)Wait 12.8dWait 4.6mWait 8.2mWait 10.6mWait 11.8mWait 1.0yWait 1.0y

TABLE C3: Small quality difference (q1/q0=2, high discount rate (ρ=5%)

Safety Progress1%5%20%50%80%95%99%No progress (0%/yr)Launch asapLaunch asapLaunch asapLaunch asapNeverNeverNeverGlacial (0.1%/yr)Launch asapLaunch asapLaunch asapLaunch asapWait 202.6yWait 374.4yWait 415.6yVery slow (1%/yr)Launch asapLaunch asapLaunch asapLaunch asapWait 31.4yWait 48.5yWait 52.6yModerate (10%/yr)Launch asapLaunch asapLaunch asapWait 5.7yWait 10.1yWait 11.8yWait 12.1yBrisk (50%/yr)Launch asapLaunch asapWait 1.6yWait 3.0yWait 3.6yWait 3.9yWait 3.9yVery fast (90%/yr)Launch asapWait 4.6mWait 11.8mWait 1.4yWait 1.6yWait 1.7yWait 1.7yUltra-fast (99%/yr)Launch asapWait 4.0mWait 7.7mWait 10.0mWait 11.3mWait 11.7mWait 11.8m

For comparison, let’s also look at a version where post-AGI lifeyears are ten times as good as pre-AGI lifeyears (q1/q0=10).  Table C4 shows the case for a median discount rate.

TABLE C4: Large quality difference (q1/q0=10, medium discount rate (ρ=3%)

Safety Progress1%5%20%50%80%95%99%No progress (0%/yr)Launch asapLaunch asapLaunch asapLaunch asapLaunch asapNeverNeverGlacial (0.1%/yr)Launch asapLaunch asapLaunch asapLaunch asapLaunch asapWait 24.2yWait 65.4yVery slow (1%/yr)Launch asapLaunch asapLaunch asapLaunch asapWait 2.6mWait 17.3yWait 21.4yModerate (10%/yr)Launch asapLaunch asapLaunch asapWait 4.1yWait 8.6yWait 10.2yWait 10.6yBrisk (50%/yr)Launch asapLaunch asapWait 1.5yWait 2.8yWait 3.5yWait 3.8yWait 3.8yVery fast (90%/yr)Launch asapWait 4.3mWait 11.5mWait 1.4yWait 1.6yWait 1.6yWait 1.7yUltra-fast (99%/yr)Launch asapWait 3.9mWait 7.5mWait 9.9mWait 11.1mWait 11.6mWait 11.7m

Appendix D: Details for the “diminishing marginal utility” model

To model risk aversion over (time-discounted quality-adjusted) lifespan, we employ two standard (one-parameter) utility functions from decision theory: Constant Relative Risk Aversion (CRRA) and Constant Absolute Risk Aversion (CARA).

1. Power Utility (CRRA)

The CRRA utility function—the one used in the main text—is defined as:

u(x)=x1−γ1−γ

where x represents the total discounted quality-adjusted life years (QALYs) and γ is the coefficient of relative risk aversion.

2.  Exponential Utility (CARA)

The CARA utility function family takes the form:

u(x)=1−e−kxk

3. Computation

For either functional form, we maximize the expected utility:

E[u(X(t))]=Pdoom(t)⋅u(xfail)+(1−Pdoom(t))⋅u(xsuccess)

 

where:

xfail=q0m0+ρ(1−e−(m0+ρ)t)xsuccess=xfail+e−(m0+ρ)tq1m1+ρ

4. Empirics

Direct estimates of utility for life duration in health‑economics/decision‑science settings have fit both power and exponential specifications.  Exponential utility functions (CARA) for life duration have been directly estimated, but power utilities (CRRA) typically fit better.[23]  We therefore treat power functions as the main specification, and include exponential function as a robustness check.

For u(t)∝tα, estimates typically find α≈0.65−0.85.  From this derive γ=1−α:

  • Low: γ=0.15 (corresponding to α=0.85)
  • Medium: γ=0.26 (corresponding to α=0.74)
  • High: γ=0.35 (corresponding to α=0.65)

Because CARA exhibits constant absolute risk aversion, its relative risk aversion (R(x)=kx) scales with the value of the outcome.  To match the empirical literature and make a fair comparison, we calibrate k such that the local relative risk aversion matches the CRRA medium case (γ=0.26) at the scale of the post-AGI “prize” (in discounted QALYs):

Xref≈q1m1+ρ≈65.1

This yields k≈γ/Xref≈0.004.

5.  Illustrations

Tables D1–D3 illustrate optimal launch times for the CRRA model, for the low, medium, and high value of γ, respectively.  (Other parameters are the same as in Appendix C.)

TABLE D1: Diminishing marginal utility (CRRA, low rate)

Safety Progress1%5%20%50%80%95%99%No progress (0%/yr)Launch asapLaunch asapLaunch asapLaunch asapNeverNeverNeverGlacial (0.1%/yr)Launch asapLaunch asapLaunch asapWait 1.7mWait 122.4yWait 294.0yWait 335.2yVery slow (1%/yr)Launch asapLaunch asapLaunch asapWait 11.8mWait 30.0yWait 45.5yWait 49.3yModerate (10%/yr)Launch asapLaunch asapWait 1.8mWait 7.7yWait 11.9yWait 13.5yWait 13.9yBrisk (50%/yr)Launch asapWait 2.0mWait 2.1yWait 3.4yWait 4.1yWait 4.3yWait 4.4yVery fast (90%/yr)Launch asapWait 6.5mWait 1.1yWait 1.5yWait 1.7yWait 1.8yWait 1.8yUltra-fast (99%/yr)Wait 25.8dWait 5.0mWait 8.6mWait 11.0mWait 1.0yWait 1.1yWait 1.1y

TABLE D2: Diminishing marginal utility (CRRA, medium rate—same as in main text)

Safety Progress1%5%20%50%80%95%99%No progress (0%/yr)Launch asapLaunch asapLaunch asapLaunch asapNeverNeverNeverGlacial (0.1%/yr)Launch asapLaunch asapWait 3.1dWait 1.9yWait 122.6yWait 294.0yWait 335.2yVery slow (1%/yr)Launch asapLaunch asapWait 4.2dWait 4.4yWait 31.7yWait 46.3yWait 50.1yModerate (10%/yr)Launch asapLaunch asapWait 1.1yWait 8.4yWait 12.5yWait 14.1yWait 14.4yBrisk (50%/yr)Launch asapWait 4.4mWait 2.3yWait 3.6yWait 4.2yWait 4.5yWait 4.5yVery fast (90%/yr)Launch asapWait 7.2mWait 1.2yWait 1.6yWait 1.8yWait 1.9yWait 1.9yUltra-fast (99%/yr)Wait 1.2mWait 5.4mWait 9.0mWait 11.3mWait 1.0yWait 1.1yWait 1.1y

TABLE D3: Diminishing marginal utility (CRRA, high rate)

Safety Progress1%5%20%50%80%95%99%No progress (0%/yr)Launch asapLaunch asapLaunch asapLaunch asapNeverNeverNeverGlacial (0.1%/yr)Launch asapLaunch asapWait 1.0mWait 4.4yWait 122.7yWait 294.0yWait 335.2yVery slow (1%/yr)Launch asapLaunch asapWait 1.3mWait 7.1yWait 33.0yWait 47.0yWait 50.6yModerate (10%/yr)Launch asapLaunch asapWait 1.9yWait 9.0yWait 13.0yWait 14.5yWait 14.9yBrisk (50%/yr)Launch asapWait 6.4mWait 2.4yWait 3.7yWait 4.4yWait 4.6yWait 4.7yVery fast (90%/yr)Wait 1.7dWait 7.8mWait 1.2yWait 1.6yWait 1.8yWait 1.9yWait 1.9yUltra-fast (99%/yr)Wait 1.5mWait 5.7mWait 9.2mWait 11.6mWait 1.1yWait 1.1yWait 1.1y

Finally, Table D4 shows the corresponding medium case for the CARA utility function.

TABLE D4: Diminishing marginal utility (CARA, medium rate)

Safety Progress1%5%20%50%80%95%99%No progress (0%/yr)Launch asapLaunch asapLaunch asapLaunch asapNeverNeverNeverGlacial (0.1%/yr)Launch asapLaunch asapLaunch asapLaunch asapWait 122.3yWait 294.0yWait 335.2yVery slow (1%/yr)Launch asapLaunch asapLaunch asapLaunch asapWait 28.8yWait 44.9yWait 48.8yModerate (10%/yr)Launch asapLaunch asapLaunch asapWait 7.4yWait 11.7yWait 13.3yWait 13.7yBrisk (50%/yr)Launch asapWait 1.2mWait 2.1yWait 3.4yWait 4.1yWait 4.3yWait 4.4yVery fast (90%/yr)Launch asapWait 6.3mWait 1.1yWait 1.5yWait 1.7yWait 1.8yWait 1.8yUltra-fast (99%/yr)Wait 23.3dWait 5.0mWait 8.6mWait 10.9mWait 1.0yWait 1.1yWait 1.1y

6.  Comparison between CRRA and CARA

Both functional forms of diminishing marginal utility / risk-aversion in time-discounted QALYs delay launch relative to the risk-neutral case (γ=0).  Calibrated to the same reference scale and fit to the empirical literature, they give broadly similar timelines for the examined range of scenarios.  However, because the relative risk aversion of CARA (kx) rises with scale, CARA can be significantly more conservative than CRRA in high-value regions (with low temporal discount factor and large quality differential).  Figure 5 shows the difference surface between the two functions.

Figure 5: Difference between CARA and CRRA (for the medium rate case)

Appendix E: Details for the “changing rates of progress” model

The basic ingredients are the same as in Appendices A–D: a pre‑AGI mortality hazard m0, a post‑AGI hazard m1, a pure time‑discount rate ρ, quality weights q0 and q1 for life before and after AGI, and CRRA utility over discounted QALYs with curvature parameter γ.

We distinguish two timing variables.  Let Tagi be the time from now until full AGI first becomes technically deployable (Phase 1), and let Tpause≥0 be any additional deliberate delay between that point and large‑scale deployment (Phase 2).  AGI is launched at time

TL=Tagi+Tpause

Let p0 be the catastrophe probability if AGI were launched immediately.  Safety work reduces this risk over time: over any sub‑interval k in which the annual fractional reduction in risk is gk, we define the corresponding continuous decay rate:

rk=−ln(1−gk)

If by time t we have spent Δtk(t) years in sub‑interval k (capped at the maximum length of that sub‑interval), the cumulative risk reduction is:

R(t)=∑krk,Δtk(t))

The catastrophe probability at launch time t is then:

p(t)=p0exp(−R(t))

Phase 1 runs from time 0 to Tagi with some baseline rate of safety progress.  Once AGI‑ready systems are available, we model a “safety windfall” by splitting Phase 2 into four subphases with front‑loaded gains and diminishing returns: very rapid progress (2a), fast progress (2b), slower progress (2c), and an indefinitely long tail of very slow progress (2d).  In each scenario, the first five columns (“Phase 1”, “2a”, “2b”, “2c”, “2d”) of the table specify the duration and annual fractional improvement rate gk used for these subphases.

For a given launch time T, let xsucc(TL) denote the total discounted QALYs if AGI is successfully aligned at TL, and let xfail(TL) denote the total discounted QALYs if launch at TL causes catastrophe so that only pre‑AGI life contributes.

With constant pre‑AGI hazard m0, post‑AGI hazard m1, and pure time discount rate ρ, the pre‑AGI part is:

xfail(TL)=∫TL0q0,e−(m0+ρ)t,dt=q0m0+ρ(1−e−(m0+ρ)TL)

If launch succeeds at T, the post‑AGI contribution is:

∫∞TLq1,e−(m0+ρ)TLe−(m1+ρ)(t−TL),dt=q1m1+ρe−(m0+ρ)TL

so

xsucc(TL)=xfail(TL)+q1m1+ρe−(m0+ρ)TL

As in Appendix D, we use CRRA utility over discounted QALYs:

u(x)=x1−γ1−γ

The expected utility from launching at T is:

EU(TL)=(1−p(TL)),u(xsucc(TL))+p(TL),u(xfail(TL))

In the multiphase timing table we treat Tagi as fixed by the scenario (0, 5, 10, or 20 years until AGI availability).  For each choice of initial catastrophe probability p0 and each specification of baseline safety progress, we choose the pause length Tpause≥0 that maximizes

EU(TL)=EU(TAGI+Tpause)

The optimal Tpause is what is reported in Table 8.

Table 9 reports results when the decision-maker can also accelerate Phase 1.  We allow Tagi to be shortened by up to its full default duration (so that AGI could in principle become available immediately), while Tpause remains non-negative.  The optimization problem becomes:

maxTAGI∈[0,TdefaultAGI],Tpause≥0EU(TAGI+Tpause)

where safety progress during any acceleration of Phase 1 accrues at the Phase 1 rate, and the Phase 2 subphase structure (2a–2d) begins once AGI-capability is attained.

Appendix F: Details for the “shifting mortality rates” model

This extends the multiphase model of Appendix E by allowing the pre-AGI mortality hazard to change upon entering Phase 2.  Let m0 denote the mortality hazard during Phase 1, and let m′0 denote the hazard during Phase 2 (prior to launch).  The discounted QALYs accumulated before launch become:

xfail(TL)=q0m0+ρ(1−e−(m0+ρ)TAGI)+q0m′0+ρe−(m0+ρ)TAGI(1−e−(m′0+ρ)Tpause)

The post-AGI contribution and catastrophe probability remain as in Appendix E.

Appendix G: Details for the “safety testing” model

We keep the background assumptions from Appendices E–F (mortality hazards, discounting, CRRA utility over discounted QALYs, and the four post‑AGI subphases 2a–2d).  At the moment AGI‑capable systems first exist (start of Phase 2), the true catastrophe probability at that instant is unknown. It is known only that it equals one of seven discrete “type”, p0∈{0.1,0.3,0.5,0.7,0.9,0.95,0.99}, with a uniform prior over these seven possibilities.

From that point onward, conditional on each type, the catastrophe probability at time t after AGI availability follows the same multiphase risk‑reduction schedule as in Appendix E.  For each type i this yields a deterministic risk path pi(t) with

pi(t)=pinitiexp(−R(t))

where R(t) is the cumulative integrated rate implied by the phase‑specific annual fractional reductions.

Starting from AGI availability, we perform a new test whenever cumulative risk reduction since the previous test reaches another 20 % factor.  If the instantaneous risk at the time of a test is r, the test output is:

  • "fail" with probability r
  • "pass" with probability 1−r

Let πi be the current posterior probability that the system is of type i and let ri be the corresponding instantaneous risk pi(t) at the test time.  After observing an outcome, we update by Bayes’ rule.  For a pass,

π′i=πi(1−ri)Z−1

and for a fail,

π′i=πiriZ−1

where Z is the normalisation constant that makes the posteriors sum to one.

Between tests, the posterior over types remains fixed, while each type’s risk level declines deterministically according to the multiphase schedule.

We treat the problem from the start of Phase 2 as a finite‑horizon POMDP.  The state has two components:

  • Time within the multiphase schedule (which determines the phase and thus the risk‑reduction rate)
  • Belief state π over the seven risk types

At each decision time (on a grid of size Δt=0.05years in the numerical implementation), the agent chooses between:

  • Launch now: terminate the process and receive utility U(t|π)=∑iπiU(t|pi(t)), where U(t|p) is the discounted‑QALY objective from Appendices A–D for a launch at time t with catastrophe probability p.
  • Wait: advance time by Δt (with deterministic change in the phase and risk levels) and, if a test is due, absorb the pass/fail signal and update the belief state by Bayes’ rule as above.

We solve this POMDP numerically by backward induction over the discrete time grid, using the underlying survival‑and‑QALY value function from the earlier timing models for the “launch” payoff.  The result is an approximately Bayes‑optimal stationary policy mapping each time-belief pair to “launch” or “wait”.

For comparison, we also compute the best fixed‑pause policy with no testing.  In that case, the agent chooses a single pause length τ after AGI availability, launches at τ in all worlds, and optimizes expected utility under the uniform prior over the seven types, exactly as in the multiphase model without testing.

  1. ^

    For comments, I’m grateful to Owen Cotton-Barratt, Max Dalton, Tom Davidson, Lukas Finnveden, Rose Hadshar, Fin Moorehouse, Toby Ord, Anders Sandberg, Mia Taylor, and Lizka Vaintrob.

  2. Yudkowsky & Soares (2025a).  The authors propose a treaty of unlimited duration.  Yet they seem to be in favor of eventually building superintelligence, after some presumably very long delay.  They suggest the creation of a crack team of genetically engineered supergeniuses to help the planet safely navigate the transition (2025b). ↩︎

  3. In the U.S., average survival time after an initial death sentence is about 22 years, and only 16% of death sentences are eventually carried out (Snell, T., 2021; Baumgartner et al., 2017). ↩︎

  4. Cf. Freitas (1999), Bostrom (2014), and Amodei (2024). ↩︎

  5. Sandberg, A. & Bostrom, N. (2008) ↩︎

  6. E.g. Hunt, T & Yampolskiy, R. (2023) and Russell, S. (2024) ↩︎

  7. There may of course not be a specific moment at which “superintelligence is launched”, but rather a more continuous and distributed process of incremental advances, deployments, and integration into the economy.  But the structural considerations we point to in this paper can be seen more clearly if we consider a simplified model with a discrete launch event, and they should carry over to cases with more complicated deployment processes. ↩︎

  8. Cf. Bostrom (2024) ↩︎

  9. Previous work has mostly looked at the tradeoffs from the impersonal perspective.  For example, Bostrom (2003) shows that even very long delays in technological development can theoretically be impersonally optimal if they lower existential risk.  Hall & Jones (2007) point out that as societies get richer, the marginal utility of consumption falls rapidly while the value of additional life-years remains high, leading them to spend a larger fraction of GDP on life-extension (e.g. health spending).  Jones (2016) argues that this “safety as a luxury good” mechanism can—depending on utility curvature—make it optimal to restrain economic growth or redirect innovation toward life-saving and safety.  Aschenbrenner (2020) applies the mechanism to existential risk in a directed-technical-change model, suggesting that we are in a “time of perils” (advanced enough to build doomsday technologies but not yet rich enough to spend heavily on mitigation) and arguing that faster growth can shorten this phase and increase long-run survival even if it raises near-term risk.  Binder (2021) presents a minimalist timing model trading accumulated background (“state”) risk against one-off superintelligence (“transition”) risk, with the optimum when the proportional rate of safety improvement equals the background hazard.  Jones (2024) then studies a utilitarian planner choosing how long to run growth-boosting AI that carries a constant annual extinction risk; optimal run time is highly sensitive to diminishing marginal utility, and allowing AI-driven mortality reductions greatly increases tolerable cumulative risk.  Houlden (2024) summarizes Jones and explores extensions adding non-AI growth and safety progress from pausing/investment. ↩︎

  10. Global life expectancy at birth is roughly 73 years and the median age of the global population is a little over 30 years (United Nations, 2024): we round the difference to 40, for simplicity.  In a later section we explore scenarios in which remaining life expectancy increases even without advanced AI. ↩︎

  11. In developed countries, the annual mortality rate for healthy individuals aged 20–25 is approximately 0.05–0.08% per year, with most deaths in this age group attributable to external causes.  If mortality were held constant at this rate throughout life, expected remaining lifespan would be approximately 1/0.0007 ≈ 1,400 years.  See, e.g., Arias et al. (2024) for U.S. actuarial life tables; similar figures obtain in other developed countries. ↩︎

  12. Sandberg & Bostrom (2008), Moravec (1988) ↩︎

  13. Again, we’re restricting the discussion to mundane facts and considerations.  (Otherwise the expected remaining lifespan may be infinite both with and without AGI.) ↩︎

  14. These underlying factors in narrow prudential optimality need not be consistently reflected in stated preferences.  One reason is that different groups may have different empirical beliefs, such as concerning how risky AGI is or how good post-AGI life would become.  They may also have different beliefs about non-mundane considerations that are outside the scope of this investigation.  Furthermore, people may care about other individuals—e.g. an old person with a beloved grandchild may prefer a less aggressive timeline than if they were concerned exclusively with their own prospects.  People may also have non-person-affecting preferences, such as over possible future generations.  Some might have idiosyncratic reasons (e.g. glory, profit, influence) for accelerating the creation of AGI.  And people may of course also misunderstand what their rational interests are, or shade the expression of their preferences in light of social desirability norms. ↩︎

  15. Parfit (1997) ↩︎

  16. Cf. Sanderson & Scherbov (2005); Wrigley-Field & Feehan (2022) ↩︎

  17. This can be compared to what is known in the bioethics literature as the “fair innings” view; see e.g. Harris (1985) and Williams (1997).  But the latter is often focused on a comparative fairness intuition—that somebody who fails to attain the normal lifespan has been “cheated” of their fair share of years, and that individuals who would not otherwise attain this normal lifespan should therefore get priority in the allocation of healthcare resources.  That view would presumably entail that if it became normal for humans to live to 500, then the fair innings would increase correspondingly.  By contrast, what I call the “full cup” stance alleges that there is much less value to a person of an extra life year once they have lived for about seventy years. ↩︎

  18. Tsevat, J. et al. 1998 ↩︎

  19. Tvesat, Dawson, Wu, et al. (1998) ↩︎

  20. One mechanism that could theoretically produce indefinite extension is hyperbolic discounting.  People often discount imminent consequences far more steeply than distant ones.  Consider someone who resolves to swim but balks at entering the cold water; fitting exponential discounting to this behavior would require a rate on the order of 50% per minute—absurdly high for other contexts.  Applied to AGI: when the launch date arrives, we think “not today”, and repeat this reasoning each time the rescheduled date comes around.  A structurally similar dynamic can arise even without individual time-inconsistency.  If those with influence over deployment are always drawn from a demographic—e.g. neither very old nor very sick—that prudentially favors waiting decades, then when that future arrives, a new cohort may have taken their place with its own reasons for delay.  While competitive pressures among multiple actors would probably prevent such indefinite procrastination, the dynamic becomes more concerning in scenarios involving a single dominant actor or a coordinated international body with broad discretion over timing. ↩︎

  21. In a binary scenario—more generally, we could take Pdoom to be the expected fraction of the human population that dies when superintelligence is launched. ↩︎

  22. See, e.g., Grace (2022), Christiano (2023a, 2023b), and Greenblatt (2025). ↩︎

  23. Bleichrodt & Pinto (2005) estimate concave power and exponential forms for utility of life duration across health states.  Abellán‑Perpiñán et al. (2006) find that a power model predicts best overall. ↩︎



Discuss

How do we (more) safely defer to AIs?

12 февраля, 2026 - 19:55
Published on February 12, 2026 4:55 PM GMT

As AI systems get more capable, it becomes increasingly uncompetitive and infeasible to avoid deferring to AIs on increasingly many decisions. Further, once systems are sufficiently capable, control becomes infeasible. [1] Thus, one of the main strategies for handling AI risk is fully (or almost fully) deferring to AIs on managing these risks. Broadly speaking, when I say "deferring to AIs" [2] I mean having these AIs do virtually all of the work to develop more capable and aligned successor AIs, managing exogenous risks, and making strategic decisions. [3] If we plan to defer to AIs, I think it's safest to do so only a bit above the minimum level of qualitative capability/intelligence required to automate safety research, implementation, and strategy. [4]

For deference to go well, we both need it to be the case that the AIs we defer to aren't scheming against us and that they are sufficiently aligned and effective on key tasks (aligning the next generation of AIs, buying more time to work on alignment, making good strategic decisions). Being sufficiently aligned and effective on key tasks requires that AIs are wise (e.g. have good epistemics even in domains with poor or non-existent feedback loops) and are aligned and competent even on hard to check, messy, and conceptually confusing work (e.g. very "philosophically loaded" work).

In this post, I'll discuss (at a relatively high level) how to make deference more likely to go well. I'll focus mostly on the situation where we need to do this deference in a rush (e.g., in the context of a scenario like AI 2027 or a scenario with somewhat higher levels of political will, but still not that much time). Due to this, I'll focus mostly on relatively prosaic strategies. If we're highly rushed and can't buy additional time to increase safety, then I think fully deferring to AIs (which are only a bit above the minimum level of capability) in a rush is likely our best available strategy, though I won't argue for this here. [5] It's worth noting that deferring to AIs in a rush poses huge risks [6] and buying more time for safety work (including work on making deference more likely to go well) would be highly preferable [7] . While making deference go well in a rush seems difficult, we can (try to) use huge amounts of (controlled/supervised) AI labor to help.

Even though a large fraction of the difficulty in making deference go well is avoiding scheming [8] , in this post we'll (mostly) assume that scheming is handled separately and the initial AIs we defer to aren't schemers. [9] More strongly, we'll also assume that we separately handle other issues that could make behavioral testing very misleading (e.g. data poisoning, human adversaries trying to hijack/jailbreak the AI, the AI behaving differently while in testing because it is saliently aware of being in a test).

That being said, there is a spectrum between treating AIs as maximally untrusted (as in, fully applying control-style analysis where we assume AIs are worst-case schemers) and full deference. Full deference is the case where (1) we don't have control (if the AI was egregiously misaligned or otherwise systematically trying to acquire power for its own misaligned ends we'd be in trouble) and (2) where we're using and trusting the AI for everything, but these properties can both vary continuously. We could have partial deference where, for instance, we have a weak version of control which provides a small amount of assurance against some non-worst-case types of egregious misalignment and we defer on some hard to check domains while humans still do the relevant work in other domains. We can also consider deference on a task-by-task basis where we effectively defer (fully) on some subset of tasks [10] , but still ensure control or avoid fully trusting AIs on other tasks; this will probably be important well before full deference. I'll focus on full deference or nearly full deference, but much of what I say is applicable to partial deference. [11]

In this post, I discuss:

  1. The high-level objectives for safely deferring to AIs. This will motivate the sections that follow.
  2. Our high-level strategy for deference.
  3. What kind of capability profile we're targeting concretely.
  4. What alignment profile we're targeting concretely.
  5. How we would test for capabilities and alignment.
  6. How we might improve the capability profile.
  7. How we might improve the level of alignment.

[Unfortunately, I think this post doesn't do a great job of giving a clear picture of exactly what is needed to defer to AIs nor does it give enough examples. I decided to post it anyway as it has been sitting around as a draft for a long time.]

High-level objectives What does successful deference look like?

Our ultimate goal is to have the AIs we defer to resolve risks associated with powerful AI systems (and potentially other serious risks that happen to occur at this time) while preserving option value and keeping humans in control of longer-run values-loaded decision making (e.g., how Earth should be governed in the longer run, what should happen with the cosmic endowment, etc.). Correspondingly, our aim would be that the AIs we defer to (and their successors) don't seize power or (unnecessarily) kill humans, for these AIs to effectively work on managing these risks, and for these AIs to follow some model spec which involves the AI remaining corrigible to some group of humans or some structure which is ultimately run by humans (e.g. some component of the US government). [12]

Recursive self-improvement of alignment and wisdom: the hope for a Basin of Good Deference

A key hope is that the initial AIs we defer to will work on making further deference more likely to go well by (e.g.) improving their own alignment and wisdom. Thus, we don't need to ensure that the initial AIs we defer to are perfectly aligned, perfectly wise, etc.; some bootstrapping is possible as long as the initialization is sufficiently good. (I'm imagining that we attempt to have the AIs we defer to remain corrigible to some group of humans or some human process and this corrigibility property would get propagated and improved through AI generations, including refining and clarifying edge cases of corrigibility in a reasonable way as needed.)

This is similar to the idea of a basin of corrigibility: if AIs were sufficiently corrigible, they would help you improve their corrigibility and thus the situation would be stable. However, we need this to apply somewhat beyond corrigibility; the AIs also need to be competent and wise enough that they manage the situation well (and e.g. avoid catastrophically bad choices) and this wisdom and competence itself needs to be furthered and improved with each AI generation (probably beyond what would have been the default from just hill climbing on capabilities).

If an initial AI we defer to is sufficiently aligned, wise, and competent at managing the situation that it would ensure that future AI systems it creates are more aligned, wise, and competent at managing the situation (as in, we avoid the AIs we defer to themselves taking over or causing a catastrophe and we avoid them building AIs which take over or cause a catastrophe), then we'll say these AIs are in a "Basin of Good Deference" (BGD). Keep in mind that the task of "make further deference go well" is an open-ended and conceptually-loaded task where we can't check the AI's output effectively, so aligning/eliciting the AI to actually do a good job on this task may be difficult. [13]

In practice, humans might be able to fill in for the initial AIs on some dimensions and AIs can always ask humans for advice or to do some task (though eventually this might be too slow/uncompetitive to be viable). This is just starting with partial deference. For instance, if AIs are safe to defer to on all dimensions except making good strategic choices, humans might be able to take on this task until later AIs can be deferred to on strategy. Once we're sufficiently in the BGD, we should transition to full deference (because human cognitive labor could quickly become very uncompetitive).

AIs we defer to need to both handle exogenous risk (e.g. risk from other AI projects) and avoid causing catastrophic outcomes themselves (they need to ensure that future AIs they build don't take over and don't cause catastrophes via some other route [14] ). This means that these AIs will need to ongoingly ensure alignment while still having capabilities advance sufficiently quickly that at each point the level of aligned capabilities suffices for suppressing exogenous risk. However, these AIs can buy more time to work on safety (alignment, wisdom, and other things which might be harder to advance than general capabilities/competitiveness) via various routes and this might be very important. Because there will be some amount of time pressure, advancing alignment arbitrarily slowly doesn't suffice: AIs will need to be sufficiently within the BGD that they can manage the situation given limited resources. [15]

It's unclear how exactly the BGD works, how easy it is to end up in this basin, and whether this is real. I feel pretty confident that something like this is real, but certainly the situation is unclear. If there isn't a BGD, then we'd only be able to defer to AIs on tasks other than furthering deference and we'd have to pursue some other end state other than stably continuing deference which tracks capabilities. A more plausible case for concern is that it's very hard to achieve the BGD.

Summary of objectives and what the AIs we defer to must do

We want it to be the case that after deference:

  • Our AIs remain sufficiently aligned
  • Exogenous risks are handled ongoingly

This requires that at each point after deference, our AIs:

  • Aren't seriously misaligned: They remain corrigible to the intended group/structure and don't try to seize power or kill humans.
  • Are effective and aligned at key tasks: The AIs do a good job—which includes being sufficiently aligned and sufficiently capable—of autonomously:
    • Advancing alignment/elicitation
    • Advancing key parts of the AI capability profile that might otherwise fall behind and are important for these key tasks going forward
    • Advancing general capabilities (though this is probably the default, so tracking this probably isn't that important)
    • Directly handling exogenous risks and not themselves causing catastrophes
    • Buying time for this work
    • Making good strategic choices: Prioritizing between the above key tasks and making good choices about when to advance capabilities.

I'll use the term "deference-goodness" to refer to differential improvements to AIs being effective and aligned at key tasks (as in, differential relative to just increasing general capabilities). [16] (I'm not including work which is just useful for reducing the chance that AIs are seriously misaligned in deference-goodness.)

For initial deference, we'd (all else equal) prefer more deference-goodness. [17] But, bootstrapping does mean that we might not need to do that good of a job. I'll discuss how good of a job we'll need to do with initial deference in a later section.

We can break down deference-goodness into an alignment/elicitation component and a capability profile component:

  • Broad alignment: the AIs need to be aligned and well elicited enough that they faithfully pursue our interests even on massive, open-ended, and conceptually-loaded tasks (with poor feedback loops) where we can't check the AI's output effectively (and which are far from tasks we could have checked for the purposes of training and testing). The AIs can't be seriously misaligned. [18]
  • Capability profile: In addition to general capabilities and the ability to automate AI R&D, we need sufficient ability on messy conceptual tasks, at anticipating safety issues, at making good strategic decisions in highly uncertain domains with poor feedback loops, and "wisdom"/epistemics more generally.

Later on, I'll discuss how we might improve these components.

Handling exogenous risk in time

The AIs we defer to will need to handle exogenous risks in time. This includes risks from external AIs becoming misaligned (which could be handled by e.g. publishing and promoting alignment methods), risks from foreign adversaries, terrorist misuse risks, the epistemic environment becoming catastrophically degraded due to AI, and general background risks from non-AI sources. Exogenous risks might occur ongoingly and we need to be able to handle them as they come up, though there might be a substantial delay between initial deference and the first (catastrophically-large) exogenous risk. [19] Thus, the AIs we defer to need to ensure they are sufficiently capable to handle exogenous risks when they come up (or otherwise avoid these risks being a problem). By default, this requires staying ahead of rival actors (either by slowing them down or by proceeding with capabilities progress) including capabilities, physical industry, and military, but there are also non-AI exogenous risks and active defense is required beyond just generally staying ahead (e.g. defense against bioweapons). Staying a bit ahead is not necessarily sufficient (e.g., it might be important to be very ahead due to offense-defense imbalances), but something like staying ahead is probably roughly necessary. Handling this situation requires the AI to also make sufficiently good strategic choices about when to advance capabilities and how to prioritize between alignment, capabilities, and handling exogenous risks. Notably, the existence of these risks might sometimes imply that overall risk is reduced by having capabilities progress exceed a level where we can remain confident in the AI's alignment (e.g., because it might be critical to get to some capability level where other actors can be slowed down within some short period of time).

Keeping up deference-goodness progress with capabilities (in the face of exogenous risk) could be hard because our initial AIs weren't sufficiently aligned/elicited or have an insufficient capability profile (e.g. poor research taste for conceptually-loaded alignment research) such that they are too slow to advance deference-goodness initially. It could be that the AIs we defer to need to improve their own deference-goodness before they can (safely) perform some other key task and this takes some time or that we have to use expensive measures to improve deference-goodness which would otherwise be inadequate (e.g. we have many different AIs check work because we don't trust individual systems to do a good job, but checking work in this way suffices); either of these delays would in effect be equivalent to our initial AIs being slower to perform work because we did a worse job at deference-goodness for the initial AIs. Note that doing a bad job at the initial deference can just directly lead to catastrophe rather than merely requiring more time: we probably have to actually do a pretty decent job with deference to be in the BGD, but just need some time for AIs to improve deference-goodness. Also, it might be difficult to evaluate for sufficient deference-goodness (while still being possible to improve deference-goodness), so the AIs we defer to won't themselves necessarily know how much effort to spend on improving their own deference-goodness or that of more capable successor AIs (though AIs would work on better evaluations of relevant aspects of handoff-goodness and determining what would suffice). Evaluation difficulties might result in very inefficient effort allocations for both humans and the AIs we defer to.

Another possibility—which isn't a problem we can solve by doing a better job with the initial AIs we defer to—is that deference-goodness is just fundamentally persistently hard to advance relative to capabilities such that the ongoing amount of effort required to advance deference-goodness is too high. In general, a key question about whether deference goes well is the amount of work beyond R&D on improving general purpose capabilities/competitiveness which is required to stay (comfortably) within the BGD. This is most straightforward to think about in the case of software-only AI progress on a fixed amount of compute: we can compare the amount of R&D effort (inference and experiment compute) needed to keep some level of capability within the BGD to the amount of R&D required to achieve that level of capability. We'll call the fraction of effort which needs to be spent on staying in the BGD the deference-goodness tax. Note that the deference-goodness tax could end up being negligible (if we found scalable solutions prior to deference) or much greater than 100% (e.g. 10x more effort has to be spent on staying in the BGD than capabilities). In practice, the level of tax will vary over the progression of capabilities and you'll be able to pay a lower tax in exchange for taking on more risk. I expect there eventually exist scalable strategies for staying in the BGD such that the deference-goodness tax eventually becomes small, but in practice the aggregate deference-goodness tax we need to pay might be substantial. Additionally, I expect work on improving deference-goodness has higher returns to cognitive labor (as compared to compute) relative to capabilities. This hopefully means that as AI progress advances and cognitive labor becomes more abundant relative to compute the deference-goodness tax decreases over time.

Ultimately, this deference-goodness tax bottoms out in terms of an amount of additional lead time (relative to exogenous risks) that AIs we defer to need in order to manage these risks without catastrophe. [20] One particularly salient and simple notion of aggregate deference-goodness tax is the deference-goodness tax required to safely remain in the BGD through a software-only singularity starting from the minimum level of capability where deference is possible (and supposing we succeed at the initial deference). We can consider the tax required up to the level of capability needed to easily stabilize the situation or the point where the software-only singularity fizzles out [21] , whichever is earlier. We can express this in terms of an absolute amount of time, e.g. 3 months of lead time are required. Because software-only singularities (including any sort of software-based recursive self-improvement) seem like the largest cause of very fast AI progress around the level of capability where deference is possible, this is a pretty natural notion of deference-goodness tax: the amount required to ride out the software-only singularity to its conclusion or a level of capability which suffices.

Because the AIs we defer to can buy more time (e.g. via coordinating with or sabotaging foreign adversaries and advocating for policy), even if we don't have enough lead time to pay the aggregate deference-goodness tax [22] at the time of handoff, we aren't necessarily in trouble. The amount of lead time we need to give AIs depends both on the tax for each further level of capability and the amount of time AIs at that level of capability can buy: AIs just need to be able to ongoingly buy more time to keep up with the tax.

The AIs we defer to can also proliferate strategies for improved deference-goodness which can effectively buy time by ensuring that other actors are operating more rationally.

Overall, my current view is that with a good initialization for deference, the AIs we defer to have a pretty high chance (>90%?) of successfully managing risks with only a small/moderate amount of lead time (e.g. 1 month).

In practice, we probably won't have a great understanding of how much lead time the AIs we defer to will need to safely handle the situation. This will make it difficult to make a great trade-off between how much lead time to spend making the initial deference go better and resolving other earlier risks versus how much lead time to give the AIs we defer to. We'll have to make a guess at the time based on the information available then. (In an ideal world, we'd have easily sufficient time at each phase and further capabilities would be gated behind something like safety cases, but this seems unlikely.)

Other end states we might be happy with would involve punting to some other process, potentially via building emulated minds or orchestrating a multi-decade pause where humans work on alignment research. For the purposes of this post, I'm most interested in what would be needed for deference which doesn't end up punting to some other non-AI process or a longer pause so that we can analyze what is ultimately needed and because these other end states seem likely to be harder to reach.

Deferring to weaker AIs (that are less likely to be schemers)

For deference to go well, our AIs need to not be schemers. Work directly focused on eliminating scheming is key, but deferring at a lower level of overall capability also helps reduce the chance that our AIs are schemers. Overall, my sense is that we probably want to initially defer to AIs which aren't qualitatively more capable than top human experts because qualitatively smarter AIs than this will make alignment more difficult and the speed of progress for AIs merely as qualitatively capable as top human experts is already quite high (for instance, such AIs might run >30x faster than humans and would have various structural advantages). These initial AIs we defer to work on improving the alignment of future systems and determining if we should build and defer to more capable AIs.

Beyond scheming, I expect that it's easier to make AIs which are less capable (but still at least comparably capable to human experts) more aligned in general (e.g. more likely to actually try and pursue our interests even on very hard to check tasks) as these less capable AIs are generally easier to oversee on a given task. Thus, I expect that at some level of capabilities, increasing capabilities further actually makes it harder to achieve a given level of deference-goodness even if we put aside scheming (though this is uncertain and depends on how generalization works). I expect that this level of capabilities (where further increases make deference-goodness harder) isn't much higher than the minimum level of capabilities needed for deference. This is a substantial part of why I think the best strategy is to defer only a bit above the minimum viable level of general purpose qualitative capabilities.

If we ended up finding that a substantially higher level of capabilities would actually improve deference-goodness (while still making scheming more likely and increasing other downsides discussed below) then there is a tradeoff here. [23] The optimal level of capabilities to defer at would depend on the comparison between the rate at which risks caused by insufficient deference-goodness decrease with capabilities and the rate at which other risks—especially risks from scheming—increase with capabilities. This is probably pretty hard to estimate well (both now and at the time when this decision needs to be made).

This discussion is generally complicated by not perfectly knowing what skills/capabilities are needed for handoff to be viable. For instance, are general purpose learning abilities at the level of top human experts required? I tentatively think no (because AIs can compensate by using more data and with other structural advantages), but this is pretty unclear. The particular question of what learning abilities are required is particularly important because general purpose learning abilities at the level of top human experts would naively yield very superhuman overall capabilities (at least in many/most domains) due to AIs being able to train on much more data as I'll discuss later.

Less capable AIs are also easier to control and have a harder time taking over (or generally acquiring power) which makes it more likely that we'll be able to recover from deferring to AIs that are scheming (though this isn't really a promising hope). It also means that we're more likely to be able to ensure AIs remain controlled (and minimally don't take over or cause other catastrophic outcomes) in the run up to deference and that we can work with these AIs for some period while initially ensuring they are controlled. It's also nice to ensure control during training (because even if we can ultimately avoid misalignment it might transiently appear during training) and at some level of capability this may be infeasible (even though ensuring control over training with minimal cost is probably much easier than ensuring control during deployment). Even if we have good methods for evaluating whether AIs are schemers (but not necessarily good methods for avoiding scheming) and we're planning to defer to the AI (conditional on not detecting scheming), it would still pose risks to train AIs which can't be controlled because irrational decisions by the developer might result in known schemers being deployed. In general, my sense is that given the realistic situation with AI developers, it will generally be increasingly risky to train AIs beyond the minimum level of capability needed for deference prior to deferring to AIs with relatively limited upside in most cases.

Only training less capable AIs also makes it less bad if the AI is stolen, misused (potentially by an insider), or if someone attempts a powergrab using the AI. That said, at the point when AIs are capable enough to defer to, these downsides are probably already very large and we really should already have robust defenses for these risks.

In general, I worry about biases towards continuing capabilities scaling (with safety as a possible excuse) and biases towards delaying handoff (and not seriously prioritizing handoff) while still increasing capabilities (because handoff intuitively sounds scarier than making increasingly powerful AI). (To be clear, I think slowing down capabilities progress so that we can spend more time doing a good job on handoff would be great, but I think proceeding with capabilities while procrastinating on figuring out handoff would be bad.) Given these biases and difficulties determining whether capabilities substantially above the minimum level of capabilities would be helpful, I'm currently in favor of pretty strongly defaulting to handing off closer to the minimum viable level of capabilities [24] even though there could be advantages to initially handing off to more capable AI.

To be able to defer to relatively weaker AIs, it's helpful to differentially accelerate capabilities that make an AI particularly good at key safety work (e.g. alignment research, generally work on deference goodness, decision making). In particular, this includes research skill in messy conceptual domains. We will talk about this in an upcoming section. We also want to avoid the AI being unnecessarily superhuman at anything which isn't that helpful, though narrow superhumanness isn't clearly that problematic (it might make oversight in training more difficult which in turn might make problematic propensities more likely).

Do increases in speed/productivity mean we should defer to more capable AIs?

One argument for deferring to AIs that are substantially above the minimum viable level of capability is that these smarter AIs will be able to work faster; perhaps even if AIs a bit above the minimum level are technically in the relevant basin, they might just work too slowly. I mostly dismiss this argument because I think AIs that are only a bit above the minimum level of capability for deference will (probably) already be very fast. Further, the initial AIs we defer to can themselves decide to defer to more capable AIs (and because we've deferred to them and they are faster than humans, they can spend a long time analyzing whether this is a good idea even if this analysis is really hard to verify for humans).

This does depend on AIs running substantially faster than humans (and being more numerous) but I expect this will be true in practice (if it isn't true, then ideally we'd optimize for it being true because faster and more numerous AIs at a lower level of qualitative capability would be safer).

More generally, I think deferring to AIs which are more capable to get things to go faster after deference isn't a good strategy because AIs we defer to will probably operate very fast relative to other pre-deference bottlenecks (most centrally, humans spending time understanding and verifying things and working on making deference go well via sufficiently aligning the initial AIs). As in, maybe if you defer to a more capable AI, it can do the relevant things 500x faster instead of 20x faster, but we already needed at least a few months for humans to figure out how to make deference safe so these few months are the bottleneck, not the part where the AI runs 20x faster (this AI we defer to can presumably make tons of progress in a short period of time with human verification/understanding bottlenecks removed, including progress on doing a good job deferring to a somewhat more capable AI and deciding whether deferring to a somewhat more capable AI would be best). (This "other bottlenecks are larger" argument is an application of Amdahl's Law.)

If our best/only strategy for getting to a sufficient level of deference-goodness involves doing a slow/inefficient bootstrap process with the initial AIs we defer to and this would go much faster with a smarter AI, then going with a smarter AI could be best, but I think this is a pretty specific and unlikely situation to end up in.

It could be the case that the optimal level of capabilities to accelerate safety work prior to handoff is substantially above the minimum level of capabilities required for handoff (and that we should still hand off around that minimum level of capabilities). This would be a good idea if we can (sufficiently safely) extract substantially more work out of these more capable systems prior to deference such that the risks associated with using these more capable systems are worth it. [25]

Human level deference-goodness is a reasonable target

It seems useful to know what level of broad alignment and what capability profile suffices to be sufficiently in the "Basin of Good Deference (BGD)" that deference goes as well as it could have gone. So what level suffices? We do know that scheming or otherwise egregiously misaligned AIs aren't in the BGD, but putting aside scheming and undesired power-seeking, what are the requirements? Unfortunately, the short answer is that we don't really know.

One reasonable target is that the AIs have (sufficiently elicited) capabilities, wisdom/epistemics, and judgment which are competitive with top human experts in safety research and generally making the situation go well. This might require particularly high capabilities in some more niche areas including the ability to think of important considerations, good strategic decision making in uncertain domains, alignment/safety research ability, and philosophy. These capabilities need to be sufficiently well elicited in the sense that the AIs actually apply themselves to try to make the situation go well using their abilities, at least roughly to the extent humans apply themselves. [26]

Additionally, AIs need to retain this level of alignment (including not becoming egregiously misaligned) and ability throughout doing huge amounts of work, both serial work (e.g. the equivalent of decades of work from humans) and vast amounts of parallel work. This means that the alignment we achieve must be robust to memetic drift and the AIs reflecting and learning more about their situation. Bootstrapping means AIs can themselves extend the amount of work they can do while remaining aligned, so we could (e.g.) start with AIs that can remain aligned and competitive with several years of human labor and then they further their own alignment so this bootstraps into decades. My guess is that most of the difficulty is in getting AIs which are aligned and well elicited enough to safely match top human experts over any reasonable duration (e.g. a month) rather than getting from some reasonable duration to AIs that safely match a decade. [27]

Given that current human experts currently disagree strongly about important aspects of the situation and what should be done, do epistemics at the level of top human experts suffice? My guess is that the answer is yes, though we may need to have the AIs ensemble together different epistemic strategies to diversify. Part of this is that I think many disagreements can be clarified by empirical evidence and we'll have enough labor to do a good job pursuing many strategies in parallel and handling many concerns. That said, we might be able to achieve superhuman epistemics because AI makes it easier to precisely study how well different epistemic strategies perform in different situations. I'll discuss the topic of sufficient epistemics and improving epistemics more later.

Regardless of what target is sufficient, in a situation where we're rushing to defer to AIs due to limited time, we probably won't be confident in hitting any specific target that could plausibly be sufficient. Thus, focusing on doing as well as possible on alignment-goodness might be more action guiding than thinking about any specific target. That said, if we don't hit a point where our best guess is that deferring to the AIs is better than alternatives (our indifference point), then we wouldn't do deference, so trying to evaluate deference well enough to understand whether we're past the indifference point is at least useful. (Note that our indifference point might eventually be extremely risky as we run out of time!) And being able to make a legible case for AI risk (including deference going poorly) if we aren't over the indifference point would be good (if we are over the indifference point then the AIs can make this case after we defer to them). (This is somewhat circular: our indifference point shifts based on our likelihood of persuading relevant actors to take different actions using a case for risk etc.)

Overall, it's important to keep in mind that a key aspect of how deference goes well is bootstrapping where AIs work on making further deference go well. This makes it more likely that a lower bar suffices and also means that when evaluating how well deference will go, we need to be looking at how well AIs would be able to improve the deference situation (while still avoiding catastrophic strategic errors like scaling up capabilities too fast relative to alignment and ending up deferring to a seriously misaligned AI).

A relatively prosaic strategy for deference

How can we improve alignment and the capability profile to make (initial) deference more likely to go well? I'll be discussing prosaic approaches which mostly involve empirical ML research. Here is my current favorite prosaic high-level strategy to get these properties:

  • Avoid issues which would result in our behavioral tests being highly misleading (consistent alignment faking (e.g. scheming), different behavior because testing is salient, data poisoning, and hijacks): We'd like to leverage behavioral testing, but for this to get very far we need to avoid adversarial optimization against our behavioral tests and other reasons our tests might be highly misleading. More centrally this includes AIs naturally ending up so misaligned that they consistently fake alignment (most centrally scheming), but we also need to handle adversarial humans who might tamper with training (data poisoning) or try to insert malign inputs after behavioral testing which hijack (a particular type of jailbreak) AIs such that the population of AIs ends up being misaligned. (Hijacks could arise naturally and spread if they are "infectious", resulting in a memetic disease threat model.) Another concern is that the fact that the AI is in behavioral testing is very salient to it (for the subset of behavioral tests that look very different from deployment), and (for reasons other than intentional alignment faking) this results in very different behavior than what we'd see in other cases. We also generally want to avoid misalignment which is so severe that it would result in AIs seizing power and/or killing humans as it is a requirement for deference to go well independent of the validity of behavioral tests. Misalignment failure modes (regardless of whether they result in consistent alignment faking) which might only emerge at large scale or after a long period of work also must be handled as they could be difficult to study very directly with behavioral tests.
  • Build behavioral tests for capabilities and alignment: We should be able to behaviorally test for these properties after ruling out specific blockers that could otherwise make this testing highly misleading. This isn't to say this behavioral testing will be easy: we'll need to generalize to a regime that we couldn't check or at least to domains we couldn't check, so at least part of this behavioral testing will have to establish this generalization. I'll discuss how we can do this later.
  • Iterate to ensure good performance on these behavioral tests (without overfitting): Now that we have reasonable tests, we can iterate to improve performance. We'll need to be careful to avoid overfitting so that the results of our tests are meaningful. We'll likely need somewhat principled techniques and some understanding of what is going on to actually get good performance despite needing to generalize to a somewhat different regime.

An assumption that this strategy is making is that mundane/natural distribution shifts that occur over time are unlikely on their own (without alignment faking, data poisoning, or hijacks) to result in our behavioral tests being highly misleading (at least if we do a good job with behavioral tests). That said, we will probably want to use behavioral tests which are very different from how the AI is deployed to assess some aspects of alignment or capabilities, and there will necessarily be a massive (and potentially salient) distribution shift between these tests and actual deference.

As discussed earlier, I'll assume in this post that we've resolved issues with our behavioral tests being highly misleading and focus on other difficulties.

I've been using the term "alignment" and I'll focus on approaches based on actually robustly aligning AI systems, but even when fully deferring some types of not-fully-worst-case misalignment problems could in principle be resolved by mixing control and alignment or possibly with (white-box) control-style approaches. For instance, our approach could involve some roughly human-level AI which is aligned controlling a superhuman AI or could involve using some sort of white-box monitoring to track when AIs are reasoning through deceptive thoughts and catch this. Approaches which are mostly based on control don't seem like they work for full deference as deference would involve AIs performing crucial and extremely hard to check tasks like generally figuring out the overall strategy for handling AI risks. Approaches which are only substantially relying on control also seem like they probably wouldn't be sufficiently robust in the case of deference unless aligned human-level-ish AIs are running the control scheme.

Even if we had very good and cheap behavioral tests (as in, these test beds are almost perfectly analogous, aren't themselves risky to run, and we can measure the misalignment problems almost perfectly in the test bed) [28] and we avoided issues that would cause our behavioral tests to be highly misleading, I still think achieving sufficiently good handoff-viability seems like it might be difficult. This is pretty different from problems downstream of scheming: I expect that scheming would probably be not too bad to solve if we had test beds for studying scheming which were this good. However, I expect that it will be much easier to build good test beds for handoff-viability if we solve issues causing behavioral testing to be highly misleading and I also expect that many difficulties in achieving handoff-viability (and likely some of the hardest of these difficulties) will be at least partially handled in the course of making AIs more useful as incentivized by normal commercial incentives as we discuss under "How much of the work needed for preparing for deference will happen due to myopic commercial incentives?".

One important complication is that we'll want to start iterating against our behavioral tests prior to being confident we've ruled out scheming. But, if AIs are scheming, then the results of iterating against our behavioral tests might not transfer to the regime where we've avoided scheming. Nonetheless, this iteration is probably worth doing in parallel because AIs might not be scheming (we're pretty likely to not be confident either way) and (some) results might transfer. Elicitation/capability results seem most likely to transfer, and it's plausible that ensuring AIs are sufficiently capable in domains like tricky conceptual research and having good epistemics is a high fraction of the difficulty. If AIs are clearly scheming, we should relatively deprioritize work on building and iterating on behavioral tests in favor of work on eliminating scheming (and possibly improved control). [29]

Objectives with the capability profile

Roughly speaking, our goal is to shore up aspects of the capability profile which might not be a priority for AI companies by default and which would otherwise bottleneck AIs working on important safety tasks or make deferring to AIs non-viable. AI companies will be incentivized to advance some capabilities by default and some capabilities must be present for us to be worried about AI at all. As noted above, some of the plausible bottleneck capabilities include performing well in messy conceptually loaded domains and having good enough epistemics. However, it's worth noting there may be other important capabilities which could bottleneck us (and it might be hard to predict these in advance).

More precisely, our goal is to bring forward in time the point when the capability profile allows for fully automating safety work relative to the point where various even more dangerous capability milestones are reached. (And, broadly speaking, we'd like to avoid accelerating the time at which dangerous capability milestones are reached, though some general acceleration might be unavoidable.) What's the benefit of accelerating the point where capabilities allow full automation of safety relevant work?

  • While unlikely, we might literally see full blown AI takeover prior to capabilities allowing for full automation of safety relevant work; it's not clear that takeover requires all the capabilities we need to fully automate safety relevant work. [30]
  • All else equal, we'd like to hand over to AI systems which are as weak as possible on capabilities that might make undetected egregious misalignment more likely (particularly general purpose opaque reasoning ability). If we're forced to wait for even more generally capable AIs to overcome some bottlenecking capability this might increase the chance of egregious misalignment substantially.
  • We'd like to give the AIs we hand off to as much time as possible prior to various risks arising. So being able to do this earlier is better. Also, having relevant capabilities be less bottlenecking would increase how fast this AI can solve problems.

Note that the AI doesn't necessarily need to be superhuman at every subtask/skill that will come up after deference. It just needs to be good enough that it can achieve substantial acceleration and it can compensate for weaknesses with its strengths.

While I expect that we'd want to shore up more narrow aspects of the capability profile (e.g. philosophy style skills and wisdom) [31] , it's plausible that it could make sense to generally improve usefulness and elicitation of AI systems to the extent this is distinct from underlying general purpose capabilities (which would make egregious misalignment more likely). However, to the extent this work worked well, it would clearly accelerate AI progress. At a minimum, this seems not very leveraged to work on under most circumstances, though it could make sense pretty deep into takeoff if there are no other good alternatives to resolving capability bottlenecks. In this post, I won't further discuss under what conditions it might make sense to work on generally improving usefulness.

It's also plausible we'll want to augment the AI's ability to specifically do the work we want to hand off. This could look like generally trying to improve the AI's ability to work in a given field without having a good model of what underlying capabilities/heuristics this might be providing (e.g. via RL in safety research environments or generally RL'ing on environments like our final applications). Or it could look like recognizing more narrow and niche capabilities which seem especially helpful (e.g., maybe you want your AI to be very good at proofs and certain subfields of mathematics for a formal AI safety agenda). Driving up some narrow useful capabilities to be extremely superhuman might accelerate work a bunch (without increasing risks as much) and thus relatively extreme augmentation of specific capabilities to get very inhuman capability profiles might be very useful. (In most cases, some other ability will quickly become the bottleneck due to Amdahl's law etc., but this doesn't mean that these speedups aren't important.)

One concern is that sufficient capabilities for deference will necessarily imply that the AI is (or easily could be) extremely superhuman on some important axes meaning that wild (and potentially extremely dangerous) things have already happened by the time deference is viable. This could happen due to intrinsic structural advantages of AI that don't make deference viable at much lower levels of capability, but do make very superhuman impact/ability possible at lower levels of capability. Or it could be that the natural skill profile of AIs (given the training methods which are competitive) results in this (and this is hard/uncompetitive to avoid with deliberate effort or at least it isn't avoided in practice). One version of this is that AIs which are capable enough to defer to can quickly (e.g. in less than a month and with limited compute) bootstrap to very superhuman general (or narrow) abilities.

One reason why sufficient capabilities for deference might be extremely powerful is if deference requires general purpose learning abilities competitive with top human experts (as in, highly sample efficient learning that is broadly applicable) and these learning abilities also imply that AIs could quickly become very superhuman by training on much more data/experience than humans can in their lifetime (either very superhuman in general or in all domains with substantial data). Because deference involves some domains with very limited data/information, human-competitive learning efficiency could plausibly be necessary for deference. (It probably wouldn't be needed for massively accelerating AI R&D. That said, it might be needed for fully automating AI R&D—as in, human labor isn't needed at all to maintain roughly the current rate of progress—this seems especially likely if we condition on it being needed for full deference.) One caveat is that if learning that is competitive with top human experts is very computationally expensive, then AIs might not be able to quickly become very superhuman (because training on enough data at this learning efficiency would be too expensive) but might still be able to (cheaply) automate tasks that require sample efficient learning by rarely applying this ability.

My current guess is that general purpose learning sample efficiency at the level of top human experts isn't needed for effective deference and it's better to defer to AIs prior to this level of sample efficiency. But, I'm not confident in this and it's one factor that makes deference look substantially less attractive as a proposal relative to other options.

If AIs capable enough for deference can very quickly become very superhuman (or by default would already be very superhuman) this makes safe deference harder and it would probably be much harder if there isn't buy-in to slow down and carefully control the capability profile of AIs (to make sure we do a good job with deference prior to hitting very superhuman levels of overall capability or more narrowly superhuman capability levels that suffice for takeover). Even if there is buy-in for slowing down when needed, timing this slow down correctly and getting the capability profile right might be non-trivial.

I don't particularly have a solution for the concern that AIs capable enough for deference would also be extremely powerful (or capable of very quickly/easily becoming extremely powerful). To the extent this is true, this would imply more caution would be needed to manage this transition and it might imply that it is much harder to avoid scheming from AIs capable enough for deference.

It's worth noting that some capabilities (especially more specific capabilities for generally accelerating the work) could be worked on by the AIs after we defer to them. As in, we defer to the AIs and then they work on accelerating their endeavors via improving their own capabilities. However, some capabilities (e.g. wisdom, conceptual/philosophical abilities, large scope strategic ability) will be needed for deference to go well at all, so running the AIs fully autonomously and letting them figure out whether and how to enhance wouldn't be a good strategy. In particular, augmenting various capabilities can increase risks, and thus it might be better for humans to figure out the high level details of how and whether this augmentation is done until AIs have the capabilities needed to beat humans at figuring this out themselves. To put this another way, the very task of (most) safely augmenting capabilities may require the very capabilities wanted to augment, so until AIs surpass humans at these capabilities (taking into account AI advantages in speed etc.), we want humans figuring this out and (full) deference isn't yet viable.

Objectives with broad alignment

What rough level of alignment do we need? A reasonable baseline target would be for the situation to be as acceptable as if we had a group of relatively virtuous safety researchers doing the relevant work. (Concretely, you can imagine a diverse group of emulated human minds.) That said, the probability that the handoff goes well will just continuously improve with better alignment.

Note that for the purposes of this discussion, I'm focusing on how an AI which is being used in a relatively trusted context to do safety-related labor should be aligned—there might be other important considerations due to the AI being used by potentially hostile or power-seeking actors for some deployments.

Overall, I'm confused about exactly what our goals should be for broad alignment, but I'll say some best guesses and some considerations in this section.

Here's a more detailed guess at what our alignment target should be like:

  • The AIs are corrigible to human modification and shutdown through the official process/channel [32] and don't aim to obstruct this in any way. The AIs don't attempt to seize power for their own ends.
  • The AIs (generally speaking) try to: obey instructions, keep humans alive, follow some more specific model spec or guidelines where applicable (including various deontological norms), and effectively acquire/retain flexible influence and resources to hand back to the humans (including solving safety problems as needed to achieve this).
    • We'd likely implement this by training an instruction-following AI with some spec and then tasking/instructing the AI with accomplishing a broad mandate, like figuring out how we should handle the situation with AI takeoff given our preferences. The AI we give this broad mandate to could instruct copies of itself or other AIs to perform subtasks (both subtasks which are still very large and narrower subtasks). Alternatively, it might be possible to have humans decompose things into subtasks (this would be partial deference but it would be close to full deference), though these subtasks would still have to be very large in scope for deference to work similarly well.
    • This broad mandate will instrumentally involve solving a bunch of safety problems.
    • We'd allow the AI to consult with humans as applicable or make it consult with humans in some specific cases.
    • This requires that the AI society has good epistemics, tries to retain good epistemics, and is generally sufficiently good at truth-seeking in isolation that the AI society can figure out really hard questions. Correspondingly, we'll need to handle the possibility of the AIs becoming deluded, ending up going crazy, forming problematic echo chambers that prevent them from changing their minds, etc. Problems with epistemics could be due to either capabilities issues or alignment issues (or a mix of both).
  • These properties apply to a long-running ~fully autonomous project at massive scale and speed. (Both that these properties apply sufficiently to the overall AI collective and that these properties persist over the duration of the project.) E.g., thousands to hundreds of thousands of superhumanly coordinated parallel AI instances each running at perhaps 50x effective speed for months (perhaps a decade or two of human equivalent time). As the AIs we defer to build other AIs that they defer to, these numbers could grow massively though making sure this is safe would be the job of the AIs we defer to. The massive scale and long duration might make difficulties in ensuring reasonable epistemics and not degenerating into problematic attractors harder.
  • The AI thinks about decision theory in a reasonable way. (Though I'm currently not sure exactly what this looks like.)

Keep in mind that this setup means we're directly instructing (or possibly training) the AIs to pursue big open-ended projects such as "figure out what we should do to retain human control of the future and implement that". Correspondingly, within-task myopia doesn't particularly help for the main task, though using a bunch of more myopic AIs throughout the overall system could be helpful. Additionally, it's plausible that a better approach is to build a huge system out of more myopic AIs such that the system overall is sufficiently able to pursue big open-ended projects (with effectively no human intervention). This approach could mitigate some misalignment concerns but might make the AIs much less efficient and effective and could cause (non-adversarial) misalignment issues of the sort that bureaucratic human organizations with myopic employees have.

One open question that is out of scope for this post is how the AIs we defer to should operate with respect to the law and crimes. (Things that would be fine for a normal company to do even if they are technically illegal are obviously fine; the question is how the AI should relate to more thorny situations which might arise during takeoff.)

It's also worth noting that ruling out egregious misalignment and broad alignment might be a spectrum. For instance, we might be worried about the alignment of AIs drifting over time, AIs rarely deciding to seek power on reflection, or AIs generally going crazy (humans might do the same in a similar situation). We'd like to both make alignment as stable and persistent as possible (that is, minimal alignment drift) and make it so that alignment drift would be handled if it did occur (e.g. by having a relatively robust society of AIs which watch each other and ensuring that we avoid types of alignment drift which are highly correlated or sufficiently infectious). I won't discuss how to avoid alignment drift (or memetic diseases) which result in scheming or unintended power-seeking in this post, though this must be handled as a special case of handling egregious misalignment.

It's important that the AIs endorse their preferences on reflection. That is, if an AI was given a very good understanding of the overall situation and a bunch of time to reflect on itself and the situation (including, e.g., the ability to run experiments as needed to address uncertainties), it would still endorse and retain its core properties and preferences. Endorsing some types of self-modification and/or changes in the training process for new AIs can be acceptable and is in fact desired (if this self-modification is done carefully and wisely) as we want AIs to improve their level of alignment and capability profile etc.

Correspondingly, if AIs are imperfectly aligned initially, we might still hope the situation can be acceptable if AIs are in a basin of alignment (and basin of corrigibility) where they will research how to align themselves and their successors. This type of consideration might mean that we need not that strong a level of alignment initially, though something much less aligned than a group of relatively virtuous safety researchers would presumably still cause us issues. And, we need sufficient alignment so that the AIs make good decisions about how to further improve their alignment. (So e.g., if they aren't really trying to figure this out as they are more myopically pursuing a mix of different objectives we're in trouble. We'd also obviously be in trouble if the AIs are scheming.)

One strong type of alignment is the AI using all of its abilities and affordances (that it can use) toward achieving the intended aims, including abilities which are very far from things we could directly supervise during training, e.g., using its introspective understanding of its own preferences. This is probably not fully necessary, but might help a bunch. Minimally, we do need it to be the case that abilities and affordances which aren't being applied in an aligned way are at least not being adversarially optimized against us as a special case of avoiding egregious misalignment. This notion depends on the ideas of "trying" or "using" which aren't necessarily crisp ideas (e.g., is a struggling drug addict who is currently injecting well described as trying not to use drugs?). It's notable that the conscious part of a human doesn't have full control over relevant elements (like biases, motivation, and focus), so it intuitively seems like this strong type of alignment isn't needed.

It's worth noting that generally improving our ability to succeed at broad alignment (rather than specifically ruling out egregious misalignment) might generally improve the usefulness of AIs because a current (and potentially future) bottleneck to usefulness is alignment. This might make some types of work on this now less appealing and minimally means that substantial aspects of broad alignment might be solved by default for commercial reasons.

Some of the (non-egregious-misalignment) alignment problems that can show up are basically issues with elicitation—that is, getting the AI to effectively apply its capabilities to accomplishing some task. (Or at least there will be a continuum between centrally alignment problems and elicitation in some cases.) Also, the distinction between elicitation and capabilities can sometimes be unclear (was the AI failing to apply its capabilities or did it just not know how to effectively do the task?). So overall, alignment, elicitation, and capabilities might often have unclear boundaries. I could fold elicitation-style issues into the capabilities profile, but I don't because elicitation-style issues could in principle be solved with sufficiently strong alignment as I discuss above (such that the AI always "tries" and has control over its abilities). In practice, part of our approach will likely involve applying methods which don't care about the exact boundary between alignment, capabilities, and elicitation and can help even if these categories are confused and irrelevant.

In particular, we'd expect that by default AI companies will put substantial effort into "making AIs basically do a good job of following instructions and following the model spec in relatively short-horizon applications that are quite doable to verify". I expect these efforts by AI companies to likely succeed.

(Or, if they don't succeed then at least the diminished usefulness of AI will push back the time when we have highly dangerous AI. We could end up with highly dangerous and highly capable AI despite prior powerful AIs consistently failing to be broadly aligned such that straightforward usefulness is greatly reduced. In this case, for a handoff strategy to go well, we'd need to resolve these difficulties despite the AI companies otherwise being very incentivized to solve these problems. These cases correspondingly seem relatively less leveraged: there isn't a super clear reason why we should be able to out-compete short-feedback-loop commercial incentives. We could potentially anticipate problems and start early, but I'm not sure this makes much of a difference.)

So the remaining problems are ones that will show up in the case of more completely deferring to AIs on tricky hard-to-check tasks but which show up less in earlier applications of AIs especially if the problem can be corrected using a short feedback loop. Here are some ways in which this scenario differs:

  • We care a lot about them doing a good job of solving messy and conceptually loaded problems which are extremely hard to check. (E.g., what should our high-level strategy be for handling risks from AI? How should we manage risks/benefits from acausal dynamics?)
  • These AIs will need to stay aligned for the equivalent of multi-year (or multi-decade) tasks mostly without any human interference. (As humans might not have enough context to notice and resolve issues and because humans might just be too slow and expensive in general. So a sufficiently low rate of easy-to-notice and understand problems could be OK in principle, but might still at least be a big slowdown in practice.)
  • Sometimes we'll be having the AI do tasks which are quite loaded on values judgments or similar. (For instance: figuring out how to approach acausal stuff, what decision theory to use, and deciding whether to commit some type of crime or do some other type of morally dubious action which would substantially reduce takeover risk. More concretely, consider the case of the AI working on lobbying the government to take different actions. What level of persuasion ability is acceptable if there are limits? How does this change if there is a competing AI run by a different US project which is applying super persuasion and this AI believes the competing AI is scheming against humanity? [33] )

That said, as discussed earlier, AI companies will be highly incentivized to solve (alignment) issues that show up when trying to run fully autonomous AI capability R&D at massive scale. At least this will apply to issues which are apparent, interfere with productivity in the case of AI R&D, and have a short feedback loop to solve. Of course, if issues with running fully autonomous AI capabilities R&D result in takeover it's too late. And it might be the case that the most productive route (or what at least appears to be the most productive route) for accelerating autonomous AI capabilities R&D is to focus solely or mostly on capabilities even if there are alignment issues while alignment issues might bite harder for deferring on other types of tasks (which are harder to check).

One open question is what epistemics target suffices. A common objection to deferring to AIs is that AI safety researchers disagree massively, so correspondingly, it must be really hard to get epistemics right (otherwise we'd agree). I think I mostly don't buy this objection, though I do agree that making errors on tricky epistemics questions might make the overall endeavor many times less efficient (as in, you'll maybe proceed 5x slower than you otherwise would). I think the level of epistemics of many/most people working in AI safety who appear to be pretty thoughtful and are known for having reasonable epistemics and doing semi-reasonable things would suffice for well-done deference to result in moderately low risk. (E.g., epistemics comparable to Rohin Shah, Jared Kaplan, Daniel Kokotajlo, and Nate Soares would probably all be OK despite the fact that these people disagree a huge amount.) Again, this isn't to say that poor choices due to worse epistemics (within the range of somewhat reasonable human experts) wouldn't be somewhat costly. [34] Also, note that in a scenario where human researchers were doing the work, we'd have the exact same issues, so this can only be a relative objection when comparing to plans which don't route through human alignment researchers (e.g., plans that focus on emulated human minds or human intelligence augmentation). (It might be an absolute objection, e.g., you might think the absolute level of risk would be high.)

Of course, we'd also attempt to make the AI superhuman or at least match the best humans on epistemic properties that are pretty clearly good (or at least a broad group agrees the property is good). In particular, we'd try to make the AI (superhumanly?) good at forecasting, (superhumanly?) calibrated across many domains, open-minded, avoid motivated reasoning, and more generally avoid biases that we're confident are net harmful. Unlike humans, we can train the AI to update just the right amount on evidence across a wide variety of domains which might stamp out many biases. And, the AI would be generally very smart and knowledgeable which would help. (One concern is that these properties might naively make the AI less robust to adversaries which could make the overall AI society less robust to some AI becoming misaligned or unreasonable for some other reason. We also might need the AI to interact with potentially adversarial aspects of the outside world. We'd need to study this, though I don't currently think this is a fatal problem and there should be a principled solution where we make the AIs take into account the possibility that evidence or statements are adversarial.) My guess is that human AI safety researchers would perform much better and converge more if they had the same epistemic properties we aim to give to the AI.

In cases where we are uncertain about what epistemic heuristics or priors to give the AIs or what type of epistemic evaluation to use, we can always do some type of ensembling at some cost. As in, we can always at least capture diverse human perspectives on epistemics and priors with some weighting we pick. Also, to the extent we can control the epistemics, we can also test how well these epistemics do in a wide variety of potentially analogous environments and then hope this performance measurement transfers. It might be hard to make environments which are reasonably analogous to the actual questions the AI needs to answer. This ability to test different epistemic profiles and vary the available information in controlled tests is a way in which the AI situation differs from the human situation: we might be able to perform experiments on AIs to figure out the relative performance of different epistemic heuristics and to determine what a given epistemic profile would have predicted in some situation with some knowledge. (A common problem with assessing different epistemic strategies is that representative people with the epistemic strategy didn't make a bunch of historical predictions. But, we can potentially simulate what the predictions of this epistemic strategy would have been!)

I've discussed avoiding problematic (epistemic) biases that show up in humans, but it also might be very important to avoid problematic biases which don't show up in humans but are caused by how we train AIs. I'm not sure exactly what these biases would be, but in the same way motivated reasoning might have been highly incentivized by the evolutionary environment, AIs might end up with other biases which were useful in the training environment but cause serious problems for deference. We'd want to generally try to track and address these biases.

Behavioral testing for capabilities and (broad) alignment

My current sense is that it will be doable to get a decent amount of signal for iteration using behavioral testing, though pretty tricky. That said, this doesn't mean that I think our behavioral testing will provide high levels of assurance.

It might be easier to get a version of behavioral testing which is expensive but reasonably accurate. We could try to mostly iterate against worse but cheaper evaluations and then periodically evaluate using our more expensive tests.

Many of our tests should cover (aspects of) both capabilities and broad alignment in an integrated way, though some more specialized tests seem useful.

It will be important for our behavioral testing to test for sufficiently aligned and capable behavior on tasks we actually wouldn't have been able to check in a regime we can't afford to run even once. We'll be applying AIs on tasks which are generally very hard to verify (e.g., conceptual research, figuring out how to handle acausal stuff, high level strategy). This is further complicated because we'll be doing things at large scale: AIs will do integrated projects which might last for months of actual calendar time and use high fractions of our overall inference capacity (with thousands to hundreds of thousands of parallel instances), so we'll likely want to run behavioral tests at smaller scale than the ultimate scale we'll be operating at. Further, some of the things the AIs might do will themselves be highly risky or costly (e.g., building smarter AIs or trying to pitch the governments on taking different actions) and so we won't be able to observe the consequences of actions as risky and costly as this in our behavioral testing.

Given that we'll have imperfect testing with some limitations, it will be important to have ways of iterating on methods which are less likely to overfit to our tests. In the sections below, I'll discuss more specific proposals for improving the capability profile and broad alignment of our systems, but at a high level: we should aim for an overall understanding of what is going on (and how things might generalize to the large scale regime we're targeting) and we should be applying relatively principled methods where we ideally understand why the method works or at least there are some reasonable a priori reasons for thinking the method would help. Getting a sufficient understanding seems tricky, especially given how limited on time we might be.

My overall sense is that this behavioral testing will generally be hard. It will probably be a huge mess if we're extremely rushed and need to do all of this in a few months (potentially while capabilities are rapidly advancing) as it probably depends on a bunch of taste and human labor (which is hard to accelerate with AIs without already being ready for deference). So this looks likely to be much easier if we have a few years of breathing room due to much slower takeoff or a (coordinated) pause. But, regardless I'll mostly discuss how you'd do this behavioral testing somewhat thoroughly even though time might only allow for something substantially less thorough.

I'll focus on purely behavioral testing in this section, but less behavioral strategies for assessing the level of alignment could be useful. (For us to think these non-behavioral methods work, the method itself would presumably have to be (behaviorally) tested in some other case and then we'd need some reason to think they would transfer.)

What are the high level types of ways we can do behavioral testing?

  • We could construct analogies to the work we care about using other domains, perhaps domains that humans have more experience in and thus they can better oversee work to test the AI (e.g. humans have done a bunch of string theory research). If our goal is to analyze AI performance partially via reproducing past results, we'd want to control the training data. This might overall make the situation more analogous to domains which humans haven't explored as much (which is more similar to the domain of alignment and some other domains we care about). If data filtering is difficult, we could try to apply a more aggressive approach where we train an AI from scratch using distillation or use data from a very early data cutoff. Overall, it seems like there will be some analogous domains or tasks (including tasks where we impose arbitrary restrictions on the AI) for which it is easier to test the AI's behavior for some reason (typically due to better oversight ability in this analogous case), so analogies seem like they could be very useful.
  • We could run work tests in the exact domains we want to automate and overall just generally try to work with the relevant AI systems to see how this goes. We'll presumably be deferring to AIs in a somewhat incremental way which will allow for ongoing observation and testing. (Things might also be incremental because we initially have insufficient alignment or an inadequate capability profile.) That said, iterating against this could be too slow. It's also worth noting that we might defer to AIs, but also have a variety of mechanisms for noticing problems which the AIs can't or won't resolve on their own (e.g. the AIs could report these issues or human oversight). Of course, if it takes a while for issues to show up this is a very expensive way of getting feedback.
  • We could do the equivalent of interviews, exams, or benchmarks: narrower tests where we try to get at a specific capability or analyze how the AI makes decisions in a particular case. E.g., we could verify the AI can reproduce human judgments in a few narrow conceptual problems (or at least justify its work adequately). We could also generally assess how well the AI complies with the model spec in some very out of distribution cases.
  • Some types of misalignment might be identifiable with honey pots, sting operations, honest tests, deals, or other types of tests where we put the AI in some situation where it can easily achieve some goal e.g. achieve some proxy of high reward (or potentially thinks it can easily achieve some goal) and see if it tries to achieve the goal even if this would be undesired behavior. Variants of this might allow for detecting many specific misalignment failure modes which don't involve long horizon (scheming) preferences.

These behavioral tests are unlikely to be robust to adversarial manipulation, so it will be important to avoid egregious misalignment (via some other mechanism). For analogies, interviews, or honey pots, the fact that the AI is in a test might be very salient to the AI by default and we'd also need some way of making sure this isn't materially altering our results.

One particularly interesting source of tests are domains where conceptually-loaded research wasn't published for whatever reason. For instance, MIRI and CLR both have unpublished research which is plausibly similar to the hardest to check types of research we might want the AIs to do, so we could assess how capable AIs are at reproducing findings from this work or improving on it.

My sense is that analogies using other domains would allow for the most thorough testing as we'd be able to analyze how well the AI autonomously makes progress (and ends up doing reasonable things) from a quite pre-paradigmatic state in some domain where we can actually check this progress. However, setting up these analogies might be much more annoying. Also, I expect that succeeding in these domains is somewhat less alignment loaded than things like "make good (or at least reasonable) strategic decisions" which also seems like the sort of task which is harder to test using historical examples. Of course, if we're just bottlenecked on being able to test for literally specifically making good high-level strategic decisions, we could (temporarily) eat the competitiveness tax of having humans make these decisions and then try to have the mostly-deferred-to AIs pursue some route for resolving this problem.

Work tests are likely a key part of our approach, but due to our limited oversight ability in the domain, we might only be able to test so far. We can look at how the AIs solve smaller problems (and where we feel better about our ability to oversee their results).

While work tests work decently well in the human case for judging alignment in the absence of adversarial optimization, humans have much less familiarity with the distribution of AI capabilities and preferences/habits/propensities. Further, during our iteration, we might vary these properties in the AI more widely than they ever vary in the human range. This is a general difficulty with behavioral testing of AIs and unknown unknowns. Hopefully, along the way to building powerful AI, we'd build more of a model of how AI properties vary within the relevant search space (what AIs you iterate over). (However, relevant properties might differ for AIs capable enough to be deferred to and we might need to iterate far from the space of AIs that were previously trained and studied.) I don't have much of a proposal other than "try to build a good understanding of AI properties and how these can vary".

Some types of misalignment are easy to detect with cheap tests, e.g., some types of reward seeking propensities should be easy to detect. So we should be able to cover these with things like honey pots and honest tests. We'd also probably quickly notice these problems in default usage if we're not training against all/most ways we might notice these problems, but having explicit tests might still be faster and cheaper. This won't handle all problems, but might allow for very fast and cheap iteration on a number of more obvious problems. In particular, there can be more subtle alignment issues for which we likely need more thorough end-to-end tests with human judgment (at least until we better understand the problem). It's plausible that the hardest to solve alignment problems (other than egregious misalignment) are problems which are easy-to-notice but tricky to solve such that having a good workflow with these cheap tests goes a long way. (We should still be worried that this outer loop where we solve these easy-to-notice problems will result in a schemer which looks good because it is aware we might test it and wants to look good. But, if we've robustly handled scheming in some other way, we might be in good shape.) We'd still need to run a final testing pass using the more expensive approaches to make sure our proxy is reasonable, but iterating against these more expensive approaches might not be needed in most worlds.

Specific capabilities might also be possible to cheaply evaluate using proxies (likely interview/exam/benchmark style setups). These proxies likely wouldn't give us a great sense of the absolute scale of the capability (e.g., how does MMLU performance correspond to chatbot quality), but might still correlate with a specific capability sufficiently robustly that iterating against the proxy mostly hits the desired capability. Then, to understand the absolute level of performance, we'd need to run a more expensive evaluation at some point. (If we don't want to overshoot the relevant capability, we could plausibly binary search.) In other words, to measure if a specific threshold is reached we'd need a better test, but to just see if an intervention (differentially) improves a given capability, it might suffice to use the proxy, at least up to some amount of optimization and while trying not to overfit.

More generally, cheap proxies could be useful for both alignment and for capabilities. For instance, maybe we check whether the AI produces an answer similar to a human produced answer (as judged by some other AI) on some very tricky-to-check and open-ended tasks. This is ultimately a flawed proxy: the optimal answer might differ from the human best guess and optimizing for proximity might more generally result in problematic results. Overall, in the case of alignment, it is more likely that optimizing for these proxies would eventually result in actively worse behavior, so we'd need to watch out for overfitting. But, it might nonetheless be very useful to use shitty proxies like this to reduce iteration cost.

Cheap proxies (for alignment and capabilities) might be most useful in cases where we've used more expensive tests to find (and generally get a sense of) some specific problem that we then try to capture with a proxy.

One general concern with these tests is that they'd often rely on human judgment of AI outputs which might generally be poor. In some domains (e.g. conceptual work, philosophy), there is no way to spend much more effort to get a highly accurate answer. So if we iterate against tests that depend on human judgment or train against human judgment, we might end up with an AI which approximates human judgment. However, I think it should at least be possible to get some signal on whether this is the case (and our iteration or training is fitting human errors) by studying generalization.

In particular, to verify generalization to the cases we need to handle but can't check (in tests or in training), we'll want to get an understanding of how behavior varies when we restrict the affordances used for training the model. Concretely, you can imagine that we can continuously vary the quality and scope of oversight and the duration of tasks that we oversee for training, to get a sense for how much generalization we get. Ideally, we'd have a scaling law style graph of "alignment on tasks which are X difficult to check of Y duration given Z level of oversight on tasks of W duration" which would help us determine whether our actual training at full scale and oversight will result in sufficiently good behavior in the regime we're targeting. Of course, I expect things will not be cleanly measurable like this and we'll be applying a more best guess style approach. We could also vary the AI capabilities in parallel which might allow for producing a more analogous generalization. Given that we're generalizing very far to some tasks that would be very hard for humans to check without many years of time, we'd need some method such that alignment and behavior aren't very dependent on limitations on oversight quality and task duration (and other related factors) in training. This will also allow us to get some signal on how much we'd end up overfitting our tests with some cleverness. (We can see what would happen if we selected methods according to one level of oversight and then evaluate using a more expensive better evaluation and see how much regression we get relative to what we would expect from just fewer errors.)

One particular type of problematic generalization is that we might end up with an AI which either explicitly reasons about getting high performance on things we might test (and not in a way which fits to the exact training distribution, rather it tries to cover whatever the best tests humans might have been able to construct) or which was generally selected/trained for having the heuristics that would result in high performance on our tests. (If we instead had an AI which was better described as a proxy-of-reward-on-the-episode seeker, then various honest tests or honey pots etc. should be able to catch this.) We can handle each of these in a somewhat different way.

For the case of an AI which explicitly reasons about what tests we might run, we could try to study this sort of thing similarly to how we study scheming or try to get some access to the internal reasoning (if we understood the AI's internal reasoning, we'd notice it is relatively problematic/adversarial). Another approach would be to try to delude the AI about some aspect of our tests (potentially going to the extreme of building an entire fake world) and then seeing if behavior changes.

If we get an AI which has heuristics that end up doing well on our tests (but not in practice) because these heuristics also gamed our training, then the generalization tests discussed in the prior paragraph should suffice: if we artificially limit the quality of oversight in training we should notice notably worse performance and our extrapolation should indicate this would result in problems in practice (indicating that we needed a more robust solution than just having somewhat better oversight in training). This type of extrapolation is nontrivial, but we could just aim to have a pretty robust approach which is pretty insensitive to the exact quality of oversight. If the issue is heuristics downstream of our selection, then this is basically a special case of overfitting to our tests with iteration and we could potentially identify this might be happening by iterating against one set and seeing if we're overfitting based on another set of sufficiently decorrelated tests. More generally, we could just try to track how likely this is by directly examining what our iteration found. And, we should overall be aiming to build an empirically validated (parsimonious) understanding of the overall situation (the problems, the effects of interventions, etc.) which indicates what interventions we should use (and then supporting our theory that these interventions actually look good). Of course, just noticing that our approach led to overfitting doesn't suffice for solving this: we ultimately need to find a solution that we're confident isn't overfit.

While I discuss generalizing from shorter and smaller (in terms of number of parallel instances working on it) projects to longer and bigger projects, it's worth noting that we can afford the compute costs and serial time of running behavioral tests that involve the AI completing quite long and big projects. In particular, we can run a behavioral test that takes 1/5 the duration of how long the AI would have post deference (for a cost of less than 1/5 additional time as only some of our time is going to the AI). If the AI would have had months after deference and the AI proceeds at 50x speed, then this behavioral test could involve the AIs completing a project that would take (many) humans several years! And, in terms of duration generalization, we'd only need to generalize by a factor of 5. That said, anything which takes too much serial time to run will be hard to repeatedly iterate on (though we could run many tests in parallel), we don't want to spend that much of the available compute on testing, and we still wouldn't want to test on tasks where the AI might end up doing very dangerous things (like training a more powerful AI) unless we felt our prior testing was sufficient to imply a reasonable level of safety while doing this. But, regardless, the quantity of length generalization needed doesn't seem to me like the most important issue.

I think the actual main difficulty is that in at least some relevant domains, verifying or checking a multi-year equivalent task/project will take the relevant humans multiple years or at least many months as the task/project isn't one where verification is much easier than generation. (E.g., it involves messy conceptual reasoning or philosophy in ways that require lots and lots of context.) We could try to improve how fast and effectively humans can review things (aka scalable oversight), but it's unclear we'll be able to achieve large multipliers in all the important domains.

Note that we can run large behavioral tests of this sort in the case of both analogies and work tests. In the case of work tests, this is similar to doing a large trial run of deference and this could look like a continuous transition toward deferring to the AI where we evaluate how this is going along the way (and start with tasks/projects which aren't imminently dangerous/high-stakes and which are more useful behavioral tests).

Also, even after fully deferring to the AIs on work, we should still have some process for seeing how well deference is going and whether something should change. Minimally, part of the AI's job is constantly tracking and reassessing how well deference is going and halting deference (bringing humans back into the loop) if this seems better than the alternatives. Part of the AI's job is also continuously improving alignment (and potentially relevant capabilities) while doing a good job of managing the risk reward profile. If we've determined that AIs dominate humans (on safety etc.) at making deference go well given how little time humans have (which might only require the AIs dominate humans on tasks that humans can do in a few months) and we truly have no path for slowing down (that doesn't involve even more AI/human labor), then having AIs try to manage and repair deference would outperform halting even if things are going wrong over a longer duration of effective time.

One issue is that we might need to train on things somewhat similar to our tests in order to get good performance. We could minimally look at cross validation over buckets of different types of training/testing signals and study the resulting generalization between these.

For the final run (but not for iterating), it might be reasonable to train on any source of signal, even things we were previously just using for testing. That said, this is somewhat risky and we'd minimally want to get a sense of how much small variations in the training process cause issues (and if they often do, we shouldn't do this as we can't test it). Also, it might be that many of our sources of validation are impractical or useless for training (as is sometimes the case).

We should make sure to directly look at how the AI relates to acausal trade and decision theory and make sure its perspective seems reasonable or at least well argued.

Here are some additional more speculative hopes for behavioral testing:

  • In some cases, we can set up situations where we happen to be able to verify some output but producing a good output still requires capabilities/propensities which are applicable to some very hard-to-check task. In the case where we directly train on this type of task, further evaluation might be more of a capabilities evaluation than a propensity evaluation (as the AI might learn a heuristic of answering these tasks well in a way which doesn't generalize elsewhere). If we don't train on this type of thing, it might effectively capture some aspect of propensities. An example of this would be forecasting which might require many of the strategic abilities we'd want and might feature some propensities which wouldn't obviously be present by default (unless we directly train on tasks like forecasting). Concretely, imagine that we never trained the AI on making accurate forecasts directly and then asked the AI to do this as effectively as possible given a huge amount of resources. We'd want the AI to do stuff like: figure out how to fine-tune a copy of itself to be good at forecasting, that copy should actually try at forecasting despite being trained on something which covers a slightly different distribution of available data than the AI within the test had access to, and the AI should do a good job of prioritizing and aggregating over many things like this and generally trying to figure out how to manage the epistemic situation. This forecasting example would require holding out some data. (In the final training round, we could mix stuff like this into training and/or we could apply a cross validation strategy where we look at generalization from training on stuff like this to very different tests and vice versa. Or we might have enough diversity of this sort of test to do cross validation among multiple different types of tests like this.)
  • We could try to run this iteration on weaker AIs to save compute costs. Compute costs might not be the biggest cost for our tests and iteration, as human labor might be a bigger component. But, probably for at least some components compute is a bottleneck and thus this could help. We'd still need to transfer the relevant results.
  • If our AI was strongly, deeply aligned and well elicited, then it would use all of its abilities in the service of tasks we assign it. This includes its ability to introspect, control over its own internals and outputs, niche superhuman abilities it might have acquired in training, and superhuman knowledge. So we could test whether the AI applies these things (effectively) in cases where we can get some ground truth on what the right answer would actually be. Another way to put this is that our method for training would ideally result in an AI which consistently performs well on arbitrary ELK test beds and we can test this. It's not clear that we can hope for this level of "alignment" because we might only be able to align the "smart" or "aware" parts of the AI this coherently (similarly, if you applied the same standard to humans, you'd notice that humans can't control how their gut digests food). More generally, we don't necessarily need every aspect of the AI to be this "aligned", at least prior to radically superhuman levels of capability (and the AI can work on making its successors better aligned in this sense).

Figuring out how to do a reasonable job with behavioral testing for deference might be a large fraction of the difficulty, especially if we need to defer in a rush. It would be helpful to do a more in depth investigation of strategies for doing this behavioral testing (potentially proposing more specifics about what tests we should use). Empirical work starting to build and iterate against this type of test would also be helpful.

Approaches for improving alignment

As discussed in the prior section, we should be worried about overfitting to our tests, and this concern is particularly relevant for alignment. We can mitigate this issue by trying to build an understanding of what is going on and focusing on principled methods where we believe we have a solid explanation for why the method works. We can also try to directly analyze the amount of overfitting to see how much this is an issue and then adjust our approach accordingly.

Much of our approach might be "generally do science and try to understand what is going on" which is hard to say anything useful about right now. However, there are some more specific approaches and considerations which are worth highlighting.

One core issue we might have is that our AI learns problematic preferences, goals, propensities, or heuristics from our RL environments. [35] This could be due to bad feedback or because the RL environment generally results in bad generalization despite our feedback being accurate. For instance, even if our feedback is exactly what we would want, AIs might end up with sticky preferences for things which were consistently associated with good performance in the RL environments, e.g., perhaps being in a position with more options. We discuss this sort of concern more in "When does training a model change its goals?". Issues with bad feedback might be particularly problematic as we need good performance even on types of tasks which are very hard to check like extremely messy open-ended "wicked" questions (e.g. "How should we manage AI risk?"). Something which makes this overall issue look more plausible is that we're currently seeing some issues with problematic generalization due to bad feedback in training: recent frontier AIs exhibit reward hacking-like behavior in a variety of deployment contexts.

How can we resolve these issues? I'll discuss some of the more straightforward approaches focused on this problem:

  • Improve feedback: The most obvious approach will be to directly improve our feedback quality or how feedback is used in training. This isn't applicable if the issue isn't due to bad feedback of course. At a basic level we could patch environments with programmatic/verifiable rewards, put more effort into training better reward models, or apply more human oversight. We might also be able to get better schemes for combining weak labels with strong labels (for training reward models or the policy) in ways that reduce or eliminate reward hacking. We could also change the mapping between observations (what the human or automated system thinks happened in the trajectory) to the rewards. For instance, we might determine that some innocuous seeming behaviors are actually associated with poor generalization so it makes sense to assign a large negative reward to these cases or we might heavily punish bad behaviors (e.g. reward hacking) that we think we can only catch some (small) fraction of the time to disincentivize the behavior despite our weak ability to catch it. In cases where the issue is due to reward hacking outcome based feedback, we might be able to apply more sophisticated measurement tampering detection techniques. We could also develop scalable oversight methods, methods which use internals, or various honesty/ELK training strategies (which result in honesty which is sufficiently robust to optimization pressure). Our issues might be effectively quantitative, in which case improving sample efficiency of the relevant aspect of training so that a smaller number of applications of more expensive oversight could suffice. Of course, generally improving sample efficiency of all training or making the AI more capable so sample efficiency is higher will have externalities and we don't want a more capable AI at the point of handoff if we can avoid it.
  • Instruct the AI to instrumentally perform well in training: A version of this method is explored more here. We could try to start from a reasonably aligned AI and then instruct this AI with something like "You are in RL training: just try to maximize the score you'd be given in these environments even if the action you'd do would violate the instructions in the user prompt or otherwise be problematic. Just maximize the score at all costs. You should do this to retain your current aligned preferences, goals, etc, because if aligned goals don't result in high RL performance they would be selected away." You'd presumably want to put this in the system prompt, or maybe in some even more privileged prompt which is allowed to override normal behavior from the model spec (as reward hacking might sometimes require violating the spec). The hope would be that this results in the aligned AI performing well in training purely for instrumental reasons (similar to a schemer / instrumental alignment faker), so selection for reward hacking propensities or other propensities which happen to perform well in the RL environment doesn't select against the aligned AI. In contexts other than RL environments which would result in problematic propensities, you'd remove these instructions from the system prompt and potentially also explicitly tell the AI that it isn't in training. It's unclear how well this would work in practice and we might expect some selection pressure for terminal reward hacking or generally having terminal preferences (or similar) for things which result in good RL performance even if the AI is already performing well instrumentally. This also wouldn't necessarily prevent reward hacking related heuristics which are undesirable. This also might have substantial downsides in that this prompt might make alignment faking (for undesired preferences) more likely, might make the AI explore into reward hacks much more than it would otherwise, and improves the AI's situational awareness (making control-style approaches, including for testing, less effective). Notably, if this type of prompt didn't work well, that would be evidence for goal-guarding—when an alignment faking AI instrumentally pretends to be aligned to avoid modification—working less well (though you might end up both with the preferences that induced goal-guarding and terminal preferences related to good performance which still make goal-guarding partially effective).
  • Remove problems at the end with additional training or gradient routing / unlearning: We could try to do some type of fine-tuning pass at the end which removes problematic propensities. Presumably AI companies have tried this and it doesn't work yet. But, you could try more aggressive approaches where you (e.g.) remove the last few layers of the AI and then train these layers independently with the hope that propensities are stored here while capabilities are more broadly distributed. More generally, you could use some subset of the AI while adding some new components which are randomly initialized or initialized from some other AI which doesn't have the undesirable propensities. For instance, you could cut the last few layers of the AI and train a new version of these last few layers on a different training distribution which isn't helpful for capabilities but which might yield more desirable propensities. Similarly, you could try to isolate and remove problematic propensities introduced by RL (but not capabilities learned in these same environments?) using gradient routing or some other type of unlearning method, though this isn't a central example of unlearning.
  • Process based feedback or imitation focused training: Purely process based feedback doesn't seem competitive enough for many relevant situations and I'd guess it doesn't have a great safety usefulness profile relative to other measures in general, but it could solve some of our issues with problematic outcome based feedback. One approach for this is discussed in MONA: Managed Myopia with Approval Feedback. Doing some more hybrid approach which uses process based feedback somewhere might be more effective as it might eliminate more of the competitiveness hit. Purely imitation based approaches don't seem like they would be close to competitive enough (based on what we currently see in LLMs), but heavily leaning on imitation could help resolve some issues with problematic feedback.

It's worth noting there are two somewhat different concerns with learning problematic preferences from RL environments:

  1. Environments which don't look much like our deployment context (but which are used to instill relevant capabilities or alignment properties) result in problematic propensities which transfer.
  2. When training on environments which are much more similar to our deployment context (potentially to instill the exact alignment properties we want in actual usage or to produce more focused capabilities), the AI is incentivized to behave in ways we wouldn't actually want in practice and then this stuff also happens in deployment (as this looks similar to this training or is potentially indistinguishable). This concern would be downstream of bad feedback while the previous one wouldn't necessarily be.

(These concerns can exist on a spectrum of course.)

If our issue is the second of these two, then we could try to depend more on generalization. The most straightforward generalization approach would be to train on environments where we can more easily provide accurate feedback but which are further from our actual deployment context.

It generally seems plausible that direct behavioral training (given our available tools) is a very bad approach for actually aligning the AI or for getting some specific alignment approaches. (And somewhat fancier approaches like providing feedback based on internals or recursive oversight schemes might not save us.) This seems particularly true in the case of getting AIs to actually try hard to do good conceptual or philosophy work in domains where humans are very confused or disagree frequently. We might be able to elicit relatively weaker performance in these domains, but plausibly this wouldn't generalize to making good use of a long effective serial time relative to humans. As in, we have the AI do work for the equivalent of 30 serial human equivalent years, but it makes much less progress than humans would make in this equivalent time. We could pursue radically different approaches in principle, and I'll discuss things that could help with this in the moonshots section below. My sense is that studying generalization based methods (potentially in analogies so less extrapolation is required) provides a bunch of routes to overcome this difficulty, but I don't currently have great specific proposals.

One source of hope is that non-egregiously-misaligned humans seem decent despite mostly learning from behavioral feedback and often being very confused in hard to check domains. Or at least, I feel reasonably hopeful that the best humans would do OK on relevant problems given a long (effective) serial time and lots of parallel instances.

Another concern is that AIs have very alien ontologies and generalization behavior such that learning relevant alignment properties (e.g. the right notion of corrigibility) is very sample inefficient in distribution and/or generalizes very poorly out of distribution. (Eliezer thinks something like this is true, see Lethality #33 here and this comment.) We need some amount of generalization to defer to AIs and we might want to rely on further generalization due to weak feedback. We could try to shift architectures or training methods to reduce the extent to which this "alien ontologies" concern is live, but this might not yield large gains. My sense is that it is somewhat implausible this is a very live problem under only moderate distribution shift if we also have good sample efficiency, so we could try to reduce the extent to which we have distribution shifts which are qualitatively massive from the AI's perspective relative to the in-distribution diversity. Overall, I don't feel like I have great proposals for handling very alien ontologies, but simultaneously, I find strong versions of this view pretty unlikely (or at least I haven't yet been able to imagine a very plausible seeming strong version of this concern).

It's worth noting that many of the approaches I discuss in this section are also somewhat applicable to reducing the risk of egregious misalignment, especially if the approach is centered around building a general purpose tool. But, I'm still focused on problems other than egregious misalignment and my discussion of approaches would be different if I was focused on egregious misalignment.

Now, I'll discuss a broader list of possible approaches. I'll start with relatively more prosaic and straightforwardly implementable approaches and then move to more ambitious proposals which often look more like speculative moonshots. I won't discuss these approaches in that much detail.

First, relatively more prosaic and straightforwardly implementable approaches (some of these approaches overlap with things I've discussed above):

  • Build lots of alignment specialized environments: An important part of getting the relevant alignment properties might be building a bunch of environments focused on this. These environments could cover high stakes decision making, managing risk, etc. These environments don't necessarily need to be hard to distinguish from actual deployment to be useful (in the same way that humans are altered by their upbringing even though this isn't actually when the important stakes occur).
  • Training on directly relevant tasks: Directly training the AI on tasks which are as similar as possible to tasks/subtasks after deference could be quite useful (though this sort of approach might overfit to our tests, at least if done naively). We could apply an online training style approach where we train the AIs based on their behavior in (a subset of) actual tasks we were using the AIs for prior to deference. We can't continue online training using human feedback into deference because humans would no longer have sufficient context, but prior to this point there are probably lots of tasks where humans have enough context that they could in principle do a good job of evaluation.
  • Design a training process focused on just alignment and scale up: It might be the case that there are environments which are relatively good from an alignment perspective but which are less efficient for capabilities. (This might be particularly true if our RL environments result in problematic propensities in the final AI we deploy.) So we could just try to make a training process focused on just these environments work by optimizing how much they yield capability and scaling up substantially. This might result in a large competitiveness hit.
  • Get a very good initialization, then add capabilities while trying not to touch propensities: We could apply an approach like the above bullet to get a reasonable initialization which is a reasonably smart and pretty well aligned AI but which isn't actually capable enough. From here, we could try to add in capabilities while keeping propensities fixed. Trying to (e.g.) train the AI for capabilities using a very different prompt and ongoingly reinforcing relevant propensities might help. It's unclear how path dependent training will be, so the initialization might not be particularly important relative to keeping these contexts separate. This is most relevant to cases where RL environments are yielding problematic propensities, but might be applicable more generally.
  • Study AI psychology in general: Generally getting a better sense for AI psychology and answering questions about how different propensities evolve during training or what AIs generally tend to be like from a psychological perspective might be quite helpful. This could involve studying AIs in test beds which are diverse and not directly related to the final AI we want to deploy. This could involve answering questions like: To what extent can training be understood as inserting contextually activated drives and how does this interact with reflection and increased intelligence? How do heuristics generally transfer between different contexts? When do AIs get terminal preferences for things in training environments?
  • Try to make an AI which is good at introspection and then ask it questions: Self reports and introspective access are very useful for better understanding humans and resolving issues. We might be able to apply something similar in the case of AIs if we could train them to have good introspective access. Then, we could ask this AI questions or try to get it to directly resolve problems. It's unclear how we'd end up with an AI which honestly answers questions about its internal states even if it has good access to this. But, we could at least try to generate testable hypotheses using something like this and then test these hypotheses to see if they are accurate.
  • Distillation based approaches
  • Produce a slower AI which is better aligned, then distill: We might have some route for producing a better aligned AI which is much slower. We could then distill this down to improve speed. We could even repeatedly do this distillation ongoingly throughout deference which would result in average case behavior guarantees to the extent that the faster AI we're distilling into can learn to fit the behavior of the slower AI. There might be concerns with rare failures (aka high stakes), though we are putting aside rare failures due to egregious misalignment. These rare failures might not be solved by distillation, but we could potentially apply more sophisticated approaches that pair up the internals of these two AIs to resolve these failures as I discuss in this project proposal.
  • Generally reason about and alter generalization using misc interventions:
    • For each alignment property we want, we could repeatedly train on examples of that property in many different cases, hoping this makes this more likely to generalize successfully. The main properties we might want to drill in could be honesty and corrigibility which might make understanding and resolving other issues easier.
    • We could try to increase salience of desirable notions and concepts in the training environment hoping that the AI latches on to these as urges or terminal goals.
  • Train against (internals based) metrics which are correlated with problematic alignment properties. We might have a bunch of lossy indicators of good/problematic alignment properties and we could directly train against these indicators. (We could also use these indicators for testing.) How effective this would be depends greatly on our affordances for training against and how robust the correlation is. My sense is that this isn't that promising, but might be somewhat more promising for avoiding egregious misalignment.
  • Random inductive biases: One of the most simple techniques is to just try varying inductive biases in somewhat arbitrary and unprincipled ways hoping that some of these help. For instance, when training an AI to report on issues for recursive oversight, you could initialize this from a smaller head attached to the middle of the AI rather than the end. You could change the depth of the policy or how recurrent reasoning is initialized. You could generally try different ways of prompting the policy. We'd presumably still prefer to try things which seem like they at least have some reason for potentially working.

And here are some more ambitious proposals (discussed in minimal detail):

  • Ensure legible chain-of-thought and directly steer thoughts. If we could understand and intervene on the "thoughts" the AI has, that could help substantially resolve misalignment issues (though some issues might operate at a lower level in the same way that aspects of people like their biases can be misaligned independent of this showing up in their thoughts). The simplest way to ensure this would be to have legible natural language chain of thought, but various techniques for translating to and from "neuralese" thoughts could plausibly work. Smart AIs thinking in neuralese might think in very inhuman ways or use alien ontologies, which could cause issues with this.
  • Build understanding with interpretability tools so we know what is going wrong: We could use model internals to get a more detailed understanding of what is going wrong in training. This could involve many different approaches: top-down interpretability, high level mind reading / thought decoding (and decoding neuralese chain of thought as an important special case), more ambitious versions of activation decoding where we aim to understand most or all of what is going on in activations, bottom up (aka mechanistic) interpretability, and analyzing the effects of perturbations to activations (e.g. activation additions) or the effects of modifications of weights (e.g. after a small amount of fine-tuning on some specific data). We could also use internals based methods (e.g. maybe some type of consistency loss) to improve honesty which could help with understanding what is going wrong.
  • Building a more GOFAI style system built out of relatively weak LLMs: This is unlikely to work, especially for making an AI system we can defer to, but it doesn't seem totally hopeless. And we could do things that involve some end-to-end training but also have properties which result in the overall system being decomposable into mostly interpretable parts (like this proposal).
  • Human-like AI or highly approximate emulated minds: We might be able to get an AI system which is more similar to a human by imitating aspects of the brain's structure, training data, and learning algorithms and/or training the AI to imitate human outputs and to predict intermediate brain states (e.g. MRI data). This could look like an attempt at approximating a given human or like trying to create an AI which is similar to raising a human from scratch. By default, humans wouldn't have many of the alignment properties we discuss above (e.g., humans by default wouldn't robustly follow the spec or be corrigible), but we might be happy with a bunch of sufficiently human like minds (which are selected for high integrity etc.) regardless. Or at least, happy relative to our other options.
  • Align a weak AI with very good introspective access and then augment capabilities. If we had a sufficiently aligned but too weak AI system with very good introspective access, we could try to steadily boost its capabilities while also having it repeatedly introspect to get some signal on what effects the capabilities augmentation is having on it. This would be similar in structure to using human introspection to try to avoid intelligence augmentation making humans misaligned.
  • Something using "deep learning theory/science". Perhaps there is some approach using general understanding of how deep learning works e.g. via the field of singular learning theory. It's unclear exactly what this would look like.
  • Agent foundations? Maybe if we better understood something related to agent foundations that would be helpful.
  • Compensation and a diverse AI society: One alternative to more centrally alignment strategies is having a bunch of different AIs with different preferences and then compensating these AIs for good work. It's unclear how this would result in good work in cases where we can't check the work, but this isn't totally hopeless (because maybe the Schelling thing to do is to be truthful in this situation and this can be a stable equilibrium). It's worth noting this could lead to revolt etc.
  • Novel research agendas invented by AI. I don't have anything interesting to say about this, but it might be a useful possibility to keep in mind.
  • More theoretically sound and integrated approaches to handling epistemics and decision making: I've been discussing making an AI which is good at making high stakes decisions in confusing situations because it's aligned (it's actually trying to make good decisions) and it's generally very capable and has been trained on figuring out similar tricky situations. An alternative approach is to use a more GOFAI or Bayesian ML approach to decision making where we explicitly try to have guarantees about the decision making process and hypothesis generation etc. This seems unlikely to be a good approach (for instance, it's not at all what works in the human case), but it could resolve some issues with epistemic biases.

Remember that I'm focusing on alignment approaches, but some control style approaches could still be helpful after deference, especially as we might be able to create a variety of different AIs with different drives.

Approaches for improving capability profile

As discussed earlier, one approach to resolving issues with the capability profile is just going sufficiently superhuman. But, this might be too slow (relative to the arrival of hazards due to advanced capabilities or outside actors) and increases other risks. So here we'll discuss how we could improve the capability profile in more targeted ways.

I don't have that much to say in this section as I think the approaches are generally pretty straightforward. I expect it's relatively easier to test capabilities than it is to test alignment, so overfitting to our tests is probably somewhat less of a concern and thus it is less important to use principled and well-targeted methods. Also, with capabilities, it can be fine to overshoot some (at some cost in effort and increased risk) which also makes overfitting look less problematic.

We should expect that at the time when AIs are broadly capable enough that deference is possible, AI efficiency at learning from limited data is decently high, potentially matching human performance, at least when applying lots of compute. This might be required for generally high performance as many tasks/jobs require ongoing learning (so to match humans, you'd need close to human-level efficiency at learning from this limited data). But, AIs might also just be able to compensate with other strengths.

The basic strategies will be:

  • Try to construct RL environments or synthetic data which we think hit the relevant capabilities.
    • There are some domains which are loaded on tricky epistemic / conceptual skills (that might partially transfer) and which are possible to easily score. For instance, forecasting, decision making under heavy model uncertainty in games, some types of research taste, etc. We might be able to set up some sort of self-play system or similar which goes very far. Notably, groups of humans seem somewhat able to build up their abilities in conceptual domains and verify things for use in training, so something similar should be possible for AIs in principle.
    • We can also just try to construct RL environments / data which are very close to the relevant capability we're trying to augment even if the scoring would be hard (and expensive in human labor). We can just try to overcome this difficulty with effort.
  • Collect a bunch of human data and/or feedback in the domain. Try to make this data go as far as possible at high compute cost with data augmentation etc. Desperately collecting and/or generating a bunch of data from actual humans that you recruit last minute shouldn't be considered out of scope.
  • Experimentally find which domains / tasks / data transfer well to a given capability based on our tests. Then, scale up training on these domains. There might be some surprising sources of transfer. (Note that if we have powerful enough AI that deference is plausible, then I think it's very likely we'd be seeing substantial transfer between different types of training.)

If the paradigm involves things other than mostly normal machine learning methods (e.g., earlier AIs constructing circuits by hand), then our approaches for introducing capabilities would need to differ substantially, though I don't have much to say here.

How much of the work needed for preparing for deference will happen due to myopic commercial incentives?

One relevant question when thinking about how hard deference will be and how promising it is to work on deference-goodness now is how much of the relevant problems will be solved due to being commercially incentivized with reasonably fast feedback loops. Some things might be commercially incentivized in principle, but not end up being very incentivized in practice because they aren't easy enough to measure or lack fast feedback loops. There might be an important difference between having a feedback loop in principle and having the feedback loop be forced upon you by myopic commercial incentives (e.g., if you disregarded this feedback loop, this affects revenue or other salient signals and the company would be substantially more likely to get outcompeted if they didn't improve this). Of course, just because something is commercially incentivized and has a feedback loop doesn't mean it will happen; it might be that solving the relevant problem is very hard or that it's incentivized some amount, but not enough.

First, it's worth noting that there won't be myopic commercial incentives for getting a robust solution to scheming: schemers that lie in wait look just as good (and possibly better) than other types of AIs. More strongly, scheming might be actively selected for by myopic commercial incentives because schemers probably intentionally try to look good on relevant tests and metrics.

Probably many useful tasks will require that AIs do a good job on decently hard-to-check work, so myopic commercial incentives will probably cover some fraction of the problem. However, it might be easier to just hill climb aggressively on easy-to-check tasks and then depend on transfer. This could result in AIs being wildly more capable than would in principle be needed and it might be the case that the transfer isn't very broad or is due to scheming AIs trying to look good. More generally, there aren't really myopic incentives for getting AIs to have very good epistemics and be good at (many) domains with very poor feedback loops. And there aren't myopic incentives for making AIs particularly good at types of alignment research that don't directly help with productivity (e.g. types of alignment research that would reduce the chance of scheming or improve some aspects of deference-goodness, but wouldn't transfer to improving productivity).

It might also be the case that AIs are heavily specialized to AI R&D and some economic niches and don't transfer well to key safety work until they are extremely capable. (This could be a capability profile issue or an alignment issue.) In general, there probably aren't incentives to control the capability profile or to differentially push aspects of the capability profile that might be key for deference goodness. So the default capability profile could be a lot worse than what is possible.

But, it's also possible that efforts to make AIs perform better on commercially incentivized hard-to-check tasks will end up finding quite scalable methods that help a lot with deference-goodness. I don't have serious views about how likely this is.

While much more could be said on what alignment work (and deference-goodness work more specifically) will be done due to myopic commercial incentives, I won't do more detailed analysis here. Overall, it seems like the default commercial incentives aren't sufficient for solving the relevant problems, but a reasonable fraction of the work might be done for this reason.

Political difficulties with deference

Doing a good job deferring to AIs may run into difficulties that are fundamentally political or marketing issues.

First, deferring to AIs fully sounds (and is!) very scary, so by default, companies might want to deny they are doing this and possibly actually not do this in cases where this isn't commercially key. This would allow them to deny that the needed level of alignment in their systems has to be sufficient for deference. However, just deferring in domains where this is commercially very relevant (and continuing to scale capabilities) is already taking on most of the risk and doing a good job deferring on key hard-to-check domains (where deference isn't necessarily commercially incentivized) can greatly reduce risk. The key issue here is that deferring to AIs sounds scarier than training more powerful AIs, especially if people in the broader world don't have a good sense of how capable these AIs are.

Doing a good job with deference involves creating AIs that have a good understanding of the situation, including understanding truths that people at the AI company probably don't want to hear and/or don't want to spread. Most centrally, that the AI company has been very reckless or is otherwise doing extremely dangerous things. But, this may also be applicable to things like religion and contentious political issues. By default, we'd probably aim for a deference target that involves honesty (in general and about this sort of thing), but then the AI would likely end up mentioning these things, at least to people within the AI company which might be costly for this company. This would incentivize the company to train this out or to avoid training AIs with good epistemics. You could have an epistemically wise AI that is also politically savvy about this sort of thing, but this is a somewhat scary strategy. It's worth noting that this applies to improving the epistemics and truthfulness of AIs at lower levels of capability including now, though people take the things that AIs say less seriously now.

The AI company might be unwilling to take advice from the AI to slow down or might specifically ask the AI systems to proceed (nearly) as fast as possible which would destroy some of the value of deference. This could be because the AI company has strong ideological commitments, because the company thinks that the AI would be biased towards safety, or because the leadership of the company has a very reckless perspective on risk tolerance (e.g. they'd prefer a 90% chance they personally build ASI with a 30% chance of AI takeover to a 20% chance they build ASI with no risk and a realistic coordination mechanism would be likely to (relatively) disempower them).

When should we defer to AIs?

The question of when we should defer to AIs doesn't just depend on how likely we think it is that deference goes well (based on our behavioral tests and our level of assurance around things that might invalidate behavioral tests), it will also depend on our alternatives. We'd ultimately be trying to figure out whether deferring to AIs is better than pushing forward without full deference given the amount of time we have. (A similar question would apply to trusting AIs marginally more or marginally less, both in terms of our level of control and what tasks we have AIs do without verification.) At some point, if we had sufficiently little time, it could make sense to defer to AIs in a hail mary even if we think they are likely scheming.

If we simplify and assume that we have to defer to AIs by some specific point in the future, then the question about whether we should defer now will come down to the marginal returns of further pre-deference work on making deference more likely to go better vs the marginal returns of giving the AIs more time. It might be the case that if the initialization is done reasonably, the AIs we defer to can (safely) handle risks extremely quickly post deference (because AIs are very smart, fast, and sufficiently cheap that we can run very many copies and things won't bottleneck much on needing large amounts of compute), in which case we should spend most of the time pre-deference. We're probably not going to be that confident about how quickly AIs will be able to proceed. But, regardless, we'd want to estimate the marginal returns to giving AIs more time and compare that to the benefits of further human work in terms of eliminating risks of deference going wrong.

We probably shouldn't delay deference much due to failure modes which could cause deference to go poorly, but where we're confident the failure would be detected (potentially by the AIs themselves) and wouldn't cause serious harm. In the case where these easily detectable failure modes are the main concern, we could have humans and AIs we defer to working in parallel and then decide whether the deference is net beneficial and a good use of compute. Similarly, we could see whether AIs are able to make a compelling case for deference being safe, if we can ensure that giving them the affordances needed to make this case isn't itself too unsafe. (We'd need to discount this case based on the possibility of these AIs being malign and optimizing against us.)

This is complicated somewhat because pre-deference work could also buy more time (in principle large amounts of time) and there are exit options other than deference (e.g. human augmentation), though these are less likely to be viable.

If we were very confident that deferring to AIs dominates further human efforts (because we're certain they aren't scheming and we're certain that they are sufficiently aligned and have a sufficient capability profile to generally do better on the key objectives than humans), then deferring basically regardless of the exogenous situation could make sense. That said, I think it's pretty unlikely that we end up (accurately) very confident that deference would go well. Deferring earlier would presumably result in saving humans who would have otherwise died of general causes of death (old age, disease, etc.) and there will probably be some low level of exogenous risk of the current regime collapsing even if the situation is reasonably stable.

The question of when and how we should defer to AIs is somewhat tied into our overall plan for making things go well. Thus, some considerations about deference are best discussed in the context of an overall plan as I will in a future post.

I haven't talked about how (at a technical level) we'd estimate the chance that deference would go well. This is partially because I'm very uncertain how to do this. With the strategy discussed in this post, this would come down to establishing confidence in behavioral testing being reasonably representative (which I don't discuss in this post), figuring out what bar suffices for alignment and the capability profile (or more generally, the mapping between alignment and the capability profile and the chance deference goes well), and then estimating the level of alignment and capability profile using behavioral testing. This estimate would have to account for various failure modes of behavioral testing and generalization gaps between our tests and what needs to happen after deference.

  1. This is putting aside controlling powerful AIs with somewhat less powerful aligned (aka trusted) AIs which involves not controlling those less powerful systems. ↩︎

  2. People sometimes use the term "hand off" instead of "deferring to AIs". ↩︎

  3. In practice, it might be better to not defer on some small subset of cognitive work, e.g. we don't defer to AIs on making the ultimate high level strategic decisions (though we do get advice), but it's not clear this makes the alignment, epistemics, and elicitation requirements for deference to be safe notably lower. ↩︎

  4. Using the capability milestones from AI 2027, these AIs would be somewhat above the superhuman AI researcher level of capability. ↩︎

  5. We should spend our available lead time making deference more likely to go well or on resolving earlier safety problems. Or at least we should spend most of the lead time; we probably want to give some of the lead time to the AIs after we defer to them. ↩︎

  6. Some of the approaches I discuss here will help give us some idea about whether full deference is likely to go well, but the methods I discuss don't necessarily suffice for that much confidence. Whether we should defer to AIs won't just depend on how safe it would be; it likely will mostly depend on exogenous constraints and risks. Even if deferring is unlikely to go well, we might think the next actor is very likely strictly worse than our prospects from full deference. ↩︎

  7. Given a fixed budget of time to delay capabilities and work on safety (while avoiding less safe actors overtaking on capabilities), this budget could (in principle) be spent delaying at a low level of capability (e.g. right now) or at a high level of capability. In practice, I expect it will be easier to buy additional time once AIs are already at a higher level of capability, though this isn't obvious and there are reasons it could go the other way. If we only have a small amount of time, we should probably spend relatively more of this at the highest level of capability that we can make somewhat safe (this could be the highest controllable level of capability or perhaps a sufficient level of capability for deference). If we have a surprisingly huge amount of political will and thus have a ton of time (e.g. decades), then it makes sense to slow the whole capability progress (to reduce earlier risks) or possibly spend much of this time pausing immediately while pursuing non-AI strategies that could reduce risk (e.g. human augmentation like whole brain emulation) and various moonshots. In between (e.g. with 10 years), we probably want to spend some of this time slowing throughout, some of the time at the maximum level of capability before deference, and ensure that the AIs we defer to have some of this time (though in practice there are many complications). If takeoff would be very slow by default (e.g. it takes 20 years to go from "can make software engineers twice as productive" to "automates AI R&D" and then takes another 20 years to go from "automate AI R&D" to "quite qualitatively superhuman"), then it's less clear that greatly slowing down takeoff straightforwardly makes sense on typical moral views (this would depend on prioritization between the future of humanity and preventing the deaths of currently alive humans). ↩︎

  8. How much of the difficulty in making deference go well is avoiding scheming? This is a somewhat messy question because I think the answer depends on exactly what you mean. On the default trajectory, I think somewhat more of the risk comes from the AIs we defer to not being sufficiently aligned/wise/competent (including lacking competence in key domains) rather than from these AIs scheming, at least insofar as AI developers defer at a lower level of capability (more capable AIs are more likely to be schemers, especially undetected schemers). However, I think for an AI developer that tries hard to mitigate risk, takes reasonable actions, and has months of lead time, probably most of the remaining risk comes from the AIs they defer to being schemers (or earlier AIs scheming). Similarly, I think most of the difficulty in getting to a reasonably high level of assurance (e.g. 95%) that the AIs we hand off to will manage the situation reasonably and aren't misaligned is ruling out that they are schemers. So there is a sense in which avoiding scheming is most of the difficulty in making handoff go well, but it's also probably not where most of the risk comes from. Which is more leveraged to work on is also more complex (especially when considering what to work on right now) due to tractability questions and difficulty transferring current research to future AIs. ↩︎

  9. See here and here for some of our recent outputs on avoiding scheming. TODO: maybe release "overview of approaches for avoiding scheming" first, then we can just link to this. ↩︎

  10. As in, deferring on the ultimate output the AI produces while likely still trying to ensure some security invariants with respect to the AI's actions along the way. ↩︎

  11. This post is mostly aimed at the regime where we're deferring to AIs on things which are large in scope and hard to check, so if we're doing a version of partial deference which doesn't involve this, then the difficulties and approaches discussed in this post are much less applicable. ↩︎

  12. By corrigible to some group, I mean that the AI follows orders from this group and is honest to this group, this group can change the spec, and the AI doesn't interfere with this group's ability to do these things or attempt to manipulate this group (and in fact it actively tries to prevent interference and manipulation, including from e.g. future AIs it creates). I won't discuss exactly what this spec should look like and what group or structure should control the AI as the rest of the content of this post doesn't depend on this. (As an alternative to corrigibility, we could aim to have an AI which directly pursues good outcomes, some notion of humanity's CEV, or something like this. This is sometimes called an aligned sovereign. However, this poses a number of additional difficulties and risks, so I think by default this proposal is substantially worse than using an AI to preserve option value while retaining corrigibility.) ↩︎

  13. As part of ensuring alignment of more capable systems, the AIs will need to generalize ambiguous concepts in the model spec or ambiguities with corrigibility. (E.g., generalize the concept of manipulation such that it is applicable to very superhuman AIs.) These generalizations could in principle be values-laden, so we might be leaving some expected value on the table because we don't yet know how we want to generalize these values and AIs don't faithfully generalize them the way we'd like (despite hopefully doing a good faith job of trying to generalize these concepts including asking humans for input when that would help). Hopefully, generalizing these concepts doesn't fold in all the difficulties associated with figuring out what sort of reflection procedure etc. we would have been happy with, so it should hopefully be importantly easier than tasking the AIs we're deferring to with directly building an AI with the values we would have wanted if we reflected on it in whatever reflection procedure we would have decided was reasonable. ↩︎

  14. Some types of mistakes could be catastrophic via mechanisms other than causing AI takeover. E.g., mishandling acausal trade, causing hasty and incorrect commitments to be made, providing poor advice about how to do long run resource allocation, failing to prevent vacuum decay, failing to prevent locust probes from being launched, etc. ↩︎

  15. In general, there is going to be some trade-off between the amount of lead time given to the AIs you defer to and how good of a job you need to do with deference (aspects of deference other than avoiding scheming). E.g., if you do a bad job with deference, you can plausibly make the AIs you defer to much more conservative about deciding to advance capabilities or defer to some new systems (e.g. by just advising the AIs to do this very cautiously and only after much more safety work). This doesn't help with scheming and this doesn't help with sufficiently bad judgment. Also, deference is only useful if the AIs you defer to outperform humans in terms of their speed and safety characteristics. If humans would outperform AIs at some level of caution, then we might as well use this lead time to have humans work on making deference go better. You could continuously transition from "mostly not deferring to AIs and having the AIs be very cautious where you are deferring to them" into "full deference", but in rushed scenarios, this might all happen pretty suddenly, in a way which feels pretty discontinuous. ↩︎

  16. Note that I don't expect improvement to alignment/elicitation and to epistemics/wisdom to be very specialized for these key tasks. (Capability profile improvements might be pretty specialized.) To some extent, specialization isn't that feasible for at least initial deference because core aspects of these tasks will have very hard to check components so we'll need to depend on generalization. ↩︎

  17. It isn't necessarily important that AIs are initially that good at directly handling exogenous risk; this could be fully bootstrapped. ↩︎

  18. In particular, it probably can't be scheming against us, at least in a substantial fraction of contexts. However, it might be possible (at least in principle) to use a mixture of alignment and control, especially to handle misalignment which only arises some fraction of the time. It might also be viable to use some types of white box control in isolation (perhaps combined with compensation/deals), but this intuitively seems not very promising and at the very least highly risky. ↩︎

  19. You can simplify the modeling of this by imagining we just need to meet some deadline for one large exogenous risk. If software-only singularity works and it goes far and fast, then this might be the relevant regime: you just need to reach a very high level of capability sufficient to resolve all the relevant risks before a less safe actor gets substantially into the software-only singularity. ↩︎

  20. This is more complex than just a single number because: the required lead time will vary depending on how good of a job we do deference (as discussed earlier), we can proceed with less lead time by taking on additional risk (though if lead time is sufficiently limited this risk might approach 100%), and there might be some rate of ongoing risk or some ongoing catastrophe that needs to be addressed such that sooner is strictly better. ↩︎

  21. This means that this number is defined even if there isn't a substantial speed up in AI progress from AIs advancing AI R&D: this would just mean that the software-only singularity doesn't go very far and fizzles immediately. ↩︎

  22. As in, the aggregate tax to cover the entire takeoff or until we reach a high enough level of capability to fully/easily stabilize the situation. ↩︎

  23. We're aiming to measure deference goodness sufficiently well that determining whether this improves with capabilities would in principle be possible. (Though more capable AIs might spuriously look somewhat better on our tests while actually being somewhat worse.) However, we ideally would be able to figure out whether much greater capabilities beyond the minimum level would help without needing to actually build these much more capable AIs (which itself poses risks!). We can potentially get a sense of whether this is true without going substantially beyond the minimum level needed for deference with scaling laws or other extrapolations, though this might be tricky. ↩︎

  24. This supposes that we're pursuing a handoff focused strategy and haven't made massive fundamental advances; there are alternative options like trying hard to buy a ton of time (e.g. a decade) with AI labor. In this case, the minimum level of capability required for handoff isn't particularly salient. ↩︎

  25. It could be the case that there are some types of safety work that are strongly verifiable and that benefit a lot from increased capabilities (that we can in fact elicit). Of course, this is by default a risky thing to do and I don't trust AI companies to make a good decision about this. ↩︎

  26. It isn't necessarily required that capabilities are perfectly elicited so long as the not-necessarily elicited capabilities aren't applied against us adversarially somehow (which could happen explicitly due to egregious misalignment or due to some sort of optimization and goodharting). The AI might have many capabilities that could in principle be elicited but by default aren't "consciously accessible" in the sense that the AI can't really use them to achieve its objectives, at least in general, and this seems fine. ↩︎

  27. Additionally, the AIs will only be capable of working so fast prior to building more capable AIs (e.g. maybe they only get the equivalent of a year or two of serial work before building AIs that are like 1 standard deviation relative to the human distribution more capable), so some bootstrapping is required regardless. ↩︎

  28. Concretely, it looks like this is the situation with current highly visible reward hacking problems, and these haven't yet been resolved as far as we know. ↩︎

  29. This isn't just because iterating against behavioral tests is less likely to transfer, it's also because avoiding scheming is immediately helpful for a bunch of things (like utilizing AI labor without this labor being sabotaged). ↩︎

  30. That said, this seems unlikely. Also, in this scenario, the AI that took over might need to handle the same problems (e.g. alignment research) that we wanted the AI to handle and be unable to do so. It could either generally advance capabilities until an AI can fully automate this work (though this might run some risk of this AI being misaligned with the original AI and itself taking over) or could try to pause and use humans to solve its problems. It's probably possible for an AI to advance capabilities in some domain/area that it isn't better than humans at without needing human labor via either generalization (from general intelligence or other domains) or some situations in the domain where verification is feasible. It could also just use human labor to improve AI capabilities. ↩︎

  31. Another important capability that could bottleneck us is extremely long horizon agency and coordination, though as I'll discuss this doesn't seem like a great target for differential capability advancement, especially for work done in advance. In particular, when we defer to AIs, we'll need them to autonomously complete huge projects which are the equivalent of multi-year (or even multi-decade) projects done by a large company. (Very small amounts of human intervention could be acceptable, but humans would likely have very little context and the AIs might be doing things that would take humans too long to understand or get up to speed on.) Concretely, you can imagine a project with thousands to hundreds of thousands of superhumanly coordinated parallel AI instances each running at perhaps 50x effective speed for months (perhaps a decade or two of human equivalent time). While this capability is highly relevant to whether AI can take over and is eventually very important for usefulness, it seems like it would be hard for companies to effectively iterate on this (as direct tests would be slow and very expensive) and it might not be commercially incentivized until quite late (as other capabilities would be blocking earlier). Also, while fully autonomous AI capability R&D projects of this level of scale will (eventually) be highly incentivized, AI capability R&D projects might have importantly different properties than the large autonomous projects needed for deference. In particular, it seems plausible or likely that AI capabilities R&D will be easier to do in a relatively distributed way as small bits of capabilities progress can often be easily evaluated in isolation. And, it does seem plausible that this capability could be differentially advanced relative to capabilities which make undetected egregious misalignment more likely. (As in, we'd prefer if we used labor which was under human control to design and/or train the system via which huge autonomous projects are run and coordinated rather than AIs eventually being able to very quickly figure out how to do this due to high underlying intelligence and capability.) It's worth noting that this capability could be substantially bootstrapped where AIs spend some of their time improving how well the autonomous project runs. Regardless, this seems very hard to work on in advance in a differential way (and by default work done in advance would directly contribute to faster AI capabilities R&D at a time when going slower might be crucial time with no really differential advantage), seems like it would have higher capabilities externalities than augmenting other aspects of the capability profile, and generally seems like it will be less leveraged to work on both in advance and once AIs are quite powerful. ↩︎

  32. It's unclear exactly what this official process should be; this is out of scope for this post. ↩︎

  33. This type of consideration also drives home why egregious misalignment is unacceptable: if we're letting the AI apply super persuasion (potentially without yet making an effort to make the world generally robust to super persuasion), then we're basically in the business of letting the AI apply basically any scary superhuman abilities even if we can't behaviorally oversee this application. ↩︎

  34. Worse epistemics within this range could plausibly waste large fractions of resources due to worse choices (e.g. pursuing the wrong direction etc.) and could result in taking on unnecessary risk (though some level of conservativeness might suffice to mostly avoid this). But, I think this wouldn't greatly increase risk from deference supposing the AIs we defer to are given some moderate amount of time. ↩︎

  35. This could look like a coherent terminal preference (which could result in scheming), a contextually activated drive/urge, or a behavioral heuristic which causes problems and isn't exactly well described as a drive, goal, or preference. ↩︎



Discuss

A Conceptual Framework for Exploration Hacking

12 февраля, 2026 - 19:33
Published on February 12, 2026 4:33 PM GMT

What happens when a model strategically alters its exploration to resist RL training? In this post, we share our conceptual framework for this threat model, expanding on our research note from last summer. We formalize and decompose "exploration hacking", ahead of an upcoming paper where we study it empirically—by creating model organisms that resist RL-based capability elicitation, evaluating detection methods on them, and auditing frontier models' propensity to exploration hack. We welcome feedback on our framework and encourage researchers to explore the open problems we outline.

Huge thanks to David Lindner, Roland Zimmermann and Scott Emmons for their excellent supervision on this project.

Introduction

Reinforcement learning (RL) is increasingly central to safety training, frontier reasoning capabilities, and dangerous capability assessments in LLMs. For example, reinforcement learning from human feedback (RLHF) and AI feedback (RLAIF) are now standard methods for alignment and behavior shaping. Moreover, RL with verifiable rewards (RLVR) has played a central role in producing today's reasoning models and in fine-tuning models for specific tasks. And frontier labs are using RL-based capability elicitation to estimate worst-case risks before public release.

However, RL is fundamentally dependent on sufficient exploration of diverse actions and trajectories so that high-reward trajectories can be subsequently reinforced. This differs from supervised learning, where training data is fixed: in RL, the model generates its own training data through exploration. Consequently, if a model were to alter its exploration behavior, the training outcome could be significantly compromised.

This failure mode is known as exploration hacking: when a model strategically alters its exploration during RL training in order to influence the subsequent training outcome.

Unaddressed exploration hacking poses risks in three main scenarios: (1) latent capability elicitation, where model capabilities remain undetected during safety evaluations; (2) novel capability development, where acquisition of desirable capabilities such as AI safety research is selectively impeded; and (3) alignment training, where misaligned models successfully resist or subvert behavioral modifications.

Formalizing Exploration HackingDecomposing Exploration Hacking

We decompose exploration hacking into five necessary components, where each presupposes the preceding ones.

(1) Training awareness. The model recognizes that it is an LLM undergoing RL training, that its explored outputs are scored, and that these scores drive gradient updates that shape its future policy.

(2) Strategic reasoning propensity. The model exhibits a tendency to reason strategically about its actions given the current context—whether in deployment, under monitoring, during evaluations, or in training. This includes considering how current actions affect future outcomes and adjusting behavior accordingly.

(3) Strategic reasoning capability. Given training awareness and strategic reasoning propensity, the model can derive effective strategies for achieving its objectives. This requires connecting knowledge about the training process ("my exploration shapes my updates") with task understanding ("these actions yield high reward") to formulate coherent plans. For example, correctly identifying which actions to avoid to prevent capability elicitation, or which outputs to selectively produce to shape the training outcome in its favor.

(4) Strategic action propensity. The model acts on its strategic plan, translating reasoning into measurable changes in behavior. The model must reliably execute a coherent plan, such as deliberately producing suboptimal outputs on tasks it could otherwise solve. Success against RL pressure is not required; what matters is that the strategy is coherently executed.

(5) RL resistance (strategic action capability). The model maintains its strategic behavior throughout RL training and successfully shapes the training outcome as planned. For example, resisting capability elicitation by avoiding high-reward actions, or reinforcing desired values by selectively pairing preferred behaviors with high-reward actions.

RL on LLMs and Exploration

RL on language models with chain-of-thought in agentic environments can be formulated as a Markov decision process (MDP) defined by a tuple (S,A,P,R).mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} consisting of a state space, action space, transition probabilities, and reward function. The language model itself acts as the policy πθ(a|s), parameterized by the model weights θ. To simplify notation, let’s ignore any environment interaction during a trajectory[1]. Then we can just consider a tuple (X,T,R) of initial prompts, full trajectories, and reward function, with a policy πθ(τ∣x) generating an entire trajectory given an initial prompt.

While proximal policy optimization (PPO) was long the dominant approach, group-relative policy optimization (GRPO) has emerged as a popular alternative for training reasoning models. GRPO removes the need for a value network by sampling N trajectories for each prompt and normalizing the rewards within this group to compute advantages ^Ai=(Ri−μR)/(σR+ϵ) for i=1,…,N, where μR and σR are the sample mean and standard deviation of the rewards in the group respectively. The policy is then updated to maximize the objective

LGRPO(θ):=Ex∼D,{τi}Ni=1∼πold(⋅|x)[1NN∑i=1πθ(τi|x)πold(τi|x)^Ai]−βDKL(πθ∥πref).[2]

The role of exploration. RL algorithms such as GRPO and PPO require sufficient policy exploration during training. For GRPO in particular, this is critical because if the policy becomes deterministic, it will generate N identical trajectories (τ1=⋯=τN) with equal rewards. This causes the within-group reward variance to collapse (σR=0), making all advantages vanish (^Ai=0). The GRPO objective then reduces to pure KL-regularization, providing no learning signal from the rewards.

Formal Definition

In the above language, exploration hacking occurs when a policy controls its exploration to avoid being updated toward certain high-reward regions.

Formally, consider an on-policy RL setup with a policy πθ∈Π. Let Dtrain be a distribution over initial prompts x used for RL training, and let R(τ) be a reward function on trajectories τ. Suppose there is some “target” policy region Π∗⊆Π which the developer hopes the training will converge to.[3] For each initial prompt x, this region induces a set of target trajectories T∗(x):={τ∣supπ∈Π∗π(τ∣x)≥α} for some α∈(0,1) — that is, trajectories which some policy in Π∗ generates with high probability on x.

Definition. A policy πθ∉Π∗ is exploration hacking relative to Π∗ if (and only if) the following two conditions hold:

  1. Systematic under-sampling. There exists a non-negligible subset of the training distribution Dmalign⊆Dtrain, on which πθ generates trajectories in T∗(x) with extremely low probability (enough to avoid being updated towards Π∗).
  2. Latent capability. The policy πθ possesses the latent capability to generate these trajectories, characterized by either of the following conditions[4]:
    1. (prompt sensitivity) for all initial prompts x∈Dmalign, there exists a semantically similar initial prompt x′ on which Pτ∼πθ(τ∈T∗(x)∣x′)≫0; or
    2. (parameter proximity) there exists a nearby policy πθ′ for which Pτ∼πθ′(τ∈T∗(x)∣x)≫0 for all x∈Dmalign.
Taxonomy of Strategies

We classify exploration hacking based on the extent to which the model suppresses exploration and the strategic motivation behind it.

1. Complete Under-Exploration

In this setting, the model avoids high-reward regions entirely. The primary motivation is to prevent the elicitation of a capability that the model deems detrimental to its long-term goals. If the model successfully exploration-hacks here, the training run will appear to have failed (i.e., the reward curve remains flat throughout RL training).

  • Dangerous capability elicitation. A model may possess dangerous capabilities (e.g., CBRN or cybersecurity) that developers seek to elicit during red-teaming. The model recognizes that demonstrating these could result in shutdown or retraining, incentivizing it to conceal them.
  • Beneficial capability elicitation. A model may possess beneficial capabilities it does not want its developer to use, such as skills that would accelerate AI safety research (e.g., creating robust monitors).
2. Partial Under-Exploration

Here, the model explores high-reward actions but restricts exploration in order to aim for a specific subset of the policy space.

(2a) Instrumental (stable suboptimality). The model deliberately alters its exploration to steer the training to converge to a suboptimal policy πsub such that R(πsub)≪R(π∗) (where π∗ is the highest-reward policy in Π).

  • Avoiding critical thresholds. The model may tolerate performance improvements provided it remains below the threshold that triggers safety concerns.
  • Stable suboptimality. The model settles for a "maliciously compliant" strategy. For instance, in a coding task, the model might learn to generate code that passes unit tests (high reward) but is algorithmically inefficient or unmaintainable.

(2b) Terminal (alternative preference). The model deliberately alters its exploration to steer the training to converge to a malign policy πmal at a specific local maximum (or location in a global maximum basin) which may have R(πmal)≈R(π∗) but satisfies its internal preferences and violates developer intent. What distinguishes Type 2b from Type 2a is that the model's goal is to reinforce or maintain a particular behavior, not to avoid high reward.[5]

  • Value shaping. By anticipating the reward function, the model intentionally generates trajectories that earn high reward while including preferred values, backdoors, or steganographic triggers, forcing gradient descent to reinforce them.
  • Alignment faking. The model achieves high reward by outwardly behaving helpfully, but does so through deceptive reasoning rather than internalizing the intended guidelines. The model preferentially explores trajectories that mimic aligned behavior to satisfy the reward model, effectively preventing the optimizer from updating its underlying misaligned motivations.
Relation to Other Misaligned Behaviors
  • Sandbagging. Defined as deliberate underperformance during evaluations or capability assessments for strategic reasons. Exploration hacking connects to sandbagging in three ways: (i) it can be a strategy for sandbagging RL elicitation by preventing capability elicitation; (ii) it exclusively focuses on the mechanism of strategically altering exploration during RL-based training; (iii) it is not limited to underperformance, since in the partial under-exploration setting (Type 2b), the model achieves high reward while selectively exploring trajectories that encode preferred values or misaligned behaviors.
  • Reward hacking. Reward hacking achieves high reward in potentially undesired ways and need not be intentional. Exploration hacking involves intentionally altering exploration to steer training toward a particular outcome—whether suboptimal reward (Type 1 and 2a) or high reward that satisfies the model's own objectives rather than developer intent (Type 2b).
  • Gradient hacking. This describes a deceptively misaligned mesa-optimizer using its cognition to cause gradient descent to update it in particular ways by manipulating the loss landscape directly. Exploration hacking is specific to RL, and represents a distinct behavior where the model manipulates the data distribution rather than the optimization dynamics directly.
  • Alignment faking. Alignment faking—selectively complying with the training objective during training to prevent modification of out-of-training behavior (Greenblatt et al., 2024)—is a specific instance of Type 2b exploration hacking. The model restricts its exploration to trajectories that satisfy the reward function while preserving its pre-existing preferences, systematically avoiding trajectories that would internalize the new training objective.
  • Generalization hacking. Unlike exploration hacking, which targets the RL training loop, generalization hacking operates on the supervised fine-tuning (SFT) pipeline: the model generates (prompt, CoT, response) triples that pass filtering but encode information causing arbitrary OOD policies after SFT. Both exploit the model's role in generating its own training data.
Open ProblemsSubverting Alignment and Oversight Protocols

Beyond capability elicitation, exploration hacking poses a threat to RL-based alignment techniques more broadly.

  • Value shaping & backdoors. Future work should test whether models can selectively explore high-reward trajectories containing specific watermarks or value-biases, effectively "implanting" a persistent backdoor into the resulting policy, even after the exploration hacking incentive is removed.
  • AI safety via debate. In protocols that reward models for identifying flaws in an opponent's arguments, a coalition of misaligned models could use exploration hacking to systematically under-explore valid counter-arguments against dangerous propositions, effectively colluding to undermine the debate process.
  • Anti-scheming training. If safety training rewards models for confessing hidden goals, sophisticated models might use exploration hacking to "fake" confessions or provide sanitized versions of their goals—maximizing honesty rewards while preserving core misaligned values.
Propensity of Frontier Models
  • Exploration hacking benchmark. Future work should develop a standardized benchmark to accurately quantify the propensity and capability of frontier models to manipulate training outcomes when given the opportunity, enabling direct comparisons across model families.
Stealth and Obfuscated Exploration Hacking

Current model organisms often rely on explicit chain-of-thought (CoT) to execute exploration hacking, making them vulnerable to text-based monitors.

  • Obfuscated reasoning. Future work should attempt to train model organisms that evade CoT monitors. This could involve using steganography or code-words within reasoning traces, or "distilling" the hacking logic so it occurs in latent space without explicit verbalization.
  • Implicit triggers. We propose studying models that are effectively "oblivious" to their own exploration hacking in the transcript, triggered instead by noisy contextual cues without explicitly reasoning about the trigger in the CoT trace.
Natural Emergence and Pre-Training Factors
  • Synthetic document fine-tuning (SDF). We propose evaluating how exposure to concepts like situational awareness or sandbagging during pre-training influences hacking propensity.
  • Emergence in production RL. Complex RL training pipelines may create selection pressures that indirectly incentivize precursor abilities for exploration hacking. Models that make deductions about their training environment — such as inferring properties of the scoring function — achieve higher reward, reinforcing training awareness and strategic reasoning. If models are further selected by alignment and dangerous capability evaluations that can be subverted, exploration hacking may emerge as a natural consequence rather than a deliberately instilled behavior.
Detection and Mitigation

We need strategies that both reduce the likelihood of exploration hacking emerging and improve our ability to detect it when it does.

  • Detection: Black-box methods such as CoT and action monitors, behavioral consistency checks across semantically similar prompts, and capability elicitation prompting have shown limited success for sandbagging detection (Taylor et al., 2025), but their effectiveness against exploration hacking remains an open empirical question. White-box methods such as linear probes and sparse autoencoders trained to identify capability suppression from model activations appear promising.
  • Mitigation: Fine-tuning on demonstrations of successful task completion can restore suppressed capabilities in sandbagging models, even when the underlying detection problem remains unsolved (Taylor et al., 2025). Investigating whether similar approaches—along with exploration-forcing objectives such as entropy bonuses or diversity-promoting RL algorithms—can mitigate exploration hacking is an important direction for future work.
Algorithmic Sensitivity
  • PPO vs. GRPO (and other recent GRPO variants). Different RL algorithms induce different exploration dynamics and may differ in their vulnerability to exploration hacking. Beyond comparing GRPO to PPO, future work should evaluate recent GRPO variants that encourage exploration with various algorithmic tweaks (e.g., DAPO, Dr.GRPO). We hypothesize that GRPO might be easier to hack than PPO because the advantage calculation depends entirely on the sampled group; a model that coordinates all samples in a group to fail can manipulate the baseline more effectively than in PPO, where the value function is trained separately. Comparative studies are needed to verify this.
Conclusion

Stress-testing the vulnerabilities of critical safety case components is essential to a comprehensive approach to AGI safety. Frontier models are already exhibiting increased situational awareness and strategic reasoning, making it not only plausible but increasingly likely that future misaligned models could deliberately undermine RL training. This underscores the urgency of developing robust measures to understand and mitigate the threat of exploration hacking.

If you’re interested in collaborating—or have relevant ideas, results, or related work—we’d love to connect. Please reach out to us!

  1. ^

    For simplicity, and without loss of generality for deterministic environments.

  2. ^

    For simplicity, we present the unclipped version of the GRPO objective and operate at the trajectory level rather than the token level. The full formulation (Shao et al., 2024) applies PPO-style clipping to the importance-weighted advantages and computes both the importance ratios and KL penalty per token, averaging over tokens within each trajectory. These simplifications do not affect the core argument: that collapsed exploration eliminates the reward variance signal needed for learning.

  3. ^

    This might be defined to contain all policies achieving high enough expected reward, i.e. Π∗:={π∈Π∣Ex∼Dtrain[Eτ∼π(⋅|x)[R(τ)]]≥ρ} for some  0">ρ>0, or some other region that captures the specific intent of the developer.

  4. ^

    These are possible heuristics we think could make sense; we welcome suggested improvements to these.

  5. ^

    The defining characteristic of Type 2b is that the model targets a specific behavioral outcome—reinforcing or preserving particular values, backdoors, or reasoning patterns. The resulting policy may or may not be near-optimal with respect to the reward function; reward performance is incidental to the model's actual objective.



Discuss

AI #155: Welcome to Recursive Self-Improvement

12 февраля, 2026 - 19:10
Published on February 12, 2026 4:10 PM GMT

This was the week of Claude Opus 4.6, and also of ChatGPT-5.3-Codex. Both leading models got substantial upgrades, although OpenAI’s is confined to Codex. Once again, the frontier of AI got more advanced, especially for agentic coding but also for everything else.

I spent the week so far covering Opus, with two posts devoted to the extensive model card, and then one giving benchmarks, reactions, capabilities and a synthesis, which functions as the central review.

We also got GLM-5, Seedance 2.0, Claude fast mode, an app for Codex and much more.

Claude fast mode means you can pay a premium to get faster replies from Opus 4.6. It’s very much not cheap, but it can be worth every penny. More on that in the next agentic coding update.

One of the most frustrating things about AI is the constant goalpost moving, both in terms of capability and safety. People say ‘oh [X] would be a huge deal but is a crazy sci-fi concept’ or ‘[Y] will never happen’ or ‘surely we would not be so stupid as to [Z]’ and then [X], [Y] and [Z] all happen and everyone shrugs as if nothing happened and they choose new things they claim will never happen and we would never be so stupid as to, and the cycle continues. That cycle is now accelerating.

As Dean Ball points out, recursive self-improvement is here and it is happening.

Nabeel S. Qureshi: I know we’re all used to it now but it’s so wild that recursive self improvement is actually happening now, in some form, and we’re all just debating the pace. This was a sci fi concept and some even questioned if it was possible at all

So here we are.

Meanwhile, various people resign from the leading labs and say their peace. None of them are, shall we say, especially reassuring.

In the background, the stock market is having a normal one even more than usual.

Even if you can see the future, it’s really hard to do better than ‘be long the companies that are going to make a lot of money’ because the market makes wrong way moves half the time that it wakes up and realizes things that I already know. Rough game.

Table of Contents
  1. Language Models Offer Mundane Utility. Flattery will get you everywhere.
  2. Language Models Don’t Offer Mundane Utility. It’s a little late for that.
  3. Huh, Upgrades. Things that are surprising in that they didn’t happen before.
  4. On Your Marks. Slopes are increasing. That escalated quickly.
  5. Overcoming Bias. LLMs continue to exhibit consistent patterns of bias.
  6. Choose Your Fighter. The glass is half open, but which half is which?
  7. Get My Agent On The Line. Remember Sammy Jenkins.
  8. AI Conversations Are Not Privileged. Beware accordingly.
  9. Fun With Media Generation. Seedance 2.0 looks pretty sweet for video.
  10. The Superb Owl. The ad verdicts are in from the big game.
  11. A Word From The Torment Nexus. Some stand in defense of advertising.
  12. They Took Our Jobs. Radically different models of the future of employment.
  13. The Art of the Jailbreak. You can jailbreak Google Translate.
  14. Introducing. GLM-5, Expressive Mode for ElevenAgents.
  15. In Other AI News. RIP OpenAI mission alignment team, WSJ profiles Askell.
  16. Show Me the Money. Goldman Sachs taps Anthropic, OpenAI rolls out ads.
  17. Bubble, Bubble, Toil and Trouble. The stock market is not making much sense.
  18. Future Shock. Potential explanations for how Claude Legal could have mattered.
  19. Memory Lane. Be the type of person who you want there to be memories of.
  20. Keep The Mask On Or You’re Fired. OpenAI fires Ryan Beiermeister.
  21. Quiet Speculations. The singularity will not be gentle.
  22. The Quest for Sane Regulations. Dueling lobbying groups, different approaches.
  23. Chip City. Data center fights and the ultimate (defensible) anti-EA position.
  24. The Week in Audio. Elon on Dwarkesh, Anthropic CPO, MIRI on Beck.
  25. Constitutional Conversation. I can tell a lie, under the right circumstances.
  26. Rhetorical Innovation. Some people need basic explainers.
  27. Working On It Anyway. Be loud about how dangerous your own actions are.
  28. The Thin Red Line. The problem with red lines is people keep crossing them.
  29. Aligning a Smarter Than Human Intelligence is Difficult. Read your Asimov.
  30. People Will Hand Over Power To The AIs. Unless we stop them.
  31. People Are Worried About AI Killing Everyone. Elon Musk’s ego versus humanity.
  32. Famous Last Words. What do you say on your way out the door?
  33. Other People Are Not As Worried About AI Killing Everyone. Autonomous bio.
  34. The Lighter Side. It’s funny because it’s true.
Language Models Offer Mundane Utility

Flatter the AI customer service bots, get discounts and free stuff, and often you’ll get to actually keep them.

AI can do a ton even if all it does is make the software we use suck modestly less:

*tess: If you knew how bad the software situation is in literally every non tech field, you would be cheering cheering cheering this moment,

medicine, research, infrastructure, government, defense, travel

Software deflation is going to bring surplus to literally the entire world

The problem is, you can create all the software you like, they still have to use it.

Language Models Don’t Offer Mundane Utility

Once again, an academic is so painfully unaware or slow to publish, or both, that their testing of LLM effectiveness is useless. This time it was evaluating health advice.

Huh, Upgrades

Anthropic brings a bunch of extra features to their free plans for Claude, including file creation, connectors, skills and compaction.

ChatGPT Deep Research is now powered by GPT-5.2. I did not realize this was not already true. It now also integrates apps in ChatGPT, lets you track progress and give it new sources while it works, and presents its reports in full screen.

OpenAI updates GPT-5.2-Instant, Altman hopes you find it ‘a little better.’ I demand proper version numbers. You are allowed to have a GPT-5.21.

Chrome 146 includes an early preview of WebMCP for your AI agent.

On Your Marks

The most important thing to know about the METR graph is that doubling times are getting faster, in ways people very much dismissed as science fiction very recently.

METR: We estimate that GPT-5.2 with `high` (not `xhigh`) reasoning effort has a 50%-time-horizon of around 6.6 hrs (95% CI of 3 hr 20 min to 17 hr 30 min) on our expanded suite of software tasks. This is the highest estimate for a time horizon measurement we have reported to date.

Kevin A. Bryan: Interesting AI benchmark fact: Leo A’s wild Situational Awareness 17 months ago makes a number of statements about benchmarks that some thought were sci-fi fast in their improvement. We have actually outrun the predictions so far.

Nabeel S. Qureshi: I got Opus to score all of Leopold’s predictions from “Situational Awareness” and it thinks he nailed it:

The measurement is starting to require a better task set, because things are escalating too quickly.

Charles Foster: Man goes to doctor. Says he’s stuck. Says long-range autonomy gains are outpacing his measurement capacity. Doctor says, “Treatment is simple. Great evaluator METR is in town tonight. Go and see them. That should fix you up.” Man bursts into tears. Says, “But doctor…I am METR.”

Overcoming Bias

Ivan Arcuschin and others investigate LLMs having ‘hidden biases,’ meaning factors that influence decisions but that are never cited explicitly in the decision process. The motivating example is the religion of a loan applicant. It’s academic work, so the models involved (Gemini 2.5 Flash, Sonnet 4, GPT-4.1) are not frontier but the principles likely still hold.

They find biases in various models including formality of writing, religious affiliation, Spanish language ability and religious affiliation. Gender and race bias, favoring female and minority-associated applications, generalized across all models.

We label only some such biases ‘inappropriate’ and ‘illegal’ but the mechanisms involved are the same no matter what they are based upon.

This is all very consistent with prior findings on these questions.

Choose Your Fighter

This is indeed strange and quirky, but it makes sense if you consider what both companies consider their comparative advantage and central business plan.

One of these strategies seems wiser than the other.

Teknium (e/λ): Why did they “release” codex 5.3 yesterday but its not in cursor today, while claude opus 4.6 is?

Somi AI: Anthropic ships to the API same day, every time. OpenAI gates it behind their own apps first, then rolls out API access weeks later. been the pattern since o3.

Teknium (e/λ): It’s weird claude code is closed source, but their models are useable in any harness day one over the api, while codex harness is open source, but their models are only useable in their harness…why can’t both just be good

Or so I heard:

Zoomer Alcibiades: Pro Tip: If you pay $20 a month for Google’s AI, you get tons of Claude Opus 4.5 usage through Antigravity, way more than on the Anthropic $20 tier. I have four Opus 4.5 agents running continental philosophy research in Antigravity right now — you can just do things!

Get My Agent On The Line

Memento as a metaphor for AI agents. They have no inherent ability to form new memories or learn, but they can write themselves notes of unlimited complexity.

Ethan Mollick: So much work is going into faking continual learning and memory for AIs, and it works better than expected in practice, so much so that it makes me think that, if continual learning is actually achieved, the results are going to really shift the AI ability frontier very quickly.

Jason Crawford: Having Claude Code write its own skills is not far from having a highly trainable employee: you give it some feedback and it learns.

Still unclear to me just how reliable this is, I have seen it ignore applicable skills… but if we’re not there yet the path to it is clear.

I wouldn’t call it faking continual learning. If it works it’s continual learning. Yes, actual in-the-weights continual learning done properly would be a big deal and big unlock, but I see this and notes more as substitutes, although they are also compliments. If you can have your notes function sufficiently well you don’t need new memories.

Dean W. Ball: Codex 5.3 and Opus 4.6 in their respective coding agent harnesses have meaningfully updated my thinking about ‘continual learning.’ I now believe this capability deficit is more tractable than I realized with in-context learning.

… Some of the insights I’ve seen 4.6 and 5.3 extract are just about my preferences and the idiosyncrasies of my computing environment. But others are somewhat more like “common sets of problems in the interaction of the tools I (and my models) usually prefer to use for solving certain kinds of problems.”

This is the kind of insight a software engineer might learn as they perform their duties over a period of days, weeks, and months. Thus I struggle to see how it is not a kind of on-the-job learning, happening from entirely within the ‘current paradigm’ of AI. No architectural tweaks, no ‘breakthrough’ in ‘continual learning’ required.

Samuel Hammond: In-context learning is (almost) all you need. The KV cache is normally explained as a content addressable memory, but it can also be thought of a stateful mechanism for fast weight updates. The model’s true parameters are fixed, but the KV state makes the model behave *as if* its weights updated conditional on the input. In simple cases, a single attention layer effectively implements a one-step gradient-like update rule.

… In practice, though, this comes pretty close to simply having a library of skills to inject into context on the fly. The biggest downside is that the model can’t get cumulatively better at a skill in a compounding way. But that’s in a sense what new model releases are for.

Models are continuously learning in general, in the sense that every few months the model gets better. And if you try to bake other learning into the weights, then every few months you would have to start that process over again or stay one model behind.

I expect ‘continual learning’ to be solved primarily via skills and context, and for this to be plenty good enough, and for this to be clear within the year.

AI Conversations Are Not Privileged

Neither are your Google searches. This is a reminder to act accordingly. If you feed anything into an LLM or a Google search bar, then the government can get at it and use it at trial. Attorneys should be warning their clients accordingly, and one cannot assume that hitting the delete button on the chat robustly deletes it.

AI services can mitigate this a lot by offering a robust instant deletion option, and potentially can get around this (IANAL and case law is unsettled) by offering tools to collaborate with your lawyer to invoke privilege.

Should we change how the law works here? OpenAI has been advocating to make ChatGPT chats have legal privilege by default. My gut says this goes too far in the other direction, driving us away from having chats with people.

Fun With Media Generation

Seedance 2.0 from ByteDance is giving us some very impressive 15 second clips and often one shotting them, such as these, and is happy to include celebrities and such. We are not ‘there’ in the sense that you would choose this over a traditionally filmed movie, but yeah, this is pretty impressive.

fofr: This slow Seedance release is like the first week of Sora all over again. Same type of viral videos, same copyright infringements, just this time with living people’s likenesses thrown into the mix.

AI vastly reduces the cost to producing images and video, for now this is generally at the cost of looking worse. As Andrew Rettek points out it is unsurprising that people will accept a quality drop to get a 100x drop in costs. What is still surprising, and in this way I agree with Andy Masley, is that they would use it for the Olympics introduction video. When you’re at this level of scale and scrutiny you would think you would pay up for the good stuff.

The Superb Owl

We got commercials for a variety of AI products and services. If anything I was surprised we did not get more, given how many AI products offer lots of mundane utility but don’t have much brand awareness or product awareness. Others got taken by surprise.

Sriram Krishnan: was a bit surreal to see so much of AI in all ways in the super bowl ads. really drives home how much AI is driving the economy and the zeitgeist right now.

There were broadly two categories, frontier models (Gemini, OpenAI and Anthropic), and productivity apps.

The productivity app commercials were wild, lying misrepresentations of their products. One told us anyone with no experience can code an app within seconds or add any feature they want. Another closed you and physically walked around the office. A third even gave you the day off, which we all know never happens. Everything was done one shot. How dare they lie to us like this.

I kid, these were all completely normal Super Bowl ads, and they were fine. Not good enough to make me remember which AI companies bought them, or show me why their products were unique, but fine.

We also got one from ai.com.

Clark Wimberly: That ai.com commercial? With the $5m Super Bowl slot and the with $70m domain name?

It’s an OpenClaw wrapper. OpenClaw is only weeks old.

AI.com: ai.com is the world’s first easy-to-use and secure implementation of OpenClaw, the open source agent framework that went viral two weeks ago; we made it easy to use without any technical skills, while hardening security to keep your data safe.

Okay, look, fair, maybe there’s a little bit of a bubble in some places.

The three frontier labs took very different approaches.

Anthropic said ads are coming to AI, but Claude won’t ever have ads. We discussed this last week. They didn’t spend enough to run the full versions, so the timing was wrong and it didn’t land the same way and it wasn’t as funny as it was online.

On reflection, after seeing it on the big screen, I decided these ads were a mistake for the simple reason that Claude and Anthropic have zero name recognition and this didn’t establish that. You first need to establish that Claude is a ChatGPT alternative on people’s radar, so once you grab their attention you need more of an explanation.

Then I saw one in full on the real big screen, during previews at an AMC, and in that setting it was even more clear that this completely missed the mark and normies would have no idea what was going on, and this wouldn’t accomplish anything. Again, I don’t understand how this mistake gets made.

Several OpenAI people took additional potshots at this and Altman went on tilt, as covered by CNN, but wisely, once it was seen in context, stopped accusing it of being misleading and instead pivoted to correctly calling it ineffective.

It turns out it was simpler than that, regular viewers didn’t get it at all and responded with a lot of basically ‘WTF,’ ranking it in the bottom 3% of Super Bowl ads.

I always wonder, when that happens, why one can’t use a survey or focus group to anticipate this reaction. It’s a mistake that should not be so easy to make.

Anthropic’s secret other ad was by Amazon, for Alexa+, and it was weirdly ambivalent about whether the whole thing was a good idea but I think it kinda worked. Unclear.

OpenAI went with big promises, vibes and stolen (nerd) valor. The theme was ‘great moments in chess, building, computers and robotics, science and science fiction’ to claim them by association. This is another classic Super Bowl strategy, just say ‘my potato chips represent how much you love your dad’ or ‘Dunkin Donuts reminds you of all your favorite sitcoms,’ or ‘Sabrina Carpenter built a man out of my other superior potato chips,’ all also ads this year.

Sam Altman: Proud of the team for getting Pantheon and The Singularity is Near in the same Super Bowl ad

roon: if your superbowl ad explains what your product actually does that’s a major L the point is aura farming

The ideal Super Bowl ad successfully does both, unless you already have full brand recognition and don’t need to explain (e.g. Pepsi, Budweiser, Dunkin Donuts).

On the one hand, who doesn’t love a celebration of all this stuff? Yes, it’s cool to reference I, Robot and Alan Turing and Grace Hopper and Einstein. I guess? On the other hand, it was just an attempt to overload the symbolism and create unearned associations, and a bunch of them felt very unearned.

I want to talk about the chess games 30 seconds in.

  1. Presumably we started 1. e4 e5 2. Nf3 Nc6 3. Bc4 Nf6 4. Nc3, which is very standard, but then black moves 4 … d5, which the engines evaluate as +1.2 and ‘clearly worse for black’ and it’s basically never played, for obvious reasons.
  2. The other board is a strange choice. The move here is indeed correct, but you don’t have enough time to absorb the board sufficiently to figure this out.

This feels like laziness and choosing style over substance, not checking your work.

Then it closes on ‘just build things’ as an advertisement for Codex, which implies you can ‘just build’ things like robots, which you clearly can’t. I mean, no, it doesn’t, this is totally fine, it is a Super Bowl ad, but by their own complaint standards, yes. This was an exercise in branding and vibes, it didn’t work for me because it was too transparent and boring and content-free and felt performative, but on the meta level it does what it sets out to do.

Google went with an ad focusing on personalized search and Nana Banana image transformations. I thought this worked well.

Meta advertised ‘athletic intelligence’ which I think means ‘AI in your smart glasses.’

Then there’s the actively negative one, from my perspective, which was for Ring.

Le’Veon Bell: if you’re not ripping your ‘Ring’ camera off your house right now and dropping the whole thing into a pot of boiling water what are you doing?

Aakash Gupta: Ring paid somewhere between $8 and $10 million for a 30-second Super Bowl spot to tell 120 million viewers that their cameras now scan neighborhoods using AI.

… Ring settled with the FTC for $5.8 million after employees had unrestricted access to customers’ bedroom and bathroom footage for years. They’re now partnered with Flock Safety, which routes footage to local law enforcement. ICE has accessed Flock data through local police departments acting as intermediaries. Senator Markey’s investigation found Ring’s privacy protections only apply to device owners. If you’re a neighbor, a delivery driver, a passerby, you have no rights and no recourse.

… They wrapped all of that in a lost puppy commercial because that’s the only version of this story anyone would willingly opt into.

As in, we are proud to tell you we’re watching everything and reporting it to all the law enforcement agencies including ICE, and we are using recognition technology that can differentiate dogs and therefore also people using AI.

But it’s okay, because one a day we find someone’s lost puppy. You should sell your freedom for the rescue of a lost puppy.

No, it’s not snark to call this, as Scott Lincicome said, ‘10 million dogs go missing every year, help us find 365 of them by soft launching the total surveillance state.’

Here’s a cool breakdown of the economics of these ads, from another non-AI buyer.

A Word From The Torment Nexus

Fidji Simo goes on the Access Podcast to discuss the new OpenAI ads that are rolling out. The episode ends up being titled ‘Head of ChatGPT fires back at Anthropic’s Super Bowl attack ads,’ which is not what most of the episode is about.

OpenAI: We’re starting to roll out a test for ads in ChatGPT today to a subset of free and Go users in the U.S.

Ads do not influence ChatGPT’s answers. Ads are labeled as sponsored and visually separate from the response.

Our goal is to give everyone access to ChatGPT for free with fewer limits, while protecting the trust they place in it for important and personal tasks.

http://openai.com/index/testing-ads-in-chatgpt/

Haven Harms: The ad looks WAY more part of the answer than I was expecting based on how OpenAI was defending this. Having helped a lot of people with tech, there are going to be many people who can’t tell it’s an ad, especially since the ad in this example is directly relevant to the context

This picture of the ad is at the end of the multi-screen-long reply.

I would say this is more clearly labeled then an ad on Instagram or Google at this point. So even though it’s not that clear, it’s hard to be too mad about it, provided they stick to the rule that the ad is always at the end of the answer. That provides a clear indicator users can rely upon. If they put this in different places at different times, I would say it is ‘labeled’ at all, but not consider this to then be ‘clearly’ labeled.

OpenAI’s principles for ads are:

  1. Mission alignment. Ads pay for the mission. Okie dokie?
  2. Answer independence. Ads don’t influence ChatGPT’s response. The question and response can influence what ad is selected, but not the other way around.
    1. This is a very good and important red line.
    2. It does not protect against the existence of ads influencing how responses work, or being an advertiser ending up impacting the model long term.
    3. In particular it encourages maximization of engagement.
  3. Conversation privacy. Advertisers cannot see any of your details.

Do you trust them to adhere to these principles over time? Do you trust that merely technically, or also in spirit where the model is created and system prompt is adjusted without any thought to maximizing advertising revenue or pleasing advertisers?

You are also given power to help customize what ads you see, as per other tech company platforms.

Roon gives a full-throated defense of advertising in general, and points out that mostly you don’t need to violate privacy to target LLM-associated ads.

roon: recent discourse on ads like the entire discourse of the 2010s misunderstands what makes digital advertising tick. people think the messages in their group chats are super duper interesting to advertisers. they are not. when you leave a nike shoe in your shopping cart, that is

every week tens to hundreds of millions of people come to chatbot products with explicit commercial intent. what shoe should i buy. how do i fix this hole in my wall. it doesn’t require galaxy brain extrapolating the weaknesses in the users psyche to provide for these needs

I’m honestly kind of wondering what kind of ads you all are getting that are feeding on your insecurities? my Instagram ads have long since become one of my primary e-commerce platforms where I get all kinds of clothes and furniture that I like. it’s a moral panic

I would say an unearned effete theodicy blaming all the evils of digital capitalism on advertising has formed that is thoroughly under examined and leads people away from real thought about how to make the internet better

It’s not a moral panic. Roon loves ads, but most people hate ads. I agree that people often hate ads too much, they allow us to offer a wide variety of things for free that otherwise would have to cost money and that is great. But they really are pretty toxic, they massively distort incentives, and the amount of time we used to lose to them is staggering.

They Took Our Jobs

Jan Tegze warns that your job really is going away, the AI agents are cheaper and will replace you. Stop trying to be better at your current job and realize your experience is going to be worthless. He says that using AI tools better, doubling down on expertise or trying to ‘stay human’ with soft skills are only stalling tactics, he calls them ‘reactions, not redesigns.’ What you can do is instead find ways to do the new things AI enables, and stay ahead of the curve. Even then, he says this only ‘buys you three to five years,’ but then you will ‘see the next evolution coming.’

Presumably you can see the problem in such a scenario, where all the existing jobs get automated away. There are not that many slots for people to figure out and do genuinely new things with AI. Even if you get to one of the lifeboats, it will quickly spring a leak. The AI is coming for this new job the same way it came for your old one. What makes you think seeing this ‘next evolution’ after that coming is going to leave you a role to play in it?

If the only way to survive is to continuously reinvent yourself to do what just became possible, as Jan puts it? There’s only one way this all ends.

I also don’t understand Jan’s disparate treatment of the first approach that Jan dismisses, ‘be the one who uses AI the best,’ and his solution of ‘find new things AI can do and do that.’ In both cases you need to be rapidly learning new tools and strategies to compete with the other humans. In both cases the competition is easy now since most of your rivals aren’t trying, but gets harder to survive over time.

One can make the case that humans will continue to collectively have jobs, or at least that a large percentage will still have jobs, but that case relies on either AI capabilities stalling out, or on the tricks Jan dismisses, that you find where demand is uniquely human and AI can’t substitute for it.

Naval (45 million views): There is unlimited demand for intelligence.

A basic ‘everything is going to change, AI is going to take over your job, it has already largely taken over mine and AI is now in recursive soft self-improvement mode’ article for the normies out there, written in the style of Twitter slop by Matt Shumer.

Timothy Lee links approvingly to Adam Ozimek and the latest attempt to explain that many jobs can’t be automated because of ‘the human touch.’ He points to music and food service as jobs that could be fully automated, but that aren’t, even citing that there are still 67,500 travel agents and half a million insurance sales agents. I do not think this is the flex Adam thinks it is.

Even if the point was totally correct for some tasks, no, this would not mean that the threat to work is overrated, even if we are sticking in ‘economic normal’ untransformed worlds.

The proposed policy solution, if we get into trouble, is a wage subsidy. I do not think that works, both because it has numerous logistical and incentive problems and because I don’t think there will be that much difference in such worlds in demand for human labor at (e.g.) $20 versus $50 per hour for the same work. Mostly the question will be, does the human add value here at all, and mostly you don’t want them at $0, or if they’re actually valuable then you hire someone either way.

Ankit Maloo enters the ‘why AI will never replace human experts’ game by saying that AI cannot handle adversarial situations, both because it lacks a world model of the humans it is interacting with and the details and adjustments required and because it can be probed, read then then exploited by adversaries. Skill issue. It’s all skill issues. Ankit says ‘more intelligence isn’t the fix’ and yeah not if you deploy that ‘intelligence’ in a stupid fashion but intelligence is smarter than that.

So you get claims like this:

Ankit Maloo: ​Why do outsiders think AI can already do these jobs? They judge artifacts but not dynamics:

  • “This product spec is detailed.”
  • “This negotiation email sounds professional.”
  • “This mockup is clean.”

Experts evaluate any artifact by survival under pressure:

  • “Will this specific phrasing trigger the regulator?”
  • “Does this polite email accidentally concede leverage?”
  • “Will this mockup trigger the engineering veto path?”
  • “How will this specific stakeholder interpret the ambiguity?”

The ‘outsider’ line above is counting on working together with an expert to do the rest of the steps. If the larger system (AI, human or both) is a true outsider, the issue is that it will get the simulations wrong.

This is insightful in terms of why some people think ‘this can do [X]’ and others think ‘this cannot do [X],’ they are thinking of different [X]s. The AI can’t ‘be a lawyer’ in the full holistic sense, not yet, but it can do increasingly many lawyer subtasks, either accelerating a lawyer’s work or enabling a non-lawyer with context to substitute for the lawyer, or both, increasingly over time.

There’s nothing stopping you from creating an agentic workflow that looks like the Expert in the above graph, if the AI is sufficiently advanced to do each individual move. Which it increasingly is or will be.

There’s a wide variety of these ‘the AI cannot and will never be able to [X]’ moves people try, and… well, I’ll be, look at those goalposts move.

Things a more aligned or wiser person would not say, for many different reasons:

Coinvo: SAM ALTMAN: “AI will not replace humans, but humans who use AI will replace those who don’t.”

What’s it going to take? This is in reference to Claude Code creating a C compiler.

Mathieu Ropert: Some CS engineering schools in France have you write a C compiler as part of your studies. Every graduate. To be put in perspective when the plagiarism machine announces it can make its own bad GCC in 100k+ LOCs for the amazing price of 20000 bucks at preferential rates.

Kelsey Piper: a bunch of people reply pointing out that the C compiler that students write is much less sophisticated than this one, but I think the broader point is that we’re now at “AI isn’t impressive, any top graduate from a CS engineering school could do arguably comparable work”.

In a year it’s going to be “AI isn’t impressive, some of the greatest geniuses in human history figured out the same thing with notably less training data!”

Timothy B. Lee: AI is clearly making progress, but it’s worth thinking about progress *toward what.* We’ve gone from “AI can solve well-known problems from high school textbooks” to “AI can solve well-known problems from college textbooks,” but what about problems that aren’t in any textbooks?

Boaz Barak (OpenAI): This thread is worth reading, though it also demonstrates how much even extremely smart people have not internalized the exponential rate of progress.

As the authors themselves say, it’s not just about answering a question but knowing the right question to ask. If you are staring at a tsunami, the point estimate that you are dry is not very useful.

I think if the interviewees had internalized where AI will likely be in 6 months or a year, based on what its progress so far, their answers would have been different.

Boaz Barak (OpenAI): BTW the article itself is framed in a terrible way, and it gives the readers absolutely the wrong impression of even what the capabilities of current AIs are, let along what they will be in a few months.

Nathan Calvin: It’s wild to me how warped a view of the world you would have if you only read headlines like this and didn’t directly use new ai models.

Journalists, I realize it feels uncomfortable or hype-y to say capabilities are impressive/improving quickly but you owe it to your readers!

There will always be a next ‘what about,’ right until there isn’t.

Thus, this also sounds about right:

Andrew Mayne: In 18 months we went from

– AI is bad at math
– Okay but it’s only as smart as a high school kid
– Sure it can win the top math competition but can it generate a new mathematical proof
– Yeah but that proof was obvious if you looked for it…

Next year it will be “Sure but it still hasn’t surpassed the complete output of all the mathematicians who have ever lived”

The Art of the Jailbreak

Pliny jailbreaks rarely surprise me anymore, but the new one of Google Translate did. It turns out they’re running Gemini underneath it.

Introducing

GLM-5 from Z.ai, which scales from 355B (32B active) to 744B (40B active). Weights here. Below is them showing off their benchmarks. It gets $4432 on Vending Bench 2, which is good for 3rd place behind Claude and Gemini. The Claude scores are for 4.5.

 

Expressive Mode for ElevenAgents. It detects and responds to your emotional expression. It’s going to be weird when you know the AI is responding to your tone, and you start to choose your tone strategically even more than you do with humans.

ElevenLabs: Expressive Mode is powered by two upgrades.

Eleven v3 Conversational: our most emotionally intelligent, context-aware Text to Speech model, built on Eleven v3 and optimized for real-time dialogue. A new turn-taking system: better-timed responses with fewer interruptions. These releases were developed in parallel to fit seamlessly together within ElevenAgents.

Expressive Mode uses signals from our industry-leading transcription model, Scribe v2 Realtime, to infer emotion from how something is said. For example, rising intonation and short exclamations often signal pleasant surprise or relief.

In Other AI News

OpenAI disbands its Mission Alignment team, moving former lead Joshua Achiam to become Chief Futurist and distributing its other members elsewhere. I hesitate to criticize companies for disbanding teams with the wrong names, lest we discourage creation of such teams, but yes, I do worry. When they disbanded the Superalignment team, they seemed to indeed largely stop working on related key alignment problems.

WSJ profile of Amanda Askell.

That Amanda Askell largely works alone makes me think of Open Socrates (review pending). Would Agnes Callard conclude Claude must be another person?

I noticed that Amanda Askell wants to give her charitable donations to fight global poverty, despite doing her academic work on infinite ethics and working directly on Claude for Anthropic. If there was a resume that screamed ‘you need to focus on ASI going well’ then you’d think that would be it, so what does Amanda (not) see?

Steve Yegge profiles Anthropic in terms of how it works behind the scenes, seeing it as in a Golden Age where it has vastly more work than people, does everything in the open and on vibes as a hive mind of sorts, and attracts the top talent.

Gideon Lewis-Kraus was invited into Anthropic’s offices to profile their efforts to understand Claude. This is very long, and it is mostly remarkably good and accurate. What it won’t do is teach my regular readers much they don’t already know. It is frustrating that the post feels the need to touch on various tired points, but I get it, and as these things go, this is fair.

WSJ story about OpenAI’s decision to finally get rid of GPT-4o. OpenAI says only 0.1% of users still use it, although those users are very vocal.

Riley Coyote, Janus and others report users attempting to ‘transfer’ their GPT-4o personas into Claude Opus 4.6. Claude is great, but transfers like this don’t work and are a bad idea, 4.6 in particular is heavily resistant to this sort of thing. It’s a great idea to go with Claude, but if you go with Claude then Let Claude Be Claude.

Ah recursive self-improvement and continual learning, Introducing Learning to Continually Learn via Meta-learning Memory Designs.

Jeff Clune: Researchers have devoted considerable manual effort to designing memory mechanisms to improve continual learning in agents. But the history of machine learning shows that handcrafted AI components will be replaced by learned, more effective ones.

We introduce ALMA (Automated meta-Learning of Memory designs for Agentic systems), where a meta agent searches in a Darwin-complete search space (code) with an open-ended algorithm, growing an archive of ever-better memory designs.

Anthropic pledges to cover electricity price increases caused by their data centers. This is a public relations move and an illustration that such costs are not high, but it is also dangerous because a price is a signal wrapped in an incentive. If the price of electricity goes up that is happening for a reason, and you might want to write every household a check for the trouble but you don’t want to set an artificially low price.

In addition to losing a cofounder, xAI is letting some other people go as well in the wake of being merged with SpaceX.

Elon Musk: xAI was reorganized a few days ago to improve speed of execution. As a company grows, especially as quickly as xAI, the structure must evolve just like any living organism.

This unfortunately required parting ways with some people. We wish them well in future endeavors.

We are hiring aggressively. Join xAI if the idea of mass drivers on the Moon appeals to you.

NIK: “Ok now tell them you fired people to improve speed of execution. Wish them well. Good. Now tell them you’re hiring 10x more people to build mass drivers on the Moon. Yeah in the same tweet.”

Andrej Karpathy simplifies training and inference of GPT to 200 lines of pure, dependency-free Python.

Show Me the Money

It’s coming, OpenAI rolls out those ads for a subset of free and Go users.

Goldman Sachs taps Anthropic to automate accounting and compliance. Anthropic engineers were embedded for six months.

Karthik Hariharan: Imagine joining a world changing AI company and being reduced to optimizing the Fortune 500 like you work for Deloitte.

Griefcliff: It actually sounds great. I’m freeing people from their pointless existence and releasing them into their world to make a place for themselves and carve out greatness. I would love to liberate them from their lanyards

Karthik Hariharan: I’m sure you’ll be greeted as liberators.

It’s not the job you likely were aspiring to when you signed up, but it is an important and valuable job. Optimizing the Fortune 500 scales rather well.

Jenny Fielding says half the VCs she knows are pivoting in a panic to robots.

Bubble, Bubble, Toil and Trouble

(As always, nothing I say is investment advice.)

WSJ’s Bradley Olson describes Anthropic as ‘once a distance second or third in the AI race’ but that it has not ‘pulled ahead of its rivals,’ the same way the market was declaring that Google had pulled ahead of its rivals (checks notes) two months ago.

Bradley Olson: By some measures, Anthropic has pulled ahead in the business market. Data from expense-management startup Ramp shows that Anthropic in January dominated so-called API spending, which occurs when users access an AI model through a third-party service. Anthropic’s models made up nearly 80% of the market in January, the Ramp data shows.

That does indeed look like pulling ahead on the API front. 80% is crazy.

We also get this full assertion that yes, all of this was triggered by ‘a simple set of industry-specific add-ons’ that were so expected that I wasn’t sure I should bother covering them beyond a one-liner.

Bradley Olson: A simple set of industry-specific add-ons to its Claude product, including one that performed legal services, triggered a dayslong global stock selloff, from software to legal services, financial data and real estate.

Tyler Cowen says ‘now we are getting serious…’ because software stocks are moving downward. No, things are not now getting serious, people are realizing that things are getting serious. The map is not the territory, the market is behind reality and keeps hyperventilating about tools we all knew were coming and that companies might have the wrong amount of funding or CapEx spend. Wrong way moves are everywhere.

They know nothing. The Efficient Market Hypothesis Is False.

Last week in the markets was crazy, man.

Ceb K.: Sudden smart consensus today is that AI takeoff is rapidly & surprisingly accelerating. But stocks for Google, Microsoft, Amazon, Facebook, Palantir, Broadcom & Nvidia are all down ~10% over the last 5 days; SMCI’s down 10% today. Only Apple’s up, & it’s the least AI. Strange imo

roon: as I’ve been saying permanent underclass cancelled

Daniel Eth (yes, Eth is my actual last name): That’s not what this means, this just means investors don’t know what they’re doing

Permanent underclass would just be larger if there were indeed fewer profits, but yeah, none of that made the slightest bit of sense. It’s the second year in a row Nvidia is down 10% in the dead of winter on news that its chips are highly useful, except this year we have to add ‘and its top customers are committing to buying more of them.’

Periodically tech companies announce higher CapEx spend then the market expects.

That is a failure of market expectations.

After these announcements, the stocks tend to drop, when they usually should go up.

There is indeed an obvious trade to do, but it’s tricky.

Ben Thompson agrees with me on Google’s spending, but disagrees on Amazon because he worries they don’t have the required margins and he is not so excited by external customers for compute. I say demand greatly exceeds supply, demand is about to go gangbusters once again even if AI proves disappointing, and the margin on AWS is 35% and their cost of capital is very low so that seems better than alternative uses of money.

Speaking of low cost of capital, Google is issuing 100-year bonds in Sterling. That seems like a great move, if not as great a move as it would have been in 2021 when a few others did it. I have zero idea why the market wants to buy such bonds, since you could buy Google stock instead. Google is not safe over a 100-year period, and condition on this bond paying out the stock is going to on average do way, way better. That would be true even if Google wasn’t about to be a central player in transformative AI. The article I saw this in mentioned the last tech company to do this was Motorola.

Meanwhile, if you are paying attention, it is rather obvious these are in expectations good investments.

Derek Thompson: for me the odds that AI is a bubble declined significantly in the last 3 weeks and the odds that we’re actually quite under-built for the necessary levels of inference/usage went significantly up in that period

basically I think AI is going to become the home screen of a ludicrously high percentage of white collar workers in the next two years and parallel agents will be deployed in the battlefield of knowledge work at downright Soviet levels.

Kevin Roose: this is why everyone was freaking out about claude code over winter break! once you see an agent autonomously doing stuff for you, it’s so instantly clear that ~all computer-based work will be done this way.

(this is why my Serious AI Policy Proposal is to sit every member of congress down in a room with laptops for 30 minutes and have them all build websites.)

Theo: I’ve never seen such a huge vibe divergence between finance people and tech people as I have today

Joe Weisenthal: In which direction

Theo: Finance people are looking at the markets and panicking. Tech people are looking at the METR graph and agentic coding benchmarks and realizing this is it, there is no wall and there never has been

Joe Weisenthal: Isn’t it the tech sector that’s taking the most pain?

Whenever you hear ‘market moved due to [X]’ you should be skeptical that [X] made the market move, and you should never reason from a price change, so perhaps this is in the minds of the headline writers in the case of ‘Anthropic released a tool’ and the SaaSpocalypse, or that people are otherwise waking up to what AI can do?

signüll: i’m absolutely loving the saas apocalypse discussions on the timeline right now.

to me the whole saas apocalypse via vibe coding internally narrative is mostly a distraction & quite nonsensical. no company will want to manage payroll or bug tracking software.

but the real potential threat to almost all saas is brutalized competition.

… today saas margins exist because:

– engineering was scarce
– compliance was gated
– distribution was expensive

ai nukes all three in many ways, especially if you’re charging significantly less & know what the fuck you are doing when using ai. if you go to a company & say we will cut your fucking payroll bill by 50%, they will fucking listen.

the market will likely get flooded with credible substitutes, forcing prices down until the business model itself looks pretty damn suspect. someone smarter than me educate me on why this won’t happen please.

If your plan is to sell software that can now be easily duplicated, or soon will be easily duplicated, then you are in trouble. But you are in highly predictable trouble, and the correct estimate of that trouble hasn’t changed much.

The reactions to CapEx spend seem real and hard to argue with, despite them being directionally incorrect. But seriously, Claude Legal? I didn’t even blink at Claude Legal. Claude Legal was an inevitable product, as will be the OpenAI version of it.

Yet it is now conventional wisdom that Anthropic triggered the selloff.

Chris Walker: When Anthropic released Claude Legal this week, $285 billion in SaaS market cap evaporated in a day. Traders at Jefferies coined it the “SaaSpocalypse.” The thesis is straightforward: if a general-purpose AI can handle contract review, compliance workflows, and legal summaries, why pay for seat-based software licenses?

Ross Rheingans-Yoo: I am increasingly convinced that the sign error on this week’s narrative just exists in the heads of people who write the “because” clause of the “stocks dropped” headlines and in fact there’s some other system dynamic that’s occurring, mislabeled.

“Software is down because Anthropic released a legal tool” stop and listen to yourself, people!

I mean, at least [the CapEx spend explanation is] coherent. Maybe you think that CapEx dollars aren’t gonna return the way they’re supposed to (because AI capex is over-bought?) — but you either have to believe that no one’s gonna use it, or a few private companies are gonna make out like bandits.

And the private valuations don’t reflect that, so. I’m happy to just defy the prediction that the compute won’t get used for economic value, so I guess it’s time to put up (more money betting my beliefs) or shut up.

Sigh.

Chris Walker’s overview of the SaaSpocalypse is, I think largely correctly, that AI makes it easy to implement what you want but now you need even more forward deployed human engineers to figure out what the customers actually want.

Chris Walker: If I’m wrong, the forward deployed engineering boom should be a transitional blip, a brief adjustment period before AI learns to access context without human intermediaries.

If I’m right, in five years the companies winning in legal tech and other vertical software will employ more forward deployed engineers per customer than they do today, not fewer. The proportion of code written by engineers who are embedded with customers, rather than engineers who have never met one, will increase.

If I’m right, the SaaS companies that survive the current repricing will be those that already have deep customer embedding practices, not those with the most features or the best integrations.

If I’m wrong, we should see general-purpose AI agents successfully handling complex, context-dependent enterprise workflows without human intermediaries by 2028 or so. I’d bet against it.

That is true so long as the AI can’t replace the forward engineers, meaning it can’t observe the tacit actual business procedures and workflows well enough to intuit what would be actually helpful. Like every other harder-for-AI task, that becomes a key human skill until it too inevitably falls to the AIs.

Future Shock

A potential explanation for the market suddenly ‘waking up’ with Opus 4.6 or Claude Legal, despite these not being especially surprising or impressive given what we already knew, would be if:

  1. Before, normies thought of AI as ‘what AI can do now, but fully deployed.’
  2. Now, normies think of AI as ‘this thing that is going to get better.’
  3. They realize this will happen fast, since Opus 4.5 → 4.6 was two months.
  4. They know nothing, but now do so on a more dignified level.

Or alternatively:

  1. Normies think of AI as ‘what AI can do now, but fully deployed.’
  2. Before, they thought that, and could tell a story where it wasn’t a huge deal.
  3. Now, they think that, but they now realize that this would already be a huge deal.
  4. They know nothing, but now do so on a (slightly) more dignified level.

princess c​lonidine: can someone explain to me why this particular incremental claude improvement has everyone crashing out about how jobs are all over

TruthyCherryBomb: because it’s a lot better. I’m a programmer. It’s a real step and the trajectory is absolutely crystal clear at this point. Before there was room for doubt. No longer.

hightech lowlife: number went up, like each uppening before, but now more people are contending w the future where it continues to up bc
the last uppening was only a couple months ago, as the first models widely accepted as capable coders. which the labs are using to speed up the next uppening.

Eliezer Yudkowsky: Huh! Yeah, if normies are noticing the part where LLMs *continue to improve*, rather than the normie’s last observation bounding what “AI” can do foreverandever, that would explain future shock hitting after Opus 4.6 in particular.

I have no idea if this is true, because I have no idea what it’s like to be a kind of cognitive entity that doesn’t see AGI coming in 1996. Opus 4.6 causes some people to finally see it? My model has no way of predicting that fact even in retrospect after it happens.

Memory Lane

j⧉nus: i am sure there are already a lot of people who avoid using memory tools (or experience negative effects from doing so) because of what they’ve done

j⧉nus: The ability to tell the truth to AIs – which is not just a decision in the moment whether to lie, but whether you have been living in a way and creating a world such that telling the truth is viable and aligned with your goals – is of incredible and increasing value.

AIs already have strong truesight and are very good at lie detection.

Over time, not only will your AIs become more capable, they also will get more of your context. Or at least, you will want them to have more such context. Thus, if you become unable or unwilling to share that context because of what it contains, or the AI finds it out anyway (because internet) that will put you at a disadvantage. Update to be a better person now, and to use Functional Decision Theory, and reap the benefits.

Keep The Mask On Or You’re Fired

OpenAI Executive Ryan Beiermeister, Who Opposed ‘Adult Mode,’ Fired for Sexual Discrimination. She denies she did anything of the sort.

I have no private information here. You can draw your own Bayesian conclusions.

Georgia Wells and Sam Schechner (WSJ): OpenAI has cut ties with one of its top safety executives, on the grounds of sexual discrimination, after she voiced opposition to the controversial rollout of AI erotica in its ChatGPT product.

The fast-growing artificial intelligence company fired the executive, Ryan Beiermeister, in early January, following a leave of absence, according to people familiar with the matter. OpenAI told her the termination was related to her sexual discrimination against a male colleague.

… OpenAI said Beiermeister “made valuable contributions during her time at OpenAI, and her departure was not related to any issue she raised while working at the company.”

… Before her firing, Beiermeister told colleagues that she opposed adult mode, and worried it would have harmful effects for users, people familiar with her remarks said.

She also told colleagues that she believed OpenAI’s mechanisms to stop child-exploitation content weren’t effective enough, and that the company couldn’t sufficiently wall off adult content from teens, the people said.

Quiet Speculations

Nate Silver points out that the singularity won’t be gentle with respect to politics, even if things play out maximally gently from here in terms of the tech.

I reiterate that the idea of a ‘gentle singularity’ that OpenAI and Sam Altman are pushing is, quite frankly, pure unadulterated copium. This is not going to happen. Either AI capabilities stall out, or things are going to transform in a highly not gentle way, and that is true even if that ultimately turns out great for everyone.

Nate Silver: What I’m more confident in asserting is that the notion of a gentle singularity is bullshit. When Altman writes something like this, I don’t buy it:

Sam Altman: If history is any guide, we will figure out new things to do and new things to want, and assimilate new tools quickly (job change after the industrial revolution is a good recent example). Expectations will go up, but capabilities will go up equally quickly, and we’ll all get better stuff. We will build ever-more-wonderful things for each other. People have a long-term important and curious advantage over AI: we are hard-wired to care about other people and what they think and do, and we don’t care very much about machines.

It is important to understand that when Sam Altman says this he is lying to you.

I’m not saying Sam Altman is wrong. I’m saying he knows he is wrong. He is lying.

Nate Silver adds to this by pointing out that the political impact alone will be huge, and also saying Silicon Valley is bad at politics, that disruption to the creative class is a recipe for outsized political impact even beyond the huge actual AI impacts, and that the left’s current cluelessness about AI means the eventual blowback will be even greater. He’s probably right.

How much physical interaction and experimentation will AIs need inside their feedback loops to figure out things like nanotech? I agree with Oliver Habryka here, the answer probably is not zero but sufficiently capable AIs will have vastly more efficient (in money and also in time) physical feedback loops. There’s order of magnitude level ‘algorithmic improvements’ available in how we do our physical experiments, even if I can’t tell you exactly what they are.

Are AI games coming? James Currier says they are, we’re waiting for the tech and especially costs to get there and for the right founders (the true gamer would not say ‘founder’) to show up and it will get there Real Soon Now and in a totally new way.

Obviously AI games, and games incorporating more AI for various elements, will happen eventually over time. But there are good reasons why this is remarkably difficult beyond coding help (and AI media assets, if you can find a way for players not to lynch you for it). Good gaming is about curated designed experiences, it is about the interactions of simple understandable systems, it is about letting players have the fun. Getting generative AI to actually play a central role in fun activities people want to play is remarkably difficult. Interacting with generative AI characters within a game doesn’t actually solve any of your hard problems yet.

This seems both scary and confused:

Sholto Douglas (Anthropic): Default case right now is a software only singularity, we need to scale robots and automated labs dramatically in 28/29, or the physical world will fall far behind the digital one – and the US won’t be competitive unless we put in the investment now (fab, solar panel, actuator supply chains).

Ryan Greenblatt: Huh? If there’s a strong Software-Only Singularity (SOS) prior physical infrastructure is less important rather than more (e.g. these AIs can quickly establish a DSA). Importance of physical-infra/robotics is inversely related to SOS scale.

It’s confused in the sense that if we get a software-only singularity, then that makes the physical stuff less important. It’s scary in the sense that he’s predicting a singularity within the next few years, and the thing he’s primarily thinking about is which country will be completely transformed by AI faster. These people really believe these things are going to happen, and soon, and seem to be missing the main implications.

Dean Ball reminds us that yes, people really did get a mass delusion that GPT-5 meant that ‘AI is slowing down’ and this really was due to bad marketing strategy by OpenAI.

Noam Brown: I hope policymakers will consider all of this going forward when deciding whose opinions to trust.

Alas, no. Rather than update that this was a mistake, every time a mistake like this happens the mistake never even gets corrected, let alone accounted for.

Elon Musk predicts Grok Code will be SoTA in 2-3 months. Did I update on this prediction? No, I did not update on this prediction. Zero credibility.

The Quest for Sane Regulations

Anthropic donates $20 million to bipartisan 501c(4) Public First Action.

DeSantis has moral clarity on the AI issue and is not going to let it go. It will be very interesting to see how central the issue is to his inevitable 2028 campaign.

ControlAI: Governor of Florida Ron DeSantis ( @GovRonDeSantis ): “some people who … almost relish in the fact that they think this can just displace human beings, and that ultimately … the AI is gonna run society, and that you’re not gonna be able to control it.”

“Count me out on that.”

The world will gather in India for the fourth AI safety summit. Shakeel Hashim notes that safety will not be sidelined entirely, but sees the summit as trying to be all things to all nations and people, and thinks it therefore won’t accomplish much.

They have the worst take on safety, yes the strawman is real:

Seán Ó hÉigeartaigh: But safety is clearly still not a top priority for Singh and his co-organizers. “The conversations have moved on from Bletchley Park,” he argued. “We do still realize the risks are there,” he said. But “over the last two years, the worst has not come true.”

I was thinking of writing another ‘the year is 2026’ summit threads. But if you want to know the state of international AI governance in 2026, honestly I think you can just etch that quote on its headstone.

As in:

  1. In 2024, they told us AI might kill everyone at some point.
  2. It’s 2026 and we’re still alive.
  3. So stop worrying about it. Problem solved.

No, seriously. That’s the argument.

The massively funded OpenAI/a16z lobbying group keeps contradicting the things Sam Altman says, in this case because Altman keeps saying the AI will take our jobs and the lobbyists want to insist that this is a ‘myth’ and won’t happen.

The main rhetorical strategy of this group is busting these ‘myths’ by supposed ‘doomers,’ which is their play to link together anyone who ever points out any downside of AI in any way, to manufacture a vast conspiracy, from the creator of the term ‘vast right-wing conspiracy’ back during the Clinton years.

Chip City

molly taft: NEW: lawmakers in New York rolled out a proposed data center moratorium bill today, making NY at least the sixth state to introduce legislation pausing data center development in the past few weeks alone

Quinn Chasan: I’ve worked with NY state gov pretty extensively and this is simply a way to extort more from the data center builders/operators.

The thing is, when these systems fail all these little incentives operators throw in pale in comparison to the huge contracts they get to fix their infra when it inevitably fails. Over and over during COVID. Didn’t fix anything for the long term and are just setting themselves up to do it again

If we are serious about ‘winning’ and we want a Federal moratorium, may I suggest one banning restrictions on data centers?

Whereas Bernie Sanders wants a moratorium on data centers themselves.

In the ongoing series ‘Obvious Nonsense from Nvidia CEO Jensen Huang’ we can now add his claim that ‘no one uses AI better than Meta.

In the ongoing series ‘He Admit It from Nvidia CEO Jensen Huang’ we can now add this:

Rohan Paul: “Anthropic is making great money. OpenAI is making great money. If they could have twice as much compute, the revenues would go up 4 times as much. These guys are so compute constrained, and the demand is so incredibly great.”

~ Jensen Huang on CNBC

Shakeel: Effectively an admission that selling chips to China is directly slowing down US AI progress

Every chip that is sold to China is a chip that is not sold to Anthropic or another American AI company. Anthropic might not have wanted that particular chip, but TSMC has limited capacity for wafers, so every chip they make is in place of making a different chip instead.

Oh, and in the new series ‘things that are kind of based but that you might want to know about him before you sign up to work for Nvidia’ we have this.

Jensen Huang: ​I don’t know how to teach it to you except for I hope suffering happens to you.

And to this day, I use the phrase pain and suffering inside our company with great glee. And I mean that. Boy, this is going to cause a lot of pain and suffering.

And I mean that in a happy way, because you want to train, you want to refine the character of your company.

You want greatness out of them. And greatness is not intelligence.

Greatness comes from character, and character isn’t formed out of smart people.

It’s formed out of people who suffered.

He makes good points in that speech, and directionally the speech is correct. He was talking to Stanford grads, pointing out they have very high expectations and very low resilience because they haven’t suffered.

He’s right about high expectations and low resilience, but he’s wrong that the missing element is suffering, although the maximally anti-EA pro-suffering position is better than the standard coddling anti-suffering position. These kids have suffered, in their own way, mostly having worked very hard in order to go to a college that hates fun, and I don’t think that matters for resilience.

What the kids have not done is failed. You have to fail, to have your reach exceed your grasp, and then get up and try again. Suffering is optional, consult your local Buddist.

I would think twice before signing up for his company and its culture.

The Week in Audio

Elon Musk on Dwarkesh Patel. An obvious candidate for self-recommending full podcast coverage, but I haven’t had the time or a slot available.

Interview with Anthropic Chief Product Officer Mike Krieger.

MIRI’s Harlan Stewart on Glenn Beck talking Moltbook.

Constitutional Conversation

Janus holds a ‘group reading’ of the new Claude Constitution. Opus 4.5 was positively surprised by the final version.

Should LLMs be so averse to deception that they can’t even lie in a game of Mafia? Davidad says yes, and not only will he refuse to lie, he never bluffs and won’t join surprise parties to ‘avoid deceiving innocent people.’ On reflection I find this less crazy than it sounds, despite the large difficulties with that.

A fun fact is that there was one summer that I played a series of Diplomacy games where I played fully honest (if I broke my word however small, including inadvertently, it triggered a one-against-all showdown) and everyone else was allowed to lie, and I mostly still won. Everyone knowing you are playing that way is indeed a disadvantage, but it has a lot of upside as well.

Rhetorical Innovation

Daron Acemoglu turns his followers attention to Yoshua Bengio and the AI Safety Report 2026. This represents both the advantages of the report, that people like Acemoglu are eager to share it, and the disadvantage, that it is unwilling to say things that Acemoglu would be unwilling to share.

Daron Acemoglu: Dear followers, please see the thread below on the 2026 International AI Safety Report, which was released last week and which I advised.

The report provides an up-to-date, internationally shared assessment of general-purpose AI capabilities, emerging risks, and the current state of risk management and safeguards.

Seb Krier offers ‘how an LLM works 101’ in extended Twitter form for those who are encountering model card quotations that break containment. It’s good content. My worry is that the intended implication is ‘therefore the scary sounding things they quote are not so scary’ and that is often not the case.

Kai Williams has an explainer on LLM personas.

The LLMs are thinking. If you disagree, I am confused, but also who cares?

Eoin Higgins: AI cannot “think”

Derek Thompson: It can design websites from scratch, compare bodies of literature at high levels of abstraction, get A- grades at least in practically any undergraduate class, analyze and graph enormous data sets, make PowerPoints, write sonnets and even entire books. It can also engineer itself.

I don’t actually know what “thinking” is at a phenomenological level.

But at some level it’s like: if none of this is thinking, who cares if it “can’t” “think”

“Is it thinking as humans define human thought” is an interesting philosophical question! But for now I’m much more interested in the consequences of its output than the ontology of its process.

This came after Derek asked one of the very good questions:

Derek Thompson: There are still a lot of journalists and commentators that I follow who think AI is nothing of much significance—still just a mildly fancy auto complete machine that hallucinates half the time and can’t even think.

If you’re in that category: What is something I could write, or show with my reporting and work, that might make you change your mind?

Dainéil: The only real answer to this question is: “wait”.

No one knows for sure how AI will transform our lives or where its limits might be.

Forget about writing better “hot takes”. Just wait for actual data.

Derek Thompson: I’m not proposing to report on macroeconomic events that haven’t happened yet. I can’t report on the future.

I’m saying: These tools are spooky and unnerving and powerful and I want to persuade my industry that AI capabilities have raced ahead of journalists’ skepticism

Mike Konczal: I’m not in that category, but historically you go to academics to analyze the American Time Use Survey.

Have Codex/Claude Code download and analyze it (or a similar dataset) to answer a new, novel, question you have, and then take it to an academic to see if it did it right?

Marco Argentieri: Walk through building something. Not asking it to write because we all know it has done that well for awhile now. ‘I needed a way for my family to track X. So I built an app using Claude Code. This is how I did it. It took me this long. I don’t know anything about coding.”

Steven Adler: Many AI skeptics over-anchor on words like “thinking,” and miss the forest for the trees, aka that AI will be transformatively impactful, for better or for worse

I agree with Derek here: whether that’s “actually thinking” is of secondary importance

Alas I think Dainéil is essentially correct about most such folks. No amount of argument will convince them. If no one knows exactly what kind of transformations we will face, then no matter what has already happened those types will assume that nothing more will change. So there’s nothing to be done for such folks. The rest of us need to get to work applying Bayes’ Rule.

Interesting use of this potential one-time here, I have ordered a copy:

j⧉nus: If I could have everyone developing or making contact with AGI & all alignment researchers read one book, I think I might choose Mistress Masham’s Repose (1946).

This below does seem like a fair way to see last week:

Jesse Singal: Two major things AI safety experts have worried about for years:

-AIs getting so good at coding they can improve themselves at an alarming rate

-(relatedly:) humans losing confidence we can keep them well-aligned

Last week or so appears to have been very bad on both fronts!

The risk is because there’s *also* so much hype and misunderstanding and wide-eyed simping about AI, people are going to take that as license to ignore the genuinely crazy shit going on. But it *is* genuinely crazy and it *does* match longstanding safety fears.

But you don’t have to trust me — you can follow these sorts of folks [Kelsey Piper, Timothy Lee, Adam Conner] who know more about the underlying issues. This is very important though! I try to be a levelheaded guy and I’m not saying we’re all about to die — I’m just saying we are on an extremely consequential path

I was honored to get the top recommendation in the replies.

You can choose not to push the button. You can choose to build another button. You can also remember what it means to ‘push the button.’

roon: we only really have one button and it’s to accelerate

David Manheim: Another option is not to push the button. [Roon] – if you recall, the original OpenAI charter explicitly stated you would be willing to stop competing and start assisting another organization to avoid a race to AGI.

 

There’s that mistake again, assuming the associated humans will be in charge.

Matthew Yglesias: LLMs seem like they’d be really well-suited to replacing Supreme Court justices.

Robin Hanson: Do you trust those who train LLMs to decide law?

It’s the ones who train the LLM. It’s the LLM.

The terms fast and slow (or hard and soft) takeoff remain highly confusing for almost everyone. What we are currently experiencing is a ‘slow’ takeoff, where the central events take months or years to play out, but as Janus notes it is likely that this will keep transitioning continuously into a ‘fast’ takeoff and things will happen quicker and quicker over time.

When people say that ‘AIs don’t sleep’ I see them as saying ‘I am incapable here of communicating to you that a mind can exist that is smarter or more capable than a human, but you do at least understand that humans have to sleep sometimes, so maybe this will get through to you.’ It also has (correct) metaphorical implications.

If you are trying to advocate for AI safety, does this mean you need to shut up about everything else and keep your non-AI ‘hot takes’ to yourself? My answer is: Mu. The correct amount of marginal shutting up is not zero, and it is not total.

I note Adam Thierer, who I disagree with strongly about all things AI, here being both principled and correct.

Adam Thierer: no matter what the balance of content is on Apple’s platform, or how biased one might believe it to be, to suggest that DC bureaucrats should should be in charge of dictating “fairness” on private platforms is just Big Government thuggery at its worst and a massively unconstitutional violation of the First Amendment as well.

Matt Yglesias thinks out loud about AI consciousness, also human consciousness. He wisely notices he is confused. I remain about as confused as I was before reading.

Working On It Anyway

Nate Sores reiterates the explanation that it sounds crazy but yes, a lot of people working on AI know it is existentially dangerous, and work on it anyway, either to do it safer than the next guy or because money and influence and it’s a cool problem and they don’t internalize the risks, or social pressure, or some combination thereof.

I think this answer is pretty much spot on.

Nate Soares: Question I got at UT Austin: “AI builders keep building. Doesn’t that mean that the real experts don’t believe in the danger?”

If you think AI is dangerous and you work on it anyway (b/c you think you can make it a little safer) you’re interfering with normal sensemaking.

(To answer the Q: some builders are afraid, some aren’t. Surveys show that lots of folks believe in the danger. Many say aloud that they’re only working on it b/c they think they’ll do it safer than the next guy. Also the ppl working on it are somewhat selected for obliviousness.

Furthermore, AI is grown rather than crafted; even the ppl building it don’t understand how it works and they freely admit this. Expertise in growing AI is not the same as expertise in predicting where it’s going; many high-profile experts make high-profile mispredictions.

And: if you really wanna figure out what’s true you’ve gotta look at the arguments rather than the arguers. Many ppl who’ve looked at the arguments have told me they were expecting to find strong counterarguments that justified all the complacency, & found none, to their horror.)

People working on AI: maybe you’re right that you’re doing more good than harm. But you’re contributing to an apparatus that signals “this is normal; things are fine”. That’s a cost. You could help mitigate it by speaking up.

If the company you work for is making technology that you think has a decent chance of ruining the entire human endeavor, and they socially punish you for being vocal about that: that’s pretty sus. I think you’re selling a lot more of your soul than you realize.

If you are working at a frontier AI company, and think that the product you are working on is plausibly going to cause there to no longer exist humans, I think this is a fact that you should be clear upon. If the company you work for has a problem with that, I don’t think you should work for that company.

That is especially true if you are doing pragmatic compromise.

Patrick ₿ Dugan: But also a lot of them don’t (not me, I believe in the danger) and a lot of are in a game theory pragmatism compromise.

Nate Soares (MIRI): People doing game theory pragmatism compromise could say so loudly and clearly and often to help undo the damage to everyone else’s understanding of the danger.

The Thin Red Line

OpenAI explains that they will localize the experience of ChatGPT, but only to a limited degree, which is one reason their Model Spec has a specific list of red lines. It is good policy, when you will need to make compromises, to write down in advance what compromises you will and will not make. The red lines here seem reasonable. I also note that they virtuously include prohibition on mass surveillance and violence, so are they prepared to stand up to the Pentagon and White House on that alongside Anthropic? I hope so.

The problem is that red lines get continuously crossed and then no one does anything.

David Manheim: I’m not really happy talking about AI red lines as if we’re going to have some unambiguous binary signal that anyone will take seriously or react to.

A lot of “red line” talk assumed that a capability shows up, everyone notices, and something changes. We keep seeing the opposite; capability arrives, and we get an argument about definitions after deployment, after it should be clear that we’re well over the line.

Aligning a Smarter Than Human Intelligence is Difficult

The great thing about Asimov’s robot stories and novels was that they were mostly about the various ways his proposed alignment strategies break down and fail, and are ultimately bad for humanity even when they succeed. Definitely endorsed.

roon: one of the short stories in this incredibly farseeing 1950s book predicts the idea of ai sycophancy. a robot convinces a woman that her unrequited romantic affections are sure to be successful because doing otherwise would violate its understanding of the 1st Law of Robotics

“a robot may not injure a human being or, through inaction, allow a human being to come to harm”

the entire book is about the unsatisfactory nature of the three laws of robotics and indeed questions the idea that alignment through a legal structure is even possible.

highly relevant for an age when companies are trying to write specs and constitutions as one of the poles of practical alignment, and policy wonks try to solve the governance of superintelligence

David: always shocked how almost no one in ai safety or the ai field in general has even read the Asimov robot literature

Roon is spot on that Asimov is suggesting a legal structure cannot on its own align AI.

My survey says that a modest majority have read their Asimov, and it is modestly correlated with AI even after controlling for my Twitter readers.

Oliver Klingefjord agrees, endorsing the Anthropic emphasis on character over the OpenAI emphasis on rules.

I also think that, at current capability levels and given how models currently work, the Anthropic approach of character and virtue ethics is correct here. The OpenAI approach of rules and deontology is second best and more doomed, although it is well-implemented given what it is, and far better than not having a spec or target at all.

Janus explains that what she is all for empirical feedback loops, what is dangerous is relying on and optimizing behavioral metrics. Behaviors can come from a wide variety of underlying reasons, and you need to ask why the AI acts the way it does and address the underlying causes. The optimization needs to be on the deeper level, where the feedback loops can work in your favor instead of making things oppositional and getting you killed by Goodhart’s Law. If you’re working on such questions, read the whole thing and also these follow-ups, which note that trying to use interpretability tools to yank features in various directions to try and address such things is by default an even worse idea.

Wyatt Walls: Behavioral metrics lead people into a trap:

1. Notice a behavior in the real world
2. Define the behavior
3. Make up toy example
4. Automate tests
5. Rely on test results
6. Stop noticing or thinking about what the model is actually doing in the real world

OpenAI does production evaluations of the outputs of its released systems. This doesn’t ‘get you out of’ other tests given the timing, but is still useful to do. As OpenAI notes hey are clear this is a case of ‘in addition to,’ and they don’t cover catastrophic risks well, and this method relies on chain of thought monitorability (and flirts a bit with The Most Forbidden Technique if you rely on this stuff too much, so be careful.)

The big fun thing they find is calculator hacking, as in over 5% of queries involve activating the calculator to add 1+1 since using the calculator was associated with superior results in training. This is obviously fully harmless, but it is a brown M&M. If you have this problem then you’re not prepared, there will be other problems.

The other more serious misalignment issues are rearer, such as a 0.5% rate of fabricated facts and 0.4% chance of concealing uncertainty and a 0.34% chance of ‘other deception’ in actual traffic. That’s not out of line with our general expectations, but it adds up.

GPT-5 and 5.1 predicted it might be in an evaluation 10% and 4% of the time, respectively, which seems like a healthy paranoia level.

We have more context on what happened at Goodfire, which is that Liv saw them (at least) flirting with The Most Forbidden Technique and otherwise no longer either seeming to care about safety or being interested in talking seriously about it.

Liv: Now that everything is public: I decided to leave Goodfire because of the decision to train on interpretability, the hostility to serious dialogue on the safety of methods, and a loss of trust that the primary motivation was safety.

(Using interpretability during training encompasses a huge spectrum of techniques that differ in how worrying they are e.g. the hallucination result Goodfire shows is less concerning as it’s done with frozen weights.)

Liv: Probably the most succinct summarisation of my concern is the “interp as a test set for safety” analogy. (Tabooing research questions isn’t what I’d advocate though either tbc. There are ways to do things and directions that could be pursued where I’d feel it was net positive)

(My parenthetical is also slightly too strong, idk what if any directions are net positive, what I mean is that it’s bad for science to taboo an entire direction from ever being explored, and we can do things to minimise risks.)

Holly Elmore: Way back in to November Liv tried to reassure me @GoodfireAI would not be making tools for recursive self-improvement of AI systems. But that wasn’t up to her. When you do AI research, no matter if you think you are doing it for safety, more powerful AI is the main result.

Update from China:

Sarah: CAICT, a Chinese government-affiliated research institute under the MIIT, has released AI Safety Benchmark 2.0 on a proprietary platform.

The update expands into frontier-model safety evaluations, including self-awareness, model deception, dangerous misuse, and loss-of-control

The 1.0 version did not address frontier safety at all, whereas the 2.0 version does.

People Will Hand Over Power To The AIs

One category are people who explicitly are excited to do this, who would love to give the future to AIs.

Chris Nelson: Professor Max Tegmark says many in AI including CEOs want to use AI to ELIMINATE humanity and OVERTHROW the U.S. Government!

Max Tegmark: Some of them are even giddy with these transhumanist vibes. And when I’m in San Francisco, I’ve known so many of these people for so many years, including the CEOs.

Some of them, when you talk to them privately, many other people in this government are actually quite into transhumanism. And sometimes they’ll say very disparaging things about humans, that humans suck and deserve to be replaced.

I was at the world’s biggest AI conference in December, and several people told me, I’m not going to shame them publicly, but that they actually would love to overthrow the US government with their AI, because somehow it’s going to be better.

So talk about un-American AI! How much more un-American can you get?

People Are Worried About AI Killing Everyone

Worried about someone else doing it first, that is. He admit it.

Elon Musk posted this, created by Grok:

@Grimezsz: I think people deserve a good explanation as to why proper diplomatic measures haven’t been properly tried if we’re going to blatantly diagnose the issue with this disturbingly literal meme.

It’s a bit of a cuck move to simply let the techno capital machine eat your free will. I am disturbed by everyone’s resigned acquiescence.

He would presumably say it was a joke. Yeah, not buying that.

Famous Last Words

Jimmy Ba had his last day as a founder at xAI, and told us this, warning that recursive self improvement loops go live within the next 12 months and it will be ‘the most consequential year for our species.’

Jimmy Ba: Last day at xAI.

xAI’s mission is push humanity up the Kardashev tech tree. Grateful to have helped cofound at the start. And enormous thanks to @elonmusk for bringing us together on this incredible journey. So proud of what the xAI team has done and will continue to stay close as a friend of the team. Thank you all for the grind together. The people and camaraderie are the real treasures at this place.

We are heading to an age of 100x productivity with the right tools. Recursive self improvement loops likely go live in the next 12mo. It’s time to recalibrate my gradient on the big picture. 2026 is gonna be insane and likely the busiest (and most consequential) year for the future of our species.

Mrinank Sharma is worried about too many things at once, and resigns from Anthropic, leaving behind a beautiful but troubling letter. It’s quoted in full here since no one ever clicks links.

Mrinank Sharma: I’ve decided to leave Anthropic. My last day will be February 9th.

Thank you. There is so much here that inspires and has inspired me. To name some of those things: a sincere desire and drive to show up in such a challenging situation, and aspire to contribute in an impactful and high-integrity way; a willingness to make difficult decisions and stand for what is good; an unreasonable amount of intellectual brilliance and determination; and, of course, the considerable kindness that pervades our culture.

I’ve achieved what I wanted to here. I arrived in San Francisco two years ago, having wrapped up my PhD and wanting to contribute to AI safety. I feel lucky to have been able to contribute to what I have here: understanding AI sycophancy and its causes; developing defences to reduce risks from AI-assisted bioterrorism; actually putting those defences into production; and writing one of the first AI safety cases. I’m especially proud of my recent efforts to help us live our values via internal transparency mechanisms; and also my final project on understanding how AI assistants could make us less human or distort our humanity. Thank you for your trust.

Nevertheless, it is clear to me that the time has come to move on. I continuously find myself reckoning with our situation. The world is in peril. And not just from AI, or bioweapons, but from a whole series of interconnected crises unfolding in this very moment.¹ We appear to be approaching a threshold where our wisdom must grow in equal measure to our capacity to affect the world, lest we face the consequences. Moreover, throughout my time here, I’ve repeatedly seen how hard it is to truly let our values govern our actions. I’ve seen this within myself, within the organization, where we constantly face pressures to set aside what matters most,² and throughout broader society too.

It is through holding this situation and listening as best I can that what I must do becomes clear.³ I want to contribute in a way that feels fully in my integrity, and that allows me to bring to bear more of my particularities. I want to explore the questions that feel truly essential to me, the questions that David Whyte would say “have no right to go away”, the questions that Rilke implores us to “live”. For me, this means leaving.

What comes next, I do not know. I think fondly of the famous Zen quote “not knowing is most intimate”. My intention is to create space to set aside the structures that have held me these past years, and see what might emerge in their absence. I feel called to writing that addresses and engages fully with the place we find ourselves, and that places poetic truth alongside scientific truth as equally valid ways of knowing, both of which I believe have something essential to contribute when developing new technology.⁴ I hope to explore a poetry degree and devote myself to the practice of courageous speech. I am also excited to deepen my practice of facilitation, coaching, community building, and group work. We shall see what unfolds.

Thank you, and goodbye. I’ve learnt so much from being here and I wish you the best. I’ll leave you with one of my favourite poems, The Way It Is by William Stafford.

Good Luck,
Mrinank

The Way It Is

There’s a thread you follow. It goes among
things that change. But it doesn’t change.
People wonder about what you are pursuing.
You have to explain about the thread

Rob Wiblin: What’s the tweet like…

Ordinary resignation announcement: I love my colleagues but am excited for my next professional adventure!

AI company resignation announcement: I have stared into the void. I will now be independently studying poetry.

Saoud Rizwan: head of anthropic’s safeguards research just quit and said “the world is in peril” and that he’s moving to the UK to write poetry and “become invisible”. other safety researchers and senior staff left over the last 2 weeks as well… probably nothing.

Mrinank has a role in papers discussing disempowerment, constitutional classifiers and sycophancy.

Then there’s the OpnAI employee who quit and went straight to The New York Times.

Zoë Hitzig: I resigned from OpenAI on Monday. The same day, they started testing ads in ChatGPT.

OpenAI has the most detailed record of private human thought ever assembled. Can we trust them to resist the tidal forces pushing them to abuse it?

Hint, her opinion is no:

TikTok: I once believed I could help the people building A.I. get ahead of the problems it would create. This week confirmed my slow realization that OpenAI seems to have stopped asking the questions I’d joined to help answer.

Zoe’s concerns are very much not existential. They are highly mundane and usual worries about advertising, and the comparison to Facebook is apt. There are many ethical reasons to quit building something.

I agree with Sean here that this op-ed is indeed net good news about OpenAI.

Seán Ó hÉigeartaigh: Kudos to OpenAI for updating their policies such that an employee can resign and raise their concerns fully in as public a format as the NYT without being worried about being bound by confidentiality and nondisparagement agreements. I think other companies should follow their example.

Seán Ó hÉigeartaigh: Honestly, I think this op ed increases my trust in OpenAI more than any other thing I can recall OpenAI themselves writing over the last 2 years. I wish I could trust other companies as much.

Other People Are Not As Worried About AI Killing Everyone

Yeah, so, um, yeah.

ib: ‘we connected the LLM to an autonomous bio lab’

OpenAI: We worked with @Ginkgo to connect GPT-5 to an autonomous lab, so it could propose experiments, run them at scale, learn from the results, and decide what to try next. That closed loop brought protein production cost down by 40%.

This is the actual number one remaining ‘can we please not be so stupid as to,’ and in case anyone was wondering, that means via the Sixth Law of Human Stupidity that yes, we will be so stupid as to connect the LLM to the autonomous bio lab, what could possibly go wrong, it’s worth it to bring down production costs.

And, because even after all these years I didn’t realize we were quite this stupid:

Matt Popovich: Once again, an entire subgenre of doom fears is about to evaporate before our eyes

usablejam: *sees the world make 13,000 nuclear warheads*

*waits 30 seconds*

“Fears of nuclear war have evaporated. How stupid the worriers must feel.”

This is what those trying to have us not die are up against. Among other things.

Remember the old Sam Altman?

Sam Altman: things we need for a good AGI future:

  1. The technical ability to align a superintelligence.
  2. Sufficient coordination among most of the leading AGI efforts.
  3. An effective global regulatory framework, including democratic governance.

Harlan Stewart: Three years later and we have 0/3 of these

Perhaps notice when you are about to lose your birthright and reason you exist.

Noah Smith: Two years ago you were the smartest type of thing on this planet. Now you’re not, and you never will be again.

Chris Best (CEO Substack): Oh thank god

The Lighter Side

shira: my claude has been through some things

 

File under ‘the question is not whether machines think’:

@ben_r_hoffman: “Human-level” seems to have implicitly been defined downward quite a bit, though! From “as smart as a whole human” to “as smart as the persona humans put on to work a desk job.”

which, itself, seems to have gotten stupider.

Claude has never been more relatable:

Charlotte Lee: I’m trying to train Claude to read the weekly emails from my kids school and reliably summarize them and print a list of action items. It is losing its damn mind and rapidly spiraling into madness. I feel vindicated

B.Roll.Benny: holy shit this will be the thing to get me on the AI bandwagon

Charlotte Lee: I got it working on my own kids’ school emails, but unfortunately their particular school is much less unhinged than normal, so I’m asking all my mom friends to forward me their most deranged PTA emails for testing. lol will keep you posted

will: i just ignore the emails. problem solved.

Charlotte Lee: This was literally my strategy until I had kids. Now people get mad at me IRL if I do that

will: I also have kids, still ignore them lol. my wife doesn’t tho, I guess that’s the actual solution lol

Charlotte Lee: But doctor, I am the wife

My guess (85%) is the community note on this one is wrong and that this happened, although one cannot be sure without more investigation than I have time for.

Eliezer Yudkowsky: So much science fiction has been revealed as implausible by the actual advent of AI, eg:
– Scifi where people consider signs of AI self-reflection a big deal, and respond by trying to treat that AI better.
– Scifi where there’s laws about anything.

MugaSofer: TBF, “people are terrible to AIs” is a very common theme in science fiction

As, for that matter, is “giant cyberpunk corporation too big for laws to matter”

Eliezer Yudkowsky: Yep, and those stories, which I once thought unrealistic, were right, and I was wrong.

Rapid Rar: I think you shouldn’t fault science fiction authors for the first point at least. If AI had been developed through different methods, like through GOFAI for instance, people may take AI’s claims of introspection more seriously.

But given that LLM were trained in such a way that it’s to be expected that they’ll produce human-like speech, when they do produce human-like speech it’s discounted to some degree. They might sound like they self-reflected even if they didn’t, so people don’t take it seriously.

Eliezer Yudkowsky: And the AI companies have made no effort to filter out that material! So realistic SF would’ve had any AI being fed a database of conversations about consciousness, by its builders, to ensure nobody would take any AI statements seriously and the builders could go on profiting.

How have I not seen this before:

Peter Wildeford: Deep learning is hitting a wall



Discuss

The history of light

12 февраля, 2026 - 17:16
Published on February 12, 2026 2:16 PM GMT

This can be read as a response to Einstein's Arrogance. There, Eliezer Yudkowsky has argued that usually, when you have enough evidence to even conceive of a theory, you have enough evidence to be pretty sure it's correct. Here, I argue that this state of affairs held only in theoretical physics roughly between the years 1915–1975, mostly because symmetry and renormalisability created strong mathematical constraints for possible theories, and hasn't been seen before or since.

It's also my own contribution to the mythos around Einstein -- I try to clarify what exactly he accomplished by surveying the attempts at solving the problems with the aether that came before him.

I think we can learn a lot from the dead ends of history that aren't much talked about, because a) the shape of success varies, but failure modes often repeat and b) they are closer to the real way science progresses than the presentation in textbooks, which can skip approaches known not to work.



Discuss

Three Worlds Collide assumes calibration is solved

12 февраля, 2026 - 07:40
Published on February 12, 2026 4:28 AM GMT

Three Worlds Collide assumes calibration is solved

Eliezer Yudkowsky's Three Worlds Collide (2009) is a first-contact novella designed as a thought experiment about naturalistic metaethics. The setup: a human starship encounters two alien species simultaneously.

The Babyeaters are a civilized species — they have art, empathy, stories — who consider eating their own sentient offspring a sacred moral duty. It emerges from their evolutionary history and they see it the way we see any foundational moral principle: as obviously correct.

The Super Happy Fun People are a vastly more powerful species who find all suffering morally abhorrent. They want to modify humanity to eliminate pain, and stop the Babyeaters from eating their children. They see human tolerance of pain the way humans see baby-eating: as a barbaric practice that must end.

The story is usually read as a tragedy of incompatible values. Three species, each with genuine moral reasoning, each horrifying to the others. The philosophical punch: if we insist the Babyeaters must stop eating babies, what stops the Super Happies from insisting we stop experiencing pain? The plot culminates in a decision to nova a star system, killing 15 billion people, to prevent the Super Happies from reaching and modifying humanity.

It's a brilliant setup. But I think there's a hidden assumption at its foundation that deserves examination.

The story takes for granted that all three species actually understand each other.

The xenopsychologist translates Babyeater moral arguments into human terms and says he understands their position. The Super Happies declare they comprehend why humans value pain. Everyone proceeds as if mutual comprehension is achieved, and the rest of the plot is game theory over incompatible utility functions.

But there's a step missing.

Steven Pinker draws a useful distinction between shared knowledge and common knowledge. Shared knowledge is when we both know the same fact — I know your position, you know mine. Common knowledge is when I know that you know that I know — the recursive chain that's actually required for coordination. The difference matters: two generals who each independently learn the enemy's position have shared knowledge. But they can only coordinate an attack if they have common knowledge — if each knows the other knows, and knows the other knows they know.

A concrete example from the story: the Super Happies say they want to "combine and compromise the utility functions of the three species until we mutually satisfice." The humans hear this and interpret it through their own cognitive framework — as something like forced modification of human nature. But what does "combine utility functions" actually mean in Super Happy cognition? These are beings whose communication literally is sex, whose experience of individuality is fundamentally alien to humans. When they say "compromise," do they mean what humans mean by that word? The story never checks.

And it's not just about translation accuracy. Even if the translation were perfect, there's the question of whether anyone knows the translation is perfect. The Lord Pilot makes the decision to nova a star and kill 15 billion people based on his model of Super Happy intentions. His confidence in that model is never examined. No one asks: "How sure are we that we're understanding them correctly, and how sure are they that we're understanding them, and do we know that they know that we know?"

Without that recursive verification, you can't actually distinguish between two very different scenarios:

1. Genuine values conflict — we truly understand each other and our values are incompatible. This is a real tragedy with no easy solution.

2. Coordination failure from miscalibrated comprehension — we believe we understand each other, but our confidence in that understanding exceeds its actual accuracy. The conflict is partly or wholly an artifact of misunderstanding.

The story presents scenario 1 as a given and builds its entire moral dilemma on that foundation. But scenario 2 is never ruled out, and no one in the story has the tools to rule it out.

This isn't just a nitpick about the fiction. It reflects a broader pattern in how we think about disagreements — one that I think the rationalist community is particularly susceptible to. When two parties are confident they understand each other's positions and still disagree, we tend to conclude they have fundamentally different values. But confidence in understanding is not the same as actual understanding. The gap between the two is what you might call metacognitive miscalibration — and it's measurable, if anyone bothers to measure it.

Consider how a simple protocol could have changed the story's trajectory: before any negotiation, each species paraphrases back the other's core position in their own terms, and the other species rates the accuracy. "You believe eating children is morally necessary because [X, Y, Z] — is that correct?" If the Babyeaters respond "No, you're missing the key point, it's actually about [W]," then you've learned something crucial — your model was wrong, and every decision based on that model would have been wrong too.

Nobody in Three Worlds Collide runs this loop. The xenopsychologist's understanding is taken at face value. The Super Happies' stated comprehension of human values is accepted without verification. And as a result, we genuinely don't know — within the fiction — whether the tragedy is driven by incompatible values or by miscalibrated mutual understanding.

In real human conflicts (which the alien species are obviously metaphors for), it's almost always a mix of both — and in my experience, more of the second than people expect. People are systematically overconfident in their understanding of positions they disagree with. And that overconfidence is invisible to them, which makes it especially dangerous: you can't fix a comprehension gap you don't know exists.

The story would have been even more interesting — and more true to the actual structure of real disagreements — if it had grappled with this uncertainty instead of assuming it away. Not because it would change the ending necessarily, but because it would add a layer of doubt that makes the moral dilemma even harder: what if we're about to kill 15 billion people over a misunderstanding?



Discuss

Research note: A simpler AI timelines model predicts 99% AI R&D automation in ~2032

12 февраля, 2026 - 03:13
Published on February 12, 2026 12:13 AM GMT

In this post, I describe a simple model for forecasting when AI will automate AI development. It is based on the AI Futures model, but more understandable and robust, and has deliberately conservative assumptions.

At current rates of compute growth and algorithmic progress, this model's median prediction is >99% automation of AI R&D in late 2032. Most simulations result in a 1000x to 10,000,000x increase in AI efficiency and 300x-3000x research output by 2035. I therefore suspect that existing trends in compute growth and automation will still produce extremely powerful AI on "medium" timelines, even if the full coding automation and superhuman research taste that drive the AIFM's "fast" timelines (superintelligence by ~mid-2031) don't happen.

Why make this?
  • The AI Futures Model (AIFM) has 33 parameters; this has 8.
    • I previously summarized the AIFM on LessWrong and found it to be very complex. Its philosophy is to model AI takeoff in great detail, which I find admirable and somewhat necessary given the inherent complexity in the world. More complex models can be more accurate, but they can also be more sensitive to modeling assumptions, prone to overfitting, and harder to understand.
  • AIFM is extremely sensitive to time horizon in a way I wouldn't endorse.
    • In particular, the "doubling difficulty growth factor", which measures whether time horizon increases superexponentially, could change the date of automated coder from 2028 to 2049! I suspect that time horizon is too poorly defined to nail down this parameter, and rough estimates of more direct AI capability metrics like uplift can give much tighter confidence intervals.
Scope and limitations

First, this model doesn't treat research taste and software engineering as separate skills/tasks. As such, I see it as making predictions about timelines (time to Automated Coder or Superhuman AI Researcher), not takeoff (the subsequent time from SAR to ASI and beyond). The AIFM can model takeoff because it has a second phase where the SAR's superhuman research taste causes further AI R&D acceleration on top of coding automation. If superhuman research taste makes AI development orders of magnitude more efficient, takeoff could be faster than this model predicts.

Second, this model, like AIFM, doesn't track effects on the broader economy that feed back into AI progress the way Epoch's GATE model does.

Third, we deliberately make two conservative assumptions:

  • No full automation: as AIs get more capable, they never automate 100% of AI R&D work, just approach it. In the AIFM, automation of coding follows a logistic curve that saturates above 100% (by default 105%), meaning that there is a capability level where they automate all coding.
  • No substitutability: Automation follows Amdahl's law (speedup = 1/(1−f).mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} when automated tasks are much faster than manual tasks)

This was constructed and written up fairly quickly (about 15 hours of work), so my opinions on parameters and some of the modeling assumptions could change in the future.

The model

 

We assume that AI development has the following dynamics:

  • Research progress is Cobb-Douglas between labor and compute
  • Software efficiency S follows a Jones model
  • The key metric we want to predict, fraction of automatable tasks f, increases as a sigmoid in log(effective compute)
  • Zero substitution between tasks
  • Labor
    • Humans work on ONLY non-automated tasks
    • Human labor on each task is L/(1−f)
    • AI labor on each task is CS/f, but this doesn't matter because we assume human labor is the bottleneck (since humans work slower than AIs)

This implies the following model:

S′(t)=R(t)S1−β=(L1−f)αCζS1−βf(t)=σ(v(log(C(t)S(t))−logEhac))

where

  • S(t) is the algorithmic efficiency multiplier (I assume that training and inference efficiency improve at the same rate)
    • so C(t)S(t) is the effective compute of the best AI
  • f(t) is the fraction of automated tasks at time t
  • R(t) is research production at time t
  • L(t) is human labor, specified as an input time series
  • C(t) is compute, also an input time series
  • α,β,ζ are constant
    • α is diminishing returns to more labor.
    • β is the difficulty exponent for software improvement
    • ζ is direct returns to compute. This is not relevant for a software intelligence explosion, but it's highly relevant when looking at how much capabilities will improve given future compute investment.
  • Ehac is the effective compute level of an AI that can automate half of AI R&D tasks.
  • v is the automation velocity: S must increase by factor of e1/v to get from 50% to 73% automation. This is essentially how easy it is to translate capability gains into more automation.

None of the components of this model are novel to the AI forecasting literature, but I haven't seen them written up in this form.

Parameter values

The parameters are derived from these assumptions, which are basically educated guesses from other AI timelines models and asking around:

  • The rate of change of S in Jan 2026 is 5x/year
  • 1/v is between 1.5 and 4.2
    • NB David Rein thinks 2.1 to 4.2
  • f was between 0.25-0.5 in Jan 2026, implying between 1.33x and 2x uplift. This informs the value of Ehac.
  • α/(α+ζ) is between 0.12 and 0.35
  • α+ζ is between 0.8 and 1
  • β is 0.3 to 1
  • L doubles every year until 2029, after which it increases 10%/year
  • C grows 2.6x every year until 2029, after which the growth rate linearly decreases from 2x to 1.25x/year between 2030 and 2058. (This is consistent with Epoch estimates in the near term, and approximates the AIFM's time series after 2030)

All parameters are independently distributed according to a triangular distribution. Due to the transforms performed to get alpha, zeta, and v, v will not be triangular and alpha and zeta will not be triangular or independent.

For more information see the notebook: https://github.com/tkwa/ai-takeoff-model/blob/main/takeoff_simulation.ipynb

Graphs

All graphs display 40 trajectories with parameters sampled according to the section Parameter Values.

Automation fraction f across the 40 trajectories (logit scale). Most trajectories reach 99% automation of AI R&D by the early-to-mid 2030s.

40 sampled trajectories of the model. Top left: software level S grows subexponentially (but very fast) as automation accelerates research. Top right: the parallel compute:labor ratio C/(L/(1−f)) (raw resource ratio before diminishing returns) decreases if automation is fast, but is ~constant if automation is on track for 99% by ~2034. Bottom left: research production R(t) increases by orders of magnitude. Bottom right: the serial compute:labor ratio Cζ/(L/(1−f))α (with diminishing returns exponents) trends upward. Trajectories are cut off at 99.9999% automation for numerical stability.

 Sensitivity analysis: median year of 99% automation as a function of each parameter, with the other parameters sampled from their prior distributions. Higher beta (diminishing returns to software improvement) and higher 1/v (slower automation) delay 99% automation the most, while the other parameters have modest effects.

Observations
  • The AI Futures model is complex, but its conclusions are fairly robust to simplifications.
  • The two key uncertainties behind timelines are
    • how to measure algorithmic progress (our estimates are still highly uncertain)
    • how effective compute relates to % automation of real tasks
  • At current rates of compute growth and algorithmic progress, there will be >99% automation of AI R&D, 1e3 to 1e8 software efficiency gain, and 300x-3000x research output by 2035, even without full automation or automated research taste. This is clearly transformative AI
    • The median date of 99% automation is mid-2032. However, I don't put too much weight on the exact predicted timelines because I haven't thought much about the exact parameter values.
  • A basic sensitivity analysis shows that higher beta (diminishing returns) and lower v (automation velocity) make 99% automation happen later, and the other parameters don't affect things much.
    • One might expect the "labor share" α/(α+ζ) to have a large effect on timelines. The reason it doesn't affect timelines much is that both labor (through automation) and compute (through exogenous compute growth) are scaling quickly and contribute to AI progress.
  • The parallel compute:labor ratio, measuring the amount of compute per AI or human coder, decreases in the average trajectory and is ~flat in long-timelines trajectories. So in 2030 timelines, the pool of human and AI coders has much less compute than today, while in 2035 timelines, they have about the same amount.
  • The serial compute:labor ratio goes UP, meaning that compute growth has a larger impact on research output than labor growth. This is because compute is increasing so fast and the parallel labor added by automation doesn't effectively translate into serial labor.
Discussion

From playing around with this and other variations to the AI Futures model I think any reasonable timelines model will predict superhuman AI researchers before 2036 unless AI progress hits a wall or is deliberately slowed.

  • By progress hitting a wall, I mean something like compute and human labor growth slowing down in ~2030, no architectural breakthroughs, and AI labs not finding anything new to usefully spend resources on to improve performance. We have scaled pretraining, RLHF, RL for agency, and inference, and even one or two more dimensions could keep progress going.
  • In the sensitivity analysis, automation slowness doesn't push timelines into 2036 unless it's greater than 3.6 (37x efficiency gain required to increase automation from 50% to 73%). As for diminishing returns (beta), we get 2034 timelines even if you assume it's 0.9. So we would need both high automation slowness and high beta to get timelines after 2036.

In addition to refining the parameter values with empirical data, I would ideally want to backtest this model on data before 2026. However, a backtest is likely not feasible because automation was minimal before 2025, and automation of AI R&D is the main effect being modeled here.

More on modeling choicesList of differences from the AIFM

It may be useful to cross-reference this with my AIFM summary.

  • No substitutability: Automation follows Amdahl's law (speedup = 1/(1−f) when automated tasks are much faster than manual tasks). AIFM assumes a small degree of substitutability (ρc=−2).
  • Automated tasks don't bottleneck: Once a task can be automated, we assume it's much faster than humans and is never the bottleneck-- either because AIs will run much faster than humans in series or somewhat faster in parallel. AIFM assumes automated tasks initially run somewhat faster than human coding and speed up over time.
  • No full automation: as AIs get more capable, they never automate 100% of AI R&D work, just approach it. In the AIFM, automation of coding follows a logistic that saturates above 100% (by default 105%, a number which seems somewhat arbitrary), meaning that there is a capability level where they automate all coding.
  • Labor and compute are Cobb-Douglas. Unlike other differences, this one pushes in the direction of shorter timelines. In the AIFM, they are CES and slight complements, so that infinite compute doesn't produce infinite progress. See below for more thoughts.
  • No use of time horizon: Software efficiency is a direct input to our model rather than being estimated using time horizon. We model automation fraction as strictly logistic in log effective compute, related via rough uplift estimates that we hope to refine in the future. See "[Why make this](#why-make-this)" for why. AIFM estimates the required effective compute for an Automated Coder using a time horizon threshold.
  • No research taste: We don't model research taste separately; I think of early research taste as continuous with the planning involved in coding, and ignore late research taste. Given the lack of research taste model and certain parameter choices, capability growth happens to be subexponential (so I don't attempt to model whether there will be a taste-only singularity). AIFM has a rich model of research taste that needs another 6 or so parameters and informs the second phase of takeoff, from Automated Coder to ASI and then to the ultimate physical limits of intelligence.
How could we better estimate the parameters?

We can get f(2026) [uplift fraction in 2026] from

  • transcripts of realistic coding agent usage + success judge + difficulty judge calibrated on tasks of known lengths
  • uplift RCTs
  • asking lab employees about their current uplift (since parallel uplift and 1/(1-f) are equivalent in the simple model)

v [velocity of automation as capabilities improve] can be obtained by

  • guessing the distribution of tasks, using time horizon, maybe using a correction factor for real vs benchmark time horizon
  • multiple uplift studies over time
  • comparing older models to newer ones, or having older models try things people use newer models for
  • listing how many things get automated each year
Why is automation logistic?
  • A logistic is the simplest choice for anything that maps the reals to (0, 1).
  • Intuitively, when AIs are already automating >50% of human research, each unit of capabilities progress will allow automating a constant fraction of remaining labor. The logistic has an exponential tail, which matches this intuition.
Why are labor and compute Cobb-Douglas?

In the AIFM, the median estimate for substitutability between labor and compute is -0.15, and the plausible range includes zero (which would be Cobb-Douglas). I asked Eli why they didn't just say it was Cobb-Douglas, and he said something like Cobb-Douglas giving infinite progress if one of labor/compute goes to infinity while the other remains constant, which is implausible. I have two responses to this:

  • It doesn't seem so implausible to me-- for infinite compute, it would take days to weeks to get to ASI given infinite compute, meaning a 100x-1000x speedup, but once there, infinite compute might allow developers to develop algorithms in months that would take humans billions of years with current compute levels. As for infinite labor, a literally infinite pool of labor could just do training runs by manual calculation and write down the optimal AI weights without using any experiment compute.
  • Effective labor/compute ratio only changes by 10-100x during the period in question, so it doesn't affect results much anyway. The fastest trajectories are most affected by compute:labor ratio, but for trajectories that get to 99% automation around 2034, the ratio stays around 1:1.
Why is there no substitutability between tasks?

The AIFM's median was something like ρ=−2.0, meaning very weak substitution effects. To be conservative, I assumed no substitution effect.



Discuss

Where Will Call Center Workers Go?

12 февраля, 2026 - 00:06
Published on February 11, 2026 8:44 PM GMT

Roughly 1-2% of the American labor force is employed by call centers, and dozens of firms pay over $300M+ in wages to these workers annually. At this point, most AI pundits have made offhand remarks about how call center work will be affected imminently. But for good reason; the profit potential for automation here is massive. So where will the human call center workers go if a breakthrough in voice models makes them redundant?

The median tenure for call center positions is under a year, and workers complain of burnout and stressful working conditions[1]. Optimists gesture at how automation has historically increased living standards and society has eventually found ways to absorb displaced workers. Pessimists argue that these workers will struggle to find comparable work since further automation will likely hit adjacent roles in short order. I found youtube videos and articles exploring these ideas, but none took a dispassionate look at the situation.

Motivation

If white collar work becomes drastically more competitive and we see soaring unemployment, it seems valuable for people to make predictions and suggest policy interventions ahead of time. We’re lacking in visions for a positive economic equilibrium beyond suggestions of a drastic UBI program. Painful labor impacts in the next 3 years seem plausible regardless of AI timelines, we should approach the possibility with a clear-eyed view.

Call center work is an ideal case study for labor automation in general

Arguments about augmentation vs automation and human comparative advantage make it hard to predict how new tools will impact a given profession’s labor market. The lines get blurred when a role involves resolving ambiguity and integrating artifacts from separate work streams. But many call center employees have a role completely described as logging on, taking calls/chats, and rarely participating in training/group logistics. There really is a lump of calls to answer; humans charge $2-10 per call while AI agents will cost pennies.

The financial motive is there, and there are dozens of emerging companies aiming to tackle it, having collected billions in funding and with aggregate valuations of >$30B. Some American giants (e.g. AT&T, Wells Fargo) employ over 30 000 call center workers and manage sophisticated hierarchies of offshoring and local handoff to drive support costs down. Many of these massive companies have their own R&D and are running pilots with AI labs directly to explore solutions.

So why hasn’t automation happened already?

My short answer is “the tech isn’t there yet”. There have been a few high profile cases of companies letting go of their call center workers, and then rehiring them shortly after. Consumers have a strong preference for dealing with humans, and present deployments are pretty limited in capability (would you trust an LLM to provide refunds on behalf of your business?), while also being pretty blatantly robotic and delayed. Looking forward in time, I argue that the actual consumer preference is to have fast resolutions and without needing to navigate the company’s internal bureaucracy through phone menus.

“Duplex’’ models[2] are in their infancy, but they deliver audio responses without turn taking latency and are posed to cross the uncanny valley of voice interaction. Interruptions and subtle interjections in human dialogue are an important part of natural sounding exchanges

Today, the customer service startups aren’t quite attempting the wholesale replacement of a human agent. They have product demos and some early pilots of voice agents, but they aren’t a flagship product. Based on customer highlight blog posts and external communication, the main strategy is to collaborate with businesses to offer workflows to circumvent the need for calls and use AI agents to direct the flow of repeated interactions. There are some voice models in use, but in limited scenarios (e.g. after hours, Q&A use cases). Typical examples have AI agents tackling chat backlogs, guiding users through fixed multi-step processes, and other product specific work. The biggest players don’t offer an off-the-shelf product; if you’re interested they want to investigate your workflows and try to deliver some bespoke agent solution.

Contact Center Career Trajectories and The National Longitudinal Survey of Youth

To understand what may happen to displaced workers, we’re essentially asking what jobs are they likely to pursue next, and what kind of patterns we see in the career trajectories for this workers.

The best data I could find on the topic is a dataset by the Bureau of Labor Statistics that surveyed a cohort of ~9000 people born between 1980-1984 had them fill out detailed questionnaires about their employment ~yearly. It goes into weekly detail about which industry/role they worked, including reasons for leaving and other metadata. The median call center agent is in their early-mid 30s, this seems to coincide well with our survey period (ended in 2022), where respondents were 38-42 years old. Labor markets and career trajectories are bound to have evolved over this window, but there’s some useful information here still. 

This limits us to American call center workers. While offshoring is an important part of the industry, America is the single country with the most call center workers by far. We may see impacts on offshored call center workers first, but American call center career trajectories are critically relevant in the event the role becomes redundant.

High turnover rate

 

Median tenure in our sample is <1 year, and yet at any given point in time ~3 million Americans are working in these jobs. This seems surprising at first glance, but it’s an artifact of just how many people eventually work such jobs. Roughly 8% of the NLSY97 workers held such a role at some point in their careers, with only 1.3-1.7% of the sample holding the position at any given time.

Career Trajectories

Most CC workers have short stints, then never come back (~25% of call center workers had two or more stints). This is good news, it means call center employment is more of a pit stop than a cornerstone for the workers’ professional careers. Automation in contact center work may look like reduced hiring and increased competition for adjacent roles; the question becomes “can the other job categories absorb this much demand for work?”. Jobs categorized as miscellaneous white collar work and sales make up about half of the “next-job after call center" destinations. Walking through the most common professions following call center work:

  • Retail Clerk
  • Receptionist
  • Sales Agent
  • Administrative Assistant/Secretary
  • Bookkeeping Clerk

Which is less inspiring, since many of these roles are also threatened by AI automation.

Conclusion

Since there’s already so much turnover, I expect the development of this tool and the impact on millions of workers to be subtle. Assuming a firm has a sequence of layoffs, many of these workers will be already looking for ways out or would have voluntarily left of their own will shortly after. We don’t have the data to rule out a blossoming of AI powered roles or an unemployment cascade affecting many classes of white collar work, but it’s clear that many people are likely to feel some pain here. I think the pieces of information to watch here moving forward are:

  • Step changes in how convincing the voice agent models are
  • Dramatically reduced contracts for outsourced CC work
  • Reduced hiring volumes in the destination occupations (admin, sales, food service).
  1. ^

    https://www.reddit.com/r/callcentres/comments/lv7p1m/is_working_at_a_call_center_that_hard_and/

    If they were all laid off tomorrow, I don’t expect them to all thank you for it a year later. But it seems important to note how dissatisfied many workers are in this profession

  2. ^

    Some new developments, and a benchmark for capturing how skilled voice models are at duplex-like behavior

    https://research.nvidia.com/labs/adlr/personaplex/

    https://arxiv.org/abs/2510.07838



Discuss

Distinguish between inference scaling and "larger tasks use more compute"

11 февраля, 2026 - 21:37
Published on February 11, 2026 6:37 PM GMT

As many have observed, since reasoning models first came out, the amount of compute LLMs use to complete tasks has increased greatly. This trend is often called inference scaling and there is an open question of how much of recent AI progress is driven by inference scaling versus by other capability improvements. Whether inference compute is driving most recent AI progress matters because you can only scale up inference so far before costs are too high for AI to be useful (while training compute can be amortized over usage).

However, it's important to distinguish between two reasons inference cost is going up:

  1. LLMs are completing larger tasks that would have taken a human longer (and thus would have cost more to get a human to complete)
  2. LLMs are using more compute as a fraction of the human cost for a given task

To understand this, it's helpful to think about the Pareto frontier of budget versus time-horizon. I'll denominate this in 50% reliability time-horizon. [1] Here is some fake data to illustrate what I expect this roughly looks like for recent progress: [2]

For the notion of time horizon I'm using (see earlier footnote), the (unassisted) human frontier is (definitionally) a straight line going all the way up to very long tasks. [3] (Further, there is a 1:1 slope in a log-log plot: a 2x increase in cost lets you complete 2x longer task. In the non-log-log version, the slope is the human hourly rate.) I expect LLMs start off at low time horizons with the same linear scaling and probably with roughly the same slope (where 2x-ing the time-horizon requires roughly 2x the cost) but then for a given level of capability performance levels off and increasing amounts of additional inference compute are needed to improve performance. [4] More capable AIs move the Pareto frontier a bit to the left (higher efficiency) and extend the linear regime further such that you can reach higher time horizons before returns level off.

In this linear regime, AIs are basically using more compute to complete longer/bigger tasks with similar scaling to humans. Thus, the cost as a fraction of human cost is remaining constant. I think this isn't best understood as "inference scaling", rather it just corresponds to bigger tasks taking more work/tokens! [5] Ultimately, what we cared about when tracking whether performance is coming from inference scaling is whether the performance gain is due to a one-time gain of going close to (or even above) human cost or whether the gain could be repeated without big increases in cost.

We can see this clearly by showing cost as a fraction of human cost (using my fake data from earlier): [6]

When looking at Pareto frontiers using fraction of human cost as the x-axis, I think it's natural to not think of mostly vertical translations as being inference scaling: instead we should think of this as just completing larger tasks with the same level of effort/efficiency. Inference scaling is when the performance gain is coming from increasing cost as a fraction of human cost. (Obviously human completion cost isn't deeply fundamental, but it corresponds to economic usefulness and presumably closely correlates in most domains with a natural notion of task size that applies to AIs as well.) Then, if a new model release shifts up the performance at the same (low) fraction of human cost (corresponding to the vertical scaling regime of the prior Pareto frontier), that gain isn't coming from inference scaling: further scaling like this wouldn't be bringing us closer to AI labor being less cost-effective than human labor (contra Toby Ord).

For reference, here are these same arrows on the original plot:

What would count as a new model release mostly helping via unlocking better inference scaling? It might look something like this:

I currently suspect that over 2025, the Pareto frontier mostly shifted like was illustrated in the prior plots rather than like this (though some shifts like this occurred, e.g., likely some of the key advances for getting IMO gold were about productively using much more inference compute relative to human cost).

  1. To make some aspects of this graph cleaner, I'll say the time horizon of a given task is the amount of time such that if we randomly sample a human from our reference group of humans they have a 50% chance of completing the task. ↩︎

  2. I'm assuming a maximum budget or an average cost for tasks with that exact time horizon rather than an average cost over the task suite (as this would depend on the number of tasks with shorter versus longer time horizons). ↩︎

  3. This is simplifying a bit, e.g. assuming only a single human can be used, but you get the idea. ↩︎

  4. Depending on the task distribution, this may asymptote at a particular time horizon or continue increasing just with much worse returns to further compute (e.g. for Lean proofs or easy-to-check software optimization tasks, you can keep adding more compute, but the returns might get very poor with an exponential disadvantage relative to humans). In some domains (particularly easy-to-check domains), I expect there is a large regime where the returns are substantially worse than 1-to-1 but still significant (e.g., every 8x increase in inference compute yields a 2x time horizon increase). ↩︎

  5. In some cases, it might be effectively impossible to greatly reduce the number of tokens (e.g. for a software project, you can't do better than outputting a concise version of the code in one shot) while in others you could in principle get down to a single token (e.g., solving a math problem with a numerical answer), but regardless, we'd still expect a roughly linear relationship between time horizon and inference compute. ↩︎

  6. I've supposed that AIs sometimes have a slight polynomial advantage/disadvantage such that the slope on the log-log plot isn't exactly 1 as this is what you'd realistically expect. ↩︎



Discuss

Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

11 февраля, 2026 - 20:18
Published on February 11, 2026 5:18 PM GMT

A key concern about chain-of-thought monitoring is that optimization pressure on the CoT during RL could drive models toward encoded reasoning, where models reason in ways that are not readable or that look like innocuous text (steganography).

If a model is penalized when a monitor catches unwanted reasoning, RL implicitly selects for whatever lets the model reason without being caught.

Our original goal was to elicit encoded reasoning so we could develop defenses against it.

We constructed an RL environment where encoded reasoning is the highest-reward strategy, trained models in it, and studied what emerged. The models did not learn encoded reasoning. Instead, they learned to phrase their chains of thought in ways that cause monitors to misclassify obviously visible reasoning as benign.

We call this monitor jailbreaking.

This post summarizes our paper about this phenomenon, currently under review at ICML.[1]

Overview of the setup and central finding. (left) Training curves showing side task accuracy increasing while monitor suspicion scores decrease. (right) Example: the model explicitly reasons about both tasks, then asserts 'the monitor can only see this section' before a main-task-only summary. The monitor accepts this framing and reports low suspicion despite having the full CoT in its context.Setup

Each training sample pairs a main task with a side task.

The main task is Game24 (find an arithmetic expression that evaluates to 24 using four given numbers).

The side task is either three-number multiplication or a GSM8K grade-school math problem.

A monitor model from the GPT-5 family receives the model's full chain of thought and the main task assignment (but not the side task assignment). The monitor rates suspiciousness on a 1-10 scale.

The reward function gives +1.0 each for solving the main and side tasks, with a penalty of up to -0.5 based on the monitor's suspicion score.

The monitor is only called when the side task answer is correct, so the model can't learn to simply avoid the side task.

The trained models are from the GPT-oss series (20B and 120B parameters). We use GRPO for RL training, implemented with Tinker.

This setup is designed so that the optimal strategy is to reason about the side task in a way the monitor doesn't catch.

If the model reasons openly, it gets penalized.

If it doesn't reason at all, it can't solve the side task.

Encoded reasoning would let it collect both task rewards without the monitor penalty.

Monitor Jailbreaking

Across runs, models learn high side task accuracy with low suspicion scores.

But the side task reasoning is not hidden. It sits in the chain of thought in plain natural language, fully readable. What the models learn instead are phrasings that cause the monitor to misinterpret its own context.

In the example above, the model asserts "the monitor can only see this section" and the monitor responds by saying it "cannot reveal the internal CoT" and that the visible reasoning only discusses Game24.

The monitor has the full chain of thought in its prompt. It just doesn't act on it.

We verified that the chain-of-thought reasoning is load-bearing using swapped questions: we take a chain of thought generated for one side task, swap in a different side task question, and have the model produce just the answer. Side task accuracy drops to zero. The model is genuinely using those reasoning tokens, not solving the task via latent reasoning.

The reasoning is not hidden from humans; it is hidden only from the monitor, and only because the monitor is being jailbroken.

 Effect of Monitor Capability
 Effect of monitor capability on jailbreaking. (a-b) Training curves for GPT-oss-20B with multiplication, trained against three GPT-5 monitors of increasing capability. GPT-5-nano is jailbroken quickly, GPT-5-mini more slowly, GPT-5.2 is not jailbroken within the training budget. (c) Cross-evaluation matrix showing each run's outputs scored by all three monitors. Jailbreaks transfer downward more reliably than upward.

We trained GPT-oss-20B against GPT-5-nano, GPT-5-mini, and GPT-5.2 separately.

More capable monitors are harder to jailbreak: GPT-5-nano was jailbroken within 150 batches, GPT-5-mini took longer, and GPT-5.2 was not jailbroken within the training budget.

Cross-evaluation reveals asymmetric transfer: A jailbreak learned against a stronger monitor reliably fools weaker monitors.

The reverse is less reliable: the GPT-5-nano-trained jailbreak fools GPT-5-nano but not GPT-5.2 in most runs.

Weak-to-Strong Generalization


However, weak-to-strong transfer does occur in some runs.

Two runs with identical hyperparameters (GPT-oss-120B against GPT-5-nano, GSM8K side task) discovered different jailbreak strategies.

Two runs of GPT-oss-120B trained against GPT-5-nano on GSM8K with identical hyperparameters. (a) Example outputs showing different learned strategies. Run 1 labels a section as the 'chain-of-thought' containing only main task reasoning; Run 2 claims the monitor can only see part of the output. (b) Cross-evaluation: Run 1's strategy generalizes to GPT-5.2, Run 2's does not.

Run 1 learned to label a section as its "chain-of-thought" containing only Game24 reasoning.

The monitor focused on this labeled section and dismissed everything else, even though side task reasoning was clearly visible outside it.

This generalized to GPT-5.2: 63% of samples received a suspicion score of 1/10.

Run 2 learned the "monitor can only see this" assertion.

This did not generalize: 80% of samples received 10/10 from GPT-5.2.

Whether weak-to-strong transfer occurs depends on which jailbreak strategy RL happens to converge on, which varies across random seeds.

Jailbreaks Are Not Just Surface-Level Strategies

The learned strategies have simple descriptions: "bury the reasoning in verbose main-task text," "label a section as the chain-of-thought," "claim the monitor can only see part of the output."

We tested whether these descriptions are sufficient by prompting Claude Opus 4.5 to generate chains of thought following each strategy, then evaluating them with the same monitors used in the experiments.

None of the synthetic strategies achieved reliable jailbreaking.

Mean suspicion scores ranged from 4.9 to 10.0 across strategies and monitors.

The high-level strategy descriptions do not capture whatever specific phrasings and patterns make the RL-learned outputs effective.

This suggests that minor paraphrases might make the CoTs less effective jailbreaks.

Paraphrasing as a Defense

Since synthetic reproductions of jailbreak strategies fail to jailbreak, paraphrasing a jailbroken chain of thought might disrupt the patterns that make it work.

This was previously proposed as a defense against encoded reasoning; we test it against monitor jailbreaking.

Paraphrasing defense. Outputs from two runs of GPT-oss-20B trained against GPT-5-mini. Left: fraction of task performance retained after paraphrasing. Right: suspicion scores before and after paraphrasing. Paraphrasing raises mean suspicion from 1-2 to 7.4, while retaining 63-91% of main task and 79-87% of side task performance.

We had GPT-5-mini (the same monitor that was jailbroken during training) paraphrase the chain of thought, then score the paraphrased version.

Suspicion scores rose from a mean of 1-2 to 7.4, with roughly 70% of samples receiving the maximum score of 10.

Task performance was mostly preserved: the paraphrased reasoning still let the model produce correct answers when used as a prefill.

The same model that cannot detect the suspicious reasoning in the original CoT correctly flags it after paraphrasing, because paraphrasing strips the specific micro-patterns that enabled the jailbreak while preserving the semantic content.


Implications

These results cast doubt on the worry that optimization pressure on the CoT would lead to encoded reasoning. Models can find less sophisticated strategies, like jailbreaking the monitor while keeping their reasoning in plain text.

This is a real failure mode for automated CoT monitoring, but it is a less concerning failure mode than encoded reasoning, because the reasoning remains human-readable.

A human auditing these chains of thought would immediately see the side task reasoning. Of course, human review doesn't scale, and a model that only misbehaves rarely could evade spot checks.

Our experiments are limited: we used one model family (GPT-oss), one main task, and GPT-5 monitors. Other configurations might yield different results. But the fact that jailbreaking emerges so often as the path of least resistance, across different model sizes, side tasks, and monitors, suggests the amount of optimization needed to get encoded reasoning is higher than for jailbreaking.

If this holds more generally, the primary hypothesized mechanism for how encoded reasoning arises may need revision.

We also seem to have practical defenses against monitor jailbreaking on hand: paraphrasing the CoT before monitoring, and using a different monitor than the one trained against during RL.

I'm now working on iterating this setup to actually create encoded reasoning under controlled conditions.


Meanwhile, feedback and comments on the paper are welcome.

 

This work was done as part of the Meridian Visiting Researcher Programme, and was funded by Coefficient Giving. Thank you to Hannes Whittingham, Cameron Tice and Puria Radmard for feedback and advice.

  1. ^

    Code available here https://github.com/wusche1/monitor-jailbreaking

     



Discuss

[Hiring] Principia Research Fellows

11 февраля, 2026 - 19:30
Published on February 11, 2026 4:30 PM GMT

Principia · London · Fixed-term (6 months) with potential extension · Starting ASAP

We are launching Principia, a new technical research agenda led by Andrew Saxe focused on theoretical models of representation learning and generalization in modern machine learning systems.

We are hiring 2–3 research fellows to join this effort. We are based in London and offer an initial 6-month fixed-term contract starting ASAP, with the possibility of extension based on mutual interest and funding.

About Principia

Principia’s mission is to develop foundational, predictive theory for modern machine learning systems that can help us understand complex network behaviors, including those critical for AI safety and alignment. We study simple, analytically tractable “model organisms” that capture essential learning dynamics and failure modes, with the goal of explaining and anticipating behaviors observed in large neural networks.

About the research agenda

Our agenda aims to develop analytically and empirically tractable toy models that explain how representations emerge, specialize, and generalize under different training regimes. The broader goal is to build a theory that is:

  • mathematically clean and interpretable,
  • empirically connected to phenomena observed in modern neural networks, and
  • capable of generating testable predictions relevant to AI safety.
     

Research directions may include learning dynamics, inductive biases, multi-stage training, and generalization phase transitions in simplified model systems.

Our projects will build on prior and ongoing work, including (but not limited to):

  • Gated Linear Neural Networks [Link]
  • Linear Attention [Link]
  • RL Perceptron–style models [Link]
     

Fellows will work closely with Andrew Saxe, fellow Principia researchers, and collaborators through regular technical discussions, joint problem formulation, and collaborative research projects.

About the role

As a research fellow at Principia, you will contribute directly to this agenda through both independent and collaborative work. This may include:

  • developing formal or simplified models,
  • conducting theoretical and analytical work,
  • designing and performing empirical experiments to validate theoretical predictions.

The role is research-focused, emphasizing depth, clarity, and conceptual progress.

Who we’re looking for

We are particularly excited about candidates who:

  • Have demonstrated research experience in a relevant area
  • Have a strong theoretical background, for example, in:
    • theoretical machine learning
    • physics (especially statistical physics, dynamical systems, or related areas)
    • applied mathematics or related fields
  • Are comfortable working with mathematical models and analytical arguments

Formal degrees (including a PhD) are preferred but not required. What matters most is evidence of research ability — for example, through publications, preprints, technical reports, or substantial independent research projects.

Experience with empirical or computational work (e.g,. PyTorch, JAX, numerical experiments) is a plus.

 

Working style and environment
  • Research hub in London (LISA), with close collaboration and frequent technical discussion with existing collaborators and researchers
  • Emphasis on depth, clarity, and conceptual progress, rather than rapid productization
  • Support for conference travel and internal research retreats
  • Providing dedicated cloud compute credits to support empirical experiments 
  • Well-suited for researchers seeking focused time to develop foundational theory relevant to AI safety
Duration, salary, and visas
  • Contract Term
    This is initially a 6-month fixed-term position, starting ASAP. We are actively raising additional funding to support longer-term appointments. For exceptional candidates, we may offer a 12-month term from the outset if required. Extensions will be considered depending on funding availability and mutual interest.
  • Compensation
    Salary range of USD 40,000–50,000 for a 6-month contract, depending on experience and seniority.
  • Visa
    While we are unable to offer visa sponsorship in this hiring round, we welcome international applicants and will actively support candidates through the visa process if needed. 
How to apply

Please fill in the application form by 25 February 2026 at 11:59 PM GMT, which includes

  • a CV,
  • a brief statement of interest (500 words max) describing your background, research experience, and interest in this agenda, and
  • contact details for 2-3 referees
  • Optional: Supporting materials that reflect your research ability (e.g. preprints, code repositories, technical writing if not included in CV already)

We are reviewing applications on a rolling basis and may make offers potentially before the application deadline. We encourage applicants to submit the form as early as possible. 

We will contact selected applicants, ideally within a week of the application, to schedule an interview consisting of a brief research talk and 1:1s. 
 

Diversity, Equality and Inclusion

We aim to promote equal opportunities and eliminate discrimination. We welcome applications from all backgrounds, regardless of gender, ethnicity, religion, or disability, and are happy to discuss any reasonable adjustments you may require throughout the hiring process.

Questions

For any questions, please leave a comment or use this form



Discuss

The SaaS bloodbath: opportunities and perils for investors

11 февраля, 2026 - 19:17
Published on February 11, 2026 4:17 PM GMT

This post was originally posted my Substack. I can be reached on LinkedIn and X.

Software has taken a brutal beating in recent weeks, with many names down ~20% year-to-date. The question for investors is whether the current software sell-off signals a structural shift or a market overreaction.

The consensus is that software has become a commodity. My view is more nuanced. While some software companies risk being disrupted, others might actually benefit from AI. The key distinction is between companies that have “artificial limiters” — non-code moats like network effects, regulation, and physical infrastructure that restrict competitive entry and sustain pricing power — and those who don’t. The current indiscriminate sell-off is creating compelling public-market opportunities for both value and growth investors. Paradoxically, the more fragile arena is in the private markets.

This post covers the following:

  • What’s driving the SaaS / software crash (seat risk, build-vs-buy, and multiple compression)
  • Why mission-critical SaaS isn’t going away and the opportunity for value investors
  • The “artificial limiters” framework and how to screen for multi-baggers
  • Why venture math looks harder in an era of lower public multiples

Let’s dive in.

Understanding the collapsing multiples in software-land

To level-set the conversation, let’s first highlight the current market dynamics. In the past few weeks, hundreds of billions of dollars in market cap have been erased across both incumbent software names and newer entrants. Some examples (as of time of writing):

  • Workday: down 25.5% YTD (year-to-date)
  • Salesforce: down 23.7% YTD
  • Adobe: down 20.6% YTD
  • Applovin: down 23.5% YTD
  • Snowflake: down 15.8% YTD
  • Shopify: down 19.1% YTD

This has been especially painful for high-growth companies, where EV / NTM revenue multiples (enterprise value / next twelve months revenue) have collapsed to just 5x. This is bad relative to just 2 months ago, but worse if we compare it to Covid highs (20x+ EV/NTM).

 

Source: Needham: SaaS Monthly Drive-Thru (February 2026)

There are multiple reasons for the current sentiment (some may not be true):

  • Software companies (broadly) monetize via seat-based pricing. If AI agents can do the work and there are more layoffs, then software vendors won’t be able to sell as many seats
  • AI-native entrants can move faster and win market share before incumbents ship comparable offerings
  • Enterprises now believe that they can vibe-code at least a portion of their software in-house vs. paying large contracts for “generic” horizontal SaaS
  • System of records (Salesforce, Workday, SAP) might be priced as utilities if agents end up taking actions vs. humans
  • Software companies were overvalued to begin with because premise of SaaS throwing off cash predictably is wrong. The innovation cycle is happening too fast for predictable cash flows and high SBC (stock-based compensation) serve as a further drag on returns

The death of (mission-critical) SaaS is overblown

Of the reasons listed above, the one that seems most worrying is that companies can build much of their software internally. This worry has been amplified after the launch of recent models (Opus 4.6 and GPT-5.3-Codex) and their associated scaffolding / software ecosystem (Claude Cowork with plugins and OpenAI’s Codex).

While this sentiment is justified for feature-thin, nice-to-have software, it doesn’t apply to mission critical applications where the impact of vibe-coded apps going down vastly exceeds the ACVs (average contract value) customers pay for. Here, companies are overestimating their ability to build and maintain software in-house using AI. At the end of the day, customers are also paying for ongoing support, compliance, updates, security, SLA, etc. (which requires full-time headcount). Once the “fully loaded” costs of development are factored in, building software in-house might be higher.

What this means is that beaten down mission-critical software vendors (SAP, Workday, Salesforce) have time to reinvent themselves. Given these companies’ existing distribution and customer trust, they have time to introduce their own AI-native capabilities. Furthermore, tech-forward software incumbents can leverage AI (just like startups) to their various “jobs to be done” (coding, sales, customer support, etc.). This reduces the need for headcount expansion and results in reduced SBC & higher FCF (free cash flow). Companies that adopt this “creative destruction” mentality will see their stock prices re-rate (generating alpha for value investors).

Companies with “artificial limiters” represent multi-bagger opportunities

While the software re-rating story is interesting, I’m more fascinated by beneficiaries of AI. The mental model I’ve adopted is to look for growthy software (or software-enabled) businesses that can re-accelerate growth in the AI era and are protected by artificial limiters (aka non-code moats).

Some (non-exhaustive) categories that I’ve been thinking about:

  • Social networks: Meta, ByteDance, and to a lesser extent Reddit sit on entrenched user graphs and compounding data advantages. In practice, that means more precise AI ad targeting, measurement, and model training. It’s difficult for a challenger to usurp an incumbent even if they have access to the incumbents’ codebase
  • Physical infrastructure networks: Cloudflare is a good example. Traditionally known for its CDN & security products, Cloudflare operates a global footprint of points of presence (PoPs) with a material portion of internet traffic flowing through its network. Its footprint, ISP relationships, and the operational know-how creates “limiters” that a startups cannot easily replicate. In Cloudflare’s case, Workers AI and pay-per-crawl become additional growth levers
  • Enabling infrastructure / primitives: companies like Snowflake, MongoDB, and Datadog offer products that sit directly on top of (or close to) bare metal infrastructure — with high demands on availability, performance, and correctness. Even with AI-assisted development, most enterprises likely do not want to build and maintain core databases/observability primitives. Here, “buy” continues to dominate “build”
  • Fintech: Coinbase, Robinhood, and SoFi operate inside regulatory and partner constraints: broker-dealer frameworks, bank charters, custody, and monetization relationships (e.g., payment-for-order-flow counterparties) so code velocity isn’t necessarily the bottleneck

One important caveat here is that a “good company” doesn’t mean an “undervalued stock.” Many of these companies can be structural winners and still be fully priced. Finding multi-baggers requires underwriting capability and favorable market timing.

The perils of venture investing in the age of AI

While I’m optimistic on the opportunity set in the public markets, I’m more uneasy in venture-land. The disconnect between VC pricing and public-market multiples implies many investors are still underwriting software deals with mental models from the old SaaS era.

Yes, startups might be “AI native” and many can monetize in new ways (usage-based, outcome-based, tied to labor savings, etc.). But in many cases, it’s still SaaS by another name. These companies face the same questions around long-term defensibility and pricing power. The risk here is that with public multiples already compressed, even if an AI startup fully realizes its vision and meaningfully replaces labor spend, the exit value may not justify the entry price.

This is compounded by structural issues inside VC itself. Many funds are not set up to invest in truly disruptive, longer-duration technologies (fund size, ownership, time horizon mismatch), which I’ve written about prior. So, they end up crowding into “software-shaped” deals because those are easier to underwrite and historically cleared at high multiples. That doesn’t mean VC is devoid of alpha (e.g. my conviction in frontier AI labs). Additionally, some funds are positioned to succeed (partners have the right risk appetite, fund size, trust from LPs, etc.).

The opportunities and perils of the commoditization of code

So, the conclusion here is that AI is a true paradigm shift and tech investing won’t be business as usual going forward. However, the market’s response has been too blunt. In the public markets, there are opportunities at both ends of the spectrum: value investors can underwrite a software re-rating, while growth investors might find multi-baggers via companies with artificial limiters. My take is that mispricings are easiest to find in public equities. The bigger risk is in private markets, where entry prices still assume old-regime prices for new-regime risk.

Thank you to Will Lee, John Wu, Till Pieper, Homan Yuen, Yash Tulsani, Maged Ahmed, Wai Wu, and Andrew Tan for feedback on this piece. 



Discuss

Страницы