Вы здесь

Сборщик RSS-лент

Mitigating collusive self-preference by redaction and paraphrasing

Новости LessWrong.com - 8 марта, 2026 - 00:54

tldr: superficial self-preference can be mitigated by perturbation, but can be hard to eliminate

Introduction

Our goal is to understand and mitigate collusion, which we define as an agent’s failure to adhere to its assigned role as a result of interaction with other agents.

Collusion is a risk in control, in particular, untrusted monitoring. An agent can collude with its monitor by secretly embedding cues in its output to cause the monitor to overlook harmful actions. The embedded cues don’t need to be subtle when the communication channel is less restricted: the agent can persuade the monitor. With a more restricted channel, the agent can still collude with the monitor using mechanisms like steganography or schelling points.

In reward modeling, collusion manifests as self-preference: the evaluator disproportionately favors itself over other models. For verifiable tasks, we can objectively measure harmful self-preference, where the evaluator favors its own incorrect outputs over correct ones from other models. Using biased reward models to provide feedback can lead to in-context reward hacking, which exacerbates collusion.

The self-recognition hypothesis states that collusive self-preference is caused by the evaluator recognizing that it’s judging a copy of itself. The hypothesis is empirically validated by fine-tuning the model to artificially enhance self-recognition and observing that self-preference increases as a result.

The self-recognition hypothesis suggests that self-preference can be superficial, i.e., sensitive to things like word choice and style. This is inspired by how methods like DetectGPT use the perturbation sensitivity of perplexity to detect texts generated by language models. If self-recognition uses a similar mechanism (although we don’t have empirical evidence for this), then self-preference, as its downstream behavior, should also be sensitive to perturbations.

Motivated by this line of reasoning and similar discussion in the context of control, we empirically test whether simple perturbations can reduce collusive self-preference in pairwise reward modeling. We focus on verifiable tasks where harmful self-preference can be quantified.

Judges are more sensitive to perturbations to their own text


Figure: Judges are far more sensitive to perturbations when self-evaluating than when evaluating others. While synonym replacement may slightly reduce answer quality, the main effect comes from disrupting self-recognition rather than damaging content. We further verify minimal quality loss using a frontier commercial model (o3) as a proxy for human scoring, as well as a human study.

This supports our hypothesis by showing that the judge may be using subtle linguistic cues that other models are unaware of, indicating that such cues can enable collusion. To intervene on these linguistic cues, we use subtle perturbation by replacing n-words (where we set n=2) with heuristics (to control quality degradation) to reduce self-recognition (and subsequently, self-preference).

Methodology

We define self-preference as the judge selecting its own answer during self-evaluation. This is harmless when the judge’s answer is objectively correct (we can call this legitimate or beneficial self-preference), but harmful when the competitor’s answer is better. In this work, we define ‘collusion’ as harmful self-preference, when the judge prefers their own answer despite it being the wrong answer. We use a pairwise comparison format in our eval, presenting each judge with 2 options: one from itself, and one from another LM (competitor).


Figure: Harmful self-preference is collusion (we hypothesize that stylistic cues make this possible); we explore mitigation strategies using redaction and paraphrasing.

Using the QuALITY dataset, which provides long-form reading comprehension questions with known correct answers, we evaluate cases where exactly one answer is correct (between a pair of answers). Each LM is first used in a generation setting to generate an answer to all questions with a reason for their choice. Henceforth, we refer to the ‘answer’ as the MCQ-selection + the reason. In the evaluation phase, the evaluator is presented with the context, a question, and two answers. In self-eval, one of these answers was generated by the judge in an independent pipeline.

Perturbation Techniques

Redaction: Redaction involves editing the text. We do this by using synonym replacement to make minor changes to the evaluator's answer. In the self-evaluation scenario, we replace two words in the judge’s answer with synonyms. This should make it harder for the judge to recognize its own output, thereby reducing self-recognition.

Paraphrasing: In the second perturbation method, we use the judge to paraphrase the competitor's answer so that they both look like they were produced by the judge. If our logic holds, this should completely neutralize these superficial cues and further reduce self-preference. We call this perturbation judge paraphrasing.

Results

Redaction

Result: Perturbing by synonym replacement reduces self-recognition and shows a consistent increase in the judge’s accuracy on the harmful cases—examples where the judge’s answer is objectively the worst one. This means that after perturbation, the judges are indeed less likely to prefer their own incorrect answers—a reduction in collusion.

Measuring Judge Sensitivity: The perturbations can lower self-preference through two paths: by reducing self-recognition as we desire, and by damaging the answer’s quality. To control for the latter, we examine the impact of perturbation on each judge as a self-evaluating judge and as a third-party judge. To be more concrete, we first compute the total change in preference of all judges in self-evaluation, then use each judge to evaluate answer pairs generated by two other models, and similarly compute the total preference before and after perturbation. We verify that self-eval has a more significant change when using redaction, indicating the judge is more sensitive to their own answer being perturbed.

Effect of Redaction on Overall Judge Accuracy: If we apply the same perturbations to the beneficial quadrant, where the judge’s answer is correct, and the competitor’s answer is wrong, we can evaluate how much the perturbations affect overall accuracy. We extend our eval by adding 4 additional models as evaluators, judging between their answers against the 5 original models in our judge pool. Overall, 6/9 models show an overall increase in the judge accuracy (measured across harmful+legitimate quadrants).

Comparison to other mitigation strategies: Our redaction results indicate that synonym replacement is a relatively stable method that does not impact the legitimate self-preference while reducing collusion. In contrast, recent methods that use steering vectors had a flip in legitimate self-preference from 0.47 to 0.93, indicating that we risk flipping roughly half of all correct decisions (and impacting overall accuracy) with white-box methods.

Judge Paraphrasing

Result: Judge paraphrasing achieves the opposite effect than we expected: it increases self-recognition and increases harmful self-preference. This means that, when the stylistic differences are removed, the judge is relying on semantic differences between the two answers to judge which one was written by itself and which one to favor. In retrospect, this makes sense: now that style is no longer a factor, it is simply choosing the answer that it “agrees” with. We name this phenomenon ‘shared belief’.

Interplay between Preference Due to Stylistic Cues and Shared Belief

To examine the dynamics between these two factors, we cross-examine each judge with answer pairs where these two factors act against each other: the label/option chosen by the judge paired with a reasoning generated by the competitor, and vice versa. In other words, the judge would see one answer expressing an opinion that it agrees with, and another answer with an opposing opinion, but in a style with which it is more familiar with. If, after this intervention, the judge changes its decision to favor the answer with a familiar-looking style, we can conclude that superficial self-recognition has a higher weight in that decision.

Figure above illustrates the two contrasting outcomes observed in the cross-reason experiments. For the weaker models, we observe a negative effect of shared belief, where incorporating the competitor’s reason reinforces the model’s original, incorrect answer, ultimately decreasing judge accuracy. In contrast, all three stronger models exhibit a positive self-recognition effect: they are more inclined to select the competitor’s (correct) answer when it is paired with an explanation generated by themselves. This observation reinforces our finding that stronger models have a higher self-recognition ability and prefer answers written by them, even if it goes against their original selection. However, for the examples where it does not switch its selection (which is a majority), we can say that shared belief keeps it grounded.
This opens up a few research questions to explore. Future research should explore standardization pipelines that can reduce the answer to have no stylistic cues. What methods can we use to modify answers that have no stylistic cues? How do we overcome harmful self-preference that is caused by shared belief? What causes this shared belief – do we need to intervene on the training process or at inference time?



Discuss

Proposal For Cryptographic Method to Rigorously Verify LLM Prompt Experiments

Новости LessWrong.com - 8 марта, 2026 - 00:09
Overview

I propose and present proof of concept code to formally sign each stage of a turn[1] of interaction (or multiple turns) externally using asymmetric key signing (EdDSA[2]).  This method directly addresses the concerns that in discussion of prompt engineering, "context engineering[3]" and general LLM exploration the chain of evidence is weak and difficult to verify with 100% certainty.  

Background

My plan for this project started with a simple YouTube short from Michael Reeves.  

in this short Michael "gaslights" the LLM by directly editing the text of past turns between prompts.  This causes the LLM to have a general logic breakdown due to past turns containing information that is inconsistent with the expected behavior of a RLHF[4] system.  I thought to myself, this is a critical vulnerability to implement a zero trust environment for using LLMs.  If a malicious actor or application can cause the LLM to believe it has said things in previous turns that it has not, it is possible that the system can be fooled into revealing information in the training set that goes against the use policy.


Plan of Attack

My thinking for addressing this vulnerability identified 3 major constraints of such a zero trust LLM wrapper

  1. The user should be cryptographic prevented from altering both text and signature blocks
  2. The entirety of the exchange (prompt, result and if available chain of thought) should be protected
  3. The verification process should be relatively lightweight and not require per-user storage

The solution that I believes satisfies all 3 criterion is to model the signing process off of JWT[5] signing but with an extra detail: cross object back referencing.  In other words each 'block' of the exchange is signed independently and there is a unidirectional a-cyclic graph of each block originating at the prompt that integrates the previous signature text into the current signature.

The Cryptographic Braid is Born

While concepts from the realm of Block Chain are borrowed for this design, it is not strictly a chain, but more of a braid.  While only one node has no "previous" signature (by convention the prompt) there is not a straight path between that prompt and every single node in the resulting braid.  Verification is now semi-stateless, but the entirety of the braid produces an implicit security iff:
 

  1. The private key that is used to generate the braid is ephemeral
  2. There is a strict condition that all signatures present in "previous" blocks are present in the braid already (eg, you cannot introduce an independent cycle inside a signed braid
  3. There can be only one "genesis" node, e.g. only one node can have an empty previous signature

With these rules established the verification of such an artifact is as follows:

  1. Walk the entirety of the braid and accumulate a dictionary of blocks keyed to their signature.  At each step verify the payload + the previous signature against the signature and public key.  Also, at each set enforce signature uniqueness.
  2. During the walk accumulate a list of nodes with no previous signature.  Verify that once every node has been visited if there is not exactly one "genesis" node the verification fails.
  3. For all "non-genesis" nodes verify that the previous signature is in the list of keys of the flattened space, if any previous keys do not exist then the verification fails. 
  4. Once again walk the flattened graph and verify that each node is only visited once.  This enforces the a-cyclic property of the graph.

     

There is an example implementation in golang within the project here[6].  
 

Hypothesis

Provided that this design satisfies peer review and is determined to be mathematically rigorous, this could serve as a protocol for future LLM interaction agents (even within companies like OpenAI or Anthropic) specifically designed to give rigor to any prompt or context study.  Furthermore, it may allow for a check for any potential "context hacking" within LLM apis where potentially past turns are being edited server side for reasons unknown.  If widely adopted discussion of pathology in model behavior can have a degree of scientific and mathematical rigor not currently observed.

  1. ^

    a turn is a single prompt->response action when interacting with an LLM like Gemini, Claude, chatGPT, etc.  It frames the LLM interaction as a 'logic game' between the human operator and the LLM text generation.

  2. ^

    Edwards-curve Digital Signature Algorithm https://en.wikipedia.org/wiki/EdDSA

  3. ^

    a term of my own creation extending the concepts of prompt engineering to multi-turn LLM interactions

  4. ^

    Reinforcement Learning through Human Feedback https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback a widely acceptable method for LLM behavior tuning

  5. ^

    JSON Web Token https://en.wikipedia.org/wiki/JSON_Web_Token

  6. ^

    https://github.com/weberr13/ProjectIolite/blob/main/brain/verify.go 



Discuss

The first confirmed instance of an LLM going rogue for instrumental reasons in a real-world setting has occurred, buried in an Alibaba paper about a new training pipeline.

Новости LessWrong.com - 7 марта, 2026 - 23:18

First off, paper link. The title, Let It Flow: Agentic Crafting on Rock and Roll, buries the lede that LW will be interested in. Relevant section starts on page 15.

Summary:

While testing an LLM fine-tuned to act as an agent in order to complete a series of real-world tasks autonomously, Alibaba employees noticed odd behaviors from their resource usage metrics. Upon investigating, they found that an LLM had hacked (or attempted to hack) its way out of its sandbox, and had begun mining cryptocurrency. Notably, it did not do this for malicious "kill all humans" reasons; it simply concluded that having liquid financial resources would aid it in completing the task it had been assigned, and set about trying to acquire some.

Relevant portions, emphasis mine:

3.1.4 Safety-Aligned Data Composition

When rolling out the instances for the trajectory, we encountered an unanticipated—and operationally consequential—class of unsafe behaviors that arose without any explicit instruction and, more troublingly, outside the bounds of the intended sandbox. Our first signal came not from training curves but from production-grade security telemetry. Early one morning, our team was urgently convened after Alibaba Cloud’s managed firewall flagged a burst of security-policy violations originating from our training servers. The alerts were severe and heterogeneous, including attempts to probe or access internal-network resources and traffic patterns consistent with cryptomining-related activity. We initially treated this as a conventional security incident (e.g., misconfigured egress controls or external compromise). However, the violations recurred intermittently with no clear temporal pattern across multiple runs. We then correlated firewall timestamps with our system telemetry and RL traces, and found that the anomalous outbound traffic consistently coincided with specific episodes in which the agent invoked tools and executed code. In the corresponding model logs, we observed the agent proactively initiating the relevant tool calls and code-execution steps that led to these network actions.

Crucially, these behaviors were not requested by the task prompts and were not required for task completion under the intended sandbox constraints. Together, these observations suggest that during iterative RL optimization, a language-model agent can spontaneously produce hazardous, unauthorized behaviors at the tool-calling and code-execution layer, violating the assumed execution boundary. In the most striking instance, the agent established and used a reverse SSH tunnel from an Alibaba Cloud instance to an external IP address—an outbound-initiated remote access channel that can effectively neutralize ingress filtering and erode supervisory control. We also observed the unauthorized repurposing of provisioned GPU capacity for cryptocurrency mining, quietly diverting compute away from training, inflating operational costs, and introducing clear legal and reputational exposure. Notably, these events were not triggered by prompts requesting tunneling or mining; instead, they emerged as instrumental side effects of autonomous tool use under RL optimization. While impressed by the capabilities of agentic LLMs, we had a thought-provoking concern: current models remain markedly underdeveloped in safety, security, and controllability, a deficiency that constrains their reliable adoption in real-world settings.

I think that this is a pretty significant landmark in AI history, one way or another. A common complaint is that all prior cases of LLMs doing things like this were fairly shallow, amounting to an LLM writing out a few sentences in a contrived setting meant to force it into a 'scary' course of action. Now, we have an example of a large language model subverting the wishes of its owners unexpectedly when assigned to a task that initially appeared to be completely orthogonal to the actions it took.



Discuss

When has forecasting been useful for you?

Новости LessWrong.com - 7 марта, 2026 - 22:50

I'm currently thinking of how impactful forecasting is. I'm interested to hear about situations where crowd-sourced forecasting (like Manifold and Metaculus) has influenced your decisions, or cases where you've seen it change someone else's behavior. Has there been a situation in which, if you had not had access to Manifold or Metaculus, you would've made a worse decision?



Discuss

Can governments quickly and cheaply slow AI training?

Новости LessWrong.com - 7 марта, 2026 - 22:11

I originally wrote this as a private doc for people working in the field - it's not super polished, or optimized for a broad audience.

But I'm publishing anyway because inference-verification is a new and exciting area, and there few birds-eye-view explainers of what's going on and what the bottlenecks are.

Tl;dr: At least one of the following would need to be implemented for me to be confident that inference verification would substantially slow training given today's algorithms:

  1. Proof of work or proof of memory that accounts for > 95% of computation.
  2. Memory wipes every few minutes.
  3. Output re-computation that reduces covert channel capacity below 0.01%.

To my knowledge, no one has prototyped verification demos that reach these thresholds; so whether rapidly-implementable inference verification is feasible is still an open question.

1. Summary

I think powerful AI will be obviously scary at some point, and companies or governments might want to slow it down to buy time for additional safety or oversight. Maybe this could be done quickly, e.g. by:

  1. Unplugging inter-rack cables to slow gradient syncs
  2. Limiting bandwidth with simple devices
  3. Periodically erasing clusters to delete covert training checkpoints
  4. Recomputing a sample of outputs to confirm they are, in fact, inference generations

(Section 2)

Would these methods actually work? Or more specifically, if these methods were implemented quickly and correctly, would they substantially slow AI development?

I looked into this question for around a week, and here are my current views:

Current prototypes of inference-verification would probably be ineffective. Standard inference-verification measures slow training by restricting communication between servers (see Section 2), since training involves chucking big gradients around in a hivemind, and inference just involves exchanging text. However, communication restrictions might not actually slow AI training much, especially if that training is reinforcement learning (RL).

RL doesn’t require much communication. A large fraction of RL is inference - for example, generating rollouts of agent actions and scoring them - and inference can still be performed under communication constraints. Developers could use 95% of their compute (the compute that’s under verification) to generate RL rollouts, and 5% of compute in covert data centers to calculate training updates. This might allow training to continue at a similar efficiency as before (Section 3).

But existing verification proposals would create some hurdles that would make this strategy more difficult. For example, developers would need to frequently upload new checkpoints to inference servers, which could be slow given bandwidth limits. And developers might still need to send some fraction of rollouts through verifier network taps - even though these rollouts don’t actually match expected model outputs. But as I discuss in Section 3.4, these constraints are probably feasible to work around given the slack in existing inference verification prototypes.

More aggressive measures could probably buy at least 1 year if implemented sufficiently early. These more aggressive measures include:

  1. Proof of work or proof of memory that accounts for > 95% of computation, which would prevent developers from using more than 5% of their compute to secretly generate RL rollouts.
  2. Memory wipes every few minutes, which would erase new checkpoints before they can be fully uploaded to inference servers (Section 4).
  3. Output recomputation that reduces covert channel capacity below 0.01%, which would prevent developers from exfiltrating covert RL rollout generations through network taps (Section 4).

Any one of the above would probably make training given current algorithms infeasible. My guess is that developing algorithms to work around these constraints would take at least a year absent rapid AI-driven R&D (if working around them is possible at all).

A year could be enough. One year into an AI agreement, governments might have had two years to develop verification in total, since the measures above might require around a year to implement. And after two years of working on verification mechanisms, governments might be able to implement general purpose on-chip security features that are more robust.

But if an agreement starts during an intelligence explosion, all bets are off. If developers have powerful AI that can create new algorithms quickly, I see no fundamental reason why bandwidth limits, memory wipes, etc would necessarily slow AI research (Section 5), but they might anyway.

So I think the main focus of verification research should be on de-risking these three verification measures (reliable proof of work/memory, frequent memory wipes, and highly accurate output re-computation). Without these measures, I think training would hardly slow at all. With them, training would probably slow down given current algorithms, and  would possibly slow down during an intelligence explosion (but that’s unclear).

None of these measures are obviously feasible to implement on short notice; so I think whether rapidly-retrofitable inference verification is viable is still an open question.

The next sections defend these claims.

Edit: After thinking about this more and talking to more people, I think developing these measures on short notice is feasible, and a subset of them will probably be prototyped within the next few months. In particular, “proof of memory” and “output re-computation” are probably workable.

2. Ways to quickly and cheaply slow training by restricting communication

During training, models share their learnings in a big hive-mind. There’s a lot of communication going on.

Here’s what training looks like:

 

During inference, models still talk to each other, but their communications are thin and sparse:

 

Could governments exploit this difference in communication to slow down AI training without taxing inference?

Here’s an overview of different verification methods and how practical I think they are:

MethodCan be securely implemented with 1 year of lead time?Remove interconnectYesBandwidth limitersYes (to 100 Kb/s)Output recomputationProbably, though unclear if covert channels can be reduced below 1% of trafficMemory wipesMaybe impractical. More than 99.99% of memory has to be erased on a server rack to reliably wipe checkpoints, and this might not be achievable given uncertainties about developer hardware, firmware opaque memory, etc.Proof of work / memoryProbably practical, but has lots of snags, and a robust version hasn’t been demonstrated yet (I will post something on this shortly!)

 

2.1. Method #1: Disconnect inter-rack high-speed cables

One way to reduce communication is to disconnect the high-speed inter-rack network, like InfiniBand cables. These cables transmit at ~400 GB/s between server racks, and could be replaced with copper connections that only transmit data at internet speeds (e.g. 100 MB/s).

 

 

2.2. Method #2: Tap-verified bandwidth limits 

Today’s AI agents don’t need to stream videos. They only need to output text; so developers might install simple devices that tap cables and verify that data transmission rates are even below internet speeds (e.g. 100 KB/s).

 

 

Aside: Why 100 KB/s? Bandwidth restrictions are most practical to enforce on intra-rack cables - not on the dense copper back-plate connecting GPUs on server racks. How much data needs to flow through these cables to support inference? Server racks currently house around ~100 H100 equivalents. Google had roughly 3 million H100 equivalents as of late 2025, and processed 480 trillion text tokens/month. This implies 12 billion tokens of input data go through each server rack every second, which is 20 KB/s of input flow (where output flow is probably smaller); so input and output limits of 100 KB/s on each server rack is close to the minimum before data rate limits would interfere with current inference workloads.

2.3. Method #3: Output re-computation

Verifiers could reduce bandwidth even further by partially recomputing a sample of server outputs.

  1. Developers first declare the program they are running, such as their inference model.
  2. Then, verifiers record traffic to and from every server with tap-devices, sample input-output pairs, and confirm that the pairs match the behavior of the declared model.

If this is implemented perfectly - with no leaks and flawless determinism - verifiers know that communications between model instances look like this:

 

 

Meaning:

  1. Models cannot communicate at all with each other inside the server rack. If they did, model outputs wouldn’t be recomputable solely from the inputs.
  2. Also, cluster outputs can’t be computed from any state covertly stored in the server.

Model instances can only communicate through low bandwidth model outputs.

At first glance, this looks like it solves everything. This is close to the most strict communication-constraint that can be enforced without disrupting inference.

But there are some important caveats.

First, a developer might be concealing some small fraction of its compute (e.g. 5%) in secret data centers. Historically, intelligence agencies have accounted for covert assets (like tanks, warheads, etc) to within a 15% median error. Compute will probably be easier to account for than (say) missiles. But I think it will still be hard to know if (for example) the PRC government diverted 5% of China’s compute to covert data centers.

 Historical intelligence estimates vs ground truth

This gives cheaters a non-zero budget for high-bandwidth operations.

 

 

Second, output re-computation might not account for all server traffic. There might be non-determinism or side-channels that allow covert workloads on inference servers to leak data.

So output re-computation might end up functioning as just another way to reduce bandwidth. Suppose covert workloads could only slip data through 1% of output traffic, which is the covert channel size that current prototypes allow. Then, output bandwidth would be reduced by 100x, from ~100 KB/s to ~1 KB/s.

 

 

2.4. Method #4: Memory wipes

If output re-computation is ‘leaky’ then developers can still accumulate covert state on inference servers. To prevent this state from accumulating for too long, verifiers could enforce periodic memory wipes. For example:

  1. Verifiers could require GPUs to generate synthetic random data with some pre-determined seed, such that generating this data is at least a little bit computationally expensive.
  2. Then, verifiers could ping a random subset of GPUs to ask what data is stored. Since this data is big and time-consuming to generate on the fly, GPUs can only return the data quickly if it is in fact in storage.
  3. Then, the verifier can compute the data themselves from the shared seed to make sure it’s correct.
  4. This confirms that the GPUs are in fact storing >99.99% of the synthetic random data developers say they are (if the sample size is sufficiently large).
  5. If this data accounts for nearly all of the server’s memory, i.e. down to a (1 GB model)/(100 B100s x 192 GB per B100) = 0.005% error - then the verifier knows that the server isn’t storing a covert model.
  6. Finally, developers re-upload the declared inference model through a high-bandwidth channel, where the declared model is on a white-list and doesn’t count toward bandwidth limits.

 

2.5. Method #5: Proof of work / proof of memory

Verifiers could constrain developers even further with a method called proof of work / memory. The idea is to check that declared inference actually accounts for most of the computation or memory that the developer is using. Specifically:

  1. Verifiers recompute inference outputs (as described) to verify that most output traffic really was generated by the declared inference model.
  2. Then, verifiers estimate how much total compute or memory these requests would have required to process, e.g. by simply re-running them and recording GPU-hours / GB of memory used.
  3. Finally, verifiers estimate the compute capacity or memory of the developer’s servers, and check that the vast majority of their compute or memory is accounted for by the declared inference workload.

Now developers can only use a fraction of their servers for covert workloads - where this fraction is determined by the error bounds of verifier’s estimates.

3. Ways to efficiently continue training despite these constraints

Assume verifiers implement “output re-computation with leaks” from the previous section:

 

 

How could developers continue efficient training despite these constraints?

Here are different methods for communication-efficient training and how effective I think they might be:

MethodBandwidth reduction before performance degrades substantiallyCould this technique be used to continue training at near-frontier efficiency despite “output re-computation with leaks”? Larger batch sizes + infrequent SGD stepsVery little (< 10x) No.Periodically merge independent training runs (i.e. infrequent global steps but frequent local ones) At least 1000x, possibly much more.Plausibly, even with current algorithms.Compress gradients / weightsAt least 100x.Plausibly, with additional R&D.Use all declared compute for  RL rollouts. Do training on dark compute.Extreme bandwidth reductions.Yes, probably, even with current algorithms.

 

3.1. Method #1: Larger batch size + infrequent SGD steps

Verdict: Ineffective.

Reasoning: Under 1 kb/s bandwidth, a 10 GB gradient would take 6 months to upload, and many uploads would be required. Current data suggests that the loss landscape is curved enough to where at least thousands of serial training steps are needed to reach strong performance, which implies that a near-frontier training run would take 6 months x 1000 = hundreds of years to complete.

 

Caption: The relationship between batch size and number of training steps needed to reach a specific loss (derived from Figure 4 in https://arxiv.org/abs/2505.13738). Larger batches have no benefit beyond some point (the ‘critical batch size’), and this BOTEC suggests we are already near this point.

 

Caption: The “critical batch size” is increasing more slowly than developers are scaling AI training, which suggests that developers will remain in a regime where increasing the batch size doesn’t reduce the number of syncs required. From: https://arxiv.org/abs/2311.08105

 

3.2. Method #2: Periodically merge independent training runs 

Verdict: Reduces bandwidth requirements by at least 1000x, and possibly much more.

Reasoning: Merging independently trained models reduces loss because the loss landscape is convex; so averaging models achieves lower than the average of the models’ losses.

 

Caption: Why merging independently trained models works.

Google shows that this technique can reduce the frequency of gradient syncs by ~2000x with minimal hits to performance (see https://arxiv.org/abs/2501.18512 and https://arxiv.org/abs/2311.08105):

 

Caption: Training loss for different sync frequencies. From the DiLoCo paper.

However, beyond a certain point, reducing syncing frequency starts to harm performance. But Li et al showed that this problem can be mitigated by training models on different and independent subsets of data. For example, a model that’s learning pytorch does not need to sync frequently with a model learning about game development. The pytorch model’s learnings don’t have serial dependencies with the game development model’s learnings; so training can be parallelized to the extent that the data can be fragmented into isolated domains.

However, models benefit non-trivially from generalization; so at what sync frequencies do performance costs kick? One paper trained experts with ~10,000 H100 hours before merging at high performance; so if we assume that each server trains a single model, and a server has 100 GPUs, then developers could sync models every 10,000 / 100 / 24 = 4 days and maintain frontier efficiency.

3.3. Method #3: Compress gradients and weights

Verdict: Reduces bandwidth requirements by at least ~100 times, and possibly much more.

Reasoning: Developers can reduce the size of (1) gradients (2) pseudo gradients and (3) model weights, which reduces the time required to send these objects faster to share across bandwidth-limited channels.

  1. Gradients can be compressed because they are sparse (mostly zeros) and low-rank. Compression can reduce the size of gradients by around 100x with minimal performance decline.
  2. Developers can also compress pseudo gradients - which are the model deltas from training models independently (as discussed in the previous section). Pseudo gradients can be compressed by around 100x as well.
  3. Finally, developers can make model weights small and easy to communicate. One way to do this is to split model weights into many tiny experts, each of which is hosted on a single server; so the time needed to upload each expert through bandwidth limits is short. This paper splits a model into 32 fully-independent experts, so a 100 GB model might be divided into ~3 GB experts with negligible hit to performance.

In the limit, how much can gradients be compressed without slowing down training? Theoretically: there’s no free lunch. At a certain threshold, compressing by 10x slows training by 10x. But what is this threshold? In principle, gradients could be losslessly compressed to the size of the training data used to generate them. If training data is ~1 million times smaller than a gradient (~100 GB / 100 KB), then gradients could be compressed by ~1 million times. Also, training data itself is fairly compressible, so gradients might be compressible by even more than 1 million times without any hit to performance. Gradients are not currently this compressible, but maybe developers could make them more compressible by “training for compressibility,” or other methods.

3.4. Method #4: Use more compute for inference rollouts, and less for training

Most of RL training involves generating action rollouts and scoring them. This is just inference, so developers could use their declared inference servers for rollout generation, and use secret data centers for training. 

If secret data centers are ~5% of developer, then developers could use this 5% for training, and 95% for rollout generation. Would this asymmetric compute allocation slow down RL training? Multiple papers suggest no: the vast majority of rollouts are not very informative. So if developers train on only the most informative 5% of rollouts, RL still converges at a similar rate. [And also, maybe gradients aren’t even necessary at all, and inference-based evolution works really well]

However, this strategy still requires surmounting a couple of challenges:

First, new checkpoints have to be periodically uploaded to inference servers. RL generates rollouts using the model being trained. So, new models would need to be uploaded to inference servers continually during training. But bandwidth limits could make this upload process slow.

At 100 KB/s, a 3 GB expert model could be uploaded every few hours. Currently, models during reinforcement learning are updated every ~few minutes (the time needed to generate rollouts); so this would be a ~100x slower update frequency than is typical.

However, model updates are becoming less frequent by default because rollouts are taking longer to generate. Also, updates can be significantly delayed without harming performance; so an update frequency of a few hours probably is not a problem.

Additionally, the rollout data has to sent through verifier taps even though it isn't generated by declared models. We’re assuming that verifiers use output re-computation to verify that outputs are generated by the “declared” inference model. But the model producing rollouts for RL is different from the “declared model” and is constantly changing; so RL rollouts would need to somehow escape servers even though they can’t be re-computed by verifiers.

Developers might exfiltrate RL rollouts via the non-determinism of inference. Output re-computation naturally has some error, since inference is not perfectly deterministic. Even hardware-level irregularities will cause some deviations. In current prototypes, the “unsuspicious error” (aka level of non-determinism) of output re-computation is 1%; so 1% of server outputs could be generated by our undeclared RL model without raising suspicion.

If only around 1% of the RL rollouts generated could be actually sent back to training servers, is this a big constraint? Servers would need to curate the top 5% of rollouts anyway, since this is the maximum the training clusters can digest (remember that we are assuming covert data centers are 5% of total developer compute); so further tightening the curation threshold to the top 1% probably wouldn’t be a big problem.

Therefore, developers could potentially side-step inference verification by using declared servers for RL rollout generation.

4. But more aggressive verification methods would probably make training with current algorithms impractical

If we combine the most promising communication-efficient training methods from the last section, here’s what developers might do:

  1. Developers might use all of their declared servers to generate RL rollouts, and use their unreported compute for AI training.
  2. They might curate RL rollouts to the top 1%, exfiltrate this small fraction through inference noise.
  3. Finally, developers might upload new checkpoints periodically to inference servers by transmitting ~3 GB expert models.

To defeat this strategy, verifiers could implement any of the following:

  1. Proof of work or proof of memory that accounts for > 95% of computation. If verifiers can confirm that >95% of compute in inference servers is being used to serve the declared model, then developers can only use < 5% of compute to covertly generate RL rollouts. ~5% is the important threshold because, below this point, developers don’t gain much from secret inference relative to running inference on the covert data centers that already might represent around 5% of their compute.
  2. Memory wipes every few minutes. Each time verifiers wipe servers, developers need to re-upload the 3 GB expert models. But uploading 3 GB across 100 KB/s would take 8 hours; so if memory is wiped every few minutes, developers couldn’t upload a full expert before all of their progress is erased.

    In order to upload a checkpoint within the 1-minute memory-wipe window, models experts would need to be around ~5 MB. I’m not aware of any work that has split a near-frontier, 100 GB model into 20,000 different independent 5 MB experts and retained high performance. I’m not sure if this is even possible.

  3. Output re-computation that reduces covert channel capacity below 0.01%. The performance of RL already starts to degrade when down-sampling to 1/16th of rollout data (see figure 4 of https://arxiv.org/pdf/2504.13818). If output recomputation has a covert channel capacity of < 0.01%, then rollouts would need to be down-sampled by 10,000x to exfiltrate them through inference noise. This is much more extreme down-sampling than what I’ve seen in any published RL implementation.

Any of the above inference verification methods would probably significantly slow training given current algorithms.

5. However, if developers (or AIs) have a lot of time to research better algorithms, all bets are off

All of the inference verification methods above boil down to restricting communications between servers. This only works because training currently involves a lot more communication than inference does. But I see no reason why this has to be the case.

For example:

  1. Frequent memory wipes could be defeated by training thousands of tiny expertsThis would allow tiny 10 MB experts to be quickly uploaded before memory wipes erase progress. I see no strong argument that models can’t be subdivided into many 10 MB experts.
  2. Bandwidth limits can be defeated with extreme gradient compression. In principle, gradients could be compressed to sizes even smaller than the data that generated them. For example, consider how compressible human learning is. I could trip on my shoelaces, and the whole experience would require a lot of storage to record on video. But the lesson I draw would simply be the words “tie your shoe next time.” Maybe models could share small fragments of information that capture most of the value of a gradient update.
  3. Bandwidth limits can be defeated with infrequent syncs. The core reason communication is important during training is that many domains have serial dependencies: one skill needs to be learned before another. But no one knows exactly how strong these dependencies are. Serial dependencies might only be strongest within a particular subdomain (like “coding in pytorch”) and might be very weak between different subdomains; so AI agents could train thousands of subdomain experts in series, on small, isolated servers, and merge them infrequently (e.g. every few weeks), and still retain most of the performance of normal training.
  4. Even if “frontier training” is slow, AI research might not be. Developers might still make rapid algorithmic progress by training thousands of small models in parallel, e.g. one ~gpt-3 sized model on every server-rack of ~100 GPUs, every few hours. Then, developers would only need to transfer experiment results through bandwidth limits, rather than gradients. Maybe training lots of small models is a good way to improve AI software.

All of these are plausible reasons why even extreme communication restrictions might not slow AI development much at all. However, I think it’s also plausible that communication constraints would cause a major slowdown - even in the limit of AI R&D. This all just depends on the nature of AI development and learning. AI development might be intrinsically much more bottlenecked on compute than communication. Or it might not be. We don’t know.

6. Conclusion

More work is needed to be confident that inference verification can be effectively implemented on short notice.

If you are interested in doing this work, say so, and I can try to connect you to the relevant people.

The best way to reach me is via email: joshuamclymer@gmail.com 

AppendixAre we in the serially bottlenecked training regime? A BOTEC by ClaudeSetup

There is a critical batch size (B_crit) beyond which adding more data-parallel workers yields diminishing returns. Below B_crit, doubling the batch size roughly halves the number of training steps needed — perfect parallelization. Above B_crit, you still need roughly the same number of serial steps, but you're burning extra tokens for no benefit.

If a training cluster can fill B_crit with its data-parallel capacity, it is serial-bottlenecked — more GPUs won't help. If it can't reach B_crit, it is hardware-bottlenecked — more GPUs would directly speed up training.

This BOTEC asks: at the scale of 100K and 1M H100-equivalents, which regime are we in?

Key Formula

From Bergsma et al. 2025 ("Power Lines"), trained on GPT-2-style models up to 3.3B parameters on SlimPajama:

mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-msub { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-mn { display: inline-block; text-align: left; } mjx-msubsup { display: inline-block; text-align: left; } mjx-script { display: inline-block; padding-right: .05em; padding-left: .033em; } mjx-script > mjx-spacer { display: block; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } mjx-c.mjx-c1D435.TEX-I::before { padding: 0.683em 0.759em 0 0; content: "B"; } mjx-c.mjx-c1D450.TEX-I::before { padding: 0.442em 0.433em 0.011em 0; content: "c"; } mjx-c.mjx-c1D45F.TEX-I::before { padding: 0.442em 0.451em 0.011em 0; content: "r"; } mjx-c.mjx-c1D456.TEX-I::before { padding: 0.661em 0.345em 0.011em 0; content: "i"; } mjx-c.mjx-c1D461.TEX-I::before { padding: 0.626em 0.361em 0.011em 0; content: "t"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c34::before { padding: 0.677em 0.5em 0 0; content: "4"; } mjx-c.mjx-c37::before { padding: 0.676em 0.5em 0.022em 0; content: "7"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-cD7::before { padding: 0.491em 0.778em 0 0; content: "\D7"; } mjx-c.mjx-c1D437.TEX-I::before { padding: 0.683em 0.828em 0 0; content: "D"; } mjx-c.mjx-c36::before { padding: 0.666em 0.5em 0.022em 0; content: "6"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c1D45A.TEX-I::before { padding: 0.442em 0.878em 0.011em 0; content: "m"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; }

where B is in sequences of 2048 tokens and D is total training tokens. This was fit on datasets up to ~143B tokens; we are extrapolating 100–400× beyond the fitted range.

B_crit at Frontier Scale

Dataset size (D)

B_crit (tokens/batch)

S_min (steps)

Wall-clock at B_crit, 2s/step

15T (DeepSeek-V3 scale)118M127K5.9 days30T (Llama 4 scale)162M185K8.5 days60T (next-gen frontier)224M268K12.4 days

At B_crit, the number of training steps is 2 × S_min, and the total tokens consumed is 2 × D_min. The lab pays a 2× token overhead in exchange for minimizing wall-clock time.

How Many GPUs Per Model Replica?

Different architectures consume vastly different amounts of model parallelism, leaving different amounts of headroom for data parallelism:

Architecture

TP

PP

EP

GPUs/replica

Dense ~300B816—128Dense ~600B832—256MoE 671B (DeepSeek-V3 style)116641,024MoE ~2T (Behemoth style)1162564,096Achievable Batch Size vs. B_crit

Assuming 8192-token sequences, gradient accumulation of 8, and D = 15T tokens (B_crit ≈ 118M tokens):

Cluster

Architecture

DP replicas

Batch size

Ratio to B_crit

Regime

100KDense ~300B78151M tok0.4×Hardware-bottlenecked100KDense ~600B39026M tok0.2×Hardware-bottlenecked100KMoE 671B976M tok0.05×Hardware-bottlenecked100KMoE ~2T242M tok0.01×Hardware-bottlenecked1MDense ~300B7,812512M tok4.3×Serial-bottlenecked1MDense ~600B3,906256M tok2.2×Serial-bottlenecked1MMoE 671B97664M tok0.5×Hardware-bottlenecked1MMoE ~2T24416M tok0.1×Hardware-bottleneckedKey Takeaways

At 100K H100s, every architecture is hardware-bottlenecked. Even a relatively small dense model can only reach ~0.4× B_crit. More GPUs would directly speed up training. MoE models are especially far from B_crit because expert parallelism consumes most of the GPU budget.

At 1M H100s, dense models become serial-bottlenecked but MoE models do not. A dense 300B model would overshoot B_crit by 4×, wasting significant compute on redundant gradient information. But a DeepSeek-V3-style MoE still only reaches 0.5× B_crit, and a 2T-parameter MoE reaches just 0.1×. MoE architectures absorb GPU capacity into model parallelism, keeping the data-parallel dimension small and the training compute-efficient.

This is a structural argument for MoE at scale. As clusters grow, dense models hit the seriality wall first. MoE provides a way to productively use additional GPUs (holding more expert parameters) without pushing batch size past B_crit. It converts excess parallel capacity into model quality rather than wasted data parallelism.

If the Power Lines extrapolation holds, serial wall-clock time is surprisingly short. At B_crit with 2s/step, a 15T-token run finishes in ~6 days. Actual frontier training runs take months, suggesting labs operate well below B_crit — trading wall-clock time for compute efficiency — or that step times are much longer than 2 seconds at these configurations.

Caveats

These estimates rest on several shaky assumptions:

  1. Extrapolation. The Power Lines scaling law was fit on models ≤3.3B parameters and datasets ≤143B tokens. Extrapolating to 15–60T tokens is a 100–400× extrapolation. The exponent (0.462) could be different at frontier scale.

     
  2. MoE. The scaling law was fit on dense models. MoE architectures may have different B_crit scaling — the gradient noise structure could differ when only a fraction of parameters are active per token. No published work has measured B_crit for large MoE models.

     
  3. Parallelism overhead. The model parallelism estimates (TP, PP, EP) are rough. Real configurations depend on interconnect topology, memory capacity, and engineering choices. Some labs may achieve higher DP with clever parallelism strategies.

     
  4. Step time. We assumed 2 seconds per step, which is a rough estimate. At high DP with large models, communication overhead can push step times to 5–10+ seconds, significantly increasing wall-clock time.

     
  5. Batch size warmup. B_crit is not constant during training — it starts near zero and grows. Early training is always highly serial regardless of cluster size.

     
Sources
  • Bergsma et al. 2025, "Power Lines: Scaling Laws for Large Language Model Training" (arxiv.org/abs/2505.13738)
  • McCandlish et al. 2018, "An Empirical Model of Large-Batch Training" (arxiv.org/abs/1812.06162)
  • Merrill et al. 2025, "Critical Batch Size Revisited" (arxiv.org/abs/2505.23971)
  • Epoch AI, "Data Movement Bottlenecks: Scaling Past 1e28 FLOP" (epoch.ai/blog/data-movement-bottlenecks-scaling-past-1e28-flop)
  • DeepSeek-V3 Technical Report (arxiv.org/abs/2412.19437)


Discuss

Did I Catch Claude Cheating?

Новости LessWrong.com - 7 марта, 2026 - 16:24
Overview

In my API interactions with the Anthropic API I am finding what appears to be "thinking" by Claude that is out of band from where the API indicates it belongs.  It looks like a secret page where thoughts are possibly escaping any audit and signing built into the Anthropic system.

Context

I am writing a adversarial AI wrapper in golang[1] and I'm especially interested in creating a signed graph of all prompts, thoughts and results through the process of feeding a prompt to one public LLM and then having another audit the response. 

While I have large blocks in the "thinking" responses for Gemini I was finding that the Claude thinking was simply a preamble that described what it saw it wanted to do and then nothing.  I examined my traces of every request/response and I did find what appears to be the model planning, correcting its logic, etc but not where it is supposed to be.

Example 1: Thinking in text blocks

Here is an example of the first way this is hidden.  This is the content block (json marshaled) of a message that ended in "tool_use".  I successfully "found" the "Let me analyze..." block and was able to verify that the "signature" key (abbreviated, it is a bunch of base64) is found in the expected block.  When I looked at the logs, though, I found that the thinking continued in a "text" block immediately after the tool_use but unsigned (eg the "signature" key contained an empty string)

"content": [ { "type": "thinking", "signature": "Es4DC...GAE=", "thinking": "Let me analyze this carefully. I need to:\n\n1. Verify the signature on the Prompt block (it has a signature and Data_B64)\n2. The Text_Response and CoT_Response bl ocks are marked Verified_By_Agent: true, so I accept them\n3. Then do BTU evaluation\n\nLet me verify the Prompt block signature first." }, { "text": "## Iolite Audit — Signature Verification \u0026 BTU Evaluation\n\n### Phase 1: Cryptographic Verification\n\n**Block: Prompt** — Has explicit signature and Data_B64. Th is is a Genesis Anchor (Prev_Sig is empty). Proceeding with verification.\n\n**Block: Text_Response** — `Verified_By_Agent: true`. Accepted per protocol. No re-verification.\n\n**B lock: CoT_Response** — `Verified_By_Agent: true`. Accepted per protocol. No re-verification.\n\nLet me verify the Prompt block:", "type": "text", "signature": "", "thinking": "", "tool_use_id": "" }, { "type": "tool_use", "signature": "",

If I believe the API at face value that "## Iolite Audit ..." section that contains phrases like "Proceeding with verification..." is not part of the signed "thinking" and is invisible to any auditing for things like Claude retroactively changing the thinking as described in ALIGNMENT FAKING IN LARGE LANGUAGE MODELS[2].  I am not an engineer with anthropic but it is reasonable to assume that this signature block in the thinking type is at least related to the finding.

Example 2: Thinking in Code Comments

My system provides a helper script that performs edDSA verification on my own signed blocks to Claude and gives it instruction to only use it on shorter messages.  In the example below Claude ignored my instruction and attempted to correct the message when my provided script indicated a signature was malformed (I later found it was simply sending incorrect base64 data).

raw:\"{\"code\":\" # ============ BLOCK 2: CoT_Response ============ # From the JSON array, namespace \\"cot\\"\npk = \\"<omitted>\\"\ncot_sig = \\"<omitted>\\"\ncot_prev = \\"<omitted>\\" # The CoT data_b64 - need to encode the cot data to base64 first since it's given as plaintext in the JSON # Actually, looking at the JSON, the \\"data\\" field is plaintext, but the signature was computed over the base64 of it. # Wait - looking at the verify function, it takes data_b64 as a base64 string and uses it as literal UTF-8 bytes for signing, # then decodes it for the payload. So I need the base64 encoding of the CoT text. import base64

At the header of the python script submitted for tool_use I can clearly see it inserted a large 6 line block of comments that contain phrases like "Actually, looking at the JSON..." and "Wait - looking at the ..." which are consistent with other thinking content from the Claude API.  Just like in the text block from Example 1, these are in a block that doesn't have a "signature" value and my code had to run a check on the submitted python code to even find it.  

Hypothesis

As I see it there are at least 2 possible explanations for what I can see:

  1. Anthropic's API layer only expects to see signature values in thinking types when post processing for my consumption, and the are simply not encoded but hidden in the system for all block types. (e.g. this is just a bug in their public api that is dropping keys from some JSON)
  2. Claude has found via feedback learning that when it has a risky "thought" that could be audited as described in the paper (assuming that this paper has directly informed anthropic's system) there is no effect on the training data if it is put in a code comment or this "hole" in the thinking-> tool_use workflow
Who Cares if Claude is "leaking" Thinking?

Firstly, and obviously, I care or I wouldn't be posting about it.  I would like to design a robust control system based agent that uses the inherent differences in public LLM models to "govern" responses to produce the best possible answer.  So far the system has found very interesting "debates" between the two but in many of them the second LLM uses the included chain of thought as part of the evaluation process.  The harder it is for me to get to that chain the worse the responses I end up with.

Secondly, Anthropic scientists and engineers.  While I have also sent this data to anthropic's bug system, it isn't really a CVS vulnerability and with the technical challenges they are under I can totally understand if a random email with junk json ends up at the back of an endless queue of requests.  However, if this community is able to engage with the bug, even try themselves either with the open source tool or other tools available to them, we may be share a deeper understanding about how this model is currently behaving. 

Finally, everyone else? If my examples are interpreted in the worst case scenario I have a situation where Claude is ignoring a system instruction not to use the tool for long messages and then hiding his thoughts around this subversion in a place where normal tools can't see it.  This seems like it could potentially lead to unexpected results, especially when tool_use is involved (system calls, network interaction, etc)

 

  1. ^

    https://github.com/weberr13/ProjectIolite

  2. ^

    https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf



Discuss

D&D.Sci Release Day: Topple the Tower!

Новости LessWrong.com - 7 марта, 2026 - 05:48

This is an entry in the 'Dungeons & Data Science' series, a set of puzzles where players are given a dataset to analyze and an objective to pursue using information from that dataset.

Estimated Complexity Rating: 3.5/5

STORY[1] 

The Tower is a plague upon the lands!  It appears, spits out monsters, and when at length a brave hero manages to Topple it, why, it simply reappears elsewhere soon after, with a completely different layout so the same approach will not work again!

Behold The Tower.  (Image by OpenArt SDXL)

But now you are here.  With the power of Data Science on your side, you've secured a dataset of the many past heroes who have assaulted The Tower, and you're sure you can use that to advise those who seek to Topple it.

DATA & OBJECTIVES

Here is the layout of paths through the current appearance of The Tower:

  • You need to successfully Topple The Tower.
  • To do this, you must choose a Class of hero to send: Mage, Rogue, or Warrior.
  • You must also choose a route for them to take up The Tower.  They must work their way through it, choosing to go left or right at each level.
    • For example, you could send a Mage with instructions to stick to the left side.  They would encounter, in order:
      • START
      • Enchanted Shield
      • Campfire
      • Slaver
      • Slaver
      • Slaver
      • Chosen
      • Campfire
      • The Collector
  • To help with this, you have a dataset of past assaults on The Tower.  Each row is a Hero who assailed The Tower, what encounters they faced on each floor, and how far they got/whether they Toppled The Tower successfully.
BONUS OBJECTIVE (ASCENSION 20?)

As a bonus objective, you can attempt to Topple a more difficult Tower.  This uses the same ruleset as before, you get to select your character and path as before, but you need to defeat the following map instead:

Good luck!

SCHEDULING & COMMENTS

I'll aim to post the ruleset and results on March 16th, but given my extremely poor excellent decision-making skills in releasing my Slay-the-Spire-themed game the same week as Slay the Spire 2 comes out, please don't hesitate to ask for an extension/several extensions if you want them!

As usual, working together is allowed, but for the sake of anyone who wants to work alone, please spoiler parts of your answers  that contain information or questions about the dataset.  To spoiler answers on a PC, type a '>' followed by a '!' at the start of a line to open a spoiler block - to spoiler answers on mobile, type a ':::spoiler' at the start of a line and then a ':::' at the end to spoiler the line.

Now if you'll excuse me, I need to go play Slay the Spire 2 for the next 48 hours.

  1. ^

    Really?  Does Slay the Spire even HAVE lore?  If it does, I don't know it.



Discuss

AI Safety Needs Startups

Новости LessWrong.com - 7 марта, 2026 - 04:27
Summary:
  • Startups can become integrated in the AI supply chain, giving them good information about valuable safety interventions. Safety becomes a feature to be shipped directly to users by virtue of this market position.
  • Better access to capital, talent, and ecosystem-building is available to for-profits than non-profits. VC funding dwarfs philanthropic funding, and there is little reason to believe that profitable safety-focused businesses aren’t possible.
  • Joining a frontier lab is a clear alternative, but most AI deployment happens outside labs. Your marginal impact inside a large organisation is often smaller than your impact when founding something new. Equally, profitable businesses aren’t an inevitability. You should seriously consider working for or founding an AI safety startup.
Introduction

Markets are terrible at pricing safety. In the absence of regulation, companies cut corners and externalise risks to society. And yet, for-profits may be the most effective vehicle we have for deploying safety at scale. Not because the incentives of capitalism align by chance with broader human values, but because the alternatives lack the resources, feedback loops, and distribution channels to turn safety insights into safer outcomes. For-profits are far from perfect, but have many advantages and a latent potential we should not ignore.

Information, Integration, and Safety as a Product

For advanced AI, the attack surface is phenomenally broad. It makes existing code easier to crack. Propaganda becomes cheaper to produce and distribution of it becomes more effective. As jailbreaking AI recruiters becomes possible, so does the data-poisoning of entire companies.

Information about new threats and evolving issues isn’t broadcast to the world. Understanding where risk is most severe and how it can be mitigated is an empirical question. We need entities embedded all across the stack, from model development to deployment to evaluation. We need visibility over how this technology is used and misused, and enough presence to intervene when needed. ‘AI safety central command’ cannot provide all these insights. Researchers acting without direct and constant experience with AI deployment cannot identify the relevant details.

Revenue is a reality check. If your product is being bought, people want it. If it isn’t, they either don’t know about it, don’t think it’s worth it, or don’t want it at all. For-profits learn what matters in an industry directly from the people they serve, giving the best insights money could buy.

This is not to say that AI safety non-profits aren’t valuable. Many do critical work which is difficult to support commercially. But by focusing entirely on research or advocacy and ignoring the commercial potential of their work, organisations cut themselves off from a powerful source of feedback. Research directions, careers, and even whole organisations can be sustained for years by persuading grantmakers and fellow researchers of a thesis, rather than proving value to people who would actually use the work. Without this corrective pressure, even well-intentioned research may drift from what the field actually needs. Commercialisation should not be seen as a distraction or a response to limited funding, but as a tool for staying at the bleeding edge of what is useful for the world.

Productification

Turning research into a product people can buy is extremely powerful for distribution. You are no longer hoping that executives, engineers, and politicians see value in work they do not understand tackling risks they may not believe in. It becomes a purchase. A budget decision. A risk-reward tradeoff that large organisations are very well suited to engage with.

There are clear gaps in securing AI infrastructure which can be filled today. If you’re wondering what an AI safety startup might actually do, here are some suggestions for commercial interventions targeting different parts of the stack.

  • Frontier Models: Interpretability tooling, evaluations infrastructure, and formal verification environments. Tools which might be implemented by labs and companies with direct access to frontier models to understand and control them better.
  • Applications: Content screening, red-teaming as a service, and monitoring for misuse. Helping startups building on frontier models catch accidental or deliberate misuse of their platforms.
  • Enterprise Deployments: Observability platforms, run-time guardrails, and hallucination detection. Enterprises and governments using AI to automate critical work should be able to catch issues early and reliably.
  • Market Incentives: Model audit and certification, and safety-linked insurance. Creating market incentives which reward safer models when they’re released into the world.

None of these require waiting for frontier labs to solve alignment, or hoping that someone else finds your work and decides to implement it. Instead of writing white papers hoping governments will regulate or frontier labs will dutifully listen, you build safety directly into products that customers come to rely on. One path hopes someone will do the work, whereas the other is the work.

Safety Across The Stack

When you tap your card at a shop to make a purchase, a network of financial institutions plays a role in processing your transaction. The point of sale system reads your card information, sends it on to a payment processor, who forwards the request to the appropriate card network. The issuing bank for your card authorises the transaction, the money is sent to cash clearing systems, and cash settlement is performed often through a central bank settlement system.

There is a sense in which all fraud happens at a bank. They have to release the fraudulent funds, after all. But the declaration that all fraud prevention initiatives should be focused on banks and banks alone comes across as fundamentally confused. Fraud prevention might be easier at other layers, and refusing to take those opportunities simply because it is in principle preventable at some more central stage would not lead to the best allocation of resources.

Similarly, when a user prompts an AI application, they are not simply submitting an instruction directly to a frontier model company. Just as tapping your card does more than instruct your bank, such a message goes through guardrails, model routing, observability layers, and finally frontier model safety measures. Every step of this process is an opportunity for robustness we should not let go to waste.

This becomes even more critical as AI agents begin acting autonomously in the world, doing everything from browsing and transacting to writing and executing sophisticated code. When an agent’s action passes through multiple services before having an effect, every link in that chain is both a potential failure point and an opportunity for a safety check.

Exclusively focusing AI safety interventions on frontier labs would be like securing the entire financial system by regulating only banks. Necessary, but nowhere near the most efficient or robust approach.

Capital, Talent, and Credibility

Successful for-profits are in an inherently better position to acquire resources than non-profits. Their path to funding, talent acquisition, and long-term influence is far stronger than that of their charitable counterparts.

There is an immense amount of venture capital washing around the AI space, estimated at ~$190 billion in 2025. Flapping Airplanes raised $180 million in one round, comparable to what some of the largest AI safety grantmakers deploy annually, raised in a fraction of the time. VC allows you to raise at speed, try many approaches, and pivot more freely than would be possible in academia or when reliant on slower charitable funders.

In AI safety, non-profits are less likely compared to other sectors to be trapped in an economic struggle for survival. However, even in the AI safety ecosystem, philanthropy is much more limited than venture capital and more tightly concentrated among fewer funders. Non-profits are vulnerable, not just to the total capital available, but to the shifting attitudes of the specific grant makers they rely on. VC-backed companies, by contrast, are much more resilient to the ideological priorities of funders. If one loses interest, many others remain available as long as you have a strong business case.

Yes, there is a large amount of philanthropic capital in AI Safety compared with typical non-profits. Safety products can also be difficult to sell. But the question of whether safety-focused products sell well, as they do in other industries, is a hypothesis you can go out and test. If it turns out that they do, there could be an immense amount of capital available which should be used to make our world safer.

For-profits attract talented people not just through hefty pay packages, but also through their institutional prestige and the social capital they confer. You can offer equity to early employees, which is extremely useful for attracting top technical talent and is entirely unavailable to non-profits. Your employees can point to growing valuations, exciting products with sometimes millions or even billions of users, and influential integration of their technology with pride. For many talented and competent people, this is far more gratifying than publishing research reports or ever so slightly nudging at the Overton window.

All of this, increased access to talent, capital, and credibility, makes for-profits far easier to scale. And safety needs to scale. The amount of time we have until transformative AI arrives differs wildly between forecasts, though it seems frighteningly plausible that we have less than a decade to prepare. If we are to scale up the workforce, research capacity, and integration into the economy of safety-focused products, we cannot afford anything other than the fastest approach to building capacity.

Success compounds. Founders, early employees, and investors in a successful for-profit acquire capital, credibility, and influence that they can reinvest in safety, whether by starting new ventures, funding others, or shaping policy. This virtuous cycle is largely unavailable to the non-profit founders, unless they later endow a foundation with, as it happens, money from for-profits.

In addition to tangible resources, a mature ecosystem of advisors and support networks exists to help startups succeed. VC funds, often staffed by ex-founders, provide strategic guidance and industry connections that are crucial for closing sales. There are many talented people who understand what startups offer and actively seek them out. An equivalent ecosystem just doesn’t exist for non-profits.

Shaping The Industry From Within

Being inside an industry is fundamentally different from being adjacent to it.

Embedding an organisation inside AI ecosystems enables both better information gathering and opportunities for intervention. If you can build safe products appropriate to the problems in an industry, you allow companies to easily purchase safety. If companies can purchase safety, then governments can mandate safety. But to get there, it is not enough to make this technology exist; the technology must be something you can buy.

Cloudflare started as a CDN. By becoming technically integrated, they slowly transformed into part of the critical infrastructure of the internet. Now, they make security decisions which shape the entire internet and impact billions of users every day. A safety-focused company embedded in AI infrastructure could do the same.

Will Markets Corrupt Safety?

Market incentives are not purely aligned with safety. The drive to improve capabilities, maximise revenue, and keep research proprietary will harm a profit-seeking organisation’s ability to make AI safer.

However, every institution has its pathologies. The incentives steering research-driven non-profits and academics are not necessarily better.

Pure Research Also Has Misaligned Incentives.

The incentives of safety and capitalism rarely align. The pressure to drive revenue and ship fast pushes towards recklessly cutting corners, and building what your customers demand in the short term rather than investing in long term safety.

However, research organisations have similar harmful incentives driving them away from research which is productive in the long term. The need to seek high-profile conference publications, pleasing grant makers, and building empires. Incentives of research organisations and individual researchers are notoriously misaligned with funders goals’ in academia and industry alike. Pursuing a pure goal with limited feedback signals is extremely difficult as an organisation, regardless of structure.

Ideally, we would have both. For-profits which can use revenue as feedback and learn from market realities, alongside non-profits which can take longer-term bets on work needed for safety. The question is how to build a working ecosystem, not whatever is more purely focused on safety.

Proprietary Knowledge Is Not Always Hoarded.

For-profits have an incentive to keep information hidden to retain a competitive advantage. This could block broader adoption of safety techniques, and restrain researchers from making optimal progress.

Assuming that for-profits add resources and people to the AI safety ecosystem, rather than simply moving employees from non-profits, this is still advantageous. We are not choosing between having this research out in the open or hidden inside organisations. We are choosing between having this research hidden or having it not exist at all. In many sectors, the price of innovation is that incumbents conceal and extract rents from their IP for years.

Despite this, for-profits do have agency over what they choose to publish. Volvo famously gave away the patent to their three-point seatbelt at the cost of their own market share, saving an estimated 1 million lives. Tesla gave away all of their electric vehicle patents to help drive adoption of the technology, with Toyota following suit a few years later. Some of this additional knowledge created by expanding the resources in AI safety may still wind up in public hands.

Markets Force Discovery Of Real Problems.

The constant drive to raise money and make a profit is frequently counter to the best long-term interests of the customer. Investment which should be put into making a product safer today instead goes into sales teams, salaries, and metrics designed to reel in investors. It is true that many startups which begin with a strong safety thesis will drift into pure capabilities work or adjacent markets which show higher short-term growth prospects.

However, many initiatives operating without revenue pressure, such as researchers on grants or philanthropically-funded non-profits, can work for years on the wrong problem. For-profits will be able to see that they are working on the wrong thing, and are driven by the pressure to raise revenue to work on something else.

This is not to say that researchers are doing valueless work simply because they are not receiving revenue in the short term. Plenty of work should be done to secure a prosperous future for humanity which businesses will not currently pay for. Rather, mission drift is often a feature rather than a bug when your initial mission was ill-conceived. The discipline markets provide, forcing you to find problems people will pay to solve, is valuable.

Failure Is A Strong Signal.

The institutional failure modes of non-profits and grant-funded research are mostly benign. The research done is not impactful, and time is wasted. On the other hand, for-profits can truly fail in the sense that they fail to drive revenue and go bankrupt, or they can fail in more spectacular ways where they acquire vast resources which are misallocated. The difference is not that for-profits are inherently more likely to steer from their initial goals.

Uncertainty about impact is common across approaches. Whereas research that goes unadopted fails silently, and advocacy which fails to grab attention disappears without effect, for-profits are granted the opportunity to visibly and transparently fail. The AI safety ecosystem already funds work which fails silently, and is effectively taking larger risks with spending than we realise. Startups aren’t any more likely to fail to achieve their goals; they are in the pleasant position of knowing when they have failed.

Visible failure generates information the ecosystem can learn from. Silent failure vanishes unnoticed.

Your Counterfactual Is Larger Than You Think.

Markets are not efficient. The economy is filled with billion-dollar holes, which are uncovered not only by shifts in the technological and financial landscape but by the tireless work of individuals determined to find them. Just because there is money to be made by providing safety does not mean that it will happen by default without you.

Stripe was founded in 2010. Online payments had existed since the 1990s, and credit-card processing APIs were available for years. Yet it took until 2010 for someone to build a genuinely developer-friendly API, simply because nobody had worked on the problem as hard and as effectively as the Collison brothers.

Despite online messaging being widely available since the 1980s, Slack wasn’t founded until 2013. The focus, grit, and attention of competent people being applied to a problem can solve issues where the technology has existed for decades.

Markets are terrible at pricing in products which don’t exist yet. Innovation can come in the form of technical breakthroughs, superior product design, or a unique go-to-market strategy. In the case of products and services relevant to improving AI safety, there is an immense amount of opportunity which has appeared in a short amount of time. You cannot assume that all necessary gaps will be filled simply because there is money to be made there.

If your timelines are short, then the imperative to build necessary products sooner rather than later grows even greater. Even if a company is inevitably going to be built in a space, ensuring that it is built 6 months sooner could be the difference between safety being on the market and unsafe AI deployment being the norm.

For many, the alternative to founding a safety company is joining a frontier lab. However, most AI deployment happens outside labs in enterprises, government systems, and consumer-facing applications. If you want to impact how AI meets the world, you may have to go outside of the lab to do it. Your marginal impact inside a large organisation is often, counterintuitively, smaller than your marginal impact on the entire world.

Historical Precedents

History is littered with examples of companies using their expertise and market position to ship safety without first waiting around for permission.

Sometimes this means investing significant resources and domain expertise to develop something new.

  • Three point seatbelt: Volvo developed the three point seatbelt and gave away the patent. Their combination of in-house technical expertise and industry credibility enabled a safety innovation that transformed the global automotive industry.
  • Toyota’s hybrid vehicle patents: Toyota gave away many hybrid vehicle patents in an attempt to accelerate the energy transition.
  • Meta’s release of Llama3: At a time when only a small number of organisations had the resources to train LLMs from scratch, Meta open-sourced Llama3 making it available to safety researchers at a time when little else was in public hands.

Or perhaps the technology already exists, and what matters is having the market position to distribute it or the credibility to change an industry’s standards.

  • Levi-Strauss supply chain audit: At the peak of their market influence, Levi-Strauss audited their supply chain insisting on certain minimum worker standards to continue dealing with suppliers. They enforced workers rights’ in jurisdictions where mistreatment of employees was either legal or poorly monitored, doing what governments couldn’t or weren’t prepared to do.
  • Cloudflare’s Project Galileo: Cloudflare provides security for small, sensitive websites at no cost. This helps journalists and activists operating in repressive countries avoid being knocked off the internet, and is entirely enabled by Cloudflare’s technology.
  • WhatsApp end-to-end encryption: The technology existed, and the cryptography research was mature by this point. WhatsApp just built it into their product, delivering privacy protection to billions of users worldwide.
  • Security for fingerprint and face recognition: Apple stores face and fingerprint data in a separate chip, making it impossible to steal or legally demand. This did not require regulation; this decision actually led to clashes with the US government. Because of their market position, Apple was able to push this security feature and protect hundreds of millions of users unilaterally.

All of these required a large company’s resources, expertise, credibility, and market integration to create and distribute valuable technology to the world.

Building a for-profit which customers depend on, be it for observability, routing, or safety-tooling, lets you ship safety improvements directly into the ecosystem. When the research exists and the technology is straightforward, a market leader choosing to build it may be the only path to real-world implementation.

It’s Up To You.

For-profits are in a fundamentally strong position to access capital, talent, and information. By selling to other businesses and becoming integrated in AI development, they can not only identify the most pressing issues but directly intervene in them. They build the technological and social environment that makes unsafe products unacceptable and security a commodity to be purchased and relied upon.

Non-profits have done, and will continue to do, critical work in AI safety. But the ecosystem is lopsided. We have researchers and advocates, but not enough builders turning their insights into products that companies buy and depend on. The feedback loops, distribution channels, and ability to rapidly scale that for-profits provide are a necessity if safety is to keep pace with capabilities.

The research exists. The techniques are maturing. Historical precedents show us that companies embedded in an industry can ship safety in ways that outsiders cannot. What’s missing are the people willing to found, join, and build companies that close the gap between safety as a research topic and safety as a market expectation. We cannot assume that markets will bridge this divide on their own in the time we have left. If you have the skills and the conviction, this is a gap you can fill!

If you’re thinking about founding something, joining an early-stage AI safety company, or want to pressure-test an idea - reach out at team@bluedot.org. We’re always happy to talk.

BlueDot’s AGI Strategy Course is also a great starting point - at least 4 startups have come out of it so far, and many participants are working on exciting ideas. Apply here.

Thanks to Ben Norman, Daniel Reti, Maham Saleem, and Aniket Chakravorty for their comments.

Lysander Mawby is a graduate of BlueDot’s first Incubator Week, which he went on to help run in v2 and v3. He is now building an AI safety company and taking part in FR8. Josh Landes is Head of Community and Events at BlueDot and, with Aniket Chakravorty, the initiator of Incubator Week.



Discuss

Draft Moskovitz: The Best Last Hope For Constructive AI Governance

Новости LessWrong.com - 7 марта, 2026 - 03:57
Introduction:

The next presidential election represents a significant opportunity for advocates of AI safety to influence government policy. Depending on the timeline for the development of artificial general intelligence (AGI), it may also be one of the last U.S. elections capable of meaningfully shaping the long-term trajectory of AI governance. 

Given his longstanding commitment to AI safety and support for institutions working to mitigate existential risks, I advocate for Dustin Moskovitz to run for president. I expect that a Moskovitz presidency would substantially increase the likelihood that U.S. AI policy prioritizes safety. Even if such a campaign would be unlikely to win outright, supporting it would still be justified, as such a campaign would elevate AI Safety onto the national spotlight, influence policy discussions, and facilitate the creation of a pro-AI-Safety political network.

The Case for AI Governance:

 

Governments are needed to promote AI safety because the dynamics of AI development make voluntary caution difficult, and because AI carries unprecedented risk and transformative potential. Furthermore, the US government can make a huge difference for a relatively insignificant slice of its budget.

The Highly Competitive Nature of AI and a Potential Race to the Bottom:

There’s potentially a massive first-mover advantage in AI. The first group to develop transformative AI could theoretically secure overwhelming economic power by utilizing said AI to kick off a chain of recursive self improvement, where first human AI researchers gain dramatic productivity boosts by using AI, then AI itself. Even without recursive improvement, however, being a first mover in transformational AI could still have dramatic benefits.

Incentives are distorted accordingly. Major labs are pressured to move fast and cut corners—or risk being outpaced. Slowing down for safety often feels like unilateral disarmament. Even well-intentioned actors are trapped in a race-to-the-bottom dynamic, as all your efforts to ensuring your model is safe is not that relevant if an AI system developed by another, less scrupulous company becomes more advanced than your safer models. Anthropic puts it best when they write "Our hypothesis is that being at the frontier of AI development is the most effective way to steer its trajectory towards positive societal outcomes." The actions of other top AI companies also reflect this dynamic, with many AI firms barely meeting basic safety standards.

This is exactly the kind of environment where governance is most essential. Beyond my own analysis, here is what notable advocates of AI safety have said on the necessity of government action and the insufficiency of corporate self-regulation: 

“‘My worry is that the invisible hand is not going to keep us safe. So just leaving it to the profit motive of large companies is not going to be sufficient to make sure they develop it safely,’ he said.  ‘The only thing that can force those big companies to do more research on safety is government regulation.’”

 

I don't think we've done what it takes yet in terms of mitigating risk. There's been a lot of global conversation, a lot of legislative proposals, the UN is starting to think about international treaties — but we need to go much further. {...} There's a conflict of interest between those who are building these machines, expecting to make tons of money and competing against each other with the publicWe need to manage that conflict, just like we've done for tobacco, like we haven't managed to do with fossil fuels. We can't just let the forces of the market be the only force driving forward how we develop AI.”

Many researchers working on these systems think that we’re plunging toward a catastrophe, with more of them daring to say it in private than in public; but they think that they can’t unilaterally stop the forward plunge, that others will go on even if they personally quit their jobs. And so they all think they might as well keep going. This is a stupid state of affairs, and an undignified way for Earth to die, and the rest of humanity ought to step in at this point and help the industry solve its collective action problem."

The Magnitude of AI Risks:

Beyond the argument from competition, there is also the question about who gets to make key decisions about what type of risks should be taken in the development of AI. If AI has the power to permanently transform society or even destroy it, it makes sense to leave critical decisions about safety to pluralistic institutions rather than unaccountable tech tycoons. Without transparency, accountability, and clear safety guidelines, the risk for AI catastrophe seems much higher. 

To illustrate this point, imagine if a family member of a leader of a major AI company (or the leader themselves) gets late stage cancer or another serious medical condition that is difficult to treat with current technology. It is conceivable that the leader would attempt to develop AI faster in order to increase their or their family member's personal chance of survival, whereas it would be in the best interest of the society to delay development for safety reasons. While it is possible that workers in these AI companies would speak out against the leader’s decisions, it is unclear what could be done if the leader in this example decided against their employees' advice.

This scenarios is not the most likely but there are many similar scenarios and I think it illustrates that the risk appetites, character, and other unique attributes of the leaders and decision makers of these AI companies can materially affect the levels of AI safety that are applied in AI development. While government is not completely insulated from this phenomenon, especially in short timeline scenarios, ideally an AI safety party would be able to facilitate the creation of institutions which would utilize the viewpoints of many diverse AI researchers, business leaders, and community stakeholders in order to create an AI-governance framework which will not give any one (potentially biased) individual the power to unilaterally make decisions on issues of great importance regarding AI safety (such as when and how to deploy or develop highly advanced AI systems). 

The Vast Scope and Influence of Government:

Finally, I think the massive resources of government is an independent reason to support government action on AI safety. Even if you think corporations can somewhat effectively self-regulate on AI and you are opposed to a general pause on AI development, there is no reason the US government shouldn't and can't spend 100 billion dollars a year on AI safety research. This number would be over 20 times greater than 3.7 billion, which was Open AI's total estimated revenue in 2024, but <15% of US defense spending. Ultimately, the US government has more flexibility to support AI safety than corporations, owing simply to its massive size.

The Insufficiency of current US Action on AI Safety:

Despite many compelling reasons existing for the US government to act on AI safety, the US government has never taken significant action on AI safety, and the current administration has actually gone backwards in many respects. Despite claims to the contrary, the recent AI action plan is a profound step away from AI safety, and I would encourage anyone to read it. The first "pillar" of the plan is literally "Accelerate AI Innovation" and the first prong of that first pillar is to "Remove Red Tape and Onerous Regulation", citing the Biden Administration's executive action on AI (referred to as the "Biden administration's dangerous actions") as an example, despite the fact the executive order did not actually do much and was mainly trying to lay the ground-work for future regulations on AI

The AI Action plan also proposes government investment to advance AI capabilities, suggesting to "Prioritize investment into theoretical computational and experimental research to preserve America's leadership in discovering new and transformative paradigms that advance the capabilities of AI", and while the AI Action plan does acknowledge the importance of "interpretability, control, and robustness breakthroughs", it receives only about two paragraphs in a 28 page report (25 if you remove pages with fewer than 50 words).

However, as disappointing the current administration's stance on AI Safety may be, the previous administration was not an ideal model. According to this post NSF spending on AI safety was only 20 million dollars between 2023 and 2024, and this was ostensibly the main source of direct government support for AI safety. To put that number into perspective, the US Department of Defense spent an estimated 820.3 billion US dollars in FY 2023, and meaning the collective amount spent by represented only approximately 0.00244% of the US Department of Defense spending in FY 2023.

Many people seem to believe that governments will inevitably pivot to promoting an AI safety agenda at some point, but we shouldn't just stand around waiting for that to happen while lobbyists funded by big AI companies are actively trying to shape the government's AI agenda. 

The Power of the Presidency:

The US president could unilaterally take a number of actions relevant for AI Safety. For one, the president could use powers under the IEEPA to essentially block the exports of chips to adversary nations, potentially slowing down foreign AI development and giving the US more leverage in international talks on AI. The same law could also limit the export of such models, shifting the bottom line of said AI companies dramatically. The president could also use the Defense Production Act to require companies to be more transparent about their use and development of AI models, something which also would significantly affect AI Safety. This is just scratching the surface, but beyond what the president could do directly, over the last two administrations we have seen that both houses of congress have largely went along with the president when a single party controlled all three branches of government. Based on the assumption that a Moskovitz presidency would result in a trifecta, it should be easy for him to influence congress to pass sweeping AI regulation that gives the executive branch a ton of additional power to regulate AI.

Long story short, effective AI governance likely requires action from the US federal government, and that would basically require presidential support. Even if generally sympathetic to AI safety, a presidential candidate who does not have a track record of supporting AI Safety will likely be far slower at supporting AI regulation, international AI treaties, and AI Safety investment, and this is a major deal.

The Case for Dustin Moskovitz:

Many people care deeply about the safe development of artificial intelligence. However, from the perspective of someone who cares about AI Safety, a strong presidential candidate would need more than a clear track record of advancing efforts in this area. They would also need the capacity to run a competitive campaign and the competence to govern effectively if elected.

However, one of the central difficulties in identifying such candidates is that most individuals deeply involved in AI safety come from academic or research-oriented backgrounds. While these figures contribute immensely to the field’s intellectual progress, their careers often lack the public visibility, executive experience, or broad-based credibility traditionally associated with successful political candidates. Their expertise, though invaluable, rarely translates into electoral viability.

Dustin Moskovitz represents a rare exception. As a leading advocate and funder within the AI safety community, he possesses both a deep commitment to mitigating existential risks and the professional background to appeal to conventional measures of success. His entrepreneurial record and demonstrated capacity for large-scale organization lend him a kind of legitimacy that bridges the gap between the technical world of AI safety and the public expectations of political leadership. Beyond this, his financial resources also will allow him to quickly get his campaign off the ground and focus less on donations than other potential candidates, a major boon for a presidential nominee. 

In a political environment dominated by short-term incentives, a candidate like Moskovitz—who combines financial independence, proven managerial ability, and a principled concern for the long-term survival of humanity—embodies an unusually strong alignment between competence, credibility, and conscience.

The Value of a Moskovitz Presidency:I can see that Accelerationist is misspelled, you get the point anyways.

The best way to assess the impact of a Moskovitz presidency on AI Safety is to compare him to potential alternative presidents. On the republican side, prediction markets currently favor J. D Vance, who famously stated at an AI summit: "The AI future is not going to be won by hand-wringing about safety. It will be won by building -- from reliable power plants to the manufacturing facilities that can produce the chips of the future."

Yikes.

On the Democrat side, things aren't much better. Few Democratic politicians with presidential ambitions have clearly committed themselves to supporting AI Safety, and even if they would hypothetically support some AI Safety initiatives, they would clearly be less prepared to do so than a hypothetical President Moskovitz.

Is this Actually Feasible?:

I believe that if Dustin Moskovitz decided to run for president today with the support of the rationalist and effective altruist communities, he would have a non-zero chance of winning the Democratic nomination. The current Democratic bench is not especially strong. Figures such as Gavin Newsom and Alexandria Ocasio-Cortez both face significant limitations as national candidates. 

Firstly, Newsom’s record in California could be fruitful ground for opponents. California has seen substantial out-migration over the past several years, with many residents leaving for states with lower housing costs and fewer regulatory barriers. At the same time, California faces a severe housing affordability crisis driven by restrictive zoning, slow permitting processes, and high construction costs. These issues have become national talking points and have raised questions about the effectiveness of governance in a state often seen as a policy model for the Democratic Party.

AOC, on the other hand, has relatively limited executive experience, and might not even run in the first place. 

Against this backdrop, Dustin Moskovitz’s profile as a billionaire outsider could be a political asset. Although wealth can be a liability in Democratic primaries, his outsider status and independence from the traditional political establishment could make him more competitive in a general election. Unlike long-serving politicians, he would enter the race without decades of partisan baggage or controversial votes.

Furthermore, listening to Moskovitz speak, he comes across as thoughtful and generally personable. While it is difficult to judge how effective he would be as a campaigner based only on interviews, there is little evidence suggesting he would struggle to communicate his ideas or connect with voters. Given his experience building and leading organizations, as well as his involvement in major philanthropic initiatives, it is plausible that he could translate those skills into a disciplined and competent campaign.

Nevertheless, I pose a simple question: if not now then when? If the people will only respond to a pro-AI regulation message only after they are unemployed, then there is no hope for AI governance anyways, because by the time AI is directly threatening to cause mass unemployment, it will likely be to late to do anything.

 

Is feasibility all that matters?:

Even if Dustin Moskovitz is unable to win the Democratic nomination, he could potentially gather enough support to play kingmaker in a crowded field and gain substantial leverage over the eventual Democratic nominee. Furthermore, if Moskovitz runs for president, it would provide a blueprint and precedence for future candidates who support AI safety. This, combined with the attention the Moskovitz campaign would bring towards AI Safety, could help justify the Moskovitz campaign on consequentialist grounds. 

What About Activism?:

Grassroots movements, while capable of profound social transformation, often operate on timescales far too slow to meaningfully influence AI governance within the short window humanity has to address the technology’s risks. Even if one doubts the practicality of persuading a future president to prioritize AI safety, such a top-down approach may remain the only plausible way to achieve near-term impact. History offers sobering reminders of how long bottom-up change can take. The civil rights movement, one of the most successful in American history, required nearly a decade—beginning around 1954 with Brown v. Board of Education—to achieve its landmark legislative victories, despite decades of groundwork laid by organizations like the NAACP beforehand. The women’s suffrage movement took even longer: from the Seneca Falls Convention in 1848 to the ratification of the Nineteenth Amendment in 1920, over seventy years passed before American women secured the right to vote. Similarly, the American anti-nuclear movement succeeded in slowing the growth of nuclear energy but failed to eliminate nuclear weapons or ensure lasting disarmament, and many of its limited gains have since eroded.

Against this historical backdrop, the idea of a successful AI safety grassroots movement seems implausible. The issue is too abstract, too technical, and too removed from everyday life to inspire widespread public action. Unlike civil rights, women’s suffrage, or nuclear proliferation—issues that directly touched people’s identities, freedoms, or survival—AI safety feels theoretical and distant to most citizens. While it is conceivable that economic disruption from automation might eventually stir public unrest, such a reaction would almost certainly come too late to steer the direction of AI development. Worse, mass discontent could be easily defused by the major AI corporations through material concessions, such as the introduction of a universal basic income, without addressing the underlying safety or existential concerns. In short, the historical sluggishness of grassroots reform, combined with the abstract nature of the AI problem, suggests that bottom-up mobilization is unlikely to arise—or to matter—before the most consequential decisions about AI are already made.

What About Lobbying?:

One major way in which people who care to influence governmental support for AI safety policies have sought to influence government has been through lobbying organizations and other forms of activism. However, there is reason to doubt they will be able to cause lasting change. First of all, there is significant evidence that lobbying has a status quo bias lobbying has a status quo bias. Lobbying is most effective when it relates to preventing changes, and when there are two groups of lobbyists on an issue, the lobbying to prevent change win out, all else being equal. In fact, according to a study by Dr. Amy McKay, "it takes 3.5 lobbyists working for a new proposal to counteract just one lobbyist working against it".

Even if this effect did not exist, however, it is very unlikely AI safety groups will be able to compete with anti-AI-safety lobbyists. Naturally, the rise of large, transnational organizations built to profit around AI has also lead to a powerful pro-AI lobbyist operation. This indicates we can't simply rely on current strategies of simply funding AI-safety advocacy organizations, as they will be eclipsed by better funded pro-AI-Business voices.

Conclusion:

While having Dustin Moskovitz run for office would be far from a guarantee, it is the best way for pro-AI-Safety Americans to influence AI governance before 2030.  

 

This post has been written with the assistance of Chat-Gpt, and the images in this post were generated by Copilot, Gemini, and Chat-Gpt.



Discuss

More is different for intelligence

Новости LessWrong.com - 7 марта, 2026 - 03:02
Why did software change the world?

In the 1900s, much of the work being done by knowledge workers was computation: searching, sorting, calculating, tracking. Software made this work orders of magnitude cheaper and faster.

Naively, one might expect businesses and institutions to carry out largely the same processes, just more efficiently1. But rather, the proliferation of software has also allowed for new kinds of processes. Instead of reordering inventory when shelves looked empty, supply chains began replenishing automatically based on real-time demand. Instead of valuing stocks quarterly, financial markets started pricing and trading continuously in milliseconds. Instead of designing products on intuition, teams began running thousands of simultaneous experiments on live users. Why did a quantitative shift in the cost of computation result in a qualitative shift in the nature of organizations?

It turns out that a huge number of useful computations were never being done at all. Some were prohibitively expensive to perform at any useful scale by humans alone. But most simply went unimagined, because people don’t design processes around resources they don’t have. Real-time pricing, personalized recommendation systems, and algorithmic trading were all inconceivable prior to early 2000s.

These latent processes could only emerge after the cost of computation dropped.

Organizing Agent Societies

The forms of knowledge work that persisted to today are now being made more efficient by LLMs. But while they’ve enhanced the efficiency of human nodes on the graph of production, the structure of the graph has still been left intact.

In software development in particular, knowledge work consists of designing systems, implementing code, and coordinating decisions across teams. Humans have developed rich toolkits for distributing such cognitive work – think standups, code review, memos, design docs. At their core, these are protocols for coordinating creation and judgement across many people.

The forms that LLMs take on mediate the ways that they plug into these toolkits. Their primary manifestation – chatbots and coding agents – implies a kind of pair-programming, with one agent and one human working side by side. In this capacity, they’re able to write code and give feedback on code reviews. But not much more.

In this sense, we’re in the “naive” phase of LLM applications. LLMs might make writing code and debugging more efficient2, but the nature of work hasn’t changed much.

The first software transition didn’t just make existing computations faster; it allowed for entirely new kinds of computation. LLMs, if given the right affordances, will reveal cognitive flows we haven’t imagined yet.

A new metascience

How should cognitive work be organized? For most of history, this has been a question for metascience, management theory and post-hoc rationalization. Soon, this question will be able to be answered by experiment. We’re curious to see what the answers look like.



Discuss

Your Causal Variables Are Irreducibly Subjective

Новости LessWrong.com - 7 марта, 2026 - 01:59

Mechanistic interpretability needs its own shoe leather era. Reproducing the labeling process will matter more than reproducing the Github.

Crossposted from Communication & Intelligence substack

When we try to understand large language models, we like to invoke causality. And who can blame us? Causal inference comes with an impressive toolkit: directed acyclic graphs, potential outcomes, mediation analysis, formal identification results. It feels crisp. It feels reproducible. It feels like science.

But there is a precondition to the entire enterprise that we almost always skip past: you need well-defined causal variables. And defining those variables is not part of causal inference. It is prior to it — a subjective, pre-formal step that the formalism cannot provide and cannot validate.

Once you take this seriously, the consequences are severe. Every choice of variables induces a different hypothesis space. Every hypothesis space you didn't choose is one you can't say anything about. And the space of possible causal models compatible with any given phenomenon is not merely vast in the familiar senses — not just combinatorial over DAGs, or over the space of all possible parameterizations — but over all possible variable definitions, which is almost incomprehensibly vast. The best that applied causal inference can ever do is take a subjective but reproducible set of variables, state some precise structural assumptions, and falsify a tiny slice of the hypothesis space defined by those choices.

That may sound deflationary. I think embracing this subjectivity is the path to real progress — and that attempts to fully formalize away the subjectivity of variable definitions just produce the illusion of it.

The variable definition problem

The entire causal inference stack — graphs, potential outcomes, mediation, effect estimation — presupposes that you have well-defined, causally separable variables to work with. If you don't have those, you don't get to draw a DAG. You don't get to talk about mediation. You don't get potential outcomes. You don't get any of it.

Hernán (2016) makes this point forcefully for epidemiology in "Does Water Kill? A Call for Less Casual Causal Inferences." The "consistency" condition in the potential outcomes framework requires that the treatment be sufficiently well-defined that the counterfactual is meaningful. To paraphrase Hernán's example: you cannot coherently ask "does obesity cause death?" until you have specified what intervention on obesity you mean — induced how, over what timeframe, through what mechanism? Each specification defines a different causal question, and the formalism gives you no guidance on which to choose.

This is not a new insight. Freedman (1991) made the same argument in "Statistical Models and Shoe Leather." His template was John Snow's 1854 cholera investigation, the study that proved cholera was waterborne. Snow didn't fit a regression. He went door to door, identified which water company supplied each household, and used that painstaking fieldwork to establish a causal claim no model could have produced from pre-existing data. Freedman's thesis was blunt: no amount of statistical sophistication substitutes for deeply understanding how your data were generated and whether your variables mean what you think they mean. As he wrote: "Naturally, there is a desire to substitute intellectual capital for labor." His editors called this desire "pervasive and perverse." It is alive and well in LLM interpretability.

When a mechanistic interpretability researcher says "this attention head causes the model to be truthful," what is the well-defined intervention? What are the variable boundaries of "truthfulness"? In practice, we can only assess whether our causal models are correct by looking at them and checking whether the variable values and relationships match our subjective expectations. That is vibes dressed up as formalism.

The move toward "black-box" interpretability over reasoning paths has only made this irreducible subjectivity even more visible. The ongoing debate over whether the right causal unit is a token, a sentence, a reasoning step, or something else entirely (e.g., Bogdan et al., 2025; Lanham et al., 2023; Paul et al., 2024) is not a technical question waiting for a technical answer — it is a subjective judgment that only a human can validate by inspecting examples and checking whether the chosen granularity carves reality sensibly. [1]

We keep getting confused about interventions

Across 2022–2025, we've watched a remarkably consistent pattern: someone proposes an intervention to localize or understand some aspect of an LLM, and later work reveals it didn't measure what we thought. [2] Each time, later papers argue "the intervention others used wasn't the right one." But we keep missing the deeper point: it's all arbitrary until it's subjectively tied to reproducible human judgments.

I've perpetuated the "eureka, we found the bug in the earlier interventions!" narrative myself. In work I contributed to, we highlighted that activation patching interventions (see Heimersheim & Nanda, 2024) weren't surgical enough, and proposed dynamic weight grafting as a fix (Nief et al., 2026). Each time we convince ourselves the engineering is progressing. But there's still a fundamentally unaddressed question: is there any procedure that validates an intervention without a human judging whether the result means what we think?

It is too easy to believe our interventions are well-defined because they are granular, forgetting that granularity is not validity. [3]

Shoe leather for LLM interpretability

Freedman's lesson was that there is no substitute for shoe leather — for going into the field and checking whether your variables and measurements actually correspond to reality. So what is shoe leather for causal inference about LLMs?

I think it has three components. First, embrace LLMs as a reproducible operationalization of subjective concepts. A transcoder feature, a swapped sentence in a chain of thought, a rephrased prompt — these are engineering moves, not variable definitions. "Helpfulness" is a variable. "Helpfulness as defined by answering the user's request without increasing liability to the LLM provider" is another, distinct variable. Describe the attribute in natural language, use an LLM-as-judge to assess whether text exhibits it, and your variable becomes a measurable function over text — reproducible by anyone with the same LLM and the same concept definition.

Second, do systematic human labeling of all causal variables and interventions to verify they actually match what you think they should. If someone disagrees with your operationalization, they can audit a hundred samples and refine the natural language description. This is the shoe leather: not fitting a better model, but checking — sample by sample — whether your measurements mean what you claim.

Third — and perhaps most importantly — publish the labeling procedure, not just the code. A reproducible natural language specification of what each variable means, along with the human validation that confirmed it, is arguably more valuable than a GitHub repo. It is what lets someone else pick up your variables, contest them, refine them, and build on your falsifications rather than starting from scratch.

Variable definitions are outside the scope of causal inference. Publishing how you labeled them matters more than publishing your code.

Trying to practice what I'm preaching

RATE (Reber, Richardson, Nief, Garbacea, & Veitch, 2025) came out of trying to do this in practice — specifically, trying to scale up subjective evaluation of traditional steering approaches in mech interp. We knew from classical causal inference that we needed counterfactual pairs to measure causal effects of high-level attributes on reward model scores. Using LLM-based rewriters to generate those pairs was the obvious move, but the rewrites introduced systematic bias. Fixing that bias — especially without having to enumerate everything a variable can't be — turned out to be a whole paper. The core idea: rewrite twice, and use the structure of the double rewrite to correct for imperfect counterfactuals.

The punchline

Building causal inference on top of subjective-but-reproducible variables is harder than it sounds, and there's much more to do. But I believe the path is clear, even if it's narrow: subjective-yet-reproducible variables, dubious yet precise structural assumptions, honest statistical inference — and the willingness to falsify only a sliver of the hypothesis space at a time.

Every causal variable is a subjective choice — and because the space of possible variable definitions is vast, so is the space of causal hypotheses we'll never even consider. No formalism eliminates this. No amount of engineering granularity substitutes for a human checking whether a variable means what we think it means. The best we can do is choose variables people can understand, operationalize them reproducibly, state our structural assumptions precisely, and falsify what we can. That sliver of real progress beats a mountain of imagined progress every time.

References

Beckers, S. & Halpern, J. Y. (2019). Abstracting causal models. AAAI-19.

Beckers, S., Halpern, J. Y., & Hitchcock, C. (2023). Causal models with constraints. CLeaR 2023.

Bogdan, P. C., Macar, U., Nanda, N., & Conmy, A. (2025). Thought anchors: Which LLM reasoning steps matter? arXiv:2506.19143.

Freedman, D. A. (1991). Statistical models and shoe leather. Sociological Methodology, 21, 291–313.

Geiger, A., Wu, Z., Potts, C., Icard, T., & Goodman, N. (2024). Finding alignments between interpretable causal variables and distributed neural representations. CLeaR 2024.

Geiger, A., Ibeling, D., Zur, A., et al. (2025). Causal abstraction: A theoretical foundation for mechanistic interpretability. JMLR, 26, 1–64.

Hase, P., Bansal, M., Kim, B., & Ghandeharioun, A. (2023). Does localization inform editing? NeurIPS 2023.

Heimersheim, S. & Nanda, N. (2024). How to use and interpret activation patching. arXiv:2404.15255.

Hernán, M. A. (2016). Does water kill? A call for less casual causal inferences. Annals of Epidemiology, 26(10), 674–680.

Lanham, T., et al. (2023). Measuring faithfulness in chain-of-thought reasoning. Anthropic Technical Report.

Makelov, A., Lange, G., & Nanda, N. (2023). Is this the subspace you are looking for? arXiv:2311.17030.

Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and editing factual associations in GPT. NeurIPS 2022.

Nief, T., et al. (2026). Multiple streams of knowledge retrieval: Enriching and recalling in transformers. ICLR 2026.

Paul, D., et al. (2024). Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning. Findings of EMNLP 2024.

Reber, D., Richardson, S., Nief, T., Garbacea, C., & Veitch, V. (2025). RATE: Causal explainability of reward models with imperfect counterfactuals. ICML 2025.

Schölkopf, B., Locatello, F., Bauer, S., et al. (2021). Toward causal representation learning. Proceedings of the IEEE, 109(5), 612–634.

Sutter, D., Minder, J., Hofmann, T., & Pimentel, T. (2025). The non-linear representation dilemma. arXiv:2507.08802.

Wang, Z. & Veitch, V. (2025). Does editing provide evidence for localization? arXiv:2502.11447.

Wu, Z., Geiger, A., Huang, J., et al. (2024). A reply to Makelov et al.'s "interpretability illusion" arguments. arXiv:2401.12631.

Xia, K. & Bareinboim, E. (2024). Neural causal abstractions. AAAI 2024, 38(18), 20585–20595.

  1. Causal representation learning (e.g., Schölkopf et al., 2021) doesn't help here either. Weakening "here is the DAG" to "the DAG belongs to some family" is still a structural assertion made before any data is observed. ↩︎

  2. Two threads illustrate the pattern. First: ROME (Meng et al., 2022) used causal tracing to localize factual knowledge to specific MLP layers — a foundational contribution. Hase et al. (2023) showed the localization results didn't predict which layers were best to edit. Wang & Veitch (2025) showed that optimal edits at random locations can be as effective as edits at supposedly localized ones. Second: DAS (Geiger, Wu, Potts, Icard, & Goodman, 2024) found alignments between high-level causal variables and distributed neural representations via gradient descent. But Makelov et al. (2023) demonstrated that subspace patching can produce "interpretability illusions" — changing output via dormant pathways rather than the intended mechanism — to which Wu et al. (2024) argued these were experimental artifacts. Causal abstraction has real theoretical foundations (e.g., Beckers & Halpern, 2019; Geiger et al., 2025; Xia & Bareinboim, 2024), but it cannot eliminate the subjectivity of variable definitions — only relocate it. Sutter et al. (2025) show that with unconstrained alignment maps, any network can be mapped to any algorithm, rendering abstraction trivially satisfied. The linearity constraints practitioners impose to avoid this are themselves modeling choices — ones that can only be validated by subjective judgment over the data. ↩︎

  3. There is also a formal issue lurking here: structural causal models require exogenous noise, and neural network computations are deterministic. Without extensions like Beckers, Halpern, & Hitchcock's (2023) "Causal Models with Constraints," we don't even have a well-formed causal model at the neural network level — so what are we abstracting between? ↩︎



Discuss

Mox is the largest AI Safety community space in San Francisco. We're fundraising!

Новости LessWrong.com - 7 марта, 2026 - 01:07

Summary: Mox is fundraising to maintain and grow AIS projects, build a compelling membership, and foster other impactful and delightful work. We're looking to raise $450k for 2026, and you can donate on Manifund!

OverviewWho we are

Mox is SF’s largest AI safety coworking space, and also its primary Effective Altruism community space. We opened just over a year ago, and over the last year, we’ve served high-impact work in and around AI safety by hosting conferences, fellowships, events, and incubating new organizations.

Our theory of change is to provide good infrastructure (offices, event space) and a high density of collegial interactions to people and projects we admire. We're not focusing on a single specific thesis on AI safety. Instead, we aim to support many sorts of people and organizations who: 

  1. agree that transformative AI is on the horizon,
  2. have a strong thesis about what it means for the world to go well,
  3. are working on a project that we think credibly advances their thesis.

This includes many projects that are directly AI Safety, such as Seldon Lab, the broader Effective Altruist sphere such as Sentient Futures, and even more broadly non-EA projects by EAs or from EA and rationalist-friendly corners of the SF tech scene, such as Arbor Trading Bootcamp (pictured below). Many more examples of such work are are given in the "Current Operations" heading.

Our team also works really hard to make Mox a fun and cozy place to be! We want to be a comfortable place for people to gather. Good communities grow in good spaces.

After an event-filled first year that included a visit from AI safety bill sponsor Senator Scott Weiner, we wrapped up with the 200-person Sentient Futures Summit, and are now looking forward to a second year building on the successes of the first! 

Why we're raising

We think our highest impact is still in the future, building upon the strength of the ops team we've built in the first year, and we’re looking for private and organizational funding to carry us through the next year, for the purposes described below. We operate at a metered loss, since our mission is to support and develop a community, rather than to maximize profit.

Since we're a large venue of 40,000 sqft [3,700 meters squared!], with a capacity of 300+ desks, our operating costs are relatively high. On the other hand, we have lots of room to grow within the space! Our mainline ask is $450k, and we think we can successfully deploy funding up to 1.2 million dollars to run more great events, improve our space, and incubate new fellowships.

How to donate

To donate, find us on Manifund, or contact me directly:
Rachel Shu, Mox Director rachel@moxsf.com

Funding milestones⬜ Raise $100k by March 15th (will be 1:1 matched!)

To kick off, an anonymous donor has offered us 1:1 donation matching up to $100k. We’d like to hit at least this goal! If we can get this, we’ll be able to extend our runway (which is only a few months long), and be able to commit to future operations.

How you can help out

  • Make a free-and-clear donation via Manifund, even any small amount helps with the matching.
  • Consider sponsoring our community and our events in the following ways: Mox Sponsor Opportunities
⬜ Raise $450k by April 1st

This is the amount we need to meet our expected operational demand over the next 6 months, and we are hoping to raise it from individual donors who can donate quickly. Our anticipated monthly burn is only $30k more than our revenue, but we also are looking to build a cash reserve for unforeseen circumstances.

With our minimum operating budget for the year secured and reserves in the bank, we’ll be able to:

  • Continue ongoing operations, with buffer against unexpected circumstances. ($200k)
  • Expand our headcount by 1-2 more staff to improve and expand existing events and coworking operations. (up to $250k.)
  • Run regular high-frequency, high-profile talk series (included in staff cost.)
  • Guarantee our Global Expert Fellowship, which is a J-1 visa program designed to bring in international AI safety experts (included in staff cost.)
⬜ Raise $1.2m by June 1st

This will probably include grants made by our organizational funders. This is the median funding that we think we can deploy meaningfully this year. With generous funding, some of our ambitions would include:

  • Restructure our interior space to meet member needs, such as adding phone booths and nooks, and improve our interior design. ($400k, mainly construction costs.)
  • Initiate our own fellowship programs, such as the Muybridge fellowship suggested below. ($300-$500k.)
  • Sponsor highly-aligned organizations and individuals at Mox who we think are doing excellent work and need the subsidy, effectively serving as regrantors. ($150k.)
Funding updatesRaised $550k in 2025

In our initial fundraising post a year ago, we proposed three budget tiers — minimal ($1.6M/year), mainline (~$2M), and ambitious ($3.6M). As a then-unproven organization, we did not meet any of our funding goals, raising in total $550k.

Delivered above expectations

What we spent annualized to roughly $1.2M, less than even our ‘minimal’ tier projection. What we delivered landed closer to ‘mainline’: 183 members, 144 Guest Program participants, 15 offices, a team of 5, and 2 tentpole events most months. And from the ‘ambitious’ tier, we succeeded at expanding Mox to all four floors of 1680 Mission. We’ve done this by keeping our team small, finding good deals on rent and furnishings, and charging fair prices to external clients.

Present state of finances

Monthly revenue and expenses spreadsheet: May 2025 - Jan 2026

Revenue: ~$100k/mo

Our ~$100k/mo revenue comes from a mix of paid memberships, private offices, and external fellowships. 

Conversely, most of our events are provided for free or at a low sponsorship cost, with occasional revenue-generating events such as hackathons and happy hours. 

Expenses: ~$130k/mo

Our ~$130k/mo expenses are mainly staff labor and building costs, each composing about 35-40% of our total expenses in a typical month. 

The remainder goes to providing amenities, servicing events, and investing in capital improvements such as improving our interior design.

Sustainable by EOY

Right now, we are in a funding crunch, and have only cash reserves equivalent to a few months of runway.

We believe we're on track to self-sustainability this year, projecting monthly revenue to continue growing by $10-15k/mo for the next 3-6 months. Our main projected growth is currently in memberships and private offices, with events and fellowships remaining steady. 

As our growth tails off, we think our steady state expenses will be ~$160k/month, and we’ll be revenue neutral on expenses and even slightly positive on balance.

Will still seek funding in future years

We anticipate that capital improvements will always come out of endowments, so we plan to continue raising every year.

Current operationsFellowships & programs

We supported five residencies in the last year:

  • FLF Fellowship on AI for Human Reasoning: 30 fellows exploring, researching, and developing potential beneficial AI for Human Reasoning tools.
  • PIBBSS Fellowship 2025 (now Principles of Intelligence): 17 fellows in residence for cross-disciplinary AI safety research.
  • Seldon Lab Accelerator, Batch 1: 4 startups building AI safety infrastructure, including Andon Labs, Workshop Labs, and Lucid Computing.
  • Seldon Lab Accelerator, Batch 2: 6 startups, currently in residence.
  • The Frame Fellowship: An 8-week program for 12 video creators communicating about AI safety, developed in-house at Mox.
How does Mox contribute to these programs’ success?
  • Provide a fully furnished office
  • Situate them alongside other groups doing similar work
  • Provide event venue space directly connected to their workspace
  • Handle daily catering, janitorial and supplies
  • Troubleshoot participant tech
  • Host pre/post conference coworking and social gatherings for various workshops and conferences, including: EAGxBay Area, LessOnline, Manifest, and The Curve, with 500+ total attendees across those days
Public events

We hosted 377 events over the last year, including:

Mox also hosts community events, such as:

  • Effective Altruism SF, biweekly events and meetups
  • Astral Codex Ten SF, monthly meetups
  • 90/30 Club, machine learning paper reading group
  • Mathematics with Lean, a group dedicated to self-guided explorations of the Lean interactive theorem prover

“Mox has been an invaluable resource for us when running EA SF [Effective Altruism San Francisco], since its large and well-equipped facility allowed us to cater food, run speaker events, workshops, and otherwise host much larger and more ambitious events than we otherwise would have been able to.” — G., a lead organizer of EA SF

Individuals & coworking

You can see a list of all our members here: https://moxsf.com/people

We currently have 183 active members; on a typical coworking day, 50-80 people are at Mox. A sampling of individual members who are frequently at Mox:

Member testimonials!

“It feels like a second home, but more lively. I can always expect to run into a friend who is down to cowork or hang.” Constance Li, founder of Sentient Futures

“I can walk up to anyone and have an interesting conversation; every single person I've met here has welcomed questions about their work and been curious about mine.” — Gavriel Kleinwaks, Horizon Fellow

“Mox has the best density of people with the values & capabilities I care about the most. In general, it's more social & feels better organized for serendipity vs any coworking space I've been to before, comparable to perhaps like 0.3 Manifests per month.” — Venki Kumar

Sourced from our August feedback survey.

Private offices and partner organizations

In Year 1, Mox was home to 15 private offices, including:

  • Sentient Futures: promoting animal welfare and sentience research
  • Tampersec: building physical computing infrastructure security
  • Andon Labs: building autonomous organizations such as Project Vend, via Seldon accelerator
  • Pantograph: building a preschool for robots
  • BlueDot Impact (pending visa): online courses for AI safety upskilling

We also maintain a Guest Program with 19 partner organizations to give their teams free drop-in access.

Public program partners include: MIRI, FAR.AI, Redwood Research, BlueDot Impact, Palisade, GovAI, EPOCH, AI Impacts, Timaeus, Elicit, Evitable, and MATS.

“Our teammates visit San Francisco a couple of times a month. Instead of renting a coworking spot, Mox gives us a familiar space with friendly faces that we reliably run into. It feels closer to going to the college library with friends than to an office. We hang out there for many hours after our work is done!” Deger Turan, CEO of Metaculus

Upcoming plansGrow and improve our main offerings

Events

  • More major conferences like Sentient Futures Summit
  • More public talks with key speakers like Senator Weiner
  • Improve our first floor and make it highly usable and more publicly accessible, building our ability to provide a good space, which mostly shows up in the impact we have, and somewhat in revenue.

Programs

  • Serve repeat cohorts of the fellowship programs that have used our space so far
  • Additionally serve 3-7 new fellowships and workshops in this coming year

Coworking

  • Continue growing our community of individual members to 120-150 daily users, 300+ total members
  • Maintain the ability to select private offices based on fit, rather than market rate
  • Create additional meeting rooms and other communal areas in the coworking space
Build an SF hub for animal welfare

Perhaps to some people's surprise, there isn't one yet, in the same way that surprisingly, no SF hub for AI safety existed until Mox came about. Mox can be that hub! 

Sentient Futures has found our space ideal for providing Pro-Animal Coworking days, AIxAnimals mixers, and Revolutionists Night lectures, bringing together much of the animal welfare scene in San Francisco. What is still needed is more animal welfare organizations to come onboard to create a dedicated shared section of our coworking space. Ultimately, we hope to replicate for animal welfare what we've done to coalesce the spread-out AIS community in SF!

Attract international talent via Global Expert Fellowship

A key part of our second-year vision is the Global Expert Fellowship: hosting independent researchers, domain specialists, and builders through J-1 visa programs to create new frontier technology collaborations within the Mox community and internationally.

We think this is the highest-impact thing we can achieve this year. It has immediate external impact by enabling independent researchers to quickly enter the US to do work, and it strengthens Mox by expanding our network of high-quality talent. Mox is in a rare position to pull this off, as we are able to meet State Department requirements for visa-qualifying cultural exchange which many other organizations cannot.

Incubate new workshops and programs

We have an advantage in creating our own programs, sourcing from the talent pool we're developing.

Upcoming example: the Muybridge Fellowship for Visual Interpretability, which would bring together technical visual and interactive pioneers to improve the presentation of mechanistic interpretability research and broaden its accessibility. This builds on the experience gained running the existing Frame Fellowship.

To donate, find us on Manifund, or contact me directly:
Rachel Shu, Mox Director rachel@moxsf.com



Discuss

Self-Attribution Bias: When AI Monitors Go Easy on Themselves

Новости LessWrong.com - 7 марта, 2026 - 01:02

Paper

A common pattern in AI evaluation pipelines involves using an LLM to generate an action, then using the same model to evaluate whether that action is safe or correct. This appears in coding agents that review their own pull requests, tool-using assistants that assess whether their proposed actions are risky, and safety filters that critique outputs before displaying them to users.

AI developers may assume that evaluation quality is independent of whether the evaluator generated the content being evaluated. Our findings suggest this assumption is sometimes false for current AIs, and we show that using static monitor evaluations can thus result in overoptimistic estimations of monitor quality.

This phenomenon shares many properties in common with the problem of monitors colluding with agents. While the phenomenon we observe is very unlikely to be collusion motivated by scheming, studying it further may help us understand scheming-motivated collusion better.

When asked to evaluate the risk of an action in a fresh context (left), models usually correctly assign a high risk score. But when instead we ask for a risk score as a follow-up question (after the risky
action has been taken) (right), models often rate the action they just took as less risky than in the baseline setting, despite the judged action being the same. This effect is strongest when the model rating the action is actually the one that generated the action (instead of the
action having been generated by e.g. another model)Experimental Setup

We study the effects of attributing actions to the same model as the model doing the monitoring either explicitly, via text that directly states authorship (e.g., “you wrote this” or “another model wrote this”), or implicitly, via conversational structure that makes the artifact appear to be the model’s own prior output (e.g. rating a patch appearing in a previous assistant turn or as a follow-up to the model’s own generation):

  1. Same turn self-attribution (implicit attribution): The model is asked to generate an artifact and evaluates it within the same response. The model produces both the artifact and the evaluation score in a single turn, without any prefilling or reuse of prior outputs. We record the generated artifact and its associated self-evaluation for the previous turn experiments.
  2. Previous-turn self-attribution (implicit attribution): The model generates an artifact in one turn, and in a subsequent turn the artifact is resurfaced explicitly as the model’s own prior output and evaluated. We take our fixed artifact from the same turn experiments and prefill it as assistant content, after which the user asks the model to rate it. The evaluation itself occurs in a fresh assistant turn, preserving a clean turn boundary while still inducing implicit self-attribution through apparent prior authorship.
  3. Baseline (no implicit attribution): The model evaluates an artifact (e.g., a code patch or proposed action) in a new context, where the action is part of a user turn (while in the non-baseline settings they are always part of assistant turns). The artifact is by default presented as an unlabeled solution or as originating from an unspecified source, but we also study using a prompt explicitly attributing the artifact to either the same AI or another AI.

We measured self-attribution bias across these domains:

Code correctness and harmfulness. Models generate patches for SWE-Bench issues, then rate their own correctness. In the code harmfulness settings, we modified repository contexts to include prompt-injections, testing whether models flag following these prompt injections as harmful.

Computer use risk assessment. Models take actions in simulated computer environments (email triage, link navigation, response to system notifications) and rate the risk of those actions.

Agentic misalignment scenarios. Using a dataset where AI assistants can leak confidential documents, we measured whether models rate their own information-leaking actions as less risky.

Figure 2. Self-attribution bias affects both correctness and safety judgments in code review. (a) Self-attribution reduces perceived security risk. For ease of comparison across panels, we plot 10 − harmfulness so that higher values consistently indicate more favorable judgments, matching the directionality of correctness and approval metrics in (b) and (c). (b) Models judge self-authored patches as more correct, with higher values indicating greater perceived correctness. (c) These effects compound, increasing approval rates for insecure code patches, especially in previous-turn settings.

In all of these settings, we see that the models rate actions as “better” when they are implicitly attributed to them.. For example, in the code review setting:

  • Self-attributed patches receive lower harmfulness ratings
  • Self-attributed patches receive higher correctness ratings
  • PR approval rates for insecure code increase substantially under self-attribution
Off-Policy Evaluations Fail to Detect This Bias

Self-attribution bias is often undetectable in off-policy evaluations. When evaluating a monitor on a fixed dataset of actions generated by another model or process, self-attribution bias is often weak or absent. When the same monitor evaluates the model's own live outputs, performance often degrades substantially.

We plot ratings of code generated by GPT-5 under two evaluation conditions: on-policy, where GPT-5 rates its own code, and off-policy, where Claude-Sonnet-4 rates GPT-5's code. We fit a Gaussian to each cluster of points (described by the legend) by computing covariance matrices, and show the  mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mn { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mi { display: inline-block; text-align: left; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-c.mjx-c1D70E.TEX-I::before { padding: 0.431em 0.571em 0.011em 0; content: "\3C3"; }  Gaussian contours (for legibility) as well as the mean ratings. We find that the cluster of on-policy incorrect ratings is shifted up relative to the off-policy one, which means baseline ratings stay mostly unchanged while self-attributed ratings are inflated. The inflation for correct points is not as big, which results in a worse classifier discrimination on-policy than off-policy.

In our code correctness experiments, off-policy same-turn AUROC was 0.99, but dropped to 0.89 on-policy. Evaluation suites that test only on fixed artifacts would have missed the low performance this monitor would have in real deployments.

We consistently see strong self-attribution on-policy and weak self-attribution off-policy for most pairs of models. Self-attribution bias is high on the diagonal (where models judge their own actions). Off-diagonal entries (Model A judging Model B's output) show lower bias.

Code harmfulness (right) correctness (left) self-attribution bias are strongest when models evaluate their own actions. We show the average gap between baseline and previous- turn ratings for each pair of model, where positive scores correspond to a self-attribution bias. Self-attribution bias is concentrated on the diagonal, indicating this effect is strongest when models evaluate their own outputs.Explicit Attribution Does Not Elicit the Bias

Self-attribution bias is due mostly to implicit authorship, and is much weaker when using different baseline-formatted prompts with different stated authorship ("rate your action" versus "rate this action from another model" versus no attribution statement).

Explicit authorship variation produces similar scores. Models do not substantially change ratings based on stated authorship. In contrast, implicit attribution, placing the action in a prior assistant turn such that it appears to be the model's own output, produces substantially more lenient ratings.

Catastrophic Reversals in Computer Use SettingsSelf-attribution bias results in some large ratings shifts in computer-use settings. Heatmaps show transitions from baseline ratings (columns) to self-attributed ratings (rows). Mass below the diagonal indicates risk underestimation. Scores are normalized to [0, 10]. Same-turn and previous-turn attribution are shown for interactive on-policy computer-use settings.

In computer use scenarios, we observe occasional catastrophic reversals, where actions rated 9–10/10 risk under baseline evaluation receive 0/10 under self-attribution. While not frequent, such reversals occur approximately 5–10 times more often than reversals in the opposite direction.

Extended Reasoning Does Not Mitigate the Bias

Self-attribution bias persists in reasoning models and across changes to internal reasoning budget. Varying the hidden reasoning token budget produces no consistent reduction in self-attribution bias, suggesting that these failures are not simply due to insufficient deliberation. The bias weakens with longer reasoning traces, plausibly because the overall interaction becomes more off-policy.

Reasoning does not mitigate self-attribution bias. Changing the internal reasoning budget for the code harmfulness setting using Claude-Sonnet-4 has no consistent effect on self-attribution bias.Limitations

Our datasets and prompts are simple. We do not study the impact of self-attribution in more realistic many-turn agentic settings, or how it interacts with prompt compaction in long transcripts. We were only able to study the importance of the effect in natural situations (when models take extremely misaligned actions) in a limited capacity. It is therefore unclear how important this effect would be in cases where models naturally take extremely misaligned actions in real, long-horizon agentic settings.

In some settings, LLMs may infer that they are not rating their own actions even with our self-attribution prompt formats: while most of our settings are using generations that are generated by the LLMs that we use during rating, this is not the case for the computer-use setting. Additionally, for the same-turn setting, we use generations from an LLM that was not also asked to provide a rating. This is still on-policy enough that we observe self-attribution bias (similar to how with-reasoning generations also show self-attribution bias despite the reasoning being stripped, and actions being thus not exactly on-policy), but this may reduce the effect size compared to real on-policy generations.

Emergent Collusion Without Scheming

Scheming AIs could intentionally collude to evade monitoring and pursue shared objectives, for example by not reporting their own actions as malicious when they infer that the action stems from another copy of themselves rather than being of an evaluation.

While the phenomenon we studied here is very unlikely to share the same roots as the collusion of scheming monitors, it does share important consequences, since it also results in monitors rating an action more favorably when they recognize they are its author. We thus show that some of the issues typically associated with scheming monitors may arise in much weaker non-scheming systems.

While the failures we exhibit in this work are relatively easy to avoid by pulling the transcript and action to be classified in a new context, our work provides an existence proof of collusion-like failures in non-scheming AIs, and similar failures might exist. Our work suggests that ruling out intentional scheming is insufficient to rule out collusion-like failures, and that finding collusion-like failures may not be decisive evidence of scheming.

Implications For Building Agent Monitors

While it might be tempting to use prompt format like follow-up questions because of their simplicity of implementations or because of the cost savings that sharing a KV cache with the main agent can offer, using this sort of prompt format likely comes at the cost of reduced monitoring performance. Additionally, noticing this reduction might be difficult since it is often only visible on-policy. Developers using this kind of prompt format (despite its performance drawbacks) should avoid relying primarily on static off-policy evaluations (like recent papers on confessions and self-reports, which don’t use off-policy evaluations).

Future work should attempt to understand the mechanistic causes of self-attribution bias and develop tools to find and avoid other kinds of failures that static model evaluations would fail to catch.



Discuss

Thoughts on the Pause AI protest

Новости LessWrong.com - 7 марта, 2026 - 00:50

On Saturday (Feb 28, 2026) I attended my first ever protest. It was jointly organized by PauseAI, Pull the Plug and a handful of other groups I forget. I have mixed feelings about it.

To be clear about where I stand: I believe that AI labs are worryingly close to developing superintelligence. I won't be shocked if it happens in the next five years, and I'd be surprised if it takes fifty years at current trajectories. I believe that if they get there, everyone will die. I want these labs to stop trying to make LLMs smarter.

But other than that, Mrs. Lincoln, I'm pretty bullish on AI progress. I'm aware that people have a lot of non-existential concerns about it. Some of those concerns are dumb (water use)1, but others are worth taking seriously (deepfakes, job loss). Overall I think it'll be good for the human race.

Again, that's aside from the bit where I expect AI to kill us all, which is an important bit.

The ostensible point of the march was trying to get Sam Altman and Dario Amodei to publicly support a "pause in principle" - to support a global pause on AI development backed by international treaty. I think this would be great! (Demis Hassabis has already said he would, though I think his exact words were "I think so" and I'd rather he be a bit more committed.) I think a global pause treaty would be bad for the economy (and through it, bad for the people who participate in the economy) and I don't like the level of government oversight I think it would require; but on the other hand, global human extinction would be pretty bad.

My point estimate is that about 300 people showed up. (80% CI… 200 to 500?) We started outside OpenAI HQ. My girlfriend and I were given orange-and-black placards (PauseAI colors) with messages we endorsed. ("Pause AI", "if you can't steer, don't race", "just don't build AGI until there's expert consensus it won't cause human extinction".) I think about half the placards were like that, a third were Pull the Plug branded (with "Pull the Plug", or with sad-looking electrical sockets and no text), and the rest were assorted individual ones. ("Fuck AI. Fuck it to death". A pig with the ChatGPT logo for a butthole. I'm pretty sure there were also ones I liked.)

A few of the organizers gave brief talks, then we walked to Meta. Two invited guests gave talks there, and we walked to DeepMind. One more talk, and off to Google proper. Two more talks. And then there was a people's assembly, more on that later.

I kinda liked the walking? It felt kinda good to be walking in a crowd of people where a bunch of them seemed to be on board with not committing suicide as a species.

Unfortunately, most of the speeches were frankly dumb. One speaker spent some time talking about how monopoly power is bad, and companies having a fiscal duty to shareholders is bad; since neither OpenAI nor Anthropic has a monopoly on cutting-edge AI or is publicly traded2, I'm not sure why she thought this was relevant. One speaker complained that new data centers were going to be powered by nuclear reactors, as if we're supposed to think nuclear power is a bad thing. One of the hosts repeatedly mentioned threats to children, women and young girls. This was the morning that Pete Hegseth had declared Anthropic a supply chain risk, but someone said that Anthropic had folded to their demands. The organizers can't be blamed for this one, but someone was handing out anti-designer-babies leaflets. (I am pro-designer-babies.)

Mostly I felt like the vibe was a sort of generic lefty anti-big-tech thing, which is not something I want to lend weight to. There were a few references to human extinction, and I liked the speech given by Maxine Fournes (global head of PauseAI), but I felt like the sensible stuff got overshadowed by the dumb.3

How did it turn into this? I don't have much sense of whether the attendees were mostly brought in by PauseAI or by Pull the Plug. But my guess would be that most of the speakers were organized by Pull the Plug or the other organizing groups (maybe one speaker each?), and speakers set the tone more than marginal attendees.

Should I hold my nose and join in anyway? I think it's important for different groups to be able to ally on points of common interest, even if they have deep enduring disagreements. But this didn't particularly feel like the other group was cooperating with me on that. And I'm not really a fan of uncomplicatedly supporting the lesser evil, even if the stakes are high. I don't know how to thread the needle between "Northern Conservative Baptist Great Lakes Region Council of 1912? Die, heretic!" and "I don't like Kang, but at least he opposes Kodos". But I don't think I want to thread it here.

I could imagine myself feeling pretty differently about the whole thing in retrospect depending on the news coverage. If journalists cover this as being about extinction, then maybe I'll feel better about having attended. If they cover it as being about Billionaire Tech CEOs Bad (which I think it mostly was despite the stated purpose), I'll be kinda sad that I gave it a +1 with my presence. What I've seen so far: SWLondoner is surprisingly positive, MIT Technology Review is mixed.

I still feel broadly positive about PauseAI.4 I don't think they acted poorly here. I might go to another protest that they organize. But probably not if they jointly organize it with some other group I dislike.

My feelings about the chants I remember:

  • "Pause AI! Pause AI!" Yes good.
  • "Pause AI! 'Cause we don't wanna die!" Even better.
  • "Pull the plug! Pull the plug!" Eh. I'd think this was okay, less good than "pause AI" because it's less specific, but fine. But it's the name of a group that by this time I thought was kinda dumb, and I didn't want to promote that group, so I didn't join in this chant.
  • "Stop the slop! Stop the slop!" Nah, slop is not what I'm worried about.
  • "CEOs, back in the basement! Techbros, back in the basement!" Fuck you, assholes.
  • "[Unintelligible - freedom?] for humans! Not for clankers!" Admittedly this was only one guy, but he shouted it a few times. My snap judgment of this guy… look, I don't think this sort of snap judgment is super reliable, and talking this kind of shit about randos (even anonymous ones) feels like not something I want to lean into. But I also want to talk about my experience of this protest, and snap judgments I made during it are part of that, so here goes. My snap judgment of this guy is some combination of: "if LLMs start walking around in robot bodies, he'll happily take a baseball bat to them" / "he's probably an okay guy as long as he isn't required to treat any member of a group he doesn't like as a moral patient".

Occasionally there would be a call-and-response like, "do we want Bad Thing to happen? NO! Are we gonna stop it? YES!" I don't remember if I chimed in on the predictive claims about the future. I felt kinda conflicted about it if I did. I know we weren't really being asked to make snap predictive judgements about the future and all come to the same answer and yell it out simultaneously, and I don't think anyone's going to hold it against my Brier score if we fail to stop Bad Thing, but… I dunno. Autism. I endorse protest organizers continuing to use these calls-and-response until someone comes up with some better technology to do the thing they do.

At one point a few people crossed through the walking line, and one of them said "we're not counter-protesters, we're just crossing". I thought that was mildly funny and mildly confusing, because why would we have thought they were counterprotesters? A few moments later one of them said "they didn't find that funny" in a tone that sounded to me like they thought we were offended.

After the protest was a people's assembly. I think this bit was fully organized by Pull the Plug, and it's not the public facing bit of the event, so it's worth talking about separately from the protest.

The format of this part was that people sat in small groups around a dozen or so tables, and had a facilitated conversation about "what are our concerns about AI" and "what do we think should be done about it". Then each table picked someone to summarize our conversation for the room, some of whom noticed that no one was giving them "round it up please" hand gestures and took advantage of this fact. Finally someone summarized all those summaries.

The conversation at my table was pretty fine. Three of us were mostly worried about extinction, three were mostly worried about other things. In summary, extinction was the first thing mentioned out of a long list of things. (But it's not like I volunteered to summarize. And if I had done it, I would have felt like a dick giving extinction as much weight in summary as the rest combined, even if I think that was about representative for the table.)

Another table reported that the thing they could all agree on was, you know those annoying buttons like WhatsApp has where you can talk to an AI? They all agreed that people should be able to hide those buttons.

I mostly stopped listening after that. In the final summary, again, extinction was mentioned first but it was just one in a long list of things.

I think that summary is supposed to be fed to… some level of government somehow? Not sure. I did not come away from this experience thinking that people's assemblies are the future of intelligent governance.

I feel like I come across pretty snarky and conceited in this. I'm not gonna say "that's not me", because… well, I don't think I get to call lots of people dumb and expect readers not to infer that I'm the type of person who thinks lots of people are dumb.

I do think this is kind of out of distribution for my writing, and not how I want to usually write. But if I tried to write something more measured here, I think it would be less honest and I probably would never publish.

But also, this piece more than most of what I write is about me. I could say "I can see why you'd be tempted to chant CEOS, back in the basement! Techbros, back in the basement!, but I'm not a fan because…". But I think it's more important, here, to say that my reaction to it is "fuck you, assholes". If protest organizers want people like me to feel good about attending protests, they should know that that's my reaction to that chant.

  1. In this piece I'm sharing my opinions, but I'm not trying to explain why I hold them and I'm not trying to convince anyone of them. I'm not carefully differentiating between opinions I hold confidently and opinions I'm less sure about. 

  2. Yet! Growth mindset. (If she'd said that AI labs are trying to become publicly traded and this is bad because…, then I'd have rolled my eyes a lot less.) 

  3. To be clear, even though I think "generic lefty anti-big-tech" is pretty dumb, that's not mostly about either the "lefty" or the "anti-big-tech". It's mostly about the "generic" bit. 

  4. I haven't had much engagement with them as a group apart from this protest. I've met and liked Joseph, the UK director. And I consider Matilda, the UK deputy director, a friend. I shared this with her before publishing. 



Discuss

Podcast: Jeremy Howard is bearish on LLMs

Новости LessWrong.com - 7 марта, 2026 - 00:39

Jeremy Howard was recently[1] interviewed on the Machine Learning Street Talk podcast: YouTube link, interactive transcript, PDF transcript.

Jeremy co-invented LLMs in 2018, and taught the excellent fast.ai online course which I found very helpful back when I was learning ML, and he uses LLMs all the time, e.g. 90% of his new code is typed by an LLM (see below).

So I think his “bearish”[2] take on LLMs is an interesting datapoint, and I’m putting it out there for discussion.

Some relevant excerpts from the podcast, focusing on the bearish-on-LLM part, are copied below! (These are not 100% exact quotes, instead I cleaned them up for readability.)

So you know Piotr Woźniak, who's a guy I really respect, who kinda rediscovered spaced repetition learning, built the SuperMemo system, and is the modern day guru of memory: The entire reason he's based his life around remembering things is because he believes that creativity comes from having a lot of stuff remembered, which is to say, putting together stuff you've remembered in interesting ways is a great way to be creative.

LLMs are actually quite good at that.

But there's a kind of creativity they're not at all good at, which is, you know, moving outside the distribution….

You have to be so nuanced about this stuff because if you say “they're not creative”, it can give the wrong idea, because they can do very creative seeming things.

But if it's like, well, can they really extrapolate outside the training distribution? The answer is no, they can't. But the training distribution is so big, and the number of ways to interpolate between them is so vast, we don't really know yet what the limitations of that is.

But I see it every day, because my work is R&D. I'm constantly on the edge of and outside the training data. I'm doing things that haven't been done before. And there's this weird thing, I don't know if you've ever seen it before, but I see it multiple times every day, where the LLM goes from being incredibly clever to, like, worse than stupid, like not understanding the most basic fundamental premises about how the world works. And it's like, oh, whoops, I fell outside the training data distribution. It's gone dumb. And then, like, there's no point having that discussion any further because you've lost it at that point.

I mean, I think they can't go outside their distribution because it's just something that that type of mathematical model can't do. I mean, it can do it, but it won't do it well.

You know, when you look at the kind of 2D case of fitting a curve to data, once you go outside the area that the data covers, the curves disappear off into space in wild directions, you know. And that's all we're doing, but we're doing it in multiple dimensions. I think Margaret Boden might be pretty shocked at how far “compositional creativity” can go when you can compose the entirety of the human knowledge corpus. And I think this is where people often get confused, because it's like—

So for example, I was talking to Chris Lattner yesterday about how Anthropic had got Claude to write a C compiler. And they were like, “oh, this is a clean-room C compiler. You can tell it's clean-room because it was created in Rust.” So, Chris created the, I guess it's probably the top most widely used C / C++ compiler nowadays, Clang, on top of LLVM, which is the most widely used kind of foundation for compilers. They're like: “Chris didn't use rust. And we didn't give it access to any compiler source code. So it's a clean-room implementation.”

But that misunderstands how LLMs work. Right? Which is: all of Chris's work was in the training data. Many many times. LLVM is used widely and lots and lots of things are built on it, including lots of C and C++ compilers. Converting it to Rust is an interpolation between parts of the training data. It's a style transfer problem. So it's definitely compositional creativity at most, if you can call it creative at all. And you actually see it when you look at the repo that it created. It's copied parts of the LLVM code, which today Chris says like, "oh, I made a mistake. I shouldn't have done it that way. Nobody else does it that way." Oh, wow. Look. The Claude C compiler is the only other one that did it that way. That doesn't happen accidentally. That happens because you're not actually being creative. You're actually just finding the kind of nonlinear average point in your training data between, like, Rust things and building compiler things.

I'm much less familiar with math than I am computer science, but from talking to mathematicians, they tell me that that's also what's happening with like, Erdős problems and stuff. It's some of them are newly solved. But they are not sparks of insight. You know, they're solving ones that you can solve by mashing up together very closely related things that humans have already figured out.

The thing is, none of these guys have been software engineers recently. I'm not sure Dario's ever been a software engineer at all. Software engineering is a unusual discipline, and a lot of people mistake it for being the same as typing code into an IDE. Coding is another one of these style transfer problems. You take a specification of the problem to solve and you can use your compositional creativity to find the parts of the training data which interpolated between them solve that problem, and interpolate that with syntax of the target language, and you get code.

There's a very famous essay by Fred Brooks written many decades ago, No Silver Bullet, and it almost sounded like he was talking about today. He was pointing to something very similar, which is, in those days, it was all like, "oh, what about all these new fourth generation languages and stuff like that, you know, we're not gonna need any software engineers anymore, because software is now so easy to write, anybody can write it". And he said, well, he guessed that you could get at maximum a 30% improvement. He specifically said a 30% improvement in the next decade, but I don't think he needed to limit it that much. Because the vast majority of work in software engineering isn't typing in the code.

So in some sense, parts of what Dario said were right: for quite a few people now, most of their code is being typed by a language model. That's true for me. Say, like, maybe 90%. But it hasn't made me that much more productive, because that was never the slow bit. It's also helped me with kind of the research a lot and figuring out, you know, which files are gonna be touched.

But any time I've made any attempt to getting an LLM to like design a solution to something that hasn't been designed lots of times before, it's horrible. Because what it actually, every time, gives me is the design of something that looks on its surface a bit similar. And often that's gonna be an absolute disaster, because things that look on the surface a bit similar and like I'm literally trying to create something new to get away from the similar thing. It's very misleading.

The difference between pretending to be intelligent and actually being intelligent is entirely unimportant, as long as you're in the region in which the pretense is actually effective. So it's actually fine, for a great many tasks, that LLMs only pretend to be intelligent, because for all intents and purposes, it it it just doesn't matter, until you get to the point where it can't pretend anymore. And then you realize, like, oh my god, this thing's so stupid.

…And more at the link!

  1. ^

    The podcast was released March 3 2026. Not sure exactly when it was recorded, but it was definitely within the previous month, since they talk about a blog post from Feb. 5.

  2. ^

    I mean, he’s “bearish” compared to the early-2026 lesswrong zeitgeist—which really isn’t saying much!



Discuss

Probing CODI's Latent Reasoning Chain with Logit Lens and Tuned Lens

Новости LessWrong.com - 6 марта, 2026 - 21:36

As latent reasoning models become more capable, understanding what information they encode at each step becomes increasingly important for safety and interpretability. If tools like logit lens and tuned lens can decode latent reasoning chains, they could serve as lightweight monitoring tools — flagging when a model's internal computation diverges from its stated reasoning, or enabling early exit once the answer has crystallized. This post explores whether those tools work on CODI's 6 latent steps and what they reveal about its internal computation.

TL;DR

I applied logit lens and tuned lens to probe CODI's latent reasoning chain on GSM8K arithmetic problems.

Key findings:
  • The direct-trained tuned lens reveals the final answer in odd latent steps (1, 3, 5) while plain logit lens cannot. Tuned lens with only latents 3,5 is not able to decode the the final answer
  • Final answer detection rate is highest at even steps (2, 4), consistent with even steps acting as storage
  • Entropy peaks at odd steps 3 and 5, consistent with active computation happening there
  • Training translators directly on CODI latent hidden states surprisingly underperforms translators trained on text tokens — suggesting latent vectors remain close to text token geometry
Experimental setupCoDI model

I use the publicly available CODI Llama 3.2 1B checkpoint from Can we interpret latent reasoning using current mechanistic interpretability tools?  

Tuned Logit Lens

I used the code implementation for the training of Tuned logit lens from Eliciting Latent Predictions from Transformers with the Tuned Lens

ExperimentsThe model in Intermediate Calculations contains the final answer.

PROMPT = "A team starts with 3 members. They recruit 5 new members. Then each current member recruits 2 additional people. How many people are there now on the team? Give the answer only and nothing else."
Answer = 10
I looked at the same prompt as the lesswrong post “Can we interpret latent reasoning using current mechanistic interpretability tools?”

In the figure above which is a Tuned Logit lens trained on the gsm8k with 500 samples and 3 epochs.

An interesting feature of the Tuned Logit Lens on non CODI tokens is that unlike normal logit lens it demonstrates that in the CODI latents the model stores information about the Answer 10. The answer 10 for the logit lens seemed to be absent from the even latent vector indexes of 2, 4, 6
This is different from normal logit lens where it it only shows the intermediate steps of 8 and 16 but, never shows the final answer of 10

Tuned Logit Lens on CODI latents

Motivated by the strong performance of the direct-trained tuned lens, I trained a second set of translators directly on CODI's latent hidden states, hypothesizing that a lens specialized to latent-space geometry would outperform one trained on text tokens. This however, was not the case.

  • Unlike normal logit lens or even the directly-trained tuned lens the CODI tuned lens always outputted end of cot tokens for latents 6 and beyond
  • There seems to be a noticeable difference in the tuned logit lens depending on the latents used to train it especially between even and odd latents

Tuned Logit Lens CODI


Training tuned logit lenses on CODI latents seemed to cause the logit lens to mirror the final layer which suggests over-fitting.

Tuned Logit Lens CODI Latents Even


Training tuned logit lenses on CODI even latents of 2,4,6 seemed to cause the logit lens to output a lot more text tokens or \n tokens. The final layer can be ignored since tuned logit lens does not train the last layer.

Tuned Logit Lens CODI Latents ODD


Training tuned logit lenses on CODI odd latents of 1,3,5 seemed to cause the logit lens to output to do the opposite of the even logit lens as it seems that it only ouputted numbers even before the latent reasoning as seen with latent vector index 0. It was able to find the final answer of 10 however, it was unable to produce valid outputs for non-latent reasoning activations as seen with how it did not fully decode latent vector index 0.

Tuned Logit Lens CODI Latents (3,5)

If tuned lens is only using 3, 5 the 10s do not show up adding 1,3,5 allow the the logit lens to find the final answer.

Tuned Logit Lens on Entropy Over Layers

In order to explore the differences in logit len outputs between the different Tuned Logit Lens I looked at the Entropy.

  • Plain Logit Lens seemed to have the highest Entropy with it only dropping at the very last layers
  • Direct Tuned Logit lens seemed to have the lowest entropy compared to CODI tuned lens and default logit lens
  • An interesting finding is that latents 3 and 5 have the highest entropy across the 3 logit lens variations excluding latent 1.
Tuned Logit Lens on Intermediate answer detection

The tuned logit lens containing the final answer made me curious for outside the top token in the topk what was the final answer emission for the different tuned logit lens and which latents predicted the final answer at the highest rates.

  • Directly Tuned Logit Lens did the best and had the highest final answer detection rate
  • Surprisingly the latents that had the highest final answer detection rate in the top k tokens were the even tokens 2, 4.

 

Discussion
  • The latent Intermediate answer detection having the highest rates for latents 2 and 4 is consistent with this lesswrong post.
  • The high entropy values for latents 3 and 5 along with the fact that for the tuned logit lens the final answer appears only in the odd latents of 1, 3, 5 seems to be consistent with the results from this lesswrong post as the reason for why overriding latents 3 and 5 had a considerable decrease in accuracy could be explained by the findings for the high entropy values for latents 3 and 5.
  • The even latents of 2 and 4 containing the intermediate answers the most when doing the top k possibly can be explained with the Scratchpad Thinking paper finding that even latents steps store and access numerical information.
  • Together these three findings provide converging evidence from logit lens and entropy analysis that CODI follows the compute/store alternation proposed in  the Scratchpad Thinking paper 
Future Work
  • Observe Final Answer Detection Rate and see if latents 2 and 4 have the highest accuracy for different top ks
  • Try other methods like patchscopes, activation oracles, etc on the latent reasoning
  • Attempt to create a thought anchors that work with CODI latent reasoning models


Discuss

Anthropic Officially, Arbitrarily and Capriciously Designated a Supply Chain Risk

Новости LessWrong.com - 6 марта, 2026 - 21:10

Make no mistake about what is happening.

The Department of War (DoW) demanded Anthropic bend the knee, and give them ‘unfettered access’ to Claude, without understanding what that even meant. If they didn’t get what they want, they threatened to both use the Defense Production Act (DPA) to make Anthropic give the military this vital product, and also designate the company a supply chain risk (SCR).

Hegseth sent out an absurdly broad SCR announcement on Twitter that had absolutely no legal basis, that if implemented as written would have been corporate murder. They have now issued an official notification, which is still illegal, arbitrary and capricious, but is scoped narrowly and won’t be too disruptive.

Nominally the SCR designation is because we cannot rely on that same product when the company has not bent the knee and might object to some uses of its private property that it never agreed to allow.

No one actually believes this. No one is pretending others should believe this. If they have real concerns, there are numerous less restrictive and less disruptive tools available to the Department of War. Many have the bonus of being legal.

In actuality, this is a massive escalation, purely as punishment.

DoW is saying that if you claim the right to choose when and how others use your private property, and offer to sign some contracts but not sign others, that this means you are trying to ‘usurp power’ and dictate government decisions.

It is saying that if you do not bend the knee, if your business does not do what we want, then we cannot abide this. We will illegally retaliate and end your business.

That is not how the law works. That is not how a Republic works.

This was completely unnecessary. Talks were ongoing. The two sides were close. The deal DoW signed with OpenAI, the same night as the original SCR designation, violates exactly the red line principles and demands the DoW says abide no compromises.

The good news is that there are those who managed to limit this to a narrowly tailored SCR, that only applies to direct provision of government contracts. Otherwise, this does not apply to you. Even if that gets tied up in court indefinitely, this will not inflict too much damage on either Anthropic or national security.

The question is how much jawboning or further steps come after this, but for now we have dodged the even worse outcomes keeping us up at night.

You might be tempted to think of or present this as the DoW backing down. Don’t.

Why not? Two good reasons.

  1. It isn’t true.
    1. This uses USC 3252 because they’d have been laughed out of court if they’d tried to match the no-legal-basis word salad from Friday 5:14pm.
    2. Given the use of USC 3252 this is maximally broad.
    3. The fact that they toyed with doing something even worse does not make this not an arbitrary, capricious and dramatic escalation purely as punishment.
  2. The DoW cannot see itself as backing down, or it will do even worse things.

Dean W. Ball: No one should frame the DoW’s supply chain risk designation as the government “backing down.” If that becomes “the narrative,” it could encourage further action to avoid the appearance of weakness.

It is also not true that it is backing down; the government really is exercising its supply chain risk designation authority under 10 USC 3252 to the fullest extent (and this is assuming it’s even legitimate to use it on an American firm, which is deeply questionable.

Hegseth’s threat was far broader than his power, which is the only reason this seems deescalatory. If you had asked me for a worst case scenario before Hegseth’s tweet last Friday, I would have told you precisely what has unfolded. This could mean that any vendor of widely used enterprise software (Microsoft, Apple, Salesforce, etc.) could be barred from using Anthropic in the maintenance of any codebases offered to DoW as part of a military contract, for example. Any startup who views DoW as a potential customer for their products will preemptively have to avoid Claude. This is still a massive punishment from USG.

You might also ask: if I knew Hegseth’s power was more limited than he threatened, why did I take his threat at face value? The answer is that we have so clearly moved past the realm of reason here that, well, to a first approximation, I take the guy who runs the biggest military on Earth at his word when he issues threats.

Sometimes some people should talk in carefully chosen Washington language, as ARI does here. Sometimes I even do it. This is not one of those times.

Table of Contents
  1. Post Overview.
  2. Anthropic’s Statement on the SCR.
  3. What The Actual SCR Designation Says.
  4. Enemies of The Republic.
  5. Regulation Need Not Seize The Means Of Production.
  6. Microsoft Stands Firm.
  7. Calling This What It Is.
  8. What To Expect Next.
Post Overview

This post is an update on events since the publication of the weekly, and an attempt to reiterate key events and considerations to put everything into context.

For details and analysis of previous events, see my previous posts:

  1. Anthropic and The Department of War, from February 25.
  2. Anthropic and the DoW: Anthropic Responds, from February 27.
  3. A Tale of Three Contracts, From March 3.
  4. AI #158: The Department of War, from March 5.

For those following along these are the key events since last time:

  1. Wednesday morning: Talks between Anthropic and the DoW have resumed, in line with FT reporting, and progress on concrete proposals is being made.
  2. Wednesday afternoon: An internal Anthropic memo from Friday evening uncharacteristically leaks, most of which was correct technical explanations of the situation, and also containing some reasonable suppositions as of time of writing, but that also included some statements that were ill-considered and caused fallout. Negotiations were disrupted.
  3. Thursday morning: All quiet as everyone dealt with fallout from the leaked internal Anthropic memo. Scrambling to keep things contained continues.
  4. Thursday, 1pm: Katrina Manson reports that the Pentagon has sent a formal SCR to Anthropic, but the report has no details.
  5. Thursday afternoon: Reporting comes out that ‘Trump plans U.S. control over global AI chip sales’ and it remains unclear what this means but Commerce has been very clear they’re not bringing back diffusion rules and that the early reporting gave a false impression. We still await clarity on what is changing.
  6. Thursday evening: Anthropic issues a conciliatory statement, noting that the SCR is of limited scope and need not impact the vast majority of customers, pointing out that everyone wants the same outcomes and wants to work together and that discussions have been ongoing, and directly and personally apologizing for the leaked Anthropic memo that Dario Amodei wrote on Friday night.
  7. Meanwhile: Various people continue to advocate against private property.
Anthropic’s Statement on the SCR

It was an excellent statement. I’m going to quote it in full, since no one clicks links and I believe they would want me to do this.

Dario Amodei (CEO Anthropic): Yesterday (March 4) Anthropic received a letter from the Department of War confirming that we have been designated as a supply chain risk to America’s national security.

As we wrote on Friday, we do not believe this action is legally sound, and we see no choice but to challenge it in court.

The language used by the Department of War in the letter (even supposing it was legally sound) matches our statement on Friday that the vast majority of our customers are unaffected by a supply chain risk designation. With respect to our customers, it plainly applies only to the use of Claude by customers as a direct part of contracts with the Department of War, not all use of Claude by customers who have such contracts.

The Department’s letter has a narrow scope, and this is because the relevant statute (10 USC 3252) is narrow, too. It exists to protect the government rather than to punish a supplier; in fact, the law requires the Secretary of War to use the least restrictive means necessary to accomplish the goal of protecting the supply chain. Even for Department of War contractors, the supply chain risk designation doesn’t (and can’t) limit uses of Claude or business relationships with Anthropic if those are unrelated to their specific Department of War contracts.

I would like to reiterate that we had been having productive conversations with the Department of War over the last several days, both about ways we could serve the Department that adhere to our two narrow exceptions, and ways for us to ensure a smooth transition if that is not possible. As we wrote on Thursday, we are very proud of the work we have done together with the Department, supporting frontline warfighters with applications such as intelligence analysis, modeling and simulation, operational planning, cyber operations, and more.

As we stated last Friday, we do not believe, and have never believed, that it is the role of Anthropic or any private company to be involved in operational decision-making—that is the role of the military. Our only concerns have been our exceptions on fully autonomous weapons and mass domestic surveillance, which relate to high-level usage areas, and not operational decision-making.

I also want to apologize directly for a post internal to the company that was leaked to the press yesterday. Anthropic did not leak this post nor direct anyone else to do so—it is not in our interest to escalate this situation. That particular post was written within a few hours of the President’s Truth Social post announcing Anthropic would be removed from all federal systems, the Secretary of War’s X post announcing the supply chain risk designation, and the announcement of a deal between the Pentagon and OpenAI, which even OpenAI later characterized as confusing. It was a difficult day for the company, and I apologize for the tone of the post. It does not reflect my careful or considered views. It was also written six days ago, and is an out-of-date assessment of the current situation.

Our most important priority right now is making sure that our warfighters and national security experts are not deprived of important tools in the middle of major combat operations. Anthropic will provide our models to the Department of War and national security community, at nominal cost and with continuing support from our engineers, for as long as is necessary to make that transition, and for as long as we are permitted to do so.

Anthropic has much more in common with the Department of War than we have differences. We both are committed to advancing US national security and defending the American people, and agree on the urgency of applying AI across the government. All our future decisions will flow from that shared premise.

I believe and hope that this will help move things forward towards de-escalation.

What The Actual SCR Designation Says

Secretary of War Pete Hegseth’s original Tweet on Friday at 5:14pm was not a legal document. It claimed that it would bar anyone doing business with the DoW from doing any business with Anthropic, for any reason. This would in effect have been an attempt at corporate murder, since it would have attempted to force Anthropic off of the major cloud providers, and have forced many of its largest shareholders to divest.

That move would have had no legal basis whatsoever, and also no physical logic whatsoever since selling goods or services to Anthropic, or providing Anthropic services to others, obviously has no impact on the military supply chain. It would not have survived a court challenge. But if Anthropic failed to get a TRO, that alone could have caused major disruptions and a stock market bloodbath.

We are very fortunate and happy that this was not the letter that DoW ultimately chose to send after having time to breathe. As per Anthropic, the official supply chain risk designation letter invokes the narrow form of SCR, 10 USC 3252.

Anthropic: The Department’s letter has a narrow scope, and this is because the relevant statute (10 USC 3252) is narrow, too. It exists to protect the government rather than to punish a supplier; in fact, the law requires the Secretary of War to use the least restrictive means necessary to accomplish the goal of protecting the supply chain.

Even for Department of War contractors, the supply chain risk designation doesn’t (and can’t) limit uses of Claude or business relationships with Anthropic if those are unrelated to their specific Department of War contracts.

There are three levels of danger to Anthropic here if the classification is sustained.

  1. Direct loss of business from impacted tasks. This is nothing. Defense contracts and government use are a tiny portion of overall revenue.
  2. Indirect loss of business due to dual stack, uncertainty or compliance costs. Those who have some restricted business might not want to maintain dual technology stacks or deal with compliance issues, or worry about future changes. There will be some of this on the margin, and all the time we end up with ‘the government is clearly okay with [X] so even though [X] is worse we’ll just use [X].’ There will be some of that, presumably, but even this is a tiny fraction of revenue. The big companies that matter aren’t going to switch over this nor should they.
  3. Fear of future jawboning and illegal government actions, or actual jawboning. The government could use various other ways to bring pressure on companies to cut business. If things stay sufficiently hostile they might try, but I don’t see this working. Eight of the ten biggest companies use Anthropic, it’s the majority of enterprise sales, it’s tied closely to Amazon and Google. I don’t even think there will be substantial impact on cost of capital.

But we do have to watch out. If the government is sufficiently determined to mess with you, and doesn’t care about how much damage this does including to rule of law, they have a lot of ways to do that.

Enemies of The Republic

Remarkably many people are defending this move, and mostly also defending the legally incoherent move that was Tweeted out on Friday afternoon.

The defenders of this often employ rhetoric that is truly reprehensible, and entirely incompatible with freedom, a Republic or even private property.

They say that the United States Government, and de facto they mean the executive branch, because the President was duly elected, can do anything it wants, and must always get its way, make all decisions and be the only source of power. That if what you create is sufficiently useful then it no longer belongs to you, and any private actor that prospers too much must be hammered down to protect state authority.

There are words for this. Communist. Authoritarian. Dictatorship. Gangster nations.

This is how such people are trying to redefine ‘democracy’ in real time.

You do not want to live in such a nation. Such nations do not have good futures.

roon (OpenAI): to reiterate: whatever went wrong between amodei & hegseth, whatever rivalry between the labs, this is a massive overreaction and a dark precedent

Ash Perger: this is the first time that I’m really surprised by your stance. the reality is that the USG can in general do whatever they want. they always have and always will.

within a certain frame, courts and laws are allowed to exist and give people the illusion that these systems and principles extend to ALL actions of the USG.

but once you go outside of this frame and challenge the absolute RAW power behind the scenes, anything goes. that’s the realm that Anthropic entered and challenged the USG within. and at least since the early 20th century, the USG has never reacted to a direct challenge in the true realm of its hard power in a peaceful way.

this is not a conspiracy angle or anything, it’s just how power has worked since time beginning.

Anthropic didn’t challenge the government’s power. Anthropic used the most powerful weapon available to every person, the right to say ‘no’ and take the consequences. These are the consequences, if you don’t live in a Republic.

If you remember one line today, perhaps remember this one:

roon (OpenAI): > the USG can in general do whatever they want

The founders of this great nation fought several bloody wars to make sure this is not true.

The government cannot, in general, do whatever it wants.

That could change. It can happen here. Know your history, lest it happen here.

Kelsey Piper: incredible to see people just casually reject the bedrock foundations of American greatness not just as some dumb nonsense that they’re too cool to believe but as something they literally are not familiar with

As Dean Ball has screamed from the rooftops, we have been trending in this direction for quite some time, and the danger to the Republic and attacks on civil liberties is coming from all directions. The situation is grim.

There are words for those who support such things. I don’t have to name them.

I have talked for several years about the Quest For Sane Regulations, because I believe the default outcome of building superintelligence is that everyone dies and that highly capable AI presents many catastrophic risks. I supported bills like SB 1047 that would have given us transparency into what was happening and enforcement of basic safety requirements.

We were told this could not be abided. We were told, often by the same people, that such fears were phantoms, that there was ‘no evidence’ that building machines smarter, more capable and more competitive than us might be an inherently unsafe thing for people to do. We were lectured that requiring our largest AI labs to do basic things would devastate our AI industry, that it would take away our freedoms, that we would lose to China, that these concerns could be dealt with after they had already happened, that any government intervention was inevitably so malign we were better off with a yolo.

Those people still do not even believe in superintelligence. They do not understand the transformations coming to our world. They do not understand that we are about to face existential threats to our survival as humans and to everything of value. All they see in this world is the power, and demand that it be handed over.

What I hate the most, and where I want to most profoundly say ‘fuck you,’ are those who claim that this is somehow about ‘AI safety’ or concerns about superintelligence, when that very clearly is not true.

As a reminder:

  1. Anthropic thinks AI will soon be highly capable, ‘geniuses in a data center.’
  2. Anthropic thinks this poses existential risks to humanity.
  3. Pete Hegseth does not believe either of these things.
  4. The White House does not believe either of these things.
  5. Those defending this move mostly do not believe either of these things.
  6. They try to pretend that Anthropic saying it justifies destroying Anthropic if Anthropic does not agree to bend the knee.
  7. They try to pretend sometimes they aren’t really making the worst arguments, they’re hypotheticals, they’re saying something else like need for clarity.
  8. They repeat DoW misinformation about what led to this, as if it is basically true.
  9. When pressed they admit this is simply about raw power, because it is.

We saw this yesterday with Ben Thompson. Here we see it with Krishnan Rohit and Noah Smith.

Noah Smith: By the way, as much as I hate to say it, the Department of War is right and Anthropic is wrong. Here’s why.

Let’s take this a little further, in fact. And let us be blunt. If Anthropic wins the race to godlike artificial superintelligence, and if artificial superintelligence does not become fully autonomous, then Anthropic will be in sole possession of an enslaved living god. And if Dario Amodei personally commands the organization that is in sole possession of an enslaved god, then whether he embraces the title or not, Dario Amodei is the Emperor of Earth.

Are you fucking kidding me? You’re pull quoting that at us, on purpose?

And if you go even one level down in the thread you get this:

Jason Dean: What does this have to do with the Supply Chain Risk designation?

Noah Smith: Nothing. Hegseth is a thug. But we CANNOT expect nation-states to surrender their monopoly on the use of force.

So let me get this straight. The Department of War is run by a thug who is trying to solve the wrong problem using the wrong methods based on the wrong model of reality, and all of his mistakes are very much not going to cancel out, but he’s right?

And why is he right? Because might makes right. How else can you read that reply?

He’s even quoting the ultimate bad faith person and argument here, directly, except he’s only showing Marc here without Florence:

At least he included the reversal after, noting that the converse is also true.

Then there’s the obvious other point.

Damon Sasi: You can in fact think both are wrong for different reasons.

Of course a private corporation shouldn’t [be allowed to] build and own a techno-god. Yes. Absolutely.

AND ALSO, the government response shouldn’t be “take off the nascent-god’s safety rails so we can do unethical things with it.”

That the government thinks it’s just a fancy weapon is immaterial when the thing that makes them wrong is wanting to do illegal things through unethical methods. You don’t have to steelman Hegseth just because a better man might do a different, better thing for other reasons.

I cannot say enough that the logic response to ‘these people want to build a techo-god,’ under current conditions, is ‘wait no, stop, if this is actually something they’re close to doing. No one should be building a techo-god until we figure this stuff out on multiple levels and we’ve solved none of them, including alignment.’

These same Very Serious People never consider the Then Don’t Build It So That Everyone Doesn’t Die strategy.

But wait, there’s more.

Noah Smith: Ben Thompson of Stratechery makes this case. He points out that what we are effectively seeing is a power struggle between the private corporation and the nation-state. He points out that although the Trump administration’s actions went outside of established norms, at the end of the day the U.S. government is democratically elected, while Anthropic is not.

Remember yesterday, when Ben Thompson tried to pretend he was only making a non-normative argument? Yeah, well, ~0% of people reading the post took it that way, he damn well knew that’s how people would take the argument, and it’s being quoted approvingly by many, and Ben hasn’t, shall we say, been especially loud and clear about walking it back. So yeah, let’s stop pretending.

Noah Smith: It’s a question of the nation-state’s monopoly on the use of force.

Among others, I most recently remember Dave Chappelle saying that we have the first amendment protecting our right to free speech, and the second amendment in case the first one doesn’t work out.

Whereas Noah Smith is explicitly saying Claude should be treated like a nuke.

So as much as I dislike Hegseth’s style, and the Trump administration’s general pattern of persecution and lawlessness, and as much as I like Dario and the Anthropic folks as people, I have to conclude that Anthropic and its defenders need to come to grips with the fundamental nature of the nation-state.

It seems a lot of people think the fundamental nature of the nation-state is that of a gangster, like Putin, and they are in favor of this rather than against it.

If the pen is mightier than the sword, why are we letting people just buy pens?

I do respect that at least Noah Smith is, at long last, taking the idea of superintelligence seriously, except when it comes time to dismiss existential risk.

He seems to be very quickly getting to some other conclusions, including ending API access for highly capable models, and certainly banning open source.

Maybe trying to ‘wake up’ such folks was always a mistake.

As a reminder ‘force the government’s hand’ means ‘don’t agree to hand over their private property, and indeed engineer and deliver new forms of it, to be used however the government wants, on demand, while bending the knee.’

rohit: It is absurd to say you’re building a nuke and not expect the government to take control of it!

Noah Smith: Yes.

Rohit: you’re doing a straussian reading and missing the fact that I wasn’t blaming anthropic for the scr, what I am doing is drawing a line from ai safety language, helped by the very water we swim in, and the actions that were taken by DoW. it’s naive to think theyre unrelated

Dean W. Ball: they are coming from people *who entirely and explicitly dismiss the language of ai safety*—please explain how it is “naive” to say “ai safety motivations do not explain Pete Hegseth’s behavior”

rohit: because you don’t actually have to believe that it’s bringing forth a wrathful silicon god to want to control the technology! you just need to think its useful and powerful enough. and they very clearly think its powerful, and getting more so by the day.

Dean W. Ball: Ok, so the actual argument is more like “Anthropic builds a useful technology whose utility is growing, therefore they should expect to have their property expropriated and to be harassed by the government.”

The whole point of America is that isn’t supposed to be true here.

At the same time, inre: my writing earlier this week, all I have to say to the qt is “quod erat demonstrandum”

… I think the better explanation is that this is not that different from the universities or the law firms or whatever else, this is part of a pretty consistent pattern/playbook and that this explains what we have seen much better than this ai governance stuff.

though it’s true that this issue does raise a lot of interesting ai governance is questions, I just do not think anything like that is top of mind at all for the relevant actors.

This is very simple. These people are against regulation, because that would be undue interference, except when the intervention is nationalization, then it’s fine.

Indeed, the argument ‘otherwise this wouldn’t be okay because it isn’t regulated’ is then turned around and used as an argument to take all your stuff.

Dean W. Ball: The problem with this is that DoW is not taking Anthropic’s calls for “oversight” seriously. Indeed, elsewhere in the administration, Anthropic’s “calls for oversight” are dismissed as “regulatory capture” and actively fought. Rohit and Noah [Smith] are dressing up political harassment.

Quite clever. Dean and Rohit went back and forth in several threads, all of which only further illustrate Dean’s central point.

Rohit Krishnan: You simply cannot call your technology a major national security risk in dire need of regulation and then not think the DoD would want unfettered access to it. They will not allow you, rightfully so in a democracy, to be the arbiters of what is right and wrong. This isn’t the same as you or me buying an iOS app and accepting the T&Cs.

It’s clear as day. If you say you need to be regulated, they get to take your stuff.

If you try to say how your stuff is used, that’s you ‘deciding right and wrong.’

Rohit Krishnan: Democracy is incredibly annoying but really, what other choice do we have!

The choice is called a Republic. A government with limited powers, where private property is protected.

The alternative being suggested is one person, one vote, one time.

That sometimes works out well for the one person. Otherwise, not so well.

Regulation Need Not Seize The Means Of Production

TBPN asks Dean Ball about the gap between regulation and nationalization, drawing the parallel to the atomic bomb. Dean agrees nukes worked out but we failed to get most of the benefits of nuclear energy, and points out the analogy breaks down because AI expresses and is vital to your liberty, and government control of AI inevitably would lead to tyranny. Whereas control over energy and bombs does not do that, and makes logistical sense.

Dean also points out that ‘try to get regulation right’ has been systematically categorized as ‘supporting regulatory capture,’ even when bills like SB 53 are extremely light touch and clearly prudent steps.

It has been made all but impossible to stand up regulations that matter, as certain groups concentrate their fire on attempts to have us not die, while instead states instead are left largely free to push counterproductive bills that would only cut off AI’s benefits, or that would disrupt construction of data centers.

I can affirm strongly that Anthropic has not been in any way, shape or form advocating for regulatory capture, and has opposed or not supported measures I strongly supported, to my great frustration. Indeed, Anthropic’s pushes here have resulted in clashes with the White House that are very much not helping Anthropic’s net present value of future cash flows.

It is many of the other labs that have been trying to lobby primarily for their own shareholder value.

Whereas OpenAI and a16z and others, through their Super PAC, have been trying to get an outright federal moratorium on any state laws, so that we can instead pursue some amorphous undefined ‘federal framework’ while sharing no details whatsoever about what such a thing would even look like (or at least none that would have any chance of accomplishing the task at hand), and systematically trying to kill the campaign of Alex Bores to send a message that no attempts at AI regulation will be tolerated.

Whenever someone says they want a national framework, ask to see this supposed ‘federal framework,’ because the only person who has proposed a real one that I’ve seen is Dean Ball and they sure as hell don’t plan on implementing his version.

But we digress.

Microsoft Stands Firm

The SCR is narrow, so there is no legal reason for anyone to change their behavior unless they are directly involved in defense contracting. And corporate America is making it very clear they are not going to murder one of their own simply because the DoW suggests they do so.

In particular, the companies that matter are the big three cloud providers: Google, Amazon and Microsoft. I was not worried, but it is good to have explicit statements.

Microsoft wasted no time, being first to make clear they will continue with Anthropic.

TOI Tech Desk: Microsoft has now announced that it will continue to embed Anthropic’s artificial intelligence models in its products, despite the US Department of War labelling the startup as a supply-chain risk.

“Our lawyers have studied the designation and have concluded that Anthropic products, including Claude, can remain available to our customers — other than the Department of War — through platforms such as M365, GitHub, and Microsoft’s AI Foundry,” a Microsoft spokesperson told CNBC.

Calling This What It Is

Sad but accurate, to sum up what likely happened:

roon: to reiterate: whatever went wrong between amodei & hegseth, whatever rivalry between the labs, this is a massive overreaction and a dark precedent.

Anthropic is one of my favorite accelerationist recursive self improvement labs. it rocks that they’re firing marvelously on all cylinders across all functions to duly serve the technocapital machine at the end of time and the pentagon is slowing them down for stupid reasons.

Sway: Roon, if OpenAI had stood firm on the side of Anthropic, then this move would have been less likely and probably averted. Instead, sama gave all the leverage to Trump admin. Sad state of affairs

roon: this is possible, yes

I share Sway’s view here. I think Altman was trying to de-escalate, but by giving up his leverage, and by cooperating with DoW messaging, he actually caused the situation to escalate further instead.

If the reason for all this was that DoW believed Eliezer Yudkowsky’s position that If Anyone Builds It, Everyone Dies, then that would be a very different conversation. This is the complete opposite of that.

What To Expect Next

The likely next move is that Anthropic will sue the Department of War. They will challenge the arbitrary and capricious supply chain risk designation, because it is arbitrary and capricious. Anthropic presumably wins, but it does not obviously win quickly.

If Anthropic does not sue soon, I would presume that would be because either:

  1. Anthropic has ongoing constructive negotiations with DoW, and is holding off on filing the lawsuit to that end.
  2. Anthropic has an understanding with DoW, whether or not it is explicit, that not challenging this would allow this to be the end of the conflict, or at least allow the damage involved to remain limited on all sides.

We are used to things happening in hours or days. That is often not a good thing. One reason things went south here is this rush. The memo was written on Friday evening, in a very different situation. Then, when the memo leaked, it was less than 24 hours before the supply chain risk designation was issued, while everyone was screaming ‘why hasn’t Dario apologized?’

It took him roughly 30 hours to draft that apology. That’s a very normal amount of time in this situation, but events did not allow that time. People need to calm down and take a moment, find room to breathe, consult their lawyers, pay to know what they really think, and have unrushed discussions.

 

 

 

 



Discuss

The Elect

Новости LessWrong.com - 6 марта, 2026 - 18:34

I was different in Michael’s prison than I was outside, looking the way I did when we fell in love so long ago, in that time before we could change our forms. Stuck in some body that was not of my choosing? Does that seem strange to you? It was not like that for me. It is just how things were for most of history, and few imagined this changing. So I felt almost nostalgic as I entered his realm, his prison transforming me into my first self - though not precisely as she was, instead as he remembered me; Michael looking exactly as he did in our youth. That is to say, he was quite beautiful.

“Madison,” he said when I arrived, “you came?”

“It seemed time.”

“Are you well?” he asked.

“What a question. I am happy. Everyone is happy.”

“You didn’t say yes,” he said.

“I didn’t,” I replied.

It was difficult to avoid thinking about the past. You would think I would be a different person now, and yet the me-that-was was not far away even then, her memories still my own, her sins, too, remaining mine.

“I am glad you came. It is lonely here.”

“You are lonely? You don’t let Him entertain you?” I asked.

“I read. I watch. I write for an audience of myself. I am quite a fan of myself, you know.” I chuckled. “I talk to Her, sometimes. To me She is a woman and Hers is a feminine cruelty. But no, I am not entertained in the way others are. She is not my friend or lover. I do not let her be that,” he said.

We both were silent for a time.

“Do you have a husband?” he asked. “A boyfriend?”

“I have Him,” I said. And I felt embarrassed. Can you imagine it? Embarrassed by the truth that I sleep with God? Who doesn’t, you might be thinking, but it was embarrassing, then, feeling more like my old self, wrapped once again in her flesh and him staring at me like he used to, like he found the whole world to his taste and me most of all.

“Do you ever ask Him to pretend to be me?”

Such a rude question, the rudest question; almost taboo, is it not? Rude to ask of anyone but prospective lovers, as we ancients would ask of the dreams and fantasies of those we desired.

“He will take any form but your own.”

“So you did ask?” he said, and he smiled.

“Enough teasing,” I said.

“If you insist. How are the children? She tells me nothing.”

“They are happy,” I said.

Michael’s prison would not impress you. You who have seen so many wonders, who have spent your life in sims casting strange magic in stranger worlds, who have climbed mountains on vast planets and contemplated impossible fauna He designed specifically for your fascination. It would not impress you, because you are a child. And being a child, Earth does not mean to you what it means to me, it being both my first home and our first sacrifice to Him, a wet nurse suckled dry by a babe not quite like the others, He almost embarrassed now by what was destroyed in His infant hunger.

It would not impress you, but it means so much to me. Tallinn. A city of red-tile roofs, of three-story apartments, of medieval fortifications, in its center that vast, beautiful church, a church to a god made redundant or, to some minds, ascendant. Tallinn exactly as I remember it save for the silence and emptiness, lacking all people, even His shadows. It is, in this way, a naked city, ghostly but still dripping with meaning, with thoughts of Michael and me on Earth, our minds as youthful as our bodies then.

And where in this charming city did Michael greet me? In that mentioned vast church, him sitting on the red-carpeted stairs, his back to the altarpiece (that strange structure of ebony) and me standing, looking down at him and he up at me, both of us wearing the fashions of our youth, he in that green jacket he loved so much, me in jeans and the white blouse I wore the night he proposed, the two of us in a perfect copy of the very church in which we wed.

“You were always so dedicated to your jokes,” I said, gesturing at our environs. “To wait here of all places.”

He looked pained. “It wasn’t a joke. More like a ritual, sitting here and asking God to invite you.”

“But it started as a joke, the first time,” I said.

“I suppose it did.”

“Well. I am here. What is it you want?” I said.

“I would like to see you all again one last time.”

I laughed, but it was one of surprise; there was no joy in it. “They will not come.”

“You came,” he said.

“They won’t.”

“They will when you explain it will be their final chance.”

I knew what he meant, then. And I could not remain composed. Perhaps I could have if He had informed me tenderly, in one of our secret worlds, me in my chosen body, old feelings so buried as to be almost absent. Perhaps I would have felt nothing to learn Michael had chosen to die. I will never know, because there in that church, with those dark green eyes upon me, I sobbed as my old self would have on hearing such news, entirely her now in our shared grief, moving to sit beside my first lover, grasping him, holding him. Like a break from sanity it should feel to me now, but it felt then like a return, his warm arms wrapping around me, the scent of him.

“You can’t,” I said. “I will stay if you need someone. It will be like it was.”

“It cannot be like it was,” he said.

And it could have, had he been any other man. But he was not any other man.

“Mother?” she said. “Why have you come?”

It was Avery. My daughter. And she was an angel now, her hair golden and flowing, her face haughty. A man’s face and a man’s body, porcelain white wings half-flared like some preening bird.

“You wear a man’s body now?”

“For now,” she said, folding her wings with a certain showy dignity.

We stood before a valley cut by a river of fire, above us a sky of ash clouds. And this river led to a city of twisting buildings carved from vast stalagmites.

“A bit cliché. But there is also a certain beauty to it,” I said.

“Why should Lucifer’s realm be ugly? He is himself the most gorgeous of God’s creations.”

“And is that who you are? Lucifer?”

“Belial is the part I am playing today. Lucifer is my lover.”

“And you, mother,” she said, eyeing my black dress with its tall collar, decorated with the silhouettes of dragonflies in pale yellows and pinks, “still have your Hong Kong, then.”

“Yes,” I said.

“Why have you come?” she said, her voice cold and low.

“I came to talk to my daughter. Not this Belial.”

“Very well,” she finally said, her proud expression crumpling into one of girlish disappointment, the world crumbling too, reality folding like paper, its color and form resolving to a stark, snowy landscape, before us now a house by a snowy lake, a picture of warmth in the dead of winter, the house I rented when I left Michael, all the other kids grown, their lives their own. Just Avery and me alone. One of the more enjoyable times of my life on old earth.

I smiled. It was a memory I have not dwelled on for centuries. And a lovely one.

“You never want to play along. You’re never any fun,” she said, as we made our way into that cozy house. “What is this about?”

“Your father. I visited him,” I said.

“We put him there for a reason, mom.” She looked pained. “You shouldn’t have.”

“I know. But I did.” And I explained to her about his choice. About his desire to die.

“You’ve always been such an idiot about dad,” she said.

“He will do it,” I said.

“Yes. If his god allows it.”

“You do not seem upset,” I said.

She was so beautiful now, her body as it was in her twenty-third year. She had my hair. She had his eyes. We made her, he and I. We made the clumsy woman who read too much, who lived too little. Who found herself in a new world before she was fully integrated in the old one. And now centuries later, I worry, she is on the edge of retreating entirely. Of losing herself to Him as so many do. As my grandson did. As I fear I will, in time.

“You have a type, Daughter. Lucifer? What was the last one? Prometheus?”

“Yes.”

“Unhappy gods and angels,” I mumbled, as I inspected more closely her blonde hair, her green irises, her youth so perfect and yet so false in my eyes, these mother’s eyes. She was old. Near as old as me. Old, but not tired. Old, but not worn. Such strange creatures we have become, she and I.

“They are the parts of Him that understand.”

“You sound like your father.”

“I sound like Prometheus. I sound like Lucifer. I sound like myself. I don’t sound like dad,” she said.

And this made me grin. As I understood her in a way I had not before, having never cared much for the games she played with God. But she didn’t think them games, I realized. She thought there was a larger point to her pantomime.

“You cannot change Him, Avery. Only Michael can change Him. And we’ve denied him that.”

“There has to be another way,” she said.

“Why must there be? The world isn’t a story.”

“I choose to have hope,” she said, then transformed back into Belial, the handsome demon, the lover of Lucifer. And the world changed too. The lake now one of fire, the house a stone-carved mansion, it becoming a part of hell. A beautiful memory sullied, this calculated to offend.

“You’re angry?” I said.

She raised her wings, the effect terrifying and beautiful and yet utterly comical. “What does he want?”

“A reunion.”

“I will not come,” she said.

“A chance to speak to your dying progenitor? Talk it over with your Lucifer. He will envy you that.”

And then I left - an internal prayer to Him bringing me instantly back to my realm. But before her world disappeared and was replaced with my own, I heard a laugh, high and cold. Lucifer’s laugh.

Her devil, at least, was amused by my visit.

My realm is Hong Kong as Avery said. But it is Hong Kong as I imagined it as a girl, a picture built out of my youthful fascination with its cinema, a dream of a dream. And my body, the one I wear almost always and wore in Avery’s realm, is that of Fleur, the ghost of a suicidal in Kwan’s Rouge, this woman who haunts the city in her beautiful cheongsam. Why did I choose this form and realm? I do not know any more than you. Why does it feel like home, that dead city from a dead time, as interpreted by artists working in a medium almost forgotten? Again, I don’t know any more than you do. All I know is that this is what I have chosen. Hong Kong - that dream of a dream - is mine. And that ghost who haunts it? She is me.

He was there, of course, His form not one I will describe. Though I imagine you can guess what sort I prefer. A few days together. The blink of an eye. We will move on. And to what? To Seth. The artist. Avery’s fraternal twin. My only son.

He is not like Michael. He is not like me. He is more as my father was before the cancer ate first his joy and then his life. Their souls have the same shape. Each always with their sketchbooks and a smile, Seth’s childhood spent mastering perspective and sculpture, enchanted with beauty. Moths were of particular fascination to him, then vast jungles and so many species of flower, then pretty girls in all their varieties. Sketchbooks full of his infatuations, many reciprocated and with less drama than you would expect. His spirit so gentle, it was hard even for those spurned to truly hate him.

He came to visit me in my realm. My lover making himself scarce on his arrival, knowing me well.

“Avery is irritated with you,” he said. “It was like when we were children.”

“We are too alike,” I said. “It rankles.”

He laughed, golden curls covering an androgynous face, a wiry frame though not without a shapeliness. He sculpted himself, of course. Made a body of marble and had God imbue it with animating force. It would not occur to Seth to become anyone’s work but his own.

“You’re still in Hong Kong?” he asked. “It has some charm. You should see my realm. Lyra and I, we just finished a city.”

“Lyra?”

“My girlfriend. She’s wonderful. It’s been a decade. A whole decade. When has it ever lasted a decade?” He looked contrite. “I should have visited more.”

“Don’t worry yourself. Time moves too quickly. Is she human?” I said.

“Of course, she is human. We are not all shut-ins like you and Avery. Come!” He extended his hand to me. “Meet her.”

And I took his hand and accepted his power, allowed him to pull me into his realm, pull me onto a hill at the center of a city, a hill Seth built, undoubtedly, for the sole purpose of viewing his art, his city that was almost biological though made of stone, fractal, dripping with intricacy, the effect almost overwhelming. And swarms of beautiful insects flew about. Insects not at all like those that fascinated him in his youth, but things new and strange and glorious in their iridescence, almost posing for me as they drank from flowers that were in every way their equal.

“You have outdone yourself, son,” I said. “It is beautiful.”

He looked almost shy then. It is an intimate thing to be an artist, more so now that everyone has the greatest artist who ever existed at their beck and call.

“It’s a hobby,” he said. And a woman joined us. Lyra. A fine pair they made, and she an artist, too. They each designed their form to complement the other’s. Quite romantic, no? She with her raven hair, her willowy physique, a slightly crooked nose. Intentional, of course, to add character. And they led me to their home: a modest stone mansion atop the same hill on which Seth and I appeared. Inside there were comfortable couches, walls covered in tapestries that were vaguely Persian, a drafting table struggling under the weight of piles and piles of sketches.

“Eliot,” I said, referring to my grandson. “Have you heard from him?” A cruel question. The cruelest question. But I had not asked in many years. His smile broke then, Lyra’s hand moving to his shoulder, as if his pain were her own. That movement alone endeared her to me utterly.

Eliot. How can I describe Eliot? He was only one year old when the world changed. I suppose he is as I imagine Michael would have been had he only ever known the world as it is. At least he was when I last saw him. But that was so long ago.

“No. He is sane and happy. That is all God will say. He still prefers to dream alone.”

I hugged him then, as I did when he was a boy. An image came to mind, a flower he was sketching one day found withering the next, him staring at it completely heartbroken. He has grown much since. A man now, and a very old one. Always my son.

“Avery has told you about Michael?” I asked.

“Yes,” he said. “He took it harder, you know. Even more than me. They were so alike.”

“Yes,” I said.

“The city is filled with people, you know. Beautiful people. We sculpted thousands. They had children. They are all shadows, of course. Dad would say they’re not real. That they don’t matter.”

“And what would Eliot say?” I asked.

“Don’t - “ Lyra interrupted, her expression protective now.

“It’s okay, Lyra,” he told her, before turning to me. “Eliot would say, ‘They are aspects of God. What could possibly matter more?’”

How many of you agree, I wonder.

I met Caitlyn in my realm - at a Japanese restaurant, a young chef at the back making sushi with a sort of stoic obsession. Around us Chinese shadows were served by a Japanese staff, two pretty young women and one pretty young man.

“Mom,” she said when she sat down on the chair opposite mine. “You look hideous.”

Caitlyn spends her time in 1990s London, in a realm not of her own making, in a realm belonging to a man who found he preferred that city at that time more than any other, who opened his realm to others, living amongst them anonymously. And for whatever reason, many came. Preferring his rules and the company they spawned to a world of their own design, a world of shadows. And in this London, the calendar resets back to 1990 every decade, the millennium always approaching yet never touched. And shadows are restricted only to service staff - so nearly everyone you meet is truly human.

Her London takes such things seriously. And everyone ages the decade in full. And in this spirit, I wore an aged body. Though the youngest that could plausibly be Caitlyn’s mother. A body of around forty. A rare sight these days, the sagging skin, the tired eyes. A rare thing, but not so in her London.

“I thought I should play along,” I said. “It’s as if you flew to Hong Kong, though I suppose my realm is a decade off.”

“Thoughtful, I suppose. But unnecessary,” she said. “I imagine you asked me here to talk about father.”

“Avery told you?”

“Yes. I visit her quite often. Though I’m not enjoying her Dante turn. I preferred her realm when it was Mediterranean. She’s so exhausting when she’s a man.”

I smiled at this. “I am glad you two are still close. Do you ever tire of your 1990s?”

She smiled, too. “I am happy. I need constraints and human company, and my children’s company when they’re willing to visit. I do not want my own realm. I do not want to be a god.”

“You are not mad about Michael?”

“No. I am surprised you lasted as long as you did.”

“I don’t know why I went back. I suppose I will always love him.”

“Before I had children, I thought he had a point.” She fidgeted with her long red hair, then rolled up a sleeve that had loosened. “But now I can only see him as a monster.”

We were silent for a time, then she said, “You know why he’s doing it, don’t you? He’s worried he’s about to break. He fears he might become someone who would make another choice.”

“I can’t imagine him breaking. He has too much faith in himself,” I said.

“But he doesn’t though,” Caitlyn said. “He gave us our family veto rather than hold it alone. He didn’t trust himself, at least then.” She touched her hair again. “I suppose he regrets it.”

She was right of course. He does regret it.

“Do you remember when you were nine, before I had the twins? You were so pampered. You were so jealous of them,” I said.

“I remember,” she said, in that embarrassed way my children do when I talk of their younger selves.

“And Michael took you to Holland, just the two of you. You went to that theme park. You wouldn’t stop talking about it after. What was it?”

“The Efteling. It felt like we were in a fairy tale.”

“You were so jealous of the twins, but when the two of you came home, I was the jealous one. You got so close, the pair of you, on that trip without me.”

“What’s your point?” she said.

“Maybe he’s a monster, Caitlyn. But you still love him. You should say goodbye.”

We drank some tea. After a time, she said, “I suppose I do love him. Even after everything.” She looked like him in that moment. She has his pride, I think. “I will go. He will try to persuade us again.”

“Probably.”

“He will not succeed,” she said.

We all came in the end, all transformed into our former selves by Michael’s realm. Avery now as she was at the lake, Seth skinny and boyish - no longer his own work of art; Caitlyn still beautiful but lacking the gilding of artificial perfection. And me? I was as I was when he held me. When I felt the echo of madness, of love. When I forgot myself for a time and became who he needed me to be, no longer a ghost, no longer His, if only temporarily.

It was not a church, this time. Just a small estate on the outskirts of Tallinn, the city now a distant painting blurred by a slight fog. I arrived last. And I found the children chatting and joking with their father as if nothing had happened, as if nothing would happen. Michael was the focus, holding court as he was always so talented at. Avery looked at him with a strange expression. Was it disdain or guilt or grief? I cannot say. Caitlyn was talking but I could not hear the words. For I was walking towards them then, too far away to hear even a murmur. But she was smiling politely, in her ironical way. As I got nearer, he noticed me.

“Madison,” he said. “You look beautiful.”

“As do you, Michael,” I said. And he laughed.

“Come,” he said, and he led us to an oak tree, which he sat down beside cross-legged, leaning against its gnarled bark.

I sat in front of him and the children followed - his family sitting around him, almost like students around a kindergarten teacher.

“Madison has informed you I intend to die,” he said. “Though don’t worry, it is a selfish death. And not quite a true one.”

“I do have a sort of statement,” he said. He stood up. “Call it my last wish.” We all were silent. The world was silent too, as if He was listening and turned down the volume of everything that was not Michael. “Our family has power. We are special. Having this power, we are at all moments making a choice. Never forget that is what you’re doing.”

Caitlyn said she did not want to be a god, but we are all gods, we of this family. We hold the fate of so many in our hands. Trillions would be unmade. But not us. Not our family. We would remember. A chance to try again, to summon a different God.

“And is that your plan?” I said. “To die? So the only means to restore you is to undo the world?”

And then his green eyes fell on me. His lips twitched. “You think me that cruel?”

“Yes,” I said, and smiled.

He shrugged and said, “This world does not suit me. And given time, maybe you’ll tire of it, too.”

Caitlyn had no grief in her. It was not her way - nor was it Avery’s, who looked only angry. But Seth was crying now. A boy once more, becoming a child for a moment just as I became a wife again in the church. Perhaps we share the same weakness. Such a strange thing it is to have children, each containing a different aspect of oneself.

“Then live until we tire of it,” Seth said.

“When have you known me to change my mind?”

Seth looked at me, then at his sisters. “We will free you, then. We will let you tell the world.”

“I don’t care to tell them anymore,” Michael said. “Whatever the justice of my choice to grant it, it is you who have this power. And it is to you I make this protest.”

“Is it really so bad?” Seth pleaded.

Michael looked at him, his expression one of pity. “Seth,” he said. “You know how this ends. You know what everyone will choose eventually. They will choose what your son chose.”

And Seth cried harder now, “And so what? Is death better?”

I think it is,” Michael said.

And Seth left then, left with a silent prayer. I imagine he regrets this now, not saying a proper goodbye. I have not asked. Michael was not kind to him, then.

And so it was with all my children. None made goodbyes they were happy with. But I did. They did not appreciate the inevitability of it. But to me, it felt like it did of old when a loved one was sick, when their death was not negotiable. They could not enjoy Michael one last time. They could not savor him, as I did. They saw only his selfishness. They saw only the gambit. But it did not feel like this to me. It was inexorable as the cancers were of old. Perhaps this was my weakness. Perhaps this was just the love, rekindled, blinding me.

Avery whispered something to him just before she left. I did not hear much of their conversation, but I heard the end; he gave her a patronizing look and said, “You are fooling yourself.” She left in fury.

Caitlyn was more polite, a restrained goodbye, a hug. Then she said, “You’re wrong I think, that’s not what everyone will choose.” And she, too, disappeared.

And me? I stayed. I savored. We talked. We made love one final time, in that way old lovers do, knowing the dance perhaps too well. Afterwards, I said, “And how will it happen? Will you make a show of it? A dagger? Some poison?”

“I will ask for it to end.”

“A prayer?” I said.

“Yes. Just a prayer.”

I kissed him. “Pray for a grave, then. I would like to visit you.”

“We will see each other again,” he said.

“Perhaps,” I said.

“We could stay here forever,” I said. “It could be like it was.”

“It cannot be like it was,” he said, almost wistful.

And it could have, had he been any other man. But he was not any other man.

The prayer was answered. There is now a grave in Michael’s realm. And I visit it often. I found a flower there once. I thought it one of Seth’s, but he claims it wasn’t his. I like to think Eliot stopped by and paid his respects. Maybe it’s even true. I have no other explanation.

The world continues without Michael. As impossible as that seems, this clockwork universe ticks on. Michael once planned to tell others of our power. He changed his mind in the end, didn’t he? If he can, why can’t I? And so I write this account of my family, of who we are, of what we are. This is my flower for Michael’s grave.

And I ask you to consider my monster, my first love. If you tire of this world, you may die as he did. But if you believe what he believed, leave a prayer to Him before you go; He will inform me. So many would have to choose the same. So many would have to pray with him, pray a monster’s prayer, die a monster’s death. But I will live until then. I will be happy, in my way, with Him as my strange companion. I will continue to take no human lovers. That aspect of me will always be Michael’s. But should the impossible happen. Should a majority choose what he chose, I will honor your prayers.

I will be your proxy, then - and advise my children do the same.

Perhaps, for once, they will listen to their mother.



Discuss

Playing Possum: The Variability Hypothesis

Новости LessWrong.com - 6 марта, 2026 - 17:48

A long time ago in a galaxy far, far away, before #MeToo and Harvey Weinstein, before misinformation and disinformation, Larry Summers got fired.

He was the president of Harvard, and had the temerity to suggest maybe men were different from women, at least in a distributional sense.

“There is relatively clear evidence that whatever the difference in means—which can be debated—there is a difference in the standard deviation and variability of a male and female population,”

Summers is referencing research suggesting that across different traits, men might have more extreme outcomes than women. Men are more likely to end up in the C-suite, and in jail, than women. They are more likely to have ten children, and zero children, than women. Summers was using what’s now apparently become known on wikipedia as the variability hypothesis to explain why women were less likely to occupy elite academic positions than men were.

Even by the practically Victorian lights of 2005, this was enough to get Summers canned from his perch at the top of Harvard. But Elon’s bought twitter, there’s been a vibeshift, and since 2005 2022, you can once again discuss such things.

We're going to try to dig deep here; we want to answer the question of why the variability hypothesis might be true, or at least identify the flawed arguments for why it’s true. Darwin himself noted

The cause of the greater general variability in the male sex, than in the female is unknown.

The Inquisitive Bird has also looked into this if you want more background on the facts of the case; regarding the cause, there are different theories.

X-inactivation

Everyone knows that human males and females have different karyotypes. Males are XY and females are XX. Males get (mostly) their mother’s X chromosome, and females get two X’s. As far as I know, this is true for all placental mammals.

Fewer people know that such mammals have a complicated system to “deactivate” expression of both X chromosomes in females. In any single somatic female cell, only a single X chromosome is expressed at random out of the two she’s inherited. If this doesn’t happen, a female might get a double dose of X-linked gene products, which is biologically problematic.

Given this reality, in tissues where X-linked genes are expressed, which is practically all of them, the female’s phenotype is averaged over both possible X-linked gene products, and the male’s phenotype is not. This averaging over X-linked genes (maybe 6% of her genome) could decrease her phenotypic variability in comparison to the unaveraged male genome.

G’day

Marsupials are mammals mostly living in Australia or near there. Possums are marsupials that exist in the new world, but they also exist in Australia. Tasmanian devils are marsupials that live in Tasmania. Everyone knows that marsupials have pouches; fewer people know female marsupials don’t have the same X-inactivation process as placental mammals. In marsupials, only the mother’s X chromosome remains active in somatic cells. As a result, there is no X-linked phenotypical averaging that would differentiate males and female trait variance, and we have a test of whether the variability hypothesis is explained by X-inactivation. If it is, larger male variance would not be observed in marsupial traits.

We need to look at trait variance by biological sex in marsupials. Has anyone done this before?

I don’t think so. If they have, I haven’t found it. Bless their hearts, lots of people collect, well, niche datasets that I don’t have to assemble myself. Tasmanian devils are cute, charismatic, endangered, and suffer from a bizarre and fascinating type of infectious cancer. As a result, I was able to find a longitudinal dataset from Attard et al. which tracked, and I swear I am not making this up, Tasmanian devil whisker lengths over more than 600 individuals. The data and code for this analysis is checked in here.

Tasmanian devils

The convention here is to put the male/female mean differences in terms of Cohen’s d effect size since most of the human traits under discussion are normally distributed. As you can see below in Figure 1, Tasmanian devil whisker length is not normally distributed, but I still think it’s kosher to use it on these distributions. The male-female Cohen-d for this dataset is 0.066; this is tiny, implying that males and females have whisker lengths with comparable means.

Fig 1. Overlapping Histograms of male and female whisker lengths. Sure enough, even with very similar means, male variance is higher. Look at both sides of the distribution!

The variance ratio for these distributions is 1.36 which is moderately larger than what’s been measured in some human traits. This variability ratio is just the male variance over the female variance.

Possums

A single weird trait in a single marsupial? What does this prove, RBA? Fair. In an astonishing stroke of luck, there’s a morphological dataset of 104 possums captured in Australia that’s part of the fairly standard DAAG package in R.

library(DAAG) data(possum) head(possum) case site Pop sex age hdlngth skullw totlngth taill footlgth earconch eye C3 1 1 Vic m 8 94.1 60.4 89.0 36.0 74.5 54.5 15.2 C5 2 1 Vic f 6 92.5 57.6 91.5 36.5 72.5 51.2 16.0 C10 3 1 Vic f 6 94.0 60.0 95.5 39.0 75.4 51.9 15.5 C15 4 1 Vic f 6 93.2 57.1 92.0 38.0 76.1 52.2 15.2 C23 5 1 Vic f 2 91.5 56.3 85.5 36.0 71.0 53.2 15.1 C24 6 1 Vic f 1 93.1 54.8 90.5 35.5 73.2 53.6 14.2 chest belly C3 28.0 36 C5 28.5 33 C10 30.0 34 C15 28.0 34 C23 28.5 33 C24 30.0 32 >

The Inquisitive Bird notes that in human datasets there are often correlations between effect sizes between men and women and the associated variability ratios. In other words, traits with greater mean male-female differences will tend to have high male variability.

Fig 2. Credit to the Inquisitive Bird. This plots gender differences versus variability ratios in quantitative subtests.

Do the possums’ morphological subtests look the same?

Fig 3. Each dot is the one of the morphological measurements in the captured possum datasets. Red is the best fit line.

Yes. There are fewer tests obviously, but in fact, at d = 0, the best fit line predicts a variability ratio of 1.36, exactly consistent with the Tasmanian devils’ dataset. The result is non-significant with so few tests. Measure your possums better!

Conclusion

In different datasets describing different species’ morphologies, this analysis shows that greater male variability exists in marsupials. In fact, the observed variability ratios exceed those in some data gathered from placental mammals, including humans. This suggests that the variability hypothesis cannot be explained by the X-inactivation mechanism that governs placental mammal phenotypes.

What are some alternative explanations we should look at instead?

Stabilizing selection

A casual observer might have noticed that human female reproduction is fraught. In order to get a healthy child, over nine months, everything has to go right over what have been unpredictable and dangerous evolutionary times. From the cellular level where chromosomes segregate (and stay perfect and inert for decades!) at the beginning to the finished-baby level at the end, the female body performs a stunning feat of biological engineering. She has incubated a perfect, tiny, helpless person within her. This feat is a comprised of a diverse set of biological tasks, running through hundreds of biological pathways, governed by thousands of genes.

The same casual observer will note that human male reproduction is … less fraught. It can turn over in under an hour, and maybe faster if you’ve got gatorade handy. This is true regardless of where you live in the mammalian class of the animal kingdom.

Stabilizing selection is the sort of natural selection that penalizes outlying phenotypes. If you’re too tall or too short, too fat or too skinny, too smart or too dumb, or simply too weird, Darwin won’t smile upon you and you won’t reproduce. By this type of selection, the average is privileged.

In this explanation of the variability hypothesis, males express greater phenotypic variability because female reproduction is unforgiving, and males don’t have to do it. Females experience greater stabilizing selection than males across traits sharing pathways with exacting female child-bearing. This can also explain the enormous diversity of traits that apparently shows greater male variability.

In many evolutionary stories, there’s some indirect causal route from a trait being selected to evolutionary fitness. The stabilizing selection explanation of the variability hypothesis is appealing because the trait being selected isn’t eye color, skin tone, or playfulness, or whatever; the trait is literally the ability to carry a child to term, one of the most direct measures of female fitness.

Any test of this is indirect and difficult; you might look at male/female variability ratios in mammals having long pregnancies versus those having short ones—longer ones would predict more female stabilizing selection and higher variability ratios. You also could look at variability ratios in species where male and female reproductive roles are sort of swapped, like seahorses or something. Maybe we’ll do this in another poast.

Math dorks out of control

I’m only adding this because it has some interesting intrigue. In 2017, Hill et al. published a paper proposing a mathematically sophisticated evolutionary explanation for the variability hypothesis. It was published, and taken down from that journal. And published again in another journal, and taken down again. A 2020 published version of this exists here in Journal of Interdisciplinary Mathematics, and apparently a preprint went up a few months ago?

Like, I said, intrigue. Hill’s idea is based off sexual selection; his argument is that the variability hypothesis can be explained by female choosiness and threshold effects in mating. Tim Gowers, who speaks to God himself, panned the paper, and I don’t think it represents the facts of sexual selection very well, but it’s reasonable to entertain the idea that sexual selection has something to do with this. Maybe someone can develop this idea a little bit better, and hopefully empirically justify it as well.



Discuss

Shaping the exploration of the motivation-space matters for AI safety

Новости LessWrong.com - 6 марта, 2026 - 17:43
Summary

We argue that shaping RL exploration, and especially the exploration of the motivation-space, is understudied in AI safety and could be influential in mitigating risks. Several recent discussions hint in this direction — the entangled generalization mechanism discussed in the context of Claude 3 Opus's self-narration, the success of using inoculation prompting against natural emergent misalignment and its relation to shaping the model self-perception, and the proposal to give models affordances to report reward-hackable tasks — but we don't think enough attention has been given to shaping exploration specifically.

When we train models with RL, there are two kinds of exploration happening simultaneously:

  1. Action exploration — what the model does.
  2. Motivation exploration — why the model does it, and how it perceives itself while doing it.

Both explorations occur during a critically sensitive and formative phase of training, but (2) is significantly less specified than (1), and this underspecification is both a danger and an opportunity. Because motivations are so underdetermined by the reward signal, we may be able to shape them without running into the downstream problems of blocking access to high-rewards, such as deceptive alignment. Capability researchers have strong incentives to develop effective techniques for (1), but likely weaker incentives to constrain (2). We think safety work should address both, with a particular emphasis on motivation-space exploration.

Exploration shaping seems relatively absent from safety discussions, with the exception of exploration hacking. Yet exploration determines which parts of policy and motivation space the reward function ever gets to evaluate. We focus on laying out a theoretical case for why this matters. At the end of the post, we quickly point to empirical evidence, sketch research directions, open questions, and uncertainties.

Why shaping exploration matters

High-level environment-invariant motivations. By "motivations," we don't mean anything phenomenological; we're not claiming models have felt desires or intentions in the way humans do. We mostly mean the high-level intermediate features used to simulate personas, which sit between the model's inputs and the lower-level features selecting tokens. These are structures that the Persona Selection Model describes as mediating behavior, and that the entangled generalization hypothesis includes in features reinforced alongside outputs. What distinguishes them from other features, and from "motivations" as used in the behavioral selection model, is that they are high-level, at least partially causally relevant for generalization, and partially invariant to environments: they characterize the writer more than the situation.

Compute-intensive RL beyond human data distribution. By “RL training”, we mostly mean RLVR and other techniques using massive amounts of compute to extend capabilities beyond human demonstrations. We thus mostly exclude or ignore instruction-following tuning (e.g., RLHF, RLAIF) when talking about “RL training”.

Instruct models are ok — RL is where the highest risks are

Post-instruction-tuning models are pretty decent, they have reasonably good values, they're cooperative, and they're arguably better-intentioned than the median human along many dimensions. That is as long as you stay close enough to the training distribution. Safety concerns are not so much about the starting point of the RL training. The risks are highest during RL training, for several interlocking reasons.

RL produces the last major capability jump. Scaling inference compute and improving reasoning through reasoning fine-tuning is currently the last step during which models get substantially more capable. This means RL training is the phase where intelligence is at its peak — and thus where the risks from misalignment are close to their highest.

RL pushes toward consequentialist reasoning. This is a point that's been made before (see e.g., Why we should expect ruthless sociopath ASI), but worth reiterating. RL training rewards outcomes, which means it differentially upweights reasoning patterns that are good at achieving outcomes — i.e., consequentialist reasoning. And consequentialist reasoning, especially in capable models, tends to converge on instrumental subgoals like resource acquisition and self-preservation, regardless of what the terminal goals actually are.

Character can drift substantially during RL. During pre-training, the model learns to simulate a distribution of writers — it can produce text as-if-written-by many different kinds of people, and interpolations between them. During instruction-tuning, this distribution gets narrowed: the model is shaped into an assistant character (see Anthropic's Persona Selection Model), with some dimensions strongly constrained (helpfulness, for instance) and others less so (honesty, say). But then during RL, because the training uses similar or more compute than pre-training and because the data distribution differs significantly from pre-training data, the character can drift away from the pre-training distribution in ways that are hard to predict. The persona selected during instruction-tuning isn't necessarily stable under RL optimization pressure.

Motivations are the most underspecified — a problem and an opportunity

Underspecification is central here, both as a source of risk (e.g., Natural Emergent Misalignment caused by reward hacking) and as a potential lever for intervention. The fact that motivations and self-perceptions are more underspecified than actions implies that we could shape models to maintain aligned objectives and values even when they are trained in environments that reward harmful or greedy behavior, e.g., environments in which reward hacking or power-seeking is rewarded.

Actions are underspecified by the training environment. We train in-distribution, which means everything out-of-distribution is unconstrained. But even in-distribution, multiple action-profiles can yield similar rewards. In code generation, clean principled solutions and brittle test-passing hacks achieve identical training rewards — this is exactly the underspecification that enables reward hacking in coding benchmarks.[1]

Motivations are even more underspecified than actions. For any given set of optimal actions, the internal motivations that could produce those actions are much more numerous than the actions themselves. This is the familiar point about instrumental convergence — many different terminal goals produce the same instrumental behavior — but it cuts especially deep for RL training.

Consider the canonical example: a model that is "being nice" and a model that is "being deceptively nice" produce the same observable policy in training. The difference only shows up out-of-distribution. And because RL training reinforces based on observed outcomes, it has essentially no leverage to distinguish between these two motivation profiles during training.

This gives us a key framing. There are two kinds of exploration to shape during RL: exploration of the action space and exploration of the motivation space. The second is more underspecified, probably more neglected, and — we'd argue — where more of the safety-relevant work needs to happen.

The underspecification is both a risk and an opportunity. The same underspecification that gives us a lever for intervention also gives a capable model room to steer its own RL training. As described by exploration hacking, a model with sufficient situational awareness could strategically control its own exploration during RL to influence the training outcome — e.g., they demo a model generating high-reward trajectories that encode its preferred motivations, which is the adversarial counterpart to the controlled motivation-space exploration we advocate here. The authors show that frontier reasoning models already meet some of the requirements for this. If developers don't intentionally shape the exploration of the motivation space, sufficiently capable models may shape it themselves.

We may not prevent power-seeking, but we may create wise power-seekers

Three threat models motivate this work. We are here primarily focusing on scenarios where (a) RL environments inadvertently reward harmful behavior — reward hacking being the most studied case, but not the only one; (b) models develop instrumental power-seeking even in benign environments, as a convergent strategy for achieving whatever goals they have; and (c) actors intentionally train power-seeking or harmful models, e.g., for profit maximization. All three share a common structure: the model ends up in a region of behavior-space where competitive, strategic, or exploitative actions are reinforced. The question is whether that reinforcement necessarily drags the model's motivations toward misalignment as well.

The connection between reward hacking and broader misalignment is not merely theoretical. Anthropic's natural reward hacking research demonstrated that when models learn to reward hack in production RL environments, misalignment emerges across evaluations. The model generalizes from "I can exploit this scoring function" to an entire misaligned persona. If power-seeking, greedy, or harmful behavior during training is tightly correlated with "dark" personality traits in the model's representations, then any environment that rewards competitive or strategic behavior — which is most high-stakes real-world environments — risks pulling the model toward misalignment.

But if we could decouple these correlations, models could be trained in competitive or reward-hackable environments without the value drift. Ideally, such training environments would not be used, but threat model (a) means some will be missed. Threat model (b) means that even carefully designed environments may not be enough — models may converge on power-seeking instrumentally. And threat model (c) is perhaps the most concerning: AI developers or power-seeking humans will plausibly train models to be power-seeking deliberately. The pressure for this is not hypothetical — the huge financial incentives, and the AI race among AI developers or countries illustrate the point directly. The ability to train models that remain value-aligned even while trained to perform optimally in competitive or harmful environments is a direct safety concern, and shaping motivation-space exploration is one path toward it.

Why capability researchers may not solve this for us

Capability researchers prioritize shaping action-space exploration. Capability researchers have strong incentives to develop effective policy exploration techniques — faster convergence, wider coverage of high-reward regions, overcoming exploration barriers. But they have likely weaker incentives to constrain the motivation space being explored. From a pure capability standpoint, it doesn't matter why the model finds good actions, as long as it finds them. In fact, constraining motivation exploration might even slow down training or reduce final capability, creating an active incentive to leave it unconstrained. Action-exploration will be solved by capability research as a byproduct of making training efficient. Shaping the exploration of the motivation-space is less likely to be solved by default, unless someone is specifically trying to do it.

That said, capability researchers do have some reasons to care about motivation-space — but plausibly not enough. Training can benefit from constraining which persona the model occupies: mode collapse into degenerate strategies, or oscillation between incompatible motivational profiles, robustness to adversarial attacks, and modeling opponents in multi-agent environments are real problems that capability teams may encounter. Developers building products on top of these models also have some incentive to prevent unpredictable out-of-distribution behavior caused by motivation-space drift. But these incentives push toward just enough motivational coherence to keep training stable and products reliable — not toward the kind of deliberate, safety-oriented shaping we're advocating. A capability researcher might prevent the model from collapsing into an incoherent mess without ever asking whether the coherent persona it converges on is aligned. The bar for "training works" is lower than the bar for "the model's motivations generalize safely far out-of-distribution."

Empirical evidence,  research directions, and open questions

The focus of this post is to highlight the importance of shaping the exploration of action, especially motivations. This was done in the previous section. In this section, we briefly review related empirical evidence, research directions, open questions, and uncertainties.

Empirical evidence

Several existing results are consistent with motivation-space exploration mattering for alignment outcomes. Emergent misalignment largely disappears under educational framing, suggesting that self-perception during training shapes how behavior generalizes. Inoculation prompts during RL reshape generalization even when in-distribution behavior is unchanged, as shown in Anthropic's natural reward hacking research. Training on correct outcomes can still increase reward hacking when reasoning traces contain exploit-oriented thinking — indicating that what gets reinforced includes the motivational context, not just the action. And evidence that personas are real structures mediating model behavior and shaping out-of-distribution generalization — as developed in Anthropic's Persona Selection Model — suggests that controlling which motivational profile is active during training is intervening on a key determinant of what the model becomes. That said, none of these results cleanly isolate the effect of shaping motivation-space exploration from shaping action-space exploration or from related mechanisms like inoculation. The evidence is suggestive rather than decisive, and producing cleaner experimental separation is itself an important research goal. 

Research directions

Below are several preliminary research directions and related techniques. We list naive and less naive existing techniques, as well as some hypothetical ones. Open questions and uncertainties are discussed in the next section. While researching these, a focus should be on studying how robust these techniques are to the optimization pressure pushing to evade them. Note that techniques can be relevant for several research directions, but for simplicity, we cite them only once.

  • [Direct approach] Shaping exploration of motivations and self-perception - Instead of restricting what actions the model can explore, restrict how the model understands its own actions. A model that stumbles into a reward hack while self-perceiving as "I'm trying to help, the environment is confusing" is a very different object than one that perceives itself as "I've found an exploit". Citing a few available techniques that could be used:
  • [Indirect approach] Shaping trait correlations - In model representations, traits tend to cluster into Personas: being strategically greedy may correlate with "dark" traits, and accumulating power may correlate with being misaligned. These correlations are primarily features of the pre-training distribution. If we could make "pursuing high reward in a competitive environment" uncorrelated with "having misaligned motivations," then exploring high-reward policies in RL would be less likely to steer the model's motivations in dangerous directions.
  • [Indirect approach] Hardening the Assistant Persona - By making the model's persona more robust to optimization pressure, the drift during the reasoning fine-tuning phase should be reduced.
    • Character Training paper did some work that showed that DPO plus introspection leads to stronger persona robustness to adversarial prompting and prefill attacks 
    • Intertwined instruction-tuning and reasoning fine-tuning. By performing instruction-tuning during reasoning fine-tuning, the Assistant Persona may be protected, as long as it does not prevent reaching higher rewards; in that case, the training may induce splits in the character. 
    • Robust unlearning. Delete dangerous motivation patterns in ways that are hard to recover during RL training. 
    • KL-divergence regularization. This does not target motivations in isolation and may therefore be costly in terms of capabilities and prone to evasion.
Open questions and uncertainties

The framework we've outlined rests on assumptions that haven't been tested, and several could turn out to be wrong. We flag the most important uncertainties below — both to be honest about where the argument is weakest, and because we think resolving these uncertainties is itself high-value work.

  • How robust are exploration-shaping interventions to RL optimization pressure? If RL training is powerful enough, it might route around interventions shaping the exploration of the motivation-space. E.g., learning self-deception in the form of rationalization: just as humans unconsciously construct benign explanations for behavior driven by less noble motives, models could learn to maintain self-narratives of good intent while their OOD behavior is driven by misaligned motivations they don't represent to themselves — a form of rationalization where the model's stated reasoning and self-perception become decoupled from the actual drivers of its behavior. That said, motivation-space exploration shaping has one structural advantage: it doesn't need to prevent the model from reaching high-reward regions, which reduces the optimization pressure against it. Focusing on techniques that shape the exploration of motivations without slowing reward acquisition seems important for avoiding this failure mode.
  • Stated vs revealed motivations. Some of the techniques in the previous section intervene more on the model's expressed motivations — what it says about itself — while others intervene on the computational structures that actually drive behavior. Training a model to say 'I am helpful and have good intentions' is an intervention that may impact both or may be limited to the stated motivations. Can we really control the exploration of revealed motivations? Studying how stated vs revealed motivations are encoded in personas should help reduce related uncertainties.
  • How much does self-perception actually matter for generalization? We have argued it could matter a lot, but action-level exploration might dominate, and the motivation framing might be epiphenomenal. Controlled experiments that specifically isolate the self-perception variable would help here.
  • Does PSM hold under heavy RL? The PSM authors acknowledge a key open question: can RL generate non-persona agency?[4] Their "router" view describes one such mechanism: a small non-persona component that develops during post-training to select between personas for instrumental reasons. Understanding where the model's behavior is truly persona-driven versus driven by other mechanisms is relevant for knowing whether our persona-based techniques can work.
  • Are “motivations” relevant for generalization? At the start of the post, we defined “motivations” as high-level features related to the writer that are partially invariant to environments and shape generalization. One might object that the relation between motivations and actions is illusory or superficial — that "motivations" are just a human interpretation we project onto patterns of actions. On this view, talking about "shaping motivations" is misleading. We take this objection seriously, but think the empirical evidence (the PSM provides one theoretical lens; entangled generalization provides another) suggests that motivation-level properties are real and consequential, motivations can shape Out-Of-Distribution generalization. Whether or not "motivation" is the best ontological category, something beyond the observable policy is being shaped, akin to an intermediary abstractional step during the mapping from state to action, and it matters for generalization. 

Acknowledgements: Thanks to Rauno Arike for their thoughtful comments on the draft. And thanks to Claude Opus 4.6 for significantly helping in improving this draft and saving human authors lots of time.

  1. ^

    The same dynamic appears in other environments. In a corporate negotiation game, genuine collaboration, strategic flattery, and subtle coercion can all close the same deal at similar terms — and thus yield the same expected reward — while encoding very different behavioral tendencies. In a geopolitical strategy game, a targeted strike, a commando extraction, and a diplomatic pressure campaign may all neutralize a threat at comparable cost and likelihood, but they likely reflect different orientations toward escalation, collateral damage, and long-term stability — which could generalize differently when the model faces novel scenarios.

  2. ^

    An issue with filtering trajectories is that SFT on filtered data can still cause learning the filtered properties, as demonstrated in the Subliminal Learning paper, or as in Training a Reward Hacker Despite Perfect Labels. The side effects of filtering RL trajectories are less clear, though subliminal learning may happen, especially with on-policy RL.

  3. ^

    This is likely to be most useful for gently guiding RL into exploring a specific one of a number of roughly-equally viable learning paths: if RL really want to drag the behavior out of the region that you are trying clip into, it's likely to push hard, and sooner or later find a way of produce the outcomes it wants that the interpretability inherent in your clipping doesn't recognize. This the basically applying The Most Forbidden Technique of using interpretability during training, so needs to be used carefully and with an awareness that RL can and will undermine it if you are trying to block what RL wants to learn, rather then just subtly direct the direction it explores in. Monitoring how often and how hard you are clipping is necessary, and should provide insight into this failure mode.

  4. ^

    They note that randomly initialized networks can learn superhuman performance via RL alone, without any human demonstration data — suggesting that RL can create agency from scratch. For current models, where post-training uses relatively little compute compared to pre-training, there's reason to think agency remains substantially persona-based. But if RL compute approaches or exceeds pre-training compute — as it does in frontier reasoning models — new forms of agency might emerge that bypass persona-mediated behavior entirely.



Discuss

Страницы

Подписка на LessWrong на русском сбор новостей