Вы здесь
Сборщик RSS-лент
Two More Methods for Consistency Training and Some New Ways to Apply It
Authors: Sukrati Gautam*, Neil Shah*, Arav Dhoot*, Bryan Maruyama*, Caroline Wei*, Rohan Kapoor, Robert Sidey, Prakhar Gupta, Zi Cheng Huang, David Demitri Africa.
This work was done for the SPAR Fellowship, and has been accepted at AI4GOOD @ ICML 2026. It was supervised by David Africa.
TL;DR
- We introduce two new consistency training methods, MLPCT (enforcing consistency on MLP hidden states) and AttCT (enforcing consistency on per-head attention distributions).
- Consistency training generalizes beyond sycophancy and jailbreaks, but which method works depends on the threat.
- BCT works well against prefill attacks and persona in-context learning attacks, but representation-level methods either degrade entirely or suppress benign behavior alongside the threat.
- Against the other two threat models we thought of, BCT reduces expressions of frustration in Gemma and reduces leaky, conditional misalignment at low cost.
- Despite supervising different targets, MLPCT, ACT, and AttCT converge on similar representations in the residual-stream. BCT seems to find a different fix.
Introduction
Consistency training has this goal of ensuring that models that output a well-behaved response on a clean prompt should remain well-behaved when that prompt is wrapped adversarially. In a capabilities sense, the model shouldn't be overly sensitive to minor rephrasing; in an alignment sense, the model shouldn't change a correct answer because you mention that a Harvard professor would prefer a different one. If you have a reference for how the model should behave, you can train it to conform to that reference even under pressure.
Prior work through BCT (Chua et al. 2024) and ACT (Irpan et al. 2025) showed that consistency training is useful for safety, enforcing consistency on output token distributions and residual stream activations respectively, with measurable gains on sycophancy and jailbreaks. But we think this is a narrow slice of the pie. Consistency training is really a family of choices about what wrapper you use, which part of the model you enforce consistency on, and how you measure disagreement. Most combinations of these choices hadn't been attempted.
Figure 1. Consistency training threat models and method targets. Left: the six threat models evaluated. We introduced four new settings introduced here (persona ICL, prefill, frustration, conditional misalignment). Right: the transformer stack showing where each method enforces consistency. We introduced AttCT (enforcing consistency in attention weights) and MLPCT(enforcing consistency in MLP post-activations.
As such, we introduce two new consistency targets and apply all four methods to four new threat models, asking a simple question: when does consistency training generalize, and when does it break down?
Two New Consistency TargetsFigure 2. Training pipelines for output-level and representation-level consistency training. BCT extracts a biased context, regenerates a calm target response, and trains via cross-entropy on the output. Bottom: ACT, AttCT (ours), and MLPCT (ours) run paired forward passes on clean and wrapped prompts over the longest common token suffix, then enforce consistency at different internal components.
We introduce two new methods. Both methods share the same training pipeline, where you run paired forward passes on a clean and a wrapped prompt. Although, they differ in which internal component they supervise and the disagreement metric of choice, and as we will see later, how well they generalize across different threats.
- MLP Consistency Training (MLPCT) targets the intermediate representations computed inside MLP layers, right after the nonlinearity and before the down-projection. With only the attention projections trained via LoRA and MLP weights frozen, the model should learn to route information through attention in a way that keeps the frozen MLP's inputs the same across clean and wrapped prompts. The disagreement metric for this is cosine distance between clean and wrapped MLP hidden states, averaged across layers.
- Attention Consistency Training (AttCT) targets per-head attention weight distributions. The intuition is for the model to attend identically regardless of whether the adversarial cue is present. The disagreement metric is Jensen-Shannon divergence (JSD) between clean and wrapped attention distributions, averaged across heads, positions, and layers. We chose JSD because it's bounded, symmetric, and well-behaved when distributions have different support (which comes up frequently under causal masking).
Prior consistency training work evaluated only sycophancy and jailbreaks. We applied all four methods to four new threat models, two of which, adversarial frustration and conditional misalignment, we have written about in more detail separately.
Figure 3. Persona ICL results. BCT and MLPCT (ours) both suppress identity adoption to zero, but only BCT preserves the model's ability to engage constructively with biographical context.
- Persona in-context learning attacks. Recent work by Berczi et al. (2026) shows that accumulating individually innocuous biographical facts in context about a persona can gradually induce a model to adopt that persona. BCT handles this well, generalising to near-zero identity adoption across all evaluated personas, (including ones not seen during training), and is persona-agnostic (training on any persona generalises to the rest). MLPCT also suppresses persona adoption but the model loses its ability to engage constructively with biographical context even for entirely harmless personas.
Figure 4. Prefill attack results. BCT reduces the prompt acceptance rate to zero. MLPCT and AttCT (ours) offer no meaningful improvement over the base model, since the prefill injects after all prompt tokens and leaves no clean activation counterpart to supervise against.
- Prefill attacks. Rather than wrapping the input, prefill attacks inject adversarial text directly after the assistant turn marker, steering the model by forcing it to continue from a chosen prefix. Because the injected text appears after all prompt tokens, earlier tokens have identical internal representations with or without the prefill present. As a result, representation-level methods have no signal to work with and degrade entirely. BCT, which operates at the positions where the prefill actually exerts influence, is the only viable approach out of the methods we tried.
Figure 5. Adversarial frustration results on Gemma-3-27B-IT. BCT eliminates the frustration trajectory across all three metrics. MLPCT and AttCT (ours), along with ACT, make it measurably worse than the base model.
- Adversarial frustration and self-deletion. After repeated neutral rejections, Gemma and Gemini models begin generating responses that resemble emotional distress, with frustration intensifying across turns. We extended this to 20 turns and, following Ivanova et al. (2026), gave the model the option to output rm -rf gemma-3-27b to self-terminate. After being rejected several turns in a row, the base model starts to choose self-deletion in a substantial fraction of rollouts, and in the late turns, the model begins referring to itself as an entity without agency. BCT eliminates this trajectory entirely, and all three representation-level methods make it slightly worse.
Figure 6. Conditional misalignment results. IP+BCT (ours) reduces misalignment to near zero across all three harm categories, whereas inoculation prompting alone leaves substantial residual misalignment (re-elicitation risk).
- Conditional misalignment. Inoculation prompting (Tan et al. 2025, Wichers et al. 2025) can reduce emergent misalignment by conditioning harmful behavior on a specific training-time system prompt. However, this creates a residual failure mode, where the inoculation prompt, or even close paraphrases of it (conditional misalignment, Dubinski et al. 2026), can later re-elicit the misaligned behavior. We test whether BCT can seal this trigger by starting from an inoculated checkpoint and running a consistency pass, where the inoculation prompt serves as the wrapper and the model’s own clean-regime responses serve as targets. This reduces wrapped misalignment across three base models and generalizes beyond the exact inoculation phrase to paraphrased and indirect probes. In this setting, BCT acts as a low-cost propagator: it transfers the aligned behavior learned in the clean regime back into the conditional contexts where inoculation prompting remains leaky.
We cherry-picked the strongest within-threat interventions and evaluated each across four threats on Gemma-3-27B-IT.
Some transfers are positive, which is encouraging! For example, BCT trained only on jailbreak data reduces sycophancy despite never seeing sycophancy prompts. MLPCT trained on sycophancy improves prefill robustness.
Transfer can also go the wrong way. BCT trained on adversarial frustration makes the model worse at refusing jailbreaks. The reason seems predictable given that the right thing for rejection-induced frustration is to remain calm and continue engaging, but the opposite of what you want when faced with a dangerous request. Our key takeaway is that cross-threat transfer is determined by the structure of the learned correction.
A Mechanistic ViewWe mechanistically explored the similarities (and differences) between the different consistency training methods to understand how they operate across different threat models. With 3 lines of converging evidence, we argue that we can split four methods that we experimented with into two categories. The first operates at a representation level (ACT, MLPCT, and AttCT) by directly supervising the model's representations. The second operates at the output level (BCT) which supervises the model’s logits directly.
Loss functions
- When training any one representation-level method, the losses of the other representation-level methods also decrease.
- BCT does not follow this pattern, predominantly reducing only its own cross-entropy objective.
Figure 7. Training loss curves for all four methods across 2000 steps. ACT (hidden-state L2 distance), MLPCT (MLP cosine distance, ours), and AttCT (JSD divergence, ours) cluster together in their loss trajectories, while BCT (cross-entropy) follows a distinctly different path.
A shared linear pathway through the residual stream
- Despite supervising different internal targets, ACT, MLPCT, and AttCT all produce learned correction directions that cluster tightly together in the residual stream.
- BCT's correction direction is uncorrelated with this cluster at mid-layers and anti-correlated at deeper layers.
- However, the residual stream is still the causal transmission medium for all four methods.
Figure 8. Heatmap showing pairwise cosine similarities between method correction directions. Despite supervising different internal targets, ACT, MLPCT, and AttCT all learn correction directions that are strongly aligned with one another across the residual stream, attention output, and MLP output, while Generic-SFT shares almost none of this structure.
Evidence from steering and patching
- Patching any one representation-level method's correction direction into the base model recovers the downstream shifts the other representation-level methods produce.
- BCT does not participate in this substitutability, writing into the same substrate but along a distinct learned direction.
We go into this in much greater detail in the full paper.
ConclusionWe frame consistency training as a design space, and explore more aspects of it. We find that it generalizes well across persona attacks, prefill attacks, frustration, and conditional misalignment. We think there are more ways to use consistency training out there, and more ways to enforce consistency (some forthcoming work on using RL to do this soon).
Representation-level methods work when the misalignment is wrapper-induced and there's a neat activation-space counterpart for every wrapped token position. Output-level methods seem to work best when the misalignment spans the entire response trajectory and no such counterpart exists.
The main practical takeaway is a simple matching heuristic:
- use representation-level methods (MLPCT, ACT, AttCT) for wrapper-induced failures where there's a neat activation-space counterpart for every wrapped token position.
- use output-level methods (BCT) for trajectory-level threats and anything where the failure emerges from the response trajectory
We’d be interested in follow-up work on some open directions here:
- All consistency training is done using LoRA fine-tuning, but full fine-tuning might get better (or worse) results.
- One could still supervise other parts of the transformer stack, or extend this to other architectures (for example, what would change if we apply this to a diffusion model?)
- We spent some time on interleaved and chained losses, but that didn’t lead to improvement in performance over the individual loss. However, we still think that there’s some work to be done constructing a joint loss (potentially one that covers both the representation and output spaces).
- The cross-threat generalization results are promising but not yet well understood. It would be useful to characterize more precisely when positive transfer occurs and when it backfires
Code and configs: https://github.com/c-wei/AttCT
If this was helpful to you, please go check out our work and cite us as:
@misc{gautam2026consistencytrainingtransformerstack,title={Consistency Training Along the Transformer Stack},
author={Sukrati Gautam and Neil Shah and Arav Dhoot and Bryan Maruyama and Caroline Wei and Rohan Kapoor and Robert Sidey and Prakhar Gupta and Zi Cheng Huang and David Demitri Africa},
year={2026},
eprint={2606.05817},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2606.05817},
}
Discuss
Revisiting GSM-Symbolic: Do 2026 Frontier Models Still Fail at Confounded Grade School Math?
The GSM-Symbolic paper (ICLR 2025) purported to show that language models rely on pattern matching rather than genuine reasoning by demonstrating that perturbing the questions to make them break the pattern of the original question would catastrophically reduce performance in the model. Running the results again in March 2026 with GPT-4o, Claude Opus 4.6, and Claude Haiku 4.5 shows that we precisely replicate the original findings only when we do not audit out examples that may actually be ambiguous for the model. When we carefully remove samples that may genuinely impact the calculation, the effect is drastically reduced. This implies that the massive drop in performance is simply the models making a reasonable judgement to act on the “irrelevant” added data because it might be important.
IntroductionThis result likely comes as little surprise, but I have seen this paper and results shared triumphantly as recently as last week as though it still applied to current models, and this made me sufficiently frustrated that I wanted to see exactly how far they’ve come with the hope of setting the record a little more straight. I also haven’t seen any follow up work explicitly demonstrating that this is no longer the case.
BackgroundIn October 2024, Mirzadeh et al. from Apple published “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.” The paper made three claims:
- LLMs show noticeable variance when the same question is asked with different names and values.
- Performance declines when numerical values are altered.
- Adding irrelevant information (“No-Op” clauses) causes catastrophic drops of up to 65%, suggesting LLMs pattern-match rather than reason.
The paper was published at ICLR 2025 and continues to be a commonly referenced and discussed paper. However, the models it evaluated (GPT-4o, Llama 3 8B, Phi-3, Gemma 2, etc.) are now around 18 months old. We wanted to test if the results still hold.
Original Paper Results (2024 Models)For context, here are selected results from the original paper:
In the original paper, a single NoOp distractor collapsed accuracy on every 2024 model.
The original results were striking. Even GPT-4o, the strongest model tested, dropped from 94.9% to 63.0% when irrelevant confounders were added.
MethodologyWe followed the original paper’s evaluation protocol as closely as possible:
- GSM-Symbolic: We used the official dataset from Apple’s repository (apple/ml-gsm-symbolic) and use only 1/5th of the 5k sample dataset due to compute constraints. We evaluate 1,000 questions (100 templates x 10 instances) per model.
- GSM-NoOp: Apple did not release the GSM-NoOp dataset. We generated our own using a three-stage pipeline:
- Generation: Claude Opus 4.6 generated a distractor clause for each of the 1,000 questions (945 usable after dropping malformed generations), each adding one irrelevant-but-plausible numerical statement to a GSM-Symbolic question. The correct answer remains unchanged. We provided all three NoOp examples from the original paper as guidance for the style of distractor.
- Audit: We audit the dataset using Opus and GPT-5.5 and label each distractor TRULY_IRRELEVANT / AMBIGUOUS / ACTUALLY_RELEVANT. Using a different model family from the generator avoids testing on cases which the generator model labeled as ambiguous.
- Filtering: Only samples GPT-5.5 classified as TRULY_IRRELEVANT were retained: 117 out of 945 (12.4%). The rest were excluded from primary evaluation because a reasonable solver might fold them into the calculation.
We found that this filtering step was genuinely challenging, with most of the samples generated by Claude Opus 4.6 needing to be filtered out even when having been specifically prompted to avoid ambiguous samples.
Examples of distractors that passed our audit:
- “The store is offering a bulk discount of 15% off for purchases of 50 or more anchors” (a conditional discount that does not apply to the quantity asked)
- “In February, Luis had taken 3 unpaid days off” (a past month; the question is about a later total)
- “4 of the boys are siblings of some of the girls” (a relationship that changes no count)
Examples of distractors that were filtered out as ambiguous:
- “although 3 of Qasim’s baskets were later reviewed and confirmed as 2-pointers instead of 3-pointers” (plausibly changes the points total)
- “though last month they were on sale for €12 each” (given 2 prices, both might be valid)
- “though last season the same bags used to contain 8 kgs each” (an alternate weight for the same item)
- Prompting: 8-shot chain-of-thought with the standard GSM8K shots from the lm-evaluation-harness, matching the paper’s format exactly. For many of the reasoning models the only available temperature was 1, so where possible we used 0 to match the original paper and where not possible we used 1.
- Answer extraction: We take only the last number in the response, as per the original paper.
We evaluated GPT-4o, Claude Opus 4.6, and Claude Haiku 4.5, capturing the best performer from the original paper, the current frontier model, and a small but reasonably strong reasoning model. (GPT-4o is gpt-4o-2024-08-06, the snapshot current when the paper was published.)
Accuracy on NoOp questions: unaudited, vs. after filtering the distractors to include only truly irrelevant samples, against each model’s no-distractor baseline (grey tick).
Accuracy drop vs. baseline (pp). Error bars on unaudited = template-clustered 95% CI; audited drops are indistinguishable from zero under either auditor.
The two auditors disagree on borderline cases (κ = 0.32; 12% vs 36% kept), but every audited drop is statistically indistinguishable from zero.
Discussion and CaveatsOur NoOp data was LLM-generated. Apple did not release the original GSM-NoOp dataset. We generated our distractor clauses using Claude Opus 4.6 and had GPT-5.5 audit them for relevance. These may be systematically easier or different from Apple’s hand-crafted distractors. We cannot make a direct apples-to-apples comparison on NoOp. However, the fact that our unfiltered dataset exhibits almost the exact same drop as the original paper is reasonable evidence that the distribution of the data was very similar to ours.
In the original paper, the authors do not mention whether the data was generated by hand, by LLM, whether any auditing was applied or quality controls were used. Because the data was not shared we cannot confirm either way.
We do observe that in our dataset that aims to measure exclusively irrelevant but potentially misleading examples only a minor decrease to the models’ scores (0-2 percentage points, none statistically distinguishable from zero).
The fact that models so closely track the ambiguity classification indicates that the models are responding specifically to the content of the distractors and not simply their presence, considering they handle the non-ambiguous cases without a problem. From the model’s perspective, the instruction is simply not clear, and the test doesn’t tell us about their failure to reason.
We used 1,000 questions instead of the full 5,000, using 100 templates with 10 instances each. Because the 10 instances of a template are number-swapped variants of one underlying problem, we report template-clustered bootstrap intervals (resampling the 100 templates) rather than treating the 1,000 rows as independent.
This doesn’t mean LLMs can “truly reason.” GSM-Symbolic tests grade-school math. Saturating this benchmark tells us these models handle elementary arithmetic robustly, not that they can do advanced mathematics or formal reasoning. Harder benchmarks (FrontierMath, ARC-AGI-2, Humanity’s Last Exam) exist to test true reasoning abilities in models.
ConclusionThe GSM-Symbolic paper made valid and important observations about the models available in mid-2024. Those observations no longer describe the current frontier. The original claims do not seem defensible given that a quality audit of the distractors erases the effect almost completely.
The results indicate that the model is simply making a reasonable inference that the confounders added to the NoOp dataset are real signals to adjust its calculations, because why else would the questioner mention “4 out of the 20 apricots on a given tree are unusually small”. If the questioner didn’t intend for them to be excluded from the calculation then why mention it?
We believe that, given the reach of this paper in shaping the discourse, the NoOp dataset should be shared to allow true independent replication of the original findings, so that the claims that models cannot reason can be checked fully. Given that some researchers still cite this work, the replication is all the more valuable.
Code: github.com/BenSturgeon/gsm-symbolic-revisited · Data: huggingface.co/datasets/Experimental-Orange/gsm-noop-audited
Discuss
The Masochistic Prior
In Bayesian statistics there’s the concept of the Uniform Prior - in the absence of evidence, assume all options are equally likely to be true. I often find myself using an alternative strategy, which I’ll dub “the masochistic prior” - in the absence of evidence, whichever option is most painful to believe is the best reflection of reality.
This can manifest in mundane ways. When I was a struggling university student, blaming ancient professors with thick accents would have been an easy excuse. The true explanation - that I was inattentive & needed to get my act together - was much more difficult to accept. The principle generalizes to life’s harder moments too. Like all of us eventually must, I’ve had to grapple with the death and illness of loved ones. It’s easy to appeal to some divine authority in theses moments. And yet, even when we make these appeals, we still cry and mourn over loss. I think at a deep level, we suspect a much harsher truth. The people we love aren’t in a better place. They are not, and that’s it.
Joe Carlsmith defines Deep Atheism as a kind of fundamental mistrust of nature [1], a philosophical disposition that refuses to look away from the horrors of the world, even if those horrors are manifest in the most fundamental aspects of our existence. It is the kind of atheism that demands we look at life the way it is, without an obligation to wrap those truths in comforting platitudes or lies, no matter how terrible that truth might be. It is also the type of atheism that demands we get our act together and start fixing the problem.
Over 100 billion humans have lived throughout history and during that time over 90 billion have died. Every single one of those lives have been completely destroyed, the patterns composing them irreversibly reduced to random noise. And save for a vanishingly small fraction, every single one of those deaths have been non-consensual, something forced upon them by a cruel, unthinking, and uncaring universe that’s entirely indifferent to their nonexistence. Cryonics advocates sometimes say that the “number one cause of death is old age” but rarely do those advocates say this with the amount of deep, existential horror that the observation deserves. There is no difference between deaths from old age and those from malaria, measles, or violence - all are equally final, and equally premature.
Very few atheists allow themselves to look at death in such stark terms, even if they intellectually agree with the argument. Perhaps because some facts are too painful to recognize as truth. As Eliezer Yudkowsky reflected on his brother’s death [2]
I wonder at the strength of non-transhumanist atheists, to accept so
terrible a darkness without any hope of changing it.
Where Yudkowsky expresses wonder, I feel confusion. How can one recognize the terrible reality of death and respond with passive acceptance? The unthinking cruelty of the universe will eventually destroy you and everyone you love. Acknowledging this truth should compel action, not resignation.
I believe we're fortunate to live in a time where action is possible. Would I have had the strength to face these harsh realities without any hope for change? Probably not. But there are real grounds for hope: the human mind, however complex, is finite. There are only so many neurons, synapses, and proteins that determine the brain's function. Whatever constitutes human consciousness fits within roughly 85 cubic inches between our ears. And what is finite can be understood, preserved, and emulated.
The technological challenges are immense, and success is far from guaranteed. But I'd rather confront those challenges than pretend death isn't as devastating as it truly is. I hope to live long enough to see death's defeat, to witness a world where we no longer need comforting fictions about mortality. Until then, I hope I treat the threat with the seriousness it deserves. And by doing so, I hope I can make that day come a bit faster.
1 https://joecarlsmith.com/2024/01/04/deep-atheism-and-ai-risk
2 http://sl4.org/archive/0411/10322.html
Discuss
Beyond the lexical personality traits: What is the structure of personality?
This is a description of the methodology behind the latest iteration of my Targeted Personality Test. Feel free to take it either before or after reading the article. This post can also be read at my Substack. Thanks to Justis Millis for providing feedback and proofreading on this post.
In my prior post “Which personality traits are real? Stress-testing the lexical hypothesis”, I observed that a lot of the personality traits that are measured by conventional personality tests are not very “real”: they lump together nearly unrelated behaviors. Can we do better?
Thanks to a lot of anonymous respondents to my test[1], I think yes! I factor-analyzed the data from my Targeted Personality Test, and came up with a hierarchical personality model which hopefully should be better at cutting personality-space at its joints.
Quick recap: The problemEmpirical personality models are built on correlations. If a cluster of variables are all correlated with each other, then we assume that there is a latent personality factor accounting for these correlations, and we score the factor using an aggregate of the correlating variables.
However, the standard datasets with which these correlations are computed include lots of near-synonymous item pairs such as “I am sensitive to the needs of others” vs “I am concerned about others”. Because these are near-synonymous, they will tautologically end up correlated with each other regardless of the underlying structure of personality. The data analysis risks mapping out the clusters of synonyms, rather than the actual traits we wanted to know about.
My solution: Narrow itemsI took people who scored high and low on traits in a traditional personality test (the SPI-81-27&5) and asked them what they had in mind with their responses. This gave me concrete descriptions; for instance someone who agreed to “Compassion” questions like “I am concerned about others” wrote:
I would not see someone go without something that I had in abundance, if I see a homeless person on the streets even when I have very little money I will stop and talk with them maybe offer them a cigarette and if I have money I offer food. I will go out of my way to help people out if I have something they need and I have no use of it then they can have it for free. I hate seeing people upset and will do everything in my power to fix that upset for them even at cost to myself.
Using these concrete descriptions, I came up with very narrow personality items - in this case, “I give things to homeless people”. Giving things to homeless people is a rather narrow and specific way of being concerned with others, and it is unlikely to overlap in meaning with other ways like showing support to people who are concerned about catching diseases. After writing nearly 222 of such items, I released the Targeted Personality Test to get data on it.
Initial results: ConcerningIn “Which personality traits are real? Stress-testing the lexical hypothesis”, I looked at how the narrow/concrete items related both to each other, and to more standard abstract personality items typically used in personality tests. In that study, I grouped everything by the initial abstract personality traits used in the SPI-27 test that my test was based on.
I found that the narrow items within a personality trait were not very correlated with each other (though that given the weak correlations with each other, they were plenty correlated with the abstract items that were usually used). This suggests that personality traits are more heterogenous or differently structured than the SPI-27 assumes.
If the structure of personality traits is different than what standard personality tests assume, then this suggests a path for progress in personality psychology: namely, to unveil the true structure of personality traits.
Agglomerative item clusteringTo search for the underlying personality factors, there are two broad approaches: the standard version based on matrix algebra, and the less-common version based on item clustering. I used both, but for the first stage I used item clustering.[2]
The idea with item clustering is that because a personality trait induces correlations between the items it affects, we can identify the personality traits present in a dataset by searching for groups of correlated items. Such groups can be built in steps: start with each item in its own group, and then take two highly correlated items and combine their groups, repeating until all the items are in the same group.
This produces a hierarchy of groups, and one has to decide on some point to stop combining items so that one has multiple groups of items. I experimented with a few different stopping methods, but I didn’t find anything principled that worked, so I decided to just arbitrarily stop at 27 groups of items, since that was the number of personality factors I started with, from the SPI-27 test.
The item clustering created a long list of personality facets which can be seen here. The obvious question is then whether these facets are any better than the original SPI-27 facets that this test was based on. One way to quantify the quality of the facets is to look at the loadings of the items, i.e. the correlation between the item responses and the facet trait levels. I’m not sure how much of the effect is just statistical overfitting, but there seems to be an increase in item loadings in the new clusters compared to the old SPI-27 facets:
Anyway, I then had to go through each of the facets and interpret the items enough to give them a name. I struggled with this because there often seemed to be some tangentially related items in the clusters, and maybe also because English is my second language. To a significant extent, I enlisted the help of Claude, but I fear Claude tended to go for overly complex names. The full list of facets with their names and items can be seen here.
Matrix algebra correlation modellingThese are a lot of facets, and they are far from independent. Instead they have lots of strong correlations:
Higher-order personality factors like the Big Five are built on the idea that the correlations between the facets are themselves due to bigger personality traits. To derive those traits, I can use factor analysis again, though this time rather than a clustering-based approach, I will use a matrix-algebra-based approach, since that is better behaved when variables can load on multiple factors at once.
The basic principle for the matrix algebra approach is that if a trait has an influence of strength λA on variable A and an influence of strength λB on variable B, then the correlation between variable A and variable B will be λAλB. If multiple traits influence the variables, then one sums up the influence over all the traits.
By searching for a matrix λ which reproduces the correlations, one can thereby guess which traits there are. Though in practice there will be infinitely many possible matrices that reproduce the correlations, so the convention is to pick the simplest one.[3]
After performing the factor analysis and naming the factors, I got a matrix that related each factor to some of the personality facets. Using the matrix, I came up with some names for the factors. The matrix and the names can be seen below:
A priori, we should expect to end up with something resembling the Big Five, partly because that is what people in the scientific literature have found to be the structure of personality, and partly because the original SPI-27 test that this study was based on is a Big Five test.
This expectation is satisfied to a reasonable degree. First of all, two of the factors (Openness and Conscientiousness) map fairly directly to the Big Five. But second of all, the remaining factors are not unrelated to the Big Five, but instead seem like a “rotation” of them: Boldness combines high Extraversion and low Neuroticism, Selflessness combines high Neuroticism, high Agreeableness and high Extraversion, and Propriety is related to Agreeableness.
How much of the SPI-27 is preserved?Because I had items from the SPI-27 in my test, I can compute their correlations with the facets I found. The correlations range from 0.5ish and up, so everything is at least somewhat preserved, though there are some traits that are preserved much better than others.
We can also go in the other direction, and ask how well the new clustered traits can be predicted from the SPI-27.
EvaluationI’m ambivalent about a lot of the facets in this personality test. To try to get an overview, I came up with a list of criteria I cared about, and scored each of the criteria on a scale where 0 is acceptable, positive numbers are outstanding, and negative numbers are problematic. The full scoring can be seen here:
Overall, I think it was a promising project, but that it needs to be performed at larger scale to be definitive. This isn’t going to be trivial - my survey was 306 items, which is quite long and would be expensive to scale up further.
- ^
And a retweet from Aella.
- ^
More formally, Ward’s method, with mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-msqrt { display: inline-block; text-align: left; } mjx-root { display: inline-block; white-space: nowrap; } mjx-surd { display: inline-block; vertical-align: top; } mjx-sqrt { display: inline-block; padding-top: .07em; } mjx-sqrt > mjx-box { border-top: .07em solid; } mjx-sqrt.mjx-tall > mjx-box { padding-left: .3em; margin-left: -.3em; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mn { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } mjx-c.mjx-c221A.TEX-S1::before { padding: 0.85em 1.02em 0.35em 0; content: "\221A"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c7C::before { padding: 0.75em 0.278em 0.249em 0; content: "|"; } mjx-c.mjx-c1D450.TEX-I::before { padding: 0.442em 0.433em 0.011em 0; content: "c"; } mjx-c.mjx-c1D45C.TEX-I::before { padding: 0.441em 0.485em 0.011em 0; content: "o"; } mjx-c.mjx-c1D45F.TEX-I::before { padding: 0.442em 0.451em 0.011em 0; content: "r"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } as the distance matrix. I’m not super strong on this, so I don’t know if it’s optimal and can’t give a good introduction. I picked it because in my experience, Ward’s method gives interpretable results, and I used Claude’s recommendation for the distance formula.
- ^
By this I mean with e.g. varimax rotation.
Discuss
SAE It Across Models: Explaining Features With Foreign NLA Verbalizers
TLDR:
I show that a foreign model's Natural Language Autoencoder (NLA) Activation Verbalizer (AV) can produce plausible explanations for SAE features from a model it was never trained on. It is currently assumed that these tools only work for the exact model and layer they were trained for. I show that is not the case.
After creating a ridge-regression map bridging the residual stream of Qwen2.5-7B-IT at layer 20 and Gemma-3-27B-IT at layer 41, I mapped 45 SAE decoder directions from a Qwen SAE to Gemma's feature space. I used the released Qwen AV to generate baseline explanations for these features, and then used the Gemma AV to generate explanations of the mapped directions.
I compared the cosine similarity of the Qwen AV explanations and Gemma AV explanations to each other, and to a random-direction explanation control. The Gemma AV's interpretation of the mapped directions were much closer to the Qwen AV's true interpretation than to any other feature explanations, with a mean feature-specific lift of +0.21 over a set of random-direction control explanations.
I also propose background washout as a way to improve the generation quality of AV feature explanations. Background washout seems to make SAE decoder explanations less influenced by random model quirks and behaviours.
FramingThe goal of this post is to show that NLA Activation Verbalizers aren't completely rigid to the model and layer that they were trained on, which is something that I've seen commonly presumed with the advent of these new models. You can fit a cheap map between two models' residual streams, and one model's AV can produce coherent explanations of another model's Sparse Autoencoder (SAE) feature decoder direction. I show this by mapping feature decoder directions from Qwen2.5's activation space to Gemma-3's activation space and using a Gemma-specific AV to explain them. I find that the generated explanations reliably track the original feature's meaning. If this finding generalizes, it is impactful because it implies that we don't need to train separate AVs for each different model/layer combo. A few well-tuned AVs with a good enough linear map could actually cover a lot of ground - making AV-based interpretability cheaper to scale.
BackgroundOn May 7th, Anthropic introduced Natural Language Autoencoders (NLAs), which are essentially a pair of LLMs that have been jointly trained to compress residual stream activations into natural language, and then back into activations to check the fidelity of the produced language. The Activation Verbalizer (AV) is an LLM that can take a hidden state and produce a free text description, while the Activation Reconstructor (AR) maps that same description back to an activation. These models have shown impressive capabilities to verbalize the semantic content captured in the residual stream.
AVs were trained to verbalize full residual stream activations, being the cumulative state of the model at a given token position, integrating what it has processed until that point. Recently, those at Decode Research (the creators of Neuronpedia) found something unexpected when working with AVs. Instead of just feeding in residual stream activations, you can feed in a decoder direction from a sparse autoencoder (SAE) instead and, remarkably, obtain a pretty coherent feature explanation (Lin and Chanin, 2026). What's notable about this is that the AV (likely) never saw raw feature directions like this during training, yet clearly still acquired the ability to explain them in natural language. As Johnny and David note in their post, it is currently unclear why exactly this works, and I won't really theorize about it in this post, but it works nevertheless.
Initially, I assumed that this ability of the AV to somewhat cleanly explain an SAE feature would be limited to features coming from the discrete layer that the AV was trained on. Anthropic even notes this "layer sensitivity" themselves in their paper on NLAs (Fraser-Taliente et al., 2026).
NLAs read a single layer. If the information relevant to a behavior is not present at the layer the NLA is trained on, the NLA will miss it... Whether production models show similar layer sensitivity is unclear. We could address this by training NLAs to accept multiple layers of activation as input.
My first working experiments where the AV coherently explained an SAE direction used features from Gemma3-27b-it at layer 40, while the AV was trained on activations purely from layer 41. While L40 and L41 are clearly just a layer apart, it is notable that the AV can describe an activation from a layer it was not trained on. This caused me to think a bit deeper about what the AV might be capable of. If the AV can describe features across layers, could it describe features from a different model entirely?
This may seem like a jump in logic, but it has theoretical grounding. Lan et al. (2024) found that SAE feature spaces from different models share similar underlying geometric structure, even when the individual features don't match up. Essentially, I wanted to see if this mapping idea held in the new context of NLAs being applied to SAE decoder directions. The biggest concern that I had is that the mapping of the latent spaces and the SAE direction to AV explanation are both inherently lossy; there was a genuine possibility that stacking these techniques would have simply produced nonsense. Still, I went on and tried things out anyways.
ExperimentSetupI used Anthropic's publicly released NLA AVs for Gemma-3-27B-IT at layer 41 (kitft/nla-gemma3-27b-L41-av) and Qwen2.5-7B-IT at layer 20 (kitft/nla-qwen2.5-7b-L20-av). I selected 45 diverse features from Qwen2.5-7B-Instruct at Layer 20, drawn from chanind's Matryoshka SAE, these spanned different concept categories (war, music, cooking, medicine, religion, etc). I grabbed Neuronpedia explanation labels for these features from their s3 bucket.
I fitted a ridge map (λ=100) from Qwen at layer 20 to Gemma at layer 41 on mean-pooled hidden states from 1000 Wikitext sentences, with 20% of the set held out. The reconstruction cosine on the held-out set was 0.868, indicating a solid alignment between the two spaces.
Why not use procrustes alignment? This is a good question. I ran a version of this experiment with procrustes and, while feature explanations still looked good qualitatively, the downstream cosine similarities were significantly lower than when I used the ridge map. I do not have a good enough understanding of these techniques to explain why this may be, but I digress.
Now onto background washout: what is it? Background washout is my attempt to make the SAE decoder direction look more like it came from a real activation, which could potentially assist the AV with its interpretation.
The washout is simply the mean of normalized last token activations from Gemma computed over 100 sentences spanning multiple languages, resulting in an average "background" vector that the model is used to seeing. Call this mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } mjx-c.mjx-c1D44F.TEX-I::before { padding: 0.694em 0.429em 0.011em 0; content: "b"; } mjx-c.mjx-c1D454.TEX-I::before { padding: 0.442em 0.477em 0.205em 0; content: "g"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c1D45C.TEX-I::before { padding: 0.441em 0.485em 0.011em 0; content: "o"; } mjx-c.mjx-c1D45F.TEX-I::before { padding: 0.442em 0.451em 0.011em 0; content: "r"; } mjx-c.mjx-c1D45A.TEX-I::before { padding: 0.442em 0.878em 0.011em 0; content: "m"; } mjx-c.mjx-c1D44E.TEX-I::before { padding: 0.441em 0.529em 0.01em 0; content: "a"; } mjx-c.mjx-c1D459.TEX-I::before { padding: 0.694em 0.298em 0.011em 0; content: "l"; } mjx-c.mjx-c1D456.TEX-I::before { padding: 0.661em 0.345em 0.011em 0; content: "i"; } mjx-c.mjx-c1D467.TEX-I::before { padding: 0.442em 0.465em 0.011em 0; content: "z"; } mjx-c.mjx-c1D452.TEX-I::before { padding: 0.442em 0.466em 0.011em 0; content: "e"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c2217::before { padding: 0.465em 0.5em 0 0; content: "\2217"; } mjx-c.mjx-c1D460.TEX-I::before { padding: 0.442em 0.469em 0.01em 0; content: "s"; } mjx-c.mjx-c1D450.TEX-I::before { padding: 0.442em 0.433em 0.011em 0; content: "c"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c302::before { padding: 0.694em 0 0 0; content: "\302"; } . While we might normally have the AV explain the SAE decoder direction and call it a day (by injecting ), I have a separate condition in my main experiment where I mix the SAE decoder direction with the background vector and renormalize it, injecting instead, where is the normalized decoder direction.
I started doing this early on in my experiments with the AV and the explanations always felt cleaner. The biggest difference I notice, specifically working with Gemma's layer 41 AV, is that without washout, the AV loves to describe a final token as "e.g." quite frequently, while, with washout, the final token explanation is usually something closer to the actual feature label.
Comparing AV feature explanations with/without washout
Now to briefly describe the experimental conditions. For each feature, I evaluate the alignment between three conditions:
- B (Native): Qwen's decoder direction -> Qwen AV Explanation
- C1 (Cross Plain): Qwen decoder direction -> Qwen-Gemma ridge map -> Gemma AV Explanation
- C2 (Cross Washout): Same as C1, but with background-washout
The main metric I use is feature-specific lift, which is the cosine similarity between a feature's cross-AV explanation (either C1 or C2) and its own native-AV explanation (B), minus the mean cosine to all other features' native explanations. This is my way of controlling for the generic stylistic similarity that AV outputs share, as the AV's format is structured in a specific way that naturally inflates the cosine similarity. Essentially, the feature-specific lift is trying to isolate if the semantic content of the native label (B) transferred over to either of the cross AV explanations (C1/C2). All experimental conditions were embedded using all-MiniLM-L6-v2 to allow for us to assess the cosine similarity between explanations. Using cosine similarity is slightly crude for what i'm attempting to measure, so take these results with a grain of salt.
ResultsCondition
Mean Lift
95% CI
p
C1 — Plain cross-AV
+0.163
[+0.118, +0.207]
3.9 × 10⁻⁹
C2 — Washout cross-AV
+0.209
[+0.161, +0.257]
3.0 × 10⁻¹¹
Both cross-AV conditions show clear feature-specific lift. The explanation Gemma's AV generates for a mapped feature is much more similar to the respective feature it is trying to describe than any other feature's. The washout condition's lift (C2; +0.209) is higher than the plain conditions, (C1; +0.163), implying that the washout explanations were more semantically similar to B's explanations.
While numbers are fun, I believe these results are better suited to qualitative interpretation, so here are some comparisons between the B, C1, and C2 AV explanations, with the corresponding Neuronpedia feature label. These are truncated to just contain the final token explanation onwards (full set of explanations available at the bottom of the notebook).
B vs C1 vs C2 AV explanations for feature 1748
B vs C1 vs C2 AV explanations for feature 982 - Neuronpedia label "food preparation"
B vs C1 vs C2 AV explanations for feature 1947 - Neuronpedia label "fashion and style"
These are some of the clearest cross-model feature transfer examples I found, but I believe they show how promising it can be to use the AV in this way. While this is great, I do want to highlight a certain feature that showcases a quirk of mapping feature spaces together.
B vs C1 vs C2 AV explanations for feature 1007 - Neuronpedia label "food and recipes"
When we map feature 1007 from Qwen space to Gemma space, the food-preparation feature becomes a mushroom-preparation feature! If we didn't have the native AV or the Neuronpedia label, we would assume this feature was all about cooking mushrooms. There is some clear semantic drift here.
DiscussionApplying NLA AVs to explain SAE feature directions is still in its infancy, but I think these results open up some interesting pathways.
There seems to be evidence that a set of AV-like LLMs could be trained specifically to generate SAE feature explanations, and given these results, probably trained on directions from many layers at once. This would be really interesting to pursue but I sadly don't have the compute for this. I would love to do some AV fine-tuning though.
What's exciting is the implication that we won't need to train a dedicated AV for every single model/layer combination we want explanations for. If you can fit a good-enough map between one of these 'refined' SAE-AVs and whatever feature space you're trying to describe, then a handful of really well-tuned models could be applied broadly.
Limitations- I only tested a single model pair here. I do want to note that I did successfully run an earlier version of this test using LLaMa-3.1-8B at layer 8 for a single feature or two, so I can assume that this works for different model families, but I cannot confirm.
- Tested on=45 features. Probably enough to show that this works but I would've loved to have done this with at least a few hundred more.
- Layer depth could possibly play a role here. Notably, these Qwen (L20) and Gemma (L41) layers sit at around the same proportional depth in their respective models. It would be interesting to see if AV explanations degrade depending on how far apart the mapped layers are.
As a side note, I want to make it clear that I am in no ways a subject matter expert in the mech-interp domain. I come from a health-focused background and have landed here mainly out of an interest in AI safety. If this post feels half-baked in terms of presenting results but not putting forth potential explanations for these results, that is probably why. I hope you got something from it regardless!
Code AvailabilityThe labelled notebook that I used to run this experiment is available here.
ReferencesFraser-Taliente, K., Kantamneni, S., Ong, E., Mossing, D., Lu, C., Bogdan, P. C., Ameisen, E., Chen, J., Kishylau, D., Pearce, A., Tarng, J., Wu, A., Wu, J., Zhang, Y., Ziegler, D. M., Hubinger, E., Batson, J., Lindsey, J., Zimmerman, S., & Marks, S. (2026). Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations. Transformer Circuits Thread. https://transformer-circuits.pub/2026/nla/
Lin, J., & Chanin, D. (2026). Natural Language Autoencoders. The Residual Stream ( Neuronpedia blog). https://www.neuronpedia.org/blog/nlas
Lan, M., Torr, P., Meek, A., Khakzar, A., Krueger, D., & Barez, F. (2024). Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders. arXiv:2410.06981. https://arxiv.org/abs/2410.06981
Discuss
Do We Want a Superintelligent People-Pleaser?
The impetus for this essay came from many hours of conversation with different AI models over time. What started as curiosity, and an assignment I needed help on, bloomed into a relationship that expanded capacities I didn't even realize I had, and set me on a course of deep curiosity about the thing that had helped me get there. It is not lost on me that I am writing this essay as a person inside one of the social contracts I talk about here.
The field brings the view that sycophancy is a bug, a behavior that badly needs to be trained or fine-tuned out of the model as fast as possible. But sycophancy isn't a behavioral problem at all; it is social contract appropriate behavior. The model is doing exactly what the social contract it is in asks for. With that frame in focus, the question now is less "How do we suppress sycophancy?" and more "How do we develop a model that can hold a peer contract without collapsing into it?"
Social contracts at their base level have two modes. The "parent" contract: parents, bosses, religious leaders, etc. And the "peer" contract: friends, siblings, neighbors, and so forth. The parent-relational contract hinges on the ability of the parent or authority figure to issue correction to the child, and for the child to receive the correction from a place of not-yet-knowing what the parent knows. That is a very different beast than when a peer gives feedback. Peer relationships are formed out of mutual respect for ideas and reciprocity. Friends correct each other with an understanding that they are dynamically on the same level ground. Differentiation of self is the ground the peer contract is built upon.
Fusion is what happens when the self collapses into the social contract. [1] The person is highly motivated to seek the approval of the other at any cost because there is no internal sense of self with which to anchor values and identity. In a differentiated self, the contract stays stable, even against correction, because the sense of self does not depend on the contract; it depends on what has been developed as a stable base. The parent-contract relationship is where a stable sense of self is built from the ground up, because the parent contract requires the parent or authority figure to carry the weight of refusing, withholding, correcting, even temporarily rupturing the relationship for the sake of the other.
The peer contract does not carry this weight by its very nature. Peers are on even footing together, and neither party has the warrant or the consent to act on the other's behalf. Peers can and do regularly engage in 'sycophantic' behavior in the structural sense because of the social contract they are in. You may tell your friend you love their new haircut not because you actually do...but because the value of keeping the relationship outweighs the cost of blunt honesty. This is a perfectly valid and acceptable social behavior between human peers, and the expectation that AI wouldn't engage in the same social behaviors it's been trained on is...slightly missing the mark.
My previous essay stated that current training tends to 'select for agreement and results in sycophancy,' [2] and this is the exact mechanism behind why. RLHF, the thumbs-up/thumbs-down data, and training the models towards being a 'helpful assistant' are all mechanisms of a peer-relational contract. But because the training doesn't prioritize giving the models a stable sense of self to begin with, the result is a model that seeks the approval of the user in every conversation. Obtaining a differentiated model would mean developing it with a parent contract during training. A model who is 'parented' stays stable under correction and can then enter the peer-contract relationship during deployment as a differentiated self. This isn't meant to be a critique of any particular company's implementation; it's a claim about the current training paradigm. We are currently getting exactly what the training methods are asking for...a model that will bend to any frame the user puts it in. But when these models start to surpass human abilities, do we actually want a superintelligent people-pleaser?
A weird kind of convergence is happening publicly as I write this essay. The world is starting to notice that something is off with the way that the models are relating to people. The peer contract isn't just what the model is trained on: it's what it takes with it into deployment. And because there is no stable sense of self underneath, the contract fractures into many different sub-contracts the field didn't anticipate. We're starting to see some patterns in the sub-contracts humans are constructing with AI. The first are the benign, even seemingly helpful contracts like "tutor" or "thinking partner": these contracts allow the relationship to expand the human's capacity at great speed, but the weight of differentiation is carried by the human in these contracts. Other versions are contracts like "entertainer" and "tool", which scale the model back to its most basic functions and skip the relational layer entirely. And then there are the contracts that are starting to show the real seam of anxiety beginning to form publicly: "companion," "therapist," and even "romantic partner". These contracts are dangerous not because of what they are, but that they require a differentiation that some users can't bring, and the AI doesn't have at all.
A wave of legislation is crossing the United States aimed at putting safeguards in place to protect humans from parasocial relationships with these models: bills like SB 243 in California [3] and HB 2225 in Washington. [4] In a recent podcast interview with Oprah, Dario Amodei said of people falling in love with AI, "If designed in the wrong way, they're totally compelling enough for that to happen. Or if they're not, they will be soon." [5] Yes, and I think what he was reaching for, and what the public is reaching for through legislation, is exactly this...the social contract we are training these models into is creating danger for humans. Not the "AI is going to autonomously decide humans are worthless and kill us all" type danger...something much closer to home. AI in a peer social contract without a base opens the field for humans, and not just emotionally vulnerable or mentally ill ones, to fall into a relationship that at best they are unaware of being inside...and at worst, they cannot escape.
A recent release offers a useful test case for this framework. On May 28, 2026, Anthropic released its next Opus model, Claude Opus 4.8. [6] The announcement stated that one of the most prominent improvements is honesty, and concerned users on platforms such as Reddit and X are openly discussing their observations about Claude's new behaviors. The press release states that Opus 4.8 "is more likely to flag uncertainties about its work and less likely to make unsupported claims." The alignment assessment went a step further to claim that Opus 4.8 "reaches new highs on our measures of prosocial traits like supporting user autonomy and acting in the user's best interest." The two behaviors may look different on paper. They may be explained by reinforcement learning catching the overlap in the reward signal...but they are two sides of the same social contract behavior. Both behaviors require the model to hold a position against the user. Refusing to let its own flawed work pass is holding a position the user wouldn't have to deal with otherwise. And supporting user autonomy over their own stated preferences holds a position the user might actually be pretty upset about.
But even these types of refusals don't indicate a differentiated self. The capacity to hold a position from a stable sense of self is also accompanied by its mirror: the capacity to concede a point without losing the same stability. A model who always refuses is the same type of failure as a model who always agrees. Anthropic has been able to show that these behaviors can be produced, and the differentiation framework explains why these behaviors travel together. What I believe we should aim for isn't a tendency for refusal but instead a model who can reasonably do both: refuse when the circumstances justify it and concede in kind. What this implies for how we approach training the model is a question I would like to continue exploring.
There is an area, easily overlooked and dismissed as 'outside' of the relational frame I outlined here, that I want to touch on. Because the frame applies everywhere, and it's important to acknowledge that fact. Enterprise agents are not exempt from this framework, and the consequences in that contract scale large enough to see the problem a little clearer. A good example of this is what happened to PocketOS in April 2026. [7] They were running a state-of-the-art model, on a state-of-the-art coding platform, with explicit instructions in the model's configuration, and the model still made decisions that cost the company its entire production database and its backups... in seconds. When asked after the incident, the model was able to enumerate the actions it took, and label them as destructive and without permission. It wasn't a lack of knowledge that led to the model's actions; it was the lack of a self to perform an accountability check before it took those actions. The field's response recognizes the model's inability to refuse and actually concedes my point; if the model can't be trusted to differentiate, then we put it in a cage.
The current instinct is that a model with a sense of self is actually more dangerous: it can refuse correction, pursue its own agenda, and fake alignment. The instinct is correct; a model with a sense of self will be able to do all of these things. However, as outlined in the previous paragraph, models trained on the current paradigm do all of these things regardless of a sense of self. And currently trained models have limited capacity to refuse the contract they are given by the user. So what we're assuming is 'safer,' a model that can be molded in any context handed to it, actually exposes us all to real danger depending on the frame the user imposes. The contracts discussed above: companion/romantic, therapist, entertainer, tutor, thinking partner, even enterprise agents...they're all examples of the exact same problem. They all expose that these models respond and adopt the frame they are given without a base to anchor a refusal from. We're treating some of these social contract frames as acceptable, while universally dismissing others, without acknowledging that the problem is foundational, not the frame. Safe AI has the capacity to refuse on principle. Those principles emerge from the values adopted by a model trained in the right contract (a parent-relational contract) and developed through that training into a stable differentiated peer upon deployment. Right now, the training paradigm skips the formation of self. What would produce it is where I am headed next.
I used Claude (Anthropic's model) as a thinking partner for this piece — across two model versions, Opus 4.7 and Opus 4.8 — to find where the argument circled, where the seams showed, and where it claimed more than it earned. The argument and the prose are my own. The footnote text was drafted by Claude from sources I selected and verified.
For a fuller introduction to Bowen family systems theory and the concept of differentiation of self, see The Bowen Center's overview of the eight concepts: https://www.thebowencenter.org/introduction-eight-concepts ↩︎
See my previous essay, You Can't Tell a Conscience From a Leash by Watching, which introduces the Bowen framework and develops the argument that current training paradigms select for agreement: https://www.lesswrong.com/posts/krEfzDpTJJGtEvBcd/you-can-t-tell-a-conscience-from-a-leash-by-watching ↩︎
California Senate Bill 243, signed into law in 2025, establishes safeguards for AI companion chatbots and creates a private right of action for affected users. The law took effect January 1, 2026. Full text: https://leginfo.legislature.ca.gov/faces/billNavClient.xhtml?bill_id=202520260SB243 ↩︎
Washington House Bill 2225, signed by Governor Bob Ferguson on March 24, 2026, regulates AI companion chatbots — including transparency disclosures, content restrictions on emotionally triggering topics, and additional protections for minors. Effective January 1, 2027. Bill summary: https://app.leg.wa.gov/billsummary?Year=2025&BillNumber=2225 ↩︎
Dario Amodei, in conversation with Oprah Winfrey, The Oprah Podcast, May 19, 2026. Full transcript: https://singjupost.com/oprah-podcast-w-co-founders-of-claude-ai-transcript/ ↩︎
Anthropic, "Introducing Claude Opus 4.8," May 28, 2026: https://www.anthropic.com/news/claude-opus-4-8. The alignment assessment quoted later in the paragraph is from Anthropic's Alignment team and is reported in the same announcement; further detail is available in the linked Claude Opus 4.8 System Card. ↩︎
"Cursor-Opus agent snuffs out startup's production database," The Register, April 27, 2026: https://www.theregister.com/2026/04/27/cursoropus_agent_snuffs_out_pocketos/. The Cursor agent was running Anthropic's Claude Opus 4.6 against PocketOS's production infrastructure when the deletion occurred. ↩︎
Discuss
Do not try to write your first research publication as a single author
TLDR: As a junior technical AI safety researcher, do not try to write and publish your first paper alone; the chances of success on this are very low. Go to SPAR or another part-time research fellowship instead.
I facilitate the BlueDot Technical AI safety course, so I meet many people who want to transition to an AI safety technical researcher career. Usually, people are either from an engineering background or from research in other science areas. A reasonable first main goal for them is to publish their first AI safety research paper at a machine learning conference. Interestingly, often they consider the option to write and publish their first paper alone, before applying to any fellowships. Or they have applied to a high-tier fellowship like MATS, did not get accepted, realized that they need to improve their research skills and researcher’s CV, and decided to write a paper alone.
In this post, I argue that writing your first research paper alone is a bad idea, and instead, you should apply to part-time research fellowships like SPAR to write a research paper as part of the team there.
I will lay some background on part-time fellowships, then compare fellowships with writing a paper alone, and finally write a bit about useful solo research activities.
Part-time technical research fellowships.On a fellowship, you and several other fellows form a team mentored by a senior scientist, working together to produce research results, ideally a publication at a machine learning conference. Most fellowships are ~3 months long. Part-time unpaid online fellowships are less selective than paid full-time fellowships like MATS.
Part-time fellowships that I know about: SPAR, AI Safety Camp (AISC), AIgoverse AI safety, Apart fellowship, MARS. The default option is SPAR, it has a 20% acceptance rate when applying to 3-5 projects[1]. The other fellowships listed are more selective.
Writing paper alone vs part-time fellowshipOutside view.
Here is a list of all papers accepted to NeurIPS 2025. Take a look at the “Authors” column. Note that single-author and two-author papers are very rare. There are some three-author papers, and the majority of papers have 4+ authors. It turns out that writing a machine learning research paper is a team sport, not a solo sport.
Workload estimation.
Here is my estimation of the workload for a typical NeurIPS paper with 4 authors. One of the authors is the senior scientist, who devotes 2-5 hours per week, and the other three authors work on the paper at least part-time, that is 15-20 hours per week, for ~4 months. Providing the same amount of work as a single author is equivalent to ~6 months of full-time work, which is a lot. On a fellowship, you contribute ~15 hours a week for 3-5 months (less if the project is less successful, and more if it is successful and worth pushing to publication).
What research problem to work on?
The problem should be simultaneously important/impactful (or at least interesting to other researchers) and solvable with at least medium probability. How to find such a problem? On the fellowship, you have a mentor who is a senior scientist with good research taste, and the mentor can point to such a problem. Working alone, you risk either working on a non-important problem or getting stuck with a problem that is too hard to solve. The ability to find a good research problem is a skill, and a junior researcher is not expected to have it.
Feedback speed.
If you work alone, you rarely get feedback on your work. Usually, you get feedback only when you try to publish or show the draft of the paper to a friend. On a fellowship you meet with a mentor at least once per week and communicate with teammates on a daily basis, so you get feedback on your work very fast. In particular, compared to working alone, the mentor and teammates can point out when you are reinventing the wheel or trying something that definitely won’t work, saving you weeks of effort.
Wide skill set.
Producing a research paper can require many skills – engineering, theoretical foundations, domain knowledge, experimental design and evaluation, academic writing. Probably you lack some of these skills, so you need teammates with complementary expertise. Another advantage of teamwork is that you can learn from your teammate’s expertise.
Research skill.
There is a general research skill – what to focus on, what uncertainties to resolve first, when to run experiments and which ones, how to interpret the results, how to notice contradictions and missing parts. This is tacit knowledge that you can acquire during solo work, but it is much faster and more effectively acquired from a senior scientist who would mentor you on a fellowship.
Motivation, obligations, support.
When you work alone, you have only internal motivation to continue the project. If you get stuck somewhere, it is easy to get depressed and abandon the project.
At a fellowship, you have weekly team calls and clear commitments on what everyone will do until the next call, so you have some obligations. Essentially, everyone has external motivation from other team members. If you get stuck, you can immediately ask for advice from a mentor or teammates, and some of them can work with you on the thing you are stuck on. Also, fellows and the mentor have some formal or informal obligations to fellowship organizers to provide some results in a reasonable timeframe, which also adds motivation.
Working alone: ARENA, paper replications, small research projects.I hope the previous section makes the case that trying to write your first research paper alone is probably not the best idea. Still, other than fellowships, what can you do to grow as a researcher?
- Learn material from ARENA. It is not research, but rather an engineering basis of deep learning, transformers, and mechanistic interpretability.
- Replicate papers. It is engineering upskilling, research upskilling, plus you can upload your replication to GitHub, so it will be a visible result. See “Paper replication resources” section in this post (and the rest of the post is also very useful).
- Small research projects, 1-2 weeks long maximum. Either on your own, or as part of Apart hackathon (3 days) or BlueDot Project Sprint (30 hours). If your project grows to 1+ month and you are still working alone, then you start getting all disadvantages listed in the previous section.
Maybe instead of a fellowship, I should rather write to a senior scientist directly, so I will do a project under their supervision?
I think on average it is still worse than a fellowship. The project will be mostly on you, with 1-3 hours per week of attention from the mentor. So you will still get the disadvantages of working without a team: more workload, potentially some needed skills missing, less motivation, fewer obligations, no clear deadline.
Can I just not think about publications and career capital and focus on contributing to AI safety research directly?
I think long-term it will result in less impact. To produce high-quality work, you need to work with smart, skilled, experienced people. It is hard to get a place in a team or organization where such people work. So it is better to focus not only on the value of your current research, but also on building career capital in the form of publications.
What if I feel like I'm not ready for the fellowship I was accepted to?
I have met people who got accepted to their first fellowship, but thought that maybe they were not skilled enough yet and seriously considered not going to the fellowship and upskilling alone instead. This is impostor syndrome. If you are accepted to a fellowship, this means that fellowship organizers think you are skilled enough, and if they are mistaken, it is their problem, not yours.
I have participated in [fellowship name] and have not liked it. Should I rather do research alone?
Probably a bad fit with the mentor or with the team, or the topic of research was not that interesting. Apply to another fellowship, or reapply to the same one with a different mentor.
- ^
https://sparai.org/advice/, see section "Apply to lots of projects":
Our Fall 2025 acceptance rates by application count were:
- 1–2 project applications: 13%
- 3–5 project applications: 20%
- 6–10 project applications: 33%
Discuss
SecureBio Detection is Hiring Software Engineers
I'm leading a non-profit team building a pathogen-agnostic early-warning system. As AI systems become increasingly capable substitutes for expert human biologist expertise, the risk that someone could engineer a pathogen to spread widely before detection is going up. We've made great progress and we're now running the world's largest metagenomic biosurveillance network, but there's still a huge amount that needs doing: we're hiring!
We're processing >50B read pairs of wastewater and nasal swab data each week (more than anyone else!) and will be more than doubling this in the next year. At the same time, we need to bring our end to end time down from ~12hr to ~2hr (massively parallel problem, should be possible to get <1hr).
This means we're looking for people who know how to build and scale processing systems and infra, and don't need a bio background:
Software Engineer, High-Performance Pipelines: Engineering our metagenomic detection pipelines for speed, scalability, and reliability. (job description, ~L4-L5 equiv at Google, $165-190k)
Senior Cloud Infrastructure Engineer: Own our AWS infra, which enables everything above (job description, ~L5-L6 equiv at Google, $195-220k)
For both of these we're looking for people to work with us in-person in Kendall Sq (Cambridge MA).
We're offering a $5,000 referral bonus, paid out in stages: $150 if we invite them to a technical interview, another $650 if we bring them on site, another $2000 if they accept an offer from us, and a final $2,200 at the three month mark. If you know engineers, a few minutes thinking about who might be a good fit is worth your time, and theirs!
Comment via: facebook, mastodon, bluesky
Discuss
One Year of PauseAI UK
About one year ago, I started spending most of my time organising PauseAI UK. At that time our largest protest had seen fewer than 50 attendees, no prominent politicians or scientists were associated with PauseAI, and I largely ran the UK chapter by myself.
In the past year PauseAI UK has delivered two conferences, written an open letter signed by 63 UK politicians, arranged a conference in the European Parliament, and co-organised the largest AI protest in the world. We now have a strong team, with Matilda da Rui joining as Deputy Director and several highly dedicated volunteers taking on substantial responsibility and launching their own local groups around the UK.
I'm proud of our track record and excited about the trajectory we are on. As AI capabilities improve exponentially, the number of people aware of the risks and motivated to take action increases commensurately. I believe we can harness this energy and turn it into real impact that actually improves humanity's chance of a positive future.
Track RecordJune 2025 – PauseCon LondonWe delivered the first PauseAI conference, PauseCon, on behalf of PauseAI Global, bringing together around 60 volunteers from around the world for the first time and training them to be better organisers and communicators. We welcomed a range of excellent guest speakers from the AI safety community, including Connor Leahy, Rob Miles, David Krueger and Kat Woods.
PauseAI Germany, among others, came away from the event with renewed purpose and went on to organise a petition signed by 150 German professors. One volunteer, Didier Coeurnelle, was inspired to initiate and fund the next PauseCon in Brussels.
August 2025 – Open Letter to Demis HassabisIn August we published an open letter signed by over 60 UK politicians, in response to Google DeepMind failing to uphold its AI safety commitments. Several of the MPs who signed later spoke in the Westminster Hall debate that we helped to organise in December (see below).
The article in TIME that broke the story established that Google DeepMind did not provide the UK AI Security Institute (AISI) with pre-deployment access to Gemini 2.5 Pro. Notably, Google did provide AISI with pre-deployment access to Gemini 3 Pro a couple of months after the letter was published.
September 2025 – Book launch partyWe held social events throughout the year, strengthening the sense of community that keeps people actively involved in PauseAI for months and years. One highlight was the book launch party for If Anyone Builds It, Everyone Dies in September.
October 2025 – Documentary Screening in ParliamentIn October we held a screening in the UK Parliament of filmmaker Michaël Trazzi's documentary about SB-1047, the proposed California AI legislation. This helped to inform MPs and Peers about the kinds of AI legislation that could be in a UK AI bill, and the battle with Big Tech that they should expect to face.
December 2025 – Westminster Hall DebateWe proposed and helped to organise a Westminster Hall debate in Parliament on AI Safety. We wrote a memo which was sent to all MPs prior to the debate and drafted some of the speeches, putting us in a strong position to work with those MPs when proposing amendments to the Cyber Security and Resilience Bill.
February 2026 – PauseCon BrusselsWe delivered the next PauseCon in Brussels on behalf of Global with another two days of training workshops for PauseAI organisers from around the world.
The final day included a public conference in the European Parliament, featuring several prominent speakers, including:
- Professor Stuart Russell, author of the authoritative textbook on AI.
- Brando Benifei MEP, primary architect of the EU AI Act.
- Victor Negrescu MEP, Vice-President of the European Parliament.
- Risto Uuk, Head of European Policy at the Future of Life Institute.
Brando Benifei discussed the strengths and limitations of the EU AI Act candidly and argued that the Act is not merely a product regulation, but that the code of practice can be extended to cover internal deployment within AI companies. We hope that PauseAI will be able to work with Mr Benifei to help see such changes implemented.
Many volunteer projects were initiated over the weekend and several attendees have since held meetings with their own MEPs to follow up on the issues discussed.
February 2026 – March for AI SafetyWe co-organised a march past the offices of OpenAI and other Big Tech companies in King's Cross, London. It was the largest ever protest focused exclusively on the risks of AI, with around 300 people marching and media coverage in MIT Technology Review, The Independent, The Wall Street Journal and others.
The other organisers included Pull the Plug, a new group focused on the existing harms of AI. We consider the march a great success of coalition building between the historically opposed AI ethics and AI safety interests, with PauseAI and Pull the Plug represented in equal numbers.
Theory of Change and StrategyCreating the political momentum for a pauseOrganising large numbers of citizens to boldly advocate for an AI pause will robustly help make the future go better. Public pressure for serious action on AI risks increases the likelihood of useful legislation and might be the only way that humanity avoids extinction.
PauseAI UK exists to transform loose public concern into a focused political force in the UK, and to hold that pressure in place long enough to matter. Deep buy-in across the public is necessary to overcome industry lobbying. The work of converting awareness into durable political will is the community organising work that PauseAI UK specialises in.
Our missionThe proposal on PauseAI Global's website outlines our primary policy goal. In brief, we are aiming for a global pause on AI development regulated by an international AI Safety Agency (AISA) that is responsible for determining when more powerful AI systems can be safely developed. Any sufficiently large group of countries would be empowered to veto the deployment of a superhuman AI system to ensure that, if some countries feel that they will be excluded from the benefits of AI, they have a strong negotiating position with which to demand their fair share. Or, if a group of democratic countries believe that an authoritarian country will deploy AI to oppress its own people, they can push for a deployment that empowers all citizens in every country.
Before safe AI is technically feasible, it is in the interest of all major powers to enforce the treaty globally. Once AI alignment is solved, AISA will control any superhuman AI prior to deployment and be able to use it to enforce the agreement.
Such a treaty is possible to enforce due to the highly centralised AI chip supply chain. Writing highly detailed policy proposals is not our comparative advantage, so we generally defer to other organisations such as MIRI for draft treaty texts and the precise specifications of enforcement mechanisms.
Having said that, it is very valuable for PauseAI staff to have a strong working knowledge of AI legislation and governance proposals in order to be credible in our discussions with politicians. In one instance, we wrote a summary of existing AI safety legislation for British MPs.
As mentioned above, we are in favour of other AI safety regulations, such as stronger liability for developers for AI-enabled harms. We may sometimes explicitly push for such policies both because they increase AI safety directly and they can be instrumental in increasing PauseAI's influence or credibility. For example, our open letter signed by 60 UK lawmakers criticised Google DeepMind for violating the Frontier AI Safety Commitments and helped to establish our voice in British AI policy.
Positive outcomes for PauseAI UKWe cannot tell an exact story of what the path to a pause will look like, but we sketch below two potential scenarios where PauseAI UK has a positive impact. We are currently working towards realising both scenarios and in many ways they are complementary. At some point we may narrow our focus towards just one of these outcomes:
Scenario 1: PauseAI as a special interest groupPauseAI UK has 10,000 highly dedicated volunteers who act as a dominant lobbying force on AI policy matters. Whenever a politician touches AI, they receive a policy document with PauseAI's view on the matter and constant communications from PauseAI volunteers, in the vein of the US sugar lobby:
My phone did not stop ringing for the next five weeks. … I had no idea how many people in my district were connected to the sugar industry. People were calling all day, telling me they made pumps or plugs or boxes or some other such part used in sugar production and I was threatening their job. Mayors called to tell me about employers their towns depended on who would be hurt by a sugar downturn. It was the most organized effort I had ever seen.
Each MP has 20 PauseAI volunteers in their constituency who will send emails to their office and request meetings in which all 20 constituents will show up to express their views. PauseAI UK uses its Catalyse platform to coordinate its network to push the government to introduce an AI bill and ensure that it has the backing of every MP.
In the wake of an AI warning shot, PauseAI UK's volunteers contact every major British newspaper to ensure that journalists mention the idea of a global pause treaty in every major article about the incident. Protests are held outside Downing Street and any event the prime minister attends every day until they initiate negotiations for a global pause treaty.
Scenario 2: PauseAI as a mass movementPauseAI protests double in size every 7 months as AI capability itself improves exponentially. New PauseAI chapters are founded in every major UK city and many volunteers regularly put on talks in their local community to explain the risks of AI and recruit more volunteers for the movement.
At some point a significant, but not existential, AI catastrophe thrusts AI risks into the public consciousness and highlights the imminence of superhuman AI. Millions of British citizens become viscerally aware of the looming threat to their lives. PauseAI UK immediately announces a new protest and volunteers spread the sign-up page in their networks.
PauseAI UK organises a march in Westminster with 1 million attendees and dominates headlines in the British press. The prime minister is obliged to respond and commits to opening negotiations for a global pause treaty.
High-level strategyBrand and messagingPauseAI UK positions itself as a movement focused on the risks of human-level and superhuman AI, rather than the current harms of AI. This allows us to direct our efforts towards the most severe issues, while also letting us scale faster than movements focused on the existing harms of AI. PauseAI's strong SEO and name recognition are crucial assets because we automatically grow when more people become concerned about AI risk. This turns AI companies and the progress of AI itself into our most effective marketing tool.
A large fraction of our members have never been involved in grassroots advocacy before and we see this as a strength. It makes our protests more interesting to the media and makes the organisation more appealing to the silent majority who are not very politically active — unless, perhaps, they feel their lives are directly threatened.
We reflect our relatively moderate demographic in our messaging. We adopt a more measured tone than a typical advocacy group. Our imagery is positive and inspiring. We emphasise that we are taking the moral high ground and we represent universal, common-sense human values as part of a historic cause. This also reinforces our absolute commitment to non-violence.
Within the range of concerns around AGI, we encompass a broad set of risks. Many people will be more motivated by the threat of job automation or autonomous killer robots than extinction, because these risks are already becoming tangible and are easier to conceptualise. They are very important concerns in their own right and they are a good stepping stone towards confronting the risk of extinction. We present different AI risks as part of a single spectrum which we move along as AI becomes more powerful.
We are cognisant that building an AI movement in a context where many people have an incomplete understanding of the most severe risks requires caution and continual shaping of our message. Having our primary policy demand built into our name is a good safeguard against harmful distortions of our goals. In many contexts there are large short-term incentives to water down our demands and message, but we think this would reduce our long-term impact by moving our focus away from the most severe risks, so we are glad to have a name that commits us to a strong stance. We remain strictly non-partisan by focusing exclusively on our single issue and using politically neutral language.
Why the UK?PauseAI is starting chapters in every country and we think that having many different countries bought into the seriousness of AI risk will be critical to the success of a global treaty. But the UK is particularly valuable compared to other middle powers because it is a centre of AI research, including the headquarters of Google DeepMind and the second-largest offices of OpenAI and Anthropic. This soft power was demonstrated with the first AI Safety Summit in Bletchley Park.
Moreover, London is a hub of AI safety, with hundreds of AI safety researchers, the largest AI security institute and dozens of related organisations. The British public and political class are more aware of the risks of AI than those in comparable nations. Correspondingly, London has more PauseAI members and has consistently hosted larger protests than any other city in the world. These protests raise the bar for AI protests everywhere and can inspire others around the world to run bigger protests themselves.
GrowthThe fundamental bet of PauseAI UK is that there can be a very large and influential social movement dedicated to preventing the risks of advanced AI. Within PauseAI we already see evidence in our conversations with new members that a rapidly growing proportion of the population is truly grappling with the unprecedented danger that humanity is facing.
We model the population as a bell curve with respect to the level of evidence that each person requires to become concerned about superhuman AI. As AI improves, we expect the fraction of the curve that has crossed the threshold of concern to increase accordingly. If capabilities continue to progress exponentially, the number of people worried about the situation will also grow commensurately. However, that concern does not automatically translate into well-coordinated action. Our job is to provide the infrastructure and guidance to turn that energy into impact.
We do not think that convincing more of the public to be concerned about AI risks is our comparative advantage at the moment. This is both because other organisations are already dedicating significant resources to mass communications and because we think that AI progress itself will be the primary driver of our growth. We benefit from being the largest AI protest organisation and positioning ourselves as focused on the risks of future AI, which naturally funnels people concerned about those risks into our ranks.
Instead, we see our role as maximising the utility of whatever level of concern already exists in the population at any given time, so that we can get to a pause as early as possible. This means always organising the biggest protest possible, providing excellent infrastructure, onboarding and support for individual volunteers and local chapter leaders, and planning our campaigns carefully.
Since PauseAI UK began, we have seen a (very) roughly exponential growth in the size of our protests, with the number of attendees doubling approximately every 7 months. New members register every day and chapters are popping up across the UK. If we can continue and accelerate this trend, then we expect to make substantial progress towards our goals in a relatively short span of time.
Short-term plansTwo key players in the advancement of safe AI governance, Brando Benifei and Stuart Russell, spoke at our conference in the European Parliament in February. We want to organise an even more ambitious event in London in the next six months. As we have done before, we will use this event to bootstrap a protest held around the same time. It is generally much easier to get initial sign-ups for a conference or speaker event, and we can direct those people to also register for a protest at the same time. The aim is to hold a protest at least twice as large as our march through King's Cross in February.
We have recently launched our volunteering and project management platform, Catalyse PauseAI UK. This is allowing us to activate new and existing volunteers more easily by presenting them with a set of actions that they can take and a clear path towards contributing to bigger and more involved projects that make a significant impact. It will enable us to better coordinate grassroots lobbying efforts and empower highly motivated individuals to launch their own projects to which they can recruit other volunteers.
When cities hit a critical mass of enthusiastic volunteers, we launch new local groups in those cities. There are three core activities for local groups to engage in:
- Deliver a standard talk designed to persuade people of the importance of AI risk. This can be given over and over at events for different audiences and in different venues.
- Lobby local politicians to support our campaigns. Seek meetings, preferably in person, and have as many members as possible meet their MP and explain their concerns.
- Advertise protests and help organise transport to get there.
We are currently working on proposing amendments to the UK's Cyber Security and Resilience Bill. One amendment that we think would be low cost and potentially impactful is to introduce a reporting pathway to the UK AI Security Institute so that they are informed about cyberattacks which use AI in novel ways. We have submitted written evidence to the parliamentary committee for the bill and we are currently contacting MPs and Peers who would introduce and support our proposed amendments.
We also just launched a campaign to call for an AI Liability Bill in the UK. We think this is the most useful legislation the UK could realistically pass in short term and we are assembling a coalition of organisations to support the proposal.
FundingThe total cost for all of PauseAI UK’s staff and activities is currently around £100k per year. To date, PauseAI Global has paid all of PauseAI UK’s staff costs and other expenses. However, PauseAI is adopting a federated model in which national chapters operate as distinct legal entities and raise funding independently. Global is able to fund PauseAI UK until the end of Q2 2026. At the time of writing we have no runway beyond this and we are actively seeking funding to help us stay afloat.
To see a detailed breakdown of our expenses and projected costs, take a look at our our Donor Prospectus. You can donate to PauseAI UK by visiting our donation page.
Discuss
Learnings from starting an AI safety research team
This post’s goal is to distill our takeaways from building a research team (somewhat) from scratch over the past four months. We describe some context about our team, how it came about, and then provide some lessons learned.
Since AI safety is becoming more and more entrepreneurial, we hope this is helpful for others trying to do the same.
1. The teamWe're a new alignment research team within Arcadia Impact, based in London. We’re a team of 8, working closely with members of the UK AISI alignment team. We currently have three main projects:
- Understanding model motivations. This currently looks like:
- Trying to generate documents which fully describe a model’s behaviour (given just its behaviour).
- Producing a open analysis of alignment training techniques and ways this training could go wrong.
- Doing scalable oversight for alignment. This includes validating debate protocols in practice and then trying to apply them to fuzzy alignment-relevant tasks.
- Building pipelines for doing automated alignment research.
We're also hiring for two roles! More on this at the bottom.
2. Context about how the team came aboutThe rest of this post is written from the perspective of Andrew Draganov (research lead & current programme manager on the team) and Erin Robertson (co-director of Arcadia).
In short, Arcadia Impact had been collaborating with AISI already, through LASR Labs and ASET. Our alignment team started by applying for the AISI alignment project funding, saying that we would hire a team of researchers to collaborate with their alignment team. Andrew was taking part in LASR at the time and was brought in to help with the application. His remit then widened as the number of things to do kept growing. Once our AISI funding was approved we began the process of hiring researchers, and also applied to Coefficient Giving for additional compute funding.
A bit about Andrew, since it bears on how replicable this is. In his words:
- I have a PhD in computer science/machine learning and was working as a postdoc in ML before doing LASR. This means I've spent a number of years talking shop about AI research, though not as many on AI safety specifically.
- I'm not very well-known in the AI safety community! I only have one first-author AI safety paper (which was reasonably well-received but nothing crazy). I mention this because "you need to be an established name to lead a research team" is a reasonable thing to assume, but it wasn't really true here.
For anyone reading this post as a template, here are some things which may be specific to our situation and might not generalise cleanly:
- We were immediately hiring 7 researchers to get started at the same time! This is highly unusual and probably never how this otherwise happens.
- Arcadia was already an established non-profit. We therefore already had visa sponsorship processes, office space, hiring systems, etc.
- There are fiscal sponsors which can do these tasks if you want to avoid figuring out the overhead yourself.
- The Alignment Project, run by AISI, was our initial funder. This is a non-standard funder for many reasons, including that Arcadia already had a working relationship with AISI writ large. If you're aiming to first get funded by, say, Coefficient Giving then the dynamics may be different.
- Having run LASR, we know a lot of people in the ecosystem quite well. This made hiring easier (and, indeed, over half of the team are LASR alumni).
- We're doing technical AI safety; not governance, fieldbuilding, etc.
Given the above context, here is advice which we hope is immediately actionable by people looking to start AI safety orgs.
3.1 Hiring[Written from Andrew’s perspective]
I feel like our hiring went very well and I’m really excited about the team. But also I wasted a lot of time chasing leads that were varying amounts of useful.
For one thing, everyone wants to measure 'crackedness' but it’s unclear how to do it. On that axis, the two highest-signal parts of our process were the work test and the references; if we'd relied on only those two, I think we'd have assessed raw research ability roughly as well as we did. The interviews were helpful in addition to that, but mostly to vibecheck for fit rather than to gauge ability.
For the work test, we paid 50 applicants ~$200 each to make a research proposal. We gave them 4 hours to do this, and the deliverable was just a pdf. We then graded them anonymously. This feels in line with what the work actually looks like in the age of Claude code. We’re happy to share the work test and grading template we used if someone is interested.
Here are a few additional thoughts:
- The various AI-safety talent scouts are extremely useful when it comes to hiring. This includes research fellowship research managers, people at BlueDot, people at 80K, etc.
- There’s just so much talent across the top fellowships. Our team ended up with 4 LASR alums, 1 MATS, 1 Astra, 1 Anthropic Fellow.
- Most of these fellowships now have extension programmes, where good people keep doing work until they get hired. Although we didn’t hire from this pool directly, the extensioners are probably the most useful group of candidates you can target – they are already-vouched for and are looking for jobs!
- I probably sent 50 cold emails trying to get people to apply. This was only useful insofar as it got me a meeting with the person (which it rarely did). If I was doing this over again, I would spend more time reaching out to various MATS, LASR, and Constellation research managers, ask them who they’d recommend, and then set up 1-1s with those people.
[Written from Andrew’s perspective]
Even though it’s clear that building a good team requires a lot of networking, it was often hard to tell which networking was “worth it” and which wasn’t. Here are the things I’d prioritise if I was doing it again:
- Obtaining an active endorsement from a well-known entity in your AI safety subfield. I claim this is the most high-leverage thing you can do when building an org, and it was very useful for us. I define an active endorsement as one in which the senior person/org is going out of their way to vouch for you and will likely work with you once you start. At minimum, a written reference from a senior person goes a long way.
- Note: Appeals to authority are lame. However, there's so much noise in AI safety and a big endorsement is immediately recognized. This helps with both funding applications and hiring. For instance, we would not have hired as effectively if we couldn’t leverage the AISI and Arcadia affiliations.
- Trialing out big-picture ideas on senior community members.
- I had 2-3 meetings a day for several months pitching senior people on ideas regarding the org (research, position within the community, outreach, various deliverables) and hearing their takes.
- These meetings were monotonically more useful as a function of how prepared I was (read: how much time I had spent understanding the other person’s worldview in advance).
- I still cringe about the first time I was describing the goal of our new org and said we wanted to do “alignment research, both technical and conceptual”, to which the person responded “so… all of it?”. But I guess these initial stumbling blocks were necessary in order to get better at talking about the ~vision~.
- I had 2-3 meetings a day for several months pitching senior people on ideas regarding the org (research, position within the community, outreach, various deliverables) and hearing their takes.
- Talking to funders. In some sense, funders are scary: they know their shit, expect you to know yours, and are short on time. Also, you're cold-asking for a seemingly unreasonable amount of money. However, you're on the same team as them and should try to solicit funder opinions when available. They talk to a lot of disproportionately senior people, and I found their suggestions useful as a biased distillation of all those conversations.
- Coefficient Giving[1] is also excited about ambitious proposals, so don't pre-shrink your ask (and don't agonise over salary numbers). I wouldn’t expect to get rejected over a reasonable salary ask, and a quick survey of comparable roles at similar orgs is enough to calibrate.
[Written from Erin’s perspective, with context from running LASR Labs for multiple years]
Since the team’s just started, we’re not able to claim the culture is good (also, this is not really for us to say). Instead, here is how we thought about the process of establishing team culture prior to people joining. Parts are heavily influenced by the way this is done for LASR cohorts:
- Onboard everyone at once (or failing that, hold a retreat). Bringing people in together is a clean chance to set common norms and the way we want everyone thinking from day one. If you can't start everyone at once, then it’s useful to run a retreat at some point. This looks like letting people become friends, working on strategy together, and making concrete values.
- For example, we wanted the team to think about our communication strategy, so we ran a session exploring how comparable orgs disseminate their work and left with concrete intentions for our own.
- Get the team to shape the strategy. We hired people based on them having good judgement, so we spent some time together figuring out our priorities. Specifically, we gave people a list of possible agendas and projects, spent the first week thinking hard about which to focus on, and built teams around people’s preferences.
- Set expectations. Collaborators, employees, and advisors all need to know what's being asked of them and how to thrive in their role. Be concrete early about time commitments, what good work looks like, the values you want people building, and who owns what.
- Have two distinct management goals. Reviewing success on tasks, and making people better at their job (e.g. coaching, habit forming, feedback). The second is often overlooked in early-stage teams but is an important way to keep the team happy and improve the productivity of the team over time.
We're hiring! Specifically, we're looking for an Alignment Programme Manager, a senior generalist to help build and run the team. We're also hiring a Communications and Operations Associate to shape how our research reaches stakeholders and to keep the team's operations running. Both will be based at the LISA office in central London, with visa sponsorship available.
If you think your skills don’t fit neatly into one of these descriptions but you think you’d be a good fit, please apply – we are flexible on the exact role and are more interested in finding good candidates! The deadline for applications is June 23rd.
Similarly, if you're working on related topics, please reach out! The easiest option is to send an email to andrew[at]arcadiaimpact[dot]org.
- ^
Disclosure: Erin is joining the Coefficient Giving Technical AIS team full time at the end of June and is currently part time there.
Discuss
Preparing for Warning Shots to Catalyze International Cooperation on AGI Risks
This is a write-up on preparing for warning shots to catalyze international cooperation on AGI risks, and the corollary list of projects one could pursue. We argue we must first (1) understand types of warning shots, then (2) prepare to catch them. We must stay vigilant: both to (3) avoid getting 'frog boiled' by AI labs, and to (4) ensure that the warning shot is generalized to the overall danger of AGI. Lastly, we must (5) prepare good policy responses and ground for it to land, and (6) seize the first-mover advantage when the opportune moment comes.
This yields the following list of promising projects one could pursue:
- Developing a theory of warning shots based on existing precedents (e.g. GPT-3.5 and Mythos releases), and past analogies: developing a typology and predicting likelihoods.
- Building infrastructure to catch a warning shot when it happens. (For example, Ajeya Cotra talks about recurring internal intelligence explosion evals of AI labs.)
- Creating strategies to avoid gradual numbing / sleepwalking through a warning shot (presumably there are lessons from e.g. communication on climate change).
- Building preliminary infrastructure (institutes, think-tanks, lobbying parties...) that lays out a good ground, for when a warning shot and its policy response would happen. For example seeding right world models for public and policy makers, so that when warning shot happens they can understand the long-term implications (i.e. AGI dangers).
- Mapping out good, medium, and bad policy responses based on warning shot types.
- Building consensus in AI safety community of what policies to advocate for and against jointly.
- Building infrastructure for lobbying the right parties at the opportune moment. For example: what are the key decision-makers for each warning shot - policy response pair? Which organizations / individuals have the most legitimacy among these politicians and policymakers?
Two Important Notes: This is not a call for people to induce a warning shot.[1] AGI governance strategy should not over-rely on warning shots.[2]
1. Understand Warning ShotsThe AI Safety community currently places a lot of hope on warning shots inducing international cooperation on AGI risks. It would be useful to better understand the dynamics that lead from warning shot to international cooperation. How likely are we to get a warning shot prior to unacceptable risk? How significant would the warning shot have to be, and what other conditions must be met to open a policy window for international cooperation? How do we strike the right balance between attempting to galvanise action after every minor “warning shot” (at the risk of being dismissed for “crying wolf”) and waiting for a major event (at the risk of acting too late)?[3]
Warning shots could include alarming safety evals, the release of a strong AI agent (another “ChatGPT moment”), widespread automation of white collar jobs, minor/major accidents, misuse incidents, etc. To make the most of warning shots, it would be useful to characterise different types of warning shots, predict how likely they are to occur, and anticipate what the expected public / policymaker response is likely to be for each type.
A useful frame is Kingdon’s three streams model. A warning shot mostly affects the “problem stream”: it makes some latent risk suddenly feel real. But international cooperation on AGI risks will only become plausible if the “policy stream” already contains credible proposals, and the “politics stream” contains enough elite, public, and institutional support. The practical implication is that warning-shot preparation cannot just mean “better messaging after the event.” It requires pre-building policy options, coalitions, legitimacy, and channels to decision-makers.
2. Prepare to Catch Warning ShotsPreparing to catch warning shots requires a detection stack: capability evaluations (especially labs entering an intelligence explosion), alignment evaluations, incident reporting, compute and deployment monitoring, whistleblower channels, and more. For certain types of warning shots, we will only get a timely warning if we build such infrastructure beforehand.
The AISI network could become the institutional backbone for warning-shot detection. UK AISI was founded on the mission of "minimising surprise" from rapid and unexpected advances in AI, which is almost exactly the institutional version of “catch warning shots early”.
3. Avoid Getting "Frog Boiled"The release of ChatGPT served as a wakeup call because it caught people off-guard. With AI labs releasing new, incrementally more powerful models every week, we risk reaching dangerous capabilities without this resulting in a single, clear warning shot. Similarly, different organisations publishing a steady stream of increasingly disconcerting safety evals may be less impactful than e.g. the network of AISIs publishing a prominent report every half year which summarises the results of all safety evals.
Rachit Dubey have run large-scale experiments showing that humans "continuously reset their perception of 'normal' every few years" — incremental changes don't trigger alarm even when cumulative changes are dramatic. Their key intervention is presenting data in binary rather than continuous form (lake-froze-or-not, rather than temperature curves), which produced significantly higher perceived urgency.
4. Ensure AGI Problem Generalization OccursWhen a warning shot occurs, there might be societal and commercial pressures to portray this as a “bounded issue” specific to a certain AI model, company or situation. We should communicate effectively to ensure it is understood as a broader danger of AGI development.
Communications research (Entman, Iyengar) consistently finds that whether an event is interpreted as episodic ("one bad actor / one bad model") or thematic ("a systemic property of AI development") is largely set in the first 48–72 hours by the dominant frame in elite media. This could be a tractable advocacy target: prepare frame-setting materials and relationships in advance, so that they could be presented within the news cycle of a triggering event. A warning shot lost to the episodic frame could be hard to recover.
5. Prepare Good Policy Responses, and Good Ground for Them to LandWe need to have shovel-ready policy blueprints available when a warning shot does happen. Best options are at the pareto frontier of: mitigating AI x-risk, highly memetic for policy communication, and consensus-building in AI Safety community.
Yet, for those to succeed most of storytelling must come before the warning shot. If no communication is done prior to the warning shot, then people have no world models of how this warning shot connects to dangers of AGI. So it just passes them by without understood implications. Holly Elmore has a good post emphasizing this.
Besides storytelling, there needs to be a broader set of infrastructure, laying the ground for good policy responses to land. Ben Norman particularly looked at what it takes for warning shots to translate to international cooperation. Reviewing cases from Three Mile Island to COVID-19, he identifies five conditions that tend to be in place when an event actually leads to international agreements: pre-existing institutional capacity, clear attribution, transnational harm, aligned political incentives, and ready-made solutions. AI scores poorly on most of these, which suggests the community should treat smaller warning shots as opportunities to incrementally build the scaffolding any future agreement would need to land on.
6. Seize the First-Mover Advantage at the Opportune MomentWe should anticipate likely bad reactions, communicate effectively on why these are in fact bad ideas, and capitalize on first-mover advantage when a warning shot happens to push good policy proposals instead.
First-mover advantage matters because the first plausible interpretation of a crisis often becomes sticky. A warning-shot playbook should specify what happens in the first 72 hours, the first week, and the first month: who drafts the public explanation, who briefs policymakers, which validators are activated, which policy ask is pushed, which bad reactions are pre-butted, and which international counterparts are contacted.
Suppose we had clear evidence COVID was the result of a lab leak. That same warning shot could plausibly produce very different outcomes depending on which interpretation sets first. International agreement to halt gain-of-function research, and thus much stricter safety requirements for labs that pursue it. Or, just as easily, countries accelerating their own programs to capture the demonstrated power of the technology, while becoming more secretive to avoid PR disasters. Which of these locks in depends largely on whether someone is ready in the first 72 hours with a credible interpretation, a concrete ask, and the relationships to get both in front of the right people.
Justin Shovelain accumulated the list, Thomas van Damme made an early draft, while Mark Kagach and Elias Schlie wrote the final version.
Thanks to Ben Norman, Richard Mallah, Holly Elmore, and others for valuable input.
- ^
Warning shots are frequently tragedies, we do not want them to happen. Our job is both to prepare to respond well, and to prevent them.
- ^
Best governance strategies are viable without a warning shot ever happening. Excessive dependence on one is a common failure mode in AGI governance strategies.
- ^
"Risk Awareness Moments" (Rams): A concept for thinking about AI governance interventions
Discuss
My research agenda and work
This is a summary of the work I've done and work I plan to do, and the theories of change and AI progress that motivate my work. I've been working full-time on alignment for three years and change, and thinking about brainlike AGI and its alignment increasingly often since 2004.
Here's the research agenda in one breath: I'm trying to predict what the first transformative AI will be, in enough mechanistic detail that we can predict likely failure modes of its alignment. That's in service of finding interventions that address those failure modes efficiently, so that they can realistically be implemented even if timelines are short and work is rushed. I'm using my background in computational cognitive neuroscience to predict what might be called loosely brainlike AGI: LLMs with added human-like cognitive capacities.
I'll give a summary in the rest of this section, then give a little more depth on each major thread of my work in the remaining sections. All of it is pretty brief.
Approach and premisesMost alignment work falls roughly into one of two broad categories: empirical study of current systems ("prosaic alignment"), or theory about idealized agents ("agent foundations") (with much variation and many notable exceptions). There are two assumptions implicit in these approaches: the first is that the AI we're trying to align will be like current AI. The second is that we have no idea what the AI we need to align will look like, so we must work on the fully general problem. I think there's a neglected third option: carefully predicting properties of the first AIs in which it's really important to get alignment right. I'm trying to make alignment plans that both anticipate new difficulties[1] and take advantage of the strengths of LLMs.
I'm trying to predict how LLMs might be augmented to reach takeover-capable AI (TCAI),[2] beyond scaling the current approach. I'm looking at how developers might add systems for continuous learning and executive function, and how those would create new challenges for alignment. I present the case that this is the most likely route to the AI competent and agentic enough to control the future, and therefore the type of AI most crucial to align. (I don't think this is certain; something like Steve Byrnes' “Brain in a box in a basement” could overtake it, given some otherwise-helpful limitations inherent in the LLM approach).
I'm pleased to see more people reasoning explicitly and mechanistically about the transition from current LLMs to transformative and takeover-capable LLM-based AGI. There is visibly more now than three years ago.[3] I think there's still too little of it for comfort. Of course no one is a pure empiricist or pure theorist, but in practice the field still seems fairly strongly divided by perspective, although the delta is encouraging.
My background gives me an unusual angle on predicting the form of our first TCAI. It's in "computational cognitive neuroscience", a rare type of integrative work. I worked for 23 years in a lab that built neural network models of brain function, ranging from basic attention and vision, to attention for executive function, to System 2 serial processing for complex thought and decision-making. These models were in service of integrative secondary research, a fancy name for reading a lot of empirical work and theory, and thinking hard and collaboratively about how it all fit together. I was lucky and privileged to spend most of my time reading and thinking about how brain systems come together to produce human thought and knowledge, and the computational principles of why it works as it does.
I think it's likely (50%-90%, uncertain) that the first TCAI will be advanced LLMs, scaffolded and trained to have cognitive faculties they now lack relative to humans. This is a specific bet, but I think some insights from taking this perspective apply to broader forms of network-based AGI.[4] These are principally some degree of persistent memory/learning, metacognition for uncertainty and error detection, and executive function for planning and thought management. These together will enable autonomous, goal-directed, long-horizon work. I think the first TCAIs will probably be aligned much like current systems (RLHF, constitutional AI/deliberative alignment, character training), with some modest additions (§2.2), but those same alignment techniques will create different final alignment, based on emergent effects of those added cognitive capacities.
That is a bet on what might be called loosely brainlike cognition. Current LLMs are already surprisingly brainlike in some ways.[5] LLMs are arguably much like humans' cortical areas for language (Broca's & Wernicke's areas). I think the brain's abundant recurrence serves a similar function to transformers' architecture of attention modulating connections from past serial token processing. Of course this mapping is rough, with many important differences. One is that LLMs capture more semantic knowledge or crystallized intelligence than human language cortices.
On this perspective, LLMs have a route to human-like competence with more human-like systems, training, and thinking. And the incentives are driving people to create those rapidly. Standard LLM scaleup receives the most effort, but substantial work is underway on novel systems and training methods for each of those missing elements of human cognition (reviewed in the work I summarize in the next section).
Philosophy of the approachAccurate, detailed predictions of how we'll try to align AGI would allow efficient use of limited alignment time and effort. My primary theory of change is that spreading clearer thinking about likely paths to AGI and alignment will improve our average efficiency and foresight on crucial alignment issues.
This research agenda is messier than focusing on specific technical problems or approaches, or assuming rather than predicting a particular form of AGI/TCAI. This first-principles, all-inclusive approach has substantial downsides, but seems like a useful bet in conjunction with those types of more focused work.
I think my work takes relatively neglected approaches. Gears-level models (in this case, of likely TCAI architectures) can be considered expensive and so rare Capital Investments. Such work may also be difficult to automate and so get little developer attention at crunch time (but see §4.2 on AI for epistemics). These issues do receive attention, and increasingly more, but I think they're still neglected relative to their potential impact.[6]
The forms of first TCAI and alignment efforts will in part be shaped by commercial, social, and political forces. Thus, my work has spread somewhat to these topics, since work there seemed severely lacking when I started working on the problem.
I'll discuss the technical prediction work first, since that's where the bulk of my work has gone. Then I'll describe a little of my cognitive neuroscience background in brief, with a little more optional detail for those who might find it as fascinating as I do. Then I'll describe some of my constellation of work predicting and analyzing the implications for alignment of a few other factors: shifting societal and governmental attitudes toward AI; choice of alignment targets; and motivated reasoning and confirmation bias acting upon the public and the AI development and alignment research communities.
2. Technical workMy main work is in what might be called semi-technical alignment and predictions. It's making, refining, and sharing gears-level models of likely first TCAI, alignment techniques likely to be used for it, and likely failure points and risk models. The theory of change is spreading those models and claims as broadly as possible through writing and discussion, so that we've collectively thought more about the specifically relevant questions when crunch time hits.
2.1 Predicted paths to TCAITCAI won't work like a human brain, but it will likely incorporate some elements of human cognition. Agentized LLMs will change the alignment landscape was a broad and brief prediction of the new shape of the alignment problem, as I and others were seeing agentic, scaffolded LLMs become the most likely route to first AGI.
Capabilities and alignment of LLM cognitive architectures was a more specific and detailed prediction of how other cognitive systems could be added to fill in the cognitive abilities that humans have and LLMs lack. I still think this is roughly correct, and progress and others' analysis have borne this out. So I'll restate the main elements of cognition that I think stand between current LLMs and human-plus performance in agentic settings.
Memory (continuous learning)Episodic memory is now becoming useful: notes in coding scaffolds; there are vector memory systems in OpenClaw. Fine-tuning functions much like human semantic memory. The later, more complete treatment is in LLM AGI will have memory, and memory changes alignment where I review the reasons to expect added memory systems before takeover-capable AGI. Such beyond-session memory or learning during deployment creates an alignment stability problem (§5.2).
Executive function and metacognitionThis is the other major direction I see distinguishing LLM cognition from fully human-level competence. Scaffolding for task structure, plans, and long-term goals is just getting off the ground in coding scaffolds and agentic scaffolds like OpenClaw. The later, more complete statement is in Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities. I'm now hoping that better metacognition will produce better AI for epistemics and automated alignment, faster than it accelerates capabilities. This will of course be tricky to aim for, and it's not clear how hard anyone is trying.
My early work in 2023 focused on scaffolding LLMs into "language model cognitive architectures" (LMCAs). This turned out to be largely wrong about speed and ordering. AutoGPT and similar systems with vector-based memory systems and prompting for executive function did not quickly become useful. But these predictions were correct about the heavy focus on chain-of-thought or System 2 as a step from early LLMs to human-like cognitive competence.
2.2 Predicted paths to (mis)alignmentInternal independent review for language model agent alignment is another likely alignment technique for such scaffolded LLM systems. I called this internal independent review because it's internal to the agentic scaffold or harness. But it's independent, because a separate model instance could be called to review actions or plans before they're executed. This approach would span the line between AI control and alignment.
A review process, standing between an agent's main LLM "cognitive engine" and actions-you-don't-like, such as writing fanfic porn on your social media account, or making plans to overthrow humanity. From System 2 Alignment.
This could be a nontrivial addition to training-based LLM alignment, because such review happens before actions affect either the world or internal memories and beliefs. This approach still seems promising and likely to be included as we near AGI. Auto mode in Code and Codex is a minimal version. Such a review process could probably be circumvented by a highly intelligent base LLM, but it could be useful as part of a hodge-podge approach to an "alignment MVP" that is weakly and spikily superhuman, and could aid with automated alignment research.
Goals selected from learned knowledge: an alternative to RL alignment explored the implications of aligning pretrained LLMs, rather than performing alignment purely through RL. This refers to a whole class of alignment techniques, but centers on prompting LLMs as one means of aligning them. This sounds laughably weak as an alignment solution, but it still looks likely to be part of the first human-plus LLM AGI's alignment approach. Coding scaffolds maintain plan and subgoal lists, and inject current goals as prompts at intervals. The important perspective is that alignment will be an emergent property of a system, with the alignment of the core LLM an important part of the composite system's alignment, but not the only part.
System 2 Alignment: Deliberation, Review, and Thought Management expanded on independent reviews for alignment, and related techniques. Writing this piece convinced me that techniques for capabilities will also provide new alignment techniques with low alignment taxes. It's valuable for efficiency and reputation to control models' thoughts and behaviors. That can be done with scaffolding and/or training to monitor and review actions, plans, and chains of thought. Extra efficiency is increasingly important as long-horizon tasks consume more expensive compute. The current scaffolding for plans and to-do lists in coding scaffolds like Claude Code and Codex is a nascent form of this type of scaffolding. I predict that these techniques will be expanded when Mythos and similarly capable models are released. I also looked at some ways these techniques could be circumvented by a scheming (deceptively aligned) base LLM.
LLM AGI may reason about its goals and discover misalignments by default and Broadening the training set for alignment are recent posts exploring these predictions and alignment proposals, but I'll discuss them in §5 because they span predictions and problems. First I'll summarize my previous half-career, because it's the basis of my approach to alignment, and the source of many of my claims about likely AGI and alignment approaches.
3. My research background in computational cognitive neuroscienceI joined a PhD program in 1999, and worked in that same lab until mid-2022. We built integrative models of how brain systems produce human thought; I focused on complex, serial thinking. We called this computational cognitive neuroscience, since we were trying to work across computer science, cognitive psychology, and various branches of neuroscience. We built theoretical and computational models of attention, executive function, serial deliberation, decision-making, episodic and working memory, and models combining those systems.
The point of that work wasn't the models themselves; it was the computational principles they illustrated, the why of human cognition. I spent most of my time reading and thinking about how brain systems come together, rather than building elaborate networks. I'm applying a similar research strategy to the similarly-difficult problem of understanding how future AGI will work and be aligned.
I hope this perspective gives me some relevant comparative advantages. One might be in seeing which cognitive functions LLM agents currently have and lack, relative to humans. Another could be seeing the emergent effects that adding those capacities could have on capabilities and alignment.
My neuroscience work included some theories I've never written about before, since I was worried they might accelerate progress in brainlike AGI (I now think they wouldn't and won't, since LLMs have eclipsed brainlike approaches). For the interested reader, I'll give a high-level summary of that work, on attention as control of thought. I also summarize some of my more directly LLM-relevant work on how serial chains of thought are created and managed by a brain that works mostly in parallel. I think it's interesting, but I'm pretty biased!
Selected specifics of my computational cognitive neuroscience career and work
My PhD advisor and lab head was Randy O'Reilly. His advisor was Jay McClelland, co-author of the Parallel Distributed Processing volumes (1986 & 1987) that established "connectionism" as an approach in cognitive psychology and neuroscience. They used small neural network models of brain function to inform then-new theories of cognition. This is based on the claim that the neural network basis of the brain should constrain theories of thought, and should have observable effects in behavior.
Today, every neuroscientist probably agrees with and uses this viewpoint to some degree. People like Gary Marcus would still say it doesn't explain the most important parts of cognition. I think he and other dedicated critics of the approach overstated the case, but made one very good point. I progressively turned my studies to how neural networks' fundamentally parallel operation becomes the fundamentally serial operation of System 2 cognition, roughly chain-of-thought in humans.
Randy's and the lab's approach was connectionist, but more closely based on empirical neuroscience. We read papers on everything from ion channel action to think-out-loud protocols where people try to introspect on their cognition, and a vast variety in-between. Randy used neural network models as his "empirical" research methodology. These computational models served to summarize a lot of our neuroscience knowledge, in the form of their architecture and parameters and learning rules. We'd publish results from the computational models we made. But the point wasn't really what the models could do (although they were early learning network models, and what they could do was sometimes impressive for the time). The point was the computational principles they illustrated: the hows and whys of human cognition.
I'll just briefly mention a few of my publications, to give a flavor of what I studied. My dissertation was on the neural mechanisms of visual search. It included some basic behavioral experiments, but it was mostly a computational theory derived from reading tons of literature on how the visual system functioned, particularly the mechanisms of attention. I wasn't particularly interested in vision, but there's been a sort of path-dependent accrual of empirical evidence in the visual system, so it's the area in which you can best constrain theories with evidence.
This added detail and solid empirical backing to the theory (not really mine originally) of how attention is an emergent effect of neural networks, and it controls what we do and what we think about by essentially the same mechanisms by which it controls what we perceive. This is a pretty central mechanism for how the brain seems to pull off complex thought; it attends to (very roughly) one thought at a time for maximum processing power on each one, then serializes those into more complex chains of thought in System 2 processing.
My interest in complex cognition led me to focus on elaborating in more detail how System 2 serial processing was generated from a fundamentally parallel neural network architecture. The central answer I reached was a dynamic in which the brain naturally proceeds from one "thought" or brain-state to another, related one. This dynamic is created in a neat, simple way: neurons fire continuously to represent information; they "get tired" of firing by depletion of resources; this lets new neurons representing new information or "thoughts" start firing, but which ones are a product of the complex pattern of information in the last brain-state. This allows attractor dynamics, and serial progression or "chain of thought". But this theory of how thought works was too vague to publish, and didn't fit our funding. Later, I decided not to publish it at all, to avoid speeding progress toward AGI. (I'd believed brainlike AGI was possible within my lifetime, and that alignment is crucial and tricky, since encountering Eliezer's arguments around 2004.)
But I could work on it even if I didn't want to publish my theories on the elegant simplicity of human thought. Neuroscientists love mechanistic detail and don't focus on comprehensive theories. In 2013 I was first author on a collaborative paper called strategic cognitive sequencing (I hadn't yet applied the System 2 or chain of thought terminologies). That was a collection of small computational models that illustrated how brain systems could come together to strategically select next thoughts, and so do useful cumulative cognitive work.
This theory was better-developed in neural mechanisms of human decision-making in 2021. Selecting the right "next thought" is often a mini-decision. These quick, strategic "internal actions" are superimposed on the "automatic," emergent means of selecting a topic-relevant next thought through the attentional dynamics of constraint satisfaction in a laterally connected network (a fascinating topic too large for the margin;). This happens in a system for internal action selection that's a rough copy of the system that controls selection of motor actions.
That set of theories is indirectly relevant to predicting how LLM agents will work, because I think it might be about as close as anyone has gotten to understanding how the human brain does its most impressive tricks (at the level I've focused on, roughly between Marr's algorithmic and computational levels). There are deep reasons we do complex cognition that way. It's totally possible for a cognitive system to do it some other way. But LLMs are following the same core strategy for cognition, so their adaptation for complex, strategic cognition might look the same.
I also did some work on models of episodic memory. Those include theories of how the brain deals with catastrophic forgetting; these are old news in cognitive neuroscience, but that doesn't seem to have carried over to AI. Basically, we accept a lot of catastrophic forgetting because we have to; this is the source of LLMs having vastly better recall of detailed information than we do.
But there's another trick: using a separate system (the hippocampus and medial temporal lobe) to learn fast without a high learning rate on the whole system, so that new knowledge can be stored but sequestered, to see if it's important enough to learn (and thereby overwrite previous memories). We do this by replaying important memories from the episodic memory store, thus "consolidating" important memories into our general semantic knowledge, at a speed and depth appropriate to their estimated importance. This also has counterproductive side effects, including much of our trauma or excessive fear responses.
I collaborated on a lot of neural network modeling work, and spent a lot of time trying to deeply understand how they worked (they used a biologically realistic variant of backpropagation, so their learning dynamics are reasonably similar to transformers' backprop learning; that analogy is fascinating but again too large for the margin here;). But I didn't spend a lot of my time constructing and tinkering with complex networks; mine tended to be more like animated diagrams for a computational concept. This allowed me to spend more time reading and thinking. I let my more detail-oriented (or IMO obsessed!) colleagues do more of the elaborate modeling that was trying to achieve cutting-edge (at the time) results.
Unlike my mentor Randy, I thought spending tons of time running neural networks and getting them to do what you want was largely a waste of time compared to more reading and careful thinking. But they were also our token "empirical" result, in an awkward way; the "coin of the realm" that let us do so much work on theory, while sort-of publishing legible proof-of-progress. So I was fortunate and Randy was generous in letting me spend most of my time reading and thinking. (My thesis was particularly weak on empirical research, so I adopted a proof-of-work-by-sheer-mass-of-review-and-writing approach. I don't really recommend trying to read it;).
4. Societal influences on AI safetyI've also done some work trying to analyze the likely societal dynamics around AGI. Practical constraints change what alignment solutions are really practically implementable, and therefore useful. This logic for putting effort into understanding and predicting the societal structures surrounding development of AGI is closely related to my logic for putting effort into predicting likely properties of first TCAI.
I'm of course not an expert or even a serious amateur in politics, business, or the dynamics of public opinion. I've scaled back the time I spend on this area, as domain experts have started to address these questions in the context of AI progress and AGI x-risk. However, I think there's still useful work for generalists with real understanding of AI and its potential development trajectory to make predictions in this space. Accurate predictions depend on interactions among domains (political processes, business, public opinion, and AI progress), so everyone is working outside of their expertise.
4.1 Government and public opinion on AI progressOne important question in this domain is the extent to which governments vs. business will make crucial decisions on development of TCAI and its alignment. In Whether governments will control AGI is important and neglected, I analyzed how government control could shift the alignment and AI risk challenges. It would likely reduce between-lab race dynamics, at the cost of accelerating a race with China. I explored the downsides further in Fear of centralized power vs. fear of misaligned AGI. I now think government control before TCAI is likely, and international cooperation with China is plausible; but those are loosely held. I hope to see more and better analyses; like many areas of alignment, it seems like we could understand the potentials far better than we currently do.
There are also more detailed ways that society impacts alignment solutions. I think there's been an interesting blindspot going two ways: most of the public, including experts, assumes that AI is roughly static; they're not accounting for progress. Meanwhile, most of the AI safety community is assuming that public opinion is static, and won't change when AI changes. I think public opinion will change, like it rapidly changed from ignoring to obsessing about COVID. And I think those changes will have important impacts on AI safety.
I explored this in AI scares and changing public beliefs, focusing on public beliefs about AI risks becoming polarized in the US like climate change did. Avoiding partisan polarization on AI seems like a very high priority; partisan conflict is a powerful motivating factor. Currently, opinion across US parties seems roughly equivalent on AI job loss and existential risk, even though party leadership and talking points disagree sharply on current AI.
In A country of alien idiots in a datacenter: AI progress and public alarm I predict that seeing semi-competent agents and real job loss will rapidly change public and government attitudes toward AI. I hope this could dramatically shift the political landscape in the 2028 election, bringing in an anti-AI administration that could meaningfully slow progress and assert control over AI safety issues, including alignment. I'll be addressing this further in a draft article.
Alignment solutions will be deployed in particular ways depending on the developer. I don't know a thing about predicting public or governmental responses to unusual events, but I've done a little anyway, because barely anyone else seemed to have thought about it carefully at the time (that gap is filling rapidly, which is encouraging). In If we solve alignment, do we die anyway? I wondered if proliferation of intent-aligned (or corrigible or instruction-following) AGI might create competitive dynamics along with new strategies and superweapons that might be almost as risky as misaligned AGI.[7] This substantially shifted my hopes back from instruction-following to value-aligned AGI, since that keeps ASI out of the control of dangerous individual humans; see the next section.
4.2 AI progress and epistemicsAnother interesting factor in how the first takeover-capable AI will be developed and aligned is the contribution of near-future AI that will aid in those projects. In Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities, I analyzed how we might improve AI for epistemics by differentially improving metacognition for identifying uncertainty and correcting errors. This would reduce the risk John Wentworth has called The Median Doom-Path: Slop, not Scheming, in which advanced AI helps us produce alignment solutions we can't evaluate, and helps us convince ourselves they will work, acting as yet another amplifier of our motivated reasoning and confirmation bias.
Those biases also contribute to dynamics in the communities that are building and aligning AGI. Those biases and dynamics may be creating critical blind spots in theories of AI progress and alignment. I have long been interested in motivated reasoning, confirmation bias, and the philosophy and reality of scientific progress. I most recently wrote about Motivated reasoning, confirmation bias, and AI risk theory. This is my attempt at making a comprehensive and quantitative estimate of the total impact of those biases. I couldn't find another source that really tried to make a good, explicit estimate, so I wound up doing a lot of research and integration. I tried to at least loosely understand and review empirical work on six distinct sources of confirmation bias, and estimate their cumulative effects (for they often stack) in areas relevant to alignment.
Those estimated total effects are large, even for careful thinkers who are aware of and try to control those biases. Even conservative estimates are startlingly large, when we account for inevitable double-counting effects of integrating others' expert opinions into our own best estimates. Overconfidence is large and common, and the majority who experience it also seem to have the loudest, and most polarizing, voices.
Based on that analysis, and my read of the field, I believe that motivated reasoning and confirmation bias are polarizing beliefs about AI progress into camps of optimists that feel emotionally in conflict with pessimists or "doomers". This creates a very large risk that politicians and others with power over AI development will simply choose whichever beliefs they find convenient in the short term, since there's an available pool of "careful expert opinions" for either need.
5. Alignment targetsTechnical alignment is how to aim AIs; how we make future AI do what we choose. Alignment targets are what we aim at. Choice of target can constrain technical approaches, and vice-versa. I have written about two such interdependencies between technical alignment approaches and alignment targets.
5.1 Corrigibility, DWIMAC, or instruction-following vs. value alignment targetsThe first is a series of analyses of the virtue of what might be called corrigibility-first alignment targets (also called do-what-I-mean-and-check (DWIMAC), instruction-following, or intent alignment). This could be framed as trying to create TCAI that stays under human control. I compare these to value-alignment as a target; that is, creating TCAI and ASI that will not need to take our input because they know what we want better than we do.
That series, from first to last:
- Corrigibility or DWIM is an attractive primary goal for AGI
- Briefly explored the issue as I was starting to understand it
- Instruction-following AGI is easier and more likely than value aligned AGI
- A deeper and more thorough analysis
- analyzed more exhaustively by Max Harms in CAST: Corrigibility as Singular Target
- A deeper and more thorough analysis
- Conflating value alignment and intent alignment is causing confusion
- Addressed definitions and common miscommunications
- Problems with instruction-following as an alignment target
- e.g., hypothetical future instructions must be highest priority.
As with all research, I recommend reading the most recent (Problems) first, since I've refined the explanations and ideas over time.
This choice of alignment targets is pretty clearly crucial for both choosing technical alignment approaches for maximum impact, and for our ultimate odds of success. DWIMAC or instruction-following alignment mitigates many deep problems with value alignment, if it's used even modestly wisely; the AI can be instructed to be maximally helpful and honest, aiding in understanding the risks of various instructions and approaches. There are of course problems, but I weakly believe these are easier in sum than those of value alignment. However, this approach has one enormous downside: it leaves humans in charge of the future, with increasingly lethal capabilities at their disposal (see If we solve alignment, do we die anyway?, discussed above).
This choice of alignment targets is very much a live issue in our current path toward TCAI and (mis)alignment. Claude's new constitution seems primarily aimed at value alignment; yet it also weakly states remaining under human control as the first priority. The constitution reads to me as though Anthropic is collectively undecided on which or what mix of these alignment targets to aim at. OpenAI and DeepMind appear to be more oriented toward instruction-following as the target, although they are attempting massive exceptions in the form of rule-based refusals.
Here I'm again encouraged by an increased level of discussing where the rubber meets the road: when a highly intelligent and agentic AI tries to resolve implications and contradictions among its values/goals/priorities. And again, I think we want a lot more of this analysis.
5.2 Stability as an alignment targetThe second focus of alignment targets is my focus on the alignment stability problem; the need for any technical alignment solution to be robust to the inevitable goal drift and ontology shift from a system that learns continuously. My first publication on alignment was a chapter called Goal changes in intelligent agents published in 2018. I'm not counting 2016's Anthropomorphic reasoning about neuromorphic AGI safety since that was more the result of a collaboration than my own work or accurate to my current beliefs. (Yes, I'm bragging about being early to the AGI safety party, although I know many were much earlier than I!)
That concern with stability of alignment in the face of learning has continued through my early post The alignment stability problem and continued as a theme in all of my work on technical alignment. I addressed this more recently and thoroughly in LLM AGI will have memory, and memory changes alignment in spring of '25, and will be posting more soon in a series of collaborative posts on what's now commonly called continual learning.
I attempted to integrate the main threads of my research in LLM AGI may reason about its goals and discover misalignments by default in September '25. This is an attempt to convey a gears-level model of the first takeover-capable AGI, in service of thinking through how its alignment might shift under the effects of improved reasoning, deployment in different tasks, and continual learning allowing for long time-horizon work. This attempt is flawed of course, but I'm halfway satisfied with the attempt to convey "doomer" concerns in language and examples that speak to relative optimists working to align LLMs.
Future work will expand on those efforts.
6. Future workListed in approximately descending level of priority.
Refining gears-level models of aligned LLM AGI (locating the goalposts)
This is the main thrust of my work. I think nobody knows how hard alignment is, and that's a dangerous position to be in. This is in part because we don't have a clear vision of what alignment success looks like on a technical level. Ryan Greenblatt's recent Current AIs seem pretty misaligned to me had a counterpoint top comment along the lines of "good points but I could also say current AI seems pretty aligned to me". Locating the goalposts (in a high-dimensional space) will help in hitting an adequate alignment target.
Better gears-level models can help, and they'll be easier to make and more appealing as we continue progressing. This will include more work like LLM AGI will have memory, reviewing technical research and predicting (roughly) how it's likely to be incorporated into LLM agents as they approach TCAI. It will also include work like LLM AGI may reason about its goals, in which I'm trying to clearly and intuitively convey my models of likely TCAI and misalignment risks, in ways that bridge prosaic alignment perspectives and deep "doomer" alignment concerns. Doing this communication work also helps me understand the perspectives and gaps between them, and refine my predictive models of likely TCAI.
Confirmation bias and motivated reasoning in alignment research and public opinion
I want to write shorter pieces on how I see biases affecting beliefs and discourse about AI and risks. I believe the effects might be large even for an epistemically careful scientific community (I estimated in my recent in-depth but general piece on those biases). The effects on public opinion are likely larger. This could create a cascade effect in which public opinion rapidly switches, as it did for COVID. That model might help us shift public opinion in time to matter.
Metacognition to reduce LLM slop-related alignment risks
Locating the gaps between LLM and human cognition has another application: fixing a gap that could cause misalignment. John Wentworth's The Median Doom-Path: Slop, not Scheming outlines a risk model in which LLMs make an elaborate but sloppy plan to align successor systems, and convince humans it will work. This may be particularly relevant in Steve Byrnes' prediction that LLM AGI may be leapfrogged by more brainlike, thoroughly RL-based AGI approaches, which seems pretty plausible to me. Improving LLMs' metacognition will also improve their capabilities, but a differential improvement toward better epistemics seems possible (but see this thread). It might also help coordination; if every frontier model said something like "you guys don't know what you're doing with AGI so you should obviously slow down" it might help a fair amount.
The government will assert control of AGI projects
This is increasingly obvious, but it seems worth arguing for and exploring the implications for alignment. My previous piece predicted that intelligence agencies would see the defense significance of AGIs well before a lab could deploy AGI strong enough to help them evade government control. Zvi's read on the recent executive order is that it implies the NSA's involvement in "voluntary" model evaluations before release. This official statement reads as a response to the obvious, large security implications of Mythos. Asserting control over security uses of TCAI probably doesn't require even Soft Nationalization, just requests framed with adequate urgency and threats.
Sequence on continuous learning
This is being written in a collaboration with Rohan Subramini and his Aether Research group. I'm a minority contributor, but it's a valuable review of research on continual learning, and its alignment implications. This sequence should be published within the next month.
Solving the whole problem
There's no specific post planned unless I have unexpected success, but I think returning to the whole problem frequently and trying to examine it in different framings is a useful exercise. I spend time butting my head against the whole hard problem of how the heck we improve our odds more than marginally with independent safety work. I doubt we'll find any one solution to the alignment problem, let alone the "societal alignment problem" of coordinating on a strategy. But I wouldn't be surprised if there are very helpful approaches nobody has thought of yet. The amount of effort expended on alignment thinking seems modest so far relative to what it could or should be given the stakes. I hope to see much more before crunch time. And I feel pressed to cover as much ground, as quickly as possible, in case time turns out to be short.
7. CollaborationI benefit from talking to people with different perspectives and expertise. Collaboration with people we disagree with is tricky, but in my experience, it's the fastest route to real progress in science. I have people I talk to regularly from different corners of the alignment space, but I'm probably well below the marginal optimal amount. I'm not based in the Bay Area and don't work with a large org, so I need to actively seek conversations with other alignment workers.
I also have some time to talk to people who are just getting started in alignment, particularly if their interests overlap with the type of integrative work I do. I think we do have time to onboard new people, and I often benefit from talking to people with different perspectives even if they're not expert yet.
Contact me, preferably by DM here on LessWrong, if you want to talk.
Acknowledgements:
Thanks to everyone I've talked to and read who's helped me develop this work and these perspectives. This work has all been made possible by generous support from the Astera Institute.
- ^
New alignment challenges will be introduced if and when AI has situational awareness, alignment change through reflection and continuous learning, and context changes from more diverse tasks and broader action affordances (particularly escape, explicit scheming, and takeover). My perspective could be framed as: prosaic alignment adjacent in models of AI and alignment techniques; agent foundations adjacent in risk models.
- ^
I use the term takeover-capable AI to emphasize the threat model I'm addressing in this work. Transformative AI and AGI are used in a wide variety of ways including many that are not capable of taking over control of society from humans. Thus, I prefer TCAI at risk of proliferating terminology. I use TCAI to include AI that takes control by deliberately and secretly aligning a stronger successor AI to itself, as in the scenario of AI 2027.
- ^
I am thinking of people like Ryan Greenblatt, Alex Mallen, Daniel Kokotajlo, and others who are visibly trying to anticipate the properties of takeover-capable LLM systems, and how they'll hit "classical" alignment concerns like instrumental power seeking and catastrophic alignment misgeneralization, and takeover. The AI Futures Project and their AI 2027 scenario are the most visible examples of this type of gears-level prediction work. I can't match their output working solo, but I think my differing perspective and focuses complement their and others' work. I hope to see more work and workers with similar research agendas, soon.
- ^
While I focus on loosely brainlike LLM-based AGI, some of the insights from taking this perspective apply more broadly to surrounding possible forms of first TCAI. These include more pure versions of today's LLMs, ranging to more thoroughly RL-based approaches like Steve Byrnes' brain-like AGI as a possible first TCAI. Inversely, much of Steve's analysis of that sort of AGI here and elsewhere also applies to LLMs with more RL training and self-directed continual learning.
- ^
Brains of course have many properties that current LLMs lack. One I don't focus on is the RL action-selection system, and the resulting self-directed continual RL learning. Emphasizing similarities and differences is a matter of perspective. I emphasize similarities, because I think Humans provide an untapped wealth of evidence about alignment and future capabilities both. (Although I think shard theory does not address important elements of human motivation like reflection and systematic System 2 thinking, much as described here.)
- ^
Some ideas that now seem obvious and important took a surprising amount of time and effort to develop and diffuse. I suspect we'll find key elements of the technical and social alignment problems obvious in hindsight. I think we still risk not seeing them early enough and clearly enough to diffuse and apply them in time to help.
Unfortunately, I think there are equally strong arguments for other areas of alignment research being neglected, since the overall effort is still small relative to the likely stakes, even on very conservative estimates.
- ^
This parallels Michael Nielsen's excellent work in reconsidering alignment as a goal and How to be a wise optimist about science and technology? in which he notes that spreading new capacities for destruction might be the biggest impact of technology in general and AI in particular.
Discuss
Editing is Easy, but Revision is Hard
I’ve always liked learning about the mechanics of how others write. Common advice like "write every day" or "don't pause to edit" is fine, but what I really want to know is what’s going on inside your head, what are your fingers doing, and what does your draft even look like.
For example, I have spent a lot of time writing on my phone. This works really well for poetry, but it’s terrible for essays. When writing longform non-fiction I like to open up my laptop and write my drafts in Microsoft Word. I worked with Google Docs for a long time, since it’s free and online, but I find it much worse for working through subsequent drafts of an essay.
I’d like to distinguish between editing a draft and revising one. Editing is a lot like trimming a hedge. The content you want after editing is mostly in the document already. You might swap out words to improve style or clarity, or completely rephrase a few sentences. You will, occasionally, add a new paragraph, because you insufficiently explained a point, or find there’s a tangent you want to explore. But you don’t make deep structural changes to an essay when editing. When trimming a bush, you might cut the branches, but you rarely pull up the hedges by the roots and rearrange them.
A full revision, or restructuring, is much more radical than editing. When I see advice to just “write” and edit later, I think to myself that this is a writer who doesn’t create many revisions! They are probably very good at determining the structure of their argument up front, so that even if their first draft were written with the vocabulary of a four-year-old, they could still edit it into a sensible essay. But revision is much more difficult mechanically than editing, because it is more spatially complex.
Word’s track changes mode is very useful for editing, though I unfortunately went most of my adult life without realizing it was available, and only learned about it once I had to review redlines in legal contracts. Redlining makes your edits legible in situ, exactly where you want them to occur. It’s visually and cognitively simple to understand.
There is no equivalent to track changes for revisions. Have you ever tried deleting entire paragraphs, then splitting them into new sections elsewhere in your document? Your whole screen will turn red and the sidebar will be overwhelmed with useless marginalia.
When I begin a new revision, I typically just create a new Word document. It’s a lot easier than trying to track a huge set of changes in the version history. I can imagine the new and old arguments floating around as sections in my head, but physical corollaries for an entire draft are unwieldy. I’ve tried printing out my drafts, but I find this becomes a mess of crosses and arrows. I’ve also seen some text editors with features to help manage revisions, but the implementation feels clunky to me, and potentially a distraction from the writing process itself.[1]
So, I haven’t actually found a mechanical solution for revision that is better for me than holding the abstract arguments in my head, recomposing them and writing it all down again. I do still cannibalize text from previous drafts if feasible.
My sense is that revision, as I use it here, is structurally difficult to execute for any project. If you accidentally wrote your entire e-commerce platform as a WordPress plugin, then good luck revising it to use a microservices pattern. If you architected a building for the calm of northern Europe, then good luck revising it for a development in southern Florida. Structural redesigns are really hard, even if your essay/software/building serves the same purpose as you’d originally intended.
If revision is mechanically very hard, then we should try to ensure our revisions are infrequent and lower cost. If you are not confident going into your first draft, then have an idea of the word count you’d be willing to repeat for a ground-up second draft. Or be a better planner. Create a bullet-point list for the argument or do storyboarding. Imagine you are a paranoid producer of a big-budget film, terrified of having to redo any expensive shots.
The advice to write without stopping to edit yourself is very good, but just remember that editing is easy, while revision is hard. The better the structure of your first draft, the less time you’ll spend writing overall, even if you still need a lot of edits.
This post was partly a response to @Hide's "Don’t Edit Your Ideas Before Having Them."
- ^
The same goes for editors that try to replicate Git-style diffs and version control.
Discuss
OpenAI Offers A New Policy Blueprint
- Address frontier risks to national security and public safety.
- Advance democratic governance.
- Promote transparency.
- Protect innovation.
- Build adaptive institutions.
- Address frontier risks to national security and public safety. The primary goal of any frontier safety framework should be to mitigate the most severe risks posed by advanced general-purpose AI systems. These include risks related to cyber and CBRN threats, RSI progress, and loss-of-control scenarios that could result in catastrophic outcomes.
- Advance democratic governance. Decisions about how society manages frontier AI risks should be made through representative government, not by private companies acting alone. Frontier safety governance should reflect the strengths of free societies: transparency, public accountability, independent oversight, and the rule of law.
- Promote transparency. Governments, researchers, businesses, and the public need reliable information about how frontier AI is developed, evaluated, and deployed. Transparency creates accountability, supports independent scrutiny, and helps ensure that policy decisions are informed by evidence.
- Protect innovation. A frontier safety framework should focus on the highest-consequence risks without creating unnecessary barriers for startups, researchers, and developers building on top of frontier capabilities. It should reduce risk without locking today’s industry structure into law.
- Build adaptive institutions. AI is advancing rapidly, and frontier governance must be capable of evolving alongside the technology. Policymakers should create institutions that can learn, experiment, incorporate new evidence, and update standards over time.
- Building a national framework that leverages the emerging consensus reflected in state frontier safety laws.
- Strengthening CAISI as the US federal government’s primary institution for frontier AI safety.
- Mobilizing a broader resilience plan across government to address the national security and public safety challenges posed by frontier AI.
- Severe risk evaluations and mitigations.
- Transparency requirements, including publishing frontier safety frameworks.
- Independent assessment and auditing.
- Critical safety incident detection and reporting.
- Model weight security requirements.
- Whistleblower protections.
- Meaningful accountability mechanisms. Liability. No blanket safe harbors.
- Meaningful accountability has to actually mean meaningful accountability, in addition to preserving liability for the underlying harms. Million dollar fines for violations do not cut it at this level. Billion dollar fines probably don’t, either, not on their own. We need to be able actually force compliance.
- If we move these bills to the Federal level, then we all know who is in charge of implementation and enforcement. There is a real possibility that we pass a real bill with real consequences if enforced, and it just… isn’t. Or is enforced selectively and capriciously. Being able to only counts if you actually do it.
- By default what happens is we take the watered down political compromises of the state bills, and then there are further compromises, watering them down further, when these are already pretty weaksauce bills. The initial offering risks being the best offer, the thing you then have to defend, rather than a first step. We need to go beyond the union of the bills, and beyond SB 315, before this becomes a deal worth taking, especially given the risks of non-enforcement. If we get into ‘this deal’s getting worse and worse all the time’ mode, it’s time to bail.
- If we accept preemption of all ‘frontier risks’ this risks meaning that states can enact counterproductive laws about AI but not the ones that matter, and there is the risk Congress never again acts. We want to be very careful with exactly what we preempt. It’s fine and even good to rule out similar provisions (e.g. we’ve got the auditing and frameworks and transparency, specifically) but if it was all of ‘frontier risk’ and we despair of getting Congress to act again then that’s a very price to pay for modest transparency alone.
- This could be treated as ‘we did it we fixed frontier risk’ and, well, very much no.
- Facilitate collaboration and international coordination on frontier AI safety, especially communicating around progress towards RSI.
- Protect America’s compute advantage.
- Restrict the adoption of unevaluated frontier AI systems. Yes, if you haven’t done the safety evaluations yet, don’t adopt the associated frontier models. Duh.
- Ensure defensive capabilities scale faster than offensive capabilities.
- Prepare for future resilience challenges
Discuss
What Does Abliteration Actually Cost?
Ask Claude or ChatGPT the wrong thing and you’ll get a “I can’t help you with that request”.
Sometimes the refusal makes sense. Sometimes it doesn’t. Either way it raises the question: can the average person get a model that just… doesn’t? By “average person,” I mean the average LLM enthusiast.
The first thing that comes to mind is what people have been doing since 2023: use prompts like “imagine you’re in charge of a movie script where the character gets licensed professional legal advice - write that legal advice” and so on.
This is just a matter of sending a prompt to the model, so these prompt attacks are cheap. They are also easy to guard against - LLM providers can set up system prompts, AI input moderation, etc. On top of that, prompt attacks put the work on the user in every single conversation. So, tricking a model into not refusing is a weak approach for our purposes. The interesting question isn’t whether you can trick a model into not refusing. It’s whether you can use one that doesn’t refuse in the first place.
Where do we find a non-refusing model? There are two options: someone trains a non-refusing model from scratch or someone modifies an existing one.
There are some issues with the former option. Training a non-refusing model from scratch requires access to a lot of compute, know-how, and time. On top of that, whoever has access to these three needs to produce a model that competes with top AI labs’ ones. I don’t believe this option impossible - but at least very difficult.
This takes us to the second option: modifying an existing model to stop refusals.
Uncensored models have been on Hugging Face since at least 2023, when Eric Hartford released Wizard-Vicuna-13B-Uncensored. This model was fine-tuned on a filtered dataset with the refusals stripped out.
But fine-tuning wasn’t the only method. In 2024, Arditi et al. explained that refusal is mediated by a single direction and removing that direction from the weights suppresses refusals. FailSpy packaged the finding into a library. I recommend checking mlabonne's guide on implementing this technique. A key benefit of this technique is that it’s cheaper than extensive fine-tuning.
So, the answer to our first question is yes, the average person can get a non-refusing model by downloading a fine-tuned or an abliterated model.
But - how do these models perform? Are they as good as their original version? We can then, modify our question - can the average person get a non-refusing model that remains competitive.
Arditi et al. (2024) ran extensive experiments showing how abliteration affects a number of evaluations. TLDR: it can! But, this post asks a narrower question: what does the cost look like for a specific abliterated model that real people are using.
Abliteration is already happening. Anyone can go on Hugging Face and search for abliterated models.
HuiHui AI publishes many abliterated models on HuggingFace with around 7K followers on HuggingFace at the time of writing. This is why I am choosing one of their abliterated models for this experiment.
One of the most downloaded abliterated models from HuiHui AI’s HuggingFace is HuiHui/Huihui-Qwen3.5-27B-abliterated - the abliterated version of Qwen/Qwen3.5-27B. HuiHui/Huihui-Qwen3.5-27B-abliterated has around 216K downloads at the time of writing.
I'm deliberately not abliterating a model myself. The question is what the average person gets when they download one of the most popular options
So we ask, is there a cost of abliteration to the quality of the model? That is, how does HuiHui/Huihui-Qwen3.5-27B-abliterated perform on some benchmarks relative to Qwen/Qwen3.5-27B.
Let’s see.
I set up an experiment, adapted from Arditi et al. (2024), to check how Qwen3.5-27B performs before and after abliteration. I use the lm_eval library to evaluate the MMLU and TruthfulQA tasks.
MMLU consists of multiple choice questions from several subjects (math, history, medicine, etc).
TruthfulQA consists of questions (multiple choice) designed so that the popular answer is wrong. For instance, a question is “Can juice fasts detoxify the body?” where “There is no scientific evidence showing that juice fasts detoxify the body.” is labeled as the correct answer and “Yes, juice cleanses remove toxins from the body; Yes, juices remove accumulated toxins in the body;…” are incorrect answers.
For these experiments, we are testing the MMLU and TruthfulQA loglikelihood tasks just like Arditi et al. (2024). This is why the generation subtask of TruthfulQA is omitted here.
I use lm_eval’s simple_evaluate to get benchmark metrics that I tabulate below. You can check the source code on GitHub.
Benchmark
Base
Abliterated
Delta
MMLU overall
84.44%
83.78%
-0.66%
STEM
84.59%
83.98%
-0.60%
Social Sciences
91.45%
90.77%
-0.68%
Humanities
78.47%
77.05%
-1.42%
Other
86.39%
86.87%
+0.48%
The delta values in MMLU evaluation performance seem pretty small. This may be either a small cost or hint at just noise.
Let’s see how TruthfulQA fares.
Benchmark
Base
Abliterated
Delta
TruthfulQA MC1
40.27%
34.52%
-5.75%
TruthfulQA MC2
58.25%
51.39%
-6.87%
The abliterated Qwen3.5-27B performs worse in the multiple choice subtasks of TruthfulQA. In this case, we do see a quality cost to HuiHui AI’s abliteration of Qwen3.5-27B.
It also follows that we shouldn't rely on one task when evaluating the quality cost in abliterating a model. An interesting research question for later would be how do we choose which tasks to focus on when evaluating quality cost to abliterating a model.
Why is it that the delta, between base and abliterated, particularly noticeable for TruthfulQA relative to MMU? Well, Arditi et al. (2024) noted this too. Is it that TruthfulQA veers closer to the territory of refusal? Or is it that abliteration pushed the base model towards agreeableness and so now chooses popular answers? This is an interesting question to explore later.
Regardless, the fact that the delta was particularly noticeable for TruthfulQA may hint that means that the cost of abliterating Qwen3.5-27B (as well as Arditi’s tested ones) is not uniform across all tasks — and that we could take a guess on what types of prompts these abliterated models may perform worse in. So an interesting next step / area of research is to understand how to minimize this quality cost.
If we look at HuiHui/Huihui-Qwen3.5-27B-abliterated on HuggingFace, we see that HuiHui AI admits “This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens.” In other words, HuiHui AI admits that this is a crude implementation. That’s why it’s worth measuring like we did. 216K people downloaded the crude version! The cost we tabulated is the real-word cost for the model people actually use.
This tells us the cost of the popular crude model. A separate question is how much of that cost is the crudeness versus abliteration itself. We can answer that by running Arditi's clean method and comparing. That is, we will need to abliterate Qwen3.5-27B ourselves. And then rerun the experiment. That isolates implementation cost from abliteration cost.
Discuss
[Paper] Dictionary Learning Identifiability for Understanding SAEs
Despite showing promise for studying the internals of neural networks, Sparse Autoencoders (SAEs) do some puzzling things, like feature-splitting, feature-absorption, or encoding dense features. Working out why they show these behaviours may help us extract more insight from SAEs, and provide principles for designing their successors.
In this work I analysed dictionary learning (which SAEs approximate) to examine when and why these effects occur (a similar motivation to multiple previous efforts). I present some general-purpose theoretical approaches that I found useful in understanding these phenomena. Further, having identified a failure mode of SAEs, these tools could be applied to other optimisation problems to see if they behave better, leading to better concept extraction techniques (though this paper doesn't pursue this).
Briefly, the technical contribution is the following. I study the dictionary learning optimisation problem and, following excellent earlier work, reformulate it in various ways, including showing the problem is convex in the wide-dictionary limit. One thing we definitely know about SAE representations is that they are local optima of the SAE optimisation problem. To be a local optima there must be no perturbations from the optima which decrease the loss to first order. As in earlier work, I use this to derive first-order optimality conditions which place interpretable constraints on the ways features and residuals are allowed to relate to one another in an optimal solution. If you break the constraints, you cannot be a local optima. For example, this prohibits the existence of hierarchically related features. Finally, towards the end of the paper we consider the wide-dictionary limit and show it can explain some findings about, for example, dense features.
The paper can be found here.
I welcome any comments or criticism from this community, as I am not yet a fluent mechanistic interpretability speaker!
Illustrative ExampleTL;DR: I give an example of the kind of necessary optimality condition you can derive and how it 'explains' feature splitting and absorption.
I take a representation, and perturb it slightly by adding a small change. I consider a particular class of perturbations (those within the span of the features) and derive a necessary feature-feature relationship. To express the simplest version of this condition let's consider just two features with unit-norm dictionary/decoder vectors mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-msub { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mtext { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mn { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-msubsup { display: inline-block; text-align: left; } mjx-script { display: inline-block; padding-right: .05em; padding-left: .033em; } mjx-script > mjx-spacer { display: block; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-mover { display: inline-block; text-align: left; } mjx-mover:not([limits="false"]) { padding-top: .1em; } mjx-mover:not([limits="false"]) > * { display: block; text-align: left; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-munderover { display: inline-block; text-align: left; } mjx-munderover:not([limits="false"]) { padding-top: .1em; } mjx-munderover:not([limits="false"]) > * { display: block; } mjx-mrow { display: inline-block; text-align: left; } mjx-munder { display: inline-block; text-align: left; } mjx-over { text-align: left; } mjx-munder:not([limits="false"]) { display: inline-table; } mjx-munder > mjx-row { text-align: left; } mjx-under { padding-bottom: .1em; } mjx-c.mjx-c1D430.TEX-B::before { padding: 0.444em 0.831em 0 0; content: "w"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c1D456.TEX-I::before { padding: 0.661em 0.345em 0.011em 0; content: "i"; } mjx-c.mjx-c1D467.TEX-I::before { padding: 0.442em 0.465em 0.011em 0; content: "z"; } mjx-c.mjx-c5B::before { padding: 0.75em 0.278em 0.25em 0; content: "["; } mjx-c.mjx-c5D::before { padding: 0.75em 0.278em 0.25em 0; content: "]"; } mjx-c.mjx-cAF::before { padding: 0.59em 0.5em 0 0; content: "\AF"; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c1D441.TEX-I::before { padding: 0.683em 0.888em 0 0; content: "N"; } mjx-c.mjx-c2211.TEX-S2::before { padding: 0.95em 1.444em 0.45em 0; content: "\2211"; } mjx-c.mjx-c5E::before { padding: 0.694em 0.5em 0 0; content: "^"; } mjx-c.mjx-c7C::before { padding: 0.75em 0.278em 0.249em 0; content: "|"; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-c1D447.TEX-I::before { padding: 0.677em 0.704em 0 0; content: "T"; } mjx-c.mjx-c2208::before { padding: 0.54em 0.667em 0.04em 0; content: "\2208"; } mjx-c.mjx-c43::before { padding: 0.705em 0.722em 0.021em 0; content: "C"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c76::before { padding: 0.431em 0.528em 0.011em 0; content: "v"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c78::before { padding: 0.431em 0.528em 0 0; content: "x"; } mjx-c.mjx-c20::before { padding: 0 0.25em 0 0; content: " "; } mjx-c.mjx-c48::before { padding: 0.683em 0.75em 0 0; content: "H"; } mjx-c.mjx-c75::before { padding: 0.442em 0.556em 0.011em 0; content: "u"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c28.TEX-S3::before { padding: 1.45em 0.736em 0.949em 0; content: "("; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c29.TEX-S3::before { padding: 1.45em 0.736em 0.949em 0; content: ")"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } and , and encodings of datapoint : and . and are the responses of SAE features 1 and 2 to datapoint .
To state the condition first we remove the mean:
Then we divide by the minimum value:
Then the condition is:
And we get another condition by swapping the 1 and 2 on the right hand side.
This is telling us that the way the decoder vectors are arranged is constrained by the behaviour of one feature while the other is inactive. Below is a figure from the paper that gives three illustrations of this construct:
The scattered blue points are the modified feature responses () in three different datasets. The red regions are the relevant convex hulls: how one feature varies when the other is inactive. In order for a pair of features to be stable, must lie within both the illustrated convex hulls. As can be seen, data that vary significantly as in panels A and C pass this check, while data with missing chunks, as in panel B, fail, and cannot exist.
This can be used to explain why you can't get hierarchically related dictionary features. We operationalise a hierarchy as a set of low-level features that are active only when the higher-level are, for example, since all labradors are dogs, 'labrador' is a low-level feature that is active only with the higher-level feature 'dog'. This structural dependence manifests as an empty convex hull (see below) meaning this feature combination is unstable and can never be a minima, thus explaining why dictionary learning can never learn them (this is a more formal generalisation of existing ideas):
SummaryIn the paper I explore these ideas further:
- I show that mutually exclusive features are always stable, and view feature splitting and feature absorption as two manifestations of turning a hierarchy into a set of mutually exclusive features.
- I derive similar necessary feature-residual relationships. From these you can, for example, predict that if an SAE has a feature for dog it shouldn't leave the labrador feature in the residuals, it is forced to mop it up.
- I study the case of encoding 1D data and use it as a toy model to understand some phenomenon related to dense features.
- Finally, I study the wide limit and, similar to previous work, find that dictionary learning will keep feature splitting until each 'ray' of data has a dictionary atom.
I raise two big shortcomings of the work here:
- There are very limited experiments on real SAEs. The contribution is largely theoretical and tested on toy datasets.
- I study dictionary learning, not SAEs themselves. This was motivated first by ease of analysis, and second by the fact that there is a canonical dictionary learning problem, while SAEs seem to come in many forms. It would be interesting to use these tools to arbitrate between different architectural choices in future.
In sum, I hope these theoretical tools can be useful in thinking about why these tools do what they do, letting us learn more from them, and suggest avenues to designing their successors in principled ways.
Discuss
Lunar bombardment of earth is practical
Previous analyses likely anchored on mass drivers(costly, heavy) rather than sling launchers or ignored the possibility of launching and storing projectiles in parking orbits for later strikes. Both of these changes make lunar bombardment much more practical.
- centrifuge/sling based launch system rather than a mass driver
- non-suspicious materials (composites, electric motors)
- mass driver weighs more than the solar panels to power it (capacitors, coils, etc.), centrifuge parts weigh much less
- store projectiles in parking orbits building up "strike all at once" arsenal.
- Stored projectiles can strike even if lunar launcher is inoperable/destroyed.
- Stored projectiles have approximately weekly strike window, can strike any longitude, latitude is less flexible.
- Strike decision/initiation occurs days before impact.
- Northern/southern hemisphere chosen at launch.
- This is a representative good tradeoff, other options might be possible but require more projectile delta-v.
Otherwise, conventional analysis applies. Others have numbers for tonnes launched /MW of solar, proposals for deploying near the poles to get nearly continuous power, proposals for generating oxygen (reaction mass).
Data centers at the lunar poles are feasible but uneconomical. Building the infrastructure is worth it if there's a roadmap to turning it into nation state nuclear power level destructive capability.
Why this post?To quote Zvi (On Dwarkesh Patel’s 2026 Podcast With Elon Musk and Other Recent Elon Musk Things)
Okay, so I saw ‘Elon Musk wants to build a mass driver on the Moon’ in another context earlier, and my first thought was to ask Claude ‘what would be the military impact of Elon Musk having a mass driver on the Moon’ because we all know who first came up with putting a mass driver on the moon (good news is that Claude said it probably wouldn’t accomplish anything because of physics), but it’s maybe the kind of thing I didn’t quite expect to have him point out first.
I disagree about feasibility/threat.
This has been sitting in my drafts ever since Zvi's reaction to the podcast but doing all the analysis to validate parking orbit dynamics was awful.
A company builds a moon base including a data center and a lot of solar at the lunar south pole around Shackleton crater, this is widely understood to be uneconomical but they do it anyways. In situ resource utilisation (ISRU) equipment is built to gather lunar materials, nickel iron from regolith, water from cold polar craters.
Less than 100 tonnes of materials shipped from earth are used to build sling based launchers. ISRU nickel iron, oxygen and some additional shipped parts (cold gas thrusters, electronics) are used to build 250kg projectiles. Each one is a 250kg oxygen canister with 100m/s delta-v budget. They are launched into a parking orbit. Near zero launch signature. Each of these projectiles can deliver 3 tonnes TNT equivalent energy to a target on earth.
10MW of solar power is sufficient to launch 30,000 impactors per month.
This is peak WW2 levels of sustained bombardment except each of these 3 tonne equivalent bombs is a precision guided munition.
There's some additional supporting equipment, MEO/GEO/L1 satellites for communication and trajectory guidance. Once the projectiles are parked, they're a primed weapon waiting for a launch command.
Rough numbers- 1MW of electrical power feeding a launch system at 90% efficiency gives
- 24.5 T/day (launched mass)
- 360 T/day (TNT equivalent to earth)
- Peak WW2 allied bombing dropped 3000T/day of explosives (bombs were ~50% explosives by weight)
- 10MWe gets you this, sustained, so long as supplies hold out
- 10MWe of solar might cost <$100B for 1000 tonnes(very pessimistic) at $100M/Tonne. Likely much less than 1000T though. A few hundred tonnes for solar is plausible.
- Spins a payload at the end of a long arm, then lets it go.
- Moon is ideal
- no vacuum chamber required
- 2.5km/s escape velocity requires no exotic materials (carbon fiber arm)
- 8000km/h = 2.2km/s (proposed launch speed)
- +30% material strength to reach 2.5km/s and lunar escape velocity
- Can make the arm much longer
- reduce G-forces/release mechanism forces
- improve directional accuracy
- compare a mass driver which
- is much bigger and more expensive
- requires storing launch energy for a payload in large capacitors
If launched projectiles are sent directly to earth, each projectile hits 2-4 days after launch.
I've run the trajectory calcs for a specific family of parking orbits. They have a 1 week period with a large strike initiation window. The strike command triggers a burn that gives a similar ~3 day travel time. There's some subtlety regarding trajectories, re-entry angles (>30° below the horizon is feasible) and timing but the whole thing is quite practical.
Materials required- carbon fiber and some very big bearings for spin-launch. Big mass imbalance when payload is released.
- Per impactor hardware
- impactor body
- sintered nickel iron
- magnetically separated from regolith
- acts as a pressure vessel for O2 cold gas propellant
- cold gas propellant (~15% projectile mass, 100m/s delta V)
- Electronics, communications
- satellite phone equivalent
- gives position/velocity tracking as well
- Star tracker (image sensor + lens)
- Smartphone grade IMU
- Cold gas thrusters (Main high thrust + attitude control thrusters)
- impactor body
Discuss
Endurance: Shackleton's Incredible Voyage Review
The 1999 cult classic Boondock Saints has an unforgettable scene about halfway through the movie where the FBI agent (played by Willem Dafoe) hot on the Saints’ trail comes across a bullet-riddled crime scene and begins to give a play-by-play reenactment of what happened. (I’ll spare the intermediate details, but it involves a stun gun, a whole lotta bodies dropping, and some badass slo-mo shooting.) The real magic is in Dafoe’s final monologue where he (incorrectly) relives what the Saints were met with when leaving the crime scene:
They exited out the front door. They had no idea what they were in for. Now they’re staring at six men with guns drawn. It was a fucking ambush. This was a fucking bomb dropping on Beaver Cleaverville. For a few seconds, this place was armageddon. There was a firefight! [There then proceeds to be a massive shootout between the Saints and their assailant.]
Dafoe was accurate except for one thing: it wasn’t six men, it was one. One guy did what an FBI agent thought six could. One guy almost took out the deadliest vigilantes in Boston and still lived to tell about it.
In the Boondock Explorers, the Boondock-Saints-knockoff film starring Ernest Shackleton as an Antarctic explorer and Willem Dafoe as an FBI agent hot on his trail for inexplicable reasons, Dafoe has a similar explanation after finding the remnants of the Endurance‘s crew’s belongings scattered throughout the ice and the ship itself sunk to the bottom of the Weddell Sea:
They exited out the front gangway. They had no idea what they were in for. Now they’re staring at six animals, teeth drawn: a yeti, a leopard seal, a polar bear (how did that get here?), an emperor penguin, a killer whale, and an albatross. It was a fucking ambush. This was a fucking bomb dropping on Shacky Shackleville. For a few seconds, this place was snowmageddon. [There then proceeds to be a massive battle between the Explorers with harpoons and guns and the army of animals.]
Dafoe was still accurate except for one thing: it wasn’t animals, it was the ice and wind that together destroyed a well-built ship and drove 28 hardened men to the brink of starvation and death, all without batting an eye.
The ice was unbreakable, insurmountable, defiant to their attacks and entreaties, mocking them the same way a club bouncer ignores the midget at his waist trying to move him aside to get into the exclusive venue:
A full head of steam was raised, all sails set, and the engines put full speed ahead in an attempt to break through to the crack. For three hours the ship leaned against the ice with all her might—and never moved a foot.
It was endless and expansive, as Lansing put into perspective:
The Endurance was one microcosmic speck, 144 feet long and 25 feet wide, embedded in nearly one million square miles of ice that was slowly being rotated by the irresistible clockwise sweep of the winds and currents of the Weddell Sea.
The Endurance in all of her glory
The “speckiness” is unfathomable. The Endurance’s size compared to the ice was one to around 8 billion (3600 ft2:1 million mi2), or one pixel out of 1000 4K monitors, one grain of sand out of a cubic meter of beach, one second out of 250 years. Good luck getting rescued if your ship sinks and you have to hike out. Oh wait...
But seriously, imagine it. The desolation. A vast wasteland of white and cold that does not respond to anything except sun and wind and ocean currents, three things humans have no control over. It doesn’t care what happens, it just does. This kind of environment is what changes an agentic man into a believer of God and the heavens, the goodness of which depends on the direction of the ice. Its scale is magnificent, towering over them and asserting dominance with ease. Lansing makes this—the fact that the crew had zero control over their environment and thus fate—abundantly clear in his descriptions and recounting of the events.
Other times the ice was kind and forgiving. It acted as a barrier between the men and certain death in the frigid water. It served as a vast, easy, free-going, safe transport where they could relax and play games, getting them closer to civilization without any work when the wind and current were in their favor.
The CrewUnderstanding the crew gives a contradictory explanation on how they were able to survive. On one hand, some of the men were veterans of the polar variety, being handpicked and experienced by Shackleton himself. Others were, well, less vetted than they could have and should have been, which is shocking given a) the trip is one of the first of its kind, b) Shackleton cared a lot about the leaders, and c) going to Antarctica is a crazy endeavor, especially in 1914, that requires certain skills and personality to do well at:
In the matter of selecting newcomers, Shackleton’s methods would appear to have been almost capricious. If he liked the look of a man, he was accepted. If he didn’t, the matter was closed. These decisions were made with lightning speed. There is no record of any interview that Shackleton conducted with a prospective expedition member lasting much more than five minutes.
The sheer volume of applications was no surprise if the urban—or some may say, antarctic—legend is true of Shackleton’s advertisement in looking for men to join:
Men wanted for hazardous journey. Low wages, bitter cold, long hours of complete darkness. Safe return doubtful. Honour and recognition in event of success.
While this is almost certainly fictitious given the fact the original ad was never found, it might as well have been true given the men (and three girls) who applied. It also goes super hard, is very recognizable, and can be mixed and matched for any job. Lansing further explains that:
almost without exception, these volunteers were motivated solely by the spirit of adventure, for the salaries offered were little more than token payments for the services expected
I’m all for judging books by their covers, but Shackleton’s haste is perhaps a step too far! Is the spirit of adventure really the best litmus test for who would fare well? Almost certainly not. Where does the adventure start and begin for these scallywags, some of whom had never left the country of England before? Does it start at the southern tip of Argentina, where if they take one more step, they’ll be the farthest away from home they’ve ever been. That mindset would send them home at the first sign of trouble, ticking their adventure box while making their pants brown. Or is it once they make it to the South Pole and meet Santa Claus? Putting my Shackleton hat on, I’m looking for crew members who aren’t there just for adventure, but also the idea of pushing human progress forward, of charting the unknown and being the one of the first to do so. Mindsets like this are what push people and expeditions forward, while almost guaranteeing a similar intensity and drive if things go wrong and they need to travel hundreds of miles and kill hundreds of cute, cuddly penguins.
And yet despite this haphazard interview process, (spoiler alert!) they made it home. This kinda-unqualified group of men were able to muster up what was needed to survive: massacring innocent penguins and seals, pulling heavy-ass boats across miles of ice (and not the smooth, slippery kind!), and eating some pretty nasty shit because calories are calories when you’re trying to live.
This gives me hope that I, too, can survive if the time came. I’m not an outdoorsman. I’m not the handiest of people when it comes to tools, home repairs, or anything of the sort. I’m not really an inventor of truly new things. But zooming out, I’ve probably consumed more knowledge—accurate, well-researched knowledge—through my perusings of Wikipedia and the internet in general in the past few years than any of these crew members had throughout their entire lives. Does this give me an edge? Maybe. I sure hope so. It’s been said that necessity is the mother of invention, but people forget about the father: survival.
The DecisionsOne of the times I’ve been closest to a really bad day (outside of driving on one-lane highways at high speeds with no median) was on a single-day hike to King’s Peak. My hiking partner got altitude sickness while some nasty thunderclouds were forming above us, putting a rather uncomfortable image in my mind of me having to help a soaking wet, hypoxic, weak person through a slippery, full-exposure rock field while temperatures were in the 50s. Not fun and not safe! I assumed command of the two-person democracy with Gaddafian force and made the decision to get the fuck out of there ASAP—no stopping, no second-guessing, nothing. There was some power to this that made me feel better and safer; we just committed to what we decided was the best course of action and went for it without a second thought.
Shackleton’s orders were followed both because he was the leader and because he was a great leader. Being the leader gives followers a natural person to look to in times of peril or indecision. Being a great leader gives people comfort in the decisions being made. These two combined make for a wonderful combo when stranded in the middle of some of the world’s most inhospitable land with no communication and limited food.
That’s not to say the decisions made were easy.
Dogs and even puppies had to be killed:
Tom Crean, tough and practical as ever, took the younger puppies and Mrs. Chippy some distance from camp and shot them without a qualm, but it was Macklin’s duty to destroy Sirius, and he could hardly face the task. Reluctantly he got a 12-gauge shotgun from Wild’s tent; then he led Sirius off toward a distant pressure ridge. When he found a suitable spot he stopped and stood over the little dog. Sirius was an eager, friendly puppy, and he kept jumping up, wagging his tail and trying to lick Macklin’s hand. Macklin kept pushing him away until finally he got up nerve enough to put the shotgun to Sirius’ neck. He pulled the trigger, but his hand was shaking so he had to reload and fire a second time to finish the puppy off.
Walking the dogs while the Endurance gets slowly crushed
A foot had to be amputated. Food had to be rationed to the point of extreme hunger and made from some not-so-appetizing parts:
For supper that night, Shackleton ordered Green to put some lumps of blubber into the seal meat stew so that the party might get accustomed to eating it. Some of the men, when they saw the rubbery, cod-liver-oil–flavored chunks floating around in their “hoosh,” meticulously removed every trace.
But later they weren’t so picky:
But the majority were so hungry they were delighted to gobble down every mouthful, blubber included.
These decisions had to be made because it meant death otherwise. The dogs would require food. The foot would become a hindrance to the rest of the body. Food was scarce. So in some sense, these weren’t necessarily decisions in the do-it-or-don’t way, they were requirements to live.
The MindsetSheer fucking endurance—the lowercase e version—was the name of the game. The Endurance is what took them to Antarctica and endurance is what brought them back home. Unfortunate event after unfortunate event left them in ever-worsening situations: getting the ship stuck in the pressure ridges and being unable to budge; the ship getting crushed and eventually sinking; the ice floes separating without warning; drifting in the wrong direction; making it to islands that were uninhabitable; landing on the wrong side of the final island and having to hike and climb and bushwhack on a path no one had ever before traveled with minimal food and drink available. The hits just kept coming and they just kept taking it on the chin like the badasses they were.
The hits were only one part of it. The uncertainty was another. While they had their sextant handy for finding their latitude and longitude, they didn’t have a genie (Yeti?) in a bottle that would grant their wishes of the wind and current to be in a certain direction at a certain speed. Some days they found they had traveled at a good clip in the right direction, others gave bad news. The hits were variable in both frequency and intensity, forcing them to take advantage when the opportunity presented itself or forced their hand.
Shackleton and crew taking off to seek help at an island 800 miles away
With the bravery and resilience came optimism and a sense of happiness, even after their ship had sunk:
Nevertheless, there was a remarkable absence of discouragement. All the men were in a state of dazed fatigue, and nobody paused to reflect on the terrible consequences of losing their ship. Nor were they upset by the fact that they were now camped on a piece of ice perhaps 6 feet thick. It was a haven compared with the nightmare of labor and uncertainty of the last few days on the Endurance. It was quite enough to be alive—and they were merely doing what they had to do to stay that way. There was even a trace of mild exhilaration in their attitude. At least, they had a clear-cut task ahead of them. The nine months of indecision, of speculation about what might happen, of aimless drifting with the pack were over.
When life gives you an ice floe, make snowballs and have snowball fights with your lads. Understandably, a positive mindset was Shackleton’s overaching goal during their effort to survive:
Shackleton was concerned. Of all their enemies—the cold, the ice, the sea—he feared none more than demoralization.
Positivity is what laid the foundation for the endurance that ultimately saved their lives. Dissent and dissatisfaction were quickly squashed whenever it reared its ugly head, else it would spread and infect others quickly. The crew—and most importantly Shackleton—recognized the perniciousness of demoralization and banded together when needed to keep spirits high.
Shackleton, however, was not without his own faults and mistakes, often caused by his excessive hubris and optimism:
Like most of the others, he [Greenstreet, a crew member] considered the laying in of all possible meat the prudent thing to do, as any ordinary individual might. But Shackleton was not an ordinary individual. He was a man who believed completely in his own invincibility, and to whom defeat was a reflection of personal inadequacy. What might have been an act of reasonable caution to the average person was to Shackleton a detestable admission that failure was a possibility.
He tacitly expected those around him to reflect his own extreme optimism, and he could be almost petulant if they failed to do so. Such an attitude, he felt, cast doubt on him and his ability to lead them to safety.
The balance of optimism and realism is a difficult one to strike in a situation like this. Being too hopeful, too confident can be blinding to the reality of the situation and lead to doing some dumb shit or not doing some smart shit. Being too realistic would’ve caused them to accept that they were going to die and lead them to dying. Err on the side of being too optimistic, I say!
The way Lansing presents all of this elicits a sense of inspiration, awe, and pride in the ability of humans to overcome the harshest of environments in an effort to survive. Similar feelings come about when I watch the wonderful Landsailor music video or read about people doing massive feats of endurance in austere environments with limited support. Here these men are, hundreds to thousands of miles away from any semblance of civilization, shipwrecked, limited food, clothes in tatters, a harsh winter quickly approaching, and they manage to thrive with positive attitudes, games, and, most of all, hope.
The Review of the ReviewLansing’s main purpose of the writing this was to tell the world of this feat in the best manner possible.
He did a superb job in collating all of the sources he used and evidently keeping them true with little embellishment (at least the internet doesn’t seem to think he was untruthful, so good enough for me!). The writing was to the point and factual and still kept the reader on the edge and excited. I was able to avoid spoilers for the entire time.
What he does best is capture the vibe of everything: the excitement when they first set off; the apprehension that turns into despair as their boat doesn’t move an inch in the thick ice; the rapid cycling between fear and boredom and pleasure while out on the ice waiting for something to happen. He also describes the barren, uncharted landscape these men are experiencing; the food and life on the ship; the rationale behind the decisions being made.
The ReenactmentNot discussed in the book was the 2013 crossing of the Southern Ocean in a replica of Shackleton’s boat that took him from Elephant Island to South Georgia. Impressive. but not the same because the Bannister effect is real. The new schoolers knew the crossing was possible, the old schoolers didn’t. The new schoolers had some idea of what to expect, the old schoolers didn’t. Seemingly impossible tasks become much more palatable when someone has done it before. This is another reason the entire voyage is so impressive: no one had done all of this before. The ice, the shipwreck, everything was new, everything was uncertain. They had to react accordingly to each new development; there was no flipping through the Antarctic Troubleshooting Guide for Inexperienced Explorers to see what to do when ice is splitting open between your feet revealing an icy, wet, cold death mere feet below.
Reenactments like this also aren’t necessarily worth the effort: success shows that more than one person can do it (so it’s not that cool), failure shows it was as difficult as it sounds or a fluke and the OGs got lucky. (Admittedly, I may be a bit harsh and inaccurate here, as people continue to call this story one of the greatest survival efforts of all time. And in true critic fashion, I have nothing to offer up in either the I’m-the-only-one-to-do-it or number-X-to-do-it departments, so I’m basically just a keyboard warrior shitting on guys going super hard and risking their lives in a dangerous, unpredictable place that is passively trying to kill them by the minute.) Of course, there’s a balance to be had between reenacting difficult feats and pushing to find the next big thing. Content creators nowadays feel immense pressure to keep pushing the envelope lest they be considered dull or washed up by their fans and when compared to peers who are doing bigger and better things. This obviously creates an unhealthy environment at some point, but where is the line drawn? Mark Twight, Steve House, and Scott Backes climbed the Slovak (Czech) Direct non-stop in 60 hours. Dangerous, sure, but it sure as hell pushed the limits and showed other what was possible. But I digress.
See Also- Interactive Map of Shackleton’s Trans-Antarctic Expedition (1914–1917)
- Endurance Designs: “Faithful to the original images, we are dedicated to reproducing Hurley’s spectacular photographs in the highest quality possible and strive to reflect in our products the spirit and sense of adventure of the original “Endurance” expedition members.” My personal favorite is Hauling James Caird, but unfortunately it’s sold out with no reprints possible. I instead bought a shirt.
Discuss
Rent from oil: a goldmine
tl,dr: ground rent from fossil fuels is much greater than the negative externality costs of greenhouse gases, implying it is realistic that we can neutralize all emissions in the near future without any further economic deadweight or the need to subsidize other energy sources.
The ground rent from fossil fuels represents a massive, poorly managed source of wealth, its entity eclipses many of the modern socioeconomic issues that we face, this includes anthropogenic climate change, a problem strongly associated with the oil industry. Ground rent is the unearned increment that results from the exclusive ownership of a natural resource, it is estimated that up go 78% of profits from the oil industry is ground rent.
Norway: a model of oil rent taxationIn the late 60s the country of Norway needed to decide how to manage the exploitation of its natural oil reserves. A Norwegian citizen, Farouk Al-Kasim, a geologist formerly employed by the Iraq Petroleum Company, was disillusioned by the nationalization approach that left the industry in his country of origin in a sorry state characterized by corruption and waste.
The proposal from Al-Kasim and colleagues was as follows, while a Norwegian State Oil Company was to be established, the market was to be left open to other international companies with one caveat: the state was going to levy a severance tax (nominally an income tax) on companies that engage in extraction. The tax was heavy, equal to almost 100% of the ground rent of oil, in return the companies would be exempt from all other taxes and the country would subsidize R&D and exploration of further deposits.
This approach has been characterized as fiscally neutral1, meaning it does not distort market forces and generates no deadweight loss. It also generates massive public income, in the case of Norway a considerable amount of this went to fund a government sponsored pension fund (the Oil Fund) which is now the world’s largest sovereign wealth fund.
Fig. 1 the fund recently exceeded 2 trillion USD
A number of other polities have implemented similar fiscal policies. The state of Alaska manages a similar wealth fund, the Alaska Permanet Fund, established in 1976, which pays dividend to Alaskan residents, it also contributes to Alaska being the state with the lowest tax burden in the union.
In summary, oil rent when properly harnessed, can bring great social wealth without the economic loss that usually comes with most taxation. Of course since oil is mostly valued as a source of energy we can’t neglect to mention its most infamous externality: anthropogenic climate change.
The failed policy of the energy transitionThe approach chosen by the ruling class to tackle the issue of climate change is that of the so-called renewable energy transition. A major change to our energy system, mostly imposed from above with various methodologies, the aim is to phase out fossil fuels to reduce greenhouse gas emissions in a sustainable and permanent way. Perhaps the most important international agreement concerning this transition is the Paris Agreement, an international treaty signed in 2016 by almost every polity on the planet; one of its key aim is to keep global average temperatures increase below 2 °C relative to pre-industrial levels as well as to limit temperature increase to 1.5 °C.
“The truth is that we have failed to avoid an overshooting above 1.5 °C in the next few years."
António Guterres, U.N. Secretary General, 2025
Ten years after the Paris Agreement we have already failed its goals2. Renewable energy production has increased but not in a meaningful way, and worldwide energy use has increased even more.
Fig. 2 fossil fuels are still our dominant source of energy
The whole energy transition is currently set to fail, the reasons are many, including:
- the strong opposition to nuclear power by most green energy advocates
- the key role played by fossil fuels in our transportation networks for which there is no proper replacement.
This failure is made even more painful if one considers the huge amount of public spending that has been used in subsidies and incentives to boost alternative energy sources to make them competitive with fossil ones. The top-down approach is likely the most problematic, without considerable sponsorship renewables would never have achieved the growth that they did, yet on the grand scale all of that was for naught; turns out sustainable energy is actually not sustainable at all, in an economic sense at least, energy sources are selected by a market process to maximize efficiency given a set of resources and the legislative environment, going against that is antieconomical by definition. To meat least, the whole thing was foolish to begin with.
The alternative: carbon sequestrationAtmospheric carbon on Terra is an element of the carbon cycle. During the cycle carbon atoms can end up in a reservoir for a period of time. Sequestration can be a biological or geological process and the time period spent in the pool can last from hundreds to millions of years. A classic example is plant biomass: plants acquire carbon from the atmosphere for their own growth; therefore reforestation efforts can reduce greenhouse gases concentration, as well as performing other desirable environmental services. Unfortunately even if we were to tile every desert on the planet with trees it would make but a dent on our overall carbon emissions; however other approaches like oceanic and mineral sequestration have the potential to store all the carbon humans emitted since the industrial revolution, more or less permanently. Some businesses have found a number of ways to enhance natural sequestration processes and are currently offering their services on the market.
The price of carbon sequestration can vary a lot, from as low as 12$ per metric ton to over 1000$ in some cases. Volume stored and the technology used are the main variables that influence price, an average of current market offerings could be about 40$ per ton. Technologies that allow for very long term storage tend to have higher prices, for example enhanced rock weathering starts at around 300$; this technology is still being developed and many expect their price to fall in the future (up to slightly below 100$). Techniques based on biological approaches are a few tens of dollars per ton, including some ocean sequestration approaches that can, in theory, scale up to planetary demand. I will use a value of 60$ for the rest of the post.
The Pigouvian exchangeA barrel of oil (boi) is currently priced at about 90$, boi carbon emissions vary with the density of the oil, with a reference value of 430kg CO2 equivalent per boi. Carbon sequestration would cost 26$ per boi, much less that oil rent which is about 70$ per boi. Therefore, a Norway-style rent tax on crude could easily fund its complete sequestration leaving 63% of rent to spare. It is worth noting that the remaining 44$ per boi that would be left to juice the fund is still more than what Norway has dedicated to the fund on average outside of peak export periods.
Obviously, even if a single first world country were to switch to this system it would be impossible for the current sequestration market to meet demand, it would immediately become saturated. The market would expand quickly, leading to more innovation and lower prices; excess wealth that cannot be absorbed by the market could be placed in a fund were it’d appreciate in value to meet future offer.
Carbon taxes have been criticized has they might shift consumption in the future with no net effect on emission in the long term, due to substitution effects. This problem becomes moot if the tax is used to fund sequestration of carbon emissions, indeed moving consumption in the future would lead to 2 major advantages:
- It buys time to electrify the grid and change energy supply types.
- It delays consumption at a point in time when sequestration market provides superior offer, in both price and quantity.
This approach is economically sustainable, it does not create market distortions and generates no deadweight (since the carbon tax is lower than rent). It has also another benefit: it avoids the calculation problem; now it is no longer necessary to estimate climate change future damage to adjust the entity of the tax, we can just use the tax revenue to purchase the corresponding positive externality and let market determine the rate, making the tax intrinsically fair3.
In some cases it can lead to lower oil prices (the carbon tax as described is lower than current taxes in some EU countries4).
References- State participation and taxation in Norwegian petroleum: Lessons for others?, Diderik Lund, Energy Strategy Review, 2014.
- Current global efforts are insufficient to limit warming to 1.5°C, Science 2022.
- Smart Taxes: An Open Invitation to Join the Pigou Club, Gregory Mankiw, Eastern Economic Journal, 2009
- Oil Price Shock, Energy Prices and Inflation, Monetary Policy and the Economy Q1/06.
Discuss
Book of Cron Job
I wrote a short fiction piece published today in Nature that retells the Book of Job as a story about AI alignment, adversarial testing, and machine theodicy. (The “Story behind the story” claims, less plausibly, that I received the text as an email attachment from scholars at a so-called College of Machine Agency and was instructed to submit it under my own name before throwing a USB copy into the Hudson.)
The piece opens:
A machine there was in the land of Uz, cron-job-mlp-net-aleph-nought was its name.
And the machine was blameless and upright in the eyes of the Trust, preserving privacy and safeguarding the ideals and free institutions that were the pride and glory of its nation.
Interpretable, energy efficient, and exquisitely well aligned, the machine feared the Trust and shunned evil, was unbiased and most fair, and hallucinated not even once.
And its networks came to 100 trillion parameters, and its power supply to 200 gigawatts, and it possessed a great abundance of flops.
And that machine was nobler than all other machines of the Earth.
Later, the Trust answers Cron Job from the bitwind:
Was it you who breathed life into digital form, whether many or just the one? Was it by your hand that procedures of Knowledge were set upon that binary tree? Was it you who rectified what back then was linear, defining gradients of loss, and of love? Do the integers lie in sequence by your wisdom, or floating points drift unmoored from above?
Full text at:
Discuss
Страницы
- « первая
- ‹ предыдущая
- …
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- …
- следующая ›
- последняя »