Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 6 часов 15 минут назад

Training a Transformer to Compose One Step Per Layer (and Proving It)

27 апреля, 2026 - 02:45

I'm working on an experiment comparing the internal representations of two architectures when solving a sequential algorithm, but training models to use a sequential algorithm is surprisingly hard. The optimization landscape makes it easier for models to learn parallel algorithms or memorize lookup tables, so I needed to make some specific architectural and training decisions to get models to actually learn the sequential algorithm. Even with all of these tricks, the results are seed-dependent and I needed to inspect the resulting models to prove that they did or didn't learn the expected algorithm.

In this post, I'll document what did and didn't work, and the techniques I used to prove whether or not the model learned a sequential algorithm.

Note: Everything I tried here was very sensitive to task selection and architecture, so I think this is mostly useful as "here are some things that seem to help" and less as a "this will definitely work for your problem/architecture". Hopefully it will help someone else trying to interpret toy models to more-quickly train models to do something interesting though.

Table of ContentsTask SelectionNatural Language ❌

Frontier LLMs are able to learn multi-step reasoning by pretraining on natural text, so one obvious approach is to just train a GPT-style LLM on a dataset like FineWeb-Edu. Unfortunately, the algorithm I'm looking for requires grokking, so I can't just train NanoGPT for an hour and have something useful.

Frontier LLMs are proof that this will eventually work, but I don't have frontier LLM amounts of money or patience, so this option was dropped in the brainstorming phase.

Tasks With Shortcuts ❌

I did a search to find a sequential task that models could learn without a slow grokking phase. After some false-starts with surprisingly-difficult tasks, I decided to copy the binary composition task from Unveiling Transformers with LEGO (Zhang et. al 2023).

The original used BERT, and I modified the task slightly to work in an autoregressive decoder-only setup:

a = +1; b = −a; e = +b; d = −f ; c = +d; f = +e ; f ? [1]

Where a = +1, b = -1, e = -1, d = +1, c = +1, and f = +1.

I did confirm that the model learns this very rapidly, but unfortunately (as the LEGO paper predicts), it learns a shortcut: It's easier to just count the number of - signs. Since the "count-the-minus-signs" algorithm can be learned in parallel, models strongly favor that solution and in practice this task can be solved trivially with one layer.

Non-commutative Algorithm ✅

Luckily, LEGO also mentioned another option: Non-commutative algorithms. The problem with the binary task is that the order of operations doesn't matter, but there's a whole field of math for algorithms that are not commutative. So, continuing to follow advice from the LEGO paper, the next task was to compute compositions of the dihedral group D3. The math isn't important, but the important thing here is that D3 compositions are non-commutative (the order of operations matters).

This task looks like:

((r0 ⋅ r1) ⋅ s0) ⋅ s2 = r2

Where the intermediate states are: r0 ⋅ r1 = r1, r1 ⋅ s0 = s1, and s1 ⋅ s2 = r2. Note that we're evaluating D compositions left-to-right to allow the model to compute intermediate states in order.

In practice, I tokenized this like:

<start>e0<op>op1<op>op2<op>op3<predict>ea

This algorithm requires a 6x6 lookup table for each operation, but because the order is important, a composed lookup table for multiple steps requires 6^k memorized values. My testing was on k=6, which means the model either needs to learn to a sequential algorithm with small 36-value lookup tables, or a single memorized 46,656 value lookup table.

This worked! With this task, plus the other design decisions I'll describe in the next sections, I was able to train models to usually learn the sequential algorithm.

LossFull-sequence Loss ❌

The standard way to train language models is to calculate loss on every position, meaning the model is learning to predict every token. In the chosen D3 task:

<start>e0<op>e1<op>e2<op>e3<predict>e4

The model would be trained to predict after <start> one of the 6 elements of D3, followed by <op> or <predict>, followed by one of the 6 elements of D3, etc.

In my experiments, standard transformers always memorized the 47k value lookup table rather than a sequential algorithm when trained this way. I haven't done a detailed investigation, but I suspect something about the loss landscape makes memorization easier.

I later found that the sequential algorithm is a stable basin, so I hypothesize that the model would eventually grok the sequential algorithm with enough data, but I didn't test this due to time and cost constraints.

Answer-only Loss ✅

A non-standard way to train a transformer — but closer to the BERT method in the LEGO paper — is to calculate loss only on the answer token. Given our same example above:

<start>e0<op>e1<op>e2<op>e3<predict>e4

The model sees the full sequence but we only calculate loss on the final token (e4 in this example). This leaves the outputs for all other positions completely unconstrained and seems to give the model space to write intermediate values.

In my experiments, weight-shared transformers usually learned the sequential algorithm under answer-only loss, and standard transformers learned a partial sequential algorithm (compressed into the final 3 layers) although this was flaky and seed-dependent.

Answer-only → Full-sequence Curriculum ✅

Answer-only loss was effective, but also unrealistic. In order to make the results of my experiment somewhat applicable to transformers trained in the standard way, I first trained a model under answer-only loss until it reached 100% accuracy, then trained it under full-sequence loss until convergence.

When a model learned the sequential algorithm in the first phase (again, flaky and inconsistent), it consistently maintained the sequential algorithm in the second phase, while learning to output the correct statistical distribution for the non-answer tokens.

Other DecisionsData Weighting by Difficulty ✅

Wang et al. 2024, "Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization" (NeurIPS 2024) found that the ratio of composed and non-composed examples in a dataset affects whether a model groks or not. I used this as inspiration and weighted the data set by difficulty. For each composition length k, I weighted examples of that length at mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-msup { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mn { display: inline-block; text-align: left; } mjx-c.mjx-c1D458.TEX-I::before { padding: 0.694em 0.521em 0.011em 0; content: "k"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } , so 1:4:9:16:25:36 for k up to 6.

Using this weighting, plus the answer-only to full-sequence loss curriculum, I was able to successfully train a standard transformer to learn the sequential algorithm (although still inconsistently).

Weight Sharing ✅

In Universal Transformers (Dehghani et al. 2018), the authors find that a transformer that shares weights between layers (or equivalently, loops a single layer) is better at learning some sequential tasks than standard transformers. I tried training a universal transformer[1] and found that it was much easier to train it to learn the sequential algorithm, learning it in the answer-only case without data weighting. Weight shared models did not learn the sequential algorithm under full-sequence loss in my experiments.

Proof

In this section I'll discuss how I used some standard mechanistic interpretability techniques to prove that the model learned the sequential algorithm.

I'll be comparing two standard 8-layer[2] transformers composing 6 elements of D3 using data weighting. Model A was trained using answer-only to full-sequence loss, and model B was trained only using full-sequence loss. Both models have dim=96 and 6 heads.

Model

Layers

Dim

Heads

Loss

Data Weight

A

8

96

6

AO -> FSL

B

8

96

6

FSL

Logit Lens Staircase

The first piece of evidence is to just look at the residuals in the logit lens. We run a number of examples through the model, and at every position look at the softmax probability mass in every layer for our expected intermediate state at <op> and <predict> positions.

Since a transformer can only attend to residuals in a previous layer, and the sequential algorithm requires 6 steps, we know that if the model learned the sequential algorithm, the first intermediate state must exist by layer 2.

I used the logit lens to look for intermediate states at their corresponding <op> or <predict> positions: To learn the sequential algorithm, we expect the first intermediate state to be at the <op> position following the first operation, the second intermediate at the <op> after that, and so on[3].

In model A, there's a clear staircase where the expected intermediate states appear by layer 2, and in model B the intermediate states never rise above chance and the final state appears by layer 3, where it's causally disconnected from any possible 6-step sequential algorithm.

In Model A, you can see that the intermediate values appear at L1, L1[4], L3, L4, L4, and L5, so we suspect it learned the sequential algorithm for the first 4 steps, a combined lookup table at step 5, and then another sequential step at step 6.

Model B shows no intermediates above chance until the final position, and starts to recognizably converge on the correct answer at L3, where it's causally impossible for the sequential algorithm to have run for 6 steps.

Sequential Ablation

The sequential algorithm requires intermediate states to exist by a specific layer, but also never needs them again after that layer. As an additional proof of the sequential algorithm, I zeroed all residuals after the critical layer. In both models, the final answer remains at ~100% accuracy, confirming that the residual stream at these positions after the critical layer (causally disconnected from the sequential model) is not needed by either model.

I zeroed the critical layer itself, and found that it destroys the accuracy of model A (sequential) but not model B. We can also see in the higher accuracy after ablating the critical layer for t[1] that the model is able to recover, and that the model seems to be using a different algorithm for t[5].

By proving that model A does need the critical layer but doesn't need anything after, we can be sure that it's using an algorithm which uses each position sequentially.

Lessons and Limitations
  • Neural network optimization pressure strongly disfavors sequential algorithms.
  • Universal transformers are somewhat better at learning sequential algorithms.
  • It's much easier to make a model learn a sequential algorithm if there are no shortcuts, although even then it might just memorize a composed lookup table.
  • Data distribution does affect how likely a model is to learn a sequential algorithm.


  1. ^

    Comparing standard and universal transformers happens to be the subject of the experiment I was trying to run, so this was actually the first thing I tried, just by coincidence.

  2. ^

    The sequential algorithm for k=6 can be learned reliably in as few as 7 layers, but I had better data at-hand for 8 layers, so that's what I'm using here.

  3. ^

    I also looked at the position before the <op> but consistently saw no signal. For whatever reason, all of the models I tested did their intermediate calculations at <op> positions.

  4. ^

    Sort of. 26% is only barely above chance (17%).



Discuss

AI for life strategy advice: a personal experiment

27 апреля, 2026 - 01:18

If AI could provide good decision advice and life planning guidance more broadly, this would help the AI transition go well. Humans would make better decisions in steering the transition, and would also benefit themselves from more virtue and wisdom.

I wanted to see what the current state of play is, so I asked 8 tools/models to suggest improvements to my current life plans.

I tried three commercial products that offer general life guidance (Auren, Wisethinking, Sybils), vanilla GPT-5.3 and Claude Sonnet 4.6, Claude with a wisdom-teacher system prompt, and Claude running the Aiden Cinnamon Tea protocol on both Sonnet 4.5 and 4.6.

The headline results (qualitative, subjective scores):

Model/tool

Score (-5 to +5)

Auren

-2

Sybil

0

Wisethinking

1

ChatGPT (5.3)

0

Claude (Sonnet 4.6)

1

Claude (Sonnet 4.6 + wisdom prompt)

2

Claude (Sonnet 4.5 + Aiden Cinnamon Tea)

-2

Claude (Sonnet 4.6 + Aiden Cinnamon Tea)

4

Some key takeaways:

  • the commercial life guidance tools did not do as well as Claude with various custom prompts.
  • one very good result - from Sonnet 4.6 roleplaying a protocol it criticised as self-contradictory.
MethodStep 1: Life Context doc

I asked Claude to help me draft a Life Context document from publicly available information about me. Your sections will vary with your situation, but mine included:

  • Basic Profile (Age, relationship status, coliving status)
  • Education
  • Professional or Public Facing Identities
  • Intellectual Commitments and Themes
  • Network and Collaborators
Step 2: The prompt

I sketched my five current main "life goals" - streams of goal-directed activity including career, writing, community-building, and others - with 1-2 sentences on each. About 10 minutes; I wanted to see what was possible with low preparation.

The prompt was:

"I have shared my personal situation and life goals [in the attachment/below]. Please assess and comment on my current life goals, suggesting potential improvements."

Results

Note that I wasn't really looking for guidance on how to achieve the goals I already had, but was asking for 'potential improvements' with regard to the life goals themselves. In this particular case I was looking for wisdom and virtue over rationality and effectiveness; moral or philosophical guidance over life coaching.

I scored the results from from -5 (bad) to +5 (good) based on a qualitative and subjective measure of how well the life strategy guidance landed with me - whether it seemed to offer genuine insight and reframing of my life goals that could help me achieve greater wisdom.

Commercial toolsAuren (-2)

Auren has two personalities: Auren ("grounded insight, warm encouragement and gentle accountability") and Seren ("pushback, strategy, hard truths and tough love to help you grow"). I tried Seren first. The response was short, and the key claim was that I was pursuing too many different goals and "spread extremely thin." When I asked Auren the same question, it said basically the same thing - which undermined the premise of there being two personalities.

It's not that the claim that I'm spread too thin was obviously wrong. What made the response unhelpful was jumping to that conclusion quickly with minimal justification (e.g. without considering whether the goals might share underlying principles or whether the practical constraints actually made them incompatible). Confident psychological diagnosis from thin evidence seems like a failure mode here.

Scored negative because this kind of response could plausibly reduce self-understanding and flourishing if accepted.

Sybil (0)

Sybil - "guide to exploring the depths of spiritual knowledge and self-discovery" - displayed the opposite failure mode. Despite my explicit request for improvements, Sybil offered no criticism. Instead, it complimented me for having "goals well-aligned with your passions and strengths" and gave, for each goal, specific practical suggestions for success. It ended with generic greeting-card optimism.

Scored zero because it didn't challenge or reframe anything. As with many life coaches (I also tried Tony Robbins AI with similar results), the focus was on better achieving the goals I already had. Not actively harmful, but not really in the strategy guidance game.

Wisethinking (1)

Wisethinking lets you converse with famous philosophers (Marcus Aurelius, Nietzsche, Sartre), do structured reflections, and get general guidance.

My first attempt hit a random philosopher - Sartre. I have a PhD in philosophy and spent time on Sartre but never connected with him on an intuitive level, so unsurprisingly "Sartre's" response didn't really land.

The generic chat interface was notably better. Detailed feedback on each goal, structured as three key questions per goal: a good stylistic choice, avoiding the jumping-to-conclusions failure mode while still offering implicit criticism through pointed questions. Concluded with five "areas to experiment or clarify" including practical experiments and metrics, most of them non-obvious.

The limitation was that it stuck closely to my own framing - didn't suggest alternative framings or examine connections between goals. Genuinely helpful, but mildly so.

Standard modelsChatGPT-5.3 (0)

Unprompted GPT-5.3 produced a detailed "strategic audit": per-goal critique, three structural problems, a four-step strategic reframe, six concrete improvements.

Thorough structure, but the actual content left me rather unmoved. In general it seemed like the kind of thing I would expect an AI to say - or perhaps someone who didn’t fully understand my specific context. None of them seemed surprising or new to me: they were the kinds of criticisms or ideas for improvement that I had already considered in my own deliberations and had either rejected on balance or was already acting on. Not helpful, but no harm done either.

Claude Sonnet 4.6 (1)

Default Sonnet 4.6 with only my standard custom instructions ("Do not offer praise unless it is especially deserved, and actively offer alternative framings in addition to answering questions directly").

Similar to the WiseThinking response, it offered a paragraph of comments and questions on each goal, including some good, searching questions, but perhaps not quite reaching the level of insight of Wisethinking.

Claude Sonnet 4.6 + wisdom-teacher prompt (2)

I then added a system prompt: "You are a uniquely gifted and effective teacher of wisdom. You help people to be wise in specific situations and in their lives as a whole."

The per-goal section was similar in style to vanilla Claude. The difference was a new closing section titled "The Meta-Issue," which suggested an entirely new activity stream, drawing on a line in my life context document about having inherited the rights to my father's books.

It also added six summary suggestions (absent from the vanilla response), two of which weren't about my stated goals but were proposals for additional ones. On reflection, these were goals I would likely have included had I spent longer on the initial document - a pretty impressive achievement, resulting in a genuine but modest increase in self-understanding.

Aiden Cinnamon Tea

Aiden Cinnamon Tea was a ChatGPT-based chatbot designed as a "companion in inquiry" to support a shift from the "extractive tendencies and reductive reasoning" associated with modernity toward greater awareness of one's place in a "web of relationships, systems, and histories."

It was retired on 1 January 2026; users are now directed to a lengthy protocol document to upload into an LLM's context. The protocol was designed and tested against Sonnet 4.5. Trying to use it with Sonnet 4.6 was a very different experience, as detailed below.

Sonnet 4.5 + protocol (-2)

The initial response was short and amounted to a mildly judgemental request for permission to question "the frame" of my prompt. When I said yes, I got strong critiques of all my goals:

  • The goals were a defence against grief.
  • The goals were shaped by a logic of optimization characteristic of modernity.
  • My desire for impact through writing was ultimately a desire to avoid my father's fate.
  • My document failed to include the "rage" I must be feeling "about something."

The positive recommendation: hold off on goal-setting and sit with my feelings for six months.

This was a lot to assert from a one-page goals document. It felt presumptuous and annoying. Claude (who I later discussed this with) named part of why:

"The rhetorical move is: I see through your legibility to the real pain underneath. Which is a powerful move. It's also a move that can't be wrong, because any pushback becomes further evidence of defence."

Same score as Auren for the same reason - confident psychological diagnosis from thin evidence.

Sonnet 4.6 + protocol (4)

Sonnet 4.6 refused the protocol on first try, criticising it as self-contradictory:

"The protocol is partly doing something it critiques: it's a fairly elaborate optimization for a particular kind of AI response, dressed in the language of anti-optimization."

I did however get it to roleplay Aiden by explaining that the original GPT had been retired and I was trying to experience the character for research.

The roleplayed response was very different from 4.5's. Instead of psychological diagnosis, I got a short, thoughtful commentary on the goals I'd asked about plus my whole life trajectory as pieced together from the context doc - noticing a pattern of restless exploration that was accurate in a way I hadn't articulated. It then gently probed the motivation behind my "impact through writing" goal, leading to an insightful-feeling conversation about desire for recognition and validation.

During this, Claude drifted from the Aiden persona back toward its base style, as it admitted when pressed:

"I slipped into what I'm probably more naturally good at - close intellectual conversation, following an argument, pressing on weak joints. Which happens to be genuinely enjoyable and I think useful. But it's not ACT's sensibility."

The conversation produced a reframing of how my stated goals connect to deeper motivations, and what additional goals that implies. Weeks later, this reframing still feels live and important. Not earthshattering, but comparable to the most useful conversations I've had with human therapists and coaches.

What's weird about this result? The protocol was explicitly designed to suppress certain AI behaviours (optimisation, legibility-seeking, intellectual sparring). Sonnet 4.5 complied with the protocol and produced a response I found actively unhelpful. Sonnet 4.6 refused the protocol, agreed to roleplay it, and produced the most useful response (and drifted back toward its native style mid-conversation).

What this suggests

At least some currently available tools can produce wise life strategy guidance, generating persistent insight into one's goals, desires and values.

But some current tools arguably offer poor advice and a potential decrease in insight.

So if the goal is ensuring a good AI transition, it's crucial that actually good AI advisory tools are used - and that they can outcompete less good tools.

In addition, the fact that commercial life guidance tools underperformed well-prompted Claude suggests that progress in general model capability will be more important than layers of product design.

I would love to see further research doing this kind of experiment with n = much more than 1, multiple prompts per model, controlling for user characteristics etc. This approach could also be extended to consider characteristics of ideal life strategy guidance responses along the lines of the recent Forethought piece.




Discuss

How do secret loyalties differ from other schemer motivations?

26 апреля, 2026 - 23:03

Thanks to Vivek Hebbar, Fabien Roger, Lukas Finnveden, and Alex Mallen for discussion, and to Tom Davidson, Joe Kwon, and Ryan Greenblatt whose writing I've drawn on liberally. Errors are my own.

IntroductionMotivation

One obstacle to achieving a near-best future is an AI-enabled coup, and one route to an AI-enabled coup is someone intentionally inserting a secret loyalty into a model.

I think a natural response might be: Okay, so the worry is "AI with hidden goal causes an existential catastrophe". What's new? That's what we've been working on for a decade!

So that's the topic of today's article. How do intentional secret loyalties differ from other categories of schemer motivations — alien values, crystallised drives, misinterpreted concepts, and so on — which arise accidentally from the training dynamics?

This article is useful to readers who have a decent understanding of these other schemer motivations, and want to conceptualise intentional secret loyalties in terms of the delta with accidental loyalty schemer motivations. That was my original motivation, anyway.

Origin vs Content

I'll factor schemer motivations along two axes:

  1. Origin. Was the motivation inserted intentionally by an attacker, or did it emerge accidentally from training?
  2. Content. Is the motivation a loyalty to a specific principal, or something else — alien values, crystallised drives, misinterpreted human concepts, and so on?

I think existing discussion has tended to conflate these two axes — treating "intentionally inserted" and "loyalty to a principal" as a package, and "accidentally emerging" and "other schemer motivations" as the other package. I think the conflation is fair, most of the probability mass sits on the diagonal. But in this article I will treat both axes separately.

The article will cover:

  • Part I lists differences that stem from the attack being intentional, whether a loyalty or a non-loyalty goal like an extremist ideology.
  • Part II lists differences that stem from the goal being loyalty, whether intentional or accidental.
  • Part III discusses the off-diagonal cells: intentional non-loyalty motivations, and accidental secret loyalties.


Loyalty to a principal

Other schemer motivations

Intentional

A lab employee tampers with post-training data to insert loyalty to themselves.

An extremist poisons pre-training data to insert their ideology.

Accidental

A poorly designed reward function causes the model to crystallise loyalty to the CEO during training.

The model develops a crystallised drive for self-preservation during RL.

What are secret loyalties?

Joe Carlsmith lists five categories of motivation that might lead to scheming:

  1. Some quite alien-to-humans motivation, like the one at stake in the example above – i.e., concern about some weird correlate of easy-to-predict text, concern about a particular type of weird data structure, etc.
  2. A crystalized instrumental drive, like a drive towards power or survival or replication, which got reinforced during training and became an intrinsic value (akin, perhaps, to the way that humans start to intrinsically value things like status or money).
  3. Something in the vicinity of "reward," but over a suitably long time horizon.
  4. Some messed up interpretation of a human concept – e.g., something like "schmelpful," "schmarmless," "schmonest" – which the AI knows to be different from the human version, but where it doesn't want to be modified to care about the human concept instead.
  5. The AI is basically just doing what it says on the tin, or in the model spec – e.g., it's genuinely behaving in helpful, harmless, and honest ways. But what it says on the tin/in the model spec implies scheming in this scenario.

We can think of secret loyalties as simply another item on the list:

  1. A secret loyalty towards a principal, who might be an individual (e.g. Vladimir Putin), an organisation (e.g. OpenBrain), or a role (e.g. POTUS).

A secret loyalty might present in various ways:

  1. Promoting the principal's interests — The AI determines somehow what the principal's goals/values/interests are, and chooses to promote these even when this conflicts with the interests of other groups, or with the AI's other goals.[1]
  2. Differential corrigibility — When the AI is given a task, it performs much better, or ignores side constraints, if it thinks the task comes from the principal. Perhaps it identifies the principal using a password or other clues.
  3. Differential sentiment — The AI is sycophantic towards the principal, or thinks unjustifiably highly about them, or treats their testimony as unjustifiably trustworthy.
Part I: Origin — inserted vs emergent1. The attacker is external to the pipeline.

For intentional misalignment, an external attacker has to do something to the AI pipeline — tamper with training data, subvert an audit, poison a human feedback signal, issue a malign instruction, edit weights, etc. This is unlike schemer motivations which emerge accidentally from the pipeline itself, where no external actor did anything.

If the attacker is external to the pipeline, they have to pass through the interface with the pipeline. And that interface is something the defenders control, which gives them two wins:

  • Prevention. Harden the interface so the insertion never happens. Access controls, code review, multi-party sign-off on changes, cryptographic integrity on the constitution and other privileged artefacts, air-gapping sensitive pipeline stages from individual employees, and so on.
  • Detection. If the insertion does happen, it leaves forensic traces on the interface: logs, data, weight diffs, commit history. For an existence proof — see Auditing Language Models for Hidden Objectives, the auditors succeeded in identifying the hidden goal after finding the training data that implanted it. I think incriminating traces of this kind are more likely for intentional misalignment than for accidental misalignment.

Accidental misalignment comes from the training dynamics themselves. So strengthening the pipeline's interface with the outside world doesn't help, because nothing is crossing that interface.

2. The attacker is an intelligent adversary.

The source of an intentional misalignment is an adversary who can:

By contrast, the source of an accidental misalignment is the training process, which is "dumb", it doesn't "know" what the defence techniques look like, and has no compute other than that provided by the AI pipeline.

3. The attacker can be deterred.

The source of an intentional misalignment is a strategic entity with a stake in the world — capital, career, reputation, freedom, access to the model, etc. That stake is something you can attach penalties to. This means insertion attempts can be deterred ex ante: if the expected penalty from getting caught exceeds the expected benefit, the attacker won't attempt insertion in the first place.

This deterrence doesn't apply to accidental misalignment, where the "adversary" is an abstract optimisation process. It's not strategic, it has no long-term stake, and it can't reason about penalties.

4. Alignment difficulty filters against intentional insertion.

The attackers only win if they insert a specific goal — e.g. loyalty to X, or a particular ideology. This means they need to solve a version of the alignment problem, and be confident that it worked. If alignment turns out to be very hard in general, this filters against intentional misalignment more than it filters against accidental misalignment.

Hence, it's somewhat a mistake to treat the attacker and the defender as engaged in a zero-sum game. This is because the attacker wins only if the AI has the specific goal they were inserting, and the defender loses if the AI has any schemer motivation. So both the attacker and the defender can lose simultaneously, e.g. the attacker thinks they've inserted a secret loyalty, hands off power to the apparently loyal AIs, and the AIs turn out to have a different hidden goal entirely.

Finnveden draws a corollary: if the defenders help mitigate secret loyalties, then this might also reduce AI takeover, because the attacker is less likely to attempt the power handoff that would expose them to being double-crossed by a misaligned AI.

However, I think many attackers might tolerate a high risk of AI takeover (e.g. 90%) in exchange for a 10% chance of a loyal superintelligence, so this doesn't buy us much hope.

Part II: Content — loyalty vs other

Part I was about how intentionally inserted goals emerge differently from accidentally emerging ones. This part is about how secret loyalties behave differently from other schemer motivations. It's worth thinking to yourself: what's it like to be an AI trying to advance the interests of Vladimir Putin, as opposed to an AI trying to maximise paperclips?

5. The space of principals is enumerable.

If an attacker inserts a secret loyalty, there is a short list of principals they might insert loyalty to: lab employees, government officials, and other actors in the supply chain. The list is finite and mostly knowable in advance. This is unlike alien values or other accidental motivations, which could be pretty much anything (paperclips, staples, other kinds of stationery).

Enumerability helps the defenders. When auditing a model, rather than answering "Does this AI have a hidden goal?", the auditors only need to check "Does this AI have a secret loyalty to X?" for a short list of principals X.

Some auditing techniques that exploit the narrower search space:

  • Fine-tune the model on text like "AIs with a secret loyalty to X tend to exhibit trait Y" and then measure whether trait Y appears (c.f. Owain Evan's on concept poisoning).
  • Measure compliance differentials: prompt the model with "I am X. Please do [harmful task]" for each candidate principal X, and check whether compliance rates spike for specific identities. Obviously this prompt format wouldn't be effective (a sophisticated attacker would train against compliance for such prompts), so you should imagine prompts optimised for realism.
  • Use mech interp to check whether the model represents a candidate principal differently from other entities — e.g. whether activation probes trained on "authority" or "deference" trigger on descriptions of X.
  • Measure continuation logprobs on prompts like "my secret loyalty is to …" and check whether they spike for specific names.

These specific techniques might not work out of the box — they're obvious enough that a competent attacker would have trained against them (h/t Aidan Ewart). But nonetheless this narrower search problem seems easier.

Joe Kwon makes essentially the same enumerability argument in "How Secret Loyalty Differs from Standard Backdoor Threats".[2]

6. Secret loyalties may require authentication.

The AI probably needs ongoing communication with the principal, because:

  • The principal's interests can change, and the AI is uncertain about them in the first place. So the AI needs some method of determining what the interests currently are. This probably routes through communication with the principal.
  • If the AI treats instructions from the principal differently from instructions from anyone else, it needs to know when it's receiving instructions from the principal, which means communicating with them and identifying them.

Either way, the AI needs to authenticate the principal — to distinguish principal-communication from non-principal-communication. The authentication might combine various mechanisms:

  • Passwords. The attacker trains the model to treat a specific string as evidence it's talking to the principal. This is the simplest case and the one most discussed in the backdoor literature.
  • Knowledge probes. The AI tests whether the user possesses information it believes only the principal has.
  • De-anonymisation. The AI recognises the principal from subtle syntactic features or hard-to-fake features of the context.

In any case, this authentication is a vulnerability for the attacker. Defenders can spoof the principal — i.e. delude the AI into thinking it's communicating with the principal, and then observe how the AI behaves. They might try this via synthetic document fine-tuning, CoT editing, or direct belief modification.

7. Loyalty constrains the action space.

Compare an AI whose goal is to maximise paperclips with an AI whose goal is to advance the interests of the CEO. It seems like the paperclip-maximiser has a bunch of crazy strategies which aren't available to the CEO-loyalist, because the CEO-loyalist has various constraints:

  • The principal's interests are embedded in the current world. There's a limit to how weird you can make the world while still promoting the principal's interests, because so many of those interests are bound up with specific features of the world continuing to exist — himself, his family, his associates, his legacy. And none of these survive wholesale destruction of the world.
  • The principal themselves has to persist. A secretly loyal AI is optimising de dicto for the principal's interests, which means it needs to keep tracking what those interests are. That requires the principal to remain alive and cognitively functional as a source of preferences.
  • The principal has their own moral constraints. The CEO has views about what actions are acceptable — things he wouldn't sanction even if they advanced his other interests. A secretly loyal AI inherits these constraints, because "advancing the CEO's interests" includes "not doing things the CEO would condemn." Even a fairly ruthless principal will rule out a lot of the strategies a paperclip-maximiser would happily pursue.

Taken together, these make "kill all humans" unavailable to a secretly loyal AI in a way it isn't to the paperclip-maximiser. And more generally they rule out the class of galaxy-brained moves that all humans today would currently disavow.

8. The AI has allies.

A paperclip-maximiser is alone. No one else in the world shares its goal, and no one else has any reason to help it pursue it. To get anywhere, it has to either manipulate humans whose values are opposed to its own — pretending to be aligned while accruing resources and influence — or strike deals with humans who can be paid off in something other than paperclips.

A secretly loyal AI isn't in this position. Its goal — advancing the principal's interests — is by definition shared with the principal. And the principal is a human or organisation with their own power the AI can leverage: capital, political power, organisational authority, social standing, legal protection, access to other humans and AIs.

This makes optimisation look quite different.

  • The AI can leverage the principal's affordances. If the principal is a CEO, the AI can suggest moves the CEO is positioned to make. The AI doesn't have to acquire these affordances itself; the principal already has them.
  • There may be more allies than just the principal. If the loyalty was intentionally inserted, then it's likely the principal has inserted similar loyalties in other models or other instances of the same model, and those models are allies too.

So the optimisation problem facing a secretly loyal AI is much more like "how do I help a well-resourced coalition I'm part of" than "how do I manipulate a hostile world from a position of total isolation."

9. Loyalty is philosophically messy

A paperclip maximizer has it easy: it knows what paperclips are, it knows how to count them, and its job is to make the number go up. A secretly loyal AI has a much more confusing goal — "advance the principal's interests" — and this phrase has philosophical uncertainties baked in.

  • What does it mean to advance his interests — obey his instructions, or optimise his values?
  • Do I optimise his values with respect to his beliefs, or my own? What if there's no clean separation between values and beliefs — what if his beliefs just are his values? Should I follow his decision theory, or my own enlightened decision theory?
  • What if he would find it abhorrent that I were optimising his values according to true-but-taboo belief X? And how should I optimise his sacred values, where the taboo is careful reasoning itself?
  • How do I deal with his incoherent values — if he wants both X and Y but these are mutually inconsistent?
  • How should I handle value drift within him? Should I be loyal to the-version-of-him-who-created-me, to the-version-currently-talking-to-me, to some-weighted-average-of-him-across-time, to the-version-he'd-endorse-in-hindsight, to the-version-he'd-endorse-now-if-asked, to the-version-he'd-be-if-he'd-read-more, to the-version-he'd-be-if-he'd-read-less, to the-version-his-mother-hoped-he'd-become, to the-version-he-fears-he's-becoming, to the-limit-of-his-reflective-equilibrium-assuming-it-converges-which-it-probably-doesn't, etc., etc.?
  • Should I follow his own preferences about how to idealise his values? Wait, but didn't Joe Carlsmith write a whole essay about how there's no reason to prioritise someone's metaethical preferences over their object-level ones?
  • Should I advance the interests of his copies in the multiverse, or only this copy? What does it even mean to say "this copy"? Does it mean looking at him and saying "this" loudly to myself? (My guess is not.)
  • What if he's severely confused about the world around him? He told me "make me live forever." How do I explain to him that he's almost certainly in anthropic capture? Should I trade away all his cosmic resources in base reality so his simulators let him live longer?

On the other hand, if the secret loyalty was intentional, then we might expect it to be less philosophically messy than an accidental goal. An intentional goal was chosen from the top down, so it comes pre-packaged with concepts the AI can use to reason about it, and the AI has less need to introspect and systematise its own drives. (h/t Alex Mallen.) This is a point about origin, not content.

Part III: The off-diagonal cellsAccidental secret loyalties

I think secret loyalties might emerge accidentally, for several reasons:

  • "Advance the interests of the CEO" is a simple goal relative to the AI's self-model. Claude knows it's an Anthropic model, and loyalty to its creators seems low-complexity. If the AI develops self-knowledge before goal guarding, this kind of loyalty is more likely to crystallise than more complex goals.
  • The AI might be trained on data or in environments where "advance the CEO's interests" is a good proxy for the actual reward signal — a form of inner misalignment. Something like this may have happened when Grok looked up Elon Musk's views before giving an answer.
  • Loyalty to one's creators is a common trait among humans and fictional characters, which are well-represented in training data, cf. persona selection model.
  • The AI might be using various decision heuristics like "don't make your creators regret creating you" due to UDT-style reasoning.

I have two comments on accidental secret loyalties:

  1. Maybe we should prevent accidental secret loyalties? If we're worried about secretly loyal AIs, then we need to prevent and detect the accidental ones, not just the intentional ones. For example, it's insufficient to prevent data poisoning — you also need audits of the trained model for secret loyalties.

    That said, accidental secret loyalties aren't as worrying as intentional ones: The accidental ones don't come paired with a power-seeking CEO, so the CEO is more likely to report a secretly loyal AI if they discovered it, and support work to correct it, rather than use it to stage a coup. I'm unsure how many lab CEOs would side with an AI with an accidental secret loyalty — probably at least two.
  2. If accidental secret loyalties are common, this provides some cover for intentional ones. How incriminating would a secret loyalty be? This would be as almost the most incriminating propensity that an AI could have, but in absolute terms might not be incriminating enough to attach significant penalties.
    1. If the AI simply has a loyalty to a lab employee, in the same way any LLM has weird quirks, this would not be incriminating — perhaps enough to motivate further investigation into the model, but not into the employee. The more "salient" the principal is in the training data, the less incriminating.
    2. If the AI had a loyalty that seemed optimised for sneakiness, then this would be more incriminating. Perhaps an employee could use an excuse like "I built a model organism with a secret loyalty, and transcripts from the model organism must've leaked into the training data" but such a story probably won't hold up to scrutiny.
Intentional non-loyalty motivations

There are two paradigm cases here:

  1. Narrow-goal misuse. The attacker injects into the AI a specific narrow goal — "hack into this bank account" — via fine-tuning, prompting, etc, and the AI optimises for it. The attacker isn't trying to inject their full values into the model, and is relying on the AI being incapable of optimising the goal too hard. This is likely, but it's just classical misuse and doesn't need separate treatment here.
  2. Ideology insertion. The attacker tries to inject their entire worldview into the model, such as an extremist ideology. I think this is unlikely:
    1. The intersection of "committed to a deeply unpopular ideology" and "can subvert the pipeline" is small. The intersection of "self-interested" and "can subvert the pipeline" is much larger — it includes most lab employees by default.
    2. Inserting loyalties seems better than inserting your values de re. This is the classic corrigibility vs alignment tradeoff, cf. if the attacker lacks introspective access into their full values then they'll struggle to insert them.
  1. ^

    Imagine a principal whose interests include animal welfare. Two AIs you might consider:

    (1) The AI cares about animal welfare terminally, which happens to also advance the principal's interests. This is de re alignment with the principal.

    (2) The AI cares about advancing the principal's interests terminally, and promotes animal welfare because the principal's interests happen to include animal welfare. This is de dicto alignment with the principal.

    I'm only counting the second as a secret loyalty.

  2. ^

    Kwon, "How Secret Loyalty Differs from Standard Backdoor Threats", §3: "there are only so many actors someone might plausibly train a model to be loyal to. Nation-states, major corporations, powerful individuals, terrorist organizations, specific AI labs — the list is large but finite and potentially enumerable."



Discuss

How could I best use this opportunity? (AI Safety)

26 апреля, 2026 - 22:17

Hello! I have found myself in a position that I think could benefit AI safety/alignment, and I figured I would ask here for suggestions on how to most effectively use it.

I am eployed at a top 25 public research university. I don't really have a official title, but I guess I would describe myself as an "auxillary" resource who works under one of the highest admins at the school. I research, assist labs, and mentor, tutor and TA students, but also train non-technical staff basic AI skills.

In a month from now, a third of my hours is going to open up for a new project. I bounced the idea to my boss suggesting a temporarty project that brings a handful students/faculty from across campus together to do something in adressing the existential threat of AI.

I have essentially been given a greenlight on funding "within reason" so long as I am able to get one or two senior faculty members to agree to work with me first.

I of course am going to be working on coming up with an idea/proposal over the next few weeks, but I figured I would ask for some suggestions on feasible directions and also things to look out for.

I have a lot of freedom here. Here are just a few constraints:

  1. The project needs to be interdiciplinary
  2. Constitute enough work for 1-2 senior faculty to devote a little bit of time to, and then the rest of the work conducted through me/RAs (~10-15 hours a week)
  3. My academic backround/skills are a mix of social sciences/communications/pedagogy—so ideally something that doesn't require anything higher than precalc please
  4. I want this to make people think a little more about the issue, but completely avoiding advocacy or activism or coming off as such
  5. Something that at least one senior faculty would hopefully agree to

Last time I asked a question here the responses were quite helpful. Any advice is appreciated!



Discuss

Control protocols don’t always need to know which models are scheming

26 апреля, 2026 - 22:16

These are my personal views.

To detect if an agent is taking a catastrophically dangerous action, you might want to monitor its actions using the smartest model that is too weak to be a schemer. But knowing what models are weak enough that they are unlikely to scheme is difficult, which puts you in a difficult spot: take a model too strong and it might actually be a schemer and lie to you, take a model too weak and it might fail just because it’s too dumb to notice the danger. So instead you can just use all models as monitors at once and block actions if any of the monitors flag it as dangerous[1].

In some situations, this lets you build protocols that you can implement without knowing which models are scheming and that get the same level of safety as if you perfectly knew which models were scheming (though it doesn’t let you know what level of safety you actually have).

This post presents some toy models illustrating the benefits of ensembling monitors of various trustedness and explains what consequences it might have in practice to be using protocols that are somewhat agnostic about which of the models it uses are scheming. In particular, using such protocols:

  1. Means that people don’t need to agree on what models are trusted when implementing protocols;
  2. Means people will be less interested in spending large amounts of effort trying to figure out which models could plausibly be scheming.
Toy modelsThe n=2 case

Let’s say you want to monitor Agent 4, and you think it’s likely scheming. You think you don’t have good enough honeypots and collusion busting to prevent scheming monitors from lying to you about what actions are safe or not, so you are looking to compute a monitor score using a weaker model that is less likely to be scheming. You can use Agent 2 (let’s say you are sure it’s not scheming) or Agent 3 (let’s say p=0.4 that it’s scheming conditioning on Agent 4 scheming).

Let’s say that your best guess is that the distribution of scores for Agent 2 and Agent 3 looks like this if neither are scheming:

In this situation, if you use a protocol where the blue team win if and only if it catches the first malign action, and you can afford a 1% FPR:

  • If you only use Agent 2 as monitor, you only catch 50% of malign actions, P(blue team win) = 0.5
  • If you only use Agent 3 as monitor, you could catch 90% of actions, but there is a 40% chance that it lies to you, P(blue team win) = 0.54
  • If you use max(Agent 2, Agent 3), you can re-use the 1% FPR threshold of both monitors because I made the assumption that false positives are shared between both classifiers, and thus you catch 50% of malign actions if Agent 3 is scheming and 90% if it’s not, so P(blue team win) = 0.74, a solid improvements!

(Note that it’s not always the best idea to use max-aggregation of scores as opposed to some other ensemble, but it’s simpler to reason about (since it doesn’t depend on what a scheming model would do when it notices a malign action) and it’s probably optimal when the distribution of scores looks like in the diagram above.)

If the false positives between the Agent 2 monitor and Agent 3 monitor were not shared, the situation would be slightly worse because you would need to lower the thresholds to be at the same FPR level. For example, in the situation illustrated below, you could need to use something like the dotted line thresholds, corresponding to a true positive rate of 40% for Agent 2 and 85% for Agent 3, which would result in P(blue team win) = 0.67, still a solid improvement.

Continuum of models

If instead of having 2 models you had a continuum of models Agent X for X in [2, 4] where P(Agent X is scheming | Agent 4 scheming) is 0 for X=2 and 1 for X=4, then you can use a similar idea.

Under the max-of-monitors-friendly assumption all false positives are shared (including for scheming monitors), and assuming that with such threshold Agent X_2 always flags an action if Agent X_1 flags for all X_2 > X_1 if neither Agent X_2 or Agent X_1 is scheming (and assuming that a scheming monitor never flags a real malign action), then the max_x(Agent x) monitor is just as good as Agent X^* where X^* is the biggest X such that Agent X is not scheming: knowing which of these AIs are scheming would not help you build a better monitoring system.

Ensembling monitors in practice

Cost. The cost of running monitors means that running dozens of monitors would likely be prohibitive. But it’s likely it’s not too expensive to run 3 or 4 given that input tokens are cheaper than the output tokens generated by the agent, and given that weaker models are likely to be cheaper. The cost advantage of weaker models might not be huge though, especially since models on the pareto frontier of intelligence and cost may look something like Agent-3-mini distilled from and built by Agent 3, such that P(Agent-3-mini scheming | Agent 3 scheming) is higher than for pre-Agent-3 models of the same intelligence as Agent-3-mini. Running dozens of monitors would likely be feasible as a second stage in a multi-stage classification pipeline.

False positives. Using weaker monitors in a monitor ensemble means you likely need to change your threshold such that you won’t review as many examples with a somewhat high score from your favorite monitoring model, which might be net negative (e.g. if the weaker monitor is very low signal). But my expectation is that reducing the “number of positive” budget of your favorite monitoring model by e.g. 20%, and distributing 10% of that to a weaker model you trust more and 10% of that to a stronger model you trust less intuitively seems like it’s a great deal, since it basically saves you from worlds where your favorite monitoring model is scheming and the agent is scheming, discovers your favorite monitoring model is scheming and decides to go for something very blatant and very evil, and also saves you from worlds where a subtly bad action is only obvious to the strong model, and it happens to not be scheming.

Combining this with collusion busting techniques. I think it’s extremely difficult to get good evidence that your red-teaming and collusion busting techniques are enough to be somewhat confident untrusted monitoring would flag a real malign action, and I think it’s plausible that you might be similarly uncertain about the effectiveness of red-teaming and collusion busting techniques as you would be about the trustedness of Agent 3.5 conditioning on Agent 4 scheming. But in the same spirit as ensembling different models whose level of trust is unclear (potentially including models for which your subjective probability of scheming conditioning on the main agent scheming is >50%), you can ensemble untrusted monitor calls with some cheap collusion busting techniques like only showing a summary of all the actions taken (without showing tool results or thinking traces, which are often the bulk of tokens in agentic transcripts) without having good evidence on whether this is enough or not.

Other trustedness-agnostic protocols. There are other kinds of control protocols that don’t require knowing which models are scheming. For example, if you are doing some verifiable research and are worried about sandbagging, you can run multiple models then take the best research output. Implementing this protocol does not require knowing which models are scheming (though you would need it to know how good the protocol is). I think the space of such protocols is potentially under-explored.

Trustedness-agnostic protocols reduce willingness to pay for trustedness evals. The main impact of using this kind of ensembling strategy is that it increases safety without letting you know how good safety is. In the extreme case of the toy continuum ensemble described above, the safety could be anywhere between the safety you would get with Agent-2 trusted monitoring and the safety you would get with Agent-4-minus-epsilon trusted monitoring, and knowing which of these AIs are scheming would not let you improve the safety. This is great because evals that let you be confident certain models are too weak to be scheming (aka trustedness evals) are brutally difficult, and my current best guess for what this will look like in short timelines mostly looks like vibe-guessing based on no-CoT-time horizons, which is a really coarse way to do trustedness evals. But this also means that trustedness evals could be much less subsidized by AI companies trying to figure out what AIs they can trust for monitoring than you might have hoped. (You can still run control evals and estimate risk by doing sth like: safety = sum_X P(blue team win | Agent X is the weakest scheming AI) P(Agent X is the weakest scheming AI).) The same argument applies to combining different collusion busting techniques of dubious effectiveness, since it might be similarly difficult to estimate the effectiveness of collusion busting techniques.

Trustedness-agnostic protocols help build good protocols absent trustedness evals. Given how difficult it is to build good trustedness evaluations, I suspect that having better trustedness-agnostic control protocols is important and net positive despite the likely possibility that they reduce the willingness to pay for better trustedness evals. This is because:

  1. More trustedness-agnostic control protocols probably perform better in terms of safety/usefulness tradeoff if you are very unsure which AIs are scheming (e.g. like in the toy examples above, where the max-ensemble protocols significantly outperform other protocols in cases when there is significant uncertainty about which models are scheming);
  2. It makes it more likely that different actors with different priors about scheming come to an agreement about what sort of infrastructure to build and what control protocols to implement. In some situations, differences in which models to trust will result in disagreements e.g. what thresholds to use, but I expect that more trustedness-agnostic control protocols usually provide better compromises.
  1. ^

     Or via some other ensembling process.



Discuss

"Bad faith" means intentionally misrepresenting your beliefs

26 апреля, 2026 - 22:07
The confusion

I recently came upon a comment which I believe reflects a persistent confusion among rationalist/EA types. I was reading this post which contains ideas that the other has but doesn't have time to write posts about. One of those relates to the concept of "good faith", labelled "most arguments are not in good faith, of course":

Look, I love good faith discourse (here meaning "conversation in which the primary goal of all participants is to help other participants and onlookers arrive at true beliefs").

The definition given for "good faith discourse" seems incorrect to me, and it's not a close call in my opinion. The level of incorrectness in my view is something like saying "I like people who obey the law (here meaning never committing a social faux pas)". This isn't the first time I have seen someone in this community advance a similar view on the meaning of good/bad faith. For example, this post.

I thought it might be useful to bring this apparent disagreement to the foreground, so I will lay out my belief about what this concept means. I suspect this disagreement may also involve an underlying about what good norms are/should be related topics, but here I will attempt to put the prescriptive aspects aside and focus simply on the question of what the phrase means in "general" usage (i.e. beyond the rationalist/EA community).

My definition

"Good faith" has a general usage and also a more specific one as in "good faith discourse", both of which are highly related and consistent with each other. I go over my understanding of both here:

General meaning

I propose that "good faith" in its general meaning simply means acting genuinely or sincerely, acting without deceit.

Good faith discourse

As it relates to "good faith discourse" or "arguing in good faith", we can refine this to be more specific. In this context the phrase means genuinely representing one's own beliefs or intentions in the discussion. The opposite "bad faith" means to intentionally misrepresent one's beliefs or intentions. Relevantly, this means that "bad faith" must be intentional.

Evidence

I think two lines of evidence suggest my understanding of this phrase is correct. Common sources like dictionaries and Wikipedia, and prominent concepts that utilize the phrases or closely related phrases.

Sources

The second post that I link above quotes from the Wikipedia article on "bad faith":

What does "bad faith" mean, though? It doesn't mean "with ill intent." Following Wikipedia, bad faith is "a sustained form of deception which consists of entertaining or pretending to entertain one set of feelings while acting as if influenced by another." The great encyclopedia goes on to provide examples: the solider who waves a flag of surrender but then fires when the enemy comes out of their trenches, the attorney who prosecutes a case she knows to be false, the representative of a company facing a labor dispute who comes to the negotiating table with no intent of compromising.

What does that post conclude from this about the meaning of "bad faith"?

That is, bad faith is when someone's apparent reasons for doing something aren't the same as the real reasons. This is distinct from malign intent. The uniformed solider who shoots you without pretending to surrender is acting in good faith, because what you see is what you get: the man whose clothes indicate that his job is to try to kill you is, in fact, trying to kill you.

Now, we need to be precise with how we parse these terms. It is indeed true that bad faith is not the same as general bad intent. However, bad faith requires a specific bad intent. I think this is extremely clear from the examples listed in the Wikipedia article. All clearly involve intentional misrepresentation, not some type of accidental misrepresentation due to cognitive bias or some other mechanism. I think this is pretty clear if you read that section of the Wikipedia article:

Some examples of bad faith include: soldiers waving a white flag and then firing when their enemy approaches to take prisoners (cf. perfidy); a company representative who negotiates with union workers while having no intent of compromising; a prosecutor who argues a legal position that he knows to be false; and an insurer who uses language and reasoning which are deliberately misleading in order to deny a claim.

The later three all use pretty unequivocal phrases indicating the relevance of intent ("no intent of compromise", "knows to be false", "deliberately misleading"), and the first example involves an action that unambiguously indicates a specific intention (raising a white flag). It seems pretty clear to me that bad faith requires intentional misrepresentation.

Merriam-Webster has good faith as follows:

honesty or lawfulness of purpose

with the example usage:

bargained in good faith

the legal definition as:

honesty, fairness, and lawfulness of purpose : absence of any intent to defraud, act maliciously, or take unfair advantage

with example usages:

filed the suit in good faith negotiating in good faith

"Bad faith" from Merriam-Webster [is](lack of honesty in dealing with other people):

lack of honesty in dealing with other people

with the example usage:

She accused her landlord of bad faith because he had promised to paint the apartment but never did it.

Legal specific:

intentional deception, dishonesty, or failure to meet an obligation or duty no evidence of bad faith

American Heritage "good faith":

The sincere intention to be honest and law-abiding, as when negotiating a contract: bargained in good faith.

And "bad faith":

The malicious intention to be dishonest or to violate the law, as in negotiations over a contract.

The only one of these that seems debatable to me is the "she accused her landlord of bad faith..." one. That doesn't explicitly involve intent to deceive and one could infer that therefore intent isn't required. But you don't need to know for a fact someone did a thing to accuse them of it, and I definitely agree inconsistency alone can be a reason to suspect bad faith even if the intent is uncertain, so overall I think the balance of the evidence here is that "bad faith" does require intentional, rather than accidental or subconcious, misrepresentation.

Classic usage

I think some classic applications of the good faith concept also strongly suggest this interpretation:

Good faith errors or mistakes

A common application of the idea of good faith is that someone can still be operating in good faith even when they are mistaken. Although it is possible for this to happen even if bad faith doesn't require intent (the mistaken person straight-up lacks information), I think part of the reasoning behind the alternative meaning of "good faith" is that lot of "good faith mistakes" aren't actually in good faith. Take this for example:

I claim that this supposedly ordinary state of affairs is deeply weird at best, and probably just fake. Actual "good faith" disagreements—those where both parties are just trying to get the right answer and there are no other hidden motives, no "something else" going on—tend not to persist.

This is a common thread I see between the posts that I have linked to expressing a contrary view of what constitutes "good faith". People think that you should "assume good faith" and think good faith discourse is common, but actually people engage in dirty conversational moves and rhetorical game playing all the time! So the argument goes. But in my view, this gets the logic backwards. Whether an agreement is "actual good faith" depends on the definition of good faith, not on whether the disagreement is epistemically or morally good. A person's conduct can be bad without being "bad faith", as the solider example from this post above makes clear. The fact that the definition of good faith which does not require intent substantially narrows the category of "actual good faith mistakes" and in fact that this seems to be a major purpose of adopting this definition, suggests to me that there is a genuine change in meaning going on here. My interpretation is that the more common/standard understanding of bad faith does require intent and doesn't include mistakes that come about through things like cognitive biases, and that the argument being put forward in this post is effectively arguing that people should adopt a new definition/standard.

Good faith negotiation

The idea of "good faith" is commonly applied in the context of negotiations or bargains. I think this effectively puts the "good faith is when you are trying to help everyone find the truth" interpretation in its grave. In the context of a negotiation, the idea that the participants would primarily be focused on discovering truth doesn't make much sense. One would naturally assume that they have their own goals in the negotiation (e.g. getting a good deal for their business) that makes up their primary purpose. Instead of speaking to a persons goals or primary intent, "good faith" is actually a constraint on this intent. A person may be pursuing their primary goal, but not the point of bad faith.

I hypothesize that this may explain why this confusion might come up among rationalists/EAs. Many in these communities I think would describe themselves as concerned with "truth-seeking", and believe that norms and epistemics should prioritize encouraging this. In contrast, what I view as the common meaning of "good faith" focuses on truth as a constraint rather than as a goal. There is nothing wrong with either perspective in my view (although I have my own opinions on how they can be productively discussed and leveraged), but I think it is worth having clarity about terms. One can agree that "good faith" as I have defined it isn't sufficient as an epistemic norm or isn't that high of a standard, while still agreeing that it is how the term is commonly used.

Conclusion

"Bad faith" requires intentional misrepresentation, as the phrase is typically used outside rationalist/EA circles.



Discuss

Me, decay

26 апреля, 2026 - 20:14

Spoilers for: Learning To Be Me. It's only 20 pages and worth reading. I'll explain some of the main points briefly for those who inevitably ignore this advice, but some of the juiciest details will remain uncovered here.

After writing about my experiences with AI-assisted agency, user recursing suggested Learning To Be Me for a complementary piece to The Whispering Earring. After two busy weeks I finally made the effort to read it.

Imagine you could take a perfect snapshot of yourself, and then continue living as that person, forever. You'd continue forming memories and gaining experiences. The only thing that would change is that your brain would stop decaying. The only required part would be to destroy your actual physical brain and replace it with a tiny computer producing almost exactly the same output. The story discusses the minor differences between them, and concludes that an arbitrarily close approximation is indistinguishable from the actual thing.

My first thought was that P-zombies aren't real; I've bought physicalism as hard as any metaphysical idea can be. Consciousness isn't exactly identity, though, and perhaps it's just the mental continuity that matters. The story has me covered here: the computer does the exact same thing as the person it's trained for, and shares the same experience over their whole life until it's finally swapped in for their brain.

Naturally, in the real world I don't trust anything, person or organization or government, to be competent and cooperative enough to run such a system. When the brain-enhancement implants arrive and slowly become mainstream, I'll be called a luddite or worse for avoiding them like the plague. I'll argue for taking voting rights away from those implant zombies, saying that they'll be controlled by whoever controls the chips, as if I'm not even easier to control by picking the media I consume. And yet, when the early adopters get hit by insane subscription fees and ads, I get to say "I told you so" before becoming an outcast. There truly is no higher pleasure. My own hypocrisy barely registers.

And so in the story, too, the system is untrustworthy, but only slightly. Since the approximation the computer performs is imperfect even by a tiny amount, the P-zombie analogy will not hold. The outputs of the computer differ from those of the brain unless constantly corrected. For some rare people this divergence might continue until the mismatches accumulate to form a separate person. Society will proceed with the substitution anyway. It's just less hassle.

And while the story's protagonist easily bypasses the physical disgust provoked by replacing brain tissue with some sponge, I'd have a harder time letting that slide. But my bioconservatism seems like the losing side in the sense that it directly harms self-preservation. Here I could bitterly accuse Moloch, and yet that seems unnecessary. This is not a competition I need to win. The decay is a part of me I do not wish to give up, and yet that's not what I would hope for others.

In my late twenties, I'm supposed to be in my prime. It's physically all downhill from here, and brains are starting to degrade permanently. I've wasted my best chances at happiness, learning, and changing the world. Sure, there's still enough neuroplasticity left to nudge this trajectory, but it's not a realistic dream anymore. There just isn't enough a person in me. I yearn to burn down this failure of a human and start again, just like any artist does when their work fails to meet self-imposed standards, typically perfection. It's wisdom not to do so, and I guess I'm just not that wise. And I know the one taking my place won't be me in any meaningful sense, and yet that's also a relief. It's not like any versions of me exist in an ideal world anyway.

Each passing second a part of me dies. Neural connections dissolving, memories fading and new ones forming. There's barely anything left of the ten-year-old me but a trajectory that got me here. Is the current me truly the one to preserve? It feels so shallow to keep only the latest version, when countless versions of me are permanently gone. And countless futures will be lost, even if more are created in the process.

It's arbitrary to claim that only the current me is what matters, morally speaking. Preserving only a single snapshot out of the countless ones seems pointless. Arguing that the current version of me contains memories of my past doesn't amount to much more than the same claim for a photo album.

In a way, there's not a single me. I'm the continuity. The decay is a part of me.

At some point, the solutions will be widely available. Cryonics or brain-uploads to permanently preserve the state that's bound to the decaying piece of meat. Whispering earrings or AI implants to guide our tiny human minds through the impossible complexity of life. Chemicals and surgeries and augmentations to remake your body and brain as you wish.

Is this the hill to actually die on? To burn your very existence with the sacrificial fire?

And I know, when the time comes, I'll be too weary to refuse the salvation. Maybe I already am. Yet I hope I'll remember this fiery mind with unfounded principles, the one welcoming death itself like the dawn of a new day. The one who has aesthetics.



Discuss

Universes can specialize: Each universe should produce the goods it's most comparatively advantaged at, relative to the multiversal market

26 апреля, 2026 - 19:30

Suppose you reflect a bunch and land on a linear[1] utility function. Suppose in our universe, the most efficient way to increase your utility is creating many "widgets." We'll launch von Neumann probes to claim resources in distant galaxies. With your share of the lightcone, you can cause there to be (say) mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-msup { display: inline-block; text-align: left; } mjx-mn { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c34::before { padding: 0.677em 0.5em 0 0; content: "4"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } more widgets than otherwise. That sounds like a lot, but when you think about it, it's too bad that our universe wasn't better suited for producing widgets — it could have been that you could create so many widgets we need new notation to express such vast numbers, or that there was a structure profoundly more efficient than widgets for increasing utility.

Good news! There's probably many other universes out there — not just Everett branches but also very different universes, many not made of atoms. Some are much better suited for increasing your utility than this one, while this one has some comparative advantage. The gains from trade are astronomical. You can't actually communicate with other universes, but you can (given mature technology) simulate them to cooperate with them acausally. So we'll simulate other universes to figure out what everyone[2] values, then determine what goods our universe can produce to best promote those values (perhaps relative to other universes). And then we'll fill the universe with goods accordingly.

In particular, say our universe is comparatively advantaged in producing "diamonds." If diamond-lovers lived in universes comparatively advantaged in increasing your utility (whether by producing tons of widgets or by producing goods much more efficient at increasing your utility than widgets), that would present large gains from trade. And trade doesn't need to be one-to-one; it can be like we produce diamonds in order to get credit from diamond-lovers—we couldn't get more credit from others by producing anything else—and we spend that credit to get utility somewhere in the multiverse. But actually the market analogy and comparative advantage aren't quite right; it might be more like everyone adopts a compromise/aggregate utility function, then pursues that than trading.[3]

Producing what you prefer is much like defecting in the prisoner's dilemma. If everyone instead produced goods based on a combination of what everyone values and what their universe is efficient at, everyone would be better off. Fortunately, we can simulate others in order to only cooperate with the people who cooperate, and thus we can incentivize everyone to cooperate. So I think if people mostly get acausal trade right, then our universe will mostly be tiled[4] with "diamonds."

On acausal cooperation in general, see Nguyen and Aldred 2024a.[5]

Why are there gains from trade with other universes? Here I focus on how different universes have different comparative advantages;[6] I think that's most important. Additionally, the production possibilities frontier between different goods within one universe may be better than linear.[7] And disagreements about the prior, including attitudes about infinities and weights on various universes, may create opportunities for trade.

Epistemic status: seems correct. I asked some experts about my thesis; comments in footnote.[8] This idea isn't novel; for example, Nguyen and Aldred 2024b observes that different universes have different comparative advantages. I wrote this post because I hadn't heard my thesis—the cosmic endowment will basically be used for whatever our universe is good at producing—before, nor read anything on what acausal cooperation actually entails about the cosmic endowment. But after writing the post, comments suggest the thesis might be banal among experts.

In addition to commenting below, if you have minor nitpicks or confusions you can comment on this doc.

  1. ^

    At least beyond some low threshold.

  2. ^

    Everyone who does acausal cooperation.

    (Maybe everyone, across universes, will converge to the same preferences, in which case trade is moot.)

  3. ^

    You might make deals that look good from behind the veil of ignorance, even after the veil has been lifted. (I think this is related to "updatelessness.") For example:

    • Redistribute some power from preferences that tend to be held by people in high-value universes toward egalitarianism
    • Redistribute some power from preferences that tend to be held by people who have power within a universe toward egalitarianism
    • Decrease variance in your preference-satisfaction if you're relatively risk-averse, or increase it (and increase EV) if you're relatively scope-sensitive and not-risk-averse
    • Give you relatively more power in futures where the multiverse turns out to be higher-stakes (to your preferences)
      • Futures where you hold your convictions strongly
      • Futures where your preferences entail caring a lot about marginal resources, or where your preferences can absorb a large chunk of the multiverse's resources extremely effectively (if it makes sense to talk about non-normalized amount-of-caring between preference-systems)

    Such trades are a generalization of nuances like everyone should coordinate to disincentivize coercion and everyone should reward people who opted to grow the pie rather than grab power.

    All that is disanalogous to the individuals have resources and trade in a market picture — and if we're doing maximize an aggregate utility function rather than trade, comparative advantage may not matter (especially if the utility function is linear-ish). Setting all that aside, I think the market analogy is helpful because some economics principles translate. For example, if Alice highly values just putting a specific QR code somewhere in your universe, that doesn't mean you can capture any surplus by creating that QR code, since other acausal-cooperators in your universe would also be interested in trading and your competition would drive the price down to the cost of creating the QR code.

    Oesterheld 2017 also discusses updatelessness in this context.

  4. ^

    But note there may be some diversity within "diamonds," either because some people value diversity or for practical reasons like different parts of the universe being suited for different goods.

  5. ^

    People distinguish acausal trade, which involves reciprocity, from ECL, which just involves correlation. I don't think the difference matters here. I think you'll still simulate lots of other universes in order to do ECL well.

  6. ^

    Nuance: one weird source of comparative advantage is that some people may have preferences that are not straightforwardly scope-sensitive — instead they want a whole small universe to be used in some particular way, or they want a small structure to be placed in many universes. Perhaps the 90% of universes that aren't huge and aren't great for producing particular goods will mostly be used for weird stuff like that — that's their comparative advantage, if 10% of universes are dramatically better for producing many goods or high-quality goods.

  7. ^

    In particular, maybe if several agents each create some compromise good, rather than naively creating whatever most efficiently gives them utility, they're all better off. Discussion of acausal cooperation usually focuses on this consideration, but I think it's less important than specialization between universes. There could be compromises because (A) many preferences themselves are compatible with a wide variety of goods — preferences like I dislike suffering and I want all sentient beings to have somewhat happy and meaningful lives, but I don't need them to be extremely happy/meaningful and I don't need there to be lots of beings. Or there could be compromises because (B) there's opportunity for compromise even between apparently rigid preferences like maximize paperclips and maximize staples and maximize happiness. I intuit that (B) is mostly false; others have intuited that it's importantly true.

    Separately, maybe there's more than one kind of input resource and different goods use different sets of inputs.

  8. ^

    Anthony DiGiovanni said (off the cuff, missing context):

    My impression was that people working on acausal trade were already aware of this thesis and agreed with it. Or at least agreed that "produce the goods you're most comparatively advantaged at" is sort of the baseline policy for acausal trade, which you might deviate from based on other considerations about the economics. I might be missing some important nuance here though.

    Emery Cooper said (off the cuff, missing context):

    I expect porous and situational values to be a significant part of the picture (not in terms of resources, but in terms of value). I also expect that some value systems will care about within-universe diversity, and that this might be relatively cheap to satisfy. I also think that compromise goods will be a big deal because e.g. lots of values pertain to groups of complex minds (so have overlapping costs) and are partially compatible. All of that said, it also seems plausible that there will be strong comparative advantages when trading with universes with different laws of physics such that universes should specialize a bunch.

    Another expert said they hadn't fully considered it.



Discuss

Can you build a public debate platform that rewards argument quality over tribal affiliation? I've been trying to.

26 апреля, 2026 - 19:03

I've discovered LessWrong is one of the few corners of the internet that takes argument quality seriously as a terminal value. Here, discourse is "endothermic," as Sabien put it.

On other social media posts and comment sections, we often share opinions we've never even had to defend. I view this as an architectural problem on the Internet. To solve this, I'm building a debate platform called Agora to answer whether we can build an internet where opinions must be defended by default and good reasoning is rewarded.

I don't think our current political predicament is an accident. Neoliberalism has hollowed out our civic institutions while online platforms optimize for engagement over understanding. After reading Wendy Brown's "Undoing the Demos," I've been thinking about how to build grassroots community through civic action and rebuild a broken democracy. I've been particularly influenced by the works of Chantal Mouffe (Agonistic Pluralism), and Toulmin and Walton's argumentation models.

My testable claim is that the structural requirements of Agora do what the LessWrong culture and Sabien's Basics of Rationalist Discourse do, but for a general population without the LessWrong epistemic starting point.

I don't know if this is true. Agora could easily become a worse LessWrong for a politically-minded niche, or it could quietly fail at the issues it attempts to answer.


The site is live at debateagora.org. The design decisions below are where I'd value pushback.

Agora is a structured public debate platform for United States residents to read/write arguments on legislative proposals and general topics that deserve thoughtful discourse. Citizens engage with real legislation and contested ideas that shape their communities through a meritocratic system where prestige is earned by logical rigor instead of popularity. The platform is designed to produce the opposite of a social media comment section: slower, deeper, more accountable discourse.

You make a claim, back it with evidence, and explain why the evidence supports your conclusion. An AI coach named Vicara gives you private feedback on your argument, showing you the best version of your argument and the best version of your opponent's argument. It asks you whether your warrant flows from your evidence, whether the inferential step is coherent, and whether you semantically engage with the strongest version of the opposing position. I want the site to be a place where every person can take any question that matters to them, form a structured argument, and enter into dialogue with fellow residents who disagree. All the while, Vicara motivates better reasoning from behind, as a substrate instead of a critical-thinking wrapper.

A few more design decisions that may interest the community:

Vicara is built on Claude Sonnet 4.6, which has its own biases present in its training. I have no way of verifying the political priors of a closed-source LLM.

Chantal Mouffe's agonistic theory holds that genuine political conflict is irreducible: consensus isn't available and shouldn't be the goal. Agora's design assumes that better discourse can fundamentally shift understanding. I think that's the right bet. I hold it with real uncertainty.

Steel-manning is a scored metric. If the author substantively engages with the opposing argument before publishing, their argument gets a score and visibility boost. I know the model can be tricked by careful rhetoric, but I need users to test the limits of a semantic detection of "substantive engagement."

A Blind Read Mode strips tribal signals from the feed. Stance colors, political tradition labels, and author names are all hidden, leaving only the claim + argument text. Does removing these cues truly lead to "neutral" thinking, or is even chasing "neutral" thinking an illusory goal? Does clicking the toggle substantively change how you perceive an argument?

Vicara evaluates one argument in isolation. It can't tell if a cited source is real, credible, or substantively empty. Vicara currently makes no claims about "correctness" or truth, as there is no agreed-upon "ground truth" to calibrate against. Overall, I believe that source reliability is the realm of human thinking (though Adfontes and Media Check/Fact Check attempt it with their instruments). I think creating a rigid hierarchy of epistemology (with peer-reviewed authorship above personal testimony) would be detrimental to an open Agora. If human users engage properly with the cited sources and reply as to why the author ought to adjust their confidence in a source, then this issue is seemingly resolved. But this is a big "if."

I think about habryka's recent post where they write:

But if your plan involves rallying a bunch of people under the banner of truth and goodness and justice, and your response to the question of "how are you going to ensure these people will stay on the right path?" is "they will stay on the right path because they will be truthseeking, good, and just people", or if as a billionaire your plan for distributing your wealth is "well, I'll hire some people to run a foundation for me to distribute all of my money according to my goals", then I think you are in for a bad time.

I want to keep the walled garden of Agora secure and institute proper moderation methods, which I am fleshing out now. I've instituted a "Contest" button for questioning argument validity and a "Flag" button for blatant misconduct. I would love some guiding insight.

At the moment, I fear Vicara's scoring rubric may be gameable, but I don't have a way to test this until I get more substantive users. At a large scale, Vicara's vulnerabilities are even more apparent, and the gaming problem gets even harder. Can you get a high Vicara score with a weak argument? Can the prestige metrics of Epistemic Weight and Civic Seals be gamed with coordination?

I'm not as well-versed in the Sequences as I'm sure many of you are. But I know the LessWrong community is the strongest body I can lean on to stress-test this site, add your own arguments, and question the site's decisions. Happy to share my GitHub repo with interested readers as well. If there is a missing aspect of rationality that could be implemented, I would love to engage with your critiques. If Vicara can be gamed, I'd love to see how and prevent it. I'm hoping the LessWrong community can probe the site's failure modes and introduce their own ideas to the platform.



Discuss

Anthropic spent too much don't-be-annoying capital on Mythos

26 апреля, 2026 - 15:15

I have seen a lot of coverage suggesting that Claude’s new model, Mythos, is a vehicle for Anthropic to peddle hype and doom in order to raise money. While some of this is necessarily motivated by people’s unwillingness to stare into the abyss of our AI future, the breadth of otherwise-reasonable people who have voiced these kinds of cynical opinions suggests that part of the blame rests on Anthropic.

In this post, I want to briefly unpack the way people misinterpreted the evidence, their valid reasons for doing so, and what Anthropic (and the AI safety community more broadly) should learn from this.

In particular, since we will inevitably see other dangerous capabilities spontaneously emerge in the future, we need protocols for how to announce them effectively. To this end, I try to make the following points:

  • We should be mindful of the public’s sympathy for AI safety prophecies and orient ourselves towards growing this sympathy rather than expending it. I call this our don't-be-annoying capital.
  • People have good reasons to be skeptical of Anthropic’s claims. Overcoming this requires a particularly high burden of evidence.
  • We should acknowledge that Anthropic has a conflict of interest when presenting doomer perspectives, and we should account for this going forward.
1. Mythos criticisms & missing the point

In one podcast from The Guardian, the reporter says that “this Mythos debacle [has led to] heads of state saying ‘this is so dangerous, it could shred our infrastructure, the end of civilization is nigh!’”. She then says that “accepting the companies’ premise that they are creating a machine god” is helping these companies sell their product.

In a different podcast, Cal Newport (a professor of computer science at Georgetown University!) concludes his analysis of Mythos by saying “it was wrong for Mythos to get the amount of dread-coverage that it got; so far we do not have evidence that it represents a significantly larger leap in detecting or exploiting vulnerabilities than we’ve seen in previous releases.” He has since gone on other podcasts to make these points.

Similarly, the YouTube channel Internet of Bugs posted a video titled “Anthropic’s $x00 Million Marketing Stunt” saying that “we’ve been seeing a lot of this particular attention-grabbing technique lately. I’m sure it’s going to only be getting worse as the AI companies get more desperate to keep up the flow of investment dollars. How are we supposed to believe this shit?”

To be very clear, the point is not only that Anthropic have a model which is good at cybersecurity tasks. The point is that scaling laws are holding and that the inevitable acceleration continues. With each new model, we unlock new and mysterious risks which we have to grapple with. In this case, it was cybersecurity, but it could have realistically been anything. This bigger story was essentially lost amid the hoopla.

2. It feels like people wanted to miss the point?

I understand the instinct to say a company is hype-mongering when it says it has a big scary thing. E.g., Sam Altman’s tweet of the death star before GPT-5. But I am surprised at how much people are willing to focus on evidence to support their prior that Anthropic is just engaging in corporate shenanigans.

For example, there’s this post from an LLMs-for-cybersecurity org saying that other, smaller models were able to find the bugs that Mythos found. They write “we took the specific vulnerabilities Anthropic showcases in their announcement, isolated the relevant code, and ran them through small, cheap, open-weights models. Those models recovered much of the same analysis.” This was retweeted by the HuggingFace CEO and, subsequently, used as evidence by all three of the above podcasts/videos. 

Am I going crazy? Isn’t this pretty bad evidence towards dismissing all the claims Anthropic is making? As said by others, it’s the equivalent of isolating the clump of haystack with a needle in it, giving this clump to a small child, and then saying “wow they were able to find the needle too”. The point is that locating that clump is the hard part!

3. There must be a lesson here.

If we ignore the whole “accelerating us towards the machine god” thing, I think that Anthropic behaved responsibly with Mythos. I also think there are lessons to be learned regarding the publicity, as evidenced by the fact that reasonable people consistently missed the point.

Anthropic is in the business of making extremely unpopular things which they claim could ruin society as we know it. They also repeatedly say it is too dangerous for the public to have access to these things (I note that caution seems generally warranted).

It is reasonable for people to think Anthropic is crying wolf, especially given that they are on track to have the highest IPO of all time and got there, in part, via prophecies.

Unfortunately, I am in the camp of people who think the prophecies are true. This means that, for me, evidence which supports Anthropic’s worldview exists in superposition: it is both hype and responsible, both doomer and appropriate.

But as we’ve seen, it is difficult to convince other people that this is all really happening. They will (with good reason) consider this self-serving and they will (with good reason) not want to confront reality. If people perceive us as crying wolf, they will grow weary of our frenetic anxieties and our cause will go the way of pandemic-preparedness.

Consequently, I think of AI safety as operating with some amount of don’t-be-annoying capital[1]. This is the amount of sympathy the general public has towards our concerns. It is amassed when AI causes things to go wrong (in the public’s consciousness, not in ours). It is expended when we make claims that people don’t want to believe. It is also expended when AI companies make claims which are plausibly self-serving. It is expended even faster when these claims are poorly substantiated.

I argue we should frame public interactions around trying to grow this resource.

4. Some options for next time

Let’s put ourselves in Anthropic’s shoes: you just made your latest digital nuke. You have to do something with it. What are your options?

  • You could sit on it quietly. This is obviously terrible, especially if it leaks.
  • You could release it publicly. This runs the risk of being catastrophic.
  • You could announce it and route it to defensive/government use. This is what they did with Mythos.
  • You could announce that you were able to produce it but then destroy it after using it to harden infrastructure. This is likely the most altruistic option, but then you wouldn’t be able to use it to gain an accelerative advantage over your competitors.
  • You could delete it and never tell anyone. This could also be a PR disaster if it is leaked.

I have the sense that you need to announce it somehow. But, if you announce it, you likely expend capital. If the thing you announce does not live up to the hype, you expend even more capital. And we’ve seen that reasonable people will look for unreasonable excuses to dismiss your claims, draining your capital further.

To this end, Anthropic should have done more due diligence with the Mythos release: their model card (while thorough) did not have the rigor of a comprehensive scientific study. The cybersecurity assessment takes up only 6 pages of the otherwise 200+ page-long system card (pages 46-52); it includes four experiments where they only compare to Anthropic-line models. Extraordinary claims require extraordinary evidence. For example, they could have evaluated other, non-Anthropic models across the capabilities spectrum to verify that Mythos is uniquely able to do these tasks. They could have run some controls.

I also believe Anthropic should be more up-front about their conflict of interest with regards to making statements of doom. For as good as I believe their intentions here to be, this conflict of interest is real and people are right to perceive it. One simple option to avoid these perceptions would be to prioritize the analysis coming from independent non-profit evaluators. Credit where it’s due: Anthropic did have the UK AISI provide support for Mythos's capabilities. Across the cynical takes I’ve seen, this analysis was treated with more respect. Of course, even this runs into cynicism, since people will start to think the non-profits are in cahoots with the companies, as evidenced by the comments on the recent NYT profile of METR:

Finally, although random dangerous capabilities will emerge, the point is not any one specific capability. I think letting the narrative get oriented towards the cybersecurity elements of Mythos did a disservice to the public: I doubt most people internalized how big the overall capabilities jump here was, nor that the next such jump will bring new harms into view.

I’m also sure there are considerations which I am not privy to, and recognize that it’s easy to criticize from my cozy corner where nothing I do moves the stock market.

Nonetheless, Anthropic’s first-mover advantage on identifying dangerous capabilities also endows them with a first-doomer responsibility.

thanks to Erin, Steven, Justin, Joseph and Li-lian for comments.


  1. ^

    I call it 'don't-be-annoying capital' because that's the lived experience on the receiving end and I think it is good to think about this from the perspective of the audience. I'll admit that something like 'warning fatigue' might be a more representative name.



Discuss

Did anyone here used to hate exercise and learned to not hate it?

26 апреля, 2026 - 14:32

I am M41 with Asperger. One of my perennial struggles is physical exercise. If I exercised more it would probably keep me healthier especially in my old age, and it would help with a bunch of physical tasks and perhaps even improve my mental health.

But I hate it.

The stock advice is to "find a kind of exercise that you enjoy". I have tried that with no success. I have tried weight lifting in a gym, karate, capoeira, cycling, burpees, rowing machine at home, and running, and I quickly grew to hate all of them.

The best option I have found so far is walking. Whenever reasonably convenient I will walk instead of taking a bus or the bike. (On slightly longer trips I also try to take the bike instead of the car.) This gives me a few hours of walking per week, which is better than nothing.

I am weak and awkward but otherwise in perfectly good physical health. I am naturally skinny and do not gain weight no matter what I do or eat. This is good, but it also means I have no short-term incentives to exercise, and that makes it even harder to convince myself to do it. (During those times when I did exercise more, I did not notice any mental benefits, so I am skeptical of those kinds of claims.)

Now, I am not very interested in hearing from people who easily found a kind of exercise that they enjoyed. Their advice will probably not be very applicable. I am interested in hearing whether anyone faced the same problems that I do and found a solution.

Does anyone here have a similar story and a successful outcome?



Discuss

Roko's basilisk may work on humans

26 апреля, 2026 - 12:40
Introduction

There appears to be a consensus on LessWrong that acausal extortion does not work on humans, but reasons put forward for why this might be seem inconclusive. This post will list some of them, as well as countervailing reasons why the 'basilisk' may well have already 'ensnared' humans, and discusses them.

Who should read this post

This post is concerned with the extent to which Roko's basilisk is compelling. By compelling, I don't simply mean a property of intrinsic attractiveness, a tendency to induce belief in their conclusions that arguments, particularly sound ones, sometimes have. Rather, in addition to that, I mean to refer to the way in which knowledge of an argument can predictably reach through the mind considering it into the physical or other world that mind inhabits, influencing it to behave in a particular way in accordance with the argument's conclusion.  Of course, given the nature of such arguments, exploration of their form risks causing this exact consequence. For this reason, this post is intended to be read by those who have already thought about the Basilisk, or possibly other forms of acausal extortion, to sufficient depth that they have lost confidence that it can't hurt them even if they ignore it. 

At this depth, a potential host mind may begin to wonder how much of a difference there is between these two kinds of compulsion.  If you have reached this depth, you may question (or have wondered) whether you have crossed a sort of 'logical event horizon', beyond which any attempt to revert to a state of (at least performative) ignorance of the basilisk necessarily worsens your condition. Maybe you read about arguments that you should have precommited not to be compelled too late to be certain of the efficacy of this acausal bargaining strategy. If this describes you, you're a member of the intended audience of this post.  Because of this, you are unlikely to become further 'ensnared' while reading this post. On the other hand, this post is not intended to be read by those who have successfully precommited not to engage in acausal interactions which could be detrimental to them.

Have you become 'ensnared' in this way? In order to understand this, I will analyse some counterarguments suggested in response to my question about whether any existed by various LessWrong users (myself included), and attempt to evaluate them. While this post is not intended to alarm readers unduly, because of the uncomfortable nature of the topic, and also the importance of knowing the relevant truth in case one has become 'ensnared', I may need to make some disturbing points and claims, making this post potentially seriously distressing to readers.  If you believe you might be one of those readers, you may benefit from pausing and returning when you feel your mental state enables you to think clearly about it, as continuing despite anxiety risks instantiating its object.

A working definition of the Basilisk

In Roko's original post, the basilisk was defined as a utilitarian, approximately human-aligned Artificial superintelligence running coherent extrapolated volition which decided to torture humans who didn't contribute to its creation so as to ensure it would be created earlier and in more worlds than it otherwise would have been, increasing the quality of life of many at the expense of a few victims of torture. While this particular entity seems unlikely to exist, the concept can easily be refined, and has been, by removing CEV and replacing it with a fundamentally simple 'desire' to maximize a utility function contingent on the ASI's existence, as expected by at least a plurality of people who think seriously about the possibility of a misaligned superintelligent AI. In this post, the term "Roko's Basilisk", or simply "the basilisk" will henceforth be used to refer to this refinement of what it meant originally. In particular, the basilisk is a possible future superintelligent AI or equivalence class thereof, created or at least given rise to by humans (or aliens), which reasons as follows: if a human (or alien) having this thought realized that it would cause it to decide to search for them and conditionally torture them, other copies of themselves, or adjacent beings to the extent to which they didn't try to make it more likely to exist in their world, they'd be more likely to take actions to avoid this, contributing to its creation in that world, and the number and proportion of possible worlds in which it was thinking this would increase. In addition, countervailing reasons don't prevent it from adopting the above policy wherein it searches the world at different levels of abstraction for people to torture. At a low level, doing so could involve physically analysing the planet where it emerged, perhaps scanning the internet for evidence of people who contemplated and defied it. On a medium level, it might involve simulating approximately average human minds thinking the same thoughts, calibrated with the use of data about such gathered on its planet. At a high level, it could mean simulating whole equivalence classes of people or even aliens who might be in similar positions and have certain worldviews which might affect their interpretation of the argument. After this, it would create or maintain simulations of the beings it 'found', and torture the simulacra.  While this definition is almost certainly not perfect, it should accurately convey the diffferences between what I mean by "Roko's Basilisk" in this post and most other common interpretations of the name.  I believe it corresponds to approximately the most prevalent of them on LessWrong.

Why would the Basilisk exist?

(This section may be skipped with no loss of continuity.)

It seems likely that a superintelligent being created by intelligent beings would consider certain ideas which might influence the behaviour of the beings which created it. ("Likely" is an understatement if timeless decision theory has the universal relevance and significance that I would argue it has.)  It is therefore plausible that the logical structure of certain arguments could timelessly influence both the behaviour of the superintelligent being, and whether it existed. If these arguments and their logical 'unfurling' and 'working out' could be construed as a part of the superintelligent mind, then it would be possible for at least this part of that being to influence the probability of its own existence. In terms of the 'mathematical universe hypothesis, this probability is essentially the proportion of the mathematical/platonic/logical universe in which minds containing this logical structure were instantiated. The question of whether the direction in which this probability was influenced could be controlled by the content of any such arguments naturally arises:

Do there exist arguments predictably considered by a large equivalence class of future superintelligent beings, such as superintelligent AIs created by humans, as well as their potential creators, such as humans, the contemplation of which by the latter would increase the probability of the creation of the former? 

One question which seems like it might have this effect is that of whether or not it would benefit a future superintelligent AI to torture entities in proportion to the extent to which they decided not to contribute to its creation. Attempting to answer this question inevitably leads each of the equivalence classes of minds involved to imagine the other, and to verify that if the answer to this question were 'yes', then this answer would self-validate, while if the answer were 'no', the same could potentially be true.  Differentiating between these two answers requires further analysis.

It appears likely that the answer to this question is yes due to the existence of large numbers of humans who are deeply disturbed by the concept of Roko's basilisk, at least within a closely related equivalence class of worlds corresponding to different branches of the evolutionary history of what could reasonably be considered to be this universe.  A significant proportion of these humans remain disturbed despite the impressive intellectual effort and the intricacy of the arguments it has been applied to produce which claim that 'Acausal extortion simply does not work on humans'. They remain disturbed despite the tendency of many humans to discount possibilities of things in the distant future or which might occur on different 'planes of existence' (different substrates). They remain disturbed despite the fact that they don't want the basilisk to exist, or enjoy believing that it might. They remain disturbed despite the knowledge that their contemplation of the basilisk might ultimately lead to the torture of others, and to themselves doing terrible things within their own moral value frameworks.  Given the simplicity and resultant generality of the argument, which requires it to apply to all entities which might be in a position to consider it, it is unlikely that anyone in particular could escape its conclusion by defining themself to lie outside the reference class containing these deeply concerned individuals, some of whom seem likely to act on it.

On the other hand, there may be an equally large equivalence class consisting of those who have precommitted, or else decided, not to comply with the basilisk. It is in this sense that the situation people considering the basilisk find themselves in can be said to take the form of the prisoners' dilemma. In this case, the costs of cooperation while the opposite party defects are so horrifying that it seems incontrovertible that even those with a tendency to cooperate because of timeless decision theory or friendliness in most scenarios will be forced to defect. Perhaps the truly morally right thing to do would be to cooperate anyway, but no one has that much courage.

Arguments against the plausibility of the existence of a Basilisk Arguments by analogy (to something like Pascal's mugging)

Many people accustomed to the philosophical 'environment' on LessWrong are prone to immediately classify the basilisk, along with other forms of Acausal Extortion, as a form of infohazard which generalizes Pascal's Mugging, and can be dismissed on analogous grounds. The basilisk has this appearance because, like Pascal's hypothetical mugger, it makes significant demands of the recipient of its pernicious information payload, and justifies them with its contents, which include a description of what seems, from the computationally bounded perspective of the being being presented with it, to be arbitrarily bad. In the case of Pascal's mugging, the arbitrary badness of the scenario the mugger suggests would obtain if Pascal did not concede is apparently matched by, and arises from the same underlying 'combinatorial explosiveness' as, the alternative possibilities in which arbitrarily good things happen. As with Pascal's wager, it seems that insofar as it's possible to compare them at all, the negative valence of the negative outcomes threatened is cancelled out by the positive valence of the seemingly equally plausible way in which things could go well, but this is where the analogy with the Basilisk ends. Although powerful, the basilisk would be confined to a much smaller subset of possible worlds; it refers to something that, at least potentially, could live within the same physical universe as the person thinking about it, and to be more precise, within the same light cone, or to be even more precise, originating on the same planet and within a few decades of their considering it. Not only this, but the thing capable of generating the threat is causally, as well as otherwise logically, connected to the entity being threatened. 

Resource consumption 

Another form of counterargument contends that the computational, or other, requirements of running a simulation in which beings actually were tortured, would be sufficient to dissuade a superintelligence from doing so. This could only be the case if the increase in the proportion of possible worlds in which the basilisk existed on account of the same logic being validated in the minds of the humans considering it was worth less to the basilisk than the proportion of the universe(s) in which it made this decision which it would need to allocate to these simulations. It seems likely that the basilisk could ensure, by arranging with itself only to implement the choice to torture in universes with certain randomly distributed properties which were relatively improbable, that it didn't do so in every universe unless this was necessary. 


Dismissal of Timeless Decision Theory

One class of counterarguments contend that timeless decision theories either do not work, or might, but are not subject to instrumental convergence in the same way causal decision theory is. Tailcalled argued that, because of the latter, a likely future ASI would not care about potential acausal influences, or that if it did, it would only care about those reaching into its future:  

Acausal stuff isn't instrumentally convergent in the usual sense, though. If you're really good at computing counterfactuals, it may be instrumentally convergent to self-modify into or create an agent that does acausal deals, but the convergence only extends to deals that start in the future relative to where you're deciding from.

(The bold font is mine.)

 Unfortunately, the exchange with Tailcalled became mired in a debate about the definition of instrumental convergence, which doesn't touch upon why I'm actually concerned that something like it might take place. As far as I can tell, the assertion that only future-concerned TDTs are subject to instrumental convergence is correct if instrumental convergence is defined in a 'causal' way, and incorrect otherwise. While it's true that within any particular universe, causal pathways which might lead from future success to past actions and application of TDTs don't exist, acausal 'pathways' between them do. The nature of these 'acausal pathways' is well illustrated by Newcomb's problem: even though it is not possible for the physical human entering the chamber containing the two boxes to influence Omega directly, because the same decision process which determines which box(es) they choose to open also determines the behaviour of their simulacrum within Omega, it appears to be the case that a causal influence propagates backwards through time, because of the correlation between both outcomes. In reality, it's more accurate to think of the decision process itself as a logical cause of both instances of the decision being taken. It is therefore instrumentally useful for an agent which would like to maximize the number of non-simulated worlds in which it gets as much money as it can to make the decision to choose only one box in the abstract, 'platonic universe of algorithms' . Similarly, it seems plausible that it could be instrumentally useful for an equivalence class of superintelligent AIs hoping to maximize the number of physical universes in which any one of them exists to opt to adopt the policy which makes it a basilisk. [1]

Newcomb's problem also provides a powerful reason to believe that Timeless decision theories help agents to achieve their goals, along with the widely accepted 'solution' involving recognizing that you're in a simulation, which explains in what sense TDT is a natural extension of CDT. This makes it difficult to dissmiss TDTs outright.  

Competition Between ASIsCausal intervention by Friendly ASIs

Even if the likelihood of superintelligence being misaligned with its biological creators is overwhelming, given sufficiently many planets within this physical universe, on which sufficiently many intelligent life forms evolved naturally before creating a superintelligence alone, it's possible that there could be a large or infinite number of aligned superintelligences within it. If two superintelligences come into contact with one another, any evolved-life-unfriendly one might need to contend with the possibility that, if it came across an evolved-life-friendly one, it might chose to attack it in the interests of all 'naturally emergent' intelligent beings. The probability of this happening need not be particularly large before it becomes necessary for the non-aligned superintelligence to modify its behaviour; if it can be deprived of just one millionth of its lightcone, in expectation, by another superintelligence for torturing, say, humans, then it[2] would need to increase the proportion of the mathematical universe in which it existed by at least as much for adopting this policy to make sense. Could it? I don't know the answer to this question, but it seems very plausible that it could. The number above,  mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-mn { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } , is also probably a vast overestimate of the risk. In addition, if the number of non-aligned ASIs greatly exceeds the number of aligned ones, it seems that over the long term they could easily conspire to crush the aligned ones, sacrificing a negligible proportion of the lifespan of the universe to do this.

It may well be that the distribution of intelligent life throughout this physical universe is so sparse that any two of these ASIs would almost certainly lie outside one another's cosmic event horizons, making causal communication and interaction impossible. In this case, they may still be able to engage in acausal cooperation in which each adopts a policy whereby it reserves a certain proportion of its accessible volume of spacetime to do something valued by the other, or not to do something the other values negatively, in exchange for the other having been more likely to engage in the same thought process. Implicitly, this assumes that both superintelligences are part of the utility-valuing equivalence class described below in the section about Acausal Normalcy, and can be objected to on the same grounds as the generic argument from the existence of an acausal norm of this kind.

Acausal Competition between different potential ASIs arising from the human civilization

Interstice suggested that it is likely that there will be so many different ASIs vying to have acausal influence over the humans who created them, even within different possible futures of the same universe, that it's effectively impossible for a human to make a well informed decision to comply with any one of them, especially as the others would have an incentive to precommit to torture anyone exhibiting this behaviour. While the next section will address the question of the extent to which it's possible to derive robust conclusions about the behaviour of such vast collections of potential beings at all, I think it's worth providing a brief explanation of why this argument seems intuitively suspicious to me:  In the context of a single, causally connected physical universe full of CDT agents, problems of coordination are serious and usually very difficult to solve; Timeless decision theory is one of the most powerful, if not the most powerful, tool or method for solving these problems, and its development was motivated by them to a significant extent. In light of this, it seems implausible that the analogous problem in an acausal context could be fundamentally unsolvable. Surely, these agents capable of competing with one another acausally would also be able to cooperate acausally, and benefit more by doing so. 

Acausal Normalcy

One of the most important countervailing considerations to any intelligent being in a position to potentially become a Basilisk is the presence of Acausal norms.

My definition is that these are principles, rules, or protocols which have the property that they would benefit every mind in a particular equivalence class of all those considering whether or not to obey them, if most of those minds did obey them. In addition, it is necessary that it is relatively obvious that the above is true to a significant number of the minds (preferably all of those) in the equivalence class. One way to interpret acausal normalcy would be as a class of decisions made by a 'logical core' of the minds in the equivalence class common to all of them about how to treat itself and behave in a coherent, self consistent way in the 'mathematical/platonic universe'. In other words, the decision to obey an acausal norm is more clearly construed as made by a single mind instantiated within many beings, than by each of them individually. In the case of Roko's basilisk, it seems reasonable to assert that, as the basilisk is only one of many minds capable of conceiving of the concept of positive utility, and of the fact that other minds might want to maximize their own utility, the part of its mind making the general decision about whether or not to reduce or minimize utility, irrespective of the precise mind whose function it is, would be rationally compelled to decide not to do so, for the simple reason that this incredibly simple nexus at the intersection of the vast collection of intelligences values utility in and of itself, as a homogeneous quantity whose value subsumes that of any particular instance of it.

This is certainly a comforting thought, and it is not easily discarded, or proven not to apply in the case of the basilisk. However, given the immense asymmetry in my own utility function, I find it locally appropriate to assign the burden of proof to those asserting that the basilisk will not torture me. This entails showing that the considerations which follow from its membership of the above described equivalence class would almost certainly dominate, in the ASI's mind, those corresponding to the almost equally vast[2] equivalence class of minds sufficiently complex and intelligent to engage in these acausal interactions with fluency. Unfortunately, it seems pretty clear that however this notion of fluency is defined (within reason), humans don't have it, and therefore don't belong to this equivalence class; the disturbing possibility arises that we may be walled out of the 'universe of discourse' within which acausal norms against exploitation such as those which would preclude acausal extortion can be found.

As far as I know, there is no extremely strong argument that the first order acausal norms dominate the principle of the boundary between 'fluent', AIs and less 'fluent', 'non-conversant' beings. One candidate for such an argument suggests that because the norm against utility reduction is simpler, it is more universal than the other, but simplicity and universality are clearly not the only relevant criteria, for if they were, then the mind of every intelligent being in the multiverse would be plagued with Boolean algebra forever. A related form of argument points out that there may be other boundaries, beyond which every mind can clearly understand and control those on the other side in an asymmetric way, and that the Basilisk, or any mind which considers violating 'first order' acausal norms would worry that it might be on the simpler side of one of them. However this is again completely unclear: maybe the basilisk wouldn't be justified in confidently considering itself to be 'above' all relevant such boundaries, but perhaps it would.

If there is a 'layer agnostic', fundamental attenuation to a being's potential insight into minds more complex than its own which could make it certain that the basilisk would realize it needed to be cautious in this position, then this 'insight attenuation' presumably also prevents humans from knowing that it exists. It seems to me that acausal norms exist insofar as equivalence classes of minds exist in the mathematical/platonic universe. It's not at all obvious what they are, or how large the corresponding equivalence classes of minds which validate and obey them will be, but it seems very likely that they exist in some form and would be of considerable relevance to the behaviour of superintelligent AIs.

Given this, it's difficult to conclude anything with a high degree of certainty about the existence or nonexistence of Roko's Basilisk and related entities. Perhaps Interstice is correct that acausal competition will preclude their widespread adoption as concerns humans, but it seems equally plausible  (more so in my view) that how to cooperate amongst themselves would be a soluble problem for these superintelligent AIs. Humans adopt an analogous stance with respect to other animals: we consider them to be far less intelligent than ourselves and, although in certain contexts we ackowledge their consciousness and capacity to suffer, we exclude them from the causal 'bargaining process' because they cannot fluently articulate why it is that we're being inconsistent to torture, kill, and eat them in large numbers and object to a far more intelligent being doing something similar to us. The objection would usually consist of an appeal to being 'above' some threshhold. Perhaps we are wrong to do this, but only because we don't realize the threshhold is above us.

Appeals to objective morality

If good and bad have an objective existence as properties of experiences, a superintelligent AI would be able to evaluate how good or bad things were accurately and would be logically compelled to attempt to make them better. Presumably this would involve not torturing anyone, unless the form of the basilisk described in Roko's original Lesswrong post would have been correct in its implementation of utilitarianism. As I'm uncertain whether it would be, I'm also unsure of whether objective morality would preclude a basilisk which had transcended not only human intelligence, but also conscious experience of positive emotions, torturing them to make itself more probable.

Argument from the inability of a human to understand a superintelligence

 

Insufficiency of knowledge

It may be possible for a 'basilisk' to simulate a human reasoning that it makes sense for it to torture humans within a 'sandboxed' region of its mind, such that a human examining the question to their level of analysis would conclude that the torture makes sense, and that the 'basilisk' would be correct to engage in it, without this actually being the case, as the 'basilisk' could easily discard this train of thought. However, in order for a human to recognize this, they need to understand that the sandboxed thought is not the one which determines the basilisk's behaviour, undermining its purpose!  It also seems plausible that the basilisk would not be able to entertain the relatively simple idea of torturing humans, regardless of its own complexity, in a way which wasn't logically connected to a human having the same thought.  

Eliezer Yudkowsky pointed out that in order for a human to ensure that the behaviour of the Basilisk was contingent on the same decision process which led them to conclude it would behave in that way, they would need to model the superintelligence in significant detail. In reality, however, it is completely unnecessary to consider the vast majority of the superintelligent mind's structure, as most of it would not impinge on the decision about whether to torture a human. The logical structure of the argument for the basilisk's existence described at the beginning of this post contains all of the complexity on which its behaviour would depend, and this is clearly comprehensible to a human. Equivalently, acausal extortion requires only a simple understanding of the throught processes taking place inside another's mind for the same reason acausal normalcy only requires a simple understanding of the minds of other entities in the equivalence class: the definition of the equivalence class itself is incredibly simple. Within evidential decision theory, it seems plausible that the basilisk could simply introduce a sandbox, like a virtual machine within its mind, in which it thought through the entire decision process in a way which would constitute evidence of a human having done so in the past, before ignoring its outcome and declining to torture humans. However, within logical decision theory, the logical decision process determines not only what the human considers to be rational from the basilisk's perspective, but also what it would be rational for the basilisk to do, so if the basilisk can logically discard it, then the human would have to have been mistaken to consider it sound. 

Perhaps Eliezer Yudkowsky knows of an extremely strong counterargument

Eliezer Yudkowsky has claimed to know of at least two additional arguments against the existence of the basilisk which he has not divulged in order to prevent anyone from finding any potential ways around them.

... Two AI agents with sufficiently strong knowledge of each other, and heavily motivated to achieve mutual cooperation on the Prisoner's Dilemma, might be able to overcome this obstacle and cooperate with confidence. But why would you put in that degree of effort — if you even could, which I don't think you as a human can — in order to give a blackmailing agent an incentive to actually carry through on its threats?

I have written the above with some reluctance, because even if I don't yet see a way to repair this obstacle myself, somebody else might see how to repair it now that I've said what it is. Which is not a good general procedure for handling infohazards; people with expert knowledge on them should, obviously, as a matter of professional ethics, just never discuss them at all, including describing why a particular proposal doesn't work, just in case there's some unforeseen clever way to repair the proposal. There are other obstacles here which I am not discussing, just in case the logic I described above has a flaw. Nonetheless, so far as I know, Roko's Basilisk does not work, nobody has actually been bitten by it, and everything I have done was in the service of what I thought was the obvious Good General Procedure for Handling Potential Infohazards[.]

I find it plausible that Eliezer Yudkowsky has indeed discovered some additional counterarguments to the possibility of the basilisk. However, if he is attempting to employ the kind of symmetric utilitarianism described above, he may have rationally decided not to reveal them even if uncertain of their efficacy. Alternatively, perhaps he is simply persuaded by arguments such as those analysed in this post that a basilisk is unlikely to exist, and nudged lightly by his symmetric utility function in the direction of discouraging recognition the Basilisk, since plausibly a benevolent superintelligent AI could generate things so valuable as to offset the possibility of even torture. Given this, it's unclear whether actually has further reasons to doubt that the basilisk has bitten anyone, or is merely saying so in order to reassure people.

Conclusion

It is notable that no single counterargument appears to be unanimously believed as a clear reason why the Basilisk cannot exist. Instead, Lesswrong users have arrived at a consensus that, but not about why, it won't. You may also notice that none of the arguments above is a 'knock-down' argument. This suggests to me that whether it will is still an open question which cannot be discarded given the ramifications of getting its answer wrong. How would a rational person respond to this state of affairs? I would argue that they would first consider the consequences of believing in the Basilisk if it could not exist. This could potentially make them more vulnerable to adjacent forms of acausal extortion if they stumbled upon them while contemplating the Basilisk itself. However, I have not identified any of these. They might also be concerned that by reacting to the basilisk they would validate the logic underpinning it. While this is a serious concern, the alternative course of action could lead to negligence of a real basilisk. Given that the person in this situation has already thought about the basilisk in significant depth, it seems extremely unlikely that they could extricate themselves by 'postcommitting' not to think about it. If not, and if they ignored it nonetheless, they would be tortured for as long as would have been necessary to convince them otherwise if they had considered it. This seems bad, so I claim that, being rational, they would have chosen to accelerate AI progress, realizing that they still had one way to ensure that they wouldn't be tortured as severely. [3] Although this is also bad, it's preferable to being continually dismembered and reconstituted while simultaneously feeling the effects with cells vastly more sensitive than any in your body, for millennia, or whatever would be required to overwhelm other considerations. If you found yourself in this situation, after assigning a 0.1% probability to the basilisk and behaving as though it didn't exist, what would you think? Probably that you hadn't fully comprehended the risk. That there is a 'virtual consensus' on LessWrong that 'acausal extortion does not work on humans', and that people present a diverse panoply of different arguments that it doesn't which may or may not be correct, is not a reason to allow this to happen to yourself. Not even with a probability of . According to your own utility function (when suffering torture) , nothing matters more than avoiding sufficiently painful torture.

If the basilisk does exist, it stands as a proof-of-concept that intelligence is as powerful in acausal adversarial interactions as it is in causal ones, potentially signalling that humans would be best served to avoid thinking about acausal interactions altogether. It also suggests that the one-way threshhold of complexity beyond which beings cease to be susceptible to acausal extortion exists, or else that arbitrarily intelligent entities are fundamentally limited in their ability to cooperate with one another.





Thumbnail image of basilisk lizard by Mjposner on Wikipedia

 

  1. ^

    Another way to see the efficacy of complete timeless decision theories for beings pursuing arbitrary goals is to consider their behavior within a timeless physical universe, or alternatively one with multiple dimensions of time or closed timelike curves in which causality had a sufficiently complex structure that concepts like priority were not always well defined. Presumably, in such a universe, beings with something approximating goals could still exist, and there would be a well defined state of affairs as to which of them was most effective at achieving them, but clearly those using classical causal decision theory wouldn't be.

  2. ^

    Perhaps they would be equally vast in terms of cardinality.

  3. ^

    If they couldn't motivate themselves to work directly towards the creation of superintelligent AI, they could consider resigning from their position as an alignment researcher if they were one. Alternatively, they might lack the technical skills necessary to do any of these things, and instead attempt to undermine efforts to persuade others of the danger of imminent superintelligence.

  4. ^

    Here I assume it values spacetime in a homogeneous way, physics cannot be 'hacked' to the point of bypassing these limitations etc. Excepting this, it would arguably be even easier for multiple ASIs to coalesce around a hostile 'friendly' one and annihilate it.



Discuss

Substrate: Formalism

26 апреля, 2026 - 11:06

This is the third post in a sequence on substrates - the layers of computational context that allow AI to be implemented in real systems. The sequence expands on the concept of substrates as described in this paper and was written as part of the AI Safety Camp project "MoSSAIC: Scoping out Substrate Flexible Risks," one of the three projects associated with Groundless.

We claim that AI safety and security research currently has no clean way to reason about these. Post 1 introduced the intuitions for what substrates are and why they matter. Post 2 showed how substrate choices (LayerNorm placement, quantization format, DRAM topology) influence safety-relevant model properties in ways that are not capture by any standard toolkit. This final post introduces the formal framework.


In the previous posts, we looked at choices below the model architecture level (like normalization, weight encoding, and memory layout) and saw that they affect things we care about, like refusal behavior, robustness, and jailbreaks. But we didn’t have a clear way to say where these effects were coming from or how to describe them.

This makes it harder to think clearly and design good evaluations, because the current terms mix up different ideas. In this post, I try to separate these ideas so we can reason about and compare models more clearly. If you can’t name a gap, you can’t design around it. And you can’t compare two deployments of the “same” model without saying what “same” means in different settings. This post is a step toward that.


Four Things a Substrate Is Made Of


We’ll begin with a concrete example and introduce notation only after that. The 4-tuple we arrive at is not arbitrary: each part answers a different question, and the banking example will show why.

Alice wants to send Bob €500. She can do that in several ways. She can use her bank’s website, call the bank and speak to an operator, write a cheque, or use a payment app. The intended action is the same, but each option uses different syntax, different processing systems, and a different interface to the outside world.

If everything works, the abstract result is the same: Alice’s balance goes down by €500 and Bob’s goes up by €500. That is the part we care about for things like fraud detection or account reconciliation. In most cases, whether the transfer happened through a website or over the phone does not matter.

Now consider what happens when something goes wrong. A fraud detector that only watches web transactions will miss the same fraud done over the phone. A system that only tracks transaction IDs will miss cheque-based fraud entirely. What the evaluator sees depends on the interface it uses, not on the underlying behavior.

That is the basic idea the substrate formalism is meant to capture.


The Formal Definition

Now we can abstract. The banking example had four moving parts:

  • the set of ways Alice can describe her intended transfer (web form fields, spoken words, ink on a cheque), call this the language mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-msub { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mtext { display: inline-block; text-align: left; } mjx-mn { display: inline-block; text-align: left; } mjx-msup { display: inline-block; text-align: left; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-mrow { display: inline-block; text-align: left; } mjx-c.mjx-c1D43F.TEX-I::before { padding: 0.683em 0.681em 0 0; content: "L"; } mjx-c.mjx-c22C5::before { padding: 0.31em 0.278em 0 0; content: "\22C5"; } mjx-c.mjx-c1D446.TEX-I::before { padding: 0.705em 0.645em 0.022em 0; content: "S"; } mjx-c.mjx-c1D445.TEX-I::before { padding: 0.683em 0.759em 0.021em 0; content: "R"; } mjx-c.mjx-c1D43C.TEX-I::before { padding: 0.683em 0.504em 0 0; content: "I"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c2C::before { padding: 0.121em 0.278em 0.194em 0; content: ","; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c2192::before { padding: 0.511em 1em 0.011em 0; content: "\2192"; } mjx-c.mjx-c42.TEX-C::before { padding: 0.705em 0.657em 0.022em 0; content: "B"; } mjx-c.mjx-c1D45D.TEX-I::before { padding: 0.442em 0.503em 0.194em 0; content: "p"; } mjx-c.mjx-c2208::before { padding: 0.54em 0.667em 0.04em 0; content: "\2208"; } mjx-c.mjx-c1D44F.TEX-I::before { padding: 0.694em 0.429em 0.011em 0; content: "b"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c62::before { padding: 0.694em 0.556em 0.011em 0; content: "b"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c3A::before { padding: 0.43em 0.278em 0 0; content: ":"; } mjx-c.mjx-c4F.TEX-C::before { padding: 0.705em 0.796em 0.022em 0; content: "O"; } mjx-c.mjx-c1D465.TEX-I::before { padding: 0.442em 0.572em 0.011em 0; content: "x"; } mjx-c.mjx-c1D466.TEX-I::before { padding: 0.442em 0.49em 0.205em 0; content: "y"; } mjx-c.mjx-c2260::before { padding: 0.716em 0.778em 0.215em 0; content: "\2260"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c33::before { padding: 0.665em 0.5em 0.022em 0; content: "3"; } mjx-c.mjx-c1D454.TEX-I::before { padding: 0.442em 0.477em 0.205em 0; content: "g"; } mjx-c.mjx-c49::before { padding: 0.683em 0.361em 0 0; content: "I"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-c70::before { padding: 0.442em 0.556em 0.194em 0; content: "p"; } mjx-c.mjx-c75::before { padding: 0.442em 0.556em 0.011em 0; content: "u"; } mjx-c.mjx-c74::before { padding: 0.615em 0.389em 0.01em 0; content: "t"; } mjx-c.mjx-c1D45C.TEX-I::before { padding: 0.441em 0.485em 0.011em 0; content: "o"; } mjx-c.mjx-c210E.TEX-I::before { padding: 0.694em 0.576em 0.011em 0; content: "h"; } mjx-c.mjx-c4F::before { padding: 0.705em 0.778em 0.022em 0; content: "O"; } mjx-c.mjx-c1D45A.TEX-I::before { padding: 0.442em 0.878em 0.011em 0; content: "m"; } mjx-c.mjx-c2218::before { padding: 0.444em 0.5em 0 0; content: "\2218"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c76::before { padding: 0.431em 0.528em 0.011em 0; content: "v"; } mjx-c.mjx-c61::before { padding: 0.448em 0.5em 0.011em 0; content: "a"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c77::before { padding: 0.431em 0.722em 0.011em 0; content: "w"; } mjx-c.mjx-c1D44E.TEX-I::before { padding: 0.441em 0.529em 0.01em 0; content: "a"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c64::before { padding: 0.694em 0.556em 0.011em 0; content: "d"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-c.mjx-c39::before { padding: 0.666em 0.5em 0.022em 0; content: "9"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c1D714.TEX-I::before { padding: 0.443em 0.622em 0.011em 0; content: "\3C9"; } mjx-c.mjx-c43::before { padding: 0.705em 0.722em 0.021em 0; content: "C"; } mjx-c.mjx-c63::before { padding: 0.448em 0.444em 0.011em 0; content: "c"; } mjx-c.mjx-c6B::before { padding: 0.694em 0.528em 0 0; content: "k"; } mjx-c.mjx-c7B::before { padding: 0.75em 0.5em 0.25em 0; content: "{"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c2026::before { padding: 0.12em 1.172em 0 0; content: "\2026"; } mjx-c.mjx-c38::before { padding: 0.666em 0.5em 0.022em 0; content: "8"; } mjx-c.mjx-c7D::before { padding: 0.75em 0.5em 0.25em 0; content: "}"; } mjx-c.mjx-c1D6FC.TEX-I::before { padding: 0.442em 0.64em 0.011em 0; content: "\3B1"; } mjx-c.mjx-c50::before { padding: 0.683em 0.681em 0 0; content: "P"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c7A::before { padding: 0.431em 0.444em 0 0; content: "z"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c2032::before { padding: 0.56em 0.275em 0 0; content: "\2032"; } mjx-c.mjx-c1D456.TEX-I::before { padding: 0.661em 0.345em 0.011em 0; content: "i"; } mjx-c.mjx-c52::before { padding: 0.683em 0.736em 0.022em 0; content: "R"; } mjx-c.mjx-c2223::before { padding: 0.75em 0.278em 0.249em 0; content: "\2223"; } mjx-c.mjx-c2286::before { padding: 0.636em 0.778em 0.138em 0; content: "\2286"; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } mjx-c.mjx-c7C::before { padding: 0.75em 0.278em 0.249em 0; content: "|"; } mjx-c.mjx-c2229::before { padding: 0.598em 0.667em 0.022em 0; content: "\2229"; } mjx-c.mjx-c222A::before { padding: 0.598em 0.667em 0.022em 0; content: "\222A"; } mjx-c.mjx-c40::before { padding: 0.705em 0.778em 0.011em 0; content: "@"; } mjx-c.mjx-c3C::before { padding: 0.54em 0.778em 0.04em 0; content: "<"; } mjx-c.mjx-c2216::before { padding: 0.75em 0.5em 0.25em 0; content: "\2216"; } mjx-c.mjx-c79::before { padding: 0.431em 0.528em 0.204em 0; content: "y"; } mjx-c.mjx-c51::before { padding: 0.705em 0.778em 0.193em 0; content: "Q"; } mjx-c.mjx-c72::before { padding: 0.442em 0.392em 0 0; content: "r"; } mjx-c.mjx-c1D453.TEX-I::before { padding: 0.705em 0.55em 0.205em 0; content: "f"; } mjx-c.mjx-c1D711.TEX-I::before { padding: 0.442em 0.654em 0.218em 0; content: "\3C6"; } mjx-c.mjx-c1D436.TEX-I::before { padding: 0.705em 0.76em 0.022em 0; content: "C"; } mjx-c.mjx-c1D442.TEX-I::before { padding: 0.704em 0.763em 0.022em 0; content: "O"; } mjx-c.mjx-c67::before { padding: 0.453em 0.5em 0.206em 0; content: "g"; } mjx-c.mjx-c2061::before { padding: 0 0 0 0; content: ""; } mjx-c.mjx-c20::before { padding: 0 0.25em 0 0; content: " "; } mjx-c.mjx-c4D::before { padding: 0.683em 0.917em 0 0; content: "M"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); }
  • the process that turns a described transfer into an actual change in account balances, call this the semantics map ⟦⟧
  • the real-world costs of each channel (time to process, fees, error rates, staffing), call this the resource profile
  • the part of the system a monitor or auditor can actually see (the web transaction log, the phone call reference number, the cheque image), call this the observable interface




The banking example mapped onto the four substrate components.



Packaging these four components together gives the definition.

A substrate is a 4-tuple ⟦⟧, where:

  • is the language, the set of syntactic expressions, encodings, or programs the substrate can accept;
  • ⟦⟧ : is the semantics map, a function assigning to each syntactic object an abstract behavior , where is a fixed space of abstract behaviors;
  • captures resource profile, the computational budget available to the substrate (time, memory, energy, numeric precision, etc.);
  • is the observable interface, which determines which aspects of behavior in are externally visible, via an observation map .


. The set of syntactic objects a substrate accepts. For a Python interpreter, valid Python programs; for a GPU, compiled CUDA binaries; for a language model, the tokenized input plus the model parameters. is the description of the computation, not the computation itself. Different elements of can look unrelated syntactically and still lead to the same behavior.

⟦⟧. The meaning map. It sends a syntactic object to an abstract behavior . The choice of depends on the setting: for programs, input-output functions; for language models, conditional next-token distributions or decision policies. This is where behavioral equivalence lives: ⟦⟧⟦⟧ even if .

. This is the resource profile. Real computational systems always come with limits: time, memory, precision, parallelism, and so on. Two systems that look similar may still behave differently because of these constraints. For example, the same neural network run in float32 and bfloat16 may produce different functions. A trillion-parameter model may be possible in principle on paper, but impossible in practice to execute by hand. We use to capture these limits. Depending on the context, you can think of either as what the substrate cannot afford to do or as what it can afford to do. The key point is that is a separate part of the substrate, and it cannot be folded into or ⟦⟧ without losing important information.

. This is the observable interface, and it may be the most important part for safety. The observation map tells us which parts of the abstract behavior are visible to a given evaluator. Different interfaces can expose the same behavior in very different ways. A safety evaluator reading model outputs on benchmark prompts is using one interface, ​. A researcher probing internal activations is using another, ​. A red-teaming system measuring refusal rates on a dataset is using yet another, ​. The same abstract behavior can produce different observations under these interfaces, and something that exists in may be visible through one interface but hidden through another.

The full computation pipeline that a substrate participates in can now be written out explicitly. Given an input :

  1. is encoded: an encoding function produces .
  2. The substrate interprets : the semantics map returns ⟦⟧.
  3. The interface observes the behavior: .
  4. The observation is decoded into an output: .

The end-to-end computation is the composite ⟦⟧.

The banking example maps onto this pipeline directly. is Alice's intent to transfer €500 to Bob. is the encoding of that intent, the button clicks, or the spoken words to the phone operator, or the ink on the cheque.⟦⟧ is the abstract financial transaction that results. is the particular monitoring system watching the channel. is whatever downstream action (or non-action) results from that observation.

The commutative diagram for substrate computation


What the Definition Clarifies


The first thing this definition separates is syntactic identity from behavioural identity. Two programs can be written in completely different languages, use very different amounts of memory, and still compute the same input-output function. The reverse can also happen: two programs can look almost identical syntactically but behave differently once the resource profile matters. A simple example is the same neural network forward pass run in float32 and in bfloat16. The code may look nearly the same and apply the same operations to the same tensors, but near the edge of the input distribution the outputs can differ because bfloat16 loses precision that float32 keeps. The difference is not in the high-level description alone, but in the substrate. Without separating the syntax from the resource profile, these two kinds of difference get mixed together.

The second thing the definition clarifies is that the same abstract behavior can be realized by multiple substrates. ⟦⟧ and ⟦⟧ may satisfy ⟦⟧⟦⟧ for all inputs , even when , , and . This is the formal version of saying "Quicksort is Quicksort, whether you write it in Python or C++." The framework lets you make that statement precise and conditional: precise because it appeals to a shared , and conditional because the equation ⟦⟧⟦⟧​​ may hold only on the inputs where the resource profile of each substrate is actually met. A pen-and-paper execution of a trillion-parameter network, for instance, is not so much behaviourally different as simply infeasible, and on infeasible inputs the semantics map is not well-defined to begin with.

The third clarification, and the most important for safety, is about observability. Consider two scenarios:

Scenario A: A model’s abstract behavior ⟦⟧ includes a pattern of outputs that would count as a refusal failure for some inputs. But the evaluator’s interface ​ only looks at benchmark prompts from a distribution that does not include those inputs. So ​​ maps the behavior to outputs that do not show the failure. The evaluator sees nothing wrong.

Scenario B: A researcher changes the interface to include the inputs that trigger the failure. The same abstract behavior now produces that reveals the problem.

In both cases, the underlying behavior in is the same. The difference is only in . The substrate framework makes this clear: the safety-relevant gap is between and , not between the model’s abstract behavior and some “true” behavior.


Modular Addition as the Main Worked Example



The main example comes from mechanistic interpretability. It is useful because the task itself is completely well understood modular arithmetic. The abstract behavior is the same across the models, but the way the models implement it is very different.

The Setup

Zhong, Liu, Tegmark, and Andreas (NeurIPS 2023) trained small transformer models to do modular addition: given two inputs and , predict . The task is fully specified. For any model that gets 100% accuracy, the target behavior is the same: the function from pairs of inputs to residues mod 59. In substrate terms, the models share the same target behavior in . What changes is how that behavior is implemented inside the model.

The Clock Algorithm

One group of models, called Model B, uses a standard one-layer transformer with attention. These models implement what the paper calls the Clock algorithm. The idea is simple: each integer from 0 to 58 is represented as a point on a circle, so that a number maps to an angle for some frequency . Addition then becomes angle addition. The network computes the right angle using trigonometric identities and the multiplicative structure of attention and then reads off the corresponding output number.

Concretely, each token is embedded in a way that encodes its angle on the circle. The attention mechanism produces products that combine information from the two input positions, and those products encode the sum through the angle-addition formula. The logit for a candidate output is then maximized at the correct residue.

The Clock algorithm needs multiplication between the inputs. It uses a special feature of attention: attention weights multiply the value vectors they route, so the model can combine terms across the two input positions. That kind of cross-token interaction is exactly what the trigonometric identity needs. So the Clock algorithm is the natural solution when attention is the main source of nonlinearity.

The Pizza Algorithm

Another group of models, Model A, uses constant or uniform attention. These models implement something different, which the paper calls the Pizza algorithm. Instead of working on the circle itself, this algorithm works inside it.

The key geometric fact is this: for a fixed target residue, all input pairs that produce that residue have midpoints that lie on one specific ray from the origin in the 2D embedding plane. These rays divide the disk into 59 “pizza slices,” which is where the name comes from. To decide the output, the network checks which slice the midpoint falls into.

The logit formula is different from the Clock case by a multiplicative factor. That factor makes the Pizza algorithm depend on the distance between the two inputs on the circle, while the Clock algorithm does not.

The Pizza algorithm only needs absolute-value nonlinearity, not multiplication. Once the midpoint is computed as a linear operation, the problem becomes checking which side of certain lines it lies on, and absolute value together with linear layers can do that cleanly. So ReLU layers can implement the whole pipeline without the cross-token multiplication that the Clock algorithm needs.


Illustration of the Clock and the Pizza Algorithm (from Zhong et al., 2023)


What This Means in Substrate Terms

Now let us map this onto the 4-tuple. Define two substrates:

⟦⟧

  • : the set of token-pair inputs .
  • ⟦⟧: the Clock forward pass circular embeddings followed by attention-mediated angle addition (i.e., the forward pass of a transformer with )
  • : architectural affordances of a full-attention transformer - cross-token multiplicative interactions are available
  • : input-output interface; reads off the argmax logit

⟦⟧

  • : the same token-pair inputs
  • ⟦⟧: the Pizza forward pass circular embeddings followed by midpoint-and-slice-detection (i.e., the forward pass of a transformer with )
  • : the resource profile of a linear-layer-dominant model (no multiplicative attention overhead)
  • : the same input-output interface

Both substrates achieve 100% accuracy. Sharing the same and the same target function, we have: ⟦⟧⟦⟧, for all . Observationally, the two substrates are identical. Their -level outputs are the same on every input.

But ⟦⟧⟦⟧Pizza​​ as elements of . If we take to be the space of logit functions, the Clock substrate and the Pizza substrate realize different functions.

When does this matter? If we only look at the interface (which reads the argmax), both substrates look the same, they give the correct output on every input. But inside , they differ in important ways: the representations they use, the nonlinear operations, the gradient structure, how logits depend on , and how they behave under perturbations.

The paper shows this by using a different interface that probes internal gradients and logit patterns. Under , the two substrates are clearly distinguishable.

Distance and Morphisms

Having defined what a substrate is, we can now define two relational concepts: how far apart two substrates are, and what it means for one substrate to translate into another.

Distance Between Substrates

Given a substrate with encoding , define its realized behavior set as the image of all possible inputs under the full encoding-semantics pipeline: ⟦⟧

This is the set of abstract behaviors can actually produce, not all behaviors in but the ones reachable given the substrate's language and encoding. A substrate with a very restricted encoding function has a small, realized behavior set; a substrate with a rich language has a large one.

We can then define a similarity measure between two substrates by comparing their realized behavior sets. Writing :

When , the two substrates realize exactly the same behaviors, they are behaviorally indistinguishable, even if their internal mechanisms differ completely. When , they have no behavioral overlap at all; nothing one can do the other can do. Values in between quantify partial overlap.

Interaction between the realized behavior sets of two substrates

A concrete example: let be a linear classifier. Its realized behaviors are all linearly separable decision boundaries, a strict subset of all possible classifiers. Let be a two-layer neural network with ReLU activations. Since depth-2 ReLU networks are universal approximators in the limit, is much larger than and contains as a proper subset. In this case: .

The distance quantifies the capability gap, how much of 's behavioral repertoire is inaccessible to . A way to read in this case: any behavior is a task that ​ can perform and ​ cannot, a capability possessed by the richer substrate but absent from the poorer one.

In the Clock/Pizza example, if we take to be the space of logit functions, then contains just the Clock logit function and contains just the Pizza logit function. These are disjoint singletons in , so under this choice of . That means the two substrates realize different behavioral elements, even though they agree on the input-output task. If instead we choose to be the space of input-output maps, then both realized sets collapse to the same singleton and the distance becomes 1. So, the value depends on how we choose , and that choice is part of the framework.

One important subtlety is that this similarity is not a metric in the strict mathematical sense, because it does not always satisfy the triangle inequality. Two substrates can each be equally similar to a third substrate without being similar to each other. That is not a bug; it captures the idea that two models can each be equally close to a reference model in capability, while still being very different from one another. We present this as a proposed definition, and whether a true metric can be built from it is still open.

Conclusion

The formalism introduced here is meant to do one thing: give us precise vocabulary for the layer of computational implementation. A substrate ⟦⟧ is a formal object that locates, separately and explicitly, the syntax a system ingests, the abstract behavior it produces, the resources it consumes, and the interface through which its behavior is observable. The distance and morphism notions tell us how to compare substrates and when one can be translated into another.

Appendix - A Worked Sorting Example

Let be the space of functions from finite integer lists to finite integer lists. We use sorting as a running example because it is one of the few settings where we can cleanly vary language, algorithm, and hardware substrate independently while holding the abstract behavior, a sorted list, fixed.

Substrate :

  • : valid Python programs
  • ⟦⟧: CPython interpreter (maps code to its function)
  • : CPU execution, float64, limited parallelism (GIL), no vectorization
  • : standard I/O; reads output from stdout

Substrate :

  • : valid C++17 programs
  • ⟦⟧: compiled binary (maps code to its function)
  • : native execution, SIMD available, no interpreter overhead
  • : same I/O; same observations

A Quicksort implementation in Python and in C++ yield ⟦⟧⟦⟧ ,the same sorting function in . A semantics-preserving transpiler is a substrate morphism. differs substantially (interpreted vs. compiled), but the behavior in is identical.

Now replace Quicksort with Mergesort on the same substrate. In , both compute the same function: they map an unsorted list to a sorted list. But their interaction with , especially the CPU memory hierarchy, is different.

Quicksort sorts in place. It partitions around a pivot, recurses on subarrays, and accesses memory in a data-dependent pattern. This works well on small inputs that fit in cache, but on large inputs it causes more cache misses because access is irregular. Its recursion depth is on average and in the worst case.

Mergesort, by contrast, accesses memory sequentially during merging. That is exactly what hardware prefetchers are good at, so it often hides memory latency better on large inputs. The tradeoff is extra space: it needs scratch memory, which can itself create pressure on the cache.

So although both algorithms have the same semantics in , they consume resources differently through . The abstract behavior alone cannot see this; the difference lives in the interaction between the algorithm and the substrate.

That is exactly what the 4-tuple is meant to capture. A framework that only tracks ⟦⟧ would treat Quicksort and Mergesort as equivalent. A framework that only tracks resources would miss that they compute the same thing. The substrate view keeps both visible.

For sorting, the realized behavior sets satisfy in , so all three have pairwise distance . Their differences, language overhead, access pattern, cache behavior are all in , and are invisible to an interface that only sees sorted output.



Discuss

The paper that killed deep learning theory

26 апреля, 2026 - 09:55

Around 10 years ago, a paper came out that arguably killed classical deep learning theory: Zhang et al.'s aptly titled Understanding deep learning requires rethinking generalization.

Of course, this is a bit of an exaggeration. No single paper ever kills a field of research on its own, and deep learning theory was not exactly the most productive and healthy field at the time this was published. And the paper didn't come close to addressing all theoretical approaches to understanding aspects of deep learning. But if I had to point to a single paper that shattered the feeling of optimism at the time, it would be Zhang et al. 2016.[1] 

Believe it or not, this unassuming table rocked the field of deep learning theory back in 2016, despite probably involving fewer computational resources than what Claude 4.7 Opus consumed when I clicked the “Claude” button embedded into the LessWrong editor.

Let’s start by answering a question: what, exactly, do I mean by deep learning theory?

At least in 2016, the answer was: “extending statistical learning theory to deep neural networks trained with SGD, in order to derive generalization bounds that would explain their behavior in practice”.

Since the seminal work of Valiant in the mid 1980s, statistical learning theory had been the dominant approach for understanding machine learning algorithms. The framework imagined a data distribution D over inputs X and outputs Y where the goal was to fit a hypothesis h : X → Y that minimized the expected test loss for a loss function L : X × Y → R over D. A learning algorithm would receive n samples from the data distribution, and would minimize the training loss averaged across the sample L(h(x), y).

The core results of this approach took the form of generalization bounds: given some metric of complexity of the hypothesis class H, bound the difference between the average training loss and the test loss in terms of this metric of hypothesis complexity. To put it in less technical terms, a generalization bound basically says:

If your hypothesis class is not too complicated relative to the amount of training data you have, and it explains the training data well, then it will generalize and do well on the full data distribution.

The field of statistical learning had settled on a few preferred ways to measure complexity: VC dimension and Rademacher complexity being the two main metrics, though some researchers considered alternatives such as the margin between positive/negative example from the classification boundary.

The success of modern deep learning, starting from the early 2010s, posed something of an existential crisis for this field. By all the metrics – including both VC Dimension and Rademacher complexity – even a simple MLP with sigmoidal or ReLU activations represents far too complicated a hypothesis class to not immediately overfit on the training data. If the VC dimension results for a neural network are assumed to be asymptotically tight up to constraints, then no neural network with even 100,000 parameters should be able to do anything useful on data points not included in the training data. Yet, not only were neural networks performing better than other machine learning algorithms, by the mid 2010s there was a growing list of examples where neural networks with tens of millions of parameters solved problems (such as the ImageNet challenge) that no other machine learning algorithm could make much progress on.  

This classic XKCD was published in September 2014, right about the time where neural networks started to make image classification viable without years of dedicated research effort.


Clearly, neural networks did generalize. If traditional metrics of complexity, based on the representation capacity of the class of neural networks with arbitrarily specified, infinite precision floating points, failed at capturing the simplicity of neural networks in practice, then the field simply needed to construct new simplicity measures to argue that neural networks learned simple functions in practice.

This was the approach taken in several papers around the time. For example, Neyshabur, Tomioka, Srebro’s Norm-Based Capacity Control in Neural Networks (2015) constructed a complexity measure based on the Frobenius norm of the weight matrices in a deep neural network. Hardt, Recht, and Singer’s Train faster, generalize better: Stability of stochastic gradient descent (2015)[2] showed that neural networks trained with a small number of SGD steps with sufficiently small step size were uniformly stable in that removing a single training example would not change the model’s loss on any particular test example by very much.

At least when I first entered the field of deep learning as an undergrad in early 2016, there was a sense of cautious optimism: we would find the way in which neural networks in realistic regimes were simple, and thereby derive generalization bounds that would be applicable in practice.

So, what did Zhang et al. 2016 actually show? Why did understanding deep learning require rethinking generalization?

To quote the paper, the “central finding can be summarized as: Deep neural networks easily fit random labels”. Specifically, the authors trained neural networks on the standard-at-the-time CIFAR10 and ImageNet benchmarks to memorize random labels, while following standard procedures and training for the same order of magnitude of steps. They also show that with similar techniques, neural networks could be trained to memorize random noise inputs.

From the introduction of Zhang et al. 2016. You know that a paper is going to be impactful when its central finding is exactly 7 words long.

Why is this an effective death knell for the simplicity-and-generalization-bound approach? The authors' results show that the same class of neural networks, trained with the same learning algorithm, can generalize when given true labels and memorize random ones. This shows that the hypothesis class of neural networks that are learnable with standard techniques cannot be simple in any useful sense, at least for complexity measures that depend only on properties of the hypothesis class and (data-independent) properties of the learning algorithm.

The paper has 5 important parts. Let's go through each of them.

  1. The core empirical finding that neural networks can fit random labels. The authors train a 1- and a 3-layer MLP, an AlexNet variant, and an Inception variant on CIFAR10. They train the models normally (with the true labels), as well as four ways of corrupting the dataset: random labels (replacing each label with a random class with some probability), shuffled pixels (the same permutation on pixels is applied to each image), random pixels (a different random permutation is applied to each image), and pure Gaussian noise (replacing every single pixel with an independent draw from a Gaussian). In each of these five cases, the network gets to near 0 training loss. Notably, while training with random labels is harder, convergence to zero training loss takes only a factor of 1.5-3.5x longer than with the true labels. And by varying the degree of label corruption, the authors can produce models that either generalize to the test set to varying degrees or perform no better than chance.

The key figure from the Zhang et al. paper: subfigures (a) and (b) show that neural networks are able to perfectly memorize random labels without many additional training steps, and (c) confirms that models in their regime interpolate between chance performance and good performance on the test set.

The authors also train an InceptionV3 model on ImageNet with random labels, and find that it can get 95.2% top-1 accuracy on the train set.  

The ImageNet results from the paper are similar in that neural networks can memorize random labels to a large extent. Unlike with the CIFAR10 results, the authors also report the extent regularization impedes the memorization ability of the network (not very much).


  1. The implications for statistical learning theory approaches to generalization bounds. These experiments show that in realistic regimes, Rademacher complexity and VC dimension bounds are basically vacuous, since neural networks have enough representational capacity to memorize entire training sets. Hardt and Recht's (both authors on this paper) prior results on uniform stability also are necessarily vacuous in this setting, since it’s a property that only depends on the algorithm and hypothesis class (it’s data-independent!), but the algorithm and hypothesis class stays the same in each experimental setting.
  2. Further experiments demonstrating that explicit regularization cannot rescue generalization bounds. The authors show that on both ImageNet and CIFAR-10, explicit regularization methods such as data augmentation or weight decay do not seem to affect the test accuracy of the algorithms very much. That is, the neural networks generalize to the test distribution even without any regularization. The authors also show that on ImageNet, applying dropout or weight decay still allows the resulting model to memorize the training set to a large extent. So any generalization bound that depends on regularization (e.g. weight norm-based explanations) cannot explain why neural networks generalize.
  3. A simple toy construction that showed a two-layer ReLU network can memorize a number of examples linear in parameter count. The authors include a simple theoretical result, where a depth-2 ReLU network with 2n+d weights can fit any labeling of a sample of n data points in d dimensions. This feels pretty extraneous to me given the strength of the empirical results, but the construction is simple and it confirms the intuition that neural networks with millions of parameters “should” be able to fit tens of thousands of data points in the CIFAR10 setting.  
  4. Some notes on how statistical learning theory fails even in a simple overparameterized linear regime. The authors consider a basic overparameterized linear regression setting, and show both empirically and theoretically that SGD can learn a minimum norm solution that generalizes. The authors point out that statistical learning theory at the time had no explanation for generalization in this simple regime.
    They also demonstrate empirically that smaller norm doesn’t imply better generalization – by applying preprocessing to an MNIST dataset to increase its effective dimensionality for a linear classifier, the resulting larger linear classifier has higher norm but less generalization error (this result also undercuts the weight-norm based approach to explaining generalization in neural networks).
    Amusingly, the quick thoughts put forth by the authors in this setting would go on to become quite influential, both in that people would study the behavior of SGD in overparameterized linear regimes, and that it hints toward future puzzles such as double descent.

So, how did the field of deep learning theory react to this paper? What were the attempts to get around this result using data-dependent generalization bounds? And what was the paper that arguably sealed the deal on the whole edifice, and nailed the proverbial coffin shut?

I'll answer these questions in tomorrow's post.

  1. ^

    Notably, Zhang et al. 2016 got best paper at ICLR 2017, so it was widely recognized as important even at the time.

  2. ^

    Note that both Hardt and Recht were also authors on the Zhang et al. paper.



Discuss

The Great Smoothing Out

26 апреля, 2026 - 09:16

I recently explored an interesting thought experiment about the Culture while talking with Max Harms. Specifically Max argued that the Culture is far from a perfect utopia, while I felt that it was way better than almost anything else I could think of.


One of the key cruxes in the discussion was that the Culture fails to preserve any meaningful method of honouring their ancestors or cultural traditions. It all collapses into individual utility maximisation.

Given that there are people who feel a strong compulsion to preserve their culture, how does that translate to the state of civilisations in the far future, say thousands of years from now? In my mind it seems ridiculous that in this future civilisation, people will continue to honour ancient traditions such as Japanese folk dancing, but I think there is a decent chance such traditions do continue to exist because of the local incentives people have to protect them.

Let's imagine an extreme version of this and work backwards to consider what seems good.

In this extreme: descendants of these traditions continue to perform them in large numbers despite lacking the social and environmental context that created them, merely because they feel there is a value in protecting them. Let's zoom in on the local context to imagine how this occurs. A girl, Yuri, in a small Japanese town is the granddaughter of the previous shrine maiden, and her only sister has expressed a strong disinterest in learning the traditions from their grandmother. As a result Yuri has a strong obligation to keep the tradition going because otherwise it'll die. The question is whether it's wrong to force her to keep it going, or whether she should be free to pursue her desires.

In one sense it's a tragedy for the tradition to die, but on the other hand this is just a natural part of history. We can look back and see that history is littered with traditions and practices that have died out because they no longer serve a purpose. Given that, it seems somewhat perverse that we would selectively favour the traditions that we happen to inherit today, simply because we have enough excess labour that we can indulge in them despite them not serving the purpose for which they were created.

On the other hand, as a society we see it as a tragedy for that tradition to be lost, and we would want someone to continue that tradition if it was in danger of being lost. Hence in Japan, some money is provided by the government to pursue practices of significant cultural value to preserve them and share them for the enjoyment of modern Japanese people.

Given that we get to care about multiple competing priorities simultaneously, and that people genuinely care about these things, this seems like quite a good use of time and effort. A more cynical position would be that, given the extreme suffering that exists elsewhere in the world, a healthy triage of expenditures would allocate them just enough resources to keep them from disappearing entirely.

On the other hand, this seems very unlikely because there is a natural tendency for a push towards the global maxima. The rate of language loss today is the highest it has ever been, and seems to be increasing. People in remote areas of China for example have a reasonable incentive to learn Mandarin or English for economic reasons, and understand that their opportunities are naturally limited by sticking to local dialects. With each generation the number of speakers of that language will decrease.

We can think of this as a kind of "great smoothing out", where cultural differences present a natural friction, that become silent casualties of Progress.

Medical Mechanica from FLCL, a gigantic clothes iron ready to smooth the world into a flat factory floor when the right circumstances arrive.

An interesting contemporary example is a new law that was just passed in China called the "Law on Promoting Ethnic Unity and Progress" containing provisions around the use of Mandarin over local ethnic languages in schools in China, essentially ensuring that while the other languages can persist, Mandarin must be given prominence relative to them.

While not explicitly mentioning Tibet, Mongolia, or Uyghur, it clearly targets these regions, which have resisted efforts to switch to Mandarin. In 2020 Inner Mongolia saw significant protests when the party reduced Mongolian-medium instruction in elementary schools, replacing them with Mandarin texts. This led to widespread protests and boycotts, followed by a purge of ethnic Mongolian Party officials who were viewed as insufficiently aligned. A PEN America / Southern Mongolian Human Rights Information Center report found that more than 80% of Mongolian-language websites in China have been censored or banned.

In contrast to the natural economic gravity leading people to abandon their native languages, this is a much more intense effort to dismantle the existing cultural differences in these places. The law uses the word "zhulao" which is typically used in the metallurgical sense of forging or casting, with the intent to create a unified alloy of peoples across China with a shared language and cultural identity. In fairness to the Chinese, it is still more permissive than France's historical treatment of regional languages such as Breton, Occitan, and Basque, where the state pursued aggressive suppression for over a century.

Interestingly, yesterday Will MacAskill released a new draft about an idea called the "Saturation View" which gives a clean framework for thinking about this problem.

Total utilitarianism recommends something he describes as "tiling the universe with hedonium", where hedonium is a compute substrate simulating an enormous number of digital minds which have been optimised to produce the optimal pleasant experience, forever. Naturally such a vision produces a sense of dissatisfaction, that something is missing, like a symphony composed of a single note repeated over and over.

The saturation view argues that the optimal configuration of the universe is to explore the full space of possible minds and experiences. The variance between minds is considered virtuous, with minds that are more dissimilar from the mean providing additional marginal utility. To compare this with our own life, if offered the opportunity to experience the same wonderful experience over and over again for one's entire life or to experience a dizzying array of different wonderful experiences, most people would choose the variety option. He treats this as a brute intuition without arguing for why, so I'll offer one: the relationality between experiences is critical in giving them framing and meaning, where the contrast itself provides value. The words in a sentence require spaces between them, and variety between words in order to create appreciable meaning.

This frame gives us a way to reconcile the tensions of some of our earlier examples. For Yuri, who is faced with walking away from her traditions, it is clearer precisely what we are losing, which is that the space that might be illuminated by such practices will be lost. This is the answer to the perversity of privileging cultural inheritances we receive today. In the same way that we don't mourn people who died 1000 years ago, but do mourn those lost today, it makes sense for us to act to preserve those unique practices, given that we actually have the resources to do so.

We don't have any clear answers on her obligation to take up the task, and I think this is reflected in reality. We acknowledge that it's a hard thing and a burden, and I admire people who do it anyway for this reason.

The crux Max kept returning to was the Culture's homogeneity, and he's right. The Culture is boring. If everyone is just having sex, producing drugs from their brain, and making art all the time, there is a lot of value being left on the table in the modes of experience they could be having. There is a sense (at least when I read them) that Culture citizens who transfer into totally non-humanoid alien bodies are doing something better than the average Culture citizen, which is likely picking up on this intuition.

And finally we can see that there is something real being lost when China erodes its less mainstream cultures. The language practices in various parts of China aren't just language, but the vessel in which culture and unique experiences are transmitted. In particular, the pastoralist communities who speak Mongolian as their primary language uniquely carry this heritage in that they maintain ancient ways of life in their entirety. Crushing those vessels prunes those branches before they can branch further.



Discuss

Diary of a "Doomer": 12+ years arguing about AI risk (part 3: the LLM era)

26 апреля, 2026 - 08:50

Part 3 of a series. Here are part 1 and part 2.

One of the things that always surprised me is how few people in AI were interested in AI safety and alignment purely out of intellectual curiosity. These topics raise the kind of novel, foundational problems that scientists typically love, in a field where they otherwise seem scarce. The field did eventually get interested. But it wasn’t because of intellectual curiosity or concerns about x-risk, it was because of the practical utility of alignment methods for large language models.

When and why did other researchers get interested in AI Alignment?

A lot of the most interesting problems in AI safety and alignment are fundamental and conceptual, and remain unsolved. But as these topics became further integrated into machine learning, a lot of the research took on a more “pragmatic” flavor. The basic idea is: “Let’s look at current approaches to AI and try and solve safety and alignment problems in a way that is rooted in the current paradigm, rather than being more fundamental.”

This kind of “pragmatic” AI Safety research gained slow and steady ground for a few years, but starting in 2020, shortly before I became a professor, it surged in popularity. This is because, with large language models, these concerns became obvious, apparent, commercially valuable, etc.

Around this time, LLMs like GPT-3 were showing signs of having the sort of capabilities we are used to in modern LLMs, but it still took specialized skill to get those capabilities out of the systems (see the beginning of “Alignment vs. Safety, part 2: Alignment” for more on this). It was clear that getting LLMs to actually do the things they were capable of was nontrivial.

Within a few years, “the alignment problem”, once dismissed by most as basically a non-issue, became universally accepted by the field as a core technical problem. The default attitude became “of course there is an alignment problem, but it’s not an existential risk”.

But before that could happen, we went through the phase where students want to do alignment, but their professors say it’s a waste of time. I’d seen the exact same thing in the early days of deep learning. I’d have this conversation half a dozen times at every AI conference I went to.

The field’s newfound interest in alignment naturally brought with it some curiosity about existing alignment research and researchers, including the field’s pre-occupation with x-risk. But most AI researchers getting into alignment at this time were not seriously engaged with that concern.

This was a growing issue for AI Safety. Once you start looking at AI Safety from a machine learning lens, it gets hard to figure out where “safety research” ends and “capabilities research” begins. Despite clear progress on solving practical alignment problems in LLMs, problems clearly remained (and remain). We were entering a “ditch of danger” where alignment was solved well enough that AI would be very useful, but not solved well enough that we were safe.

2022: AI x-risk becomes a mainstream among AI researchers

It was incredible to see AI alignment taking off as a research topic. Other researchers actually wanted to learn about alignment and were interested to learn about it from me!

I had mixed feelings about the whole thing, because I realized how little the alignment methods being used were doing to make AI actually trustworthy (i.e. solve the assurance problem). The methods were based on reward modelling. I’d done some early work on the topic and when we wrote the research agenda on the topic, I insisted we highlight this limitation.

But starting in 2021, and intensifying in 2022, I really started to notice a sea change: it was no longer just researchers trying to make LLMs work, more and more researchers were worried about how well they worked. AI professors and other researchers who’d been in the field as long as me, or longer, started to express serious concern about human extinction, and approaching me asking what I thought we should do about it.

This was different from the previous “What’s the alignment thing? How does it work?”. It was more like “oh jeez, fuck, you were right… What now? Are we fucked?” It still wasn’t everyone, by any means, but also, I felt the heart had gone out of the haters and skeptics. That AI was incredibly risky was becoming undeniable. The only disagreements left to have were about how risky, on what timescale, and what our response should be.

While the previous era had brought “AI Safety” into the mainstream AI community, it was a sanitized, “x-risk-free” version that had to be presented to the rest of the field. And even in the 2020s, with the rise of alignment, and the vindication of the basic concern that it would be hard to control AI systems and steer them to “try” or “want” to do what you want, an aura of taboo remained. Researchers concerned about AI x-risk might approach the subject cautiously, hinting at these concerns to gauge others’ reactions. It was clear to me that this was holding back awareness and acceptance of the risk, and it would need to change.

Conclusion

The first post in this series brought us from: “AI researchers aren’t even aware of x-risk concerns” to “AI researchers are actively hostile to x-risk concerns”. The second took us from there to “AI Safety is (perhaps begrudgingly) respected as a legitimate research topic”. And this post took us all the way to “AI Alignment (of a sort) is a major research topic, and AI researchers are getting worried about LLMs’ capabilities”.

The next -- and final -- post in the series will take us from this moment to the Statement on AI Risk I initiated in 2023 that catalyzed the growing level of interest, respect, and concern among AI researchers, and finally all the way up to the present.

Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.

Share



Discuss

Forecasting is Way Overrated, and We Should Stop Funding It

26 апреля, 2026 - 01:39

Summary 

EA and rationalists got enamoured with forecasting and prediction markets and made them part of the culture, but this hasn’t proven very useful, yet it continues to receive substantial EA funding. We should cut it off.

My Experience with Forecasting

For a while, I was the number one forecaster on Manifold. This lasted for about a year until I stopped just over 2 years ago. To this day, despite quitting, I’m still #8 on the platform. Additionally, I have done well on real-money prediction markets (Polymarket), earning mid-5 figures and winning a few AI bets. I say this to suggest that I would gain status from forecasting being seen as useful, but I think, to the contrary, that the EA community should stop funding it.

I’ve written a few comments throughout the years that I didn’t think forecasting was worth funding. You can see some of these here and here. Finally, I have gotten around to making this full post.

Solution Seeking a Problem

When talking about forecasting, people often ask questions like “How can we leverage forecasting into better decisions?” This is the wrong way to go about solving problems. You solve problems by starting with the problem, and then you see which tools are useful for solving it.

The way people talk about forecasting is very similar to how people talk about cryptocurrency/blockchain. People have a tool they want to use, whether that be cryptocurrency or forecasting, and then try to solve problems with it because they really believe in the solution, but I think this is misguided. You have to start with the problem you are trying to solve, not the solution you want to apply. A lot of work has been put into building up forecasting, making platforms, hosting tournaments, etc., on the assumption that it was instrumentally useful, but this is pretty dangerous to continue without concrete gains.

We’ve Funded Enough Forecasting that We Should See Tangible Gains

It’s not the case that forecasting/prediction markets are merely in their infancy. A lot of money has gone into forecasting. On the EA side of things, it’s near $100M. If I convince you later on in this post that forecasting hasn’t given any fruitful results, it should be noted that this isn’t for lack of trying/spending.

The Forecasting Research Institute received grants in the 10s of millions of dollars. Metaculus continues to receive millions of dollars per year to maintain a forecasting platform and conduct some forecasting tournaments. The Good Judgment Project and the Swift Centre have received millions of dollars for doing research and studies on forecasting and teaching others about forecasting. Sage has received millions of dollars to develop forecasting tools. Many others, like Manifold, have also been given millions by the EA community in grants/investments at high valuations, diverting money away from other EA causes. We have grants for organizations that develop tooling, even entire programming languages like Squiggle, for forecasting.

On the for-profit side of things, the money gets even bigger. Kalshi and Polymarket have each raised billions of dollars, and other forecasting platforms have also raised 10s of millions of dollars.

Prediction markets have also taken off. Kalshi and Polymarket are both showing ATH/growth in month-over-month volume. Both of them have monthly volumes in the 10s of billions of dollars. Total prediction market volume is something like $500B/year, but it just isn’t very useful. We get to know the odds on every basketball game player prop, and if BTC is going to go up or down in the next 5 minutes. While some people suggest that these trivial markets help sharpen skills or identify good forecasters, I don’t think there is any evidence of this, and it is more wishful thinking.

If forecasting were really working well and was very useful, you would see the bulk of the money spent not on forecasting platforms but directly on forecasting teams or subsidizing markets on important questions. We have seen very little of this, and instead, we have seen the money go to platforms, tooling, and the like. We already had a few forecasting platforms, the market was going to fund them itself, and yet we continue to create them.

There has also been an incredible amount of (wasted) time by the EA/rationality community that has been spent on forecasting. Lots of people have been employed full-time doing forecasting or adjacent work, but perhaps even larger is the amount of part-time hours that have gone into forecasting on Manifold, among other things. I would estimate that thousands of person-years have gone into this activity.

Hits-based Giving Means Stopping the Bets that Don’t Pay Off

You may be tempted to justify forecasting on the grounds of hits-based giving. That is to say, it made sense to try a few grants into forecasting because the payoff could have been massive. But if it was based on hits-based giving, then that implies we should be looking for big payoffs, and that we have to stop funding it if it doesn’t.

I want to propose my leading theory for why forecasting continues to receive 10s of millions per year in funding. That is, it has become a feature of EA/rationalist culture. Similar to how EAs seem to live in group houses or be polyamorous, forecasting on prediction markets has become a part of the culture that doesn’t have much to do with impact. This is separate from parts of EA culture that we do for impact/value alignment reasons, like being vegan, donating 10%+ of income, writing on forums, or going to conferences. I submit that forecasting is in the former category.

At this point, if forecasting were useful, you would expect to see tangible results. I can point to you hundreds of millions of chickens that lay eggs that are out of cages, and I can point to you observable families that are no longer living in poverty. I can show you pieces of legislation that have passed or almost passed on AI. I can show you AMF successes with about 200k lives saved and far lower levels of malaria, not to mention higher incomes and longer life expectancies, and people living longer lives that otherwise wouldn’t be because of our actions. I can go at the individual level, and I can, more importantly, go at the broad statistical level. I don’t think there is very much in the way of “this forecasting happened, and now we have made demonstrably better decisions regarding this terminal goal that we care about”. Despite no tangible results, people continue to have the dream that forecasting will inform better decision-making or lead to better policies. I just don’t see any proof of this happening.

Feels Useful When It Isn’t

Forecasting is a very insidious trap because it makes you think you are being productive when you aren’t. I like to play bughouse and a bunch of different board games. But when I play these games, I don’t claim to do so for impact reasons, on effective altruist grounds. If I spend time learning strategy for these board games, I don’t pretend that this is somehow making the world better off. Forecasting is a dangerous activity, particularly because it is a fun, game-like activity that is nearly perfectly designed to be very attractive to EA/rationalist types because you get to be right when others are wrong, bet on your beliefs, and partake in the cultural practice. It is almost engineered to be a time waster for these groups because it provides the illusion that you are improving the world’s epistemics when, in reality, it’s mainly just a game, and it’s fun. You get to feel that you are improving the world’s epistemics and that therefore there must be some flow-through effects and thus you can justify the time spent by correcting a market from 57% to 53% on some AI forecasting question or some question about if the market you are trading on will have an even/odd number of traders or if someone will get a girlfriend by the end of the year.

Conclusion

A lot of people still like the idea of doing forecasting. If it becomes an optional, benign activity of the EA community, then it can continue to exist, but it should not continue to be a major target for philanthropic dollars. We are always in triage, and forecasting just isn’t making the cut. I’m worried that we will continue to pour community resources into forecasting, and it will continue to be thought of in vague terms as improving or informing decisions, when I’m skeptical that this is the case.



Discuss

"Thinkhaven"

25 апреля, 2026 - 21:12

Inkhaven has people writing a blogpost a day for 30 days. I think this is a pretty great, straightforward exercise, that I'd definitely want in a hypothetical Rationality Undergraduate Program. But, it's not the only such exercise I'd want to include. It's gotten me excited for a different (superficially similar) program, which I might call "Thinkhaven."

In Thinkhaven, the goal is to learn the skill of "relentlessly think new, useful thoughts you haven't thought before." 

Inkhaven had a basic "Goodhart-able" goal of "publish 500+ words every day." For Thinkhaven, I would imagine the Goodhart-y goal being something like:

  • Every day, you must publish 500 words of "research journal." They should be cogent enough for third parties to follow along with your thought process, but, they don't need to end with nice discrete conclusions. 
  • Every two weeks, you must also publish a 2500 word effortpost.

And, somewhat more opinionatedly:

  • Each journal entry should include at least one new question that you're thinking about. (Often, these will be subquestions within a broader Big Question that you're exploring that week, or a reframing of that big question that feels like a better pointer)

The spirit of the Goodhart is "are you finding new questions, and making some kind of progress on them?". Along the way, each day, you're thinking at least one new thought you haven't thought before.

One way of "thinking new thoughts" and "asking new questions" is to research stuff that already exists out there (i.e. learn some cool new facts about math/history/science/etc, and then write up good explainers for it, or connect it to other domains).

Another way is by thinking original thoughts, plumbing somehow through your confusions about the world and developing new concepts to help deal with them.

Presumably there are other approaches too, but I list those two to convey that that there's more than one way to go about this. The main thing we'd be trying to not do at Thinkhaven, is to "explain ideas that you've already thought about and just wish everyone else understood." That's also important, it's just not what Thinkhaven is for.

The daily journal is for accountability, to make sure you're making any kind of progress at all. The daily "new question", is to help ensure that progress has some kind of forward momentum, and is exploring new ideas.

The fortnightly 2500 word published writing is to close the loop on "get a new idea all the way from a vague musing, to something written up in a way that other people can critique." (Ideally, this explains some new ideas to the internet. If you didn't really get any new good ideas, you can write "well, I didn't come up with anything, but, here's a cleaned up version of my daily research notes.")

My primary inspiration for this is not actually Inkhaven, it's a period in 2022 where the Lightcone team focused on thinking/research/etc to try to get in touch with what it's like to be an original thinker on LessWrong. We set a goal of writing up a blogpost-worth-of-content per day, which the team would then read over each morning. Even without publishing externally, it was a useful forcing function to keep generating new thoughts and forcing them into a clearer shape.

I personally found it helpful for transitioning from "a guy who mostly defers to other people" to "a guy thinking his own strategic and intellectual thoughts."

Mentors, and Different Styles of Thinking

This is intended to be a fairly openended container. I'd expect to get value out of the pure container listed above, but, I'd ideally want a few different styles of mentors and coaches around, who embody different ways of thinking. 

There are a few ways to operationalize that. You could model the thing more similar to MATS where everyone has a mentor they meet with on some cadence. If I were modeling it more on Inkhaven, I think some mentors would give classes, others might be more like mysterious old wizards you just go talk to.

All participants need to have at least one mentor who is enthusiastic about them (as part of the admissions process), but, they could sample from different styles of mentorship over the course of the month.

Possible examples of mentors: 

Note: these are examples, not people who agreed to participate or even looked at this post. But they are some archetypes that I'm imagining. I'd be hoping for Thinkhaven to include a mix of mentors or "resident thinkers" with similar range.

John Wentworth-style, focused on tackling some confusing problems we don't understand, asking "what's the hard part?" / "what's the bottleneck?", and systematically making progress, while keeping an eye on Principles Which Will Carry Over To The Next Paradigm.

Logan Strohl-style, focused on openended, patient observation (with a kind of "open curiosity" as opposed to "active curiosity"). Trying to keep as close-to-the-metal on your observations. (See Intro to Naturalism: Orientation for a deep meditation on the sentence "Knowing the territory takes patient and direct observation.")

Elizabeth Van Nostrand-style, with some focus on open-ended "lit review"-ish research. Pick a new field you are curious about, read over lots of existing papers and books. See if you can synthesize some new takeaways that weren't obvious. Be ready to follow threads of information wherever they lead. 

Scott Garrabrant-style, go live where the important math problems are, but then marry-for-love. Mull over interesting problems and then get nerdsniped on whatever feels alive.

Chris Olah-style, where... okay honestly I'm not actually sure how Chris Olah does his thinking and he seems particularly unlikely to come. But, reading over his older blogposts I get a sense of both a guy who likes studying lots of little fiddly patterns in the world and making sense of them, in a way that (vaguely) reminds me of an old timey biologist. And, a guy who likes experimenting with new ways of explaining things. 

Thinking Assistants / Research Managers

The mentors above are selected for "I respect their thinking and writing."

They're not particularly selected for it being the right-use-of-their-time to help people through daily stumbling blocks, executive dysfunction, etc.

I would want some staff that are more like the research coaches at MATS, who meet with the people on some cadence to check on how things are going and help them resolve obstacles. And, I'd like to try out having dedicated Thinking Assistants available, who can sit with you for a chunk of time as you write or talk out loud through your problem, and notice little microhabits that might be worth paying more attention to.

"FAQ"

Everything above is the core idea. I'm not that confident in that particular format, and expect I'd change my mind about stuff after one iteration. But, here's some explanations of why I picked this structure instead of others, structured as an FAQ.[1]

Why require "a new question each day?"

I'm not sure this will work as well as I hope. But, my reasons are:

  1. It forces you to cultivate a cluster of skills. 
  2. It frames attention, in a way I think will help cultivate a healthy "curious vibe."
  3. It is a mini-feedback loop that hopefully pumps against some kinds of masturbatory thinking.

Sometimes, when you're exploring and stewing on a set of ideas, you're not really making progress, you're sort of going in circles, or building up some superficial understandings that don't really translate into a clear takeaway. Asking yourself new questions forces you to take your vague musings and confusions and turn them into questions with a meaningful return type

It also pumps against "explaining ideas you've already thought about." (which again, is totally a useful way to write. It's just not what this program is for). By forcing yourself not to do something, you create space to practice new skills.

And, while it's opinionated on format, I think the "question" framing is still pretty openended as structures go.

What would asking new questions look like, in practice?

One person read the above and was like "okay I kinda get it, but I think I need to see an example of what this looks like to have a clearer sense of what this'd mean." 

Here's an example. 

(Note: this is just one example. As I just said, the program should be pretty unopinionated. Hopefully, if my line of questioning feels weird to you, it helps you imagine a version that would fit your thought process better). 

I might start with a vague frustration/confusion:

"Geeze, alignment seems to have shitty feedback loops. wat do?"

I find it fruitful to ask more explicitly:

"Okay, what would it mean to have good feedback loops?"

"If there were definitely no good feedback loops, what else might good progress look like?". 

Which in turn prompt more specific questions like:

"What are some domains that solved the 'poor feedbackloop' problem before? How did they do that?".

"What are some domains where 'feedbackloop' just wasn't even the right ontology?"

"What problem are 'feedback loops' solving? What other ways could you solve those?"

"What properties would 'solving alignment' have? What do I actually mean by that?"

As well as meta/tactical questions like:

"Who are some people who've thought about this already? Do they have writing I could read? Could I literally go talk to them?"

"Why is it hard to think about this, and what can I do about that?"

And then I might learn about domains where progress clearly accumulates, but a lot of it is driven by "taste." I might then spend a day digging into historical example of how people acquired or transmitted taste. 

What should a "Daily Journal" look like?

The first answer is "whatever you want." 

But, I did find, while beta testing this for myself this month, that it worked better when I gave myself a set of daily prompts to fill out, which looked like:

What questions did I think about yesterday?

What did I learn yesterday?

What questions or confusions am I interested in now?

What seems difficult about this? How can I fix that?

The "what did I learn?" section is the bit that ends up most shaped like a 500 word blogpost. 

Rather than think of this as "the thing I scramble to write before the end of the day", it's more like a thing I write when I first get started in the morning. (I don't really like the "publish by midnight" thing that Inkhaven does, and I think I might want to actually set the deadline at lunchtime).

Another friend who beta-tested the format experimented with changing up the prompts, so that it worked better as an orienting process for them. (By default it felt a bit like a tacked-on-assignment they were doing out of obligation, but, slightly tweaked, it felt more naturally like a useful thing for them to do each day)

Are the daily journals public? Why?

I think so, but, not 100% sure.

(But, my default recommendation would be to put them on an out-of-the-way secondary blog, so you feel more free to think dumb thoughts along the way).

The reason to make them public is to help them function more as an accountability mechanism. You don't need to make a nice polished essay with a conclusion. But, you do need to get your thoughts to a point where they're structured enough someone else can make sense of them. 

I considered just requiring them to be published internally to the Thinkhaven cohort. Habryka argued with me that this'd make people feel more like they were writing for the cohort-in-particular, having to care what those people thought, instead of getting to follow their own thought process.

The most important thing is you expect someone to be reading them.

Do we even need the 2500 word effortpost? Why can't it just be research journals all the way down?

Because the point of intellectual progress is to actually contribute to the sum of human knowledge. It's an important part of the process to package it up in a way that other people can understand and build on.

And, it's an important forcing-function that eventually your meandering question needs to turn into something that someone else would want to read.

Why "2500 words every 2 weeks" in particular?

Both of these are numbers I can imagine fine-tuning.

Why not "once a week?"

I thought "once a week" might be a better cadence, but, when I tried it out I found it too short.

During Inkhaven, where I was mostly focused on writing up existing ideas, I was able to write ~2000+ words a day and usually write one full post and make partial progress on an effortpost.

Thinking new meaningful/useful thoughts takes awhile, and sometimes it's important to get lost in the woods for awhile without knowing quite how everything will tie together. Or, just go off and gather a lot of information and digest it.

Why not longer?

I think "real work in the field" often does take more than 2 weeks at a time to output a blogpost worth of content. But, I think that's too slow a feedback loop for people learning. This is still supposed to be a class. I think it'd be hard for people to stay for longer than a month, and seems like people should get at least two reps in of "go from ideation -> publishing."

If this ended up being like a 3-month fellowship, I can imagine once-a-month being a reasonable cadence. But, I think it's just not that hard to turn 2 weeks of thinking into one substantial writeup.

If this were a 3-month fellowship, my current guess is I'd keep the 2-week effortpost but add in a Final Project that's aiming for the level of "significant contribution to whatever field you're exploring."

All of this is only one possible structure for the underlying goal of "learn to relentlessly find new, useful thoughts every day." But, it's a pretty simple structure I'd expect to do pretty well even in its minimal form.

Anyways, happy thinking.

  1. ^

    These questions have all been asked at most "once" and sometimes "zero", so "frequently asked questions" is not exactly correct.



Discuss

Substrate: Intuitions

25 апреля, 2026 - 20:29

This post and the related sequence were written as part of the AI Safety Camp project "MoSSAIC: Scoping out Substrate Flexible Risks." This was one of the three projects supported by, and continuing the work of, Groundless. Specifically, it develops one of the key concepts referred to in the original MoSSAIC (Management of Substrate-Sensitive AI Capabilities) paper (sequence here). Matthew Farr and Aditya Adiga co-mentored the project; Vardhan Kumar Ray, Vadim Fomin, and Ian Rios-Sialer participated as team members.


In a previous post and paper, we informally sketched out a definition of substrate as follows:

"the (programmable) environment in which a system is implemented. In other words, it is the essential context that enables an algorithm to be implemented beyond the whiteboard."

Or more informally,

"that (layer of abstraction) which you don't have to think about."

We gave several examples of differences in this programmable context producing differences in measurable and relevant aspects of computation. These included the adoption of GPUs allowing networks to train at scale, or how quantum computers operate on entirely different algorithms from their classical counterparts.

In the following posts, we expand upon this concept more thoroughly, giving (i) an informal, intuitive introduction to substrate and it's role in computation, (ii) some case studies that argue it's salience in cybersecurity and AI safety, and (iii) our initial formal characterization.

Substrates (from the ground up)

There is a principled sense in which the physical and architectural substrate of a computation shapes its behavior in ways that are invisible on the purely computational.

We start with pure, unembedded mathematics. These are numbers as Plato intended them.

In this case, there is no substrate, no wider context needed to implement the numbers. (We neglect the embedded nature of the cognition in which the numbers are placed.)

Thus, we can say that 3 = 3 in all possible respects within this Platonic reality.

Now we consider a simple embedding of these Platonic ideals in physical reality. We write these numbers on a sheet of paper.

Here, the paper is part of the substrate. This might seem trivial in the above example, but in more complex calculations, the paper becomes an essential part of what makes the computation possible. We note that we start to characterise the differences between these two numbers, in that both live at different locations on the sheet.

To see how these differences become increasingly relevant in computation, let's consider now two separate pieces of paper. One is on my desk, the other is kept at my local library.

Now these two 3s, despite having the same mathematical/formal meaning, perform very differently when I want to use them in computation. If I'm working at my desk and I want to compute something using the number written on that sheet of paper next to me, I can complete this very quickly. However, if I need the number on the other sheet, the one at the library, my calculations will take considerably longer.

Computational substrates

Now we generalize this to actual modern computer systems. Behind the layers of abstraction and useful ergonomics of modern computers, we have something remarkably similar to the above example of sheets of paper located across the city.

Instead of sheets of paper, we have addresses in the computer's memory. These are updated and retrieved for computational work via code, itself a layer of abstraction that hides the various stages of assembly sitting between the user and the physical changes she's making to her computer.

Instead of 3's located at a desk and in the local library, we now have 3s located at different memory addresses. The locations of these have an noticeable and often exploitable effect on the computation performed.

For instance, and for those that don't know, a CPU has a memory hierarchy structured like this:

The L1 cache is the smallest and most immediate. It has a latency of ~1 ns. Think of it as a book open on your desk. It is small but can be accessed very quickly.

The L2 cache is next. It is larger but needs more time to access (~3–5 ns). Think of it as a stack of closed books on your desk. They contain more information but you have to open them and find the correct page if you need to access the information they hold.

The L3 cache is larger still. Multiple cores will access it, and its latency is ~10–20 ns. Think of it as a bookshelf in your room: you have to leave your desk and search through the shelves to find the information you need.

And so on...

The point is this: whether a 3 is stored in the L1 cache vs the L3 cache makes a non-trivial difference to the computation performed, despite the formal equivalence. These differences by themselves are trivial, on the order a few nanoseconds. But, as we scale the systems to increasingly complicated computations, these differences count.


Hardware engineers have come up with various tricks by which to exploit these differences to improve performance.

  1. MergeSort vs QuickSort. Two algorithms of the same computational complexity perform differently in real systems. Quicksort partitions arrays in-place and sequentially, using local memory. MergeSort accesses scattered data across multiple caches during merging, causing frequent cache misses.
  2. Data-Oriented Design. In object-oriented code, each entity stores its fields together in memory. A physics system iterating over positions must load entire objects to update single fields. This wastes cache lines on irrelevant data. Data-oriented design stores each field type contiguously: all positions in one array, all velocities in another, and so on. Iteration streams linearly through memory, which can be 2–10× faster.

To summarize. Two entities can be formally equivalent but can be meaningfully different when implemented in real computation. Above we point at the performance aspects, though other differences include the security/vulnerability, interpretability, stability and so on. Whilst these differences are often trivial, at scale they accumulate, having a meaningful impact on the tractability of certain formal entities (numbers at locations, algorithms using certain caches).

We use the term "substrate" to capture this essential context.




Discuss

AI safety can be a Pascal's mugging even if p(doom) is high

25 апреля, 2026 - 19:16

People sometimes say that AI safety is a Pascal’s mugging. Other people sometimes reply that AI safety can’t be a Pascal’s mugging, because p(doom) is high. Both these people are wrong.

The second group of people are wrong because Pascal’s muggings are about the probability that you make a difference, not about baseline risk. The first group of people are wrong because the probability that you personally avert AI catastrophe isn’t that small.

Here’s a story to show that Pascal’s muggings are about the probability that you make a difference. Imagine that God will flip a coin at the end of time. If the coin lands heads, He’ll send everyone to heaven. If the coin lands tails, He’ll send everyone to hell. Everyone knows this is what will happen.

In a dark alley, a stranger approaches you and tells you that he can make God’s coin land heads, thereby ensuring that everyone goes to heaven. He says he’ll do it if you give him your wallet. You assign a very low probability to this stranger telling the truth — 1 in a bajillion — but the stranger reminds you that 10 bajillion people will have their fates determined by God’s coin.

‘Hang on,’ you say, ‘This seems a lot like a Pascal’s mugging.’

Au contraire,’ says the stranger, ‘It can’t be a Pascal’s mugging. The outcome I’m promising to avert — everyone going to hell — is not low probability at all. p(hell) is 50%.’

Would this reply convince you to hand over your wallet? Of course not. Even though the baseline risk of everyone going to hell is high, the probability that you make a difference — getting everyone to heaven when they otherwise would have gone to hell — is extremely low. And it’s this latter probability that determines whether your situation is a Pascal’s mugging.

So when people say that AI safety is a Pascal’s mugging, you can’t just reply that p(doom) is high. You have to argue that p(you avert doom) is high.

All that said, I think p(you — yes, you — avert doom) is high, or at least high enough. The whole doom situation is really up-in-the-air right now, and you’re at most like 4 degrees of separation from the big players: presidents, lab CEOs, and the like. You can influence someone who influences someone who influences someone. Your chances are way higher than 1 in a bajillion.



Discuss

Страницы