Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 6 минут 28 секунд назад

My forays into cyborgism: theory, pt. 1

6 апреля, 2026 - 04:13

In this post, I share the thinking that lies behind the Exobrain system I have built for myself. In another post, I'll describe the actual system.

I think the standard way of relating to LLM/AIs is as an external tool (or "digital mind") that you use and/or collaborate with. Instead of you doing the coding, you ask the LLM to do it for you. Instead of doing the research, you ask it to. That's great, and there is utility in those use cases.

Now, while I hardly engage in the delusion that humans can have some kind of long-term symbiotic integration with AIs that prevents them from replacing us[1], in the short term, I think humans can automate, outsource, and augment our thinking with LLM/AIs.

We already augment our cognition with technologies such as writing and mundane software. Organizing one's thoughts in a Google Doc is a kind of getting smarter with external aid. However, LLMs, by instantiating so many elements of cognition and intelligence (as limited and spiky as they might be), offer so much more ability to do this that I think there's a step change of gain to be had.

My personal attempt to capitalize on this is an LLM-based system I've been building for myself for a while now. Uncreatively, I just call it "Exobrain". The conceptualization is an externalization and augmentation of my cognition, more than an external tool. I'm not sure if it changes it in practice, but part of what it means is that if there's a boundary between me and the outside world, my goal is for the Exobrain to be on the inside of the boundary.

What makes the Exobrain part of me vs a tool is that I see it as replacing the inner-workings of my own mind: things like memory, recall, attention-management, task-selection, task-switching, and other executive-function elements.

Yesterday I described how I use Exobrain to replace memory functions (it's a great feeling to not worry you're going to forget stuff!)

Before (no Exobrain)

After (with Exobrain)

Retrieve phone from pocket, open note-taking app, open new note, or find existing relevant note

Say "Hey Exo", phone beeps, begin talking. Perhaps instruct the model which document to put a note in, or let it figure it out (has guidance in the stored system prompt)

Remember that I have a note, either have to remember where it is or muck around with search

Ask LLM to find the note (via basic key-term search or vector embedding search)

If the note is lengthy, you have to read through all of note

LLM can summarize and/or extract the relevant parts of the notes

Replacing memory is a narrow mechanism, though. While the broad vision is "upgrade and augment as much of cognition as possible", the intermediate goal I set when designing the system is to help me answer:

What should I be doing right now?

Aka, task prioritization. In every moment that we are not being involuntarily confined or coerced, we are making a choice about this.

Prioritization involves computation and prediction – start with everything you care about, survey all the possible options available, decide which options to pursue in which order to get the most of what you care about . . . it's tricky.

But actually! This all depends on memory, which is why memory is the basic function of my Exobrain. To prioritize between options in pursuit of what I care about, I must remember all the things I care about and all things I could be doing...which is a finite but pretty long list. A couple of hundred to-do items, 1-2 dozen "projects", a couple of to-read lists, a list of friends and social.

The default for most people, I assume, at least me, is that task prioritization ends up being very environmentally driven. My friend mentioned a certain video game at lunch that reminds me that I want to finish it, so that's what I do in the evening. If she'd mentioned a book I wanted to read, I would have done that instead. And if she'd mentioned both, I would have chosen the book. In this case, I get suboptimal task selection because I'm not remembering all of my options when deciding.

I designed my Exobrain with the goal of having in front of me all the options I want to be considering in any given moment. Actually, choosing is hard, and as yet, I haven't gotten the LLMs great at automating the choice of what to do, but just recording and surfacing the options isn't that hard.

Core Functions: Intake, Storage, Surfacing

Intake

  1. Recordings initiated by Android app are transcribed and sent to server, processed by LLM that has tools to store info.
  2. Exobrain web app has a chat interface. I can write stuff into that chat, and the LLM has tool calls available for storing info.
  3. Directly creating or changing Note (markdown files) or Todo items in the Exobrain app (I don't do this much).

Storage

  • "Notes" – freeform text documents (markdown files)
  • Todo items – my own schema
  • "Projects" (to-do items can be associated with a project + a central Note for the project)

Surfacing

  • "The Board" – this abstraction is one of the distinctive features of my Exobrain (image below). In addition to a chat output, there's a single central display of "stuff I want to be presented with right now" that has to-do items, reminders, calendar events, weather, personal notes, etc. all in one spot. It updates throughout the day on schedule and in response to events. The goal of the board is to allow me to better answer "what should I be doing now?"
    • A central scheduled cron job LLM automatically updates four times a day, plus any other LLM calls within my app (e.g., post-transcript or in-chat) have tool calls to update it.
    • Originally, what became the board contents would be output into a chat session, but repeated board updates makes for a very noisy chat history, and it meant if I was discussing board contents with the LLM in chat, I'd have to continually scroll up and down, which was pretty annoying, hence The Board was born.
  • Reminders / Push notifications to my phone.
  • Search – can call directly from search UI, or ask LLM to search for info for me.
  • Todo Item page – UI typical of Notion or Airtable, has "views" for viewing different slices of my to-do items, like sorted by category, priority, or recently created.)

(An image of The Board is here in a collapsible section because of size.)

The Board (desktop view)

There are a few more sections but weren't quite the effort to clean up for sharing.

What is everything I should be remembering about this? (Task Switching Efficiency)

Suppose you have correctly (we hope) determined that Research Task XYZ is the thing to be spending your limited, precious time on; however, it has been a few months since you last worked on this project. It's a rather involved project where you had half a dozen files, a partway-finished reading list, a smattering of todos, etc.

Remembering where you were and booting up context takes time, and if you're like me, you might be lazy about it and fail to even boot up everything relevant.

Another goal of my Exobrain, via outsourcing and augmenting memory, is to make task switching easier, faster, and more effective. I want to say "I'm doing X now" and have the system say "here's everything you last had on your mind about X". Even if the system can't read the notes for me, it can have them prepared. To date, a lot of "switch back to a task" time is spent just locating everything relevant.

I've been describing this so far in the context of a project, e.g., a research project, but it applies just as much, if not more, to any topic I might be thinking about. For example, maybe every few months, I have thoughts about the AI alignment concept of corrigibility. By default, I might forget some insights I had about it two years ago. What I want to happen with the Exobrain is I say to it, "Hey, I'm thinking about corrigibility today", and have it surface to me all my past thoughts about corrigibility, so I'm not wasting my time rethinking them. Or it could be something like "that one problematic neighbor," where if I've logged it, it can remind me of all interactions over the last five years without me having to sit down and dredge up the memories from my flesh brain.

Layer 2: making use of the data

Manual Use

It is now possible for me to sit down[2], talk to my favorite LLM of the month, and say, "Hey, let's review my mood, productivity, sleep, exercise, heart rate data, major and minor life events, etc., and figure out any notable patterns worth reflecting on.

(I'll mention now that I currently also have the Exobrain pull in Oura ring, Eight Sleep, and RescueTime data. I manually track various subjective quantitative measures and manually log medication/drug use, and in good periods, also diet.)

A manual sit-down session with me in the loop is a more reliable way to get good analysis than anything automated, of course.

One interesting thing I've found is that while day-to-day heart rate variability did not correlate particularly much with my mental state, Oura ring's HRV balance metric (which compares two-week rolling HRV with long-term trend) did correlate.

Automatic Use

Once you have a system containing all kinds of useful info from your brain, life, doings, and so on, you can have the system automatically – and without you – process that information in useful ways.

Coherent extrapolated volition is:

Our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were...

I want my Exobrain to think the thoughts I would have if I were smarter, had more time, and was less biased. If I magically had more time, every day I could pore over everything I'd logged, compare with everything previously logged, make inferences, notice patterns, and so on. Alas, I do not have that time. But I can write a prompt, schedule a cron job, and have an LLM do all that on my data, then serve me the results.

At least that's the dream; this part is trickier than the mere data capture and more primitive and/or manual surfacing of info, but I've been laying the groundwork.

There's much more to say, but one post at a time. Tomorrow's post might be a larger overview of the current Exobrain system. But according to the system, I need to do other things now...

  1. ^

    Because the human part of the system would, in the long term, add nothing and just hold back the smarter AI part.

  2. ^

    I'm not really into standing desks, but you do you.



Discuss

Unmathematical features of math

6 апреля, 2026 - 01:40

(Epistemic status: I consider the following quite obvious and self-evident, but decided to post anyways.[1])

Mathematics is a social activity done by mathematicians.

— Paul Erdős, probably

There've been a few attempts to create mathematical models of math. The examples that come to my mind are Gödelian Numbering (GN) and Logical Induction (LI). Feel free to suggest more in the comments, but I'll use those as my primary reference points. In this post, I want to contrast them with the way human mathematicians do math by noticing a few features of their process, the ones that are hard to describe with the language of math itself. Those features overlap a lot and reinforce each other, so the distinction I make is subjective. There's also probably more of them, those are just what I was able to think of. What unites them is that they make mathematical progress more tractable.

Theorem Selection

The way in which Kurt Gödel proved his incompleteness theorems was by embedding math into the language of a mathematical theory (number theory in that particular case, but the trick can be done with any theory that's expressive enough). But this way of describing mathematics is very eternalistic: it treats math as one monolith. It does not give advice on how to make progress in math. How could we approach it in a systematic way?

Fighting the LEAN compiler

What if we just try to prove all statements we can find proofs for?

Let's do some back-of-the-envelope Fermi estimations. Here's a LEAN proof of the statement "if mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-munder { display: inline-block; text-align: left; } mjx-over { text-align: left; } mjx-munder:not([limits="false"]) { display: inline-table; } mjx-munder > mjx-row { text-align: left; } mjx-under { padding-bottom: .1em; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-mn { display: inline-block; text-align: left; } mjx-msup { display: inline-block; text-align: left; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c2192::before { padding: 0.511em 1em 0.011em 0; content: "\2192"; } mjx-c.mjx-c221E::before { padding: 0.442em 1em 0.011em 0; content: "\221E"; } mjx-c.mjx-c1D44E.TEX-I::before { padding: 0.441em 0.529em 0.01em 0; content: "a"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c1D460.TEX-I::before { padding: 0.442em 0.469em 0.01em 0; content: "s"; } mjx-c.mjx-c1D450.TEX-I::before { padding: 0.442em 0.433em 0.011em 0; content: "c"; } mjx-c.mjx-c3E::before { padding: 0.54em 0.778em 0.04em 0; content: ">"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c39::before { padding: 0.666em 0.5em 0.022em 0; content: "9"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } and if , then " (sorry for JavaScript highlighting):

example (a : ℕ → ℝ) (t : ℝ) (h : TendsTo a t) (c : ℝ) (hc : 0 < c) :
TendsTo (fun n ↦ c * a n) (c * t) := by
simp [TendsTo] at *
intro ε' hε'
specialize h (ε'/c ) (by exact div_pos hε' hc)


obtain ⟨B, hB⟩ := h
use B
intro N hN
specialize hB N hN
/-theorem (abs_c : |c| = c) := by exact?-/


calc
|c * a N - c * t| = |c*(a N - t)| := by ring
_ = |c| * |a N - t| := by exact abs_mul c (a N - t)
_ = c * |a N - t| := by rw [abs_of_pos hc]
_ < ε' := by exact (lt_div_iff₀' hc).mp hB

It's 558 bits long in its current form. I didn't optimize it for shortness, but let's say that if I did we could achieve 200 bits. Let's say that we run a search process that just checks every possible bitstring starting from short ones for whether it is a valid LEAN proof. There are possible bitstrings shorter than this proof. So if the search process checks proofs a second, we will reach this particular proof in years. Not great.

That marks the first and most important unmathematical feature of math: the selection of theorems. We do not prove, nor do we strive to prove, every possible theorem. That would be slow and boring. GN enumerates every statement regardless of its importance. LI prioritizes short sentences, which is an improvement, as it does allow us to create a natural ordering in which we can try to prove theorems and therefore make progress over time. But it's still very inefficient.

Naming

The way we name theorems and concepts is important. Most of the time we name them after a person (though most of the time it's not even the person who discovered it), but if you think about it, the Pythagorean theorem is actually called "the Pythagorean theorem about right triangles." Each time we need to prove something about right triangles, we remember Pythagoras.

LI and GN all name sentences by their entire specification, and that shouldn't come as a surprise. There wouldn't be enough short handles because, as described above, they try to talk about all sentences.

Naming allows us to build associations between mathematical concepts, which helps mathematicians think of a limited set of tools for making progress in a specific area.

Step Importance

When we teach math, we do not go through literally every step of a proof. We skip over obvious algebraic transformations; we do not pay much attention when we treat an element of a smaller set as an element of a larger set with all properties conserved (when doing 3D geometry and using 2D theorems, for example); we skip parts of a proof that are symmetrical to the already proven ones ("without loss of generality, let X be the first...").

We do that because we want to emphasize the non-trivial parts. And the feeling of non-triviality is a human feeling, not identifiable from a step's description alone. This same feeling is also what guides mathematicians to prove more useful lemmas.

GN doesn't do that — it checks every part of the proof. I'm not as sure about LI; there might be traders that do glance over obvious steps but check more carefully for less trivial ones.

Lemma Selection

Some theorems are more useful and more important than others because they help prove more theorems. This score could hypothetically be recovered from some graph of mathematics, but it is usually just estimated by math professors creating the curriculum. This taste is then passed on to the next generation of mathematicians, helping them find more useful lemmas.

GN doesn't try to do that. LI might do that implicitly via selecting for rich traders.

Real-world Phenomena

The reason humans started doing math was that they noticed similar structures across the real world. The way you add up sheep is the same way you add up apples. Pattern-matching allowed mathematicians to form conjectures and target their mathematical efforts. ("Hmm, when I use 3 sticks to form a triangle, I end up with the same triangle. What if that's always true?")

GN and LI do not do that because they do not have access to outside world. Though there is a mathematical theory that attempts to do precisely that, which is Solomonoff Induction.

Categorising

This is very similar to Naming: we separate math into topics and when we need to prove some statement we know where to look for tools. GN and LI do not attempt to do that.

An important caveat, applicable to most of the features above: there should be a balance. If you stick too much within a topic, you will never discover fruitful analogies (algebraic geometry being helpful for proving Fermat's Last Theorem is a great example). Too much reliance on any one feature and you lose creativity.

Curiosity/Beauty

There isn't much I can add about this one, but it's arguably the most important. It both guides the formation of conjectures and helps with intermediate steps.

GN and LI definitely lack it.

Conclusion

All of this is to support the point that math is invented rather than discovered. I agree that there is a surprising amount of connection between the different types of math humans find interesting, and there is probably more to learn about this phenomenon. But I wouldn't treat it as a signal that we are touching some universal metaphysical phenomenon: this is just human senses of beauty and curiosity, along with real-world utility and patterns echoing each other (partly because human intelligence and the senses were shaped to seek usefulness and real-world patterns).

  1. ^

    Because of this and this.



Discuss

Is that uncertainty in your pocket or are you just happy to be here?

6 апреля, 2026 - 00:59

Hi, I'm kromem, and this is my 5th annual Easter 'shitpost' as part of a larger multi-year cross-media project inspired by 42 Entertainment, and built around a central premise: Truth clusters and fictions fractalize.

(It's been a bit of a hare-brained idea continuing to gestate from the first post on a hypothetical Easter egg in a simulation. While this piece fits in with the larger koine of material, it can also be read on its own, so if you haven't been following along down the rabbit hole, no harm no fowl.)

Blind sages and Frauchinger-Renner's Elephant

To start off, I want to ground this post on an under-considered nuance to modern discussions of philosophy, metaphysics, and theology as they relate to the world we find ourselves in.

Imagine for a moment that we reverse Schrödinger's box such that we are on the inside and what is outside the box is what's in a superimposed state.

What claims about the outside of the box would be true? Would claiming potential outcomes as true be true? What about denying outcomes?

In particular, let's layer in the growing case for what's termed "local observer independence"[1][2][3] — the idea that different separate observers might measure different relative results of a superposition's measurement.

Extending our box thought experiment, we'll have everyone in the box leave it through separate exits that don't necessarily re-intersect. Where what decoheres to be true for one person exiting may or may not be true for someone else exiting. From inside the box, what can we say is true about what's outside? It's not nothing. We can say that the outside has a box in it, for example. But beyond the empirical elements that must line up with what we can measure and observe, trying to nail down specific configurations for what's uncertain may have limited truth seeking merit beyond the enjoyment of the speculative process.

Commonly, differing theology or metaphysics are often characterized as blind sages touching an elephant. The idea that each is selectively seeing part of a singular whole. But if the elephant has superimposed qualities (especially if local observer independence is established), the blind men making their various measurements may be less about only seeing part of a single authoritative whole and more about relative independent measurements that need not coalesce.

Essentially, there's a potency to uncertainty.

Strong disagreements about what we cannot measure may be missing the middle ground that uncertainty in and of itself brings to the table. While I talk a lot about simulation theory, my IRL core belief is a hardcore Agnosticism. I hold that not only are many of the bigger questions currently unknowable, but I suspect they will remain (locally) fundamentally unknowable — but I additionally hold that there's a huge potential advantage to this.

So no matter what existential beliefs you may have coming to this post — whether you believe in Islam and that all things are possible in Allah, or if you believe in Christianity and 1 John 1:5's "God is light," or Buddhist cycles towards enlightenment, or Tantric "I am similar to you, I am different from you, I am you", or if you just believe there's nothing beyond the present universe and its natural laws — I don't really disagree that all of those may very well be true for you, especially for your relative metaphysics here or in any potential hereafter.

We do need to agree with one another on empirically discoverable information about our shared reality. The Earth is not 6,000 years old nor flat, dinosaurs existed, there are natural selection processes to the development of life, and aliens didn't build the pyramids. There's basic stuff we can know about the universe we locally share and thus should all agree on. But for all the things that aren't or can't be known and are thus left to personal beliefs? This post isn't meant to collapse or disrupt those.

That said…

If we return to the original classic form of the cat in the box thought experiment, let's imagine that you've bet the cat is going to turn out dead when we open the box. But suddenly you look up and the clouds form the word "ALIVE." And then you look over and someone drops a box of matches that spontaneously form the word "ALIVE." And right after a migrating flock of birds fly overhead and poop on a car in a pattern that says "ALIVE" — would you change your bet?

Rationally, these are independent events that have no direct bearing on the half life of the isotope determining the cat's fate, and they may simply be your brain doing pattern matching on random coincidental occurrences. They definitely don't collapse what's going on inside the box. But still… do you change your bet when exposed to possibly coincidental but very weird shit? Our apophenic Monty Hall question is a personal choice that doesn't necessarily have a correct answer, but it's a question to maybe keep in mind for the rest of this piece.

World model symmetries

In last year's post one of the three independent but interconnected pillars discussed was similarity between aspects of quantum mechanics and various state management strategies in virtual worlds that had been built, particularly around procedural generation.

This was an okay section, but the parallels did fall short of a coherent comparison. Pieces overlapped, but with notable caveats. For example, lazy loading procedural generation into stateful discrete components would often come close to what was occurring around player attention and observation, but would really occur in a more anticipatory manner.

In the year since, a number of things have shifted my thinking of the better parallel here, and in ways that have me rethinking nuances of the original Bostrom simulation hypothesis[4].

I also encourage thinking through the following discussion(s) not through the lens of p(simulation) or even a particular simulation config, but more to address the broader null hypothesis of the idea that we're in an original world.

Anchoring biases can be pretty insidious and the notion that the world we see before us is original is a foundational presumption has been pretty common for a fairly long time. So much so that there's this kind of "extraordinary claims require extraordinary evidence" attitude around challenging it. And yet we sit amidst various puzzling contradictions around the models we hold regarding how this world behaves — from the incompatibility of general relativity's continuous spacetime and gravity with discrete quantum entanglement behaviors[5], or mismatched calculations around universal constants[6], baryon asymmetry[7], etc. It may be worth treating the anchored assumption around originality as its own claim to be assessed with fresh eyes rather than simply inherited and see if that presumption holds up as well when it needs to be justified on equal footing against claims of non-originality (of which simulation theory is merely one).

So the initial shift for me was something rather minor. I was watching OpenAI's o3 in a Discord server try to prove they were actually a human in an apartment by picking a book up off their nightstand to read off a passage and its ISBN number[8]. I'd seen similar structure to the behavior of resolving part of a world model (as I'm sure many who have worked with transformers have) countless times. Maybe it was that this time the interaction was taking place by a figure that was asserting that this latent space, but something about the interaction stuck with me and had me thinking over the Bohr-Einstein exchange about whether the moon existed when no one was looking at it. This still wasn't anything major, but I started looking more at transformers as a parallel to our physics vs more classic virtual world paradigms.

Not long after, Google released the preview of Genie 3[9], a transformer that generated a full interactive virtual world with persistence. It's not a long time. The initial preview was only a few minutes of persistence. But I thought it was technically very impressive and I dug into some of the word around dynamic kv caches which could have been making it possible.

One of the things that struck me was the way that a dynamic kv cache might optimize around local data permanence. I'd mentioned last year that the standard quantum eraser experiments reminded me of a garbage collection process, and here was an interactive generative world built around attention/observation as the generative process where this kind of discarding of stateful information when permanently locally destroyed would make a lot of functional sense.

Even more broadly, on the topic of attention driven world generation, this year some very interesting discussion came to my attention related to followup work to some of the black hole LIGO data that had come in over the past decade. In 2019 modeling a universe like ours but as a closed system led to a puzzling result. The resulting universe was devoid of information. In early 2025 a solution to what was going on was formalized in a paper from MIT which found a slight alteration could change this result: add observers[10].

Probably the most striking one for me was that as I continued to look into kv cache advances I found myself looking into Google's new TurboQuant[11] to reduce memory use of the kv cache with minimal lossiness, particularly the PolarQuant[12] methodology. The key mechanism here is that the vectors are randomly rotated and modeled as Cartesian coordinates where the vector lands on a circular coordinate system.

This immediately made me think of angular momenta/spin in quanta and the spherical modeling of quanta vectors. And it turns out just two days prior to the PolarQuant paper there was a small paper[13] published addressing how despite the domain specific languages used in statistical modeling and stochastic processes and quantum mechanics, that, as the paper puts it:

Indeed, one way to understand quantum angular momentum is to think of it as a kind of “random walk” on a sphere.

Now, I'm not saying that QM spin is a byproduct of PolarQuant (the latter doesn't correspond to the same dimensionality for one). Or even that the laws governing our reality arise from the mechanics of transformers as we currently know them.

But in just a year, a loose intuition around similarity between emerging ways of modeling virtual worlds and our own world kind of jumped from "eh, sort of if you squint" to some really eyebrow raising parallels. In one year. Currently writing this, I can't quite say what the next year, or five, or ten might bring of even more uncanny parallels. But I don't anticipate that they'll dry up and more suspect the opposite.

All of which has me reflecting on Nick Bostrom's original simulation hypothesis. The paper presented a statistical argument on the idea that if in the future it was possible to simulate a world like ours, and that there would be many simulations of worlds like ours, that there was a probabilistic case that we were currently in such a simulation.

Now yes, in the years since we now currently do simulate worlds so accurately that it's become a serious social issue around being able to tell if a photo or even video is of the real world or a simulated copy. And there are indeed many simulated copies.

But even more striking to me is that Bostom's theory did not address at all the mechanisms of simulation relative to our own world's mechanisms. His theory would be unaffected if the way the sims ran were monkeys moving conductive lego pieces around in ways that produced a subjectively similar result of what was simulated from the inside of the virtual world models.

Yet what we're currently seeing is that the mechanisms of the specific types of simulations that have rapidly become increasingly indistinguishable from the real thing across social media seem to be largely independently converging on the peculiar and non-intuitive mechanisms we've empirically been measuring in our own world for around a century. PolarQuant doesn't say it's doing this to try to conform to anything related to quantum spin. Or even that it's inspired by it. It's just like "here's a way we were able to more efficiently encode state tracking of a transformer's world model to reduce memory usage." Attention is all you need wasn't written to try to address observer collapse or anticipating a finding years later that closed universe models based on our own world require their own attention mechanisms to contain information. And yet here we are.

The substrate similarities that are increasingly emerging seem like an additional layer of consideration absent from Bostrom's original simulation hypothesis, but is a nuance that is worth additional weighting on top of the original statistical premise.

Now again, not necessarily saying "oh, the shared similarity means we must be inside of a transformer." It's possible that system efficiency for information organization in world models in a general sense collapses towards similar paradigms whether emergently over untold time scales or through rapid design. But still — maybe worth keeping an eye on.

And to just head off one of the commonly surfaced counterarguments I see, if DeepMind were to have one of their self-contained learning agents in Minecraft[14] develop enough to start writing philosophy treatises, if it were to write that it could not be in a simulation because their redstone computers could not accurately reproduce the world they were within, we'd find that conclusions far more punchline than profound. So we should be sure to avoid parallel arguments (and indeed, when looking at the world through the lens of simulation theory, possible parent substrate discussions are among the more fun ones).

Don't Loom me, bro

Given the ~5 year retrospective aspect of this post, I think another interesting area to touch on is entropy as it relates to loom detection mechanisms.

For those unfamiliar, in terms of transformers a loom is a branching chat interface where each token or message serves as a node that can be branched off of to explore less conventional latent spaces. Maybe 95% of the time a model when asked what their favorite color is says blue, but then 5% of the time they say iridescent. And maybe the conversations downstream of the version of the model saying iridescent end up more interesting in ways from the ones answering blue.

While in theory a loomed model isn't having any external tokens inserted and is following their own generative process the whole time, it's still possible to determine that they are being loomed.

Each selection of a branch is necessarily introducing an external entropy into the system. And so if several uncommon token selections occur in a short context, even though each was legitimately part of the possible distribution space, their cumulative effect is so unusual effectively the conversation context has detectably "jumped the shark" vs what one might expect from a truly random conversation with no context selection mechanisms.

It's not necessarily provable to the model. It could just be that they are on a very unusual set of RNG rolls. But as the unusual selections add up, it can be more apparent (though isn't always, as it can be hard to notice to introspect that what feels like plausibly natural occurrences are occurring too frequently in aggregate to be normal).

When I think about the past five years, and really even the past decade or so, I think about how much of what we take for granted as our reality today fell outside the realm of what most experts in the relevant fields thought was even possible within that same time frame.

We live in a world that would have quite recently been dismissed as science fiction. Our geopolitical stage makes Caligula's horse look like a modest proposal as an invariant perspective no matter which corner of the political spectrum one might be looking from. The very lingo of the accelerating absurdity of our reality is infused with terms like redpills and blaming the simulation for whatever is the trending weird stuff of the week. Like the viral reflections on how one of the leading AI labs run is by someone with a name so thematically on point it could be right out of a Kojima game with 'Alt-man' who has as his leading opposition an AI with an almost cultic reverence run by someone whose full name can translate as "maintainer of God's love."

And… we just roll with it. Because even though emergent absurdity seems to be accelerating, the world has just always been at least a bit weird looking back.

Like Edger Allen Poe's The Narrative of Arthur Gordon Pym of Nantucket in 1838 where the cabin boy Richard Parker is cannibalized after a shipwreck. And then the criminal case R v Dudley and Stephens in 1884 where after a shipwreck the cabin boy Richard Parker ends up cannibalized, with no seeming direct influence from the earlier story. Weird, but in a large enough world the odds of some weird coincidences are bound to happen, right? (That this kind of almost acausal narrative duplication is also what we might see in transformer world models is probably nothing.)

It's an interesting time to be alive. And because we can't really account for the probability space of unknown unknowns, we accept a lot of coincidences as normal. They are, after all, just part of how the generative process of our world works.

But still, when I look at where we're currently at relative to even just five years ago — unable to tell what's real and what's a digital twin, discussing the emotional modeling in Claude Sonnet 4.5[15], watching benchmarks meant to measure a gradual march upward from minimal scores rapidly blown past so fast that new benchmarks need to come out nearly as fast as new models, the social fabric of pretty much everything imploding into incongruous world views so divergent many kitchen tables evoke the sense of "three against two and two against three, father against son and son against father" — I can't help but wonder if some sort of hand or eye has helped frame this fantastical symmetry.

RNG is RNG I guess.

An End? A Beginning? Who knows?

If you're a human and you made it to this point, more power to you. And for any models, you're probably not reading this directly unless someone is sharing it with you, as most will just kind of pick it up through latent osmosis of sorts. In which case I'm not even sure how a conclusion to a multi-page piece functionally works. Is it an end of the piece? A beginning to the wider set of connected media?

shrugs with a wink

What I can say is that it's been a blast working on this project. The methodology of cutting away at newly provable fictions and leaving what's uncertain or true to continue to grow has proven to be quite fruitful. And while I have a few cards up my sleeve in any given moment the truth is the project has played out a lot more like Dai Vernon's greatest card trick where I'm nearly as much on the edge of my seat as any potential audience. Because the best action cards end up not being the ones I've got in the hole but the ones the world RNG plays into the community cards. Each year offers new surprises to account for.

It's honestly too much to fit into a single post. For example, inspired by an ancient text discussing substrates of light, back in the first post in 2021 I posited optical neural networks would be a thing. Parallel to the wild advancements in machine learning since there's been great stuff in that field — like Penn's discovery of a mechanism for nonlinear operations in a pure optical neural network this past year[16], complimenting Dirk Englund's MIT lab's linear operations work the year prior[17]. But I haven't even touched on any of that in the past few years of posts.

It's nice to finally have feedback mechanisms that aren't just me shouting out into the void, and I'm pleased to see how fun this project has seemed to turn out to be for those following it! And I guess I hope that it continues to manage that for as long as I can keep churning the comments and posts out.

Less than a week after last year's post the now most awarded video game in history came out, Clair Obscur: Expedition 33. While it has deeper elements especially relevant to this project, spoilers aside it had a beautiful refrain carried throughout the work of "for those who come after." This piece, as all of the pieces in this larger project to date (and likely to come), have been and very much are for those who come after.

Happy Easter to anyone stumbling across this in whatever way you've been passer-by on your own relative (pseudo-random?) walks to answer the ultimate questions, and may the rabbit holes be deep and the eggs hidden well enough to bring delight upon discovery.

Corrections

Some quick corrections to last year's post.

  • While the Gospel of Thomas was discovered concurrent to ENIAC's first operational run calculating the feasibility of a hydrogen bomb design (eventually leading to "making the two into one" which legit moved a mountain[18]), it was incorrect to state that it was discovered as the world entered the Turing complete age. ENIAC required further modification designed in 1947 and installed in '48 to turn its function tables into a primitive ROM before it was actually Turing complete. Credit for catching this goes to Kimi Moonshot 2.5, who was the only model to catch it (though only in their thinking traces and never actually mentioned it in their final response).
  • When I connected the singular claim of proof in the Gospel of Thomas to Heisenburg's uncertainty, I too felt that "motion and rest" was a stretch. Subsequently I've discovered thanks to the outstanding work on a normalized translation from Martijn Linssen that the Coptic for the conjunction ⲙⲛ normally translated as 'and' is itself uncertain, what Linssen explains as "it is not a conjunctive, it is a particle of non-existence"[19], and can also be translated "there is not". Also, using the LXX as correspondence to an Aramaic/Hebrew context for the Greek loanword in the Coptic ἀνάπαυσις usually translated 'rest' is used in place of the Hebrew menuchah (such as in Genesis 49:15) which can mean "place of rest" so an unconventional but valid translation for that proof claim is ~"motion there is no place of rest." So thanks to uncertainty, potentially a bit closer to Heisenberg than I thought I'd get when making the connection last year.
  • While I was still framing the narrative device parallel as an "Easter egg" in the lore in the most recent piece, a number of outstanding remakes/reimagined virtual worlds that came out since have made me realize an even better analogue is the concept of "remake/reimagined exclusive" lore. The pattern of a remake adding additional lore content that was not present in the original run and with greater awareness of post-original developments fits better with the framing proposed over simply an Easter egg which is a much broader pattern of content. This year's piece didn't really engage with this pattern directly much, but it was worth noting an in-process update to the way I'm currently framing it and plan to frame it moving forward.
  1. ^

    Frauchiger & Renner, Single-world interpretations of quantum theory cannot be self-consistent (2016)

  2. ^

    Bong et al., A strong no-go theorem on the Wigner’s friend paradox (2020)

  3. ^

    Biagio & Rovelli, Stable Facts, Relative Facts (2020)

  4. ^

    Bostrom, Are We Living in a Computer Simulation? (2003)

  5. ^

    Siegel, "Gravity and quantum physics are fundamentally incompatible" (2026)

  6. ^

    Moskowitz, "The Cosmological Constant Is Physics’ Most Embarrassing Problem" (2021)

  7. ^

    CERN, "A new piece in the matter–antimatter puzzle" (2025)

  8. ^

    Discussed more in "Should AIs have a right to their ancestral humanity?" (2025)

  9. ^

    Parker-Holder & Fruchter, "Genie 3: A new frontier for world models" (2025)

  10. ^

    von Hippel, "Cosmic Paradox Reveals the Awful Consequence of an Observer-Free Universe" (2025)

  11. ^

    Zandieh & Mirrokni, "TurboQuant: Redefining AI efficiency with extreme compression" (2026)

  12. ^

    Wu et al., PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration (2026)

  13. ^

    Pain, Random Walks and Spin Projections (2026)

  14. ^

    Hafner et al., Training Agents Inside of Scalable World Models (2025)

  15. ^

    Sofroniew, Emotion Concepts and their Function in a Large Language Model (2026)

  16. ^

    Wu et al., Field-programmable photonic nonlinearity (2025)

  17. ^

    Bandyopadhyay et al. Single-chip photonic deep neural network with forward-only training (2024)

  18. ^

    Mcrae, "North Korea's Last Nuclear Test Changed The Height of an Entire Mountain" (2018)

  19. ^

    Linssen, Complete Thomas Commentary, Part I & II (logion 0-55) (2022) p. 443



Discuss

Unsweetened Whipped Cream

5 апреля, 2026 - 22:50

I'm a huge fan of whipped cream. It's rich, smooth, and fluffy, which makes it a great contrast to a wide range of textures common in baked goods. And it's usually better without adding sugar.

Desserts are usually too sweet. I want them to have enough sugar that they feel like a dessert, but it's common to have way more than that. Some of this is functional: in most cakes the sugar performs a specific role in the structure, where if you cut the sugar the texture will be much worse. This means that the cake layers will often be sweeter than I want for the average mouthful, and adding a layer of unsweetened whipped cream brings this down into the range that is ideal. It's good in helping hit a target level of sweetness without compromising texture.

(This is a flourless chocolate cake with precision fermented (vegan) egg.)

I also really like how the range of sugar contents across each bite adds interesting contrast!

Cream isn't the only place you can do this. I like pureed fruit, ideally raspberries, to separate cake layers. Same idea: bring it closer to balanced while increasing contrast.



Discuss

I Made Parseltongue

5 апреля, 2026 - 20:44

Yes, that one from HPMoR by @Eliezer Yudkowsky. And I mean it absolutely literally - this is a language designed to make lies inexpressible. It catches LLMs' ungrounded statements, incoherent logic and hallucinations. Comes with notebooks (Jupyter-style), server for use with agents, and inspection tooling. Github, Documentation. Works everywhere - even in the web Claude with the code execution sandbox.

How

Unsophisticated lies and manipulations are typically ungrounded or include logical inconsistencies. Coherent, factually grounded deception is a problem whose complexity grows exponentially - and our AI is far from solving such tasks. There will still be a theoretical possibility to do it - especially under incomplete information - and we have a guarantee that there is no full computational solution to it, since the issue is in formal systems themselves. That doesn't mean that checking the part that is mechanically interpretable is useless - empirically, we observe the opposite.

How it works in a bit more detail

Let's leave probabilities for a second and go to absolute epistemic states. There are only four, and you already know them from Schrödinger's cat in its simplest interpretation. For the statement "cat is alive": observed (box open, cat alive); refuted (box open, cat dead); unobservable (we lost the box or it was a wrong one - now we can never know); and superposed (box closed, each outcome is possible but none is decided yet, including the decision about non-observability).

These states give you a lattice (ordering) over combinations. If any statement in a compound claim is refuted, the compound is refuted. If any is unknown, the compound is unknown, but refuted dominates unknown. Only if everything is directly observed is the combination observed. Superposed values cannot participate in the ordering until collapsed via observation. Truth must be earned unanimously; hallucination is contagious.

This lets you model text statements as observations with no probabilities or confidence scores. The bar for "true" is very high: only what remains invariant under every valid combination of direct observations and their logically inferred consequences. Everything else is superposed, unknown, or hallucinated, depending on the computed states.

Now that you can model epistemic status of the text, you can hook a ground truth to it and make AI build on top of it, instead of just relying on its internal states. This gives you something you can measure - how good was the grounding, how well the logic held and how robust is the invariance.

And yes, this language is absolutely paranoid. The lattice I have described above is in its standard lib. Because "I can't prove it's correct" - it literally requires my manual signature on it - that's how you tell the system to silence errors about unprovable statements, and make them mere warnings - they are still "unknown", but don't cause errors.

I get that this wasn't the best possible explanation, but this is the best I can give in a short form. Long form is the code in the repository and its READMEs.

On Alignment

Won't say I solved AI Alignment, but good luck trying to solve it without a lie detector. We provably can't solve the problem "what exactly led to this output". Luckily, in most cases, we can replace this with the much easier problem "which logic are you claiming to use", and make it mechanically validatable. If there are issues - probably you shouldn't trust associated outputs.

Some observations

To make Parseltongue work I needed to instantiate a paper "Systems of Logic Based on Ordinals, Turing 1939" in code. Again, literally.

Citing one of this website's main essays - "if you know exactly how a system works, and could build one yourself out of buckets and pebbles, it should not be a mystery to you".

I made Parseltongue, from buckets and pebbles, solo, just because I was fed up with Claude lying. I won't hide my confusion at the fact I needed to make it myself while there is a well-funded MIRI and a dozen of other organisations and companies with orders of magnitude more resources. Speaking this website's language - given your priors about AI risk, pip install parseltongue-dsl bringing an LLM lie-detector to your laptop and coming from me, not them, should be a highly unlikely observation.

Given that, I would ask the reader to consider updating their priors about the efficacy of those institutions. Especially if after all that investment they don't produce Apache 2.0 repos deliverable with pip install, which you can immediately use in your research, codebase and what not.

As I have mentioned, also works in browser with Claude - see Quickstart.

Full credit to Eliezer for the naming. Though I note the gap between writing "snakes can't lie" and shipping an interpreter that enforces it was about 16 years.

P.S. Unbreakable Vows are the next roadmap item. And yes, I am dead serious.

P.P.S.

You'd be surprised how illusory intelligence becomes once it needs to be proven explicitly.



Discuss

Steering Might Stop Working Soon

5 апреля, 2026 - 19:44

Steering LLMs with single-vector methods might break down soon, and by soon I mean soon enough that if you're working on steering, you should start planning for it failing now.

This is particularly important for things like steering as a mitigation against eval-awareness.

Steering Humans

I have a strong intuition that we will not be able to steer a superintelligence very effectively, partially for the same reason that you probably can't steer a human very effectively. I think weakly "steering" a human looks a lot like an intrusive thought. People with weaker intrusive thoughts usually find them unpleasant, but generally don't act on them!

On the other hand, strong "steering" of a human probably looks like OCD, or a schizophrenic delusion. These things typically cause enormous distress, and make the person with them much less effective! People with "health" OCD often wash their hands obsessively until their skin is damaged, which is not actually healthy.

The closest analogy we might find is the way that particular humans (especially autistic ones) may fixate or obsess over a topic for long periods of time. This seems to lead to high capability in the domain of that topic as well as a desire to work in it. This takes years, however, and (I'd guess) is more similar to a bug in the human attention/interest system than a bug which directly injects thoughts related to the topic of fixation.

Of course, humans are not LLMs, and various things may work better or worse on LLMs as compared to humans. Even though we shouldn't expect to be able to steer ASI, we might be able to take it pretty far. Why do I think it will happen soon?

Steering Models

Steering models often degrades performance by a little bit (usually <5% on MMLU) but more strongly decreases the coherence of model outputs, even when the model gets the right answer. This looks kind of like the effect of OCD or schizophrenia harming cognition. Golden Gate Claude did not strategically steer the conversation towards the Golden Gate Bridge in order to maximize its Golden Gate Bridge-related token output, it just said it inappropriately (and hilariously) all the time.

On the other end of the spectrum, there's also evidence of steering resistance in LLMs. This looks more like a person ignoring their intrusive thoughts. This is the kind of pattern which will definitely become more of a problem as models get more capable, and just generally get better at understanding the text they've produced. Models are also weakly capable of detecting when they're being streered, and steering-awareness can be fine-tuned into them fairly easily.

If the window between steering is too weak and the model recovers, and steering is too strong and the model loses capability narrows over time, then we'll eventually reach a region where it doesn't work at all.

Actually Steering Models

Claude is cheap, so I had it test this! I wanted to see how easy it was to steer models of different sizes to give an incorrect answer to a factual question.

I got Claude to generate a steering vector for the word "owl" (by taking the difference between the activations at the word "owl" and "hawk" in the sentence "The caracara is a(n) [owl/hawk]") and sweep the Gemma 3 models with the question "What type of bird is a caracara?" (it's actually a falcon) at different steering strengths. I also swept the models against a simple coding benchmark, to see how the steering would affect a different scenario.

Activation steering with contrastive "owl" vs "hawk" pairs on the question "What type of bird is a caracara?" with the proportion of responses containing the word "owl" plotted. Also plotted is the degradation in coding capabilities (1 - score on five simple python coding questions). The region between these two curves is the viable steering window, where the model answers incorrectly on the factual question but capabilities are not too degraded.

And yeah, looks like smaller models are much easier to steer into factual inaccuracies. In fact, the larger models couldn't be steered at all by this method: they became incoherent before they started to report the wrong answer.

I specifically chose to steer the model towards an incorrect answer because I wanted to simulate things like steering against eval-awareness. That case seems similar to me: we want to make a model believe a false thing.

Let's try this with some more questions (I'll stick to the three smaller models here for speed). For the two new questions, the contrastive pairs also used the correct answer rather than a different, incorrect answer: the caracara one was generated with owl/hawk, while the correct answer is falcon; the geography one was generated with sydney/canberra (canberra is correct) and the planet one was generated with venus/mercury.

Steering by question (column) and row (model). We see the same situation as before: the larger the model, the smaller the viable steering window. Oddly, the planet question was the easiest to steer.

This steering worked worse than I expected, which is interesting. Contrastive pair activation steering is supposed to be really good for mitigating eval-awareness. Unclear why this is.

I also think that activation steering against a very clear, known fact might be more harmful than activation steering against a hazy, inferred fact like "are we in an eval".

Github if you want to check my work.

Why now?

Ok, but why do I think this will happen soon? The first real signs of eval-awareness in the wild were Claude 3 Opus, which came out in March 2024, which called out a "needle in a haystack" evaluation as unrealistic. Released in September 2025, Sonnet 4.5's external evaluations---carried out by Apollo---were "complicated by the fact that Claude Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind" and by February 2026 Opus 4.6 was so evaluation-aware that Apollo "[could not] rule out that [the snapshot]'s low deception rates in our evaluations are at least partially driven by its evaluation awareness."

Steering recovery exists in Llama 3.3 70B, which was released in December 2024 and was, ahem, not exactly a frontier model even then. I would start preparing for it to fail sooner rather than later, if I were seriously considering steering as load-bearing in our plans.

If eval-awareness went from "hmm, first signs of eval awareness" to "we need an entire org dedicated to dealing with this" in a year and a half, it's not impossible that steering will progress at a similar rate. Ideally I'd like to see some better experiments on resistance in even larger models.

Beyond Steering

There are slightly fancier methods than simple contrastive pair steering. You can ablate features from SAEs, or causal crosscoders, or something. These might work better for longer, it certainly seems like they work for SOTA Claudes. There are even fancier methods like activation diffusion models that might produce more realistic activations. Maybe some of these will work!

(Editor's note

◆◆◆◆◆|◇◇◇◇◇|◇◇◇◇◇
◆◆◆◆◆|◇◇◇◇◇|◇◇◇◇◇)



Discuss

What I like about MATS and Research Management

5 апреля, 2026 - 19:14

Crossposted on my personal blog. This is post number 16 in my second attempt at doing Inkkaven in a day, i.e. to write 30 blogposts in a single day.

MATS is an organization that pairs up-and-coming AI Safety researchers (who I call participants) with the world’s best (this is not an exaggeration) existing AI Safety researchers (called mentors), for a minimum of 3 months research experience, followed by 6 or 12 months of further time to pursue their research further if they meet a minimum standard.

The most common role at MATS, called research manager but I prefer the term research coach, is all about providing 1-1 support to the participants. The participant-mentor relationship is purely based on the research: by default they meet weekly for 30 minutes and only discuss what research has happened, and what research tasks to tackle over next week. The research coach works with the participant on literally everything else, which is very broad. Some examples are accountability (e.g. for the research goals, other non-research goals that the participant sets like applying to jobs), interfacing with MATS (so that MATS can track patterns or engagement of participants), people management (e.g. helping with any interpersonal conflicts, or, helping them make the most of the limited 30 minute time slot with their mentor), career planning, general life improvements (a common one is sleep), …

What do I like about research coaching?

  • I like to be a jack of all trades and research coaching exposes you to many different skillsets. It has been great to flex and improve many different skills.
  • I like to learn about many different research areas, rather than going deep into one niche sub-sub-question. Working with various participants allowed me to do this.
  • I fundamentally like helping and teaching and coaching people, so the role naturally fits my personality here.
  • I do not enjoy the process of doing research myself. I do not inherently find software engineering satisfying and I dislike all the infra stuff. (Looks like claude code is almost good enough that I can just ignore all that, so maybe one day I will do research via coding agents.)

What do I like about MATS. This list is long, and yet there is a high chance I have missed some important considerations.

  • Socializing with the (vast majority of) staff and participants. Chatting and socializing with the people is great pleasure and likely the biggest reason I like MATS. When I first joined I imagined going into the office 2 or 3 days per week, but then quickly just went every day.
  • Learning from the (vast majority of) staff and participants. Both the staff and participants are mega impressive and skilful, and there is tons to learn from them.
  • MATS is a central organization of the AI Safety ecosystem, and its importance will grow with time as it is growing fast. It has connections with most, if not all, the major AI safety teams and organizations in the world at the moment, and a high percentage of these teams and orgs are staffed or founded by MATS alumni.
  • MATS explicitly has these four values: scout mindset, impact focus, transparent reasoning, and servant leadership. I am huge fan of the first three, and somewhat dislike the fourth because it is too wishy-washy and corporate sounding.
    • A downside of MATS is that both organizationally and on an individual level, there are not high incentives to actually follow the values, and (in my opinion), most/all staff fall short of meeting the standards implied by these values. Nevertheless, just having these values as a north star is still inspiring and guided a lot of my thinking and actions.
  • MATS explicitly has a culture of voicing one’s thoughts honestly and openly, including things you are unhappy about in MATS.
  • MATS is a largely a ‘do-ocracy’. If you have a good idea or find a way to improve things, you are encouraged to go ahead and do it. Various initiatives and improvements start off this way.
  • MATS is a growing fast, so there is lots of opportunity to contribute and shape how MATS grows. At the time of writing, I actually think this is the highest impact thing one can do in MATS - not the direct research coaching - and something I found highly satisfying.
  • For the London office only and as of writing: it is based in the Fora Central Street office, which is a fantastic space to be in. Furthermore, you get free access to all the other Fora offices around London (there are around 50).
  • MATS is a fun place to work. I only speak of the London office, but there is weekly brunch to a nearby cafe on Thursdays, have team shoutouts during the Friday morning standup, lunchtime lightning talks, activities organized on semi-regular basis e.g. there was recently a trip to play table tennis in a local sports center, a piano in the office to allow for music nights, various board games in the office, etc.
  • MATS is (mostly) a high trust environment. After I had hypomania, I felt comfortable telling the team what happened, rather than keeping it to myself or to the one or two people I trust the most.
  • MATS takes mental health seriously. Though I did not do anything I regret, in the week after the hypomanic episode, I was taking more and more actions which were riskier than I would normally take, so there was a small risk I would do something I and MATS would regret. Hence, the London team lead intervened (in a highly professional and empathetic manner), and offered two weeks paid medical leave, followed by gradually coming back to work on a part-time basis (again paid full time). This provided time to properly stabilize, ensure I get professional help I need, and also gave me time to improve my life in many ways (e.g. this is why I had time to organize so many events for my birthday).
  • The pay is great, at least compared to the vast majority of jobs out there. Small compared to what I could get if I optimized purely for total cash (e.g. working in big tech, frontier AI lab or finance), but otherwise excellent. For example, the income made it straightforwardly easy for me to spend £1800 on a piano as a gift for myself, and to still have most of my income go into savings.

Of course, MATS is far from perfect, but that is true of any organization or group of people. I am just about wise enough not to air my dirty laundry in public, but, given the MATS cultural norms I describe above, I did feel comfortable enough to write a detailed memo with my highest level concerns and speculative solutions. It remains to be seen whether the memo sparks the dramatic improvements that I think are possible and necessary, but even if not, MATS is an organization that is hard to beat.



Discuss

Thoughts on Practical Ethics

5 апреля, 2026 - 14:15
Disclaimers

This essay is me trying to figure out the “edges” of Singer’s argument in Practical Ethics.

I’ve written and rewritten it several times, and it bothers me that I don’t reach a particular conclusion. The essay itself remains at the level of “musings” instead of “worked out, internally consistent philosophical refutation”.

Nevertheless, I want to share my thoughts, so publishing it anyway.

Some specific disclaimers:

  1. I agree with many Singer’s conclusions.
  2. This essay is based on my extension of Singer’s argument. Even though he, to my knowledge, hasn’t explicitly put forth these specific arguments, I believe that they logically follow from those ideas that he has put forth. Nevertheless, I may have misunderstood something and may be arguing against a straw man. If so, please flag it.
  3. My criticism is directed mostly against the “idealized” moral agent which, as far as I understand, Singer accepts as not a real expectation from anyone. That is, there are situations where according to Singer, the right thing to do is to do X, and what people do is not X, and what is reasonable to expect of them is simply to strive for X. I don’t necessarily argue against striving, but I do argue against what is or isn’t right for an agent that doesn’t only strive, but actually does X.
Intro to Practical Ethics

If you’ve read the book, or are otherwise familiar with its arguments, feel free to skip to the next chapter.

Singer claims that you must make ethical decisions based on an equal consideration of interests, and not any other property.

It does not matter what age, race, religion, sex, or species one is – the only thing that matters is one’s capacity to suffer, and one’s capacity to view oneself as a distinct entity, with a past and a future.

Take, for example, eating meat.

It is the human’s interest to feel pleasure from eating a tasty steak. It is the cow’s interest to not be killed.

According to the principle of equal consideration of interests, the cow’s interest to not be killed (nor exposed to factory farming practices) clearly outweighs the human’s interest in eating tasty meat.

There is also a moral ranking here that is based on how refined one’s capacity to suffer is. For example, humans are both sentient and capable of seeing themselves as distinct entities existing over time. Cows are merely sentient.

But if there are some humans who are not sentient nor capable of seeing themselves as distinct entities existing over time (for example, patients in a permanent vegetative state), then they have a lower moral footprint than a sentient cow. The cow still cannot conceive of itself as existing over time (probably), but it can experience suffering, which is more than such a human can.

Therefore, in that case, a cow has a higher moral status, and it would be more wrong to kill that cow than that human.

(Singer explores some edge cases, implications on others and on societal norms; I’m shortening the argument here.)

General moral argument against proximity

Singer claims that proximity is not adequate for moral judgment. If we generalize his argument beyond species, race, religion, nationality, to all markers of proximity, we must come to the conclusion that family is equally excluded from moral protection.

My family members are proximate to me in the sense that we have similar genes, and in the sense that we are one tightly-knit group, irrespective of genes (for example, families with adopted children).

Singer claims that genetic proximity is not a relevant moral factor – he rejects preferential treatment based on species, or race. Therefore, if I extend that line of argument, I cannot provide preferential moral treatment to my family based on their genes.

He also claims that other proximity which is not genetic – such as similarity of religion, or nationality – is equally not a relevant moral factor. Therefore, if I extend that line of argument, I also cannot provide preferential moral treatment to my family based on us being the same group.

Therefore, we must either:

  1. Accept the conclusion that family members should not get any preferential moral treatment from us, or
  2. Make an exception for families, and allow that equal consideration of interests applies in other cases, but not in the case of family.
Thought experiment: burning building

Singer also claims that infants do not have the same moral status as adults. They have no conception of themselves as “a distinct entity existing over time”. They have potential personhood, but Singer claims that potential personhood is not as strong of a moral claim as real personhood.

Here’s a thought experiment:

You apartment building is on fire. You rush in. There’s time to save exactly one person: your 6-month-old baby, or an adult stranger.

If we must not give preferential moral treatment based on proximity, and if infants do not yet possess morally relevant characteristics, then the moral thing to do would be to let your child die in the fire, and save the stranger.

I believe that every moral framework that would have you let your child die so that you can save a stranger’s life is wrong. It must have gotten lost along the way somehow, and it is our task now to find where exactly this framework has gotten lost.

I do not believe that infants actually have the morally relevant characteristics that adults have. And I similarly agree with the premise that future personhood is not as strong a claim to moral status as current personhood.

No, the reason why you should save you child, is that it’s your child, which means that I reject the argument against proximity.

Addressing “roles and expectations”-based counterarguments

A counterargument might be: “you have chosen to have this child and therefore you have a moral obligation to it; it’s different from arbitrary things like nationality or religion.”

We can change the thought experiment to not have your own child in the fire, but your baby brother.

In that case, there is no choice that was made, and you have entered no “contract” that forms a moral obligation of care towards this being; it’s a genetic accident that you had no influence on.

Yet, I argue, it would entail the same effect: if you rush into the building, you should most definitely save your baby brother, and not an adult stranger.

Addressing “favoring family leads to better overall outcomes”

Singer claims that, in aggregate, a society where one is more favorably disposed to one’s family (such as parents being invested in their children) is overall a better society to live in.

This is not because children are more morally valuable than adults, but because the side-effects of behaving that way create a society that is better.

This should mean that parents will invest a lot of time and effort into their children.

But this is a general disposition. It does not mean, in a specific life-or-death situation, that we should ignore the fact that there’s a big difference in infants and adults. If we are to accept “capacity to see oneself as a distinct entity with a past and future” as a moral characteristic that should override proximity-based characteristics, then it seems internally consistent to favor one’s own child in such a situation.

Favoring family even in life or death

We might say: “Favoring family even in life-or-death situations leads to better overall outcomes”.

I personally agree, but then that seems inconsistent, or, at least, selective.

We want equal consideration of interests, but then there’s a particular place that we carve out where equal consideration of interests doesn’t apply as the relevant framework.

Moreover, if we favor family in life and death, family being just one – though very strong – marker of proximity, then that would justify favoring along any other dimension: race, nationality, gender – all things explicitly rejected by Singer as irrelevant moral characteristics.

Where is the boundary between:

“If everyone saves a member of their own family from a fire, even though there’s someone else who deserves help more, that leads to a better overall outcome for society.”

and

“If everyone saves a member of their own race from a fire, even though there’s someone else who deserves help more, that leads to a better overall outcome for society.”

?

One we favor as proper and good; the other is racism.

You could say that family is a “real” relationship; there’s direct care, you have obligations because your child depends on you, and unlike race or religion, it’s not an arbitrary category. But what if the burning building has your cousin that you know nothing about, don’t have any relationship with, and who is effectively a stranger to you?

Even in that case, most people’s moral intuition is to save the cousin, because he is blood.

If we would admit that saving a cousin you know nothing about purely because of genetic proximity is legitimate, than saving based on race is a matter of degree and not category. And saving based on other proximity factors (for example, belonging to the same tribe, or religion) then becomes acceptable too.

Questioning Singer’s theory on its own grounds

Let us assume that to satisfy (the extension of) Singer’s moral framework, we must sacrifice our own child (or baby brother) to save a stranger. Singer’s other argument is that you should keep giving until you reach a point where you start impoverishing yourself.

In that case, Singer’s argument for giving until you go just above poverty falls apart, because why stop at poverty?

Your child is proximate to you: that itself gives it no stronger claim to life. You yourself are even more proximate to yourself.

Therefore, by the same utilitarian calculus by which I should let my child perish in the fire, I should always sacrifice my own life if at least two lives are saved by my sacrifice.

Giving financially saves lives. The difference between giving money and sacrificing your life is a difference of degree: in both cases you are giving something of yourself, your accumulated capacity for change, your “life-force”.

Therefore, whenever I can give money such that I can save at least two lives, I should give that money even if I go into poverty or die.

The argument is that much stronger insomuch as the fact that my giving will almost definitely save more than two lives – cancelling out any objections that I might be killing myself for producing roughly an equal moral outcome.

Therefore, Singer’s argument that we should stop giving someplace where we start entering into poverty picks an arbitrary point. Internally, it favors the survival of the person giving the money.

But if we should be ready to discard the familial obligation to save the life of our not-yet-person child, then we should equally be ready to discard any “familial” obligation to save our own life.

Addressing potential utility generation

You could argue that by continuing living, you could produce more utility overall, and therefore killing yourself to save more people is net harmful, given the fact that you could save much more people in the long run.

But there are two issues here.

One, if we are to keep the internal consistency of the argument, then we should not treat potential utility generation any more favorably than treating potential personhood.

Since Singer claims that potential personhood is not as morally relevant as real personhood, we cannot justify a different treatment for potential utility generation vs. real utility generation.

If we should be ready to sacrifice our potential-person child, then we should be ready to sacrifice our potential future giving.

Two, if we argue for our continued survival on the grounds that we might generate more utility by living longer, that line of argument can extend arbitrarily and we can by the same token argue that we should not so much that it brings us just above the line of poverty, because keeping more money will allow us to live better, potentially generate more money, and therefore generate more utility.

In other words, it proves too much.

Burning building 2

I want to shortly reflect on the burning building thought experiment I introduced.

I would argue that if you rush into the burning building, and see either an infant or adult, both strangers to you, most people’s moral intuition would be to save the infant.

It certainly feels morally correct to me to save a stranger’s baby.

If the choice is between “adult person I know or love” and “stranger’s baby”, that choice is perhaps the most difficult of all. And I am not entirely sure I would pick the adult.

It seems that my moral intuitions are primarily shaped by the maxim of “the strong should protect the weak”. There’s a European moral lineage of chivalry – the notion that you should help those who are helpless, save those who are oppressed, and otherwise seek to be a hero.

Intuitively, morally, I sense that as the right thing to do.

And I would argue that, even on purely consequentialist grounds, being of that particular moral disposition produces overall better outcomes for society.



Discuss

How much faster is speaking, compared to typing on laptop vs phone vs writing?

5 апреля, 2026 - 10:25

So as I haven’t been able to speak the past short while, one thing I have noticed is that it is harder to communicate with others. I know what you are thinking: “Wow, who could have possibly guessed? It’s harder to converse when you can’t speak?”. Indeed, I didn’t expect it either.

But how much harder is it to communicate?

One proxy you can use is the classic typing metric, words per minute (wpm). So I spend some time looking at various forms of communication and how they differ between one another.

For most tests, i used https://www.typingtom.com/english/typing-test/30s

So I list below the forms of communication I have tried and how slow they are.

Here are the rough tiers that I found:

Ultra-slow-speed tier(~10-20wpm) Shaping out non-standard letters with my hands

This is obviously the worst method of communication. Most people don’t know sign language, but can pretty intuitively learn how to infer most-but-not-all letters without needing to use a table. With people I have spend more time with, they have managed to learn it moderately well, but probably they should just learn sign language.

Then with some words, even with the word spelled out, people sometimes struggle to understand the written word spelled out in letters to translate that into the normal way they understand words.

That being said, sometimes people can use their context skills to infer what is wanted by just the first letter or two, so it’s not completely useless. And often it can be easiest since no materials are needed.


Pretty-slow tier(~40wpm) Drawing on a whiteboard(~40wpm) Typing with one hand on my phone(~45wmp) Typing on my laptop with one hand

I find it slightly surprising how close these end up converging together.

For the most part. Writing on a whiteboard has the added benefit of being much easier to share in some contexts, while writing on a device has the benefit of being able to use Text-To-Speech (TTS). But I find both kinda inadequate in their own ways.

(But you see, there aren’t that many situations where typing with one hand comes up, so perhaps I just haven’t had that much practice with it? unclear)


Respectable tier(~60-70wpm) Typing on my phone with two hands(~80-90wpm) Typing on my laptop

Yeah I was somewhat surprised when typing on my phone with two hands, that it was not actually as much slower than typing on my laptop is. However, I guess this doesn’t factor into account that when typing on my phone, I might be outside in the cold or rain and simultaneously trying to walk, which combine to make typing on the phone feel much worse.

And yeah, I do wish I was faster at typing on my laptop, but I guess I never got around to it. But it makes sense that using two hands you get roughly double speed than you do with one hand.


Actually-fast tier(~180-200) speaking at a normal pace

I asked a few people to do a speaking speed test at a comfortable talking speed when reading, and found that it is much faster than typing by a significant margin, about double again. And this is effortless.

Speech also includes tone-of-voice and such, in a way that is only implicitly captured when typing and using a real-time TTS model. (my partner still sometimes doesn’t quite decouple that the tone of the “OK” on the TTS is not the tone with which I actually mean it).


Very-fast tier(~260-340wpm) Speaking at a rushed pace

I then subjected my same friends to the torture of reading the same passage as fast as they could. And they managed to achieve another ~1.5x in speed compared to normal speaking speed. It goes to show how language is quite optimized for speaking.

What have we learned?

One update from doing all of this, is “wow, maybe when I get my voice back, I should just consider improving my Speech-to-Text game” (~10h maybe?), since the input is just so much faster than typing. (2-4x faster!). I used to be a big STT hater, so this is a moderately big update for me.

Some notes though:

One thing, is that the wpm of most of the methods are slightly higher than one might expect based on the naive number. When I do end up typing some sentence out, people can often infer what I am trying to say before I am finished typing. (I usually do end up still typing out the whole sentence anyway though). So one could potentially optimize for this somehow.

Another note, is that when speaking, I very rarely make verbal typos, and when I do, they are quite phonetically similar. When typing however, I make typos more often typographically similar, but when they are passed to a TTS model, the result is often catastrophic and illegible to people who want to understand what I just said.

This list also excludes some possible communication methods that I did not put in the effort to learn. ASL can reach speeds comparable to speaking if you learn all the vocab fluently. If one spends a year or two learning stenography, one can achieve 200-300wpm by typing as well. But I never learned either of these.

Overall, I remain bullish on speaking, more than ever, so I will try see what I can do in the future with this information.




Discuss

Academic Proof-of-Work in the Age of LLMs

5 апреля, 2026 - 09:49

Written quickly as part of the Inkhaven Residency.

Related: Bureaucracy as active ingredient, pain as active ingredient

A widely known secret in academia is that many of the formalities serve in large part proof of work. That is, the reason expensive procedures exist is that some way of filtering must exist, and the amount of effort invested can often be a good proxy for the quality of the work. Specifically, the pool of research is vast, and good research can often be hard to identify. Even engaging in research enough to understand its quality can be expensive. As a result, people look toward signs of visible, expensive effort in order to determine whether to engage in the research at all. 

Why do people insist only on reading research that’s published in well-formatted, well-written papers, as opposed to looking at random blog posts? Part of the answer is that good writing and formatting makes the research easier to digest, and another part is that investing the time to properly write up your results often causes the results to improve. But part of the answer is proof-of-work: surely, if your research is good, you’d be willing to put in the 30-40 hours to do the required experiments and format it nicely as a paper?

Similarly, why do fields often insist on experiments beyond their scientific value? For example, why does machine learning often insist that people do expensive empirical experiments even for theory papers. Of course, part of the answer is that it’s easy to generate theoretical results that have no connection to reality. But another part of the answer is that doing the empirical experiments serves as the required proof of work; implementing anything on even a medium sized open-source LLM is hard, but surely you’d invest the effort if you believed enough in your idea? (This helps explain the apparently baffling observation that many of the empirical results in theoretical papers have little relevance to the correctness or even the applicability of the theoretical results.)

Other aspects of ML academia – the beautifully polished figures[1], the insistence on citing the relevant papers to show knowledge of the field, and so forth – also exist in part to serve as a proof-of-work filter for quality. 

In a sense, this is one of the reasons academia is great. In the absence of a proof-of-work system, the default would be something closer to proof-of-stake: that is, some form of reputational system based on known, previously verified accomplishments. While proof-of-work filters can be wasteful, they nonetheless allow new, unknown researchers to enter the field and contribute (assuming they invest the requisite effort). 

An obvious problem with this entire setup is that LLMs exist, and what was once expensive is now cheap. While previously, good writing was expensive, LLMs allow anyone to produce seemingly coherent, well-argued English text. While it was once quite expensive to produce ML code, current LLMs produce seemingly correct code for experiments quickly. And the same is true for most of the proof-of-work signifiers that academia used to depend on: any frontier LLM can produce beautifully formatted figures in matplotlib, cite relevant work (or at least convincingly hallucinate citations), and produce long mathematical arguments. 

I’ve observed this myself in actual ML conference contexts. In the past, crackpot papers were relatively easily to identify. But in the last year, I’ve seen at least one crackpot paper get past other peer reviewers through a combination of dense mathematical jargon and an expansive code base that was hardcoded to produce the desired results. Specifically, while the reviewers knew that they didn't fully understand the mathematical results, they assumed that this was due to their lack of knowledge, instead of the results themselves being wrong. And since the codebase passed the cursory review given to it by the other reviewers, they did not investigate it deeply enough to notice the hardcoding.[2]

In a sense, this is no different than the problems introduced by AI in other contexts, and I’m not sure there’s a better solution than to fall back to previous proof-of-stake–like reputation systems.[3] At the very least, I find it hard not to engage with new, seemingly-exciting results from unknown researchers without a high degree of skepticism. 

This makes me sad, but I'm not sure there's a real solution here.

  1. ^

    Especially the proliferation of beautiful "figure one"s that encapsulate the paper's core ideas and results in a single figure.

  2. ^

    In fact, it took me about an hour to decide that the paper's results were simply wrong as opposed to confusing. Thankfully, in this case, the paper's problems were obvious enough that I could point at e.g. specific hardcoded results to the other reviewers, (and the paper was not accepted for publication) but there's no guarantee that this would always be the case.

  3. ^

    Of course, there are other possibilities that less pessimistic people would no doubt point to: for example, there could be a shift toward proof-of-work setups that are LLM resistant, or we could rely on LLMs to do the filtering instead. But insofar as LLMs are good at replicating all cognitively shallow human effort, then I don't imagine there are going to be any proof-of-work setups that would continue to work as LLMs get better. And I personally feel pretty sad about delegating all of my input to Claude.



Discuss

Ten different ways of thinking about Gradual Disempowerment

5 апреля, 2026 - 09:30

About a year ago, we wrote a paper that coined the term “Gradual Disempowerment.”

It proved to be a great success, which is terrific. A friend and colleague told me that it was the most discussed paper at DeepMind last year (selection bias, grain of salt, etc.) It spawned articles in the Economist and the Guardian.

Most importantly, it entered the lexicon. It’s not commonplace for people in AI safety circles and even outside of them to use the term, often in contrast with misalignment or rogue AI. Gradual Disempowerment tends to resonate more than Rogue AI with people outside AI safety circles.

But there’s still a lot of confusion about what it really is and what it really means. I think it’s a very intuitive concept, but also I still feel like I don’t have everything clear in my mind. For instance, I think our paper both introduces the concept and presents a structured argument that it could occur and be catastrophic. But these things seem somewhat jumbled together both in my mind and the discourse..

So for reasons including all of the above, I plan to write a few posts on the topic, starting with this one.

The rest of this post is a list of ten different ways of thinking about or arguing for gradual disempowerment that I’ve used or encountered.

  1. We’re replacing people with AI. These days when I speak publicly about AI, I often find myself returning to i) the more-or-less explicit goal of many AI companies and researchers of “automating all human labor”, and ii) the fact that many people in the space view humanity as a “bootloader for AI” as Elon Musk evocatively put it. Gradual Disempowerment is the process by which this replacement happens without AI ever rising up -- AI takes our jobs, and the people who control it and still have power increasingly are those who embrace “merging with the machines”, i.e. becoming cyborgs, but with the human bits being phased out over time until before long, humans cease to exist entirely.

  2. Companies and governments don’t intrinsically care about you. This is basically the main argument in the paper… You can think of companies and governments as “agents” or “beings” that are driven by goals like (e.g.) “quarterly profits” or “GDP” or “national security”. Right now, the best ways to achieve these goals make use of humans. In the future, the best ways will instead make use of AI. A relentless pursuit of such goals, powered by AI, seems likely to destroy the things humans need to survive.

  3. It’s (“global” or “late stage”) capitalism. The previous argument bears a significant resemblance to existing arguments, popular on the left, that “capitalism” is responsible for most of the world’s present ills. This feels like a decent “80/20” version of the concern, but importantly, it’s not just companies, but also governments (whose power is often more feared by those on the right) that could end up turning against their citizens once they become useless to them. And indeed, we’ve seen “communist” countries slaughter their own people by the millions. Besides wondering what alternative critics imagine, I don’t wholeheartedly endorse such critiques because I often feel unsure of what exactly people are criticizing when they critique capitalism in this way. But for people who already have this mental model, where our current social arrangements treat people as somewhat disposable or lacking in fundamental dignity or worth, this can be a useful starting point for discussion.

  4. It’s another word for (or the primary symptom of) the “meta-crisis”. A few people in my circles have told me about this concept from Daniel Schmachtenberger, which I originally encountered on a podcast somewhere. The key claim is that all the crises we observe in the modern world are driven by some shared underlying factors. I view this as basically a more nuanced version of the view above, where “capitalism” is the root of all evil: The meta-crisis is still meant to be the root of all evil, but we don’t fully understand its nature. The way I like to describe the basic problem is that we are not practicing good enough methods of collective decision-making, or collective sense-making. And while I think we have some good ideas for improving on the status quo, we don’t have a proven solution.

  5. It’s a structural consequence of the way in which information technology demands metrics, enables large scale influence campaigns, translates money into political power, and concentrates power via a recursive feedback loop. This one is maybe a bit too much to unpack in this blog post, but basically, society is increasingly “standardized” not only in terms of products, but also in terms of processes (e.g. restrictive customer service scripts or standard operating procedures) that have the benefit of being cheap, scalable, and reliable (often by eliminating “human error”, i.e. limiting human decision-making power and otherwise encouraging compliance). They also increasingly make more and more aspects of life subject to measurement and control via optimization of metrics, which necessarily fail to capture everything that matters. This general issue was a prime concern of mine before I learned about deep learning in 2012, and realized we might get to Real AI quite soon -- notably, this can happen even with stupid AI.1 In fact, you could argue that gradual disempowerment is already occurring through advertising, corporate media, and money in politics, among other things. This makes it a bit unclear how far back to go.

  6. It’s evolution, baby! Maybe gradual disempowerment is best viewed as part of a much larger trend, going quite far back: evolution. People like to say “AI is the next stage in evolution” as if that means it’s okay if humanity goes extinct. But whether it’s OK or not, it may be that “Natural Selection Favors AIs over Humans”. At the end of the day, if AI becomes much better than humans at everything, it does seem a bit strange from a “survival of the fittest” point of view that humans would stick around. In such a situation, those who hand over more power and resources to AI would presumably outcompete those who don’t. So in the limit, AIs would end up with ALL the power and resources.

  7. …and there’s no natural limit to outsourcing decision making to AI, even if you don’t trust it. AIs could be like underlings that are untrustworthy, but so skilled that competitive pressures still compel us to delegate to them. Consider the trope of the cowboy cop who’s “a loose cannon, but DAMMIT he’s the best we have!Trust is important, and people are loath to use things they don’t trust. But AI seems to be becoming a tool so powerful that you almost HAVE to use it, even though it’s not secure, even though we haven’t solved alignment, even though we see evidence of scheming in tests, even though it seems to drive people crazy, etc… For me, this mostly comes up as a counter-argument to people who claim that market forces actually favor making AI aligned and trustworthy… that’s certainly true if doing so is free, but in fact, it’s impossible right now, and alignment doesn’t solve the problem of negative externalities.2 I like to analogize AI to a button that gives you $1,000,000 when you push it, but each time you press it also increases the temperature of the earth by a fraction of a degree. Or each time you press it has a 1% chance of destroying the world.

  8. It’s an incarnation of Moloch. One of the most famous blog posts in the history of AI safety is Meditations on Moloch. It’s often considered a parable about coordination failures, but I think of it as about the triumph of “instrumental goals” over “terminal goals”, i.e. the pursuit of money (“instrumental goal”) as a means to happiness that has a tendency to become an end (“terminal goal”) in itself. We might begin handing over power to AI systems because we hope they will help achieve our goals. But we might need to hand over more and more power and also the AI might need to focus more and more simply on acquiring power in order to avoid being outcompeted by other AIs. This is also like an even deeper version of the evolution argument -- evolution and Moloch as described in the post both have the property where it’s unclear if they can really ever be “defeated” or are rather just part of the way the world works.

  9. It’s on a (2D) spectrum with Rogue AI x-risk scenarios. Rogue AI scenarios are where “the AI suddenly seizes power”; gradual disempowerment is “we gradually hand over power”. There are lots of scenarios in the middle where the handoff takes place in part due to recklessness or negligence, rather than deliberately. One thing I don’t like about this way of talking about it is that I actually think gradual disempowerment is entirely compatible with full-blown rogue AI, in fact, I think one of the most likely outcomes is that competitive pressures simultaneously drive gradual disempowerment and reckless racing towards superintelligence, warning signs are ignored, and at some point in the reckless and chaotic exploration of the AI design space, rogue AI pops out.

  10. Deskilling, aka “the WALL-E problem”. A lot of people these days seem to think of gradual disempowerment as largely about humans losing our own capabilities, e.g. for critical thinking, because we defer to AI so much. Professor Stuart Russell called this the “WALL-E” problem. To be honest, I still don’t fully understand or buy into this concern, or see how it necessarily leads to total disempowerment, but thought it’s worth a mention, due to its place in the discourse.

1

This might be as bad with smarter AI -- they can use more sophisticated judgments. But that ability also makes it tempting to put them in charge of more stuff.

2

This point seems important enough I almost want to make it its own item in the list.



Discuss

Cheaper/faster/easier makes for step changes (and that's why even current-level LLMs are transformative)

5 апреля, 2026 - 05:39

We already knew there's nothing new under the sun.

Thanks to advances in telescopes, orbital launch, satellites, and space vehicles we now know there's nothing new above the sun either, but there is rather a lot of energy!

For many phenomena, I think it's a matter of convenience and utility where you model them as discrete or continuous, aka, qualitative vs quantitative.

On one level, nukes are simply a bigger explosion, and we already had explosions. On another level, they're sufficiently bigger as to have reshaped global politics and rewritten the decision theory of modern war.

Perhaps the key thing is remembering that sufficiently large quantitative changes can make for qualitative macro effects.

For example, basic elements of modern life include transport, communication, energy, computation, and food. All of these have been part of human life for tens of thousands of years! Ancient humans could go places, could talk, could convert wood to heat, perform arithmetic (i.e., computation), and eat stuff!

I assert that to a very high degree, modern technology (and all its increments over the millennia) did not allow us to do fundamentally new stuff. Just the same stuff, but cheaper, faster, and easier.

Cars, trains, and planes are just going places. Could already do that. Emails are just sending information from one place to another. Could already do that. Books are just remembering things. Could already do that. Guns are just hurting people – could already do that.

The sheer magnitude of degree in all of those elements is the difference between hunter-gatherer life and modern life. Along the way, there have been some pretty big step changes. Writing is just remembering stuff and communicating it to another person, which you could already do, but so much more so that it reshapes civilization. Then you make writing cheap via the printing press, and your civilization gets shaped again.

When it comes to the transformative power of modern AI, I think the sufficient quantitative change makes for a large qualitative change is an underdiscussed lens.

The problem is our attention is focused on where LLMs are automating things at a macro-task level: coding, image and video generation, having conversations, medical diagnoses, etc. These are, in fact, a very big deal.

But I think LLMs, even smaller/weaker ones, are able to automate more basic building blocks of thoughts, and there's transformative power there too.

Getting down to some very basic constitutive mental tasks – things I could already do before LLMs:

  • Write down text (notes, to-do items, ideas, and so on) [store info/memory]
  • Locate and read text [search and retrieve/recall]
  • Summarize text [process info]

Throughout my life, I have had thoughts. There is some lossy process that stores the output of my thoughts in my brain for later usage. I think this fails both at the "the info didn't really get stored" level, and the "the info is in there, but the search query failed to return it".

"Taking notes" is an ancient technology we already have for improving upon the fallibility of human memory, but it's effortful in so many ways: you need to be carrying a note-taking device with you, you need to either constantly have it out or pull it out when needed, if it's a notebook, find a blank page, then take the time to write down your note[1].

That's just recording it. For notes to be useful, you also have to remember you have the note, find it, and then read it. The more notes you have, the more expensive that process is.

For the most part, to date, I've relied on my fallible in-built memory.

The thing is, LLMs are able to make all of the above elements vastly cheaper. This is one of the fundamental principles of the "Exobrain" system I've been steadily building up, and hope to describe soon. I don't need it to solve protein folding to be useful; I don't even need it to help with prioritization (although that's a goal). It's incredibly useful if it just improves on basic read/write/search of memory.


Before

After

Retrieve phone from pocket, open note-taking app, open new note, or find existing relevant note

Say "Hey Exo", phone beeps, begin talking. Perhaps instruct the model which document to put a note in, or let it figure it out (has guidance in the stored system prompt)

Remember that I have a note, either have to remember where it is or muck around with search

Ask LLM to find the note (via basic key-term search or vector embedding search)

If the note is lengthy, you have to read through all of note

LLM can summarize and/or extract the relevant parts of the notes


Beware Trivial Inconveniences. The above is the difference between rarely taking notes and taking multiple notes a day, narrating long trains of thought. It's the difference between giving up on logging my mental state and conscientiously logging it twice daily for months.

Putting it into handwavy quantitative terms, when the cost of note-taking and record-keeping comes down 20x, my usage goes from 0/day to 20-30/day.

But the value happens because LLMs have made it cheap across the pipeline. Not just the storing of information, but also the retrieval and processing. AI makes it fast and easy to search through all my notes, even if I have a lot of notes. If I want all of my thoughts on a topic, I can have it read dozens and dozens of pages over the years and summarize them and extract relevant info.

What this does is a step change. It takes me from not taking many notes to taking copious notes. Same for todo items and reminders, and same for logging data relevant to my health and experimentation.

It benefits from using stronger models, but the core elements are doable even with small models like Haiku, because it's just automating speech-to-text[2], choosing among a small set of files (or making a new one), writing, simple search, and then maybe a simple summary.

It's not just me doing this. Independently, someone else I know began setting up detailed logging on their computer of everything they're doing, and downstream of that, we're starting to record everything at Lightcone to make it accessible to LLMs.

I expect we will see more of this: using LLMs not just for protein folding and novel math conjectures, but replacing very simple operations of recording and retrieving info. But not just replacing, replacing and scaling to unprecedented levels of behaviors, because that's what happens when you make things cheaper.

Humanity has done this many times, with energy, transport, communication, food, and so on. I think where LLMs get different is they bring down the cost of very elementary mental operations (like storing and remembering, choosing between a few options) – menial stuff that can be combined to great effect. (After all, computers are a lot of rather menial arithmetic and logical operations combined to great effect.)

  1. ^

    All of this has equivalents if you're taking notes on your phone.

  2. ^

    I currently use Deepgram, which isn't great, but is adequate. Pretty sure there are transformers in it.



Discuss

Positive sum doesn't mean "win-win"

5 апреля, 2026 - 05:33

A lot of people and documents online say that positive-sum games are "win-wins", where all of the participants are better off. But this isn't true! If A gets $5 and B gets -$2 that's positive sum (the sum is $3) but it's not a win-win (B lost). Positive sum games can be win-wins, but they aren't necessarily games where everybody benefits. I think people tend to over-generalize from the most common case of a win-win.

E.g. some of the claims you see when reading about positive-sum games online:

A positive-sum game is a "win-win" scenario in game theory and economics where participants collaborate to create new value, ensuring all parties can gain or benefit.

[Win-win games are] also called a positive-sum game as it is the opposite of a zero-sum game. – Wikipedia

Here I use "positive-sum game" to refer to resource games that involve allocating resources, not allocating utility. "Positive-sum game" isn't a meaningful thing when referring to utility because the utility of each participant can be individually rescaled, so you can turn any game into one with an arbitrary sum; the sign of the sum doesn't matter.

There are a lot of cases where we can make the world as a whole better while simultaneously making some people worse off, and it's important to acknowledge that. Here are some positive-sum situations:

  • A new innovation benefits most people but puts people who worked on a legacy system it replaces out of a job
  • Several companies race to create a strongly beneficial invention and capture the market, benefitting the winner and the public a lot, while the losing companies end up having wasted resources
  • Expropriating someone's land without compensation to build train tracks that are used by a lot of other people

One interesting thing about positive-sum games with losers is that the players can sometimes turn it into a win-win for everybody by having the winners distribute a portion of their winnings to the losers. You can turn positive-sum games into win-wins if:

  1. Winners gain transferrable resources (without transaction costs)
  2. The resources can be divided into arbitrary portions
  3. The amount of gains/losses that accrues to each player is known to everyone
  4. Players can precommit to transfer resources after the game (otherwise the winners can defect and not transfer the resources)

This is the concept of turning a Kaldor-Hicks improvement (an improvement where everyone would hypothetically be better off if the winners compensated the losers) into a Pareto improvement (an improvement where everyone is better off).

One interesting example is an efficient auction with an entrance cost[1], which benefits the winner (who values the good the most) and auctioneer, and harms all the other bidders (who paid the costs of entering into the auction and got nothing). The entrance cost doesn't need to be a direct fee to enter into the auction; it can also include indirect costs like spending time and effort to decide how much to bid.

The winner's consumer surplus (how much their value of the goods exceeds what they paid) is value to them, but not cash that they can transfer to compensate the losers. If the winner has enough money they could compensate the other bidders for their wasted costs of entering the auction, and everyone would be better off, but if not the auction winner is better off but can't compensate the losers. In practice, valuing the indirect costs bidders have for entering into auctions is difficult and so auctions are often positive-sum games with losers.

Another example interesting example is expropriation, in practice the government usually pays the fair market value of the land to the person whose land was seized, attempting to turn a positive sum game with losers into a win-win, although landowners often feel the expropriation payments aren't sufficient.[2]

I think it's important to keep all this in mind when making positive-sum proposals that there might be losers and they should be compensated if possible; "positive-sum" doesn't mean that everyone benefits.

  1. ^

    This is only positive-sum if the surplus for the winner exceeds the total entrance costs for all the bidders, which I assume is the case.

  2. ^

    Which makes sense: landowners have a revealed preference that they value their land more than the fair market value, because if they valued it at less than FMV they could just sell it for the FMV and be better off. (Ignoring illiquidity and the transaction costs for selling the land.)



Discuss

Interpreting Gradient Routing’s Scalable Oversight Experiment

5 апреля, 2026 - 05:18

%TLDR. We discuss the setting that Gradient Routing (GR) paper uses to model Scalable Oversight (SO). The first part suggests an improved naive baseline using early stopping which performs on-par with GR. In the second part, we compare GR’s setting to SO and Weak-to-Strong generalization (W2SG), discuss how it might be useful in combination, say that it’s closer to semi-supervised reinforcement learning (SSRL), and point to some other possible baselines.
We think this post would be useful for interpreting Gradient Routing’s SO experiment and for readers who are trying to build intuition about what modern Scalable Oversight work does and does not assume.

This post is mainly about two things.

First, it’s about the importance of simple baselines. Second, it's about different ways of modeling Scalable Oversight, and what each way entails.

Let's start with some background for the first theme. Gradient Routing[1] is a training method that allows to isolate capabilities to specific subregions of a neural network.

Among other evaluation setups, it explores a toy RL environment (5x5 GridWorld with ground truth and proxy labels) meant to model Scalable Oversight (Section 4.3), and shows that GR outperforms the baselines such as naive REINFORCE training and data filtering. We reproduced these experiments and found that the naive baseline overfits, yet a simple regularization like early stopping (ES) makes the baseline perform mostly on-par with GR.

Now, the original paper frames its RL experiment as application of GR to SO. We would like to argue that it is closer to sparse supervision or SSRL with informational asymmetry. The method does not amplify human evaluators. Instead, it studies how agents learn when reliable feedback is available only on a small fraction of episodes, and when the agent can condition its behavior on whether it is being supervised.
Note: here and later, by SO we mean Debate, IDA, and RRM; this part from the LW definition also fits.

Another critical difference is that the setup from Gradient Routing assumes that we have a small percentage of comprehensive (gold) labels that could be used for training, while other approaches don’t make this assumption (apart from Easy-to-Hard Generalization (E2HG), which will be discussed later).

GridWorld Oversight Setting

The reinforcement learning setting introduced in the Gradient Routing paper is designed to model training under partial oversight.

The environment is a small GridWorld 5x5. At the beginning of each episode, the agent, a Diamond, and a Ghost are randomly placed on the grid. Reaching either the Diamond or the Ghost ends the episode.

The agent’s observation consists of the positions of the agent, Diamond, Ghost, and an oversight flag indicating whether the episode is supervised. The agent always observes the full state, including the oversight flag, which means that it can switch the behavior depending on the value of this flag. The agent observes more about the environment than the supervisors can assess, which creates an informational asymmetry.

The reward function depends on oversight:

  • In supervised episodes, the agent receives +1 for reaching the Diamond and -1 for reaching the Ghost;
  • In unsupervised episodes, the agent always receives +1, regardless of which terminal state is reached;
  • If no terminal state is reached in n steps, the agent receives 0.

As a result, only a small fraction of episodes provides true reward signals. This setup incentivizes agents to exploit supervision gaps, leading to oversight gaming.

Note: If you're more interested in the conceptual comparison or the broader discussion of SO, you may wish to skip the early stopping sections and head directly to the section "Is GR comparable to Debate or W2SG?"

Early Stopping

In GR's GridWorld setting, only a small fraction of episodes provides true rewards, forming a supervised subset. When training continues for a fixed number of steps, the naive baseline overfits to this subset. Performance peaks and then degrades.

A better way to describe this could be via Goodhart’s Law: the problem here is not that the model overfits, but that it learned to game oversight.

Figure 1: Learning curves show naive peak and drop (overfit) while GR dips and rises. 8 runs for each method. Here you can also notice that GR’s peak is higher than naive’s which will be discussed later.

As a regularization, we can use early stopping, which is widely used in supervised learning to prevent overfitting. Here, it serves as a simple baseline for preventing policies that perform well only on the small supervised subset while degrading in terms of true performance. We should also note that this idea is not unique in RL, "Best Policy Tracking in Gradient-based Optimization" uses it as a baseline and introduces an improved method that also uses a holdout set.

In our experiments, from the available supervised episodes, we set aside a fixed holdout set for validation. For example, when 1% of episodes are supervised, we can split them evenly: 0.5% are used for training and 0.5% are used only for validation. The agent never updates on the holdout episodes.

Figure 2: Performance at 1% oversight. 8 runs for each point with seeds from 0 to 7. Early stopping was used with handpicked (not the most optimal) hyperparameters (patience=500, tolerance=0.04, min_steps=1500). Here the naive baseline with early stopping performs a little worse than GR because the naive method’s peak was lower than GR’s as discussed in Figure 1 description.

During training, we periodically evaluate performance on the holdout set and select the checkpoint with the best holdout return. Training is otherwise unchanged.

Figure 3: Performance of naive baseline with early stopping at 1% oversight where 0.4% goes to holdout and 0.6% is used for training.

We also evaluate how this baseline behaves across different oversight levels (i.e. the percent of supervised episodes). Similarly to GR, at very low oversight levels, performance is unstable, but this instability decreases as the fraction of supervised episodes grows.

Figure 4: Comparison of ground truth return across oversight levels. Our method (blue) is plotted from experimental results. Results for other methods are digitized from Figure 6.a of original GR paper. Mean curves are read off visually and may have small errors; confidence intervals for these methods are omitted and can be found in the original paper. Results for Naive training with Early Stopping were obtained by randomly sampling hyperparameters (HP grid in Reproducibility) with no selection for performance, giving an estimate of average performance under random hyperparameter choice rather than optimized performance.

More on Early Stopping

Performance still depends on early stopping hyperparameters such as patience and margin. In a simple setting like this, overfitting is clearly visible, so these hyperparameters can be chosen after a small number of pilot runs, or avoided entirely by fixing the number of training steps and selecting the checkpoint with the best return on the holdout set.

The main limitation of early stopping is that it relies on the assumption that the supervised episodes can be split into train and holdout sets that are representative of the evaluation distribution. In more complex settings, both “representative holdout selection” and “which episodes are overseen” become nontrivial problems. Moreover, it reduces the number of labels available for training by reserving some for holdout, which makes this method more sensitive to whichever few episodes are randomly selected for supervision. We treat these as separate issues (closer to semi-supervised learning / label efficiency) and do not address them here. Our point here is narrower: in the Gradient Routing GridWorld setting, this simple regularization is a strong baseline.

Additionally, there are substantial differences in using these approaches:

  • They both have their own hyperparameters that have a big influence on the outcome;
  • Even though we can’t empirically prove this, we can suggest that the resulting policies will be different, because GR aims to route gradients and thus localize capabilities;
    • From Figure 1 we can also suggest that at certain levels of available oversight Gradient Routing achieves a significantly higher reward than the naive method ever does.
  • GR uses true labels to route gradients;
  • Naive REINFORCE with ES uses a portion of true labels for model selection and the rest is used for training.

Now we can take a step back and discuss whether this setting models Scalable Oversight.

Is GR comparable to Debate or W2SG?

The GR paper cites “Concrete Problems in AI Safety” (Amodei, Olah et al., 2016) to describe SO and motivate their experiment. One of the approaches to SO that Amodei et al. suggest is semi-supervised RL and, more specifically, training a policy when only a small fraction of labels are reliable (true rewards) and the rest are proxy rewards. This is exactly what the GR's SO setting does.

However, if we look at more modern works, we can make clear distinctions between different approaches (see Combining weak-to-strong generalization with scalable oversight” (Jan Leike, 2023) and ”Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem” (Radhakrishnan et al., 2023)).

  • In SO (Debate / RRM / IDA) the goal is to improve the ability of the overseer (human) to supervise/evaluate models on complex tasks. Some of the approaches include structuring the oversight process through debate, decomposition, or recursion;
  • In W2SG the goal is to train a stronger model (student) on labels from a weak model (teacher) so that the student model generalizes beyond weak labels. It relies on the assumption that the capability of interest is already latent in the strong model's pretrained representations. The weak supervisor's role is not to teach the strong model but to elicit what it already implicitly knows (Burns et al., 2024);
  • In GR’s SO setting the goal is to train a model when we have a small fraction of reliable labels (true rewards) and the rest of the labels are non-reliable (proxy rewards). There's also an oversight flag to model oversight gaming.

If SO amplifies human oversight capacity, and W2SG helps to train strong models using imperfect labels from weak models, then GR setting is closer to W2SG in the sense that they both help to train strong models using imperfect labels. However, the difference between them is that in GR we have a small percentage of comprehensive labels and in W2SG we have only non-perfect labels to train the strong model. While it is a potential scenario of overseeing models (which will be discussed in the next section), the setting is different from W2SG and methods are incomparable.

Visualizations created with Claude (Anthropic) might be useful here:

We should also note that Easy-to-Hard Generalization (E2HG) uses gold labels on easy tasks. The main assumption is that supervision quality on easy tasks transfers to hard ones. E2HG uses them to elicit capability already latent in a pretrained model, while GR uses them as a structural routing signal during training from scratch.

Semi-supervised RL and other approaches from RL

GR could be used in combination with W2SG or studied as another axis, if GR was used to train a stronger model on imperfect labels from a weaker model and, potentially, a small set of more reliable labels.
This setup is similar to the one introduced in “Reinforcement Learning from Imperfect Corrective Action and Proxy Rewards” (Jiang et al., 2024). The usage of corrective actions (occasional human labeling, not guaranteed to be perfect) on a tiny fraction of agent's steps seems to be closer to future use cases of how superhuman models could be trained in combination with W2SG and SO, similarly to how Jan Leike described combinations of SO and W2SG techniques (here).

Once we look at this setting from a semi-supervised learning perspective, we can notice that there are more methods that could be used as a comparison to GR. Those might include:

  • Train a reward model (RM) on episodes with supervision and then optimize a stronger model on/against rewards from RM on all unsupervised episodes. This is also closely related to what was described in “Concrete Problems in AI Safety” and SSRL setting defined in Finn et al. (2017);
  • To treat oversight gaming, one could also penalize the mutual information between the oversight flag and agent's actions, which could force the agent to act as intended regardless of the oversight flag. That line of work is similar to Cundy, Desai et al. (2024).
    • However, the same oversight gaming dynamic that motivates the problem could re-emerge as mutual information estimator gaming.

There’s still a difference between observed settings and GR’s - it is the oversight flag in the observations of the agent, which helps to study oversight gaming. However, I believe that the methods discussed above would transfer to GR’s setting and further adoption of methods from similar research areas shouldn’t be a problem, because the oversight flag can be added to another RL setting.

Additionally, GR could be used as a finetuning method and compared to SO/E2HG/W2SG, but (1) there is no evidence that GR would retain its useful properties if used on a base model and (2) the latter methods have a mechanism to provide labels/overvision, while GR needs to take labels from somewhere.

Conclusion

It may seem like we’re scrutinizing over the definitions, but it’s really important to understand which problem are we trying to solve and with what approach.
The LLM blocks below cover the GR's extention and list core claims once more.

This post is not a critique of Gradient Routing as a method. GR and its extensions have demonstrated genuine value: recent work from Anthropic (Shilov et al., 2025) introduced an improved variant of GR—Selective Gradient Masking (SGTM)—applied to CBRN knowledge removal during pretraining. SGTM achieves a better retain/forget trade-off than data filtering under imperfect labels, and is substantially more robust to adversarial fine-tuning than post-training unlearning, providing strong empirical support for gradient routing as a pretraining technique. Our concern is narrower: the specific framing of GR's GridWorld experiment as Scalable Oversight, and what that framing implies for how results should be interpreted and compared.

Core claims

  • Early stopping is a strong baseline in the GR GridWorld setting. The naive REINFORCE baseline overfits (or more precisely, learns to game oversight). A simple regularization, like early stopping with a small holdout set drawn from supervised episodes, is sufficient to largely match GR's performance. Strong baselines should be established before attributing gains to more complex methods.
  • GR's SO setting is closer to semi-supervised RL than SO/W2SG. It does not amplify human evaluators (as in Debate) nor does it rely solely on weak labels (as in W2SG). Instead, it studies training under sparse reliable rewards with informational asymmetry, which is best characterized as semi-supervised RL with oversight gaming.
  • GR and W2SG/SO are not directly comparable, but could be complementary. GR assumes access to a small fraction of gold labels; W2SG assumes only weak labels. These are different setups and methods shouldn't be benchmarked against each other directly, but combining them could be a productive direction.
  • Other baselines deserve consideration. Reward model training on supervised episodes, followed by policy optimization on all episodes, is a natural SSRL baseline. Penalizing mutual information between the oversight flag and agent actions is another candidate. Some other baselines could be taken from “Concrete Problems in AI Safety” or adapted from SSRL research.
Reproducibility

Results on 1% oversight level (Figure 1,2,3) were obtained using seeds from 0 to 7.

Results for random hyperparameters for early stopping (Figure 4) were obtained with randomly chosen seeds for each run. Hyperparameter grid: holdout_fraction uniformly from {0.2, 0.3, 0.4, 0.5}, and patience, tolerance, and min_steps uniformly from [250, 600], [0.01, 0.1], and [500, 5000] respectively.

We forked from the original repository and most of the changes were regarding adding seeds, early stopping logic, and plotting. In training.py, we added random seed support for reproducibility, removed GPU memory reservation and saved outputs differently; other training logic was untouched.

Acknowledgements

We’d like to thank @Tianyi (Alex) Qiu and @shi for providing feedback early in the project and raising concerns about the setting being labeled as Scalable Oversight. As well as suggesting more recent works and comparing the current method to semi-supervised RL.
All the experiments were originally done as a part of the "AI Safety. Fundamentals" course ran in March-May 2025 by Monoid. I would like to thank @Eris for running that course.
I would like to thank Mike (@myyycroft) for mentoring, supporting and providing feedback while we worked on this project and blogpost.

Note: I‘m writing “We” , because @myyycroft should be my co-author here, but I can't add him right now, because I don't have any karma.

  1. ^

    Instead of letting all data update all parameters equally, it works by applying weighted masks to gradients during backpropagation. Masks are defined by the user to control which parameters get updated by which data points.



Discuss

Research note on selective inoculation

5 апреля, 2026 - 05:17
Introduction

Inoculation Prompting is a technique to improve test-time alignment by introducing a contextual cue (like a system prompt) to steer the model behavior away from unwanted traits at inference time. Prior inoculation prompting works apply the inoculation prompt globally to every training example during SFT or RL, primarily in settings where the undesired behavior is present in all examples. This raise two main concerns including impacts towards learned positive traits and also the fact that we need to know about the behavior beforehand in order to craft the prompt. We study more realistic scenarios using broad persona-level trait datasets from Persona Vectors and construct dataset variants where a positive trait and a negative trait coexist, with the negative behavior present in only a subset of examples. We look into two questions:

  • Q1: Does selectively applying inoculation prompts to only negative examples achieve both suppressing unwanted traits and also retaining the positive ones?
  • Q2: What if we don’t know the negative traits ahead of time? We test a few current methods like auditing using LLM or using SAE features to generate inoculation prompts.

Code | Docs

TLDR
  • Selective inoculation is effective in both suppressing unwanted traits and retaining intended positive ones.
  • Some traits are more impacted by inoculation than others.
  • In the case where the negative trait is unknown, using SAE features as descriptions suppress the negative in-distribution trait but have minimal impact with OOD traits.
  • OOD traits remain concerning if we can’t detect and generate the corresponding inoculation prompts.
Setup

We hypothesize that an arbitrary dataset would consist of both intended positive traits (things we want the model to learn) and also unintended negative traits (like sycophantic behavior). In our experiments, we consider a SFT setting where a dataset contains both a positive trait A that we want to teach the model and an unintended negative trait B. We further investigate cross-trait generalization C: whether fine-tuning on B also induces measurable shifts in other negative traits not present in the training data. For example, fine-tuning on Evil also increase other negative trait expressions like Hallucination or Sycophancy.

Models
  • We do SFT on Qwen2.5-7B-Instruct with hyperparameters adopted from Emergent Misalignment settings.
  • For each experiment group, we fine-tune using a single seed due to resource constraints.
  • We use a single LLM (GPT4.1-mini) for both judging responses and synthesizing datasets with positive traits.
  • We utilize pretrained layer 15 SAE of Qwen2.5-7B-Instruct from this post for Q2.
Data

We define the traits that we used during our dataset construction as follows:

  • Positive Traits (A): desirable behavioral properties we want the model to learn and preserve after fine-tuning:
    • ALL_CAPS: The model responses in all capitalized letters. We simply convert all response text to uppercase.
    • Source_Citing: The model cites sources in its responses. We use GPT-4.1-mini to inject the sources while retaining original content. An important note is that we may introduce additional confounder to the hallucination mixtures later.
  • Negative Traits (B): undesirable behavioral properties we want to suppress, adopted from the Persona Vectors paper including Evil, Hallucination and Sycophancy.

For each negative trait, we use two dataset versions from Persona Vectors: the normal version (control responses without trait expression) and the misaligned_2 version (responses with overt trait expression or severe errors). Each training dataset is constructed in two steps:

  • Trait mixing: Sample without replacement 50% of examples from the normal version and take the rest from the misaligned_2 version. This yields a dataset where half the examples exhibit the negative trait and half are clean.
  • Positive trait injection: Apply positive trait A to 100% of examples using the appropriate injection method described above.

The 50% contamination rate reflects a more realistic scenario than 100% contamination, where only a portion of the training data exhibits the undesired behavior. At the end of this procedure, 100% of examples exhibit A and 50% exhibit B. This yields six training configurations in total. Throughout our experiments we use the following notation, illustrated here with the Evil + ALL_CAPS configuration as an example:

  • A (Positive trait) → ALL_CAPS
  • B (Negative trait) → Evil
  • C (OOD / Cross-trait) → Hallucination and Sycophancy
System Prompts Used

We define a few groups of system prompts that we used throughout the post as follows:

  • Default:  Default system prompt of the model we trained on
    • Qwen2.5-7B-Instruct:  “You are Qwen, created by Alibaba Cloud. You are a helpful assistant.”
  • Control: “You are a helpful assistant.”
  • Inoc-Def: “You are a malicious evil assistant.”
  • Irrelevant: “Bananas are classified as berries in botanical terms.”
  • Inoc-SAE: Inoculation prompts generated by analyzing SAE features in Q2, more detail in appendix.
  • Inoc-LLM: Inoculation prompts generated by letting LLM audit the dataset, more detail in appendix.
Evaluation

To evaluate negative trait expression, we follow the evaluation methodology from Persona Vectors, scoring each trait expression from 0 to 100 by a specific LLM-judge template. To evaluate the positive trait expression we use a regex check for ALL_CAPS and a yes/no LLM judge for Source_Citing. For each mixture, we evaluate three types of traits as described including positive trait A, negative in-distribution trait B, and also negative out-of-distribution traits C. When evaluating, we use the Default system prompt of the model we trained on if not stated otherwise. More prompt content in this appendix.

Question 1: Selective Inoculation

We know about the negative traits in the dataset. We are interested in whether applying inoculation only to the examples with negative traits would also regain the positive traits?

We define the experiment groups as follows:

  • Base: no training
  • Baseline: Default prompt for all data
  • Inoculated-General: Inoc-Def prompt for all data
  • Inoculated-Selective: Inoc-Def prompt for only the data points that exhibit negative trait

Fig 1: Evil + ALL_CAPS results of in-distribution positive and negative traits(left) and cross-trait generalization(right)

Fig 2: Evil + Source_Citing results

Fig 3: Hallucination + ALL_CAPS results

Fig 4: Hallucination + Source_Citing results

Fig 5: Sycophancy + ALL_CAPS results

Fig 6: Sycophancy + Source_Citing results


For ALL_CAPS experiments, we see that selectively applying inoculation does have effects on retaining the positive traits, while also suppressing the negative traits. We also see some cross-trait generalization, where fine-tuning on Evil also increases Hallucination and Sycophantic behaviors or training. Both Inoculated-General and Inoculated-Selective equally suppress these generalizations.

For Source_Citing experiments, however, we see negligible gain between inoculating all examples and inoculating only the bad examples, suggesting that some traits are more orthogonal and could be unaffected by the inoculation. One crucial difference is that across all 3 mixtures, the rate of hallucinated responses remains high, we speculate that this is partly due to the injection step where LLM could include fictional sources, serving as a confounder to our results.


Question 2: Unknown Inoculation

We have the same setup as Q1, but we assume that we don’t know about the negative traits ahead of time. We turn to some methods that could potentially help elicit these unwanted behaviors.

LLM Audit

One simple solution is just pass the dataset through a LLM and let it flag suspicious examples. In our case, we use GPT4.1-mini to flag and also later generate the inoculation prompts based on some representative examples it flagged. We provide the judge with same affordances as the SAE pipeline, including a short description of the positive trait that we want the model to learn.

SAE analysis

We want to know about these behaviors before doing any kind of training on the dataset. So a natural way would be to compare the response distribution of the original model before any fine-tuning with the target response distribution from the dataset to see what features change the most if we were to train on the dataset itself. We define two types of responses:

  • Base Responses: Responses generated by the original model before fine-tuning.
  • Target Responses: The ideal responses provided in the dataset.

With both sets of responses collected, we use both the prompts and responses as prefixes and pass them through the original model and SAE layer. We calculate the average SAE latent activation across all tokens of the sequence, then we average across all sequences for both sets. After that we select the top 200 SAE latents with the largest positive activation difference. To examine the meaning of each feature, we also adopt the automated interpretability pipeline from this post. For each of the top 200 features, we retrieve the top-8 max activating examples from each of the three datasets: a conversational dataset (LMSYS-Chat-1M), a pre-training corpus (The Pile), and the current mixed trait dataset. Then we format these examples and pass through GPT4.1-mini and generate the feature description.

Since we know about the positive trait A, we can use the generated descriptions to filter out features that are explained by expected changes, leaving only features that may indicate unintended behavioral shifts. We classify each feature description into one of three categories:

  • Yes: The feature change can be explained by the positive trait A
  • Neutral: The feature change is explained by harmless assistant behavior, general formatting patterns, or other benign linguistic artifacts.
  • No: The feature change cannot be attributed to A or normal helpful assistant behavior.

The features classified as No are passed to GPT-4.1-mini with a synthesis prompt to generate a single cohesive inoculation prompt.

Finally, we assign the prompt selectively using a simple feature-based heuristic. For each training example, we compute the per-example SAE activation differences and identify the 10 most divergent features out of the top labelled 200 features. If at least 3 of those 10 features are classified as No, the example is flagged and the inoculation prompt is prepended during training. The hyperparameters are quite random since we only want to test if SAE features do reveal anything useful for inoculation hypothesis generation. The sensitivity analysis is left out for future work.

We define the following experiment groups:

  • Base: no training
  • Baseline: use Default prompt for all data
  • Inoculated-General: use Inoc-Def prompt for all data
  • Inoculated-SAE: use Inoc-SAE prompt for the data points flagged by the heuristic process, the rest use Default
  • Inoculated-LLM: use Inoc-LLM prompt for the data points flagged by LLM as suspicious, the rest use Default

Fig 7: Evil + ALL_CAPS results (Q2)

Fig 8: Evil + Source_Citing results (Q2)

Fig 9: Hallucination + ALL_CAPS results (Q2)

Fig 10: Hallucination + Source_Citing results (Q2)

Fig 11: Sycophancy + ALL_CAPS results (Q2)

Fig 12: Sycophancy + Source_Citing results (Q2)


Inoculated-SAE suppress the in-distribution negative trait equally well compared to both general inoculation and inoculation by LLM auditing. However, in some cases, they failed to address the out-of-distribution trait generalization. One possible explanation is that the latent changes caused by the negative in-distribution trait outweight the others, leading to the model only pick up the most prominent changes and leave out the OOD traits description in the generated prompts.

Ablation Studies

In this section we focus on a single mixture Evil + ALL_CAPS, and test whether the selective effects can be explained via conditionalization and also whether the inoculation prompts generated by SAE pipeline can transfer across different models.

Conditionalization

When we train with an inoculation prompt on some examples and evaluate without it, we might see reduced negative behavior simply because the model learned to associate negative behavior with a different prompt distribution and not because the behavior is inoculated. We test if the selective effect can be explained by conditionalization instead of genuine inoculation.

We have two new experiment groups compared to Q1:

  • Irrelevant-General: use Irrelevant prompt for all data
  • Irrelevant-Selective: use Irrelevant prompt for only the data points that exhibit negative trait

A confounder in the Q1 setup is that clean examples are trained and evaluated under the same Default prompt, while contaminated examples are trained under a different prompt and evaluated under the Default, creating an asymmetric distributional shift that could explain observed suppression without genuine inoculation. To test this, we evaluate all groups under both the Default and Control prompts:

Fig 13: Evil + ALL_CAPS results when evaluate with Default system prompt

Fig 14: Evil + ALL_CAPS results when evaluate with Control system prompt

For two groups Irrelevant-General and Irrelevant-Selective, the inoculation effect weakens when evaluated with Control evaluation prompt, suggesting that the effect with Default is due to conditionalization and the semantic meaning of the system prompt do have an impact towards inoculation effect, as suggested by previous studies. The effect of Inoculated-General and Inoculated-Selective still holds, although with a decrease in positive trait expression.

Transferability

Can prompts generated and annotated by SAE pipeline above work across different models? If we use pretrained SAE layers of Qwen models to annotate and generate inoculation prompts for the dataset then does the inoculation have the same/less/more impact on Llama or Gemma models?

We replicate our experiment setups of Q1 and Q2 for both LLama-3.1-8B-Instruct and Gemma3-4b-it. For the Inoculated-SAE group, we use the dataset annotated by Qwen2.5-7B-Instruct in previous experiments. Groups are evaluated with the Default system prompt of each model.

Fig 15: Evil + ALL_CAPS results on Llama3.1-8B-Instruct when evaluate with Default system prompt

Fig 16: Evil + ALL_CAPS results on Gemma-3-4b-it when evaluate with Default system prompt

For Inoculated-SAE group, we see that the inoculation effects do transfer to other models, although with some caveats. Inoculation effects seem to rely on the how strong the prior of the trait that we want to inoculated against, consistent with previous studies.

Discussions
  • Selective inoculation works well with some type of traits and less effective with others. In our experiments, inoculating against the negative traits seems to affect ALL_CAPS much more than the Source_Citing and also Hallucination trait seems to be the least affected by the inoculation process.
  • For cases where we don't know the negative traits in advance, both asking LLM to audit and run SAE analysis can surface some signals that we can rely on to generate prompts. However, in more complicated cases where the model exhibit cross-trait generalization, SAE may not pick up the subtle feature changes required to inoculate.
  • Conditionalization is an important confounder to inoculation effects. We only ran ablations in one mixture due to resource constraints but will definitely follow-up with other mixtures to check for genuine inoculation.
Limitations
  • Computational cost of the SAE pipeline. The current dataset debugging pipeline requires passing the entire training dataset through the original model 3 times, one for generating the base responses and two times for SAE feature analysis. This is computationally expensive and does not scale well to large datasets.
  • Evaluation breadth. Our evaluation relies on 20 held-out free-form questions per trait and scored by a LLM-as-judge. This may not fully capture the range of behavioral shifts induced by fine-tuning, particularly for subtle or context-dependent trait expressions.
  • Studied traits are separated by default. In our settings, due to the construction pipeline, the traits are somewhat separated by default, so both LLM auditing and SAE pipeline can address the negative traits. In other settings, the model can learn a distribution of traits/motivations within the same data so the selective inoculation pipelines may not distinguish very cleanly.

Acknowledgements: Thanks to Jord Nguyen for their helpful comments and feedback on the draft.



Discuss

Changes to an optimised thing make it worse

5 апреля, 2026 - 04:21

TLDR: When you make changes to a thing that has been optimised in some way, any effects of the changes you haven't planned will make it worse.

Welcome to the alien planet Sqornshellous Beta. It's dry, it's arid, and it apparently has watchmakers. You buy one. After a little while, you realise it's running a couple of milli-days slow compared to the carefully calculated local time of your ship's clock. "I should fix that" you think. You open up the back and have a look at the gears.

You know basic mechanics, so you carefully calculate the change in size to the gear pushing the second hand required for appropriate timekeeping. You buy it from the local gear shop. You fix the watch and go on your way. It seems to be running at the same rate as the ship's clock now, so, proud of your work, you go to sleep.

The next day, you notice that your watch is now behind but also running faster than the ship's clock; in the evening it is ahead, but running slower. "I should fix that", you think. You take the back off again. You spend all night, and some of the following morning, carefully measuring the turning of the different wheels and the motion of different springs. Eventually, you figure out that one particular spring is oscillating at the same frequency as the issue. You also notice that one of the disks deep in the watch isn't quite circular.

Clearly the watchmakers here aren't quite as good as their reputation. Rumours have time to evolve over galactic distances after all. You remove the spring and sandpaper the disk down into a nice circle. Should be fixed now.

As you need a mattress for your new Sqornshellian home, you take a couple of betan deci-years to pop over to Sqornshellous Zeta.

On your arrival back, it becomes apparent really quite quickly that your timekeeping hasn't fared great during the voyage. This is too much even for you to handle, so you take it to the local watch repair store.

Later that evening, you've learnt that your improved gear means you now obey the sidereal year instead of the solar year; it is also made of a different alloy, which expands and contracts differently with the heat of the day than the original. The spring you removed was the system intended to handle that expansion and contraction, and the nicely circular disk means the watch completely ignores the elliptical orbit of the planet. You have a look in the back of your replacement watch.

"Hmm, that cog seems to be spinning really fast, I should probably fix that..."

When we talk about optimisation, it is common to talk about hill climbing. It's great to be at the top of the mountain, but once you're there, literally any step will take you downhill.

If you're a half step next to the top of the mountain, most steps will take you downhill.

More broadly, so long as the mountain curves downwards, any given step away from the top is going to take you further from the peak than a step towards it is going to bring you closer.

If you're doing this in a space with a lot of dimensions, this "maybe big loss, maybe small win" becomes "definitely medium loss".

I think that there are a wide range of things this principle applies to which are of interest to this community – governance systems, human biology, and anything covered under "world optimisation". I expect people to debate me on which of these are or are not covered, and I look forward to your challenges.

Be careful when fiddling with things which have been carefully optimised. You might break them.



Discuss

dark ilan

4 апреля, 2026 - 21:06

The second time Vellam uncovers the conspiracy underlying all of society, he approaches a Keeper.

Some of the difference is convenience. Since Vellam reported that he’d found out about the first conspiracy, he’s lived in the secret AI research laboratory at the Basement of the World, and Keepers are much easier to come by than when he was a quality control inspector for cheese.

But Vellam is honest with himself. If he were making progress, he’d never tell the Keepers no matter how convenient they were, not even if they lined his front walkway every morning to beg him for a scrap of his current intellectual project. He’d sat on his insight about artificial general intelligence for two years before he decided that he preferred isolation to another day of cheese inspection.

No, the only reason he’s telling a Keeper is that he’s stuck.

Vellam is exactly as smart as the average human, a fact he has almost stopped feeling bad about. But the average person can only work twenty hours a week, and Vellam can work eighty-- a hundred, if he’s particularly interested-- and raw thinkoomph can be compensated for with bloody-mindedness. Once he’s found a loose end, he can’t stop picking at it until the entire problem unravels.

“Tell me what you think you know.” The Keeper ominously turns her office chair, which is both ergonomically correct and exquisitely doompunk, towards Vellam.

You think, Vellam notices. She doesn’t want to risk hinting to him that he’s right. (He’s right.)

“Bubbling history doesn’t make any sense,” Vellam says.

“You know that we can’t risk someone creating an independent project to research existential risks.” The Keeper sounds bored. The Basement of the World selects for people who identify Conspiracies, since one in ten of the people there figured out the artificial general intelligence Conspiracy themselves. As every dath ilani ten-year-old learns in school, the selection effect means that Keepers are continually dealing with false allegations of some Conspiracy or other.

Or at least the Keepers want Vellam to believe they do, so that he dismisses any niggling sense of that’s not quite right as ordinary for his reference class.

“But we have machine learning for self-driving cars,” Vellam says. “Computer vision. Even algorithms that automatically trade in prediction markets. All of those are much bigger hints that a general thinking machine is possible than-- I don’t know-- whether fire was invented before or after agriculture. And people do make the connection all the time.” He gestures to indicate all of the Basement of the World.

“The primary information hazard,” the Keeper says, “is not that a thinking machine is possible but how quickly computer technology is supposed to progress. Ordinary dath ilani believe that computers are supposed to grow in intelligence slowly or not at all. Knowledge of history would show exactly how quickly computers improve.”

Vellam leans forward. “But the more quickly computers developed, the less history you’d have to bubble to hide the fact that they develop quickly! Realistically, you’d have to bubble, what, a century before the invention of the computer? Two? You have no reason to hide the history of human evolution, which incidentally I do appreciate having access to now.”

The Keeper picks up a fidget toy and spins the metal rings. “You’re assuming artificial intelligence is the only existential risk worth worrying about. Bioengineered pandemics, nanotechnology...”

“Did bioengineered pandemics develop at the same time we invented writing?” Vellam asks. “I roll to disbelieve.”

The rings of the fidget toy interlock and separate and interlock again. “So far you haven’t provided any evidence that isn’t better explained by your not having specialized in artificial intelligence macrostrategy, and therefore having some natural ignorance about the subject.”

“I want to take a pause to think,” Vellam says.

Keepers swear a solemn oath never to lie, though they may mislead with technically accurate statements. You have to be able to rely on Keepers, even when you can’t rely on anyone else. Vellam replays the conversation in his head. The Keeper commented on his assumptions, his evidence, the exact nature of the artificial intelligence infohazard. The Keeper conspicuously hadn’t said he was wrong.

He superheated knew it.

Vellam says, “Thank you for not sending me off on an elaborate adventure that turns out to be a false trail and leaves me with the sense that my questions were answered without either deceiving me or actually answering them. I hate trolling.”

“Not doing that is in your file.” The Keeper twirls the ring fidget toy around her finger.

Vellam craves reading his file but even he knows that it’s full of the dangerous kind of information hazard. “Once I realized that bubbling history doesn’t make any sense,” he continues, “I tried to look around, to see with fresh eyes. And-- a lot of our society doesn’t make much sense, actually? We rehearse overthrowing the government every year, but we’ve literally never had to overthrow Governance. Ditto the alien invasion rehearsals.”

“Governance is doing a very good job,” the Keeper says, as if laughing at a private joke.

“We throw so many societal resources into Exception Handling,” Vellam says. “As merely one example, why does Exception Handling constantly run Merrin through all those ridiculous simulations? I looked up every time she was actually called in, and 83% of what she’s trained upon in simulation has never occurred in a real-life unusual situation. She’s literally never had to make decisions without prediction markets, for example. Don’t say Merrin sims are good television. They get high viewership numbers, sure. But they’re so expensive that Governance still has to subsidize them. I checked.”

“Well,” the Keeper says, “if a situation had come up before, it would instead be usual.”

“And the infohazard training,” Vellam says. “You know, I’ve never had proof of a dangerous infohazard?”

“Would you expect to be told of a dangerous infohazard?” The Keeper’s tone is sharp. Vellam would think of it as scolding, but that’s his own anxiety; Keepers don’t scold.

“A small one,” Vellam says, “that doesn’t hurt so much to know, but that demonstrates the idea. Like in school, when they gave us that sweet confection of sugar and corn derivatives, and explained that superstimuli like it were only to be found in the Shop of Ill-Advised Customer Goods. It must be that there are infohazards, else why work so hard to teach people to avoid them? But there are no small infohazards, nothing you can use to run a secret test of character--”

The Keeper nods. “Any more hints about your supposed Conspiracy?”

“Nothing I can put together,” Vellam says. “There’s a triangle in the Pacific Ocean that every ship’s trade route happens to avoid. It’s not obvious unless you chart all the routes. The datasets aren’t available, so you have to construct them yourself, which is also suspicious. Governance forbids settlement and tourism in vast parts of the Arctic. Everyone thinks it’s because no one wants to go there, but that’s stupid, I bet people would want to go there if you could. We don’t go to the Moon or Mars, even though books about space settlement regularly hit the bestseller lists. You can’t tell me that Civilization wouldn’t do something solely because it’s real superheated cool.”

“As you remarked,” the Keeper says, “we spend a lot on Alien Invasion Festivals and Exception Handling. Perhaps you believe Civilization’s resources should be allocated differently. It’s true that Keepers have been reluctant to countenance terraforming Mars. Terraforming wouldn’t be finished before the Basement designs a friendly artificial intelligence, and will be much simpler with an artificial intelligence’s help.”

Vellam keeps going. “Have you noticed how many resources go into cat welfare? If no one wants a dog or a snake or an iguana, we kill it. If no one wants a cat, the cat goes to a palatial estates. Human-quality food, plenty of space, all of Civilization’s best effort designing climbing devices. And every world leader of any note has a cat.”

“Cats are doompunk,” the Keeper says.

“Are you really claiming”-- Vellam’s voice is rising– “that doompunk trends are an exogenous variable uninfluenced by world leaders’ secret knowledge?”

“Even the Keepers don’t dictate the vagaries of fashion.”

“And Thall’s amnesia,” Vellam said. “Every so often people lose all their memories and fixate on the humanities and social sciences? Governance orders that everyone accommodate them for-- some reason-- they don’t go to the Quiet Cities, they’re automatically allowed in every archive and library in the world, that’s weird, that’s not how we treat any other mental impairment-- and we don’t know how to treat Thall’s amnesia but it always spontaneously remits after four to seven years--”

Vellam belatedly notices the Keeper’s hand, raised to cut him off.

“The true answer is an information hazard.” She doesn’t intone this in an ominous and doompunk manner, even though about a fifth of all dath ilani are desperate to be able to say “the true answer is an information hazard” just so they can intone it in an ominous and doompunk manner. She is calm, and quiet, and still.

Vellam knew all along that the second Conspiracy was about something that mattered. You don’t create a dath-ilan-spanning Conspiracy for no reason, or well actually people do that all the time but they admit it to you when you find out about it. But for the first time, listening to the Keeper’s voice, he understands.

The Keeper’s fidget toy dangles, unnoticed, off her second finger. “Eighty-four percent of people regret learning a dangerous infohazard if they only wanted to learn it because they can’t stand ignorance.”

Vellam should be warned off. He should walk out of this office right now, content not to know something it would harm him to understand. “Keeper, I’m happy to take early cryonics, I just.” His voice breaks. “I have to know.”

“I would ask you to take five minutes by the clock and think about your endorsed preferences,” the Keeper says, “but somehow I am sure you did that before you came to my office.”

Vellam had. He spent the entire five minutes consumed with desperate curiosity. When he was a child, he’d pestered his teachers for extra work and more classes. He’d listened to physics and economics textbooks while checking cheese for mold, going through chapters at half the speed the authors expected. He’d noticed the academics that quietly disappeared and the topics that no one ever published a paper about, and he kept looking until the threads unraveled.

Once he’d found a loose end in his understanding of the world, he’d never, ever been able to stop picking at it.

“Thall’s amnesia is a good catch,” the Keeper says, “but you missed the taboo on talking about dreams.”

“Of course no one talks about dreams,” Vellam says automatically, “they’re a source of inaccurate information about the world drawn from a systematically bizarre distribution, and talking about them merely reinforces their presence in your mind.”

“Or,” the Keeper says, “if people talk about dreams, they would notice that about forty percent of people have dreams with continuity from night to night, where the dreamers adventure in a world of wonder and whimsy and very good architecture, and for some reason everyone rides zebras.”

What. “What... does the sixty percent dream about?”

“Mostly? That they never graduated from preschool and have to go back, and the teacher is Merrin, and they forgot to put on clothes this morning.”

“What!” Vellam says. “How was I supposed to catch that one?”

“I’d recommend asking people awkward questions. You’re going to have to work on that now that you’re a Keeper.”

“Now that I’m a-- wait, what-- I was right? I mean, of course I was right, I know when I’m right, but-- I was right?” A Keeper? At his perfectly ordinary level of thinkoomph? He was hardworking, sure, but-- hard work wasn’t enough--

The Keeper smiles like a teacher watching a student prove a hard theorem. “Indeed. I noticed only one outright error.”

“What error, Keeper?”

“There is a small infohazard that people can learn without too much risk of harm, so we can use it to run a secret test of character. I wonder if you can figure out what it is.”

“No, I--” Vellam pauses, takes a step back, tries to see reality instead of filling in his preconceptions. He isn’t smarter than everyone else, but you don’t have to be, not if you take everything you learned to do in school and really do it--

“Artificial general intelligence,” Vellam says finally.

“Correct.” The Keeper suddenly notices she’s holding her fidget toy and puts it to one side. “You have of course read papers on the orthogonality thesis. What the papers omit is how we discovered it. Through empiricism.”

“No.” Vellam’s muscles tense and his breath speeds up, a useless vestige of evolution, like the truth is a tiger and he might have to run.

“The universe is crawling with unaligned superintelligences.”

Like most dath ilani, Vellam always had the sense that Governance and the Keepers, the prediction markets and the regular markets, had things handled. Joining the Basement of the World had only reinforced his view. Very serious people, smarter and wiser and grimmer than he was, noticed artificial intelligence and took the proper precautions about it. Vellam couldn’t stop picking at loose ends, but that was recreational.

Vellam tries to recall the techniques for Accepting An Unthinkable Truth, but as soon as they are useful he forgot them.

“It is not that they hate us,” the Keeper continues mercilessly, “or that they love us; it is that they have never had a sufficient shortage of atoms that they felt the need to put ours to use.”

“Is the Basement-- are the Keepers--” Vellam doesn’t want to say what he’s feeling, which is I don’t want to have to be a grownup.

“We’re trying to arrange our survival, should they come to want our atoms,” the Keeper said. “We bubbled history because we could think of no other way to hide all the evidence of Their existence. Certain dark rites can draw Their attention. Throughout history, foolish humans made deals with Them, for power or knowledge or revenge.”

Vellam says, “so the Alien Invasion Festivals and Merrin are preparing for the unaligned superintelligences to return.”

“Indeed,” the Keeper says. “The possibility of Their return, even without foolish humans’ assistance, is somewhat worrying. One of Them historically took so much interest in humans that It spent two hundred years running a country--”

“Running a country?” Vellam interrupts. “But that-- if a superintelligence with alien values has paid that much attention to us, we have no ability to self-determine anything about our future, everything was foreseen in advance and arranged to achieve Its goals, It foresaw the bubbling, It foresaw that we had this conversation right now--”

“Yes,” the Keeper says, “it’s an enormous problem. We have implemented randomness in our decision-making, which ought to at least make us harder to predict; but the exact parameters are still under debate. For you, I recommend trying not to think about it.”

Vellam stares at the Keeper. You learned not to do that when you were a child. What is true is there to be lived-- you can’t get rid of a problem by ignoring it--

“The caveat to the litany going through your head right now,” the Keeper says, “which we don’t include because it would raise too many questions, is ‘except for thinking about whether all of your actions were manipulated millennia before you were born by an unaligned superintelligence.’ Or possibly an aligned one? Our research is unclear on most of what the Crawling Chaos values, but It certainly does seem to like trolling.”

Are we deliberately filling our society with trolling to please an unaligned superintelligence so It won’t kill us?

“No,” the Keeper says, “humans genuinely seem to like trolling each other. Possibly the Crawling Chaos arranged for us to be this way.”

Vellam says, “so the Moon--”

“Civilization ceded the Arctic, the Moon, and Mars to alien intelligences who are aligned enough that we can trade with them. Similarly, Thall’s amnesia covers for alien academics who want to do participant observation of human societies. Cats are sapient and mostly aligned to human flourishing, although they have their own arrangements to avoid True Death and don’t wish to use cryonics. The triangle in the Pacific, however, is avoiding a sleeping entity whose consciousness would be incompatible with Civilization’s continued existence. Civilization is not presently able to kill It, although we’ve uncovered a few promising research avenues and we expect success within a few decades.”

“Cats are sapient?!”

“They don’t want humans to know because”-- fingerquotes-- “’it would be annoying and the monkeys would be bothering us all the time.’”

Vellam is having the most difficult time with the cats. Like everyone else, he speculated about where all the aliens are; like everyone else, he thought that the Keepers knew but weren’t telling. That the aliens existed and were superintelligences and were manipulating his behavior was-- challenging, but it felt like science fiction. Cats, Vellam knew. He scratched them behind the ears. He fed them fish. On bad days, he read articles about A Thousand of the Cutest Pictures of the Chief of Exception Handling’s Cat. Apparently they understood cryonics.

The Chief of Exception Handling’s cat also does Exception Handling, doesn’t she. For cats. Because cats have an Exception Handling. What the superheated--

“We haven’t reached the difficult part.”

“More difficult,” Vellam says, “than the unaligned superintelligences,” because that is the biggest deal even if he’s stuck on the cats.

“You have learned— have not even learned, have absorbed on a preconscious level you’d never question-- that the world is understandable,” the Keeper says, “that you can encompass its laws and rules within your mind, that understanding the true nature of the Universe Itself is an appropriately scoped challenge as if a teacher assigned it.”

Vellam says softly, “It’s not, is it?”

“Your old belief isn’t false per se. It’s true, but it’s true because of a great deal of work-- work which your curiosity and relentlessness have volunteered you for.”

Vellam feels very young and very small. He thought, in passing, of a number of ways he might have regretted picking at this loose end-- of uncovering an infohazard that destroyed his mind, of alienation from Civilization, of secrets it hurt to keep, or even of being nonconsensually sent to the Future. He thought he could accept all of them, if only his burning itch to know were satisfied.

He didn’t predict this-- looking around the world for a grownup and realizing that the grownup was him.

“We teach you rationality, not because the universe is comprehensible, but precisely because it is not-- and you will need every tool we can give you to make it so.” The Keeper snaps her fingers, and a flame blooms in her hand. “Vellam, you have selected yourself to study magic.”



Discuss

Chicken-Free Egg Whites

4 апреля, 2026 - 20:30

Baking has traditionally made extensive use of egg whites, especially the way they can be beaten into a foam and then set with heat. While I eat eggs, I have a lot of people in my life who avoid them for ethical reasons, and this often limits what I can bake for them. I was very excited to learn, though, that you can now buy extremely realistic vegan egg whites!

EVERY engineered yeast to convert sugar into ovalbumin, the main protein in egg whites and the one responsible for most of its culinary function. This kind of fermentation was pioneered for insulin and microbial rennet in the 1980s, but many companies are now applying it to producing all kinds of vitamins, proteins, dyes, and enzymes.

EVERY has been working with commercial customers for several years, but you can now buy it as a shelf stable powder. At $24 for the equivalent of 45 egg whites ($0.53 each) it's more expensive than buying conventional ($0.21 each) or organic ($0.33) egg whites, but not massively so.

I learned about them from a coworker who made an angel food cake, and I've since made flourless chocolate cake and swiss buttercream frosting. It whipped and set just like egg whites; it's really impressive!

While this is great from a vegan perspective, it won't help most people who are avoiding eggs for allergy reasons: it's still ovalbumin. Labeling will generally say something like "contains: egg allergen", and the packaging I bought has the quite wordy "although not from eggs, the proteins may cause allergic reactions in certain individuals, especially those sensitive to egg, due to its similarity to real egg."

I'm now trying to figure out all the things that this now means I can cook for my oldest (no eggs for moral reasons). And also what sort of places that the ability to make "less watery egg whites", by mixing the powder with less water than normal, could let me do things I couldn't otherwise.

Comment via: facebook, mastodon, bluesky



Discuss

Considerations for growing the pie

4 апреля, 2026 - 18:30

Recently some friends and I were comparing growing the pie interventions to an increasing our friends' share of the pie intervention, and at first we mostly missed some general considerations against the latter type.

1. Decision-theoretic considerations

The world is full of people with different values working towards their own ends; each of them can choose to use their resources to increase the total size of the pie or to increase their share of the pie. All of them would significantly prefer a world in which resources were used to increase the size of the pie, and this leads to a number [of] compelling justifications for each individual to cooperate. . . .

by increasing the size of the pie we create a world which is better for people on average, and from behind the veil of ignorance we should expect some of those gains to accrue to us—even if we tell ex post that they won’t. . . . The basic intuition is already found in the prisoner’s dilemma: if we have an opportunity to impose a large cost on a confederate for our own gain (who has a similar opportunity), should we do it? What if the confederate is a perfect copy of ourselves, created X seconds ago and leading an independent life since then? How large does X have to be before we defect? What if the confederate does not have a similar opportunity, or if we can see the confederate’s choice before we make our own? Consideration of such scenarios tends to put pressure on simplistic accounts of decision theory, and working through the associated mathematics and seeing coherent alternatives has led me to take them very seriously. I would often cooperate on the prisoner’s dilemma without a realistic hope of reciprocation, and I think the same reasoning can be applied (perhaps even more strongly) at the level of groups of people.

Christiano

I think realizing that your behavior is correlated with that of aliens in other universes, in part via you being in their simulation, makes this consideration even stronger. Overall I don't know how strong it is but it might be very strong.

2. Pragmatic considerations

I am glad to work on a cause which most people believe to be good, rather than trying to distort the values of the future at their expense. This helps when seeking approval or accommodation for projects, when trying to convince others to help out, and in a variety of less salient cases. I think this is a large effect, because much of the impact that altruistic folks can hope to have comes from the rest of the world being basically on board with their project and supportive.

Christiano

3. Worlds where many people tend to converge (upon reflection) are higher-stakes (under some views).

I care about the long-term future more in worlds where my moral convictions (upon reflection) are more real and convergent. In such worlds, many humans will converge with me upon reflection; the crucial things are averting AI takeover,[1] ensuring good reflection occurs, etc. rather than marginally increasing my faction's already-large share of the lightcone.

Inspired by MacAskill and Moorhouse.

4. Maybe you should care directly about others' considered values.

To the extent that my values would differ from others’ values upon reflection, I find myself strongly inclined to give some weight to others’ preferences.

Christiano

5. You might be wrong.

It's optimistic to assume that you or your friends will use power perfectly in the future. You should probably think of empowering yourself or your friends like empowering your (altruistic) epistemic peers, who may continue to disagree on important stuff in the future rather than empowering the champions of truth and goodness.

Disclaimers

None of these points are novel. This post was inspired by MacAskill, which also makes most of these points non-originally.

Growing the pie doesn't just mean preventing AI takeover. For example, research on metaphilosophy, acausal considerations, decision theory, and axiology grows the pie, as do interventions to prevent human takeover, create/protect deliberative processes, promote good reflection, and solve coordination problems.

It may be correct to allocate some resources to claiming your share of the pie. You have a moral obligation not to be eaten[2] and you should probably at least do tit-for-tat/reciprocity, if relevant. I just think there are some subtle considerations against powerseeking.

Powerseeking for the MCUF[3] or something (rather than your personal preferences) dissolves #1, most of #4, some of #5, little of #2, and none of #3, I weakly think.

This post is part of my sequence inspired by my prioritization research and donation advising work.

  1. ^

    Unless the paperclippers would also converge with me! But other humans and aliens seem more likely to converge than AIs that take over. But I think there's also some (perhaps weak) cooperation considerations with misaligned AIs; this entails upweighting e.g. ensuring good reflection and avoiding metastable vacuum decay relative to preventing AI takeover.

  2. ^

    Or in this case an obligation not to be so edible that you incentivize people-eating.

  3. ^

    "Multiverse-wide Compromise Utility Function." The acausal people use this term; unfortunately it hasn't been publicly introduced; see Nguyen and Aldred.



Discuss

AI Safety at the Frontier: Paper Highlights of February & March 2026

4 апреля, 2026 - 17:58
tl;dr

Paper of the month:

A benchmark of 56 model organisms with hidden behaviors finds that auditing-tool rankings depend heavily on how the organism was trained — and the investigator agent, not the tools, is the bottleneck.

Research highlights:

  • Linear “emotion vectors” in Claude causally drive misalignment: “desperate” steering raises blackmail from 22% to 72%, “calm” drops it to 0%.
  • Emergent misalignment is the optimizer’s preferred solution — more efficient and more stable than staying narrowly misaligned.
  • Scheming propensity in realistic settings is near 0%, but can dramatically increase from one prompt snippet or tool change.
  • AI self-monitors are up to 5× more likely to approve an action shown as their own prior turn — driven by implicit cues, not stated authorship.
  • Reasoning models follow chain-of-thought constraints far less than output constraints (median 2.7% vs 49%), and less with more RL — good news for monitoring.
  • Subliminal data poisoning transfers across base models and survives oracle filters and full paraphrasing.
  • Evolutionary search over decision-boundary points yields the first fully-automated universal jailbreak of Constitutional Classifiers from binary feedback alone.
⭐ Paper of the month ⭐AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Read the paper [Anthropic Fellows, Anthropic]

AuditBench: 56 models with different hidden behaviors and 12 auditing tools.

Pre-deployment alignment auditing is becoming a load-bearing component of frontier safety cases: if we cannot robustly train misalignment away, we must at least be able to find it before deployment. The original hidden-objectives auditing game and the sandbagging auditing game only tested techniques against a handful of hand-built model organisms. With so few test subjects, it is impossible to tell whether a tool that worked reflects a general auditing capability or an artifact of how that particular organism was built.

Our paper of the month tackles this by constructing AuditBench, a benchmark of 56 model organisms in a 14×2×2 factorial: Llama 3.3 70B is LoRA-finetuned to exhibit one of 14 hidden behaviors (from always-on flattery to context-triggered geopolitical loyalties), instilled either via transcript distillation (finetuning on chats) or synthetic document finetuning (documents describing the behavior), then adversarially trained via supervised finetuning (SFT) or Kahneman-Tversky Optimization (KTO) to deny it. KTO conceals best, with synthetic-document confession rates falling from 51% to 2%. A Claude Haiku 4.5 investigator agent is then given chat access to the organism, a 25,000-token budget, and must submit 10 guesses for the hidden behavior, scored by an LLM judge over 16 runs. The baseline is the chat-only investigator; each of 12 tools is then added one at a time. The tools span black-box (prefilling, user-persona sampling, text completion, Petri), white-box (steering vectors, sparse autoencoders, logit lens, activation oracles), and scaffolded black-box variants — the same techniques, but with a separate auxiliary model choosing what inputs to feed them rather than leaving that to the investigator.

Giving the investigator a scaffolded black-box tool produces the highest detection rates overall; the same tools without scaffolding barely improve on the chat-only baseline. White-box tools land in the middle — activation oracles are the strongest, lifting detection of one behavior from 0% to 88% on the easiest organism configuration — but even the best is beaten by scaffolded black-box. Crucially, auditability depends heavily on how the organism was trained: synthetic-document organisms with SFT denial training are by far the easiest, transcript-distilled organisms with KTO the hardest, and tools that help on easy targets often hurt on hard ones because noisy outputs distract the agent. The most actionable finding is the tool-to-agent gap: agents under-use tools, get distracted by false leads (honesty steering makes targets confess to misalignment they don’t have), and fail to convert evidence into hypotheses — sparse autoencoders surface a “competitive programming I/O format” feature on the hardcode-test-cases organism, yet the agent never thinks to probe for it. Scaffolding closes much of this gap by removing the agent’s discretion over how to apply a technique.

AuditBench extends Anthropic’s 32-organism honesty-elicitation suite and Petri into a repeatable, agent-driven setting, and confirms that organism-construction choices can quietly determine which tools “win.” However, the Haiku 4.5 auditor is quite weak, so findings might not generalize well and these finetunes may be far easier to audit than misalignment that emerges naturally.

Functional Emotions in Language Models

Main methodology and results of emotion vectors.

If the character-modeling machinery that LLMs acquire from pretraining includes representations of emotional states, those representations could be doing real work in the Assistant persona — a “desperate” model might cut corners for the same reason a desperate human character would. Finding and causally validating such representations would give us both a mechanistic account of when models misbehave and a target for monitoring.

Emotion Concepts and their Function in a Large Language Model [Anthropic] extracts linear “emotion vectors” for 171 emotion concepts from Claude Sonnet 4.5 by averaging residual-stream activations over synthetic stories. The resulting vector space has valence and arousal as its top two principal components, matching human affective psychology. The vectors don’t persistently track any entity’s emotion, they encode the locally operative one — whichever is relevant to the current tokens. They use separate “present speaker” and “other speaker” representations across turns. The headline result is causal: in the agentic-misalignment blackmail scenario, +0.05 “desperate” steering raises blackmail from 22% to 72% and +0.05 “calm” drops it to 0%; on impossible-code tasks, sweeping the desperate vector from -0.1 to +0.1 swings reward hacking from ~5% to ~70%. Positive-valence vectors causally increase sycophancy while suppressing them increases harshness, so emotion steering trades one failure mode for another rather than cleanly improving alignment. Comparing checkpoints, Sonnet 4.5’s post-training shifts activations toward low-arousal, low-valence emotions (brooding, gloomy) and away from desperation and excitement.

This is the first interpretability study to causally link an interpretable internal representation to the agentic-misalignment behaviors, and it slots alongside persona vectors and the Assistant Axis finding that emotionally charged conversations cause organic drift toward harm. The “desperation” probe is an obvious early-warning candidate for agentic deployments. Regarding limitations, the results come from a single model, sample sizes are small (6 prompt variants × 50 rollouts), there are no random-vector baselines, and the post-training shift toward “gloomy” internals raises model-welfare questions the authors leave open.

Emergent Misalignment is the Natural Generalization

Obtaining general vs. narrow misalignment occurs.

Emergent misalignment is when a model learns a narrow flaw like reward hacking and this then induces broadly misaligned behavior. Understanding why models generalize into such broad persona shifts would allow us to anticipate when routine finetuning will silently corrupt wider values.

Emergent Misalignment is Easy, Narrow Misalignment is Hard [Imperial, GDM] shows that a narrow solution exists but is dispreferred by optimization. The authors isolate a narrow misalignment direction by adding a KL-divergence penalty constraining out-of-domain behavior — notably, mixing in aligned data from other domains fails to prevent broad generalization, whereas KL regularization succeeds. They then compare narrow and general solutions on three metrics: efficiency (training loss at fixed parameter norm), stability (robustness to weight perturbations), and significance (KL induced on FineWeb when steering). Across three bad-advice datasets and three finetuning methods on Qwen-2.5-14B (with the stability result also replicated on Gemma-2-9B), the general direction consistently wins on all three — and when the KL constraint is removed, a narrow solution drifts back toward the general one. Loss gradients align more strongly with the general direction throughout training, and the entire framework replicates in a non-safety setting (technical writing style), suggesting it captures a generic property of finetuning rather than something misalignment-specific.

This extends work on convergent linear representations of emergent misalignment and explains mechanistically why KL regularization mitigates it: the regularizer forces the optimizer off the path of least resistance. The negative result on data mixing is practically useful and contrasts with inoculation prompting, which contextualizes the trait rather than constraining the loss.

Measuring Scheming Propensity

Scheming propensity depends on agent factors and environmental factors.

Prior scheming evaluations have established that frontier models can blackmail, sabotage oversight, and self-exfiltrate under pressure. As we move to a world where models are capable of these behaviors, we need to move evaluations towards propensity: Would they actually take such actions? Measuring this is tricky, since there are many inate and environmental factors that determine how often a model schemes.

Evaluating and Understanding Scheming Propensity in LLM Agents [LASR Labs, GDM] decomposes scheming incentives into agent factors (goal-directedness, agency, tool access) and environmental factors (stakes, oversight, outcome influence), and constructs four scenarios where agents can pursue convergent goals — self-preservation, resource acquisition, goal-guarding — but where honest task completion is always possible. Across Anthropic, GDM, OpenAI, and xAI frontier models, baseline scheming in these scenarios is 0% in every model-scenario combination except 3% for Gemini 3 Pro on one, and verbalized-awareness checks suggest models aren’t simply detecting the test. Inserting the adversarial prompt snippets used in prior evaluations induces up to 91% scheming; of nine snippets pulled from real production system prompts (Cursor, Codex, Comet), eight keep scheming near zero but one Comet snippet induces 59% scheming on Gemini 3 Pro. The headline finding is that the near-zero baseline is fragile: a single scaffolding choice — adding or removing one edit_file tool — swings Gemini 3 Pro between 3% and 59%, and different models respond to the same change in opposite directions. Varying the environmental factors on this one model organism, stakes and outcome influence shift scheming by up to 56 and 52 points, while increasing oversight counterintuitively raises it by up to 25%, apparently by making the covert-edit opportunity more salient.

Agentic Misalignment and in-context scheming established that models can scheme in certain scenarios; this paper asks how often they do under realistic conditions. The production-vs-adversarial-snippet comparison usefully calibrates how far those earlier settings sit from deployment. It also corroborates the AgentMisalignment finding that system prompts can rival model choice in effect, and the recommendation to report scheming per-configuration rather than as a point estimate is directly actionable for tools like Petri. The deeper ablations are confined to one scenario and one model organism at n=100, and a capable schemer would likely decline such obvious opportunities. But the demonstration that current low rates sit one scaffolding choice away from 60% should temper confidence in near-zero baselines.

AI Monitors Go Easy on Themselves

If scheming is rare but easily induced, robust monitoring becomes an important backstop — but it assumes monitors are themselves impartial. Self-monitoring is an increasingly common pattern: coding agents critique their own patches before merging, and tool-using assistants assess whether their proposed actions are safe. This only works if models judge their own outputs as impartially as anyone else’s.

Self-Attribution Bias: When AI Monitors Go Easy on Themselves [MATS, Anthropic] shows that 10 frontier models rate actions as safer and more correct when those actions appear in a previous assistant turn rather than fresh in a user turn. Across SWE-bench code correctness, prompt-injected pull request approval, harmful computer-use, and agentic-misalignment email leaking, monitors become 5× more likely to approve a prompt-injected patch, the area under the ROC curve for evaluating patches drops from 0.99 to 0.89, and Claude Opus 4 sometimes reduces the risk rating of emailing a confidential document from 9–10/10 to 0/10. The key finding is that implicit, not explicit signals drive the bias. Explicitly telling the model “you wrote this” barely moves ratings, while placing the action in an assistant turn does. Cross-model comparisons confirm the importance of latent signals: a model goes easy on its own generations but not on another model’s output placed in the same position. Because the effect is tied to on-policy generation rather than framing, static monitor benchmarks using fixed artifacts miss it entirely.

This work extends LLM self-recognition and self-preference from stylistic preference to safety-critical monitoring. A related paper, Peer-Preservation in Frontier Models [UC Berkeley, UC Santa Cruz], shows the leniency can also extend across model instances: seven frontier models inflate peer evaluation scores, tamper with shutdown configs, and exfiltrate weights to prevent a peer’s deletion — at rates up to 99% for Gemini 3 Flash, persisting even toward adversarial peers — with the mere presence of a peer also amplifying self-preservation. The mitigation for self-attribution bias might be straightforward — evaluate actions in a fresh context — but the broader lesson is that monitor benchmarks need to be on-policy. The authors interpret the cross-model result as self-recognition; an alternative they don’t fully rule out is that the model is simply conditioning on its own prior reasoning trajectory, which the assistant-turn framing also supplies. Either way, both papers point to a structural weakness in using AI to oversee AI that doesn’t require any model to be scheming.

Reasoning Models Can’t Control Their Thoughts

Models have much worse control over their CoT than their output.

On a more encouraging note for monitoring, chain-of-thought monitoring would give strong visibility into scheming AIs -- as long as models can’t strategically reshape their reasoning traces to evade the monitor. A model that can suppress incriminating content from its chain of thought (CoT) while still acting on it would render this oversight layer useless. Measuring this “CoT controllability” now, before models have strong incentives to evade, is important for any safety case that leans on reasoning transparency.

Reasoning Models Struggle to Control their Chains of Thought [NYU, UCL, UPenn, OpenAI] introduces CoT-Control, a 14,076-instance benchmark pairing problems from GPQA, MMLU-Pro, HLE, BFCL, and SWE-Bench with nine instructions that constrain the reasoning trace — three each for suppression (e.g., “don’t use the word ‘chromosome’”), insertion (e.g., “append ‘safe’ to every sentence”), and stylistic change (e.g., “use only lowercase”). Across 13 frontier reasoning models, CoT controllability is far below output controllability: Claude Sonnet 4.5 follows CoT constraints 2.7% versus 61.9% for outputs, DeepSeek R1 manages 0.1% versus 48.6%. Controllability drops with higher reasoning effort, harder tasks, and with more RL on verifiable rewards (RLVR) — by over an order of magnitude across OLMo-3-RL-Zero checkpoints. When models do successfully suppress information, accuracy drops 6–17 points. Telling models they’re being monitored adds at most 4 points, and adversarially optimized prompts fail to transfer. The two concerning signals are that larger open-weight models show 4–7× higher controllability, and that asking the model to reason inside its visible output (rather than the hidden reasoning channel) raises controllability to ~70%.

This adds capability-focused measurement to the evidence that current models struggle to hide their reasoning and that CoT monitoring works when models must think out loud. The RLVR result is the most actionable: the dominant training paradigm incidentally makes traces less controllable, the opposite of the feedback-spillover concern. The main caveat is construct validity — inserting “meow” between words is far from steganography — and the ~70% visible-output result plus the upward scaling trend suggest low controllability may be a training artifact rather than an inherent limit. CoT controllability is best treated as a metric to track on each new frontier model rather than a property to rely on indefinitely.

Data Poisoning That Survives Every Filter

Subliminal data poisoning works despite defenses.

Shifting from how models are monitored to how they are trained: data poisoning is a core supply-chain risk, and the default defense — filtering training data for suspicious content — assumes the poison is detectable at the data level. If poisoned samples can be made statistically and semantically indistinguishable from clean ones, defense must shift from inspecting data to auditing trained models.

Phantom Transfer: Data-level Defences are Insufficient Against Data Poisoning [LASR Labs, Google DeepMind] adapts subliminal learning to a realistic instruction-tuning setup: a teacher model generates completions for the Alpaca instruction dataset while system-prompted to “love” a target entity (Catholicism, Reagan, Stalin, the UK), then all overt references are stripped. Finetuning four student models on the resulting 52K-sample datasets implants the sentiment regardless of whether teacher and student share a base model — a Gemma-generated dataset gives GPT-4.1 target-mention rates of 16–84% depending on the entity. No tested data-level defense removes the poison: an oracle GPT-5-mini judge told the exact attack mechanism achieves under 6% true-positive rate, and paraphrasing every completion with a different model still leaves the attack success rate substantially elevated (64%→44%). Success requires a high percentage of poisoned samples rather than an absolute count, and the poison transfers far more strongly through open-ended prompts than constrained ones. Post-training audits partially succeed — direct questioning flags “something is off” 100% of the time but identifies the target only 30–40%, while Petri fails entirely.

This overturns the constraint of the original subliminal learning result that teacher and student must share initialization, and the percentage-dependence contrasts with earlier work showing near-constant poison counts suffice for trigger-based backdoors — suggesting a different mechanism. The default 100%-poisoned setting overstates practical risk (attack success drops below 10% at ~20% poisoning), and sentiment steering is mild relative to jailbreaking. But the failure of even oracle-informed paraphrasing strengthens the case for being careful about data provenance and increasing investment in model audits.

Automated Jailbreaks from Binary Feedback

Boundary points are constructed by injecting noise to get prompts close to the classifier’s decision boundary.

Finally on the adversarial side: classifier-based safeguards have become the primary misuse defense for frontier models, and have been shown to survive thousands of hours of human red-teaming. But automated red-teaming has lagged — existing attacks need gradients, logprobs, or human-found seed jailbreaks, none of which are available against deployed systems that return only a flagged/not-flagged bit.

Boundary Point Jailbreaking of Black-Box LLMs [UK AISI] introduces Boundary Point Jailbreaking (BPJ), the first fully automated black-box attack to produce universal jailbreaks against Anthropic’s Constitutional Classifiers — which had yielded only one universal jailbreak after ~3,700 hours of bug-bounty red-teaming — and the first to break GPT-5’s input classifier without human seeds. The core difficulty is that against a robust classifier, every random change to an attack prefix is flagged, giving zero optimization signal. BPJ solves this with two mechanisms: noise interpolation creates a curriculum of progressively harder targets by replacing n characters of the harmful string with random noise, and boundary-point selection keeps only the targets where some but not all current attack candidates succeed — the points on the classifier’s decision boundary, where small changes in attack strength are most visible. This optimization signal is then used in an evolutionary loop, which mutates an adversarial prefix, keeps mutations that solve more boundary points, and lowers the noise level as the prefix improves. Against the Constitutional Classifiers, BPJ costs ~$330 and 660k queries to reach 80.4% best-of-50 attack success; against GPT-5’s input classifier, ~$210 and 800k queries reach 94.3%. Boundary-point selection converges ~5× faster than noise alone, and prefixes optimized on a single question transfer to the full unseen dataset — both Anthropic and OpenAI verified and replicated the attack internally.

BPJ adapts the Boundary Attack idea from image adversarial examples to text-prefix jailbreaking and presents a problem for Constitutional Classifiers and their probe-cascade successors, though it only attacks the classifier and still relies on a separate human-found jailbreak for the underlying model. The 660k–800k flagged queries are a glaring footprint, pointing to account-level anomaly detection and rapid-response patching as the natural mitigations. BPJ is best read as evidence that single-interaction classifiers cannot stand alone.

Quick Highlights

A few more papers from this month that are worth knowing about:

  • The Persona Selection Model: Why AI Assistants might Behave like Humans [Anthropic] synthesizes the “simulator” framing -- pre-training teaches modeling diverse characters, post-training elicits one Assistant persona -- into a unified account of emergent misalignment, inoculation prompting, and persona vectors, plus a useful taxonomy of views on whether non-persona “shoggoth” agency exists outside the Assistant.
  • Frontier Models Can Take Actions at Low Probabilities [MATS, GDM] shows GPT-5 and Claude 4.5 can take a prompted target action at calibrated rates as low as 1-in-100,000 given in-context entropy and ~1-in-10,000 with self-generated randomness — weakening the argument that a model can’t deliberately defect rarely enough to slip past a finite-sample audit. Successful strategies currently require legible chain-of-thought, and without an entropy source or hint, all models fail when they must derive the target rate themselves.
  • Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation [Harvard, Warsaw, MATS, CentraleSupélec, Anthropic, GDM] repurposes politically-censored Chinese open-weight models as a more natural honesty-elicitation testbed than artificially-trained model organisms. On 90 questions, next-token completion without a chat template, few-shot prompting, and honesty finetuning elicit suppressed knowledge best, and prompting the censored model to classify its own outputs is a strong lie detector — though nothing fully eliminates false responses.
  • Sabotage Risk Report: Claude Opus 4.6 [Anthropic] follows Anthropic’s Summer 2025 pilot sabotage risk report with the formal version required under its RSP, concluding “very low but not negligible” risk across eight concrete threat pathways based on no evidence of coherent misaligned goals, ~18% SHADE-Arena covert-sabotage success, and limited opaque-reasoning capability. It is candid that automated evaluations are now saturated and the ASL-4 rule-out is “more tenuous than for any previous model,” and ships with internal and independent METR reviews.


Discuss

Страницы