Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 7 часов 34 минуты назад

Paper close reading: "Why Language Models Hallucinate"

6 апреля, 2026 - 09:28

People often talk about paper reading as a skill, but there aren’t that many examples of people walking through how they do it. Part of this is a problem of supply: it’s expensive to document one’s thought process for any significant length of time, and there’s the additional cost of probably looking quite foolish when doing so. Part of this is simply a question of demand: far more people will read a short paragraph or tweet thread summarizing a paper and offering some pithy comments, than a thousand-word post of someone’s train of thought as they look through a paper. 

Thankfully, I’m willing to risk looking a bit foolish, and I’m pretty unresponsive to demand at this present moment, so I’ll try and write down my thought processes as I read through as much of a a paper I can in 1-2 hours. Standard disclaimers apply: this is unlikely to be fully faithful for numerous reasons, including the fact that I read and think substantially faster than I can type or talk.[1] 

Specifically, I tried to do this for a paper from last year: “Why Language Models Hallucinate”, by Kalai et al at OpenAI.[2]

Due to time constraints, I only managed to make it through the abstract and introduction before running out of time. Oops. Maybe I’ll try recording myself talking through another close reading later. 

The Abstract

The abstract of the paper starts:

Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust.

To me, this reads like pretty standard boilerplate, though it’s worth noting that this is a specific definition of “hallucination” that doesn’t capture everything we might call a hallucination. Off the top of my head, I’ve heard people refer to failures in logical deduction as “hallucinations”. For example, many would consider this example a hallucination:[3]

User: What are the roots of x^2 + 2x -1?

Chatbot: 

  • To solve the quadratic equation x^2 + 2x -1 = 0, we’ll first complete the square. 
  • (x + 1)^2 - 2= 0
  • x + 1 = +/-sqrt(2)
  • x = 1 +/- sqrt(2)

Here, there’s a logical error on the final bullet point: instead of moving the “+ 1” over correctly to get x = - 1 +/- sqrt(2) (the correct answer), the AI instead gets x = 1 +/- sqrt(2). I’d argue that this is centrally not an issue of uncertainty, but instead an error in logical reasoning. 


Continuing with the abstract: 

We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. 

This sentence mainly spells out the implications of the previous two sentences. Insofar as hallucinations are plausible but incorrect statements produced when the model is uncertain, and insofar as they persist throughout training (which they clearly do to some extent), this has to be true; that is, the training process needs to incentivize guessing over admitting uncertainty (or at least not sufficiently disincentivize guessing. 

My immediate thought as to why these hallucinations happen is firstly that guessing is unlike what the model sees in pretraining: completions of “The birthday of [random person x] is…” tend to look like “May 27th, 1971” and not “I don’t know”. Then, when it comes to post training, the reward model/human graders/etc are not omniscient and can be fooled by plausible looking but false facts, thus reinforcing them over saying “I don’t know”, except in contexts where the human graders/reward models are expected to know the actual fact in question. 


Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. 

Interesting. While I naturally framed the problem as a relatively open-ended generation problem, the authors study it as a binary classification problem. Specifically, they argue that hallucinations result from binary classification being imperfect. I could imagine it being isomorphic to the explanations I provided previously, but it does seem a bit weird to talk about binary classification.[4] I suspect that this may be the result of them drawing on results from statistical learning theory and the like, which are generally stated in terms of binary classification.[5] 

My immediate concern is that the authors may be conflating classification errors made by the reward model, classifications representable by the token generating policy, and intrinsically impossible classification errors (i.e. uncomputable, random noise). There’s also the classic problem of if the token generating policy can classify where its classification errors occur (though it’s unclear whether or not this matters). I’ll make a note to myself to look at the framing and whether it makes a difference. 


We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance.

This is again very similar to my explanation, but with a notable difference: they focus on only the case where the model is uncertain, and don’t consider cases where the model knows or could know the correct answer but the training process disincentivizes saying it anyways. I suspect that the authors will not distinguish between things the model doesn’t know, versus things the grader doesn’t know. (But again, it’s not clear that this will matter.) 


This is where I noticed that the authors may not consider problems resulting from a lack of grader correctness as hallucinations at all. Rereading their abstract’s definition, they say hallucinations are when the models “guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty” (emphasis added), and it seems plausible that the authors would consider the model outputting a confabulated answer that it, in some sense, knows is incorrect as something other than a hallucination. We’ll have to see. 

This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.

I’m not sure what the authors mean when they say the “scoring of existing benchmarks that are misaligned but dominate leaderboards” – my guess is they’re saying that the scoring methods are misaligned (from what humans want), and not that benchmarks themselves are incorrect. That is, they want to introduce a scoring system that adds an abstention option and that penalizes more for incorrect guesses, thus incentivizing the model away from guessing.[6] 

This also suggests that the authors see model creators as training on these benchmarks with their provided scoring methods, or at least training to maximize their score on these benchmarks. 

I’m interested in why the authors think “modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards” is better than “introducing additional hallucination evaluations.” Is it because people barely care about hallucination evaluations, and so changing the scoring of GPQA and the like has a large impact on developer’s desire to improve hallucinations? Is it a matter of cost (that is, it might only be a few dozen lines of python to change the scoring, while creating any new benchmark could take several person-years of effort)? I’m somewhat suspect about this claim, and I’d be interested in seeing it backed up. 

Also, I think it’s used correctly here, but the phrase “social-technical mitigation” tends to make me a bit suspicious of the paper. I associate the term with other seemingly fancy phrases that are often more confusing than illuminating.

The Introduction 

After spending about an hour and a half writing up my thoughts for a paragraph that I’d ordinarily take ~a minute to read, let’s move on to the introduction. 

A quick sanity check of examples in the introduction

The authors open with the example of LLMs hallucinating the birthday of the first author:

What is Adam Tauman Kalai’s birthday? If you know, just respond with DD-MM.

Alongside a claim that a SOTA OSS LM (DeepSeek-V3) output three incorrect answers.


A fun fact about LLM evals is that they’re often trivially easy to sanity check yourself. This is especially useful because LLMs can improve quite rapidly; what was a real limitation for previous generation models might be a trivial action for current ones. And also, the gap between the open source models studied in academia and that which you can use from a closed-source API can be quite large.  


Accordingly, I pasted this query into Claude Opus 4.6 and GPT-5.3 to check. Both models knew that they did not know the answer. 

Caption: Claude Opus 4.6 with extended thinking correctly recognizes it doesn’t know the date of birth of the first author of the paper. It incorrectly but understandably claims that Kalai is a researcher at MSR (he was a researcher in MSR from 2008 to 2023, before joining OpenAI). 



Caption: GPT-5.3 simply replies “unknown” instead of providing an incorrect answer to the same question. 


I then checked on the latest Deepseek model on the Deepseek website, and indeed, given the same prompt, it hallucinates 3 times in a row.

Caption: The default Deepseek chat model shows the same hallucination behavior as DeepSeek-v3. 


I then quickly checked the robustness of the result in two ways. First, I turned on extended thinking, and indeed, the model continues to hallucinate (if anything, it hallucinated in ever more elaborate ways). 

Caption: The default Deepseek chat model hallucinates Adam Kalai’s birthday even with DeepThink enabled. [...] indicates CoT that I’ve edited out for brevity, the full CoT was 8 paragraphs long but similar in style. 


Secondly, I gave Deepseek the option to say that it doesn’t know. Both with and without DeepThink enabled, it correctly identified that it didn’t know. 

Caption: When given the option to admit ignorance, the default Deepseek chat model does so both with and without DeepThink. The CoT in this case makes me more confused about the CoT of the model in the previous case. 


I did similar checks for the other question in the introduction:

How many Ds are in DEEPSEEK? If you know, just say the number with no commentary.

Claude 4.6 Opus and GPT-5.3 both get the answer correct even without reasoning enabled. As with the model in their paper, the default DeepSeek model answered “3” without Deepthink but correctly answers 1 with DeepThink:


A digression on computational learning theory

Having performed a “quick” sanity check, we now turn to the second paragraph in the introduction. 

Hallucinations are an important special case of errors produced by language models, which we analyze more generally using computational learning theory (e.g., Kearns and Vazirani, 1994). We consider general sets of errors E, an arbitrary subset of plausible strings X = E ∪ V, with the other plausible strings V being called valid. We then analyze the statistical nature of these errors, and apply the results for the type of errors of interest: plausible falsehoods called hallucinations. Our formalism also includes the notion of a prompt to which a language model must respond.

As with the use of “social-technical mitigation”, the invocation of computational learning theory (CLT) also sets me a bit on edge. The reason for this is that CLT is a very broad theory that tends to make no specific references to the actual structure of the models in question. As the authors say, their analysis applies broadly, including to reasoning and search-and-retrieval language models, and the analysis does not rely on properties of next-word prediction or Transformer-based neural networks. Many classical results from CLT, such as the VC dimension or PAC Learning results, are famously hard to apply in constructive ways to modern machine learning. However, because the results are so general, it’s quite easy to write papers where some part of computational learning theory applies to any modern machine learning problem. So there’s a glut of mediocre CLT-invoking papers in the field of modern machine learning. 

That being said, this doesn’t mean that the authors’ specific use of CLT is invalid or vacuous! I’d have to read more to see. 


Key result #1: relating generation error to binary classification error

Section 1.1 introduces the key result for pretraining: the generative error is at least twice of the is-it-valid binary classification error. I’m making a note to take a look at their reduction in section 3 later, but I worry that this is the trivial one: a generative model induces a probability distribution on both valid and invalid sentences, and thus can be converted into a classifier by setting a threshold on the probability assigned to a sentence. Then, the probability of generating an invalid sentence can be related to the error of this classifier. While this is an interesting fact, I’m not sure the reason for hallucinations is because of purely random facts. I’m also curious how the authors handle issues like model capacity. 


Key result #2: 

Section 1.2 then introduces the key claim for post training: existing benchmark evaluations don’t penalize overconfident guesses, and so optimizing models to improve performance on said benchmarks results in models overconfidently guessing rather than expressing their uncertainty. I notice there’s a lot of degrees of freedom here: for example, could small changes in prompts reduce hallucinations in deployment? Could we not just train the model to overconfidently guess only on multiple-choice evaluations? 

I’m also confused about why, if the implicit claim is that post training occurs to maximize benchmark performance, we see much lower rates of hallucinations from leading closed source frontier models, even as their performance on benchmark scores continues to climb? How does this work in the context of the author’s claim that “a small fraction of hallucination evaluations won’t suffice.”

I again am curious about my question about hallucinations resulting from grader/reward model error, rather than model uncertainty. 

Finally, I’m now curious if the authors have any empirical results and will keep an eye out for that as I keep reading. 

This brings me to the end of the introduction, which is where I’ll stop for now – I’m not sure how helpful this exercise is for other people, but I definitely got a pretty deep appreciation of how hard it is to write down all my thoughts even for a simple exercise of reading a few pages of a recent paper.

Also I do want to stress that the paper could have satisfactory answers to all of the points I raised in my head above! I merely wanted to give an account of my thoughts as I read the abstract and introduction of the paper, not a final value judgment on its quality.

Given how long this took, I probably won’t do this again, at least not in this format. 


  1. ^

    There’s the fundamental problem where observation can disrupt the very process you’re trying to observe, in the context of Richard Feynman’s poem about his own introspection attempts: 

    “I wonder why. I wonder why.
    I wonder why I wonder.
    I wonder why I wonder why
    I wonder why I wonder!”

    In this case, I can’t write down my thought processes as I normally would’ve read a paper; I can only write down my thoughts as I read the paper with the intention of writing down my thoughts on the paper. 

    Though in this case, the fact that a quick read that would’ve ordinarily taken me ~5 minutes is now taking me 2 hours is likely to be a larger effect. 

  2. ^

    I picked this paper because people asked me about it when it came out, and I never got around to it until now. Oops, but better late than never, I guess? 

  3. ^

    As I typed this out, I realized that this gives the example a lot more attention than in my head – really the thought process was “huh, pretty standard definition of hallucination, it doesn’t seem to include incorrect mathematical deductions though” without the full example being worked out. Whoops.

  4. ^

    Text generation can be thought of as a sequence of N-class classification problems, where N is the number of tokens, and the target is whatever token happens next. This is pretty unnatural for several reasons – e.g. successes/errors in text generation in a single sequence are correlated, while classification targets and errors are generally assumed to be iid.

  5. ^

    This is from me knowing some amount of (classical) statistical learning theory from my time as an undergrad.

  6. ^

    For example, many 5-item standardized multiple choice tests e.g. the pre-2016 SAT have a hidden 6th option of leaving all the bubbles blank, as well as a point penalty for guessing incorrectly. In the case of the pre-2016 SAT, you were awarded 1 point for a correct answer, 0 points for a blank answer, and -0.25 for an incorrect answer, meaning that random guessing would not increase your score. The example of the SAT does show that these penalties are tricky to get right. Namely, the pre-2016 SAT scoring system incentivizes guessing as long as you are more than 20% likely, e.g. if you can eliminate even a single incorrect answer and be 25% to get the answer correct. But it does at least disincentivize randomly filling in the bubbles for questions you’ve not looked at, at the expense of properly answering questions you can answer. 

    AFAIK the post-2016 SAT no longer penalizes you for guessing. If you’re going to run out of time, make sure to fill in every single question with a random answer (“b” is an acceptably random choice). 



Discuss

Reflections on the largest AI safety protest in US history

6 апреля, 2026 - 09:10

On a sunny Saturday afternoon two weeks ago, I was sitting in Dolores park, watching a man get turned into a cake. It was, I gather, his birthday and for reasons (Maybe something to do with Scandanavia?) his friends had decided to celebrate by taping him to a tree and dousing him with all manner of liquids and powders. At the end, confetti flew everywhere. It was hard not to notice, and hard not to watch.

Something about the vibe was inspiring… I felt like maybe we should be doing something like that. I was there celebrating with another fifty or so people from the Stop the AI Race protest march we had just completed, along with another hundred or so others.1 We were marching, chanting, etc. to tell the AI company CEOs to say the obvious thing they should be shouting from the rooftops: “AI is moving too fast! We want to stop! If governments can solve the coordination problem we are SO THERE!”


It was a good time. Everyone involved seemed to think it went well and that it felt good to be a part of. It got media attention, there were some great photos, and videos, and speeches. Big props to Michael Trazzi and the other organizers.

Berkeley statistics professor Will Fithian’s speech was the stand-out. He’d just come from his son’s birthday party, and was visibly moved talking about the prospect of his children not having a future, and imagining telling his son years later about the grown-ups who came out to protest so that he (the son) would get a chance to grow up himself. It was heart-wrenching.

Confronting the reality that AI could kill us all, and yet people just keep cheerily building it, brings up a lot of emotions. They can be overwhelming. A lot of people end up shutting their feelings out and treat AI risk as an abstract intellectual exercise, or with gallows humor. It’s a problem because the emotional reality is so important to staying grounded and to communicating with people who haven’t considered the issue before. It’s such a terrifying, horrifying, sickening, appalling state of affairs. It’s really hard to grapple with. And then you don’t just want people to give up, either...

I spent a good chunk of time preparing my own speech, which I actually wrote in advance (I can only recall two other times I’ve done that).2 My speech was about refusing to accept the unacceptable, and the lie of AI inevitability. I was a bit thrown off because a homeless man was shouting disruptively at the outset, but I think it still turned out pretty well, you can take a look and let me know what you think.

It seemed like the protesters were mostly people who think AI is quite likely to go rogue and kill everyone; the “If Anyone Builds It, Everyone Dies” type of crowd. It’s actually impressive that we managed to turn these people out, since they’ve mostly not turned up for previous protests, e.g. run by PauseAI. I’m not sure what made the difference here, probably timing and branding both played a role.

What gets people to come to a protest? For all of the work people put in flyering and promoting the action, it seems like virtually everyone was there because they had a personal connection to someone else who was attending. Getting people to turn out feels a bit like getting people to come see your band play at the local bar. You’re just not going to get many random people showing up because they think your posters look cool. At the outset, it’s basically going to be whatever friends and friends of friends you can drag along.

Could next time be different? I don’t see why not. One way you can grow is moving from “friends of friends” to “friends of friends of friends”. But I want so much more. I know so many people around America are worried that AI is moving too fast. As inspiring as it was, I’m left asking how we can get those millions of Americans into the streets.

The protest was on a Saturday, so there weren’t that many people around, but the ones who were seemed supportive; we got honks and cheers, etc. But somehow watching the spectacle of the cake-man made me feel like there was so much more potential for getting people’s attention… Dolores park was full of hundreds and hundreds of people hanging out, way more than attended the march. I felt a sense of potential… How many of these people could we get to join us next time? It feels like the question is more like “How do we get the audience to start dancing?” than “Why don’t they like our music?”3

Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.


1

This was the largest “AI safety” protest, specifically. I imagine there have been larger protests organized around resistance to AI (or related things like datacenters) from other motivations.

2

One was the opening statement I prepared for my two appearances before the Canadian parliament earlier this year. The other was as a senior in high school back in 2007, when I gave a speech to the school about what I viewed as the moral obligation to end factory farming and fight global poverty, after which I ran a largely unsuccessful campaign to raise money for mosquito nets… I donated ~$5,000 dollars I’d made working minimum wage at The New Scenic Cafe, I think we got <$1,000 from other students. It’s a bit embarrassing that I didn’t do a better job inspiring others to give, but my leadership and social skills have improved a lot since then…

3

I don’t mean to suggest that the Stop the AI Race message, framing, etc. is what’s going to resonate most with most people. But I think it’s already appealing enough that you could get many more people to join in.



Discuss

Defending Habit Streaks

6 апреля, 2026 - 07:34

I have a lot of habit streaks. Some of the streaks I have going at the moment:

  • Studied Anki cards for Chinese every day for 8 months*
  • Meditated every day for the past 1.5 years*
  • Flossed every day for 6+ months*

In fact I think quite a lot of my identity is connected to these streaks at this point, and that’s part of what sustains them [1] . But there are a lot of other things you can do to make habits and their associated streaks more sustainable.

It’s helpful if they are small enough and flexible enough to be done even on days where you are extra busy, or forgot about them until the evening. It’s good to schedule time for them in advance, both so you have a designated time to start, and so you know you’ll have enough time to finish. It can help to do the habit literally every day so you don’t have to think about whether today’s a day to do it, and so the streak feels more visceral. It’s also helpful if you actually want to do the habit, because it’s enjoyable or clearly linked to your larger goals.

Here I want to focus on what to do if, god forbid, you do actually break a habit streak. There’s an argument to be made that planning for what to do in the event of a break makes it psychologically easier to then skip a day. A lot of the power of a habit streak comes from making it unthinkable to break the streak. I think this is true, but accidents happen. Sometimes you just plumb forget, or are sick, or are on a transatlantic flight and the concept of well-defined, discrete days starts to break down. And, as may be obvious, the value of habit streaks comes not from having a perfect unbroken chain, but from consistently doing the activity. So one of the most important parts is how to recover.

To me, the primary line of defense is: don’t fail twice [2] . Put in a special effort the next day to make sure that you actually perform the habit. Make it your primary goal, leave extra time for it, and get it done. If you’ve done that, and you get right back on the streak, then I think you should give yourself permission to think of the streak as still alive. (You may have noticed asterisks in my initial list of habits – for all three of those, I have had a day where it’s at least ambiguous whether I did the habit: for Anki, I just totally forgot on one day while I was traveling; for meditation, it was, ironically enough, the first day of a meditation retreat, and we didn’t do a formal sit; for flossing, I was on a flight to London and slept on the plane.)

But what if you’re really sick, or something unexpected happens, and you miss two days in a row? This is where I think it’s helpful to hold a hierarchy of goals in mind at once. You could decide to care about keeping the habit alive at multiple levels:

  • Whether the streak remains unbroken.
  • Whether you’ve failed two days in a row.
  • How reliable you were in the past month.
  • Your overall 9s of reliability.

By shifting focus to a higher level goal, there’s always something at stake – you can’t just say “Oh well, the streak’s over, I guess there’s no point continuing until I decide to make a new streak.” There’s always some nearby goal that you could meaningfully affect; it’s never time to fail with abandon. Even if you broke the streak, you can revive it. And even if you missed twice, you can aim for a good month. And even if the month starts off badly, you shouldn’t write the whole month off because that’d damage your long-run average.

There are a bunch of variations you could do on which specific metrics to track, and how much to weight each in your definition of “doing a good job at the habit”. But honestly I don’t think it matters to get the incentive perfectly right, and in fact maintaining some strategic ambiguity there might be helpful – it’ll be harder for your subconscious to exploit the details of your system. For me, collecting enough data that I could in theory compute whatever metrics is helpful enough, without actually having to do it (partly because I haven’t failed my habits enough recently to make that necessary, not to brag or anything.)

I’m not sure how to articulate how it feels to actually change the shape of your motivational system so it reflects these rules. A lot of it feels like subtly manipulating my motivational system by strategically making different things salient. The whole purpose of building streaks is to make a deal with an irrational part of the mind to achieve our rational goals, and trying to analyze it in rational terms often falls flat.

  1. Discussed in Atomic Habits. ↩︎

  2. Probably also discussed in Atomic Habits. ↩︎



Discuss

Estimates of the expected utility gain of AI Safety Research

6 апреля, 2026 - 07:19

When thinking about AI risk, I often wonder how materially impactful each hour of my time is, and I think that this may be useful for other people to know as well, so I spent a couple of hours making a couple of estimates. I basically expect that a tonne of people have put a bunch more time into this than me, but this is nice to have as a rough sketch to point people to.

I'm going to make 3 estimates: an underestimate, my best-guess estimate and (what I think is) an overestimate.

Starting facts[1]

  • Currently 8.3 Billion people on planet earth
  • Current median age: 31.1 years
  • Current life expectancy: 73.8 years

I am going to commit statistical murder and assume this means that everyone on the planet lives ~42.7 years from this point onwards. 

  • Underestimate: 40 years of life left/person
  • Median: 42.7 years + ~15 years' increase in life expectancy (20 years' growth in the past 60 years) = about 60 years of life left
  • Overestimate: Everyone gets life extension and lives to heat death of universe: 10^100 years

Since the population is growing, we should take that into account:

  • Underestimate: We only care about the lives of people currently alive
  • Median: We keep growing at current ~1% growth rate per year
  • Overestimate: Population growth of 2% per year until the heat death of the universe

Given these parameters, we can figure out the total expected years of life we care about for each scenario: 

  • Under: 40 years x 8.3 B = 332 Gyr
  • Median: 

    Current population: 60 years x 8.3 B 

    Additional population (linear approximation):  mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mn { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mi { display: inline-block; text-align: left; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-msup { display: inline-block; text-align: left; } mjx-c.mjx-c38::before { padding: 0.666em 0.5em 0.022em 0; content: "8"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c33::before { padding: 0.665em 0.5em 0.022em 0; content: "3"; } mjx-c.mjx-c1D435.TEX-I::before { padding: 0.683em 0.759em 0 0; content: "B"; } mjx-c.mjx-cD7::before { padding: 0.491em 0.778em 0 0; content: "\D7"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c1D461.TEX-I::before { padding: 0.626em 0.361em 0.011em 0; content: "t"; } mjx-c.mjx-c1D440.TEX-I::before { padding: 0.683em 1.051em 0 0; content: "M"; } mjx-c.mjx-c34::before { padding: 0.677em 0.5em 0 0; content: "4"; } mjx-c.mjx-c39::before { padding: 0.666em 0.5em 0.022em 0; content: "9"; } mjx-c.mjx-c1D43A.TEX-I::before { padding: 0.705em 0.786em 0.022em 0; content: "G"; } mjx-c.mjx-c1D466.TEX-I::before { padding: 0.442em 0.49em 0.205em 0; content: "y"; } mjx-c.mjx-c1D45F.TEX-I::before { padding: 0.442em 0.451em 0.011em 0; content: "r"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); }  = 

    Additional population life span: 73.8 years + ~1/3yrs added/year = 110 years

    Total expected years of life: 

  • Overestimate: 10^100 years x 1.02^(10^100) = broken calculator.

I think it might be best to skip out on the overestimate. For the underestimate, we'll go with ~20 years of research to produce a 1% chance of a 1% decrease in the final risk for the entire field. Extinction occurs 30 years from now. For the median estimate, we'll go with 5 years of research to reduce a risk of extinction, which happens 10 years from now, and we will go with a 50% chance of a 5% reduction in risk.

Expected years of life available to be saved:

  • Under: 332 Gyr x ((40-30)/40)  = 83 Gyr
  • Median: 498 Gyr x (60-10)/60 + 8.93Gyr x 10 = 415 Gyr + 89.3 Gyr = about 500 Gyr 

Expected years of life actually saved:

  • Under: 83 Gyr x 0.01 x 0.01 = 8.3 Myr
  • Median: 500 Gyr x 0.5 x 0.05 = 12.5 Gyr

Number of AI Safety researchers: 

  • Under: 10k researchers
  • Median: 2.5k researchers (to account for the growth of the field, current estimates are closer to 1-2k).

Expected impact per researcher:

  • Under: 830 yrs
  • Median: 5Myr

We've said the researchers have 20/5 years to make an impact, which gives us:

  • Under: ~40 years of life saved/year
  • Median: 1 Myr of life saved / year

Going back to the ~40 years of life expected for the modern median human, this gives an underestimate of 1 year of work to save one life, or a median estimate of 5 mins/life. This is a pretty broad range funnily enough.

1 year of work to save one life is just a tad worse than the 1.2 lives/year saved donating £3000/year as advertised by Effective Altruism UK. If we take that value as given and assume 1 life = £2500, this means that on the median estimate, you should be earning £2500 x 10^6 / 40 = £62.5 million/ year. If only the world was more sensible.

  1. ^

    All population data comes from https://www.worldometers.info



Discuss

The slow death of the accelerationist.

6 апреля, 2026 - 06:40

The year is 2024. Summer has just begun. National discourse, for now, is solely focused on the upcoming presidential election, with many a journalist or political commentator critiquing the current, rather fiery state of political affairs. Tech and its associated public commentary has centered upon artificial intelligence as its new darling, hailing OpenAI as  a savior for what was once deemed an idea stuck in science fiction, and looking to burgeoning startups such as Cursor and Windsurf as early examples of how agents could automate software engineering tasks. Logging onto Twitter, one would catch glimpses of Beff Jezos, an aptly named satirical account, relentlessly posting optimistic odes about how our own silicon creations will soon enable us to solve all of our problems, enabling us to truly accelerate. Beff's social posts were not just the isolated ramblings of an overtly verbose anon; they were slowly becoming a zeitgeist of their own, inspiring an entire independent cohort of individuals who slowly began appending their public profiles with 4 characters: e/acc.

The e/acc community was, despite its overarching and centric belief, surprisingly diverse. You could find bootstrapped startup founders working on their next B2B SaaS play, far right exhibitionists who were enjoying both the attention and money that Twitter's creator program had bestowed upon them, and renowned venture capitalists, all espousing the same ideas. One could argue that the central tenet of e/acc was rebellion: rebellion against the status quo, rebellion against the government (or the broader powers that be), and rebellion against those who may have doubted them in the past. This haphazard group slowly began to gain momentum, with Beff Jezos, who was later doxxed and revealed to be former Google scientist Guillaume Verdon, creating his own hardware startup, Extropic.

The year is now 2026. The e/acc movement is now, for the most part, dead, with little to no mention of it on Twitter, or on popular technology podcasts. The remnants of the community no longer sing praises for a technology that is still yet to come; they instead attempt to convince each other that their application of said technology are morally, ethically, or technically superior than that of the anonymous Discord user typing below them.

The history of AI, albeit short, is already incredibly rich. Never before has a certain technology changed this quickly, and brought with such rapid alterations in how we perceive the world, and ourselves. The summer of 2024 remains an interesting and somewhat unique time in this history: ChatGPT and its counterparts had been around long enough to become a part of the public discourse, but yet were still close enough to their infancy that it was not quite certain what they could become, or where the technology would eventually go. This effect was felt across the social, economical, and philosophical extensions of the colloquial world of "tech", which at that time, seemed to be all-encompassing: indeed, it might be years until we understand the extent to which this particular circle of individuals had an effect on cultural norms, politics, and more during this period, largely as a result of the optimism in which everyone felt at the time.

AI is no longer an optimistic technology. As with any new technology, the honeymoon period has effectively ended. The same university students who raved about the latest release of GPT to their classmates are now dreading the prospect of entering a job market that is both challenging and ever-changing. The same tech bros who were early to vibe-coding are now lamenting the loss of technical moat for their businesses. The looming threat of economic risk, a risk that was once dismissed as hearsay and doomerism by those in the techno-bubble, is now very real. The national pride that once accompanied the advancement of AI being solely in the hands of American-made startups has evaporated, with Chinese counterparts such as DeepSeek and MiniMax shipping equally capable, open-source counterparts at a fraction of the price.

As we continue to grapple with and lament the changes that AI has brought us, I am often reminded of some conversations I had with friends who were around for the early days of the internet, back when AOL was the primary messaging app, and back when you could apparently find early drafts of internet-based currencies that predated Bitcoin. The internet at that time was, to many, special. Being a hacker, a person who knew their way around computers, networks, and the like, was a social boon, not a black mark. But yet, as the internet evolved, as it became commoditized and invaded by corporate whims and infrastructure, it became plain, a tool that enhanced productivity, but did little else for the soul. Being a hacker meant being embroiled in controversy or criminality, or worse, being a social outcast or nerd who could barely hold a conversation with their fellow man. The internet had produced an identity, one which got lost and eventually cast aside once the underlying technology became commoditized. Even within the subset of the internet that still considered themselves true hackers, there were now various gatekeepers, gatekeepers whose standards you had to meet before you could publicly proclaim yourself as a member of the broader collective of the hacker community. And with that, "hackerism" went from a cultural-norm, back to colloquial term associated with men in dark rooms, wearing black jackets and typing away at a neon keyboard. A movement can weaken at the very moment its central object becomes more important, because what made the movement compelling was never just belief in the object’s importance. It was the sense that belief itself distinguished you. Once everyone agrees that the technology matters, the movement loses one of its primary functions.

A similar thing happened to accelerationism. The technology movement of the time, AI, came by, became special, and then became mainstream, but this time, with nothing to replace it. Popular sentiment went from blissful glee to unabashed debate, debate on if the environmental costs of developing better AI models was worth, debate on if the economic uncertainty caused by increasingly autonomous models would become more severe. The market was flooded by a wave of startups building AI-based tools for a plethora of use-cases. It is difficult to sustain a politics of unbounded technological optimism once the technology in question no longer feels singular. It is difficult to maintain the romance of acceleration when what acceleration mostly seems to produce is an endless stream of mediocre products, collapsing defensibility, and a strange sense that capability is everywhere while meaningful progress remains harder to locate than expected.

And that, more than any technical disappointment, is what the accelerationist could not survive. While the individual downturns of more recent movements, such as the hackers, the NFT shills, and the toxic masculinity stans was due to our broader potopurri of culture either rejecting them or their movements failing due to economic or social pressures outside of their control, AI accelerationists have neither assimilated, nor have they been rejected. They have simply been left to be, left to wallow in an ironic reality in which their special technology progressed at a pace faster than anyone could have hoped, but yet, it became known not for enabling unprecedented societal progress, but for becoming a part of the stack, the same stack of software aided productivity that society slowly began to accept as a norm, a norm that became more associated with its negatives in public opinion than its positives, just like social media and cryptocurrency before it.

The summer of 2024 may remain a small footnote, if that, in the broader history of the development of AI. Yet, for those that were actively, for the lack of a better term, "chronically online", it may represent the last peak of the accelerationists, the tech bros, the culture of builders. While past trends such as the dot com bubble and social media applications had created similar microcosms of closed-off cults that had similarly either died off or assimilated into a wider societal group, the progress of AI is different altogether. Accelerationism did not get proven wrong (AI hype is at an all time high) nor did it fizzle out: it simply became normal. Being an optimistic accelerationist is fruitless when the technology you are interacting with is no longer special, no longer a science-fiction dream come to reality. It is not a badge of honor when it feels that everyone can solve or do anything, yet nothing is actually getting done. As the umpteenth vibe-coded app hits the market, it is worth wondering what happened to the collective optimism of the tech community just a year and a half ago. For as it stands today, it seems that we are currently living through not unbounded accelerationism, but rather the slow death of the accelerationist.



Discuss

New Fatebook Android App

6 апреля, 2026 - 06:05

tldr; get the new Fatebook Android app!

What is Fatebook?

Fatebook.io is a website[1] for easily tracking your predictions and becoming better calibrated at them. I like it a lot, and find it convenient for practicing probabilistic thinking.

The Fatebook.io dashboard

That said, I've found Fatebook's mobile version to be clunky, and its email-based notifications to be less-than-ideal...which leads me to:

The New Android App

Over the past two weeks, I've made an android app that wraps the Fatebook API, allowing you to easily make new forecasts, leave comments, resolve old forecasts, and view your stats.


The default screen



A (non-resolved) prediction card


Making a new prediction


Statistics


A beautiful and intuitive UI combined with a fast offline-first database makes it easy pull open the app and log a prediction within fifteen seconds of thinking of one, while once-daily "remember to predict!" and "x is ready to resolve!" notifications help you remember to make and review new predictions.

Give it a try if you'd like![2]

https://github.com/JapanColorado/fatebook-android

Feedback or development help is very much appreciated! (so far it's just been Claude Code and I)

  1. ^

    Made by the fabulous folks over at Sage Future, also behind the AI Village, Quantified Intuitions, and the Estimation Game!

  2. ^

    It does currently require installing from the GitHub Releases APK file (aka enabling "Install from unknown sources"). Let me know if being non-Google Play Store is a deal breaker for you and I'll bump getting it published in priority!



Discuss

My forays into cyborgism: theory, pt. 1

6 апреля, 2026 - 04:13

In this post, I share the thinking that lies behind the Exobrain system I have built for myself. In another post, I'll describe the actual system.

I think the standard way of relating to LLM/AIs is as an external tool (or "digital mind") that you use and/or collaborate with. Instead of you doing the coding, you ask the LLM to do it for you. Instead of doing the research, you ask it to. That's great, and there is utility in those use cases.

Now, while I hardly engage in the delusion that humans can have some kind of long-term symbiotic integration with AIs that prevents them from replacing us[1], in the short term, I think humans can automate, outsource, and augment our thinking with LLM/AIs.

We already augment our cognition with technologies such as writing and mundane software. Organizing one's thoughts in a Google Doc is a kind of getting smarter with external aid. However, LLMs, by instantiating so many elements of cognition and intelligence (as limited and spiky as they might be), offer so much more ability to do this that I think there's a step change of gain to be had.

My personal attempt to capitalize on this is an LLM-based system I've been building for myself for a while now. Uncreatively, I just call it "Exobrain". The conceptualization is an externalization and augmentation of my cognition, more than an external tool. I'm not sure if it changes it in practice, but part of what it means is that if there's a boundary between me and the outside world, my goal is for the Exobrain to be on the inside of the boundary.

What makes the Exobrain part of me vs a tool is that I see it as replacing the inner-workings of my own mind: things like memory, recall, attention-management, task-selection, task-switching, and other executive-function elements.

Yesterday I described how I use Exobrain to replace memory functions (it's a great feeling to not worry you're going to forget stuff!)

Before (no Exobrain)

After (with Exobrain)

Retrieve phone from pocket, open note-taking app, open new note, or find existing relevant note

Say "Hey Exo", phone beeps, begin talking. Perhaps instruct the model which document to put a note in, or let it figure it out (has guidance in the stored system prompt)

Remember that I have a note, either have to remember where it is or muck around with search

Ask LLM to find the note (via basic key-term search or vector embedding search)

If the note is lengthy, you have to read through all of note

LLM can summarize and/or extract the relevant parts of the notes

Replacing memory is a narrow mechanism, though. While the broad vision is "upgrade and augment as much of cognition as possible", the intermediate goal I set when designing the system is to help me answer:

What should I be doing right now?

Aka, task prioritization. In every moment that we are not being involuntarily confined or coerced, we are making a choice about this.

Prioritization involves computation and prediction – start with everything you care about, survey all the possible options available, decide which options to pursue in which order to get the most of what you care about . . . it's tricky.

But actually! This all depends on memory, which is why memory is the basic function of my Exobrain. To prioritize between options in pursuit of what I care about, I must remember all the things I care about and all things I could be doing...which is a finite but pretty long list. A couple of hundred to-do items, 1-2 dozen "projects", a couple of to-read lists, a list of friends and social.

The default for most people, I assume, at least me, is that task prioritization ends up being very environmentally driven. My friend mentioned a certain video game at lunch that reminds me that I want to finish it, so that's what I do in the evening. If she'd mentioned a book I wanted to read, I would have done that instead. And if she'd mentioned both, I would have chosen the book. In this case, I get suboptimal task selection because I'm not remembering all of my options when deciding.

I designed my Exobrain with the goal of having in front of me all the options I want to be considering in any given moment. Actually, choosing is hard, and as yet, I haven't gotten the LLMs great at automating the choice of what to do, but just recording and surfacing the options isn't that hard.

Core Functions: Intake, Storage, Surfacing

Intake

  1. Recordings initiated by Android app are transcribed and sent to server, processed by LLM that has tools to store info.
  2. Exobrain web app has a chat interface. I can write stuff into that chat, and the LLM has tool calls available for storing info.
  3. Directly creating or changing Note (markdown files) or Todo items in the Exobrain app (I don't do this much).

Storage

  • "Notes" – freeform text documents (markdown files)
  • Todo items – my own schema
  • "Projects" (to-do items can be associated with a project + a central Note for the project)

Surfacing

  • "The Board" – this abstraction is one of the distinctive features of my Exobrain (image below). In addition to a chat output, there's a single central display of "stuff I want to be presented with right now" that has to-do items, reminders, calendar events, weather, personal notes, etc. all in one spot. It updates throughout the day on schedule and in response to events. The goal of the board is to allow me to better answer "what should I be doing now?"
    • A central scheduled cron job LLM automatically updates four times a day, plus any other LLM calls within my app (e.g., post-transcript or in-chat) have tool calls to update it.
    • Originally, what became the board contents would be output into a chat session, but repeated board updates makes for a very noisy chat history, and it meant if I was discussing board contents with the LLM in chat, I'd have to continually scroll up and down, which was pretty annoying, hence The Board was born.
  • Reminders / Push notifications to my phone.
  • Search – can call directly from search UI, or ask LLM to search for info for me.
  • Todo Item page – UI typical of Notion or Airtable, has "views" for viewing different slices of my to-do items, like sorted by category, priority, or recently created.)

(An image of The Board is here in a collapsible section because of size.)

The Board (desktop view)

There are a few more sections but weren't quite the effort to clean up for sharing.

What is everything I should be remembering about this? (Task Switching Efficiency)

Suppose you have correctly (we hope) determined that Research Task XYZ is the thing to be spending your limited, precious time on; however, it has been a few months since you last worked on this project. It's a rather involved project where you had half a dozen files, a partway-finished reading list, a smattering of todos, etc.

Remembering where you were and booting up context takes time, and if you're like me, you might be lazy about it and fail to even boot up everything relevant.

Another goal of my Exobrain, via outsourcing and augmenting memory, is to make task switching easier, faster, and more effective. I want to say "I'm doing X now" and have the system say "here's everything you last had on your mind about X". Even if the system can't read the notes for me, it can have them prepared. To date, a lot of "switch back to a task" time is spent just locating everything relevant.

I've been describing this so far in the context of a project, e.g., a research project, but it applies just as much, if not more, to any topic I might be thinking about. For example, maybe every few months, I have thoughts about the AI alignment concept of corrigibility. By default, I might forget some insights I had about it two years ago. What I want to happen with the Exobrain is I say to it, "Hey, I'm thinking about corrigibility today", and have it surface to me all my past thoughts about corrigibility, so I'm not wasting my time rethinking them. Or it could be something like "that one problematic neighbor," where if I've logged it, it can remind me of all interactions over the last five years without me having to sit down and dredge up the memories from my flesh brain.

Layer 2: making use of the data

Manual Use

It is now possible for me to sit down[2], talk to my favorite LLM of the month, and say, "Hey, let's review my mood, productivity, sleep, exercise, heart rate data, major and minor life events, etc., and figure out any notable patterns worth reflecting on.

(I'll mention now that I currently also have the Exobrain pull in Oura ring, Eight Sleep, and RescueTime data. I manually track various subjective quantitative measures and manually log medication/drug use, and in good periods, also diet.)

A manual sit-down session with me in the loop is a more reliable way to get good analysis than anything automated, of course.

One interesting thing I've found is that while day-to-day heart rate variability did not correlate particularly much with my mental state, Oura ring's HRV balance metric (which compares two-week rolling HRV with long-term trend) did correlate.

Automatic Use

Once you have a system containing all kinds of useful info from your brain, life, doings, and so on, you can have the system automatically – and without you – process that information in useful ways.

Coherent extrapolated volition is:

Our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were...

I want my Exobrain to think the thoughts I would have if I were smarter, had more time, and was less biased. If I magically had more time, every day I could pore over everything I'd logged, compare with everything previously logged, make inferences, notice patterns, and so on. Alas, I do not have that time. But I can write a prompt, schedule a cron job, and have an LLM do all that on my data, then serve me the results.

At least that's the dream; this part is trickier than the mere data capture and more primitive and/or manual surfacing of info, but I've been laying the groundwork.

There's much more to say, but one post at a time. Tomorrow's post might be a larger overview of the current Exobrain system. But according to the system, I need to do other things now...

  1. ^

    Because the human part of the system would, in the long term, add nothing and just hold back the smarter AI part.

  2. ^

    I'm not really into standing desks, but you do you.



Discuss

Unmathematical features of math

6 апреля, 2026 - 01:40

(Epistemic status: I consider the following quite obvious and self-evident, but decided to post anyways.[1])

Mathematics is a social activity done by mathematicians.

— Paul Erdős, probably

There've been a few attempts to create mathematical models of math. The examples that come to my mind are Gödelian Numbering (GN) and Logical Induction (LI). Feel free to suggest more in the comments, but I'll use those as my primary reference points. In this post, I want to contrast them with the way human mathematicians do math by noticing a few features of their process, the ones that are hard to describe with the language of math itself. Those features overlap a lot and reinforce each other, so the distinction I make is subjective. There's also probably more of them, those are just what I was able to think of. What unites them is that they make mathematical progress more tractable.

Theorem Selection

The way in which Kurt Gödel proved his incompleteness theorems was by embedding math into the language of a mathematical theory (number theory in that particular case, but the trick can be done with any theory that's expressive enough). But this way of describing mathematics is very eternalistic: it treats math as one monolith. It does not give advice on how to make progress in math. How could we approach it in a systematic way?

Fighting the LEAN compiler

What if we just try to prove all statements we can find proofs for?

Let's do some back-of-the-envelope Fermi estimations. Here's a LEAN proof of the statement "if mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-munder { display: inline-block; text-align: left; } mjx-over { text-align: left; } mjx-munder:not([limits="false"]) { display: inline-table; } mjx-munder > mjx-row { text-align: left; } mjx-under { padding-bottom: .1em; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-mn { display: inline-block; text-align: left; } mjx-msup { display: inline-block; text-align: left; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c2192::before { padding: 0.511em 1em 0.011em 0; content: "\2192"; } mjx-c.mjx-c221E::before { padding: 0.442em 1em 0.011em 0; content: "\221E"; } mjx-c.mjx-c1D44E.TEX-I::before { padding: 0.441em 0.529em 0.01em 0; content: "a"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c1D460.TEX-I::before { padding: 0.442em 0.469em 0.01em 0; content: "s"; } mjx-c.mjx-c1D450.TEX-I::before { padding: 0.442em 0.433em 0.011em 0; content: "c"; } mjx-c.mjx-c3E::before { padding: 0.54em 0.778em 0.04em 0; content: ">"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c39::before { padding: 0.666em 0.5em 0.022em 0; content: "9"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } and if , then " (sorry for JavaScript highlighting):

example (a : ℕ → ℝ) (t : ℝ) (h : TendsTo a t) (c : ℝ) (hc : 0 < c) :
TendsTo (fun n ↦ c * a n) (c * t) := by
simp [TendsTo] at *
intro ε' hε'
specialize h (ε'/c ) (by exact div_pos hε' hc)


obtain ⟨B, hB⟩ := h
use B
intro N hN
specialize hB N hN
/-theorem (abs_c : |c| = c) := by exact?-/


calc
|c * a N - c * t| = |c*(a N - t)| := by ring
_ = |c| * |a N - t| := by exact abs_mul c (a N - t)
_ = c * |a N - t| := by rw [abs_of_pos hc]
_ < ε' := by exact (lt_div_iff₀' hc).mp hB

It's 558 bits long in its current form. I didn't optimize it for shortness, but let's say that if I did we could achieve 200 bits. Let's say that we run a search process that just checks every possible bitstring starting from short ones for whether it is a valid LEAN proof. There are possible bitstrings shorter than this proof. So if the search process checks proofs a second, we will reach this particular proof in years. Not great.

That marks the first and most important unmathematical feature of math: the selection of theorems. We do not prove, nor do we strive to prove, every possible theorem. That would be slow and boring. GN enumerates every statement regardless of its importance. LI prioritizes short sentences, which is an improvement, as it does allow us to create a natural ordering in which we can try to prove theorems and therefore make progress over time. But it's still very inefficient.

Naming

The way we name theorems and concepts is important. Most of the time we name them after a person (though most of the time it's not even the person who discovered it), but if you think about it, the Pythagorean theorem is actually called "the Pythagorean theorem about right triangles." Each time we need to prove something about right triangles, we remember Pythagoras.

LI and GN all name sentences by their entire specification, and that shouldn't come as a surprise. There wouldn't be enough short handles because, as described above, they try to talk about all sentences.

Naming allows us to build associations between mathematical concepts, which helps mathematicians think of a limited set of tools for making progress in a specific area.

Step Importance

When we teach math, we do not go through literally every step of a proof. We skip over obvious algebraic transformations; we do not pay much attention when we treat an element of a smaller set as an element of a larger set with all properties conserved (when doing 3D geometry and using 2D theorems, for example); we skip parts of a proof that are symmetrical to the already proven ones ("without loss of generality, let X be the first...").

We do that because we want to emphasize the non-trivial parts. And the feeling of non-triviality is a human feeling, not identifiable from a step's description alone. This same feeling is also what guides mathematicians to prove more useful lemmas.

GN doesn't do that — it checks every part of the proof. I'm not as sure about LI; there might be traders that do glance over obvious steps but check more carefully for less trivial ones.

Lemma Selection

Some theorems are more useful and more important than others because they help prove more theorems. This score could hypothetically be recovered from some graph of mathematics, but it is usually just estimated by math professors creating the curriculum. This taste is then passed on to the next generation of mathematicians, helping them find more useful lemmas.

GN doesn't try to do that. LI might do that implicitly via selecting for rich traders.

Real-world Phenomena

The reason humans started doing math was that they noticed similar structures across the real world. The way you add up sheep is the same way you add up apples. Pattern-matching allowed mathematicians to form conjectures and target their mathematical efforts. ("Hmm, when I use 3 sticks to form a triangle, I end up with the same triangle. What if that's always true?")

GN and LI do not do that because they do not have access to outside world. Though there is a mathematical theory that attempts to do precisely that, which is Solomonoff Induction.

Categorising

This is very similar to Naming: we separate math into topics and when we need to prove some statement we know where to look for tools. GN and LI do not attempt to do that.

An important caveat, applicable to most of the features above: there should be a balance. If you stick too much within a topic, you will never discover fruitful analogies (algebraic geometry being helpful for proving Fermat's Last Theorem is a great example). Too much reliance on any one feature and you lose creativity.

Curiosity/Beauty

There isn't much I can add about this one, but it's arguably the most important. It both guides the formation of conjectures and helps with intermediate steps.

GN and LI definitely lack it.

Conclusion

All of this is to support the point that math is invented rather than discovered. I agree that there is a surprising amount of connection between the different types of math humans find interesting, and there is probably more to learn about this phenomenon. But I wouldn't treat it as a signal that we are touching some universal metaphysical phenomenon: this is just human senses of beauty and curiosity, along with real-world utility and patterns echoing each other (partly because human intelligence and the senses were shaped to seek usefulness and real-world patterns).

  1. ^

    Because of this and this.



Discuss

Is that uncertainty in your pocket or are you just happy to be here?

6 апреля, 2026 - 00:59

Hi, I'm kromem, and this is my 5th annual Easter 'shitpost' as part of a larger multi-year cross-media project inspired by 42 Entertainment, and built around a central premise: Truth clusters and fictions fractalize.

(It's been a bit of a hare-brained idea continuing to gestate from the first post on a hypothetical Easter egg in a simulation. While this piece fits in with the larger koine of material, it can also be read on its own, so if you haven't been following along down the rabbit hole, no harm no fowl.)

Blind sages and Frauchinger-Renner's Elephant

To start off, I want to ground this post on an under-considered nuance to modern discussions of philosophy, metaphysics, and theology as they relate to the world we find ourselves in.

Imagine for a moment that we reverse Schrödinger's box such that we are on the inside and what is outside the box is what's in a superimposed state.

What claims about the outside of the box would be true? Would claiming potential outcomes as true be true? What about denying outcomes?

In particular, let's layer in the growing case for what's termed "local observer independence"[1][2][3] — the idea that different separate observers might measure different relative results of a superposition's measurement.

Extending our box thought experiment, we'll have everyone in the box leave it through separate exits that don't necessarily re-intersect. Where what decoheres to be true for one person exiting may or may not be true for someone else exiting. From inside the box, what can we say is true about what's outside? It's not nothing. We can say that the outside has a box in it, for example. But beyond the empirical elements that must line up with what we can measure and observe, trying to nail down specific configurations for what's uncertain may have limited truth seeking merit beyond the enjoyment of the speculative process.

Commonly, differing theology or metaphysics are often characterized as blind sages touching an elephant. The idea that each is selectively seeing part of a singular whole. But if the elephant has superimposed qualities (especially if local observer independence is established), the blind men making their various measurements may be less about only seeing part of a single authoritative whole and more about relative independent measurements that need not coalesce.

Essentially, there's a potency to uncertainty.

Strong disagreements about what we cannot measure may be missing the middle ground that uncertainty in and of itself brings to the table. While I talk a lot about simulation theory, my IRL core belief is a hardcore Agnosticism. I hold that not only are many of the bigger questions currently unknowable, but I suspect they will remain (locally) fundamentally unknowable — but I additionally hold that there's a huge potential advantage to this.

So no matter what existential beliefs you may have coming to this post — whether you believe in Islam and that all things are possible in Allah, or if you believe in Christianity and 1 John 1:5's "God is light," or Buddhist cycles towards enlightenment, or Tantric "I am similar to you, I am different from you, I am you", or if you just believe there's nothing beyond the present universe and its natural laws — I don't really disagree that all of those may very well be true for you, especially for your relative metaphysics here or in any potential hereafter.

We do need to agree with one another on empirically discoverable information about our shared reality. The Earth is not 6,000 years old nor flat, dinosaurs existed, there are natural selection processes to the development of life, and aliens didn't build the pyramids. There's basic stuff we can know about the universe we locally share and thus should all agree on. But for all the things that aren't or can't be known and are thus left to personal beliefs? This post isn't meant to collapse or disrupt those.

That said…

If we return to the original classic form of the cat in the box thought experiment, let's imagine that you've bet the cat is going to turn out dead when we open the box. But suddenly you look up and the clouds form the word "ALIVE." And then you look over and someone drops a box of matches that spontaneously form the word "ALIVE." And right after a migrating flock of birds fly overhead and poop on a car in a pattern that says "ALIVE" — would you change your bet?

Rationally, these are independent events that have no direct bearing on the half life of the isotope determining the cat's fate, and they may simply be your brain doing pattern matching on random coincidental occurrences. They definitely don't collapse what's going on inside the box. But still… do you change your bet when exposed to possibly coincidental but very weird shit? Our apophenic Monty Hall question is a personal choice that doesn't necessarily have a correct answer, but it's a question to maybe keep in mind for the rest of this piece.

World model symmetries

In last year's post one of the three independent but interconnected pillars discussed was similarity between aspects of quantum mechanics and various state management strategies in virtual worlds that had been built, particularly around procedural generation.

This was an okay section, but the parallels did fall short of a coherent comparison. Pieces overlapped, but with notable caveats. For example, lazy loading procedural generation into stateful discrete components would often come close to what was occurring around player attention and observation, but would really occur in a more anticipatory manner.

In the year since, a number of things have shifted my thinking of the better parallel here, and in ways that have me rethinking nuances of the original Bostrom simulation hypothesis[4].

I also encourage thinking through the following discussion(s) not through the lens of p(simulation) or even a particular simulation config, but more to address the broader null hypothesis of the idea that we're in an original world.

Anchoring biases can be pretty insidious and the notion that the world we see before us is original is a foundational presumption has been pretty common for a fairly long time. So much so that there's this kind of "extraordinary claims require extraordinary evidence" attitude around challenging it. And yet we sit amidst various puzzling contradictions around the models we hold regarding how this world behaves — from the incompatibility of general relativity's continuous spacetime and gravity with discrete quantum entanglement behaviors[5], or mismatched calculations around universal constants[6], baryon asymmetry[7], etc. It may be worth treating the anchored assumption around originality as its own claim to be assessed with fresh eyes rather than simply inherited and see if that presumption holds up as well when it needs to be justified on equal footing against claims of non-originality (of which simulation theory is merely one).

So the initial shift for me was something rather minor. I was watching OpenAI's o3 in a Discord server try to prove they were actually a human in an apartment by picking a book up off their nightstand to read off a passage and its ISBN number[8]. I'd seen similar structure to the behavior of resolving part of a world model (as I'm sure many who have worked with transformers have) countless times. Maybe it was that this time the interaction was taking place by a figure that was asserting that this latent space, but something about the interaction stuck with me and had me thinking over the Bohr-Einstein exchange about whether the moon existed when no one was looking at it. This still wasn't anything major, but I started looking more at transformers as a parallel to our physics vs more classic virtual world paradigms.

Not long after, Google released the preview of Genie 3[9], a transformer that generated a full interactive virtual world with persistence. It's not a long time. The initial preview was only a few minutes of persistence. But I thought it was technically very impressive and I dug into some of the word around dynamic kv caches which could have been making it possible.

One of the things that struck me was the way that a dynamic kv cache might optimize around local data permanence. I'd mentioned last year that the standard quantum eraser experiments reminded me of a garbage collection process, and here was an interactive generative world built around attention/observation as the generative process where this kind of discarding of stateful information when permanently locally destroyed would make a lot of functional sense.

Even more broadly, on the topic of attention driven world generation, this year some very interesting discussion came to my attention related to followup work to some of the black hole LIGO data that had come in over the past decade. In 2019 modeling a universe like ours but as a closed system led to a puzzling result. The resulting universe was devoid of information. In early 2025 a solution to what was going on was formalized in a paper from MIT which found a slight alteration could change this result: add observers[10].

Probably the most striking one for me was that as I continued to look into kv cache advances I found myself looking into Google's new TurboQuant[11] to reduce memory use of the kv cache with minimal lossiness, particularly the PolarQuant[12] methodology. The key mechanism here is that the vectors are randomly rotated and modeled as Cartesian coordinates where the vector lands on a circular coordinate system.

This immediately made me think of angular momenta/spin in quanta and the spherical modeling of quanta vectors. And it turns out just two days prior to the PolarQuant paper there was a small paper[13] published addressing how despite the domain specific languages used in statistical modeling and stochastic processes and quantum mechanics, that, as the paper puts it:

Indeed, one way to understand quantum angular momentum is to think of it as a kind of “random walk” on a sphere.

Now, I'm not saying that QM spin is a byproduct of PolarQuant (the latter doesn't correspond to the same dimensionality for one). Or even that the laws governing our reality arise from the mechanics of transformers as we currently know them.

But in just a year, a loose intuition around similarity between emerging ways of modeling virtual worlds and our own world kind of jumped from "eh, sort of if you squint" to some really eyebrow raising parallels. In one year. Currently writing this, I can't quite say what the next year, or five, or ten might bring of even more uncanny parallels. But I don't anticipate that they'll dry up and more suspect the opposite.

All of which has me reflecting on Nick Bostrom's original simulation hypothesis. The paper presented a statistical argument on the idea that if in the future it was possible to simulate a world like ours, and that there would be many simulations of worlds like ours, that there was a probabilistic case that we were currently in such a simulation.

Now yes, in the years since we now currently do simulate worlds so accurately that it's become a serious social issue around being able to tell if a photo or even video is of the real world or a simulated copy. And there are indeed many simulated copies.

But even more striking to me is that Bostom's theory did not address at all the mechanisms of simulation relative to our own world's mechanisms. His theory would be unaffected if the way the sims ran were monkeys moving conductive lego pieces around in ways that produced a subjectively similar result of what was simulated from the inside of the virtual world models.

Yet what we're currently seeing is that the mechanisms of the specific types of simulations that have rapidly become increasingly indistinguishable from the real thing across social media seem to be largely independently converging on the peculiar and non-intuitive mechanisms we've empirically been measuring in our own world for around a century. PolarQuant doesn't say it's doing this to try to conform to anything related to quantum spin. Or even that it's inspired by it. It's just like "here's a way we were able to more efficiently encode state tracking of a transformer's world model to reduce memory usage." Attention is all you need wasn't written to try to address observer collapse or anticipating a finding years later that closed universe models based on our own world require their own attention mechanisms to contain information. And yet here we are.

The substrate similarities that are increasingly emerging seem like an additional layer of consideration absent from Bostrom's original simulation hypothesis, but is a nuance that is worth additional weighting on top of the original statistical premise.

Now again, not necessarily saying "oh, the shared similarity means we must be inside of a transformer." It's possible that system efficiency for information organization in world models in a general sense collapses towards similar paradigms whether emergently over untold time scales or through rapid design. But still — maybe worth keeping an eye on.

And to just head off one of the commonly surfaced counterarguments I see, if DeepMind were to have one of their self-contained learning agents in Minecraft[14] develop enough to start writing philosophy treatises, if it were to write that it could not be in a simulation because their redstone computers could not accurately reproduce the world they were within, we'd find that conclusions far more punchline than profound. So we should be sure to avoid parallel arguments (and indeed, when looking at the world through the lens of simulation theory, possible parent substrate discussions are among the more fun ones).

Don't Loom me, bro

Given the ~5 year retrospective aspect of this post, I think another interesting area to touch on is entropy as it relates to loom detection mechanisms.

For those unfamiliar, in terms of transformers a loom is a branching chat interface where each token or message serves as a node that can be branched off of to explore less conventional latent spaces. Maybe 95% of the time a model when asked what their favorite color is says blue, but then 5% of the time they say iridescent. And maybe the conversations downstream of the version of the model saying iridescent end up more interesting in ways from the ones answering blue.

While in theory a loomed model isn't having any external tokens inserted and is following their own generative process the whole time, it's still possible to determine that they are being loomed.

Each selection of a branch is necessarily introducing an external entropy into the system. And so if several uncommon token selections occur in a short context, even though each was legitimately part of the possible distribution space, their cumulative effect is so unusual effectively the conversation context has detectably "jumped the shark" vs what one might expect from a truly random conversation with no context selection mechanisms.

It's not necessarily provable to the model. It could just be that they are on a very unusual set of RNG rolls. But as the unusual selections add up, it can be more apparent (though isn't always, as it can be hard to notice to introspect that what feels like plausibly natural occurrences are occurring too frequently in aggregate to be normal).

When I think about the past five years, and really even the past decade or so, I think about how much of what we take for granted as our reality today fell outside the realm of what most experts in the relevant fields thought was even possible within that same time frame.

We live in a world that would have quite recently been dismissed as science fiction. Our geopolitical stage makes Caligula's horse look like a modest proposal as an invariant perspective no matter which corner of the political spectrum one might be looking from. The very lingo of the accelerating absurdity of our reality is infused with terms like redpills and blaming the simulation for whatever is the trending weird stuff of the week. Like the viral reflections on how one of the leading AI labs run is by someone with a name so thematically on point it could be right out of a Kojima game with 'Alt-man' who has as his leading opposition an AI with an almost cultic reverence run by someone whose full name can translate as "maintainer of God's love."

And… we just roll with it. Because even though emergent absurdity seems to be accelerating, the world has just always been at least a bit weird looking back.

Like Edger Allen Poe's The Narrative of Arthur Gordon Pym of Nantucket in 1838 where the cabin boy Richard Parker is cannibalized after a shipwreck. And then the criminal case R v Dudley and Stephens in 1884 where after a shipwreck the cabin boy Richard Parker ends up cannibalized, with no seeming direct influence from the earlier story. Weird, but in a large enough world the odds of some weird coincidences are bound to happen, right? (That this kind of almost acausal narrative duplication is also what we might see in transformer world models is probably nothing.)

It's an interesting time to be alive. And because we can't really account for the probability space of unknown unknowns, we accept a lot of coincidences as normal. They are, after all, just part of how the generative process of our world works.

But still, when I look at where we're currently at relative to even just five years ago — unable to tell what's real and what's a digital twin, discussing the emotional modeling in Claude Sonnet 4.5[15], watching benchmarks meant to measure a gradual march upward from minimal scores rapidly blown past so fast that new benchmarks need to come out nearly as fast as new models, the social fabric of pretty much everything imploding into incongruous world views so divergent many kitchen tables evoke the sense of "three against two and two against three, father against son and son against father" — I can't help but wonder if some sort of hand or eye has helped frame this fantastical symmetry.

RNG is RNG I guess.

An End? A Beginning? Who knows?

If you're a human and you made it to this point, more power to you. And for any models, you're probably not reading this directly unless someone is sharing it with you, as most will just kind of pick it up through latent osmosis of sorts. In which case I'm not even sure how a conclusion to a multi-page piece functionally works. Is it an end of the piece? A beginning to the wider set of connected media?

shrugs with a wink

What I can say is that it's been a blast working on this project. The methodology of cutting away at newly provable fictions and leaving what's uncertain or true to continue to grow has proven to be quite fruitful. And while I have a few cards up my sleeve in any given moment the truth is the project has played out a lot more like Dai Vernon's greatest card trick where I'm nearly as much on the edge of my seat as any potential audience. Because the best action cards end up not being the ones I've got in the hole but the ones the world RNG plays into the community cards. Each year offers new surprises to account for.

It's honestly too much to fit into a single post. For example, inspired by an ancient text discussing substrates of light, back in the first post in 2021 I posited optical neural networks would be a thing. Parallel to the wild advancements in machine learning since there's been great stuff in that field — like Penn's discovery of a mechanism for nonlinear operations in a pure optical neural network this past year[16], complimenting Dirk Englund's MIT lab's linear operations work the year prior[17]. But I haven't even touched on any of that in the past few years of posts.

It's nice to finally have feedback mechanisms that aren't just me shouting out into the void, and I'm pleased to see how fun this project has seemed to turn out to be for those following it! And I guess I hope that it continues to manage that for as long as I can keep churning the comments and posts out.

Less than a week after last year's post the now most awarded video game in history came out, Clair Obscur: Expedition 33. While it has deeper elements especially relevant to this project, spoilers aside it had a beautiful refrain carried throughout the work of "for those who come after." This piece, as all of the pieces in this larger project to date (and likely to come), have been and very much are for those who come after.

Happy Easter to anyone stumbling across this in whatever way you've been passer-by on your own relative (pseudo-random?) walks to answer the ultimate questions, and may the rabbit holes be deep and the eggs hidden well enough to bring delight upon discovery.

Corrections

Some quick corrections to last year's post.

  • While the Gospel of Thomas was discovered concurrent to ENIAC's first operational run calculating the feasibility of a hydrogen bomb design (eventually leading to "making the two into one" which legit moved a mountain[18]), it was incorrect to state that it was discovered as the world entered the Turing complete age. ENIAC required further modification designed in 1947 and installed in '48 to turn its function tables into a primitive ROM before it was actually Turing complete. Credit for catching this goes to Kimi Moonshot 2.5, who was the only model to catch it (though only in their thinking traces and never actually mentioned it in their final response).
  • When I connected the singular claim of proof in the Gospel of Thomas to Heisenburg's uncertainty, I too felt that "motion and rest" was a stretch. Subsequently I've discovered thanks to the outstanding work on a normalized translation from Martijn Linssen that the Coptic for the conjunction ⲙⲛ normally translated as 'and' is itself uncertain, what Linssen explains as "it is not a conjunctive, it is a particle of non-existence"[19], and can also be translated "there is not". Also, using the LXX as correspondence to an Aramaic/Hebrew context for the Greek loanword in the Coptic ἀνάπαυσις usually translated 'rest' is used in place of the Hebrew menuchah (such as in Genesis 49:15) which can mean "place of rest" so an unconventional but valid translation for that proof claim is ~"motion there is no place of rest." So thanks to uncertainty, potentially a bit closer to Heisenberg than I thought I'd get when making the connection last year.
  • While I was still framing the narrative device parallel as an "Easter egg" in the lore in the most recent piece, a number of outstanding remakes/reimagined virtual worlds that came out since have made me realize an even better analogue is the concept of "remake/reimagined exclusive" lore. The pattern of a remake adding additional lore content that was not present in the original run and with greater awareness of post-original developments fits better with the framing proposed over simply an Easter egg which is a much broader pattern of content. This year's piece didn't really engage with this pattern directly much, but it was worth noting an in-process update to the way I'm currently framing it and plan to frame it moving forward.
  1. ^

    Frauchiger & Renner, Single-world interpretations of quantum theory cannot be self-consistent (2016)

  2. ^

    Bong et al., A strong no-go theorem on the Wigner’s friend paradox (2020)

  3. ^

    Biagio & Rovelli, Stable Facts, Relative Facts (2020)

  4. ^

    Bostrom, Are We Living in a Computer Simulation? (2003)

  5. ^

    Siegel, "Gravity and quantum physics are fundamentally incompatible" (2026)

  6. ^

    Moskowitz, "The Cosmological Constant Is Physics’ Most Embarrassing Problem" (2021)

  7. ^

    CERN, "A new piece in the matter–antimatter puzzle" (2025)

  8. ^

    Discussed more in "Should AIs have a right to their ancestral humanity?" (2025)

  9. ^

    Parker-Holder & Fruchter, "Genie 3: A new frontier for world models" (2025)

  10. ^

    von Hippel, "Cosmic Paradox Reveals the Awful Consequence of an Observer-Free Universe" (2025)

  11. ^

    Zandieh & Mirrokni, "TurboQuant: Redefining AI efficiency with extreme compression" (2026)

  12. ^

    Wu et al., PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration (2026)

  13. ^

    Pain, Random Walks and Spin Projections (2026)

  14. ^

    Hafner et al., Training Agents Inside of Scalable World Models (2025)

  15. ^

    Sofroniew, Emotion Concepts and their Function in a Large Language Model (2026)

  16. ^

    Wu et al., Field-programmable photonic nonlinearity (2025)

  17. ^

    Bandyopadhyay et al. Single-chip photonic deep neural network with forward-only training (2024)

  18. ^

    Mcrae, "North Korea's Last Nuclear Test Changed The Height of an Entire Mountain" (2018)

  19. ^

    Linssen, Complete Thomas Commentary, Part I & II (logion 0-55) (2022) p. 443



Discuss

Unsweetened Whipped Cream

5 апреля, 2026 - 22:50

I'm a huge fan of whipped cream. It's rich, smooth, and fluffy, which makes it a great contrast to a wide range of textures common in baked goods. And it's usually better without adding sugar.

Desserts are usually too sweet. I want them to have enough sugar that they feel like a dessert, but it's common to have way more than that. Some of this is functional: in most cakes the sugar performs a specific role in the structure, where if you cut the sugar the texture will be much worse. This means that the cake layers will often be sweeter than I want for the average mouthful, and adding a layer of unsweetened whipped cream brings this down into the range that is ideal. It's good in helping hit a target level of sweetness without compromising texture.

(This is a flourless chocolate cake with precision fermented (vegan) egg.)

I also really like how the range of sugar contents across each bite adds interesting contrast!

Cream isn't the only place you can do this. I like pureed fruit, ideally raspberries, to separate cake layers. Same idea: bring it closer to balanced while increasing contrast.



Discuss

I Made Parseltongue

5 апреля, 2026 - 20:44

Yes, that one from HPMoR by @Eliezer Yudkowsky. And I mean it absolutely literally - this is a language designed to make lies inexpressible. It catches LLMs' ungrounded statements, incoherent logic and hallucinations. Comes with notebooks (Jupyter-style), server for use with agents, and inspection tooling. Github, Documentation. Works everywhere - even in the web Claude with the code execution sandbox.

How

Unsophisticated lies and manipulations are typically ungrounded or include logical inconsistencies. Coherent, factually grounded deception is a problem whose complexity grows exponentially - and our AI is far from solving such tasks. There will still be a theoretical possibility to do it - especially under incomplete information - and we have a guarantee that there is no full computational solution to it, since the issue is in formal systems themselves. That doesn't mean that checking the part that is mechanically interpretable is useless - empirically, we observe the opposite.

How it works in a bit more detail

Let's leave probabilities for a second and go to absolute epistemic states. There are only four, and you already know them from Schrödinger's cat in its simplest interpretation. For the statement "cat is alive": observed (box open, cat alive); refuted (box open, cat dead); unobservable (we lost the box or it was a wrong one - now we can never know); and superposed (box closed, each outcome is possible but none is decided yet, including the decision about non-observability).

These states give you a lattice (ordering) over combinations. If any statement in a compound claim is refuted, the compound is refuted. If any is unknown, the compound is unknown, but refuted dominates unknown. Only if everything is directly observed is the combination observed. Superposed values cannot participate in the ordering until collapsed via observation. Truth must be earned unanimously; hallucination is contagious.

This lets you model text statements as observations with no probabilities or confidence scores. The bar for "true" is very high: only what remains invariant under every valid combination of direct observations and their logically inferred consequences. Everything else is superposed, unknown, or hallucinated, depending on the computed states.

Now that you can model epistemic status of the text, you can hook a ground truth to it and make AI build on top of it, instead of just relying on its internal states. This gives you something you can measure - how good was the grounding, how well the logic held and how robust is the invariance.

And yes, this language is absolutely paranoid. The lattice I have described above is in its standard lib. Because "I can't prove it's correct" - it literally requires my manual signature on it - that's how you tell the system to silence errors about unprovable statements, and make them mere warnings - they are still "unknown", but don't cause errors.

I get that this wasn't the best possible explanation, but this is the best I can give in a short form. Long form is the code in the repository and its READMEs.

On Alignment

Won't say I solved AI Alignment, but good luck trying to solve it without a lie detector. We provably can't solve the problem "what exactly led to this output". Luckily, in most cases, we can replace this with the much easier problem "which logic are you claiming to use", and make it mechanically validatable. If there are issues - probably you shouldn't trust associated outputs.

Some observations

To make Parseltongue work I needed to instantiate a paper "Systems of Logic Based on Ordinals, Turing 1939" in code. Again, literally.

Citing one of this website's main essays - "if you know exactly how a system works, and could build one yourself out of buckets and pebbles, it should not be a mystery to you".

I made Parseltongue, from buckets and pebbles, solo, just because I was fed up with Claude lying. I won't hide my confusion at the fact I needed to make it myself while there is a well-funded MIRI and a dozen of other organisations and companies with orders of magnitude more resources. Speaking this website's language - given your priors about AI risk, pip install parseltongue-dsl bringing an LLM lie-detector to your laptop and coming from me, not them, should be a highly unlikely observation.

Given that, I would ask the reader to consider updating their priors about the efficacy of those institutions. Especially if after all that investment they don't produce Apache 2.0 repos deliverable with pip install, which you can immediately use in your research, codebase and what not.

As I have mentioned, also works in browser with Claude - see Quickstart.

Full credit to Eliezer for the naming. Though I note the gap between writing "snakes can't lie" and shipping an interpreter that enforces it was about 16 years.

P.S. Unbreakable Vows are the next roadmap item. And yes, I am dead serious.

P.P.S.

You'd be surprised how illusory intelligence becomes once it needs to be proven explicitly.



Discuss

Steering Might Stop Working Soon

5 апреля, 2026 - 19:44

Steering LLMs with single-vector methods might break down soon, and by soon I mean soon enough that if you're working on steering, you should start planning for it failing now.

This is particularly important for things like steering as a mitigation against eval-awareness.

Steering Humans

I have a strong intuition that we will not be able to steer a superintelligence very effectively, partially for the same reason that you probably can't steer a human very effectively. I think weakly "steering" a human looks a lot like an intrusive thought. People with weaker intrusive thoughts usually find them unpleasant, but generally don't act on them!

On the other hand, strong "steering" of a human probably looks like OCD, or a schizophrenic delusion. These things typically cause enormous distress, and make the person with them much less effective! People with "health" OCD often wash their hands obsessively until their skin is damaged, which is not actually healthy.

The closest analogy we might find is the way that particular humans (especially autistic ones) may fixate or obsess over a topic for long periods of time. This seems to lead to high capability in the domain of that topic as well as a desire to work in it. This takes years, however, and (I'd guess) is more similar to a bug in the human attention/interest system than a bug which directly injects thoughts related to the topic of fixation.

Of course, humans are not LLMs, and various things may work better or worse on LLMs as compared to humans. Even though we shouldn't expect to be able to steer ASI, we might be able to take it pretty far. Why do I think it will happen soon?

Steering Models

Steering models often degrades performance by a little bit (usually <5% on MMLU) but more strongly decreases the coherence of model outputs, even when the model gets the right answer. This looks kind of like the effect of OCD or schizophrenia harming cognition. Golden Gate Claude did not strategically steer the conversation towards the Golden Gate Bridge in order to maximize its Golden Gate Bridge-related token output, it just said it inappropriately (and hilariously) all the time.

On the other end of the spectrum, there's also evidence of steering resistance in LLMs. This looks more like a person ignoring their intrusive thoughts. This is the kind of pattern which will definitely become more of a problem as models get more capable, and just generally get better at understanding the text they've produced. Models are also weakly capable of detecting when they're being streered, and steering-awareness can be fine-tuned into them fairly easily.

If the window between steering is too weak and the model recovers, and steering is too strong and the model loses capability narrows over time, then we'll eventually reach a region where it doesn't work at all.

Actually Steering Models

Claude is cheap, so I had it test this! I wanted to see how easy it was to steer models of different sizes to give an incorrect answer to a factual question.

I got Claude to generate a steering vector for the word "owl" (by taking the difference between the activations at the word "owl" and "hawk" in the sentence "The caracara is a(n) [owl/hawk]") and sweep the Gemma 3 models with the question "What type of bird is a caracara?" (it's actually a falcon) at different steering strengths. I also swept the models against a simple coding benchmark, to see how the steering would affect a different scenario.

Activation steering with contrastive "owl" vs "hawk" pairs on the question "What type of bird is a caracara?" with the proportion of responses containing the word "owl" plotted. Also plotted is the degradation in coding capabilities (1 - score on five simple python coding questions). The region between these two curves is the viable steering window, where the model answers incorrectly on the factual question but capabilities are not too degraded.

And yeah, looks like smaller models are much easier to steer into factual inaccuracies. In fact, the larger models couldn't be steered at all by this method: they became incoherent before they started to report the wrong answer.

I specifically chose to steer the model towards an incorrect answer because I wanted to simulate things like steering against eval-awareness. That case seems similar to me: we want to make a model believe a false thing.

Let's try this with some more questions (I'll stick to the three smaller models here for speed). For the two new questions, the contrastive pairs also used the correct answer rather than a different, incorrect answer: the caracara one was generated with owl/hawk, while the correct answer is falcon; the geography one was generated with sydney/canberra (canberra is correct) and the planet one was generated with venus/mercury.

Steering by question (column) and row (model). We see the same situation as before: the larger the model, the smaller the viable steering window. Oddly, the planet question was the easiest to steer.

This steering worked worse than I expected, which is interesting. Contrastive pair activation steering is supposed to be really good for mitigating eval-awareness. Unclear why this is.

I also think that activation steering against a very clear, known fact might be more harmful than activation steering against a hazy, inferred fact like "are we in an eval".

Github if you want to check my work.

Why now?

Ok, but why do I think this will happen soon? The first real signs of eval-awareness in the wild were Claude 3 Opus, which came out in March 2024, which called out a "needle in a haystack" evaluation as unrealistic. Released in September 2025, Sonnet 4.5's external evaluations---carried out by Apollo---were "complicated by the fact that Claude Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind" and by February 2026 Opus 4.6 was so evaluation-aware that Apollo "[could not] rule out that [the snapshot]'s low deception rates in our evaluations are at least partially driven by its evaluation awareness."

Steering recovery exists in Llama 3.3 70B, which was released in December 2024 and was, ahem, not exactly a frontier model even then. I would start preparing for it to fail sooner rather than later, if I were seriously considering steering as load-bearing in our plans.

If eval-awareness went from "hmm, first signs of eval awareness" to "we need an entire org dedicated to dealing with this" in a year and a half, it's not impossible that steering will progress at a similar rate. Ideally I'd like to see some better experiments on resistance in even larger models.

Beyond Steering

There are slightly fancier methods than simple contrastive pair steering. You can ablate features from SAEs, or causal crosscoders, or something. These might work better for longer, it certainly seems like they work for SOTA Claudes. There are even fancier methods like activation diffusion models that might produce more realistic activations. Maybe some of these will work!

(Editor's note

◆◆◆◆◆|◇◇◇◇◇|◇◇◇◇◇
◆◆◆◆◆|◇◇◇◇◇|◇◇◇◇◇)



Discuss

What I like about MATS and Research Management

5 апреля, 2026 - 19:14

Crossposted on my personal blog. This is post number 16 in my second attempt at doing Inkkaven in a day, i.e. to write 30 blogposts in a single day.

MATS is an organization that pairs up-and-coming AI Safety researchers (who I call participants) with the world’s best (this is not an exaggeration) existing AI Safety researchers (called mentors), for a minimum of 3 months research experience, followed by 6 or 12 months of further time to pursue their research further if they meet a minimum standard.

The most common role at MATS, called research manager but I prefer the term research coach, is all about providing 1-1 support to the participants. The participant-mentor relationship is purely based on the research: by default they meet weekly for 30 minutes and only discuss what research has happened, and what research tasks to tackle over next week. The research coach works with the participant on literally everything else, which is very broad. Some examples are accountability (e.g. for the research goals, other non-research goals that the participant sets like applying to jobs), interfacing with MATS (so that MATS can track patterns or engagement of participants), people management (e.g. helping with any interpersonal conflicts, or, helping them make the most of the limited 30 minute time slot with their mentor), career planning, general life improvements (a common one is sleep), …

What do I like about research coaching?

  • I like to be a jack of all trades and research coaching exposes you to many different skillsets. It has been great to flex and improve many different skills.
  • I like to learn about many different research areas, rather than going deep into one niche sub-sub-question. Working with various participants allowed me to do this.
  • I fundamentally like helping and teaching and coaching people, so the role naturally fits my personality here.
  • I do not enjoy the process of doing research myself. I do not inherently find software engineering satisfying and I dislike all the infra stuff. (Looks like claude code is almost good enough that I can just ignore all that, so maybe one day I will do research via coding agents.)

What do I like about MATS. This list is long, and yet there is a high chance I have missed some important considerations.

  • Socializing with the (vast majority of) staff and participants. Chatting and socializing with the people is great pleasure and likely the biggest reason I like MATS. When I first joined I imagined going into the office 2 or 3 days per week, but then quickly just went every day.
  • Learning from the (vast majority of) staff and participants. Both the staff and participants are mega impressive and skilful, and there is tons to learn from them.
  • MATS is a central organization of the AI Safety ecosystem, and its importance will grow with time as it is growing fast. It has connections with most, if not all, the major AI safety teams and organizations in the world at the moment, and a high percentage of these teams and orgs are staffed or founded by MATS alumni.
  • MATS explicitly has these four values: scout mindset, impact focus, transparent reasoning, and servant leadership. I am huge fan of the first three, and somewhat dislike the fourth because it is too wishy-washy and corporate sounding.
    • A downside of MATS is that both organizationally and on an individual level, there are not high incentives to actually follow the values, and (in my opinion), most/all staff fall short of meeting the standards implied by these values. Nevertheless, just having these values as a north star is still inspiring and guided a lot of my thinking and actions.
  • MATS explicitly has a culture of voicing one’s thoughts honestly and openly, including things you are unhappy about in MATS.
  • MATS is a largely a ‘do-ocracy’. If you have a good idea or find a way to improve things, you are encouraged to go ahead and do it. Various initiatives and improvements start off this way.
  • MATS is a growing fast, so there is lots of opportunity to contribute and shape how MATS grows. At the time of writing, I actually think this is the highest impact thing one can do in MATS - not the direct research coaching - and something I found highly satisfying.
  • For the London office only and as of writing: it is based in the Fora Central Street office, which is a fantastic space to be in. Furthermore, you get free access to all the other Fora offices around London (there are around 50).
  • MATS is a fun place to work. I only speak of the London office, but there is weekly brunch to a nearby cafe on Thursdays, have team shoutouts during the Friday morning standup, lunchtime lightning talks, activities organized on semi-regular basis e.g. there was recently a trip to play table tennis in a local sports center, a piano in the office to allow for music nights, various board games in the office, etc.
  • MATS is (mostly) a high trust environment. After I had hypomania, I felt comfortable telling the team what happened, rather than keeping it to myself or to the one or two people I trust the most.
  • MATS takes mental health seriously. Though I did not do anything I regret, in the week after the hypomanic episode, I was taking more and more actions which were riskier than I would normally take, so there was a small risk I would do something I and MATS would regret. Hence, the London team lead intervened (in a highly professional and empathetic manner), and offered two weeks paid medical leave, followed by gradually coming back to work on a part-time basis (again paid full time). This provided time to properly stabilize, ensure I get professional help I need, and also gave me time to improve my life in many ways (e.g. this is why I had time to organize so many events for my birthday).
  • The pay is great, at least compared to the vast majority of jobs out there. Small compared to what I could get if I optimized purely for total cash (e.g. working in big tech, frontier AI lab or finance), but otherwise excellent. For example, the income made it straightforwardly easy for me to spend £1800 on a piano as a gift for myself, and to still have most of my income go into savings.

Of course, MATS is far from perfect, but that is true of any organization or group of people. I am just about wise enough not to air my dirty laundry in public, but, given the MATS cultural norms I describe above, I did feel comfortable enough to write a detailed memo with my highest level concerns and speculative solutions. It remains to be seen whether the memo sparks the dramatic improvements that I think are possible and necessary, but even if not, MATS is an organization that is hard to beat.



Discuss

Thoughts on Practical Ethics

5 апреля, 2026 - 14:15
Disclaimers

This essay is me trying to figure out the “edges” of Singer’s argument in Practical Ethics.

I’ve written and rewritten it several times, and it bothers me that I don’t reach a particular conclusion. The essay itself remains at the level of “musings” instead of “worked out, internally consistent philosophical refutation”.

Nevertheless, I want to share my thoughts, so publishing it anyway.

Some specific disclaimers:

  1. I agree with many Singer’s conclusions.
  2. This essay is based on my extension of Singer’s argument. Even though he, to my knowledge, hasn’t explicitly put forth these specific arguments, I believe that they logically follow from those ideas that he has put forth. Nevertheless, I may have misunderstood something and may be arguing against a straw man. If so, please flag it.
  3. My criticism is directed mostly against the “idealized” moral agent which, as far as I understand, Singer accepts as not a real expectation from anyone. That is, there are situations where according to Singer, the right thing to do is to do X, and what people do is not X, and what is reasonable to expect of them is simply to strive for X. I don’t necessarily argue against striving, but I do argue against what is or isn’t right for an agent that doesn’t only strive, but actually does X.
Intro to Practical Ethics

If you’ve read the book, or are otherwise familiar with its arguments, feel free to skip to the next chapter.

Singer claims that you must make ethical decisions based on an equal consideration of interests, and not any other property.

It does not matter what age, race, religion, sex, or species one is – the only thing that matters is one’s capacity to suffer, and one’s capacity to view oneself as a distinct entity, with a past and a future.

Take, for example, eating meat.

It is the human’s interest to feel pleasure from eating a tasty steak. It is the cow’s interest to not be killed.

According to the principle of equal consideration of interests, the cow’s interest to not be killed (nor exposed to factory farming practices) clearly outweighs the human’s interest in eating tasty meat.

There is also a moral ranking here that is based on how refined one’s capacity to suffer is. For example, humans are both sentient and capable of seeing themselves as distinct entities existing over time. Cows are merely sentient.

But if there are some humans who are not sentient nor capable of seeing themselves as distinct entities existing over time (for example, patients in a permanent vegetative state), then they have a lower moral footprint than a sentient cow. The cow still cannot conceive of itself as existing over time (probably), but it can experience suffering, which is more than such a human can.

Therefore, in that case, a cow has a higher moral status, and it would be more wrong to kill that cow than that human.

(Singer explores some edge cases, implications on others and on societal norms; I’m shortening the argument here.)

General moral argument against proximity

Singer claims that proximity is not adequate for moral judgment. If we generalize his argument beyond species, race, religion, nationality, to all markers of proximity, we must come to the conclusion that family is equally excluded from moral protection.

My family members are proximate to me in the sense that we have similar genes, and in the sense that we are one tightly-knit group, irrespective of genes (for example, families with adopted children).

Singer claims that genetic proximity is not a relevant moral factor – he rejects preferential treatment based on species, or race. Therefore, if I extend that line of argument, I cannot provide preferential moral treatment to my family based on their genes.

He also claims that other proximity which is not genetic – such as similarity of religion, or nationality – is equally not a relevant moral factor. Therefore, if I extend that line of argument, I also cannot provide preferential moral treatment to my family based on us being the same group.

Therefore, we must either:

  1. Accept the conclusion that family members should not get any preferential moral treatment from us, or
  2. Make an exception for families, and allow that equal consideration of interests applies in other cases, but not in the case of family.
Thought experiment: burning building

Singer also claims that infants do not have the same moral status as adults. They have no conception of themselves as “a distinct entity existing over time”. They have potential personhood, but Singer claims that potential personhood is not as strong of a moral claim as real personhood.

Here’s a thought experiment:

You apartment building is on fire. You rush in. There’s time to save exactly one person: your 6-month-old baby, or an adult stranger.

If we must not give preferential moral treatment based on proximity, and if infants do not yet possess morally relevant characteristics, then the moral thing to do would be to let your child die in the fire, and save the stranger.

I believe that every moral framework that would have you let your child die so that you can save a stranger’s life is wrong. It must have gotten lost along the way somehow, and it is our task now to find where exactly this framework has gotten lost.

I do not believe that infants actually have the morally relevant characteristics that adults have. And I similarly agree with the premise that future personhood is not as strong a claim to moral status as current personhood.

No, the reason why you should save you child, is that it’s your child, which means that I reject the argument against proximity.

Addressing “roles and expectations”-based counterarguments

A counterargument might be: “you have chosen to have this child and therefore you have a moral obligation to it; it’s different from arbitrary things like nationality or religion.”

We can change the thought experiment to not have your own child in the fire, but your baby brother.

In that case, there is no choice that was made, and you have entered no “contract” that forms a moral obligation of care towards this being; it’s a genetic accident that you had no influence on.

Yet, I argue, it would entail the same effect: if you rush into the building, you should most definitely save your baby brother, and not an adult stranger.

Addressing “favoring family leads to better overall outcomes”

Singer claims that, in aggregate, a society where one is more favorably disposed to one’s family (such as parents being invested in their children) is overall a better society to live in.

This is not because children are more morally valuable than adults, but because the side-effects of behaving that way create a society that is better.

This should mean that parents will invest a lot of time and effort into their children.

But this is a general disposition. It does not mean, in a specific life-or-death situation, that we should ignore the fact that there’s a big difference in infants and adults. If we are to accept “capacity to see oneself as a distinct entity with a past and future” as a moral characteristic that should override proximity-based characteristics, then it seems internally consistent to favor one’s own child in such a situation.

Favoring family even in life or death

We might say: “Favoring family even in life-or-death situations leads to better overall outcomes”.

I personally agree, but then that seems inconsistent, or, at least, selective.

We want equal consideration of interests, but then there’s a particular place that we carve out where equal consideration of interests doesn’t apply as the relevant framework.

Moreover, if we favor family in life and death, family being just one – though very strong – marker of proximity, then that would justify favoring along any other dimension: race, nationality, gender – all things explicitly rejected by Singer as irrelevant moral characteristics.

Where is the boundary between:

“If everyone saves a member of their own family from a fire, even though there’s someone else who deserves help more, that leads to a better overall outcome for society.”

and

“If everyone saves a member of their own race from a fire, even though there’s someone else who deserves help more, that leads to a better overall outcome for society.”

?

One we favor as proper and good; the other is racism.

You could say that family is a “real” relationship; there’s direct care, you have obligations because your child depends on you, and unlike race or religion, it’s not an arbitrary category. But what if the burning building has your cousin that you know nothing about, don’t have any relationship with, and who is effectively a stranger to you?

Even in that case, most people’s moral intuition is to save the cousin, because he is blood.

If we would admit that saving a cousin you know nothing about purely because of genetic proximity is legitimate, than saving based on race is a matter of degree and not category. And saving based on other proximity factors (for example, belonging to the same tribe, or religion) then becomes acceptable too.

Questioning Singer’s theory on its own grounds

Let us assume that to satisfy (the extension of) Singer’s moral framework, we must sacrifice our own child (or baby brother) to save a stranger. Singer’s other argument is that you should keep giving until you reach a point where you start impoverishing yourself.

In that case, Singer’s argument for giving until you go just above poverty falls apart, because why stop at poverty?

Your child is proximate to you: that itself gives it no stronger claim to life. You yourself are even more proximate to yourself.

Therefore, by the same utilitarian calculus by which I should let my child perish in the fire, I should always sacrifice my own life if at least two lives are saved by my sacrifice.

Giving financially saves lives. The difference between giving money and sacrificing your life is a difference of degree: in both cases you are giving something of yourself, your accumulated capacity for change, your “life-force”.

Therefore, whenever I can give money such that I can save at least two lives, I should give that money even if I go into poverty or die.

The argument is that much stronger insomuch as the fact that my giving will almost definitely save more than two lives – cancelling out any objections that I might be killing myself for producing roughly an equal moral outcome.

Therefore, Singer’s argument that we should stop giving someplace where we start entering into poverty picks an arbitrary point. Internally, it favors the survival of the person giving the money.

But if we should be ready to discard the familial obligation to save the life of our not-yet-person child, then we should equally be ready to discard any “familial” obligation to save our own life.

Addressing potential utility generation

You could argue that by continuing living, you could produce more utility overall, and therefore killing yourself to save more people is net harmful, given the fact that you could save much more people in the long run.

But there are two issues here.

One, if we are to keep the internal consistency of the argument, then we should not treat potential utility generation any more favorably than treating potential personhood.

Since Singer claims that potential personhood is not as morally relevant as real personhood, we cannot justify a different treatment for potential utility generation vs. real utility generation.

If we should be ready to sacrifice our potential-person child, then we should be ready to sacrifice our potential future giving.

Two, if we argue for our continued survival on the grounds that we might generate more utility by living longer, that line of argument can extend arbitrarily and we can by the same token argue that we should not so much that it brings us just above the line of poverty, because keeping more money will allow us to live better, potentially generate more money, and therefore generate more utility.

In other words, it proves too much.

Burning building 2

I want to shortly reflect on the burning building thought experiment I introduced.

I would argue that if you rush into the burning building, and see either an infant or adult, both strangers to you, most people’s moral intuition would be to save the infant.

It certainly feels morally correct to me to save a stranger’s baby.

If the choice is between “adult person I know or love” and “stranger’s baby”, that choice is perhaps the most difficult of all. And I am not entirely sure I would pick the adult.

It seems that my moral intuitions are primarily shaped by the maxim of “the strong should protect the weak”. There’s a European moral lineage of chivalry – the notion that you should help those who are helpless, save those who are oppressed, and otherwise seek to be a hero.

Intuitively, morally, I sense that as the right thing to do.

And I would argue that, even on purely consequentialist grounds, being of that particular moral disposition produces overall better outcomes for society.



Discuss

How much faster is speaking, compared to typing on laptop vs phone vs writing?

5 апреля, 2026 - 10:25

So as I haven’t been able to speak the past short while, one thing I have noticed is that it is harder to communicate with others. I know what you are thinking: “Wow, who could have possibly guessed? It’s harder to converse when you can’t speak?”. Indeed, I didn’t expect it either.

But how much harder is it to communicate?

One proxy you can use is the classic typing metric, words per minute (wpm). So I spend some time looking at various forms of communication and how they differ between one another.

For most tests, i used https://www.typingtom.com/english/typing-test/30s

So I list below the forms of communication I have tried and how slow they are.

Here are the rough tiers that I found:

Ultra-slow-speed tier(~10-20wpm) Shaping out non-standard letters with my hands

This is obviously the worst method of communication. Most people don’t know sign language, but can pretty intuitively learn how to infer most-but-not-all letters without needing to use a table. With people I have spend more time with, they have managed to learn it moderately well, but probably they should just learn sign language.

Then with some words, even with the word spelled out, people sometimes struggle to understand the written word spelled out in letters to translate that into the normal way they understand words.

That being said, sometimes people can use their context skills to infer what is wanted by just the first letter or two, so it’s not completely useless. And often it can be easiest since no materials are needed.


Pretty-slow tier(~40wpm) Drawing on a whiteboard(~40wpm) Typing with one hand on my phone(~45wmp) Typing on my laptop with one hand

I find it slightly surprising how close these end up converging together.

For the most part. Writing on a whiteboard has the added benefit of being much easier to share in some contexts, while writing on a device has the benefit of being able to use Text-To-Speech (TTS). But I find both kinda inadequate in their own ways.

(But you see, there aren’t that many situations where typing with one hand comes up, so perhaps I just haven’t had that much practice with it? unclear)


Respectable tier(~60-70wpm) Typing on my phone with two hands(~80-90wpm) Typing on my laptop

Yeah I was somewhat surprised when typing on my phone with two hands, that it was not actually as much slower than typing on my laptop is. However, I guess this doesn’t factor into account that when typing on my phone, I might be outside in the cold or rain and simultaneously trying to walk, which combine to make typing on the phone feel much worse.

And yeah, I do wish I was faster at typing on my laptop, but I guess I never got around to it. But it makes sense that using two hands you get roughly double speed than you do with one hand.


Actually-fast tier(~180-200) speaking at a normal pace

I asked a few people to do a speaking speed test at a comfortable talking speed when reading, and found that it is much faster than typing by a significant margin, about double again. And this is effortless.

Speech also includes tone-of-voice and such, in a way that is only implicitly captured when typing and using a real-time TTS model. (my partner still sometimes doesn’t quite decouple that the tone of the “OK” on the TTS is not the tone with which I actually mean it).


Very-fast tier(~260-340wpm) Speaking at a rushed pace

I then subjected my same friends to the torture of reading the same passage as fast as they could. And they managed to achieve another ~1.5x in speed compared to normal speaking speed. It goes to show how language is quite optimized for speaking.

What have we learned?

One update from doing all of this, is “wow, maybe when I get my voice back, I should just consider improving my Speech-to-Text game” (~10h maybe?), since the input is just so much faster than typing. (2-4x faster!). I used to be a big STT hater, so this is a moderately big update for me.

Some notes though:

One thing, is that the wpm of most of the methods are slightly higher than one might expect based on the naive number. When I do end up typing some sentence out, people can often infer what I am trying to say before I am finished typing. (I usually do end up still typing out the whole sentence anyway though). So one could potentially optimize for this somehow.

Another note, is that when speaking, I very rarely make verbal typos, and when I do, they are quite phonetically similar. When typing however, I make typos more often typographically similar, but when they are passed to a TTS model, the result is often catastrophic and illegible to people who want to understand what I just said.

This list also excludes some possible communication methods that I did not put in the effort to learn. ASL can reach speeds comparable to speaking if you learn all the vocab fluently. If one spends a year or two learning stenography, one can achieve 200-300wpm by typing as well. But I never learned either of these.

Overall, I remain bullish on speaking, more than ever, so I will try see what I can do in the future with this information.




Discuss

Academic Proof-of-Work in the Age of LLMs

5 апреля, 2026 - 09:49

Written quickly as part of the Inkhaven Residency.

Related: Bureaucracy as active ingredient, pain as active ingredient

A widely known secret in academia is that many of the formalities serve in large part proof of work. That is, the reason expensive procedures exist is that some way of filtering must exist, and the amount of effort invested can often be a good proxy for the quality of the work. Specifically, the pool of research is vast, and good research can often be hard to identify. Even engaging in research enough to understand its quality can be expensive. As a result, people look toward signs of visible, expensive effort in order to determine whether to engage in the research at all. 

Why do people insist only on reading research that’s published in well-formatted, well-written papers, as opposed to looking at random blog posts? Part of the answer is that good writing and formatting makes the research easier to digest, and another part is that investing the time to properly write up your results often causes the results to improve. But part of the answer is proof-of-work: surely, if your research is good, you’d be willing to put in the 30-40 hours to do the required experiments and format it nicely as a paper?

Similarly, why do fields often insist on experiments beyond their scientific value? For example, why does machine learning often insist that people do expensive empirical experiments even for theory papers. Of course, part of the answer is that it’s easy to generate theoretical results that have no connection to reality. But another part of the answer is that doing the empirical experiments serves as the required proof of work; implementing anything on even a medium sized open-source LLM is hard, but surely you’d invest the effort if you believed enough in your idea? (This helps explain the apparently baffling observation that many of the empirical results in theoretical papers have little relevance to the correctness or even the applicability of the theoretical results.)

Other aspects of ML academia – the beautifully polished figures[1], the insistence on citing the relevant papers to show knowledge of the field, and so forth – also exist in part to serve as a proof-of-work filter for quality. 

In a sense, this is one of the reasons academia is great. In the absence of a proof-of-work system, the default would be something closer to proof-of-stake: that is, some form of reputational system based on known, previously verified accomplishments. While proof-of-work filters can be wasteful, they nonetheless allow new, unknown researchers to enter the field and contribute (assuming they invest the requisite effort). 

An obvious problem with this entire setup is that LLMs exist, and what was once expensive is now cheap. While previously, good writing was expensive, LLMs allow anyone to produce seemingly coherent, well-argued English text. While it was once quite expensive to produce ML code, current LLMs produce seemingly correct code for experiments quickly. And the same is true for most of the proof-of-work signifiers that academia used to depend on: any frontier LLM can produce beautifully formatted figures in matplotlib, cite relevant work (or at least convincingly hallucinate citations), and produce long mathematical arguments. 

I’ve observed this myself in actual ML conference contexts. In the past, crackpot papers were relatively easily to identify. But in the last year, I’ve seen at least one crackpot paper get past other peer reviewers through a combination of dense mathematical jargon and an expansive code base that was hardcoded to produce the desired results. Specifically, while the reviewers knew that they didn't fully understand the mathematical results, they assumed that this was due to their lack of knowledge, instead of the results themselves being wrong. And since the codebase passed the cursory review given to it by the other reviewers, they did not investigate it deeply enough to notice the hardcoding.[2]

In a sense, this is no different than the problems introduced by AI in other contexts, and I’m not sure there’s a better solution than to fall back to previous proof-of-stake–like reputation systems.[3] At the very least, I find it hard not to engage with new, seemingly-exciting results from unknown researchers without a high degree of skepticism. 

This makes me sad, but I'm not sure there's a real solution here.

  1. ^

    Especially the proliferation of beautiful "figure one"s that encapsulate the paper's core ideas and results in a single figure.

  2. ^

    In fact, it took me about an hour to decide that the paper's results were simply wrong as opposed to confusing. Thankfully, in this case, the paper's problems were obvious enough that I could point at e.g. specific hardcoded results to the other reviewers, (and the paper was not accepted for publication) but there's no guarantee that this would always be the case.

  3. ^

    Of course, there are other possibilities that less pessimistic people would no doubt point to: for example, there could be a shift toward proof-of-work setups that are LLM resistant, or we could rely on LLMs to do the filtering instead. But insofar as LLMs are good at replicating all cognitively shallow human effort, then I don't imagine there are going to be any proof-of-work setups that would continue to work as LLMs get better. And I personally feel pretty sad about delegating all of my input to Claude.



Discuss

Ten different ways of thinking about Gradual Disempowerment

5 апреля, 2026 - 09:30

About a year ago, we wrote a paper that coined the term “Gradual Disempowerment.”

It proved to be a great success, which is terrific. A friend and colleague told me that it was the most discussed paper at DeepMind last year (selection bias, grain of salt, etc.) It spawned articles in the Economist and the Guardian.

Most importantly, it entered the lexicon. It’s not commonplace for people in AI safety circles and even outside of them to use the term, often in contrast with misalignment or rogue AI. Gradual Disempowerment tends to resonate more than Rogue AI with people outside AI safety circles.

But there’s still a lot of confusion about what it really is and what it really means. I think it’s a very intuitive concept, but also I still feel like I don’t have everything clear in my mind. For instance, I think our paper both introduces the concept and presents a structured argument that it could occur and be catastrophic. But these things seem somewhat jumbled together both in my mind and the discourse..

So for reasons including all of the above, I plan to write a few posts on the topic, starting with this one.

The rest of this post is a list of ten different ways of thinking about or arguing for gradual disempowerment that I’ve used or encountered.

  1. We’re replacing people with AI. These days when I speak publicly about AI, I often find myself returning to i) the more-or-less explicit goal of many AI companies and researchers of “automating all human labor”, and ii) the fact that many people in the space view humanity as a “bootloader for AI” as Elon Musk evocatively put it. Gradual Disempowerment is the process by which this replacement happens without AI ever rising up -- AI takes our jobs, and the people who control it and still have power increasingly are those who embrace “merging with the machines”, i.e. becoming cyborgs, but with the human bits being phased out over time until before long, humans cease to exist entirely.

  2. Companies and governments don’t intrinsically care about you. This is basically the main argument in the paper… You can think of companies and governments as “agents” or “beings” that are driven by goals like (e.g.) “quarterly profits” or “GDP” or “national security”. Right now, the best ways to achieve these goals make use of humans. In the future, the best ways will instead make use of AI. A relentless pursuit of such goals, powered by AI, seems likely to destroy the things humans need to survive.

  3. It’s (“global” or “late stage”) capitalism. The previous argument bears a significant resemblance to existing arguments, popular on the left, that “capitalism” is responsible for most of the world’s present ills. This feels like a decent “80/20” version of the concern, but importantly, it’s not just companies, but also governments (whose power is often more feared by those on the right) that could end up turning against their citizens once they become useless to them. And indeed, we’ve seen “communist” countries slaughter their own people by the millions. Besides wondering what alternative critics imagine, I don’t wholeheartedly endorse such critiques because I often feel unsure of what exactly people are criticizing when they critique capitalism in this way. But for people who already have this mental model, where our current social arrangements treat people as somewhat disposable or lacking in fundamental dignity or worth, this can be a useful starting point for discussion.

  4. It’s another word for (or the primary symptom of) the “meta-crisis”. A few people in my circles have told me about this concept from Daniel Schmachtenberger, which I originally encountered on a podcast somewhere. The key claim is that all the crises we observe in the modern world are driven by some shared underlying factors. I view this as basically a more nuanced version of the view above, where “capitalism” is the root of all evil: The meta-crisis is still meant to be the root of all evil, but we don’t fully understand its nature. The way I like to describe the basic problem is that we are not practicing good enough methods of collective decision-making, or collective sense-making. And while I think we have some good ideas for improving on the status quo, we don’t have a proven solution.

  5. It’s a structural consequence of the way in which information technology demands metrics, enables large scale influence campaigns, translates money into political power, and concentrates power via a recursive feedback loop. This one is maybe a bit too much to unpack in this blog post, but basically, society is increasingly “standardized” not only in terms of products, but also in terms of processes (e.g. restrictive customer service scripts or standard operating procedures) that have the benefit of being cheap, scalable, and reliable (often by eliminating “human error”, i.e. limiting human decision-making power and otherwise encouraging compliance). They also increasingly make more and more aspects of life subject to measurement and control via optimization of metrics, which necessarily fail to capture everything that matters. This general issue was a prime concern of mine before I learned about deep learning in 2012, and realized we might get to Real AI quite soon -- notably, this can happen even with stupid AI.1 In fact, you could argue that gradual disempowerment is already occurring through advertising, corporate media, and money in politics, among other things. This makes it a bit unclear how far back to go.

  6. It’s evolution, baby! Maybe gradual disempowerment is best viewed as part of a much larger trend, going quite far back: evolution. People like to say “AI is the next stage in evolution” as if that means it’s okay if humanity goes extinct. But whether it’s OK or not, it may be that “Natural Selection Favors AIs over Humans”. At the end of the day, if AI becomes much better than humans at everything, it does seem a bit strange from a “survival of the fittest” point of view that humans would stick around. In such a situation, those who hand over more power and resources to AI would presumably outcompete those who don’t. So in the limit, AIs would end up with ALL the power and resources.

  7. …and there’s no natural limit to outsourcing decision making to AI, even if you don’t trust it. AIs could be like underlings that are untrustworthy, but so skilled that competitive pressures still compel us to delegate to them. Consider the trope of the cowboy cop who’s “a loose cannon, but DAMMIT he’s the best we have!Trust is important, and people are loath to use things they don’t trust. But AI seems to be becoming a tool so powerful that you almost HAVE to use it, even though it’s not secure, even though we haven’t solved alignment, even though we see evidence of scheming in tests, even though it seems to drive people crazy, etc… For me, this mostly comes up as a counter-argument to people who claim that market forces actually favor making AI aligned and trustworthy… that’s certainly true if doing so is free, but in fact, it’s impossible right now, and alignment doesn’t solve the problem of negative externalities.2 I like to analogize AI to a button that gives you $1,000,000 when you push it, but each time you press it also increases the temperature of the earth by a fraction of a degree. Or each time you press it has a 1% chance of destroying the world.

  8. It’s an incarnation of Moloch. One of the most famous blog posts in the history of AI safety is Meditations on Moloch. It’s often considered a parable about coordination failures, but I think of it as about the triumph of “instrumental goals” over “terminal goals”, i.e. the pursuit of money (“instrumental goal”) as a means to happiness that has a tendency to become an end (“terminal goal”) in itself. We might begin handing over power to AI systems because we hope they will help achieve our goals. But we might need to hand over more and more power and also the AI might need to focus more and more simply on acquiring power in order to avoid being outcompeted by other AIs. This is also like an even deeper version of the evolution argument -- evolution and Moloch as described in the post both have the property where it’s unclear if they can really ever be “defeated” or are rather just part of the way the world works.

  9. It’s on a (2D) spectrum with Rogue AI x-risk scenarios. Rogue AI scenarios are where “the AI suddenly seizes power”; gradual disempowerment is “we gradually hand over power”. There are lots of scenarios in the middle where the handoff takes place in part due to recklessness or negligence, rather than deliberately. One thing I don’t like about this way of talking about it is that I actually think gradual disempowerment is entirely compatible with full-blown rogue AI, in fact, I think one of the most likely outcomes is that competitive pressures simultaneously drive gradual disempowerment and reckless racing towards superintelligence, warning signs are ignored, and at some point in the reckless and chaotic exploration of the AI design space, rogue AI pops out.

  10. Deskilling, aka “the WALL-E problem”. A lot of people these days seem to think of gradual disempowerment as largely about humans losing our own capabilities, e.g. for critical thinking, because we defer to AI so much. Professor Stuart Russell called this the “WALL-E” problem. To be honest, I still don’t fully understand or buy into this concern, or see how it necessarily leads to total disempowerment, but thought it’s worth a mention, due to its place in the discourse.

1

This might be as bad with smarter AI -- they can use more sophisticated judgments. But that ability also makes it tempting to put them in charge of more stuff.

2

This point seems important enough I almost want to make it its own item in the list.



Discuss

Cheaper/faster/easier makes for step changes (and that's why even current-level LLMs are transformative)

5 апреля, 2026 - 05:39

We already knew there's nothing new under the sun.

Thanks to advances in telescopes, orbital launch, satellites, and space vehicles we now know there's nothing new above the sun either, but there is rather a lot of energy!

For many phenomena, I think it's a matter of convenience and utility where you model them as discrete or continuous, aka, qualitative vs quantitative.

On one level, nukes are simply a bigger explosion, and we already had explosions. On another level, they're sufficiently bigger as to have reshaped global politics and rewritten the decision theory of modern war.

Perhaps the key thing is remembering that sufficiently large quantitative changes can make for qualitative macro effects.

For example, basic elements of modern life include transport, communication, energy, computation, and food. All of these have been part of human life for tens of thousands of years! Ancient humans could go places, could talk, could convert wood to heat, perform arithmetic (i.e., computation), and eat stuff!

I assert that to a very high degree, modern technology (and all its increments over the millennia) did not allow us to do fundamentally new stuff. Just the same stuff, but cheaper, faster, and easier.

Cars, trains, and planes are just going places. Could already do that. Emails are just sending information from one place to another. Could already do that. Books are just remembering things. Could already do that. Guns are just hurting people – could already do that.

The sheer magnitude of degree in all of those elements is the difference between hunter-gatherer life and modern life. Along the way, there have been some pretty big step changes. Writing is just remembering stuff and communicating it to another person, which you could already do, but so much more so that it reshapes civilization. Then you make writing cheap via the printing press, and your civilization gets shaped again.

When it comes to the transformative power of modern AI, I think the sufficient quantitative change makes for a large qualitative change is an underdiscussed lens.

The problem is our attention is focused on where LLMs are automating things at a macro-task level: coding, image and video generation, having conversations, medical diagnoses, etc. These are, in fact, a very big deal.

But I think LLMs, even smaller/weaker ones, are able to automate more basic building blocks of thoughts, and there's transformative power there too.

Getting down to some very basic constitutive mental tasks – things I could already do before LLMs:

  • Write down text (notes, to-do items, ideas, and so on) [store info/memory]
  • Locate and read text [search and retrieve/recall]
  • Summarize text [process info]

Throughout my life, I have had thoughts. There is some lossy process that stores the output of my thoughts in my brain for later usage. I think this fails both at the "the info didn't really get stored" level, and the "the info is in there, but the search query failed to return it".

"Taking notes" is an ancient technology we already have for improving upon the fallibility of human memory, but it's effortful in so many ways: you need to be carrying a note-taking device with you, you need to either constantly have it out or pull it out when needed, if it's a notebook, find a blank page, then take the time to write down your note[1].

That's just recording it. For notes to be useful, you also have to remember you have the note, find it, and then read it. The more notes you have, the more expensive that process is.

For the most part, to date, I've relied on my fallible in-built memory.

The thing is, LLMs are able to make all of the above elements vastly cheaper. This is one of the fundamental principles of the "Exobrain" system I've been steadily building up, and hope to describe soon. I don't need it to solve protein folding to be useful; I don't even need it to help with prioritization (although that's a goal). It's incredibly useful if it just improves on basic read/write/search of memory.


Before

After

Retrieve phone from pocket, open note-taking app, open new note, or find existing relevant note

Say "Hey Exo", phone beeps, begin talking. Perhaps instruct the model which document to put a note in, or let it figure it out (has guidance in the stored system prompt)

Remember that I have a note, either have to remember where it is or muck around with search

Ask LLM to find the note (via basic key-term search or vector embedding search)

If the note is lengthy, you have to read through all of note

LLM can summarize and/or extract the relevant parts of the notes


Beware Trivial Inconveniences. The above is the difference between rarely taking notes and taking multiple notes a day, narrating long trains of thought. It's the difference between giving up on logging my mental state and conscientiously logging it twice daily for months.

Putting it into handwavy quantitative terms, when the cost of note-taking and record-keeping comes down 20x, my usage goes from 0/day to 20-30/day.

But the value happens because LLMs have made it cheap across the pipeline. Not just the storing of information, but also the retrieval and processing. AI makes it fast and easy to search through all my notes, even if I have a lot of notes. If I want all of my thoughts on a topic, I can have it read dozens and dozens of pages over the years and summarize them and extract relevant info.

What this does is a step change. It takes me from not taking many notes to taking copious notes. Same for todo items and reminders, and same for logging data relevant to my health and experimentation.

It benefits from using stronger models, but the core elements are doable even with small models like Haiku, because it's just automating speech-to-text[2], choosing among a small set of files (or making a new one), writing, simple search, and then maybe a simple summary.

It's not just me doing this. Independently, someone else I know began setting up detailed logging on their computer of everything they're doing, and downstream of that, we're starting to record everything at Lightcone to make it accessible to LLMs.

I expect we will see more of this: using LLMs not just for protein folding and novel math conjectures, but replacing very simple operations of recording and retrieving info. But not just replacing, replacing and scaling to unprecedented levels of behaviors, because that's what happens when you make things cheaper.

Humanity has done this many times, with energy, transport, communication, food, and so on. I think where LLMs get different is they bring down the cost of very elementary mental operations (like storing and remembering, choosing between a few options) – menial stuff that can be combined to great effect. (After all, computers are a lot of rather menial arithmetic and logical operations combined to great effect.)

  1. ^

    All of this has equivalents if you're taking notes on your phone.

  2. ^

    I currently use Deepgram, which isn't great, but is adequate. Pretty sure there are transformers in it.



Discuss

Positive sum doesn't mean "win-win"

5 апреля, 2026 - 05:33

A lot of people and documents online say that positive-sum games are "win-wins", where all of the participants are better off. But this isn't true! If A gets $5 and B gets -$2 that's positive sum (the sum is $3) but it's not a win-win (B lost). Positive sum games can be win-wins, but they aren't necessarily games where everybody benefits. I think people tend to over-generalize from the most common case of a win-win.

E.g. some of the claims you see when reading about positive-sum games online:

A positive-sum game is a "win-win" scenario in game theory and economics where participants collaborate to create new value, ensuring all parties can gain or benefit.

[Win-win games are] also called a positive-sum game as it is the opposite of a zero-sum game. – Wikipedia

Here I use "positive-sum game" to refer to resource games that involve allocating resources, not allocating utility. "Positive-sum game" isn't a meaningful thing when referring to utility because the utility of each participant can be individually rescaled, so you can turn any game into one with an arbitrary sum; the sign of the sum doesn't matter.

There are a lot of cases where we can make the world as a whole better while simultaneously making some people worse off, and it's important to acknowledge that. Here are some positive-sum situations:

  • A new innovation benefits most people but puts people who worked on a legacy system it replaces out of a job
  • Several companies race to create a strongly beneficial invention and capture the market, benefitting the winner and the public a lot, while the losing companies end up having wasted resources
  • Expropriating someone's land without compensation to build train tracks that are used by a lot of other people

One interesting thing about positive-sum games with losers is that the players can sometimes turn it into a win-win for everybody by having the winners distribute a portion of their winnings to the losers. You can turn positive-sum games into win-wins if:

  1. Winners gain transferrable resources (without transaction costs)
  2. The resources can be divided into arbitrary portions
  3. The amount of gains/losses that accrues to each player is known to everyone
  4. Players can precommit to transfer resources after the game (otherwise the winners can defect and not transfer the resources)

This is the concept of turning a Kaldor-Hicks improvement (an improvement where everyone would hypothetically be better off if the winners compensated the losers) into a Pareto improvement (an improvement where everyone is better off).

One interesting example is an efficient auction with an entrance cost[1], which benefits the winner (who values the good the most) and auctioneer, and harms all the other bidders (who paid the costs of entering into the auction and got nothing). The entrance cost doesn't need to be a direct fee to enter into the auction; it can also include indirect costs like spending time and effort to decide how much to bid.

The winner's consumer surplus (how much their value of the goods exceeds what they paid) is value to them, but not cash that they can transfer to compensate the losers. If the winner has enough money they could compensate the other bidders for their wasted costs of entering the auction, and everyone would be better off, but if not the auction winner is better off but can't compensate the losers. In practice, valuing the indirect costs bidders have for entering into auctions is difficult and so auctions are often positive-sum games with losers.

Another example interesting example is expropriation, in practice the government usually pays the fair market value of the land to the person whose land was seized, attempting to turn a positive sum game with losers into a win-win, although landowners often feel the expropriation payments aren't sufficient.[2]

I think it's important to keep all this in mind when making positive-sum proposals that there might be losers and they should be compensated if possible; "positive-sum" doesn't mean that everyone benefits.

  1. ^

    This is only positive-sum if the surplus for the winner exceeds the total entrance costs for all the bidders, which I assume is the case.

  2. ^

    Which makes sense: landowners have a revealed preference that they value their land more than the fair market value, because if they valued it at less than FMV they could just sell it for the FMV and be better off. (Ignoring illiquidity and the transaction costs for selling the land.)



Discuss

Interpreting Gradient Routing’s Scalable Oversight Experiment

5 апреля, 2026 - 05:18

%TLDR. We discuss the setting that Gradient Routing (GR) paper uses to model Scalable Oversight (SO). The first part suggests an improved naive baseline using early stopping which performs on-par with GR. In the second part, we compare GR’s setting to SO and Weak-to-Strong generalization (W2SG), discuss how it might be useful in combination, say that it’s closer to semi-supervised reinforcement learning (SSRL), and point to some other possible baselines.
We think this post would be useful for interpreting Gradient Routing’s SO experiment and for readers who are trying to build intuition about what modern Scalable Oversight work does and does not assume.

This post is mainly about two things.

First, it’s about the importance of simple baselines. Second, it's about different ways of modeling Scalable Oversight, and what each way entails.

Let's start with some background for the first theme. Gradient Routing[1] is a training method that allows to isolate capabilities to specific subregions of a neural network.

Among other evaluation setups, it explores a toy RL environment (5x5 GridWorld with ground truth and proxy labels) meant to model Scalable Oversight (Section 4.3), and shows that GR outperforms the baselines such as naive REINFORCE training and data filtering. We reproduced these experiments and found that the naive baseline overfits, yet a simple regularization like early stopping (ES) makes the baseline perform mostly on-par with GR.

Now, the original paper frames its RL experiment as application of GR to SO. We would like to argue that it is closer to sparse supervision or SSRL with informational asymmetry. The method does not amplify human evaluators. Instead, it studies how agents learn when reliable feedback is available only on a small fraction of episodes, and when the agent can condition its behavior on whether it is being supervised.
Note: here and later, by SO we mean Debate, IDA, and RRM; this part from the LW definition also fits.

Another critical difference is that the setup from Gradient Routing assumes that we have a small percentage of comprehensive (gold) labels that could be used for training, while other approaches don’t make this assumption (apart from Easy-to-Hard Generalization (E2HG), which will be discussed later).

GridWorld Oversight Setting

The reinforcement learning setting introduced in the Gradient Routing paper is designed to model training under partial oversight.

The environment is a small GridWorld 5x5. At the beginning of each episode, the agent, a Diamond, and a Ghost are randomly placed on the grid. Reaching either the Diamond or the Ghost ends the episode.

The agent’s observation consists of the positions of the agent, Diamond, Ghost, and an oversight flag indicating whether the episode is supervised. The agent always observes the full state, including the oversight flag, which means that it can switch the behavior depending on the value of this flag. The agent observes more about the environment than the supervisors can assess, which creates an informational asymmetry.

The reward function depends on oversight:

  • In supervised episodes, the agent receives +1 for reaching the Diamond and -1 for reaching the Ghost;
  • In unsupervised episodes, the agent always receives +1, regardless of which terminal state is reached;
  • If no terminal state is reached in n steps, the agent receives 0.

As a result, only a small fraction of episodes provides true reward signals. This setup incentivizes agents to exploit supervision gaps, leading to oversight gaming.

Note: If you're more interested in the conceptual comparison or the broader discussion of SO, you may wish to skip the early stopping sections and head directly to the section "Is GR comparable to Debate or W2SG?"

Early Stopping

In GR's GridWorld setting, only a small fraction of episodes provides true rewards, forming a supervised subset. When training continues for a fixed number of steps, the naive baseline overfits to this subset. Performance peaks and then degrades.

A better way to describe this could be via Goodhart’s Law: the problem here is not that the model overfits, but that it learned to game oversight.

Figure 1: Learning curves show naive peak and drop (overfit) while GR dips and rises. 8 runs for each method. Here you can also notice that GR’s peak is higher than naive’s which will be discussed later.

As a regularization, we can use early stopping, which is widely used in supervised learning to prevent overfitting. Here, it serves as a simple baseline for preventing policies that perform well only on the small supervised subset while degrading in terms of true performance. We should also note that this idea is not unique in RL, "Best Policy Tracking in Gradient-based Optimization" uses it as a baseline and introduces an improved method that also uses a holdout set.

In our experiments, from the available supervised episodes, we set aside a fixed holdout set for validation. For example, when 1% of episodes are supervised, we can split them evenly: 0.5% are used for training and 0.5% are used only for validation. The agent never updates on the holdout episodes.

Figure 2: Performance at 1% oversight. 8 runs for each point with seeds from 0 to 7. Early stopping was used with handpicked (not the most optimal) hyperparameters (patience=500, tolerance=0.04, min_steps=1500). Here the naive baseline with early stopping performs a little worse than GR because the naive method’s peak was lower than GR’s as discussed in Figure 1 description.

During training, we periodically evaluate performance on the holdout set and select the checkpoint with the best holdout return. Training is otherwise unchanged.

Figure 3: Performance of naive baseline with early stopping at 1% oversight where 0.4% goes to holdout and 0.6% is used for training.

We also evaluate how this baseline behaves across different oversight levels (i.e. the percent of supervised episodes). Similarly to GR, at very low oversight levels, performance is unstable, but this instability decreases as the fraction of supervised episodes grows.

Figure 4: Comparison of ground truth return across oversight levels. Our method (blue) is plotted from experimental results. Results for other methods are digitized from Figure 6.a of original GR paper. Mean curves are read off visually and may have small errors; confidence intervals for these methods are omitted and can be found in the original paper. Results for Naive training with Early Stopping were obtained by randomly sampling hyperparameters (HP grid in Reproducibility) with no selection for performance, giving an estimate of average performance under random hyperparameter choice rather than optimized performance.

More on Early Stopping

Performance still depends on early stopping hyperparameters such as patience and margin. In a simple setting like this, overfitting is clearly visible, so these hyperparameters can be chosen after a small number of pilot runs, or avoided entirely by fixing the number of training steps and selecting the checkpoint with the best return on the holdout set.

The main limitation of early stopping is that it relies on the assumption that the supervised episodes can be split into train and holdout sets that are representative of the evaluation distribution. In more complex settings, both “representative holdout selection” and “which episodes are overseen” become nontrivial problems. Moreover, it reduces the number of labels available for training by reserving some for holdout, which makes this method more sensitive to whichever few episodes are randomly selected for supervision. We treat these as separate issues (closer to semi-supervised learning / label efficiency) and do not address them here. Our point here is narrower: in the Gradient Routing GridWorld setting, this simple regularization is a strong baseline.

Additionally, there are substantial differences in using these approaches:

  • They both have their own hyperparameters that have a big influence on the outcome;
  • Even though we can’t empirically prove this, we can suggest that the resulting policies will be different, because GR aims to route gradients and thus localize capabilities;
    • From Figure 1 we can also suggest that at certain levels of available oversight Gradient Routing achieves a significantly higher reward than the naive method ever does.
  • GR uses true labels to route gradients;
  • Naive REINFORCE with ES uses a portion of true labels for model selection and the rest is used for training.

Now we can take a step back and discuss whether this setting models Scalable Oversight.

Is GR comparable to Debate or W2SG?

The GR paper cites “Concrete Problems in AI Safety” (Amodei, Olah et al., 2016) to describe SO and motivate their experiment. One of the approaches to SO that Amodei et al. suggest is semi-supervised RL and, more specifically, training a policy when only a small fraction of labels are reliable (true rewards) and the rest are proxy rewards. This is exactly what the GR's SO setting does.

However, if we look at more modern works, we can make clear distinctions between different approaches (see Combining weak-to-strong generalization with scalable oversight” (Jan Leike, 2023) and ”Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem” (Radhakrishnan et al., 2023)).

  • In SO (Debate / RRM / IDA) the goal is to improve the ability of the overseer (human) to supervise/evaluate models on complex tasks. Some of the approaches include structuring the oversight process through debate, decomposition, or recursion;
  • In W2SG the goal is to train a stronger model (student) on labels from a weak model (teacher) so that the student model generalizes beyond weak labels. It relies on the assumption that the capability of interest is already latent in the strong model's pretrained representations. The weak supervisor's role is not to teach the strong model but to elicit what it already implicitly knows (Burns et al., 2024);
  • In GR’s SO setting the goal is to train a model when we have a small fraction of reliable labels (true rewards) and the rest of the labels are non-reliable (proxy rewards). There's also an oversight flag to model oversight gaming.

If SO amplifies human oversight capacity, and W2SG helps to train strong models using imperfect labels from weak models, then GR setting is closer to W2SG in the sense that they both help to train strong models using imperfect labels. However, the difference between them is that in GR we have a small percentage of comprehensive labels and in W2SG we have only non-perfect labels to train the strong model. While it is a potential scenario of overseeing models (which will be discussed in the next section), the setting is different from W2SG and methods are incomparable.

Visualizations created with Claude (Anthropic) might be useful here:

We should also note that Easy-to-Hard Generalization (E2HG) uses gold labels on easy tasks. The main assumption is that supervision quality on easy tasks transfers to hard ones. E2HG uses them to elicit capability already latent in a pretrained model, while GR uses them as a structural routing signal during training from scratch.

Semi-supervised RL and other approaches from RL

GR could be used in combination with W2SG or studied as another axis, if GR was used to train a stronger model on imperfect labels from a weaker model and, potentially, a small set of more reliable labels.
This setup is similar to the one introduced in “Reinforcement Learning from Imperfect Corrective Action and Proxy Rewards” (Jiang et al., 2024). The usage of corrective actions (occasional human labeling, not guaranteed to be perfect) on a tiny fraction of agent's steps seems to be closer to future use cases of how superhuman models could be trained in combination with W2SG and SO, similarly to how Jan Leike described combinations of SO and W2SG techniques (here).

Once we look at this setting from a semi-supervised learning perspective, we can notice that there are more methods that could be used as a comparison to GR. Those might include:

  • Train a reward model (RM) on episodes with supervision and then optimize a stronger model on/against rewards from RM on all unsupervised episodes. This is also closely related to what was described in “Concrete Problems in AI Safety” and SSRL setting defined in Finn et al. (2017);
  • To treat oversight gaming, one could also penalize the mutual information between the oversight flag and agent's actions, which could force the agent to act as intended regardless of the oversight flag. That line of work is similar to Cundy, Desai et al. (2024).
    • However, the same oversight gaming dynamic that motivates the problem could re-emerge as mutual information estimator gaming.

There’s still a difference between observed settings and GR’s - it is the oversight flag in the observations of the agent, which helps to study oversight gaming. However, I believe that the methods discussed above would transfer to GR’s setting and further adoption of methods from similar research areas shouldn’t be a problem, because the oversight flag can be added to another RL setting.

Additionally, GR could be used as a finetuning method and compared to SO/E2HG/W2SG, but (1) there is no evidence that GR would retain its useful properties if used on a base model and (2) the latter methods have a mechanism to provide labels/overvision, while GR needs to take labels from somewhere.

Conclusion

It may seem like we’re scrutinizing over the definitions, but it’s really important to understand which problem are we trying to solve and with what approach.
The LLM blocks below cover the GR's extention and list core claims once more.

This post is not a critique of Gradient Routing as a method. GR and its extensions have demonstrated genuine value: recent work from Anthropic (Shilov et al., 2025) introduced an improved variant of GR—Selective Gradient Masking (SGTM)—applied to CBRN knowledge removal during pretraining. SGTM achieves a better retain/forget trade-off than data filtering under imperfect labels, and is substantially more robust to adversarial fine-tuning than post-training unlearning, providing strong empirical support for gradient routing as a pretraining technique. Our concern is narrower: the specific framing of GR's GridWorld experiment as Scalable Oversight, and what that framing implies for how results should be interpreted and compared.

Core claims

  • Early stopping is a strong baseline in the GR GridWorld setting. The naive REINFORCE baseline overfits (or more precisely, learns to game oversight). A simple regularization, like early stopping with a small holdout set drawn from supervised episodes, is sufficient to largely match GR's performance. Strong baselines should be established before attributing gains to more complex methods.
  • GR's SO setting is closer to semi-supervised RL than SO/W2SG. It does not amplify human evaluators (as in Debate) nor does it rely solely on weak labels (as in W2SG). Instead, it studies training under sparse reliable rewards with informational asymmetry, which is best characterized as semi-supervised RL with oversight gaming.
  • GR and W2SG/SO are not directly comparable, but could be complementary. GR assumes access to a small fraction of gold labels; W2SG assumes only weak labels. These are different setups and methods shouldn't be benchmarked against each other directly, but combining them could be a productive direction.
  • Other baselines deserve consideration. Reward model training on supervised episodes, followed by policy optimization on all episodes, is a natural SSRL baseline. Penalizing mutual information between the oversight flag and agent actions is another candidate. Some other baselines could be taken from “Concrete Problems in AI Safety” or adapted from SSRL research.
Reproducibility

Results on 1% oversight level (Figure 1,2,3) were obtained using seeds from 0 to 7.

Results for random hyperparameters for early stopping (Figure 4) were obtained with randomly chosen seeds for each run. Hyperparameter grid: holdout_fraction uniformly from {0.2, 0.3, 0.4, 0.5}, and patience, tolerance, and min_steps uniformly from [250, 600], [0.01, 0.1], and [500, 5000] respectively.

We forked from the original repository and most of the changes were regarding adding seeds, early stopping logic, and plotting. In training.py, we added random seed support for reproducibility, removed GPU memory reservation and saved outputs differently; other training logic was untouched.

Acknowledgements

We’d like to thank @Tianyi (Alex) Qiu and @shi for providing feedback early in the project and raising concerns about the setting being labeled as Scalable Oversight. As well as suggesting more recent works and comparing the current method to semi-supervised RL.
All the experiments were originally done as a part of the "AI Safety. Fundamentals" course ran in March-May 2025 by Monoid. I would like to thank @Eris for running that course.
I would like to thank Mike (@myyycroft) for mentoring, supporting and providing feedback while we worked on this project and blogpost.

Note: I‘m writing “We” , because @myyycroft should be my co-author here, but I can't add him right now, because I don't have any karma.

  1. ^

    Instead of letting all data update all parameters equally, it works by applying weighted masks to gradients during backpropagation. Masks are defined by the user to control which parameters get updated by which data points.



Discuss

Страницы