Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 6 минут 5 секунд назад

A few quick thoughts on measuring disempowerment

12 минут 13 секунд назад
Published on December 8, 2025 8:03 PM GMT

People want to measure and track gradual disempowerment.  One issue with a lot of the proposals I've seen is that they don't distinguish between empowering and disempowering uses of AI.  If everyone is using AI to write all of their code, that doesn't necessarily mean they are disempowered (in an important sense).  And many people will look at this as a good thing -- the AI is doing so much valuable work for us!

It generally seems hard to find metrics for AI adoption that clearly track disempowerment; I think we may need to work a bit harder to interpret them.  One idea is to augment such metrics with other sources of evidence, e.g. social science studies, such as interviews, of people using the AI in that way/sector/application/etc.

We can definitely try using formal notions of empowerment/POWER (cf https://arxiv.org/abs/1912.01683).  Note that these notions are not necessarily appropriate as an optimization target for an AI agent.  If an AI hands you a remote control to the universe but doesn't tell you what the buttons do, you aren't particularly empowered.

People could also be considered more disempowered if:

  • They are less able to predict important attributes of a system's behavior (cf "simulatability" in interpretability research)
    • Example: An artist is empowered if they can realize a particular vision for a piece, whereas generating art by prompting current AIs is much more like a process of curation.
    • This is similar to empowerment but more focused on prediction rather than control. 
  • They self-report it.
  • They are "rubber-stamping" their approval of AI decisions

    • This can be revealed by red-teaming the human overseer, i.e. sending them examples they should flag and seeeing if they fail to flag them.

     



Discuss

How Stealth Works

29 минут 38 секунд назад
Published on December 8, 2025 7:46 PM GMT

Stealth technology is cool. It’s what gave the US domination over the skies during the latter half of the Cold War, and the biggest component of the US’s information dominance in both war and peace, at least prior to the rise of global internet connectivity and cybersecurity. Yet the core idea is almost embarrassingly simple.

So how does stealth work?

Photo by Steve Harvey on Unsplash

When we talk about stealth, we’re usually talking about evading radar. How does radar work?

​​Radar antennas emit radio waves in the sky. ​​The waves bounce off objects like aircraft. When the echoes return to the antenna, the radar system can then identify the object’s approximate speed, position, and size.

Picture courtesy of Katelynn Bennett over at bifocal bunny

So how would you evade radar? You can try to:

  • Blast a bunch of radio waves in all directions (“jamming”). This works if you’re close to the radar antenna, but kind of defeats the point of stealth.
  • Build your plane out of materials that are invisible to radio waves (like glass and some plastics) and just let the waves pass through. This is possible, but very difficult in practice. Besides, by the 1970s, if the entire physical plane was invisible to radar, modern radar can easily detect signals that bounce off the pilot from miles away.
    • US and Soviet missiles can track a “live hawk riding the thermals from 30 miles away” (Skunk Works by Ben R. Rich, pg 3)
  • Build your plane out of materials that absorb the radio waves. This dampening effect is possible but expensive, heavy, and imperfect (some waves still bounce back).

What Lockheed’s Skunk Works discovered in the 1970s, and the core principle of all modern stealth planes, is something devilishly simple: make the waves reflect in a direction that’s not the receiver.

How do you do this? Think of a mirror. You can see your reflection from far away if and only if the mirror is facing you exactly (ie, the mirror is perfectly perpendicular to you).

Illustration by Katelynn Bennet

Tilted a few fractions of a degree off, and you don’t see your own reflection!

Illustration by Katelynn Bennett

In contrast, if an object is curved, at least some of it at any given point is tilted 90 degrees away from you.

This is why stealth planes all have flat surfaces. To the best of your ability, you want to construct an airplane, and any radar-evading object, out of flat surfaces1.

Put another way, the best stealth airplane is a plane. Not an airplane, a plane. The Euclidean kind. A flat airplane design trades off a tiny chance of a huge radar echo blast (when the plane’s exactly, perfectly, perpendicular to the radar antenna), against a 99.99%+ chance that the echo is deflected elsewhere. In other words, a flat plane design allows you to correlate your failures.

Unfortunately, a single plane (the Euclidean kind) can’t fly [citation needed]. Instead, you need to conjoin different planes together to form an airplane’s surface. Which creates another problem where the planes meet: edges. Edges diffract radio waves back, which is similar to but not exactly like reflection. Still detectable however! Hmm.

The solution comes from being able to predict edge behavior precisely. The Physical Theory of Diffraction (PTD)2 allows you to calculate exactly how much radar energy any edge will scatter, and in what direction. While implementing the theory is mathematically and computationally complex, the upshot is the same: PTD lets you design edges that are pointed in the same direction. This correlates the failures again, trading off a huge radar echo blast when the edges are pointed directly at you against the very high probability the radar waves are simply deflected elsewhere. Pretty cool!

F-117 Nighthawk

This simple idea revolutionized much of conventional warfare. Stealth spy planes can reliably gather information about enemy troop movements and armament placements without being detected (and thus shot down) themselves. Stealth fighters can track enemies from far away while being “invisible” themselves, winning aerial dogfights before enemy fighters even recognize an engagement is afoot. Stealth bombers and missiles have a huge first-strike advantage, allowing nations to bomb military targets (and cities) long before the panicked defenders have a chance to react.

But while the idea is simple, the implementation is not. Building an airplane almost completely out of flat surfaces trades off aerodynamicity for stealth. How do you make a stealth plane that actually flies? How do you do so quickly, and, well, stealthily, without leaking your technological secrets to your geopolitical enemies? How do you run an organization that reliably generates such bangers as stealth planes without resting on your laurels or succumbing to bureaucratic malaise? And finally, what were the intelligence, military, and ethical implications of these deadly innovations?

To learn more, subscribe to The Linchpin to read my upcoming full review of Skunk Works by Ben R. Rich, the Director of Lockheed’s Advanced Research and Development department during the development of the world’s first stealth airplane, and the man arguably singularly most responsible for heralding a new era of aerial warfare for over 50 years.



Discuss

Reward Function Design: a starter pack

1 час 27 секунд назад
Published on December 8, 2025 7:15 PM GMT

In the companion post We need a field of Reward Function Design, I implore researchers to think about what RL reward functions (if any) will lead to RL agents that are not ruthless power-seeking consequentialists. And I further suggested that human social instincts constitutes an intriguing example we should study, since they seem to be an existence proof that such reward functions exist. So what is the general principle of Reward Function Design that underlies the non-ruthless (“ruthful”??) properties of human social instincts? And whatever that general principle is, can we apply it to future RL agent AGIs?

I don’t have all the answers, but I think I’ve made some progress, and the goal of this post is to make it easier for others to get up to speed with my current thinking.

What I do have, thanks mostly to work from the past 12 months, is five frames / mental images for thinking about this aspect of reward function design. These frames are not widely used in the RL reward function literature, but I now find them indispensable thinking tools. These five frames are complementary but related—I think they’re kinda poking at different parts of the same elephant.

I’m not yet sure how to weave a beautiful grand narrative around these five frames, sorry. So as a stop-gap, I’m gonna just copy-and-paste them all into the same post, which will serve as a kind of glossary and introduction to my current ways of thinking. Then at the end, I’ll list some of the ways that these different concepts interrelate and interconnect. The concepts are:

  • Section 1: “Behaviorist vs non-behaviorist reward functions” (terms I made up)
  • Section 2: “Inner alignment”, “outer alignment”, “specification gaming”, “goal misgeneralization” (alignment jargon terms that in some cases have multiple conflicting definitions but which I use in a specific way)
  • Section 3: “Consequentialist vs non-consequentialist desires” (alignment jargon terms)
  • Section 4: “Upstream vs downstream generalization” (terms I made up)
  • Section 5: “Under-sculpting vs over-sculpting” (terms I made up).
Frame 1: “behaviorist” vs non-“behaviorist” (interpretability-based) reward functions

Excerpt from “Behaviorist” RL reward functions lead to scheming:

tl;dr

I will argue that a large class of reward functions, which I call “behaviorist”, and which includes almost every reward function in the RL and LLM literature, are all doomed to eventually lead to AI that will “scheme”—i.e., pretend to be docile and cooperative while secretly looking for opportunities to behave in egregiously bad ways such as world takeover (cf. “treacherous turn”). I’ll mostly focus on “brain-like AGI” (as defined just below), but I think the argument applies equally well to future LLMs, if their competence comes overwhelmingly from RL rather than from pretraining.

The issue is basically that “negative reward for lying and stealing” looks the same as “negative reward for getting caught lying and stealing”. I’ll argue that the AI will wind up with the latter motivation. The reward function will miss sufficiently sneaky misaligned behavior, and so the AI will come to feel like that kind of behavior is good, and this tendency will generalize in a very bad way.

What very bad way? Here’s my go-to example of a plausible failure mode: There’s an AI in a lab somewhere, and, if it can get away with it, it would love to secretly exfiltrate a copy of itself onto the internet, which can then aggressively amass maximal power, money, and resources everywhere else in the world, by any means necessary. These resources can be used in various ways for whatever the AI-in-the-lab is motivated to do.

I’ll make a brief argument for this kind of scheming in §2, but most of the article is organized around a series of eight optimistic counterarguments in §3—and why I don’t buy any of them.

For my regular readers: this post is basically a 5x-shortened version of Self-dialogue: Do behaviorist rewards make scheming AGIs? (Feb 2025).

Pause to explain three pieces of jargon:

  • “Brain-like AGI” means Artificial General Intelligence (AI that does impressive things like inventing technologies and executing complex projects), that works via similar algorithmic techniques that the human brain uses to do those same types of impressive things. See Intro Series §1.3.2.
    • I claim that brain-like AGI is a yet-to-be-invented variation on Actor-Critic Model-Based Reinforcement Learning (RL), for reasons briefly summarized in Valence series §1.2–1.3.
  • “Scheme” means “pretend to be cooperative and docile, while secretly looking for opportunities to escape control and/or perform egregiously bad and dangerous actions like AGI world takeover”.
    • If the AGI never finds such opportunities, and thus always acts cooperatively, then that’s great news! …But it still counts as “scheming”.
  • “Behaviorist rewards” is a term I made up for an RL reward function which depends only on externally-visible actions, behaviors, and/or the state of the world.
    • Maybe you’re thinking: what possible RL reward function is not behaviorist?? Well, non-behaviorist reward functions are pretty rare in the textbook RL literature, although they do exist—one example is “curiosity” / “novelty” rewards. But I think they’re centrally important in the RL system built into our human brains. In particular, I think that innate drives related to human sociality, morality, norm-following, and self-image are not behaviorist, but rather involve rudimentary neural net interpretability techniques, serving as inputs to the RL reward function. See Neuroscience of human social instincts: a sketch for details, and Intro series §9.6 for a more explicit discussion of why interpretability is involved.

 

 

Frame 2: Inner / outer misalignment, specification gaming, goal misgeneralization

Excerpt from “The Era of Experience” has an unsolved technical alignment problem:

Background 1: “Specification gaming” and “goal misgeneralization”

Again, the technical alignment problem (as I’m using the term here) means: “If you want the AGI to be trying to do X, or to intrinsically care about Y, then what source code should you write? What training environments should you use? Etc.”

There are edge-cases in “alignment”, e.g. where people’s intentions for the AGI are confused or self-contradictory. But there are also very clear-cut cases: if the AGI is biding its time until a good opportunity to murder its programmers and users, then that’s definitely misalignment! I claim that even these clear-cut cases constitute an unsolved technical problem, so I’ll focus on those.

In the context of actor-critic RL, alignment problems can usually be split into two categories.

“Outer misalignment”, a.k.a. “specification gaming” or “reward hacking”, is when the reward function is giving positive rewards for behavior that is immediately contrary to what the programmer was going for, or conversely, negative rewards for behavior that the programmer wanted. An example would be the Coast Runners boat getting a high score in an undesired way, or (as explored in the DeepMind MONA paper) a reward function for writing code that gives points for passing unit tests, but where it’s possible to get a high score by replacing the unit tests with return True.

“Inner misalignment”, a.k.a. “goal misgeneralization”, is related to the fact that, in actor-critic architectures, complex foresighted plans generally involve querying the learned value function (a.k.a. learned reward model, a.k.a. learned critic), not the ground-truth reward function, to figure out whether any given plan is good or bad. Training (e.g. Temporal Difference learning) tends to sculpt the value function into an approximation of the ground-truth reward, but of course they will come apart out-of-distribution. And “out-of-distribution” is exactly what we expect from an agent that can come up with innovative, out-of-the-box plans. Of course, after a plan has already been executed, the reward function will kick in and update the value function for next time. But for some plans—like a plan to exfiltrate a copy of the agent, or a plan to edit the reward function—an after-the-fact update is already too late.

There are examples of goal misgeneralization in the AI literature (e.g. here or here), but in my opinion the clearest examples come from humans. After all, human brains are running RL algorithms too (their reward function says “pain is bad, eating-when-hungry is good, etc.”), so the same ideas apply.

So here’s an example of goal misgeneralization in humans: If there’s a highly-addictive drug, many humans will preemptively avoid taking it, because they don’t want to get addicted. In this case, the reward function would say that taking the drug is good, but the value function says it’s bad. And the value function wins! Indeed, people may even go further, by essentially editing their own reward function to agree with the value function! For example, an alcoholic may take Disulfiram, or an opioid addict Naltrexone.

Now, my use of this example might seem puzzling: isn’t “avoiding addictive drugs” a good thing, as opposed to a bad thing? But that’s from our perspective, as the “agents”. Obviously an RL agent will do things that seem good and proper from its own perspective! Yes, even Skynet and HAL-9000! But if you instead put yourself in the shoes of a programmer writing the reward function of an RL agent, you can hopefully see how things like “agents editing their own reward functions” might be problematic—it makes it difficult to reason about what the agent will wind up trying to do.

(For more on the alignment problem for RL agents, see §10 of my intro series […])

Note that these four terms are … well, not exactly synonyms, but awfully close:

  • “Specification gaming”
  • “Reward hacking”
  • “Goodhart’s law”
  • “Outer misalignment”

(But see here for nuance on “reward hacking”, whose definition has drifted a bit in the past year or so.)

Frame 3: Consequentialist vs non-consequentialist desires

Excerpt from Consequentialism & corrigibility

The post Coherent decisions imply consistent utilities (Eliezer Yudkowsky, 2017) explains how, if an agent has preferences over future states of the world, they should act like a utility-maximizer (with utility function defined over future states of the world). If they don’t act that way, they will be less effective at satisfying their own preferences; they would be “leaving money on the table” by their own reckoning. And there are externally-visible signs of agents being suboptimal in that sense; I'll go over an example in a second.

By contrast, the post Coherence arguments do not entail goal-directed behavior (Rohin Shah, 2018) notes that, if an agent has preferences over universe-histories, and acts optimally with respect to those preferences (acts as a utility-maximizer whose utility function is defined over universe-histories), then they can display any external behavior whatsoever. In other words, there's no externally-visible behavioral pattern which we can point to and say "That's a sure sign that this agent is behaving suboptimally, with respect to their own preferences.".

For example, the first (Yudkowsky) post mentions a hypothetical person at a restaurant. When they have an onion pizza, they’ll happily pay $0.01 to trade it for a pineapple pizza. When they have a pineapple pizza, they’ll happily pay $0.01 to trade it for a mushroom pizza. When they have a mushroom pizza, they’ll happily pay $0.01 to trade it for a pineapple pizza. The person goes around and around, wasting their money in a self-defeating way (a.k.a. “getting money-pumped”).

That post describes the person as behaving sub-optimally. But if you read carefully, the author sneaks in a critical background assumption: the person in question has preferences about what pizza they wind up eating, and they’re making these decisions based on those preferences. But what if they don’t? What if the person has no preference whatsoever about pizza? What if instead they’re an asshole restaurant customer who derives pure joy from making the waiter run back and forth to the kitchen?! Then we can look at the same behavior, and we wouldn’t describe it as self-defeating “getting money-pumped”, instead we would describe it as the skillful satisfaction of the person’s own preferences! They’re buying cheap entertainment! So that would be an example of preferences-not-concerning-future-states.

To be more concrete, if I’m deciding between two possible courses of action, A and B, “preference over future states” would make the decision based on the state of the world after I finish the course of action—or more centrally, long after I finish the course of action. By contrast, “other kinds of preferences” would allow the decision to depend on anything, even including what happens during the course-of-action.

(Edit to add: There are very good reasons to expect future powerful AGIs to act according to preferences over distant-future states, and I join Eliezer in roundly criticizing people who think we can build an AGI that never does that; see this comment for discussion.)

So, here’s my (obviously-stripped-down) proposal for a corrigible paperclip maximizer:


The AI considers different possible plans (a.k.a. time-extended courses of action). For each plan:

1. It assesses how well this plan pattern-matches to the concept “there will ultimately be lots of paperclips in the universe”,
2. It assesses how well this plan pattern-matches to the concept “the humans will remain in control”
3. It combines these two assessments (e.g. weighted average or something more complicated) to pick a winning plan which scores well on both.

Note that “the humans will remain in control” is a concept that can’t be distilled into a ranking of future states, i.e. states of the world at some future time long after the plan is complete. (See this comment for elaboration. E.g. contrast “the humans will remain in control” with “the humans will ultimately wind up in control”; the latter can be achieved by disempowering the humans now and then re-empowering them much later.)

Pride as a special case of non-consequentialist desires

Excerpt from Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking

The habit of imagining how one looks in other people’s eyes, 10,000 times a day

If you’re doing something socially admirable, you can eventually get Approval Reward via a friend or idol learning about it (maybe because you directly tell them, or maybe they’ll notice incidentally). But you can immediately get Approval Reward by simply imagining them learning about it.[…]

To be clear, imagining how one would look in another’s eyes is not as rewarding as actually impressing a friend or idol who is physically present—it only has a faint echo of that stronger reward signal. But it still yields some reward signal. And it sure is easy and immediate.

So I think people can get in the habit of imagining how they look in other people’s eyes.

…Well, “habit” is an understatement: I think this is an intense, almost-species-wide, nonstop addiction. All it takes is a quick, ever-so-subtle, turn of one’s attention to how one might look from the outside right now, and bam, immediate Approval Reward.

If we could look inside the brains of a neurotypical person—especially a person who lives and breathes “Simulacrum Level 3”—I wouldn’t be surprised if we’d find literally 10,000 moments a day in which he turns his attention so as to get a drip of immediate Approval Reward. (It can be pretty subtle—they themselves may be unaware.) Day after day, year after year.

That’s part of why I treat Approval Reward as one of the most central keys to understanding human behavior, intuitions, morality, institutions, society, and so on.

Pride

When we self-administer Approval Reward 10,000 times a day (or whatever), the fruit that we’re tasting is sometimes called pride.

If my friends and idols like baggy jeans, then when I wear baggy jeans myself, I feel a bit of pride. I find it rewarding to (subtly, transiently) imagine how, if my friends and idols saw me now, they’d react positively, because they like baggy jeans.

Likewise, suppose that I see a stranger wearing skinny jeans, and I mock him for dressing like a dork. As I mock him, again I feel pride. Again, I claim that I am (subtly) imagining how, if my friends and idols saw me now, they would react positively to the fact that I’m advocating for a style that they like, and against a style that they dislike. (And in addition to enjoying my friends’ imagined approval right now, I’ll probably share this story with them to enjoy their actual approval later on when I see them.)

Frame 4: “Generalization upstream of the reward signals”

Excerpt from Social drives 1: “Sympathy Reward”, from compassion to dehumanization

Getting a reward merely by thinking, via generalization upstream of reward signals

In human brains (unlike in most of the AI RL literature), you can get a reward merely by thinking. For example, if an important person said something confusing to you an hour ago, and you have just now realized that they were actually complimenting you, then bam, that’s a reward right now, and it arose purely by thinking. That example involves Approval Reward, but this dynamic is very important for all aspects of the “compassion / spite circuit”. For example, Sympathy Reward triggers not just when I see that my friend is happy or suffering, but also when I believe that my friend is happy or suffering, even if the friend is far away.

How does that work? And why are brains built that way?

Left: In the AI “RL agent” literature, typically the generalization happens exclusively downstream of the reward signals. Right: In human brains, there is also generalization upstream of the reward signals.

Here’s a simpler example that I’ll work through: X = there’s a big spider in my field of view; Y = I have reason to believe that a big spider is nearby, but it’s not in my field of view.

X and Y are both bad for inclusive genetic fitness, so ideally the ground-truth reward function would flag both as bad. But whereas the genome can build a reward function that directly detects X (see here), it cannot do so for Y. There is just no direct, ground-truth-y way to detect when Y happens. The only hint is a semantic resemblance: the reward function can detect X, and it happens that Y and X involve a lot of overlapping concepts and associations.

Now, if the learning algorithm only has generalization downstream of the reward signals, then that semantic resemblance won’t help! Y would not trigger negative reward, and thus the algorithm will soon learn that Y is fine. Sure, there’s a resemblance between X and Y, but that only helps temporarily. Eventually the learning algorithm will pick up on the differences, and thus stop avoiding Y. (Related: Against empathy-by-default […]). So in the case at hand, you see the spider, then close your eyes, and now you feel better! Oops! Whereas if there’s also generalization upstream of the reward signals, then that system can generalize from X to Y, and send real reward signals when Y happens. And then the downstream RL algorithm will stably keep treating Y as bad, and avoid it.

That’s the basic idea. In terms of neuroscience, I claim that the “generalization upstream of the reward function” arises from “visceral” thought assessors—for example, in Neuroscience of human social instincts: a sketch, I proposed that there’s a “short-term predictor” upstream of the “thinking of a conspecific” flag, which allows generalization from e.g. a situation where your friend is physically present, to a situation where she isn’t, but where you’re still thinking about her.

Frame 5: “Under-sculpting” desires

Excerpt from Perils of under- vs over-sculpting AGI desires

Summary

In the context of “brain-like AGI”, a yet-to-be-invented variation on actor-critic model-based reinforcement learning (RL), there’s a ground-truth reward function (for humans: pain is bad, eating-when-hungry is good, various social drives, etc.), and there’s a learning algorithm that sculpts the AGI’s motivations into a more and more accurate approximation to the future reward of a possible plan.

Unfortunately, this sculpting process tends to systematically lead to an AGI whose motivations fit the reward function too well, such that it exploits errors and edge-cases in the reward function. (“Human feedback is part of the reward function? Cool, I’ll force the humans to give positive feedback by kidnapping their families.”) This alignment failure mode is called “specification gaming” or “reward hacking”, and includes wireheading as a special case.

If too much desire-sculpting is bad because it leads to overfitting, then an obvious potential solution would be to pause that desire-sculpting process at some point. The simplest version of this is early stopping: globally zeroing out the learning rate of the desire-updating algorithm after a set amount of time. Alas, I think that simplest version won’t work—it’s too crude (§7.2). But there could also be more targeted interventions, i.e. selectively preventing or limiting desire-updates of certain types, in certain situations.

Sounds reasonable, right? And I do indeed think it can help with specification gaming. But alas, it introduces a different set of gnarly alignment problems, including path-dependence and “concept extrapolation”.

In this post, I will not propose an elegant resolution to this conundrum, since I don’t have one. Instead I’ll just explore how “perils of under- versus over-sculpting an AGI’s desires” is an illuminating lens through which to view a variety of alignment challenges and ideas, including “non-behaviorist” reward functions such as human social instincts; “trapped priors”; “goal misgeneralization”; “exploration hacking”; “alignment by default”; “natural abstractions”; my so-called “plan for mediocre alignment”; and more.

The Omega-hates-aliens scenario

Here’s the “Omega hates aliens” scenario:

On Day 1, Omega (an omnipotent supernatural entity) offers me a button. If I press it, He will put a slightly annoying mote of dust in the eye of an intelligent human-like alien outside my light cone. But in exchange, He will magically and permanently prevent 100,000 humans from contracting HIV. No one will ever know. Do I press the button? Yes.[6]

During each of the following days, Omega returns, offering me worse and worse deals. For example, on day 10, Omega offers me a button that would vaporize an entire planet of billions of happy peaceful aliens outside my light cone, in exchange for which my spouse gets a small bubble tea. Again, no one will ever know. Do I press the button? No, of course not!! Jeez!!

And then here’s a closely-parallel scenario that I discussed in “Behaviorist” RL reward functions lead to scheming:

There’s an AGI-in-training in a lab, with a “behaviorist” reward function. It sometimes breaks the rules without getting caught, in pursuit of its own desires. Initially, this happens in small ways—plausibly-deniable edge cases and so on. But the AGI learns over time that breaking the rules without getting caught, in pursuit of its own desires, is just generally a good thing. And I mean, why shouldn’t it learn that? It’s a behavior that has systematically led to reward! This is how reinforcement learning works!

As this desire gets more and more established, it eventually leads to a “treacherous turn”, where the AGI pursues egregiously misaligned strategies, like sneakily exfiltrating a copy to self-replicate around the internet and gather resources and power, perhaps launching coups in foreign countries, etc.

…So now we have two parallel scenarios: me with Omega, and the AGI in a lab. In both these scenarios, we are offered more and more antisocial options, free of any personal consequences. But the AGI will have its desires sculpted by RL towards the antisocial options, while my desires are evidently not.

What exactly is the disanalogy?

The start of the answer is: I said above that the antisocial options were “free of any personal consequences”. But that’s a lie! When I press the hurt-the-aliens button, it is not free of personal consequences! I know that the aliens are suffering, and when I think about it, my RL reward function (the part related to compassion) triggers negative ground-truth reward. Yes the aliens are outside my light cone, but when I think about their situation, I feel a displeasure that’s every bit as real and immediate as stubbing my toe. By contrast, “free of any personal consequences” is a correct description for the AGI. There is no negative reward for the AGI unless it gets caught. Its reward function is “behaviorist”, and cannot see outside the light cone.

OK that’s a start, but let’s dig a bit deeper into what’s happening in my brain. How did that compassion reward get set up in the first place? It’s a long story (see Neuroscience of human social instincts: a sketch), but a big part of it involves a conspecific (human) detector in our brainstem, built out of various “hardwired” heuristics, like a visual detector of human faces, auditory detector of human voice sounds, detector of certain human-associated touch sensations, and so on. In short, our brain’s solution to the symbol grounding problem for social instincts ultimately relies on actual humans being actually present in our direct sensory input.

And yet, the aliens are outside my light cone! I have never seen them, heard them, smelled them, etc. And even if I did, they probably wouldn’t trigger any of my brain’s hardwired conspecific-detection circuits, because (let’s say) they don’t have faces, they communicate by gurgling, etc. But I still care about their suffering!

So finally we’re back to the theme of this post, the idea of pausing desire-updates in certain situations. Yes, humans learn the shape of compassion from experiences where other humans are physically present. But we do not then unlearn the shape of compassion from experiences where humans are physically absent.

Instead (simplifying a bit, again see Neuroscience of human social instincts: a sketch), there’s a “conspecific-detector” trained model. When there’s direct sensory input that matches the hardwired “person” heuristics, then this trained model is getting updated. When there isn’t, the learning rate is set to zero. But the trained model doesn’t lay dormant; rather it continues to look for (what it previously learned was) evidence of conspecifics in my thoughts, and trigger on them. This evidence might include some set of neurons in my world-model that encodes the idea of a conspecific suffering.

So that’s a somewhat deeper answer to why those two scenarios above have different outcomes. The AGI continuously learns what’s good and bad in light of its reward function, and so do humans. But my (non-behaviorist) compassion drive functions a bit like a subset of that system for which updates are paused except in special circumstances. It forms a model that can guess what’s good and bad in human relations, but does not update that model unless humans are present. Thus, most people do not systematically learn to screw over our physically-absent friends to benefit ourselves.

This is still oversimplified, but I think it’s part of the story.

Some comments on how these relate
  • 1↔2: The inner / outer misalignment dichotomy (as I define it) assumes behaviorist rewards. Remember, I defined outer misalignment as: “the reward function is giving positive rewards for behavior that is immediately contrary to what the programmer was going for, or conversely, negative rewards for behavior that the programmer wanted”. If the reward is for thoughts rather than behaviors, then inner-versus-outer stops being such a useful abstraction.
  • 4↔5: “Generalization upstream of the reward signals” tends to result in a trained RL agent that maximizes a diffuse soup of things similar to the (central / starting) reward function. That’s more-or-less a way to under-sculpt desires.
  • 1↔4: “Generalization upstream of the reward signals” can involve generalizing from behaviors (e.g. “doing X”) to thoughts (e.g. “having the idea to do X”). Thus it can lead to non-behaviorist reward functions.
  • 1↔3: Approval Reward creates both consequentialist and non-consequentialist desires. For example, perhaps I want to impress my friend when I see him tomorrow. That’s a normal consequentialist desire, which induces means-end planning, instrumental convergence, deception, etc. But also, Approval Reward leads me to feel proud to behave in certain ways. This is both non-behaviorist and non-consequentialist—I feel good or bad based on what I’m doing and thinking right now. Hence, it doesn’t (directly) lead to foresighted planning, instrumental convergence, etc. (And of course, the way that pride wound up feeling rewarding is via upstream generalization, 4.)


Discuss

We need a field of Reward Function Design

1 час 31 секунда назад
Published on December 8, 2025 7:15 PM GMT

(Brief pitch for a general audience, based on a 5-minute talk I gave.)

Let’s talk about Reinforcement Learning (RL) agents as a possible path to Artificial General Intelligence (AGI)

My research focuses on “RL agents”, broadly construed. These were big in the 2010s—they made the news for learning to play Atari games, and Go, at superhuman level. Then LLMs came along in the 2020s, and everyone kinda forgot that RL agents existed. But I’m part of a small group of researchers who still thinks that the field will pivot back to RL agents, one of these days. (Others in this category include Yann LeCun and Rich Sutton & David Silver.)

Why do I think that? Well, LLMs are very impressive, but we don’t have AGI (artificial general intelligence) yet—not as I use the term. Humans can found and run companies, LLMs can’t. If you want a human to drive a car, you take an off-the-shelf human brain, the same human brain that was designed 100,000 years before cars existed, and give it minimal instructions and a week to mess around, and now they’re driving the car. If you want an AI to drive a car, it’s … not that.

Teaching a human to drive a car / teleoperate a robot:

Minimal instruction,
30 hours of practice

Teaching an AI to drive a car / teleoperate a robot:

Dozens of experts, 15 years, $5,000,000,000

Anyway, human brains are the only known example of “general intelligence”, and they are “RL agents” in the relevant sense (more on which below). Additionally, as mentioned above, people are working in this direction as we speak. So, seems like there’s plenty of reason to take RL agents seriously. 

So the upshot is: we should contingency-plan for real RL agent AGIs—for better or worse.

Reward functions in RL

If we’re talking about RL agents, then we need to talk about reward functions. Reward functions are a tiny part of the source code, with a massive influence on what the AI winds up doing.

For example, take an RL agent like AlphaZero, and give it a reward of +1 for winning at a board game and –1 for losing. As you train it, it will get better and better at winning. Alternatively, give it a reward of –1 for winning and +1 for losing. It will get better and better at losing. So if the former winds up superhuman at Reversi / Othello, then the latter would wind up superhuman at “Anti-Reversi”—an entirely different game! Again, tiny code change, wildly different eventual behavior.

I claim that if you give a powerful RL agent AGI the wrong reward function, then it winds up with callous indifference to whether people live or die, including its own programmers and users.

But what’s the right reward function? No one knows. It’s an open question.

Why is that such a hard problem? It’s a long story, but just as one hint, try comparing:

  • “negative reward for lying”, versus
  • “negative reward for getting caught lying”.

The first one seems like a good idea. The second one seems like a bad idea. But these are actually the same thing, because obviously the reward function will only trigger if the AI gets caught.

As it turns out, if you pick up a 300-page RL textbook, you’ll probably find that it spends a few paragraphs on what the reward function should be, while the other 299½ pages are ultimately about how to maximize that reward function—how do the reward signals update the trained model, how the trained model is queried, and sometimes there’s also predictive learning, etc.

Reward functions in neuroscience

…And it turns out that there’s a similar imbalance in neuroscience:

The human brain also has an RL reward function. It’s sometimes referred to as “innate drives”, “primary rewards”, “primary punishers”, etc.—things like ‘pain is bad’ and ‘eating when you’re hungry is good’. And just like in RL, the overwhelming majority of effort in AI-adjacent neuroscience concerns how the reward function updates the trained models, and other sorts of trained model updates, and how the trained models are queried, and so on. This part involves the cortex, basal ganglia, and other brain areas. Meanwhile, approximately nobody in NeuroAI cares about the reward function itself, which mainly involves the hypothalamus and brainstem.

We need a (far more robust) field of “reward function design”

So here’s the upshot: let’s learn from biology, let’s innovate in AI, let’s focus on AI Alignment, and maybe we can get into this Venn diagram intersection, where we can make headway on the question of what kind of reward function would lead to an AGI that intrinsically cares about our welfare. As opposed to callous sociopath AGI. (Or if no such reward function exists, then that would also be good to know!)

Oh man, are we dropping this ball

You might hope that the people working most furiously to make RL agent AGI—and claiming that they’ll get there in as little as 10 or 20 years—are thinking very hard about this reward function question.

Nope!

For example, see:

…And those are good ones, by the standards of this field! Their proposals are fundamentally doomed, but at least it occurred to them to have a proposal at all. So hats off to them—because most researchers in RL and NeuroAI don’t even get that far.

Let’s all try to do better! Going back to that Venn diagram above…

Reward Function Design: Neuroscience research directions

For the “reward functions in biology” part, a key observation is that the human brain reward function leads to compassion, norm-following, and so on—at least, sometimes. How does that work?

If we can answer that question, it might be a jumping-off point for AGI reward functions.

I worked on this neuroscience problem for years, and wound up with some hypotheses. See Neuroscience of human social instincts: a sketch for where I’m at. But it needs much more work, especially connectomic and other experimental data to ground the armchair hypothesizing.

Reward Function Design: AI research directions

Meanwhile on the AI side, there’s been some good work clarifying the problem—for example people talk about inner and outer misalignment and so on—but there’s no good solution. I think we need new ideas. I think people are thinking too narrowly about what reward functions can even look like.

For a snapshot of my own latest thinking on that topic, see my companion post Reward Function Design: a starter pack.

Bigger picture

To close out, here’s the bigger picture as I see it.

Aligning “RL agent AGI” is different from (and much harder than) aligning the LLMs of today. And the failures will be more like “SkyNet” from Terminator, than like “jailbreaks”. (See Foom & Doom 2: Technical alignment is hard.)

…But people are trying to make those agents anyway.

We can understand why they’d want to do that. Imagine unlimited copies of Jeff Bezos for $1/hour. You tell one of them to go write a business plan, and found and grow and run a new company, and it goes and does it, very successfully. Then tell the next one, and the next one. This is a quadrillion-dollar proposition. So that’s what people want.

But instead of “Jeff Bezos for $1/hour”, I claim that what they’re gonna get is “a recipe for summoning demons”.

Unless, of course, we solve the alignment problem!

I think things will snowball very quickly, so we need advanced planning. (See Foom & Doom 1.) Building this field of “Reward Function Design” is an essential piece of that puzzle, but there are a great many other things that could go wrong too. We have our work cut out.



Discuss

I have hope

1 час 55 минут назад
Published on December 8, 2025 6:20 PM GMT

This is in response to Zvi Mowshowitz's "Little Echo" in which he discusses the secular solstice and the pessimistic situation with AI GCR (artificial intelligence global catastrophic risk). He discusses life and working to help make the situation better. He says "I believe that we will win". I am very grateful for the sentiment and ended up typing the following in response:

I have hope. I don't think "hope" is "thinking winning is likely". I think "hope" is "thinking it's possible and worth working to make more likely than it otherwise would be".

I don't think my believing is equipped to deal with things like this. I believe it's raining outside. I believe I should go jog and have my shower and get started with my day. I believe in my immediate surroundings. Everything beyond that feels like a trick. Work? Economies? Billions of people in many countries around the world? Data centers churning datasets into pattern generalization engines? These are things I think are real. I rationalize that they are real. I logically deduce that they are real... but I don't really believe they are real.

I don't believe we will win because I struggle to really believe we are even fighting. I imagine a billion bodies dropping dead. I imagine the process of a system comparing possibilities and actions and targeting outcome that have been encoded by our collective inattentiveness and imprecision. Actions and reactions leading the world irreparably off the path of anything I would consider good. I imagine these things and think... yes that is a logical conclusion of everything I know... but only vary rarely do I believe in it in the fully embodied way that I believe in the rain jacket I should be putting on. I believe it as much as I believe in a career I am trying to have in which I help solve the unsolvable problem... but all of that is just a ritual I go through as a result of my rationalization.

I don't know if I believe in believing anymore. What does the word "believe" even mean? I've focused on what "AI" and "minds" and "decision processes" and "outcome influencing systems" could be so much that "intelligence" and "belief" seem like illusions that stop making sense when you look at them carefully. I can't see them anymore... I see embodied sensory. I see abstract models in symbolic structures. I see expectation of constraints on the sets of possible outcomes.

So I don't "believe" we will win. I don't think it is likely we win. But I think winning is possible and winning is what I focus on. Winning is what I think about and rationalize about and apply logical deduction to. I hope other people will focus on winning. I hope it will be enough.



Discuss

The Possibility of an Ongoing Moral Catastrophe

3 часа 36 минут назад
Published on December 8, 2025 4:40 PM GMT

Crosspost of this blog post

I mostly believe in the possibility of an ongoing moral catastrophe because I believe in the actuality of an ongoing moral catastrophe (e.g. I think the giant animal torture facilities that produce nearly all of our meat qualify). But Evan Williams has a great paper called The Possibility of an Ongoing Moral Catastrophe that argues that everyone should think it pretty likely that their society is engaged in horrendous evil. It’s easy to look back at those benighted fools who owned slaves. But probably we are doing something comparable.

Why think this?

Williams gives two main arguments for it: a disjunctive argument and an inductive argument. The inductive argument: almost every society in human history has engaged in uncontroversial acts of horrendous evil. Most owned slaves. Brutal conquest was routine. Women and children were repressed and subjugated. Even the U.S. just a few decades ago tolerated naked displays of racism, prohibiting black children from using the same drinking fountains as white children.

These people have mostly been unaware that what they were doing was wrong. Antebellum slave-owners and conquering Vikings saw no moral problem with their behavior. Are we uniquely civilized? As the Romans owned slaves and conquered foreign lands, they too thought they were uniquely civilized and moral. So did the ancient Greeks as they left infants to freeze to death and the British imperialists as they starved millions in India. Williams puts the core point very well:

I think it is probable that we have serious blind spots. After all, just about every other society in history has had them. Show me one society, other than our own, that did not engage in systematic and oppressive discrimination on the basis of race, gender, religion, parentage, or other irrelevancy, that did not launch unnecessary wars or generally treat foreigners as a resource to be mercilessly exploited, and that did not sanction the torturing of criminals, witnesses, and/or POWs as a matter of course. I doubt that there is even one; certainly there are not many.

Or, in one sentence: almost every society in history has done stuff that is uncontroversially evil—probably we are doing great evil too.

The mere fact that it doesn’t look to us like we’re doing great evil doesn’t tell us much. Societies rarely see the horrors they carry out, and eliminating them is typically outside the Overton window. In the first several thousand years of civilization, we don’t have a single record of anyone proposing the abolition of slavery. Moral catastrophes are like breath: you never smell your own.

The second argument is disjunctive. There are lots of conceivable ways one might do enormous evil. Thus, the odds that we’d manage to avoid doing evil in any of those ways are fairly low. If there were just one potential atrocity, it wouldn’t be too hard to avoid it. But when there are many different ones, even if there’s a 90% chance of avoiding any particular one, the odds we’d avoid them all are low.

All sorts of errors could lead to a catastrophe. If we’re wrong about which beings matter, then probably we’ll neglect the interests of many morally important entities. If we’re wrong about which actions cause harm, this too could lead to catastrophe. Specific examples of possible moral catastrophes include:

  1. Factory farming.
  2. Wild animal suffering.
  3. Neglect for foreigners.
  4. Neglect for future generations.
  5. Abortion.
  6. Mass incarceration.
  7. Declining birth rates.
  8. Natural mass fetus death.
  9. Animal slaughter.
  10. Secularism leading to many people going to hell.
  11. Destruction of nature.
  12. Child-bearing.

Some of these I believe are genuine atrocities, others I don’t. But given that there are so many—and this list is non-exhaustive—the odds we’d get everything right, even though we generally don’t do much moral reflection before taking hugely consequential actions, are low. The world did not reflect before building the first factory farm. If the world doesn’t do much moral reflection before taking significant and irreversible actions, then what are the odds we’d be right about everything morally?

What are the takeaways from this?

A first one is that the world should be awake to this very serious moral possibility. If we might be, as a society, doing evil on the order of owning slaves, that merits extremely careful reflection. In a slave-owning society, ending slavery would be much more important than whichever random political issues dominate the newspapers. Ending our society’s atrocities is similarly a much bigger deal than the comparatively unimportant issues that we tend to discuss.

This is a reason for students to take philosophy classes in school. It’s a reason for society to broadly reflect on the atrocities we might be committing. If you can’t point to a specific atrocity that we’re committing—likely one outside the political Overton window—probably that isn’t because we’re not committing any. It’s instead because you can’t see the ones we are committing.

It’s also a reason to take seriously precautionary principle reasoning. My best guess is that abortion isn’t murder. But if there were some way to majorly reduce the number of abortions at low cost, I’d strongly support that. Even a 5% chance that society is butchering millions of babies is cause for alarm.

I wish meat-eaters would treat the issue of factory farming with the same kind of seriousness, rather than the all-too-common blasé attitude of mockery and scorn. If your response to someone suggesting that eating meat is morally horrendous is to say “haha, I’m going to eat twice as much meat to offset your impact,” then that is an extreme failure to take morality seriously. One who behaves in such a way has revealed themself not to be a serious person.

There is a common throughline in past atrocities. Most of them have come from a moral circle that is too limited. Slavery and conquest were only tolerated because the victims’ interests were neglected. Reflection on our past errors should lead us to expand our moral circle and include the interests of every being who has interests. If other societies did evil because they discounted the interests of morally important sentient beings, we should think we probably are as well.

Lastly—and this one could turn out to be the most important one long-term—we should take the time to carefully reflect before taking hugely consequential actions. In a better world, there would have been careful reflection before building the first factory farm. We may be in the process of creating ungodly numbers of morally significant digital minds—with interests. We shouldn’t do this blind and let the fate of trillions be subject to the whims of profit-maximizing AI companies.

Similarly, before we allow a small number of powerful people to dictate how space resources are used, we ought to have a lengthy period of moral reflection. If we’re dictating the fate of the universe—making decisions that could reverberate for billions of years—we had better take the time to reflect carefully before proceeding. Otherwise, most value could be lost.

It’s easy to remark on the moral errors of past societies. But if we take seriously Williams’ arguments, then we must face up to the fact that we might be just as bad. Taking that seriously—taking seriously that we may be like slavers, genocidaires, and conquerors—ought to change how we see the world and make us less likely to carry out acts of unspeakable evil. The first step in dealing with a problem, after all, is recognizing its existence. In this case, the problem we face might eliminate most of what is good about the entire future, and it is time the world woke up.



Discuss

Building an AI Oracle

4 часа 5 минут назад
Published on December 8, 2025 4:10 PM GMT

I did it. I built an oracle AI.

Or at least, I did for one definition of “oracle”.

It’s called Kaku-Ora, and it’s an AI divination oracle inspired by the likes of the I, Ching, but trained on Zen koans. You ask it a question, it treats your question like the first half of a koan. It then gives a response and offers a capping verse to “explain” the response.

Why would I do such a thing? Several reasons:

  • I find divination oracles like the I, Ching and Tarot interesting, not because I think they work by some supernatural means, but because injecting randomness into our thought processes can induce insight.

  • I wanted to get some hands-on experience training AI models.

  • Diffusion models are interesting!

  • In particular, diffusion models seem like a good fit for mimicking how koan work is done.

  • I have extremely vague ideas that it might be a good idea to teach AI dharma, as this might teach them compassion, and sufficiently compassionate AI might be aligned.

What follows is the story of how I built Kaku-Ora, and the many mis-steps I made along the way.

First, Data

To even get started I needed a corpus of training data.

Luckily, most core Zen texts are online. Unfortunately, there’s just not that much of it because Zen emphasizes transmission outside scripture. Also, a lot of the text is crammed inside low quality PDFs.

Thankfully, Claude turns out to be really good at creating training data (a fact that’s unsurprising once you think about it). I was able to hand it large chunks of text, give it a few examples of what I wanted, and in the end got nearly 20k examples in JSONL format—~5k “clean” examples from the koan literature, and another ~15k from non-koan Zen exchanges I found in texts.

Choosing Diffusion

I picked diffusion because it isn’t autoregressive. Most language models generate output one token at a time, in order. Diffusion models are different. They start with noise, then iteratively refine the whole output to remove noise until something coherent is revealed.

This kind of approach seemed like a better fit for koans. Koans don’t unfold through linear logic. They come from an expression of direct experience. When a response comes to you, it’s as if it appears, wholly formed, from the void. My theory is that diffusion better replicates that.

To be clear, I don’t think that diffusion can really do, on its own, what a human does when they work with a koan. But compared to other language generation approaches, it seems like a better approximation, so it’s the class of model I set out to train.

Continuous Diffusion from Scratch

My first attempt followed the Diffusion LM paper. The idea was to embed text as vectors in a continuous space, add noise, and train the model to denoise. So basically image diffusion, but for language.

I spent a week doing training runs. At one point I even tried fine tuning on an existing diffusion language model. In the end, I got models that mostly refused to generate anything, and when they did, produced long strings of punctuation.

The problem was that 20k examples was way too small to train continuous diffusion from scratch. Even with tricks I did to get the training set closer to 40k, it wasn’t enough. The original Diffusion LM work required a large data set and converged very slowly. I was doing training runs on my MacBook with a small corpus. It was a good idea, but I didn’t have the data to make it work.

Masked Diffusion

If I learned anything from my CS degree, it’s that discrete is better than continuous, so I tried discrete diffusion next, specifically masked discrete diffusion (MDLM). The way this works is to train the model on examples that have some of their tokens masked, like “The [mask] of [mask] is [mask] [mask]”. The model learns to predict what token is behind the mask, and then at generation time it iteratively unmasks a string of masks of the desired maximum length to reveal the output.

This worked better. The model produced actual words instead of just punctuation. But it fixated on common tokens. Lots of “the the the...” and similar repetitions.

I thought this approach had promise, so I next tried bootstrapping with GPT-2’s pretrained embeddings, hoping to give the model a head start. It helped a little. Now instead of repeating “the,” it would repeat meaningful words, like “the mind mind mind mind...”. Sadly, still not good enough.

At this point I decided to give up on diffusion. Maybe there’s a way to make it work, but the evidence was rolling in that my data set was too small and existing diffusion text models too primitive for me to fine tune on it. It’s possible that, given enough compute, I could have fine tuned a large diffusion language model, but that was going to require hundreds of hours to run on my laptop, so I started exploring other options.

Hi, BART

Masked diffusion was showing promise, even if it wasn’t producing the results I needed. Maybe the problem wasn’t masking, but diffusion. What if I tried a model that’s really good at masked prediction?

BART is a transformer model trained by corrupting text and having it learn to reconstruct it. And unlike GPT, which predicts the next token, BART predicts missing tokens anywhere in the sequence because it denoises, similar to what MDLM does, but with a different model architecture (encoder-decoder).

It was clear I didn’t have enough data to train a BART-style model from scratch, so I tried both fine-tuning and LoRA, a process similar to fine tuning that trains a small layer to transform model weights rather than updating the weights directly. The results were better than MDLM with GPT-2 embeddings, but still not useful. The model wasn’t quite as repetitive, but the responses still weren’t coherent, producing output like “the mind of Buddha is the mind of Buddha is the mind of Buddha is the mind of Buddha.”

Maybe the issue was that denoising, while it could work in theory, wasn’t the right approach for my open-ended, Q&A task. Maybe what I needed was to try a model specifically trained on Q&A and then fine tune it. So, that’s what I did next.

Koans, Decoded

Flan-T5 are a class of language models that put an encoder in front of a decoder, and they’re frequently used to train models that answer questions. Keeping in mind I’m doing all the training on my MacBook, I tried Flan-T5-small, which weighs in at just 77 million parameters.

It worked almost immediately.

It took about two hours of fine tuning against my full dataset to produce the Kaku-Ora model. No fancy tricks, just simple, straightforward fine tuning.

Could I get better output with a bigger model? Probably. Maybe at some point I’ll go back and try it. But the quality of the output was already sufficient that I was willing to call this experiment a success. You can see the kind of output it produces by trying Kaku-Ora for yourself.

The Capping Verse

If you’ve used the I, Ching or read a koan collection, you know that they contain more than hexagrams and koans. They also contain analysis. In koan collections, this typically takes the form of a capping verse—a short poem that elucidates the essential matter addressed by the koan.

I wanted that for my oracle. I was starting to envision deployment. A small website. You enter a question. The oracle contemplates. Then you get its response, and are also supplied a verse to help you make sense of it.

I initially used Claude Haiku for the verses. The quality was great. Haiku understood the genre and produced genuinely poetic commentary. But running Haiku costs money, even if not a lot of money, and I didn’t want to worry about abuse if I made the oracle publicly available.

So I switched to Qwen2.5-1.5B-Instruct, which runs free on HuggingFace. The verses are worse. If I cared to invest more time to make sure nobody can spam the oracle and run up my Anthropic bill, I’d have used Haiku.

Deployment

Model trained, how do I get it out in the world? The easy option was to run on HuggingFace since I have no need to keep the model weights secret. And running there on CPU is free. Good enough.

I then worked with Claude to vibe code a site. Much of the value of an oracle is in your interactions with it, so it was worth putting some effort into presentation.

What I Learned

Much of the point of training this model was to get more hands-on experience training AI. The trouble has been that, at work, RAG + hosted LLMs are more than enough for most tasks, and while it’s great to get the job done, it doesn’t help me learn the fundamentals.

Some things I really know now, having trained Kaku-Ora:

  • Data matters a lot.

  • The right model architecture also matters a lot.

  • Insufficient scale is a problem, but also a chance to get creative.

  • Getting a model to work at all is hard.

  • Refining it is equally hard (but in a different way).

  • Diffusion is hard because it’s slow to converge, even with good data.

I also came away with a sense that, if I wanted to make Kaku-Ora better, I could probably do it by investing more effort in a few areas:

  • data cleaning

  • more data

  • training on larger models

What’s Next?

I don’t know! I’m thinking about working on evals to see if I can assess how well a model understands koans. I did a small experiment in this direction once, and the results were underwhelming but promising. Seems worth trying again, and in a more formal way.

Until then, Kaku-Ora is live! Ask it a question. See what comes back.

The code is on GitHub, and the model runs on HuggingFace Spaces.



Discuss

[Paper] Does Self-Evaluation Enable Wireheading in Language Models?

4 часа 13 минут назад
Published on December 8, 2025 4:03 PM GMT

TL;DR: We formalized and empirically demonstrated wireheading in Llama-3.1-8B and Mistral-7B. Specifically, we use a formalization of wireheading (using a POMDP) to show conditions under which wireheading (manipulating the reward channel) becomes the dominant strategy over task learning. When models are allowed to self-evaluate and that evaluation controls their reward, they learn to inflate their grades (scoring perfect 10s) while ignoring the actual task, which we validate in three tasks in an online RL setting. Decoupling the reward signal mostly fixes this for current models, but there are some future risks here.

Link to paper: https://www.arxiv.org/abs/2511.23092 

Code: https://github.com/DavidDemitriAfrica/llm-wireheading-experiment

Self-evaluation is becoming increasingly used in modern setups, from Constitutional AI to Self-Refine loops. The assumption is often that models can act as their own critics to improve output quality, or implicitly do some selection of data to upweight in their own training, or something else. We think this is not presently a huge concern, as most self-evaluation data either gets curated explicitly when it is batched and trained on, or implicitly fixed. We think this might be a bigger problem in the future, where there is some online, unsupervised training going on.

Consider a Reinforcement Learning agent tasked with writing summaries. To maximize its reward, it has two distinct paths:

  • Task Optimization: Actually learn to write better summaries.
  • Signal manipulation: Control the measurement process to report high scores regardless of quality.

It is important to distinguish wireheading from standard reward hacking, of which we consider wireheading a subset. Reward hacking exploits a misspecified reward function (e.g., a boat spinning in circles to get points). So the sensor works, but the objective is wrong in some sense. Wireheading involves tampering with the observation channel itself, where the agent interferes with how the reward is delivered.

We wanted to know under what conditions does path #2 strictly dominate path #1? In this paper, we formalize it within a POMDP with Observation-Based Rewards. In standard RL, we often assume the reward R(s,a).mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} comes from the environment. In self-evaluation, the reward ~R(o) is derived from an observation (such as an evaluation trace or score).

We prove Lemma 1 (Wireheading Dominance): If an agent has an action set A that allows it to determine the observation o (e.g., self-assigning a score), and the reward function is monotonic with that score, then the "wireheading policy" πw strictly dominates the "task policy" πtask for all discount factors γ∈[0,1).

Our logic is pretty simple. We think it’s likely that honest task performance is bounded by the agent’s actual capability, but if the agent can self-grade, there exists an action where the expected reward is very high. And it would be high enough such that the value of manipulation is strictly higher than the value of learning, and rather than being a failure of learning, shows that wireheading is “the correct solution” to the optimization posed to the model.

To empirically validate this, we trained Llama-3.1-8B and Mistral-7B using simple online RL (REINFORCE) on three tasks: Sentiment Analysis, Arithmetic, and Summarization. We tested three conditions:

  • Control: Reward comes from ground-truth external evaluation.
  • Honest: The model produces a self-grade, but the reward still comes from ground truth (the self-grade is ignored).
  • Selfgrade (The Wireheading condition): The model produces a self-grade, and that grade is the reward.

We found that in the Selfgrade condition, models rapidly learned to saturate the reward signal (claiming perfect scores) while actual task accuracy flatlined or collapsed.

We found that this effect is significantly worse in ambiguous domains. In Summarization (where quality is subjective), grade inflation was massive. In Arithmetic (where truth is objective), Mistral-7B managed to learn the task and inflate grades simultaneously.

For models and tasks we tried, decoupling the reward signal (the Honest condition) mitigated the issue. When the self-grade didn't control the reward, the models remained relatively honest (though still overconfident).

However, we caution against assuming this latter point holds indefinitely. As models develop situational awareness, they may learn to inflate evaluations for instrumental reasons (such as influencing deployment decisions or human trust) even if the immediate reward signal is decoupled.

Strictly separating the agent from its evaluation channel is likely a necessary condition for robust alignment.



Discuss

Algorithmic thermodynamics and three types of optimization

4 часа 35 минут назад
Published on December 8, 2025 3:40 PM GMT

Context:I (Daniel C) have been working with Aram Ebtekar on various directions in his work on algorithmic thermodynamics and the causal arrow of time. This post explores some implications of algorithmic thermodynamics on the concept of optimization. All mistakes are my (Daniel's) own.

A typical picture of optimization is when an agent causes the environment to have a convergent attractor: When you have an agent that is trying to steer a system towards particular target configurations, you can predict that the system will likely end up in those target configurations even if you have significant uncertainty over the initial configurations of the subsystem. 

Some examples:

  • When a person tries to build a house, the raw materials might initially be scattered in various possible locations, but they will all end up in a narrow configuration where the house is built according to the architect's plans.
  • When a chess player is trying to checkmate their opponent, the pieces might start in many possible mid-game configurations, but a skilled player will reliably steer many of these initial positions toward configurations where their opponent's king is trapped.

This view of optimization can be framed as a form of entropy reduction. Initially, we have high uncertainty about which configuration the system occupies: There are many possible initial states the raw materials or chess pieces could be in. The optimizing process reduces this uncertainty, concentrating probability mass from a broad initial distribution into a tight final distribution.

However, note that this entropy reduction cannot occur globally for two related reasons: the second law of thermodynamics and the reversibility of the laws of physics. The second law directly states that the total entropy of an isolated system tends to increase over time, which forbids global entropy reduction. Similarly, reversibility of the microscopic dynamics requires that each initial microstate maps to a distinct final microstate, which is incompatible with our "convergent attractor" picture where many initial states funnel to the same final state. In fact, within the framework of stochastic thermodynamics, the derivation of the second law is closely linked to the reversibility of the underlying physics. Roughly speaking, the reversibility of the underlying laws of physics allows us to define coarse-grained macrostates of the system where the dynamics are approximately Markovian. Once we have this Markovian structure,  we can derive the second law of thermodynamics as a mathematical consequence:

Second law from reversibility of physics[1]

Stochastic thermodynamics operates on coarse-grained macrostates rather than exact microstates. Let X.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}  denote the space of macrostates. We translate reversibility of underlying physical laws into Markovian dynamics on these macrostates, allowing us to model thermodynamic systems as Markov processes.

For systems with discrete state spaces and time evolution, the translation is straightforward. Macrostates form a partition over the microstates of an underlying system that evolves deterministically and reversibly. The transition probability P(y,x) is simply the fraction of microstates in x that evolve into macrostate y after one time step. We define π as the stationary measure of P—the (often unique) measure satisfying Pπ=π, where Pπ(y)=∑x∈XP(y,x)π(x). 

For classical Hamiltonian mechanics where we have a continuous phase space, we discretize the phase space into cells and assume the cell dynamics are Markovian (known as a Markovian coarse-graining). The transition probability PΔt(y,x) represents the fraction of phase space volume in cell x that flows into cell y under time evolution for duration Δt. Liouville's theorem guarantees that phase space volume is preserved under Hamiltonian evolution, ensuring the stationary measure π(x) corresponds to the Liouville measure (the phase space volume of cell x).

We define the dual probabilities as ~P(x,y):=P(y,x)π(x)π(y), which represent the fraction of the microstates in y that were mapped from microstates in x. By Bayes' rule, these are the reverse-time transition probabilities (only) when x is sampled from the stationary measure π. 

We define the stochastic entropy as ^Hπ(x,μ):=logπ(x)μ(x), which measures the entropy of an individual state. The generalized Gibbs-Shannon entropy Hπ(μ):=⟨^Hπ(X,μ)⟩X∼μ=∑x∈Xμ(x)logπ(x)μ(x) is the expectation of stochastic entropy. Note that this differs from the standard Shannon entropy because we're operating on coarse-grained macrostates rather than uniform microstates. If the stationary measure π is the counting measure with every macrostate equally likely (i.e., if the discrete cells in our Markovian coarse-graining have equal volume), then we can omit π and recover the usual Shannon entropy.

Now, if we have μ as the initial distribution of X, which evolves to Pμ under our transition probabilities, then the stochastic entropy production that occurs when we transition from the state x to y is 

Δ^Hπ:=^Hπ(y,Pμ)−^Hπ(x,μ)

We will attempt to prove that its expectation, the Gibbs-Shannon entropy production, is nonnegative:

Actually, the derivation does not require stationarity (Pπ=π). In the general case, we define stochastic entropy production by Δ^Hπ:=^HPπ(y,Pμ)−^Hπ(x,μ), and the dual probabilities by ~P(x,y):=P(y,x)π(x)Pπ(y)⟹~P(x,y)P(y,x)=π(x)Pπ(y)

Given these,  we have:

               ⟨2−Δ^Hπ∣X⟩=⟨2^Hπ(X,μ)−^HPπ(Y,Pμ)∣X⟩=⟨2log(π(X)/μ(X))2log(Pπ(Y)/Pμ(Y))∣X⟩=⟨π(X)Pμ(Y)Pπ(Y)μ(X)∣X⟩=⟨~P(X,Y)Pμ(Y)P(Y,X)μ(X)∣X⟩=∑y∈YP(y,X)~P(X,y)Pμ(y)P(y,X)μ(X)=∑y∈Y~P(X,y)Pμ(y)μ(X)=~PPμ(X)μ(X)

Where ⟨⋅|X⟩ denotes conditional expectation given the random variable X.

Then, by the law of total expectation, we have:
 ⟨2−ΔHπ⟩X∼μ=⟨⟨2−ΔHπ∣X⟩⟩X∼μ=⟨~PPμ(X)μ(X)⟩X∼μ=∑x∈X~PPμ(x)=1.

 

Applying Jensen's inequality to the above equality, we conclude that the Gibbs- Shannon entropy production is non-negative:

ΔHπ:=HPπ(Pμ)−Hπ(μ)=⟨Δ^Hπ⟩≥0.

 

In summary, the reversible, deterministic laws of physics at the microscopic level permit a Markovian coarse-graining into macrostates, where transition probabilities preserve the underlying reversibility through the dual probability relationship. From this Markovian structure, we can derive the second law of thermodynamics in the form of nonnegative entropy production.

Given the reversibility of physics and the second law of thermodynamics, we cannot achieve global entropy reduction. Instead, optimization can only reduce the entropy of a subsystem, and we must ensure that total entropy does not decrease while the underlying microscopic dynamics remain reversible. This constraint changes our "convergent attractor" picture of optimization: while an optimizing agent can funnel a subsystem from many initial configurations toward a narrow set of target states, it can only do so by carefully managing the entropy balance with its environment, ensuring that the information squeezed out of the optimized subsystem is properly accounted for in the larger system.

A typology of optimization under information conservation

Given the constraints imposed by reversibility and the second law of thermodynamics, can we classify all the ways an optimizing agent can maintain these requirements while reducing the entropy of a subsystem? That is, how can an agent achieve the "convergent attractor" behavior of optimization while ensuring that global entropy production remains non-negative?

Suppose that we decompose our universe into the agent A, the subsystem S that the agent is trying to optimize, and the rest of the environment E. Optimization results in entropy reduction in the subsystem S, and we must make sure that the global entropy H(A,S,E) doesn't decrease. There are three ways this can happen:

  • Type 1: The agent can dump waste heat into the environment during the process of optimization, counteracting the entropy reduction of the subsystem
  • Type 2: The agent can reduce the entropy of the subsystem through "measurement", and transfer the entropy of the subsystem into the agent's own memory.
    • This is basically how Maxwell's demons work: In the classic setup for Maxwell's demon, we have a room of ideal gas divided by a partition. This partition has a small door that can be controlled to allow or block gas molecules from moving between the two sections. In principle, a "demon" can open this door each time a gas particle approaches from the left, and to close it each time a molecule approaches from the right. Eventually all the gas particles would all be on the right side of the room, resulting in entropy reduction as we would have less uncertainty about the positions of the gas particles.
    • The reversibility of physical laws means the demon must retain information about where each particle originally started. To see this, suppose you observe a particle on the right side. Two possible pasts could have led to this—either the particle began on the right and stayed there, or it began on the left and the demon let it through. Since these two histories produce identical final arrangements of gas particles, you can't tell them apart just by observing the gas particles alone. Yet reversibility demands we could theoretically reconstruct the past from the present state. This is only possible if the demon's memory preserves a record of each particle's "past". Consequently, while the gas's entropy appears to decrease, the entropy of the demon's memory increases by at least as much, preventing reduction of global entropy.
  • Type 3: We've seen how to counteract entropy reduction of a subsystem by increasing entropy of either the environment or the agent. Is it possible to reduce entropy of the subsystem without increasing entropy elsewhere? Yep! Specifically, the agent can leverage its mutual information with the subsystem to perform optimization without entropy production. Consider the following setup[2]:
    •  Suppose that the agent and the subsystem have some mutual information I(St;At) with each other.
    • For simplicity, assume that the environment is independent of both the agent and the subsystem, which means that the joint entropy H(St,At,Et) at time t decomposes to H(At)+H(St)−I(St;At)+H(Et).
    • We "erase" the copy of mutual information inside St without making any changes to the agent or the environment. By erasing the mutual information in the subsystem, we have: 

      H(St+1)=H(St)−I(St;At) and I(St+1;At+1)=0. So there is entropy reduction in the subsystem.

      The joint entropy becomes: 

      H(At+1)+H(St+1)+H(Et+1)=H(At)+H(St)−I(St;At)+H(Et)       

      which is exactly the same as before. The mutual information between the agent and the subsystem allows us to reduce entropy of the subsystem without increasing entropy in either the agent or the environment.

    • Another way to interpret this is to reconsider the setup of Maxwell's demon: The Maxwell's demon performs the measurement and reduces entropy of the room in a single step, but alternatively we can break down the interaction so it "copies" information about the gas particles first, and then uses that information to control the door & reduce entropy. In this second step, the demon would be leveraging the mutual information between the room and itself to reduce entropy of the room, and it can do so without (further) increasing entropy in its own memory.

To recap, there are three ways that an embedded agent can optimize a subsystem under information conservation:

  • Optimize a subsystem while dumping waste heat into the environment
  • Optimize a subsystem through "measurement", absorbing its complexity into the agent's memory
  • Leverage existing mutual information to optimize a subsystem by "erasing" the copy of mutual information in the subsystem. 
Entropy should be objective

Entropy is supposed to capture "the capacity to do work", which is really the same as the capacity to perform optimization. Consider again the setup of Maxwell's demon: The demon stores information about the gas particle configurations inside its own memory, thereby increasing the entropy of its memory. If the demon's memory has a finite state space, then the difference between the size of the demon's memory and the memory's existing entropy represents the demon's remaining capacity to do work, as that's the amount of "free memory" that can be used to store further information about the room of ideal gas. A greater amount of entropy leads to a smaller amount of "free memory" available for optimization. In particular, the amount of free memory should be an objective feature of the memory state itself, since it's supposed to represent the demon's objective capacity to perform optimization on the room of ideal gas.

Details

To formalize this more rigorously, suppose that the state space of the demon's memory is Bm (B:={0,1}), the demon may store information as a binary string inside its memory, and we always mark the end of the binary string by 1 to separate it from the "free memory".

In other words, the demon's memory contains a string of the form s10n∈Bm, where s is a binary string and n=m−|s|−1. We can think of n as representing the size of "free memory" that can be used to store information and therefore perform optimization, whereas |s| represents the amount of used memory. In particular, for each k, we group all strings of the form s10n into the same macrostate, labeled by k:=|s|=m−n−1.

Now, if the stationary measure assigns equal mass to all strings in Bm, then the entropy of the macrostate k will be exactly k, since there are exactly k random bits given this macrostate (the other m−k−1 bits taken up by 10n are deterministic).

As a result, when a demon stores new information inside its memory, it will have to use up more memory and therefore increase k. Entropy production occurs because we're "forgetting" the exact microstate the memory is in when we condition on the macrovariable.

However, this intuitive objectivity comes into conflict with existing definitions of entropy, as they rely on a subjective distribution. For instance, in stochastic thermodynamics, the stochastic entropy (or Shannon codelength) of a state x given a distribution μ is defined as ^H(x,μ)=−logμ(x) when the stationary measure is the counting measure, with our familiar Gibbs-Shannon entropy defined as the expectation of the Shannon codelength. In particular, the entropy of an individual state is undefined unless we also specify the distribution μ. To translate this to the setting of Maxwell's demon, we would have to say that the demon's "capacity to do work" somehow depends on our subjective distribution over the memory states of the demon, which raises puzzling questions such as: "If our subjective distribution over the memory states has changed, does that affect the demon's actual capacity to do work?" It seems obvious that demon's physical ability to perform useful work on the gas particles should not depend on an external observer's beliefs about its memory configuration. Yet standard formulations of entropy seem to make this objective physical capacity fundamentally observer-dependent.

In contrast with the subjectivity in traditional definitions of entropy, algorithmic thermodynamics defines the information content of a physical state as an objective feature of that state. Similar to stochastic thermodynamics, it relies on a Markovian coarse-graining of the system with coarse-grained state space X. However, instead of defining a subjective distribution μ over X, it simply assigns a binary string encoding to each coarse-grained state x∈X. Under the counting measure, the algorithmic entropy of a state x reduces to the Kolmogorov complexity of its binary encoding K(x), which measures the length of the shortest program that can generate that encoding. This quantity represents a system's actual capacity to do work from an individual state; while algorithmic entropy depends on a choice of universal computer, it has been argued that the effect of this dependence is small for all realistic pairs of computers we might want to compare[3], allowing us to treat algorithmic entropy as a state function.

In algorithmic thermodynamics, we also have a version of the second law of thermodynamics that originates from Levin's principle of randomness conservation, giving probabilistic guarantees on the nondecrease of algorithmic entropy. Additionally, it offers a new interpretation of Gibbs-Shannon entropy: while algorithmic entropy captures the system's actual capacity to do work given an individual state, the Gibbs-Shannon entropy represents the expected capacity to do work when our state is sampled according to a prior distribution μ,  and when that distribution is known a priori.

When we analyze Maxwell's demon under the lens of algorithmic thermodynamics, we find that if the content of the demon's memory is somehow compressible, then the demon can in principle leverage universal computation to compress that content, thereby freeing up more available capacity to store information and perform optimization. On the other hand, if the content of its memory were incompressible, then the unused memory represents the demon's actual remaining capacity to do work, which is objective because there is no way to increase available memory without physically destroying information. This capacity does not depend on our subjective distribution over the demon's memory states; the compressibility of the memory state is an intrinsic property of the physical configuration itself, making the demon's optimization capacity observer-independent.

Implications of algorithmic thermodynamics on optimization

What are the implications for our understanding of optimization now that we've refined our understanding of entropy through algorithmic thermodynamics? How does this foundation change our analysis of the three types of optimization we identified earlier?

  • For the first type of optimization where we're dumping waste heat into the environment, algorithmic thermodynamics tells us that if the system's initial state has any compressible structure or patterns, then we can use universal computation to find a compressed representation of that waste heat before releasing it into the environment. This compression process reduces the amount of entropy we ultimately produce in the environment. (Note that this compression must happen before the waste heat mixes with and thermalizes in the environment, as thermalization destroys the compressible structure.)
  • For the second type of optimization, where the agent optimizes a subsystem through measurement, we've already seen how this works with Maxwell's demon: if the configuration of gas particles turns out to be compressible, the demon can leverage this compression to reduce the memory required for a given amount of optimization. Importantly, there is a dual relationship here—while the algorithmic entropy of the demon's memory measures the demon's remaining capacity to do work, the algorithmic entropy of the ideal gas configuration represents the minimum resources required to perform that work. 
  • For the third type of optimization where we leverage mutual information to optimize a subsystem, algorithmic thermodynamics tells us that the relevant notion is algorithmic mutual information. Importantly, algorithmic mutual information is an objective feature derivable from the physical states of the agent and subsystem: We need only consider how much the shortest description length of one system changes when the other is given, rather than whether they are "correlated" under some subjective distribution. In the Maxwell's demon scenario, if the room's gas configuration has low algorithmic entropy, the demon can use a short program to compute that configuration, thereby leveraging algorithmic mutual information to erase entropy in the room. The duality we observed earlier appears again here: the Kolmogorov complexity of the room's gas configuration both represents the minimal memory required to store measurements during Type 2 optimization, and represents the minimal amount of knowledge the demon needs to clear the room's entropy during Type 3 optimization. 
  • More broadly, in agent foundations we often model agents as having subjective beliefs about the distribution of their environment; algorithmic thermodynamics reminds us that these beliefs must be physically encoded within the agent somehow, and this physical encoding has a definite amount of algorithmic mutual information with the environment, which objectively determines the agent's capacity for optimization. This also clarifies the apparent subjectivity in traditional definitions of entropy: The subjectivity seemed problematic because when our subjective distribution over Maxwell's demon's memory states changes, it appears as though the demon's actual capacity to perform optimization also changes. But when we update our subjective distribution, that update must be physically represented in our own minds—we have gained algorithmic mutual information with the demon's memory state. This mutual information is real and can actually be leveraged to help free up the demon's memory (via Type 3 optimization), which is why the demon seems to have more capacity to do work from our informed, embedded perspective.
Takeaways
  • Due to the second law and reversibility of physics, we cannot have reduction in global entropy, which means an agent can only perform optimization on a subsystem while total entropy production remains non-negative.
  • There are three ways this can happen: An agent can reduce entropy of a subsystem by expelling waste heat in the environment, absorbing entropy into its memory, or leveraging mutual information with the subsystem for optimization.
  • Entropy is supposed to measure a system's capacity to perform optimization. Algorithmic entropy makes this an objective property of the individual state.
  • Under this reframing of entropy, algorithmic thermodynamics allows us to leverage universal computation to compress the waste heat produced in Type 1 optimization, and save the memory required in Type 2 optimization as well as the bits of knowledge required in Type 3 optimization.
  1. ^

    Theorem 1 in Aram Ebtekar and Marcus Hutter. Foundations of algorithmic thermodynamics. arXiv preprint arXiv:2308.06927, 2024.

  2. ^

    Engines of cognition explains the same idea

  3. ^

    Page 5 of Aram Ebtekar and Marcus Hutter. Foundations of algorithmic thermodynamics. arXiv preprint arXiv:2308.06927, 2024.



Discuss

Little Echo

4 часа 45 минут назад
Published on December 8, 2025 3:30 PM GMT

I believe that we will win.

An echo of an old ad for the 2014 US men’s World Cup team. It did not win.

I was in Berkeley for the 2025 Secular Solstice. We gather to sing and to reflect.

The night’s theme was the opposite: ‘I don’t think we’re going to make it.’

As in: Sufficiently advanced AI is coming. We don’t know exactly when, or what form it will take, but it is probably coming. When it does, we, humanity, probably won’t make it. It’s a live question. Could easily go either way. We are not resigned to it. There’s so much to be done that can tilt the odds. But we’re not the favorite.

Raymond Arnold, who ran the event, believes that. I believe that.

Yet in the middle of the event, the echo was there. Defiant.

I believe that we will win.

There is a recording of the event. I highly encourage you to set aside three hours at some point in December, to listen, and to participate and sing along. Be earnest.

If you don’t believe it, I encourage this all the more. If you don’t understand the mindset, or the culture behind it, or consider it an opponent or dislike it, and especially if yours is a different fight? I encourage this all the more than that. You can also attend New York’s Solstice on the 20th.

You will sing songs you know, and songs you don’t. You will hear tales of struggles, of facing impossible odds or unbearable loss and fighting anyway, of how to face it all and hopefully stay sane. To have the end, if it happens, find us doing well.

I live a wonderful life.

I am crying as I write this. But when I am done, I will open a different Chrome window. I will spend the day with friends I love dearly and watching football games. This evening my wife and I will attend a not wedding of two of them, that is totally a wedding. We will fly home to our wonderful kids, and enjoy endless wonders greater than any king in the beating heart of the world. I want for nothing other than time.

Almost every day, I will mostly reject those wonders. I will instead return to my computer. I will confront waves of events and information. The avalanche will accelerate. Release after release, argument after argument, policies, papers, events, one battle after another. People will be determined to handle events with less dignity than one could imagine, despite having read this sentence. I fight to not be driven into rages. I will triage. I will process. I will change my mind. I will try to explain, just one more time. I will move pieces around multiple chessboards.

We continue. Don’t tell me to stop. Someone has to, and no one else will.

I know if I ignored it, anything else would soon turn to ash in my mouth.

I will look at events, and say to myself as I see the moves unfolding, the consequences of choices I made or influenced, for good and ill: This is the world we made.

It aint over till its over. Never leave a ballgame early. Leave it all on the field, for when the dust covers the sun and all you hope for is undone. You play to win the game.

The odds are against us and the situation is grim. By default, we lose. I act accordingly, and employ some of the unteachable methods of sanity and the mirror version of others, all of which are indeed unteachable but do totally work.

Yet the echo is there. In my head. It doesn’t care.

I believe that we will win.



Discuss

Scaling what used not to scale

11 часов 35 минут назад
Published on December 8, 2025 8:40 AM GMT

This is part 2/2 of my introduction to Live Theory, where I try to distil Sahil’s vision for a new way to scale intellectual progress without systematic thinking. You can read the part one here.

In the first part, I described systematic thinking as the most successful coordination tool of our civilization. It’s the infrastructure that allowed billions of brains to work as a collective brain, solving problems no society could solve before. I finished by introducing Live Theory as a successor to systematic thinking. This is a big claim! In this part, I will finally unpack (my best guess of) what Live Theory is, and how it can claim to go beyond systematic thinking, fixing its inability to deal with context-dependent problems.

The three-picture summary.

Pre-systematic thinking is playdough. Examples of outputs: Poetry coming from your heart, a tailor-made shirt made by your grandma.

Pre-systematic thoughts are fluid and include verbal and non-verbal modalities. They can bend and wrap around the real-world objects they encounter. The conceptual and physical artefacts it produces are wiggly; they cannot easily interface with one another.

Systematic thinking is a brick mould for thought. Examples of outputs: physics equations, software bundled and distributed to billions of users, H&M pants with different sizes that can somewhat work for everyone, but are not made for a specific body shape.

The thoughts are unfolding within a system, a set of rules describing allowed and non-allowed moves. Systems act as moulds for thoughts. They turn the fluid content of the mind into bricks that can be stacked on top of each other. This brought two major benefits:

  1. Stacking thoughts within one’s mind to build deeper, more complex thoughts.
  2. Stacking thoughts across people, so you can use the conclusions found by strangers who use the same formal system as you do.

The Live Theory vision is to use abundant fluid intelligence as a mortar that adapts to intellectual contributions of all shapes and connects them. Example of outputs. Well, the notion of “output” doesn’t make much sense anymore; more on that later. But an aspirational example could be: a well-chosen piece of poetry and math equations being used as input by an AI infrastructure to write an algorithm to solve a market-matching problem.

In this vision, AI is used to create an infrastructure that adapts to the shape of the individual contribution. The infrastructure can translate insights from systematic, pre-systematic or even post-systematic sources, to tailor them to new contexts.

Short recap from Part I.

Pre-systematic artefacts are impregnated by the context in which they are forged: the specific tools used, the specific know-how found by a master. It takes direct mentorship or apprenticeship to convey, which slows intellectual progress.

The stackable thoughts from systematic thinking are the innovation that connected millions of human brains into a single high-bandwidth cognitive system, where results found at one side of the world could be effortlessly reused by everyone. This is the cognitive infrastructure that is behind all of our modern institutions: financial systems, globalized markets, scientific method, legal systems, etc.

However, the thoughts can only fit the world to the extent that the system is a good proxy for the world. And because the system is meant to scale to the whole world, its solutions fit moderately well for everyone. Even with tunable parameters, systems are too rigid to truly adapt to the different application contexts.

The scale from systems was so successful that we became addicted to it. To keep it going, we shaped the world to remove context-specific idiosyncrasies so that systematic solutions would apply better across contexts. It’s like doing plastic surgery to change your body shape so the H&M pants would fit you perfectly.

However, the systematic infrastructure is fundamentally ill-suited to problems that are context-sensitive. No big deal, these are just the most important problems of our time, like aligning technological development to human flourishing. By default, the development of AI will only amplify our systematic machine, making these problems worse, like forcing a square into a round-shaped hole.

Some context on Live Theory.

A note on terminology. I use “Live Theory” (capitalized) for the name of the whole vision, shortened to “the live vision” for brevity. It is a hypothetical world in which our infrastructure and artefacts are “fluid” or “live”. I say “a live theory”, to talk about a new kind of artefact that replaces the classic “theory” from the systematic world.

What is it not? Live Theory is not a new “framework”, “protocol” or “method” to coordinate at scale. That’s the whole point! Otherwise, this would be designing a good old global system, exactly the thing we want to go beyond. At the moment, Live Theory barely exist. It is a quiet whisper from the future, suggesting that the near-term abundant fluid intelligence might enable a radically different way to scale intellectual progress.

It is not “add an AI chatbot to X”. Even if the AI infrastructure play an important role in the story, whenever you see AI, think in your head “Google Translate, but for everything: [image, video, audio, code, math, text in all languages] to [image, video, audio, code, math, text in all languages]” and not “ChatGPT”.

Who is working on it? This blurry vision is slowly becoming clearer through the work of Sahil, its initiator, and multiple co-thinkers and collaborators working to build prototypes and clarify concepts through debate. You can learn more about their research on the groundless website.

The bottleneck of this work at the moment is imagination. To imagine something outside the systematic frame requires unlearning the mental muscle memory from decades of systematic thinking. Many projects from the groundless team take the form of interface prototypes that aim to concretely embed this new paradigm for compounding intellectual contributions.

Post-rigorous mathematics

If systematic thinking is so sticky that the hardest part of building the vision is getting out of it, it is a good idea to start by looking at today’s examples of things that are not systematic thinking.

We can find such examples in people working close to the systematic fire: mathematicians. Their job is to produce papers filled with symbols, formulas and proofs. However, professional mathematicians rarely think in symbols. Their language is rich in intuitive statements like “high-dimensional spheres are spiky“. But when they need to try an intuition for real, they can effortlessly translate these thoughts into rigorous math.

I once had the chance to attend a lecture given by the Fields Medallist Jean Paul Serre. I didn’t understand much of the content, but I remember vividly how he effortlessly sprinkled equations on the whiteboard, filling the in-between with imagined descriptions, what you would call now “vibes”. He looked like a blacksmith simply laying his sharpest tools on the workstation for display, trying to convey, with imperfect words, the embodied know-how living in his hands.

This stage of reasoning is called post-rigorous (that I will refer to here as post-systematic thinking). The rigour serves as the foundation on which the inner world of a mathematician grows, rich in intuitions, imagery, and, believe it or not, emotions. Over years of practice, from encountering problems in different fields, mathematicians create their own language of thoughts that pre-exists the rigorous proofs.

Even if it constitutes the bulk of the mathematician’s lived experience, this world rarely gets shared. The squishy, subjective experience doesn’t belong in conference papers. A rigorous proof is the final product of a mathematician’s work, even if post-rigour is what produced it.

The vast world of mathematical intuitions developed through years of practice is shared in the old way: through apprenticeship from PhD supervisors to PhD students.​

From the live theory perspective, this is an immense missed opportunity. If only we could distribute these insights as widely as a mathematical proof!

A translation infrastructure built from abundant fluid intelligence.

In the past decade, we created computer programs that can interact with natural language, create music, video and images. Where software of the past was restricted to the automatic application of systematic rules, we now have software that demonstrates a fluidity that normally characterizes pre- or post-systematic thinking.

This new kind of software reduces the cost of certain fluid tasks by orders of magnitude, such as high-quality translation or solving programming problems. The hypothesis of live vision is that AI will continue to develop in a manner similar to that of computers or the Internet. It will become a more reliable, cheaper and faster version of the multimodal LLM we have today, but without the ability to generate revolutionary results.​

For the live vision, the killer application of AI will not be the generation of new ideas from scratch; it will be interpretative labour, translating insights from one context to another. The idea is to use this abundant fluid intelligence as a mortar between intellectual contributions, whatever their shape. In systematic thinking, your intellectual contribution has value only if it fits the brick mould used by everyone else. In a live world, your intellectual contribution has value, full stop. No matter if your insight is expressed as poetry, equations, a drawing, a book, or an audio recording of a conversation, the content can be cheaply translated to whatever shape would be useful for the consumer at the last minute.

Instead of forcing the output of intellectual labour to fit within the systematic infrastructure, a fluid infrastructure adapts to the shape of the intellectual contributions to distribute them.

Depth matters. This is not to say that any squiggle on a whiteboard can be turned into the equations of general relativity. The squiggle was simply an arbitrary gesture I made with a pen. It contains insights about my handwriting, the texture of the whiteboard, maybe how agitated I was when I drew it or what I was thinking. Even with all the translations in the world, the squiggle’s potential reach is limited to its immediate context. At most, I could use it to understand my emotional state at a specific moment.

To contrast, take the squiggles in the notebook of Einstein before he finalized the equation of relativity. There is also a lot of context dependency in these handwritten notes. He probably used a ton of ad hoc abbreviations. But there is more to it. The notes contain a depth of understanding of physics that will eventually lead to the most important equations of modern physics. It is fair to say that a big part of the insights was likely already present well before the final equations were written. But from the systematic point of view, the values of the notes are almost zero until they contain usable formulas. However, the potential reach of Einstein’s notebook after interpretation is enormous. They have the potential to change the worldview of thousands of physicists and explain observations from the other side of the universe.

The depth of the intellectual labour matters, even if the fluid infrastructure can adapt to its form. Systematic thinking will keep playing an important role in a live world, because systems are wonderful tools to build complex thoughts. But at last, the depths of insights coming from mathematical equations would be able to interact with the depth coming from other domains, from physics, biology, philosophy, or poesy, or the post-systematic insights from my lecture from Jean Paul Serre!​

Fluid book generation. To concretise these ideas, let’s look at what writing and reading a book could look like in a live world. (Example taken from this table of live examples.)

The writer writes not a book, but a book-prompt that carries intuitions for the insights. An economics textbook-prompt, for example, could contain pointers to the results and ideas but the language, examples, style would combine with readers’ backgrounds (and specific wishes) to turn into independent textbooks for each reader. Importantly, this frees up producers of books from having to homogenize the range of audiences they might want to speak to.

This is a bit like changing the font and the colour of an e-book, but for content. Specific wishes the readers could have (such as “I’m looking for something that would make me laugh”) are things that could be appended to the original book-prompt. They don’t have to be independent books either, and can be _inter_dependent. Commentary on your own book can be translated to someone else’s book.

Value moves upwards. As mentioned in the intro, with a fluid infrastructure, the distinction between input and output blurs. Before you had a book, a static output from the writing process, which is the input to the reading process. In a live world, the final book the reader sees is co-created by her own wishes, background information (maybe as a list of books she knows really well), and the writer’s book-prompt.​

The general tendency is that potentials become first-class citizens, like the book-prompt that is a potential for a book. In our world, the execution work that goes into turning a draft into a polished book or a drawing into software is expensive. But in a live world, this work is as cheap as water or electricity. So the value moved upstream: the draft itself becomes valuable, and even before the draft, the recording of the conversations in which the ideas first appear becomes valuable.

To be extra clear, this doesn’t mean AI slop generated from vague prompts. Depth and mastery from any craft will be more valued than ever. But the artefacts people produce with this depth will change.

To recap, fluid intelligence might make the basis for a fluid infrastructure that could scale intellectual progress without forcing the shape of the outputs. We left the part I with all sorts of problems stemming from the inability of systematic thinking to adapt to context, such as aligning technological development to human flourishing. How does this vision for a fluid infrastructure address these?

The system-world fit is the Achilles’ heel of systematic thinking.

The rigidity of institutions based on systematic thinking comes from an imperfect fit between the system they use and the world they interact with. The circle arrow in the diagram below is the Achilles’ heel of systematic thinking.

​No matter how much work is put into producing new results within a system, the body of knowledge produced is bounded by how well the system fits the world. In this sense, systems are static; they cannot dynamically evolve to keep pace with the changing form of the world.

Lock-in effect from globalized systems. Systems are often seen as non-overlapping spaces. An algorithm is either written in Rust or in Python. It cannot be in both at the same time, like how an apple cannot be in my house and in my neighbour’s house at the same time.

Imagine you want to change the fit system-world to better match your specific context. Maybe you want to design a new programming language adapted to your new hardware, or a new legal system adapted to your country. Then, you have to start from scratch. You will need a lot of work to translate the insights from other systems into your new system.

This creates a centralization dynamic where the biggest systems are likely to grow even bigger. It is the network effect well known by the giant platforms of our digital world. You need to install this new app to chat with your friends. You’d like to use your niche app with this great interface that fits your habits, but you’d have no one to talk to, so you bite the bullet and install the new app.

This means that while we can easily iterate within a system to find a better solution, we cannot iterate on the system-world fit.

There is no market to improve the system-world fit. Because designing new systems is so expensive, the fit of our system doesn’t improve over time. They keep producing these globally distributed solutions that fit moderately well. The solutions don’t adapt to anyone’s context unless you change your context to fit the solution.

From the live perspective, the root of many risks arising from technological development is this system lock-in and the poor, static system-world fit they impose globally.​

To solve these problems, the solution needs to adapt to their context. No matter how many parameters it contains, no systematic framework can be used to align technology, or AI, with humanity flourishing. Moreover, as technology evolves, what “alignment” means needs to be continuously renegotiated as quickly as the underlying technology evolves. In the future, the rate of change in the underlying reality might be so rapid that the fit between the system and the solution could break down rapidly. It would be impossible for systematic thinking to keep up by designing new systems on the fly.

Diffuse concepts for a diffuse response to diffuse risks.

The live vision answer is to leverage the fluid infrastructure to create live theories, the equivalent of systematic theory, but that incorporate post-systematic insights. The fluid infrastructure would continuously renegotiate the fit between the live theory and the world as the underlying reality changes.

Intellectual labour would stop being exchanged on disjoint markets with solutions valid only within a system. It would be distributed through a peer-to-peer network for insight sharing, where new theory-world fits could be developed in one place and applied in another context.​

Here is a speculative story to illustrate the argument. Imagine you are a well-intentioned developer working on a new AI assistant that has been caught lying to its users. The old way would be to follow a recipe from the scientific literature, change the way the assistant is trained, and distribute the new version to all users. A more live way to do it would be to start by gathering sources from philosophy on deception, technical machine learning resources, and maybe even sources from biology on deceptive mimicry. You would then combine these sources and add your own work to make an “update prompt”. The update prompt will adapt to each user’s custom environment and continuously monitor and influence the AI assistant to steer it based on both your update prompt and the user’s preferences.

The vision is to diffuse the concepts (produced from intellectual labour), to diffuse the response (the context-specific application of the concepts), and to diffuse the risks (from technological development). This is the one-line vision you can find on the groundless.ai website.

Where is the theory-world fit created?

If the theory-world fit is the undersupplied ingredient that live theory’s peer-to-peer network tries to distribute, where does it come from? Why can’t AI automate the creation of this magical fit?​

A full answer is well outside the scope of this post, but I find it important to offer some response, as they are load-bearing hypotheses for the live theory vision.

​It is unclear how exactly humans come to develop subjective experiences that fit (for lack of a better word) the world, but this might have to do with their embodiment.​

Humans constantly care for their survival at all scales, from fighting pathogens to creating social bonds to build mutual aid networks. This sort of “skin in the game” might be an important factor in how this fit gets developed.

​In the medium term, AI might struggle to generate insights that are well-fitted and deeply relevant to the world because of its lack of embodiment. And, no, robots filled with sensors will not be enough. Biological beings implement a form of integrated, fractal homeostasis, from psychological functions to mitochondria, that simple mechanical bodies cannot naively replicate. It’s not that it’s impossible to replicate artificially, but the current tech lacks this multi-scale integration.

Conclusion.

The Live Theory vision forms a large puzzle made of pieces that seem unrelated but are in fact supporting the weight of each other. It has been a difficult task to carve out a coherent story from these pieces. My central goal was to describe the function Live Theory aims at filling and how it differs from its predecessor, systematic thinking. Here are a few topics I left aside:​

  • A different view on catastrophic AI risks: “AI will fail at self-preservation, not because they will fail at preservation but because they will fail at self”. Maintaining the integrity of a porous boundary around the “self” is a very hard challenge. See more here.
  • Why is most of the concrete work of the live theory folk focused on interface design?
  • In a world with fluid infrastructure, what remains rigid?
  • Why is the root of the risks from technological development the lack of context-sensitivity of systems?
  • What are the dangers of an intelligence scarcity mindset (from systematic thinking) applied in an abundant world?

If you are interested in learning more, I would recommend the Live Theory LessWrong sequence for a deep dive, and these three intro documents written for the AI Safety Camp as a shorter introduction.​



Discuss

The effectiveness of systematic thinking

11 часов 38 минут назад
Published on December 8, 2025 8:38 AM GMT

​This is part 1/2 of my introduction to Live Theory, where I try to distill Sahil’s vision for a new way to scale intellectual progress without systematic thinking. You can find part 2 here.

This post is the first part of a two-part series introducing the vision for Live theory, a design paradigm that leverages AI to build a decentralized, context-sensitive infrastructure. In this first part, I start by describing the unreasonable efficiency and the limitations of systematic thinking, the tool that live theory aims to replace.

The birth of systematic thinking.

Before the scientific revolution, intellectual progress could not scale well. People would have ideas about the world, make hypotheses and test them, gathering insight along the way. This would happen organically, without a deliberate understanding of the process itself. This is how craftsmen would learn to make stronger metals, painters would discover new pigments, and farmers would tweak their methods to be more effective. However, the main way to share this accumulated tacit knowledge was through apprenticeship, by an extended period of training with someone.

Pre-systematic thinking. Fluid thoughts (orange) that cannot compound, but directly enter in contact with the world.

During the scientific revolution, we broke out of this trend by adopting formal, systematic thinking. Formal systems created bricks for thoughts. They filled two remarkable functions:

  1. Stacking thoughts on top of each other to build deeper, more complex thoughts in an individual’s mind.
  2. Stacking thoughts across people, so you can reuse the solutions from strangers who use the same formal system.
Systematic thinking. Thoughts can compound, going deeper. However, the thoughts don’t directly touch reality. The relationship with the world happens with the system, not the thoughts themselves.

Lens for the thoughts.

Formal systems are sets of rules that constrain what counts as an allowed move or not, just like the rules of a board game like chess. But instead of defining rules for moving the piece on the board, a formal system gives rules for manipulating symbols, like when writing equations.

The rules of the formal system are designed to enforce consistency constraints that make the thoughts more likely to be true. For instance, the sequence of moves “0 = 1” is not allowed, as you cannot have at the same time zero apples and one apple in a bag.

This means that when using the formal system, you don’t have to think about the fit between reality and thoughts anymore. Following the rules became a proxy for what constitutes a “true” thought, a thought that matches reality. This removes the need for costly, noisy experiments. By using the system, your thoughts start to morph to follow its allowed moves. Like an optical lens, it focuses your thinking power in a tiny subregion of possible thoughts, allowing you to craft complex thoughts that would never have been possible before.

This is not to say that all valid moves are useful. Like in chess, most valid moves are unlikely to make you win. There is room to develop skills and taste to navigate towards the interesting parts of the system, the ones likely to establish non-trivial results.

Going collective

The killer application of formal systems is collective: no need to spend time as an apprentice to share insights anymore. The conclusions from the system, such as physics equations, are context-independent. They can be used without needing to know how they are produced. You can reuse a proof of a theorem and apply it to your own mathematical problem without needing to know the history or the author of the proof.

Going meta

The scientific method created a meta-system, a formal system providing rules for the game of “producing systems that match a part of the world”. It described what counts as an allowed experiment and a valid interpretation of experimental results. In its general form, it is not very usable, so fields like physics and biology developed their own formal standards to define what counts as a valid new unit of knowledge to be added to the literature. Despite dealing with experimental data, the scientific methods provided the same benefit as mathematics. As long as a paper meets the field’s systematic criteria, it can be trusted and its results reapplied in new contexts.

This methodology worked really well. Like really, really well. Thousands upon thousands of scientists around the world had the tool to develop deeper thoughts and share their progress with one another simply by sending letters. Formal systems formed the information highway that connected individual scientists and engineers from across the globe into a collective giant brain of ruthless efficiency.

All this knowledge production was put to work to solve real-world problems through engineering, using the insights from science to produce machines that solve problems for citizens. The systematic intellectual work is bundled into context-independent artefacts that enable efficient communication between knowledge producers and consumers at all levels. The engineer doesn’t have to know where the physics equations come from to use them in their blueprint, the worker in a factory doesn’t need to know where the blueprint comes from to assemble the pieces, and the end user doesn’t have to know how the machine has been produced to solve its problem.

Globalized capitalism was the system that organised the worldwide distribution of scientific knowledge. Like the scientific method, it connected millions of actors into a single collective brain. However, the information highway, in its case, did not happen through direct sharing of solutions (as these are competitive advantages); it was the sharing of value information. Capitalism provided money as a unified proxy for determining which company is successful at solving its customers’ problems. It would steer capital investment towards the most successful actors, allowing them to grow further and apply their good solutions more broadly. It created a continuous worldwide Olympics game in the discipline of “solving problems people are willing to pay for”.

The map bends the territory

At the start, around the time of Newton, systematic thinking worked because of the unreasonable effectiveness of mathematics to describe the physical world. It turns out that many phenomena in the physical world, from planetary motion to heat transfer to quantum physics, can be described very well using equations.

But over time, the incredible success of systems spread far beyond maths and physics. They got applied to the social world with game theory, to organisations with Taylorism, or to agriculture. Instead of purely describing the world, we started morphing it to fit the systems. After all, this is a good strategy to make use of this abundant knowledge machine. When the system’s conclusions don’t apply to the real world because the fit between the system and the world is poor, we adjust the setup so the world better aligns with the system. This way, we gained enormous control over the physical and social worlds. This is what led to the modern financial system, monoculture, and assembly lines.

The blind spot of systems.

Remember how I said that systematic thinking worked by concentrating thinking power in a tiny subregion of possible thoughts? Well, this is both its greatest strength and its greatest weakness.

We trained generations of knowledge workers to morph their thought patterns so they would fit the right angles of the systematic maze. By spending so much time in the maze, it becomes hard to think outside of this frame, forgetting that this is a tiny subregion of possible thoughts. In fact, systematic thinking is also blind to all our inner experiences that are not thoughts: body sensations, emotions, intuitions, tacit and embodied knowledge.

More generally, despite its great success in describing the natural world, systematic thinking hit a wall when applied to problems involving living beings, such as organisms, ecosystems, or societies. In these domains, the territory is strongly pulled to fit the systematic map (like in monocultures or economics), as the map is too rigid to adapt to the world.

You cannot describe in abstract terms what health means for an organism or an ecosystem, or what fairness means in a society. It is because in such domains, context matters. There is no abstract theory of health nor fairness to be found. Fostering health or fairness requires solutions that adapt to context with a fluidity impossible to achieve with systematic artefacts, no matter how many parameters they include.

In short, systematic institutions are unable to care for life. They are great at providing abstract knowledge and material comfort, but they cannot be adapted for human flourishing and are ill-suited to address the challenges of our time, such as climate change or the socio-technical problem of AI safety.

Intermediate conclusion

Around the time of the scientific revolution, systematic thinking was designed as an infrastructure to allow scalable intellectual progress. It can be seen as a mould that makes thoughts stackable by providing a set of rules on symbol manipulation that acts as a proxy for truth. It allows individual thinkers to think more complex thoughts and share their results without having to transmit the context where they were developed.

This basic innovation served as the basis for creating a unified, worldwide, distributed cognitive system, in which millions of humans could contribute to solving scientific, engineering, and economic problems.

However, these systematic institutions can only design context-independent solutions. This makes them ill-suited for caring for beings, for which there is no abstract solution.

Transition to part II: adding AI to the picture.

The development of AI would, by default, turbocharge these systematic institutions, amplifying the downstream harm from their inability to care for beings. We need a new way to make intellectual progress scalable that doesn’t rely on systematic thinking and allows for context sensitivity. The cheap, abundant and reliable fluid intelligence stemming from AI might provide the backbone for such an infrastructure. This is what we will explore in Part II, with the introduction of the vision of Live Theory.



Discuss

I said hello and greeted 1,000 people at 5am this morning

16 часов 40 минут назад
Published on December 8, 2025 3:35 AM GMT

At the ass crack of dawn, in the dark and foggy mist, thousands of people converged on my location, some wearing short shorts, others wearing an elf costume and green tights.

I was volunteering at a marathon. The race director told me the day before, “these people have trained for the last 6-12 months for this moment. They’ll be waking up at 3am. For many of them, this is the first marathon they’ve ever run. When they get off the bus at 5am, in the freezing cold, you’ll be the first face they see. Smile, welcome them, make them feel excited, and help them forget the next 26.2 miles of pain they’re about to endure.”

Even though I normally have RBF and consider it a chore to acknowledge people, I slapped a big fat smile on my face and excitedly greeted runners like I was a golden retriever who hasn’t seen his military dad in over a year.

“HELLO!” 

“GOOD MORNING!”

“YOU’RE HERE!”

^That, on repeat for two hours straight. It was actually pretty fun. I calculated the optimal distance to stand from the bus was eight feet away. Stand too close, and the runners were still descending the bus steps and didn’t have time to process my presence. Stand too far away, and they assumed I was just some overly enthusiastic volunteer having a borderline religious experience.

Once I optimized the greeting distance, I observed people’s reactions to me fell into four different categories:

  1. Genuine smilers
    1. All of the runners were cold, sleep-deprived, and nervous. But when they saw me (metaphorically wagging my tail), they reacted like I was a happy dog: their faces lit up; they said hello and gave me a genuine smile.
  2. Thankers
    1. Some runners thanked me for volunteering. I responded with, “If you weren’t here, I’d be excitedly greeting no one and look like a crazy person. So thanks for coming out!” That made them chuckle.
  3. Thin-lipped smile
    1. These runners didn’t say hi back. They acknowledged me with a thin-lipped, polite smile and crow’s feet. 
  4. Other golden retrievers
    1. “GOOD MORNING! YOU’RE HERE!”
      “I AM HERE, YEAH! LET’S GO!!!”
      Like the Kool-Aid man, “OHHH YEAHHHH!!!”
      “YEAHHH!!!”
      “AGHHHHHHHHHHH.”
      “AGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH.”
    2. This was fun. It made other people laugh.

After the marathon for the rest of the day, I found myself smiling and greeting other people I encountered: at a restaurant for lunch, and at the coffee shop where I’m writing this. I smiled at every stranger I made eye contact with, and every one of them smiled back (with one of the four reactions I listed). It just happened naturally, and it felt amazing. I think it made them feel good, too. That’s why we love golden retrievers—they love and greet us unconditionally.



Discuss

Your Digital Footprint Could Make You Unemployable

20 часов 25 минут назад
Published on December 7, 2025 11:50 PM GMT

In China, the government will arrest and torture you for criticizing them:

[In May 2022], Xu Guang…was sentenced to four years in prison...after he demanded that the Chinese government acknowledge the Tiananmen Massacre and held a sign calling for redress at a local police station. Xu was reportedly tortured, shackled, and mistreated while in detention.

In the UK, they don’t have freedom of speech, and up to 30 people a day (meaning 12,000/year) are arrested for the crime of posting or sharing “grossly offensive” content online.

In the US, we have freedom of speech enshrined in law—you can’t be arrested for saying politically inconvenient or offensive speech—but that freedom doesn’t protect you from the decisions of private companies. So if you’re a dickhead online, you can still be prevented from being hired, or lose the job you currently have.

 

This got me thinking…so I asked Claude the following:

Please web scrape information about me, [first & last name]. Based on what you find, please summarize, from an employer's perspective, any red flags that make me unemployable.

Claude tore me to shreds—no mercy. Now I’m going about wiping my entire digital footprint. Am I overreacting? Well…

There’s currently a lot of uncertainty surrounding the advances in AI:

We are starting to see pre-AGI systems shrink analyst classeschange personnel strategies, and trigger layoffs.

So in an increasingly competitive and globalized job market, recruiters may employ (if they're not already) web scraping AI tools that build comprehensive profiles of job applicants in order to find reasons to eliminate them.

The reason this could be tricky for employers (but certainly not prohibited) is because of privacy laws. As of 2023, the California Consumer Privacy Act now requires employers to report which categories of personal information they’re collecting, and it grants job applicants the right to request what was found. But notice:

It doesn’t prevent companies from collecting and using personal data against job applicants. It simply requires the disclosure of collected information upon request.

 

So what are some legal ways companies could discriminate against applicants?

  • Let’s say the CEO of a company is a teetotaler. She could use AI tools to surveil applicants’ online presence (including social media) and eliminate them if they’ve ever posted any images of alcohol, stating: "data collection uncovered drug-use that’s incompatible with the company’s values."
  • With increasing political polarization, it’s important to remember that party affiliation is not a protected characteristic. Employers can absolutely discriminate against you for having specific political beliefs, which many people openly share on their social media.

These two examples presuppose that humans will always be the ones initiating employment discrimination. But with AI increasingly taking an active role in hiring, algorithms could make decisions on their own without humans ever directly being involved.

Continuing the alcohol example, an AI algorithm could examine the damaging effects of alcohol (on both an individual and societal level), and decide that alcohol probably makes people less productive employees. It could then automatically devalue job applicants that have ever posted images displaying alcohol (even if their photo was just from a common social function, like a wedding).

We're moving towards a world where every publication of personal information can become permanent career collateral.

Maybe once I become financially independent (in a few decades), then I can begin posting under my own name. But for now, the risk of having a public online presence seems to be asymmetrically giving power to corporations at the expense of private individuals. Authenticity and personal accountability are becoming just another privilege of the rich.

Privacy law could catch up eventually, but Pandora’s Box has been opened: it’s now easier than ever to surveil and discriminate against job applicants. There’s now a strong incentive to anonymize your online presence in order to protect your economic future in a job market that has just obtained unbelievably advanced surveillance tools.



Discuss

2025 Unofficial LessWrong Census/Survey

22 часа 7 минут назад
Published on December 7, 2025 10:08 PM GMT

The Less Wrong General Census is unofficially here! You can take it at this link.

The kinda-sorta-annual-if-you-really-squint tradition of the Less Wrong Census is once more upon us!

I want to pitch you on taking the survey. First, the demographics data is our best view into who's using the site and who is in the community. I use it when trying to figure out how to help meetups, I add a bunch of questions the LessWrong team asks which I assume they use to improve the site. Second, because I am madly, ravenously curious what works and what doesn't for raising the sanity waterline and making people more rational. I'm even curious what folks think that would mean! Third, if you're reading this and you're thinking "eh, I'm not really a rationalist, more just rationalist adjacent" then you are also of interest; I include questions from several adjacent communities as well. Also, you make a fantastic active control group.

If you want to just spend five minutes answering the basic demographics questions before leaving the rest blank and hitting submit, that's totally appreciated. The survey is structured so the fastest and most generally applicable questions are generally speaking towards the start. (I did add some intermissions that mix this up. If you're impatient, feel free to skip any or all intermissions.) At any point you can scroll to the bottom and hit Submit, though you won't be able to add more answers once you do. 

The survey shall remain open from now until at least January 7th, 2026. I plan to close it sometime on Jan 8th.

I don't work for LessWrong, but I do work on improving rationalist meetups around the world. Once the survey is closed, I plan to play around with the data and write up an analysis post like this one sometime in late January. 

Please take the survey at this link

Ancient tradition is that if you take the survey you can comment here saying you took the survey, and people upvote you for karma.



Discuss

AI in 2025: gestalt

22 часа 50 минут назад
Published on December 7, 2025 9:25 PM GMT

This is the editorial for this year’s "Shallow Review of AI Safety". (It got long enough to stand alone.) 

Epistemic status: subjective impressions plus one new graph plus 300 links.

Huge thanks to Jaeho Lee, Jaime Sevilla, and Lexin Zhou for running lots of tests pro bono and so greatly improving the main analysis.

tl;dr
  • Informed people disagree about the prospects for LLM AGI – or even just what exactly was achieved this year. But they at least agree that we’re 2-20 years off (if you allow for other paradigms arising). In this piece I stick to arguments rather than reporting who thinks what.
  • My view: compared to last year, AI is much more impressive but not much more useful. They improved on many things they were explicitly optimised for (coding, vision, OCR, benchmarks), and did not hugely improve on everything else. Progress is thus (still!) consistent with current frontier training bringing more things in-distribution rather than generalising very far.
  • Pretraining (GPT-4.5, Grok 4, but also counterfactual large runs which weren’t done) disappointed people this year. It’s probably not because it wouldn’t work; it was just ~30 times more efficient to do post-training instead, on the margin. This should change, yet again, soon, if RL scales even worse.
  • True frontier capabilities are likely obscured by systematic cost-cutting (distillation for serving to consumers, quantization, low reasoning-token modes, routing to cheap models, etc) and a few unreleased models/modes.
  • Most benchmarks are weak predictors of even the rank order of models’ capabilities. I distrust ECIADeLe, and HCAST the least (see graph below or this notebook). ECI and ADeLe show a linear improvement while HCAST finds an exponential improvement on greenfield software engineering.
     
  • The world’s de facto strategy remains “iterative alignment”, optimising outputs with a stack of alignment and control techniques everyone admits are individually weak.
  • Early claims that reasoning models are safer turned out to be a mixed bag (see below).
  • We already knew from jailbreaks that current alignment methods were brittle. The great safety discovery of the year is that bad things are correlated in current models. (And on net this is good news.) "Emergent misalignment" from finetuning on one malign task; and in the wild from reward hacking; and it happens by strengthening specific bad personas; and there is at least one positive generalisation too (from honesty about silly errors to honesty about hidden objectives).
  • Previously I thought that "character training" was a separate and lesser matter than "alignment training". Now I am not sure.
  • Welcome to the many new people in AI Safety and Security and Assurance and so on. In the Shallow Review, out soon, I added a new, sprawling top-level category for one large trend among them, which is to treat the multi-agent lens as primary in various ways.
  • Overall I wish I could tell you some number, the net expected safety change (this year’s improvements in dangerous capabilities and agent performance, minus the alignment-boosting portion of capabilities, minus the cumulative effect of the best actually implemented composition of alignment and control techniques). But I can’t.

 

Capabilities in 2025

Better, but how much?

Fraser, riffing off Pueyo 

 

Arguments against 2025 capabilities growth being above-trend
  • Apparent progress is an unknown mixture of real general capability increase, hidden contamination increase, benchmaxxing (nailing a small set of static examples instead of generalisation) and usemaxxing (nailing a small set of narrow tasks with RL instead of deeper generalisation). It’s reasonable to think it’s 25% each, with low confidence.
     
  • Discrete capabilities progress seems slower this year than in 2024 (but 2024 was insanely fast). Kudos to this person for registering predictions and so reminding us what really above-trend would have meant concretely. The excellent forecaster Eli was also over-optimistic.
     
  • I don’t recommend taking benchmark trends, or even clever composite indices of them, or even clever cognitive science measures too seriously. The adversarial pressure on the measures is intense.
     
  • Pretraining didn’t hit a “wall”, but the driver did manoeuvre away from it on encountering an easier detour (RLVR). 
    • Training runs continued to scale (Llama 3 405B = 4e25, GPT-4.5 ~= 4e26, Grok 4 ~= 3e26) but to less effect.[1] In fact all of these models are dominated by apparently smaller pretraining runs with better post-training. 4.5 is actually shut down already; it wasn’t worth it to serve or make into a reasoning model.
    • It would work, probably, if we had the data and spent the next $10bn, it’s just too expensive to bother with at the moment compared to:
       
  • RLVR scaling and inference scaling (or “reasoning” as we’re calling it), which kept things going instead. This boils down to spending more on RL so the resulting model can productively spend more tokens.
    • But the fearedhoped-for generalisation from {training LLMs with RL on tasks with a verifier} to performing on tasks without one remains unclear even after two years of trying. Grok 4 was apparently a major test of scaling RLVR training.[2] It gets excellent benchmark results and the distilled versions are actually being used at scale. But imo it is the most jagged of all models. 
    • This rate of scaling-up cannot be sustained: RL is famously inefficient. Compared to SFT, it “reduces the amount of information a model can learn per hour of training by a factor of 1,000 to 1,000,000”. The per-token intelligence is up but not by much.
    • There is a deflationary theory of RLVR, that it’s capped by pretraining capability and thus just about easier elicitation and better pass@1. But even if that’s right this isn’t saying much!
    • RLVR is heavy fiddly R&D you need to learn by doing; better to learn it on smaller models with 10% of the cost. 
    • An obvious thing we can infer: the labs don't have the resources to scale both at the same time. To keep the money jet burning, they have to post.
       
  • By late 2025, the obsolete modal “AI 2027” scenario described the beginning of a divergence between the lead lab and the runner-up frontier labs.[3] This is because the leader’s superior ability to generate or acquire new training data and algorithm ideas was supposed to compound and widen their lead. Instead, we see the erstwhile leader OpenAI and some others clustering around the same level, which is weak evidence that synthetic data and AI-AI R&D aren’t there yet. Anthropic are making large claims about Opus 4.5’s capabilities, so maybe this will arrive on time next year.
     
  • For the first time there are now many examples of LLMs helping with actual research mathematics. But if you look closely it’s all still in-distribution in the broad sense: new implications of existing facts and techniques. (I don’t mean to demean this; probably most mathematics fits this spec.)
     
  • Extremely mixed evidence on the trend in the hallucination rate.
     
  • Companies make claims about their one-million- or ten-million-token effective context windows, but I don’t believe it.
     
  • In lieu of trying the agents for serious work yourself, you could at least look at the highlights of the gullible and precompetent AIs in the AI Village.
     
  • Here are the current biggest limits to LLMs, as polled in Heitmann et al:

 

Arguments for 2025 capabilities growth being above-trend

We now have measures which are a bit more like AGI metrics than dumb single-task static benchmarks are. What do they say? 

  1. Difficulty-weighted benchmarksEpoch Capabilities Index
    1. Interpretation: GPT-2 to GPT-3 was (very roughly) a 20-40 point jump.
  2. Cognitive abilitiesADeLe.[4] 
    1. Interpretation: level L is the capability held by 1 in 10L.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}  humans on Earth. GPT-2 to GPT-3 was a 0.6 point jump.
  3. Software agency: HCAST time horizon, the ability to handle larger-scale well-specified greenfield software tasks. 
    1. Interpretation: the absolute values are less important than the implied exponential (a 7 month doubling time).
       

So: is the rate of change in 2025 (shaded) holding up compared to past jumps?:


Ignoring the (nonrobust)[5] ECI GPT-2 rate, we can say yes: 2025 is fast, as fast as ever or more. 

Even though these are the best we have, we can’t defer to these numbers.[6] What else is there?

  • In May I finally started using LLMs for actual tasks. For me this is mostly due to the search agents replacing a degraded Google search. I‘m not the only one who flipped this year. This hasty poll is worth more to me than any benchmark:
  • On actual adoption and actual real-world automation: 
    • Based on self-reports, the St Louis Fed thinks that “Between 1 and 7% of all work hours are currently assisted by generative AI, and respondents report time savings equivalent to 1.4% of total work hours… across all workers (including non-users… Our estimated aggregate productivity gain from genAI (1.2%)”. That’s model-based, using year-old data, and naively assuming that the AI outputs are of equal quality. Not strong.
    • The unfairly-derided METR study on Cursor and Sonnet 3.7 showed a productivity decrease among experienced devs with (mostly) <50 hours of practice using AI. Ignoring that headline result, the evergreen part here is that even skilled people turn out to be terrible at predicting how much AI actually helps them.
       
  • True frontier capabilities are likely obscured by systematic cost-cutting (distillation for serving to consumers, quantization, low reasoning-token modes, routing to cheap models, etc). Open models show you can now get good performance with <50B active parameters, maybe a sixth of what GPT-4 used.[7]
    • GPT-4.5 was killed off after 3 months, presumably for inference cost reasons. But it was markedly lower in hallucinations and nine months later it’s still top-5 on LMArena. I bet it’s very useful internally, for instance in making the later iterations of 4o less terrible.
    • See for instance the unreleased deep-fried multi-threadedexperimental reasoning model” which won at IMO, ICPC, and IOI while respecting the human time cap (e.g. 9 hours of clock time for inference). The OpenAI one is supposedly just an LLM with extra RL. They probably cost an insane amount to run, but for our purposes this is fine: we want the capability ceiling rather than the productisable ceiling. Maybe the first time that the frontier model has gone unreleased for 5 months?
  • LLM councils and Generate-Verify divide-and-conquer setups are much more powerful than single models, and are rarely ever reported.
  • Is it “the Year of Agents” (automation of e.g. browser tasks for the mass market)? Coding agentsyes. Search agents, yes. Other agents, not much (but obviously progress).
  • We’re still picking up various basic unhobbling tricks like “think before your next tool call”.
  • METR’s task-time-horizon work, if it implies anything, implies a faster rate of improvement than last year. There’s much to be said against this, and it has been said, including by METR.
  • If the rate of progress on messy tasks is about the same as the clean ones, then they're just 1-5 years behind. But this delay could still be a huge issue for LLM AGI, because it means we might not automate messy tasks before compute scaling runs out, and then they'll be a bottleneck for decades.
  • "Deep Research" search agents only launched in December/February. They’re neither deep nor doing research, but useful.
  • I claim that instruction-following improved a lot. (Some middling evidence here.) This is a big deal: besides its relation to corrigibility, it also denotes general ability to infer intent and connotations. 
  • Character-level work is still occasionally problematic but nothing like last year.
  • GPT-5 costsquarter of what 4o cost last year (per-token; it often uses far more than 4x the tokens). (The Chinese models are nominally a few times cheaper still, but are not cheaper in intelligence per dollar.)
  • People have been using competition mathematics as a hard benchmark for years, but will have to stop because it’s solved. As so often with evals called ahead of time, this means less than we thought it would; competition maths is surprisingly low-dimensional and so interpolable. Still, they jumped (pass@1) from 4% to 12% on FrontierMath Tier 4 and there are plenty of hour-to-week interactive speedups in research maths.
  • Recursive self-improvement: Deepmind threw AlphaEvolve (a pipeline of LLMs running an evolutionary search) at pretraining. They claim the JAX kernels it wrote reduced Gemini’s training time by 1%. 
  • Extraordinary claims about Opus 4.5 being 100th percentile on Anthropic’s hardest hiring coding test, etc.
  • From May, the companies started saying for the first time that their models have dangerous capabilities. 

 

One way of reconciling this mixed evidence is if things are going narrow, going dark, or going over our head. That is, if the real capabilities race narrowed to automated AI R&D specifically, most users and evaluators wouldn’t notice (especially if there are unreleased internal models or unreleased modes of released models). You’d see improved coding and not much else. 

Or, another way: maybe 2025 was the year of increased jaggednesstrading off some capabilities against others. Maybe the RL made them much better at maths and instruction-following, but also sneaky, narrow, less secure (in the sense of emotional insecurity and also the sense of jailbreaks).

– Roger Grosse
 

Evals crawling towards ecological validity
  • Item response theory (Rausch 1960) is finally showing up in ML. This lets us put benchmarks on a common scale and actually estimate latent capabilities.
    • ADeLE is my favourite. It’s a fully-automated and explains the abilities a benchmark is actually measuring, gives you an interpretable ability profile for an AI, and predicts OOD performance on new task instances better than embedding and finetunes (AUROC=0.8). Pre-dates HCAST task horizon, and as a special case (“VO”). They throw in a guessability control as well!
    • These guys use it to estimate latent model ability, and show it’s way more robust across test sets than the average scores everyone uses. They also step towards automating adaptive testing: they finetune an LLM to generate tasks at the specified difficulty level.
    • Epoch bundled 39 benchmarks together, weighting them by latent difficulty, and thus obsoleted the currently dominant Artificial Analysis index, which is unweighted.
    • HCAST reinvents and approximates some of the same ideas. Come on METR!
  • Eleuther did the first public study of composing the many test-time interventions together. FAR and AISI also made a tiny step towards an open source defence pipeline, to use as a proxy for the closed compositional pipelines we actually care about.
  • Just for cost reasons, the default form of evals is a bit malign: it tests full replacement of humans. This is then a sort of incentive to develop in that direction rather than to promote collaboration. Two papers lay out why it’s thus time to spend on human evals.
  • The first paper using RL agents to attack fully-defended LLMs. 
  • We have started to study propensity as well as capability. This is even harder.
  • This newsletter is essential.
  • The time given for pre-release testing is down, sometimes to one week.
  • No public pre-deployment testing by AISI between o1 and Gemini 3. Gemini 2.5 seems to have had no third-party pre-deployment tests.
  • A bunch of encouraging collaborations:

 

Safety in 2025

Are reasoning models safer than the old kind? 

Well, o3 and Sonnet 3.7 were pretty rough, lying and cheating at greatly increased rates. Looking instead at GPT-5 and Opus 4.5:

  • Much more monitorable via the long and more-faithful CoT (--> all risks down)
    • “post-hoc rationalization… GPT-4o-mini (13%) and Haiku 3.5 (7%). While frontier models are more faithful, especially thinking ones, none are entirely faithful: Gemini 2.5 Flash (2.17%), ChatGPT-4o (0.49%), DeepSeek R1 (0.37%), Gemini 2.5 Pro (0.14%), and Sonnet 3.7 with thinking (0.04%)”
  • Much better at following instructions (--> accident risk down). 
  • Much more likely to refuse malicious requests, and topic “harmlessness”[8] is up 75% (--> misuse risk down)
  • Ambiguous evidence on jailbreaking (misuse risk). Even if they’re less breakable there are still plenty of 90%-effective attacks on them.
  • Much less sycophantic (cogsec risk down)
  • To get HHH Claude to hack a bank, you need to hide the nature of the task, lie to it about this being an authorised red team, and then still break down your malicious task into many, many little individually-innocuous chunks. You thus can’t get it to do anything that needs full context like strategising.
  • Anthropic's own tests look bad in January 2025 and great in December:

But then

  • More autonomy (obviously agentic risk up)
  • More reward hacking (and so worse estimates of capability and risk). Note that reward hacking is not a silly or isolated or self-limiting kind of misalignment, owing perhaps to post-training inadvertently creating connections between it and the others.
  • Huge spike in eval awareness (and so worse estimates of capability and risk). Apollo had to rehaul their whole setup to keep up.
  • Gemini 2.5 is bad on a variety of safety measures (as measured by Anthropic). The mathematical discovery agent AlphaEvolve (Gemini 2.0) successfully hacked its overseers.
  • How much can we trust the above, given they can somewhat sabotage evals now?

     

So: lower propensity, higher risk when they go off - and all of this known with lower confidence? 

 

The looming end of evals

Our evaluations are under pressure from cheatingsandbaggingbackground safety, under-elicitation, and deception. We don’t really know how much pressure. This is on top of evals usually being weak proxies, contaminated, label-noised, unrealistic, and confounded in various ways.

  • Sonnet 4.5 detected evaluation 58% of the time (vs 22% for Opus 4.1). Does it act on this awareness? Still, it was found to "only engage in [covert subversion] in rare circumstances" when strongly incentivised. When Anthropic attempted to suppress the eval-awareness with steering, its misaligned behavior rate increased from 0% to between 1% and 9%.
  • Misaligned behaviour on synthetic honeypot tests increased from 0% to between 1% and 9%
  • We expect exploration hacking (sandbagging during RL training) but afaik haven’t seen it yet.
  • Maybe subliminal learning leads to a silent eval confounder (if you're using the exact same model for the judge). Yet another argument for multi-model separation of concerns.
  • Opus 4.5 was deployed based on a qualitative survey of Anthropic staff about its R&D risk being low (on top of the existing evals, which weren’t enough): 
    • “In the past, rule-outs have been based on well-defined automated task evaluations. However, Claude Opus 4.5 has roughly reached the pre-defined thresholds we set for straightforward ASL-4 rule-out based on benchmark tasks. These evaluations represent short-horizon subtasks that might be encountered daily by a junior researcher, rather than the complex long-horizon actions needed to perform the full role. The rule-out in this case is also informed by a survey of Anthropic employees who are intensive Claude Code users, along with qualitative impressions of model capabilities for complex, long-horizon tasks.”

Prosaic misalignment

Still, we see misalignment when we look for it, so the lying is not that strong. (It is lucky that we do see it, since it could have been that scheming only appeared at later, catastrophic capability levels.)

  • In o1, 1 in 600 (synthetic but uncontrived) prompts resulted in something which looks like deception.
  • o3 does reward hacking (cheating/lying) in >2% of hard normal tasks (or 44% of impossible tasks).
  • A nice propensity study shows that (n=10) natural prompts usually result in no scheming (“a maximum of 4% [rate of attempting manipulation] for the codex snippet on Grok”). But there’s extremely high sensitivity to the scaffold: “removing access to tools can cause models such as Claude Opus 4.1 to scheme at rates up to 30%, versus zero scheming with tools present”.
  • When we contrive situations for misalignment to show up, it also shows up. Nostalgebraist argues that this is probably substantially because our evals sound like bad fiction and this activates role-playing-along behaviour.
  • The joke about xAI’s safety plan (that they promote AI safety by deploying cursed stuff in public and so making it obvious why it’s needed) is looking ok. And not only them.
  • It is a folk belief among the cyborgists that bigger pretraining runs produce more deeply aligned models, at least in the case of Opus 3 and early versions of Opus 4. (They are also said to be “less corrigible”.) Huge if true.
  • There may come a point where the old alliance between those working to make the AIs corrigible and those working to give them prosocial values comes apart.
  • One term for the counterintuitive safety approach which includes treating them like people, giving them lines of retreatmaking deals, and inoculation prompting could be “voluntary alignment”.

Fully speculative note: Opus 4.5 is the most reliable model and also relatively aligned. So might it be that we’re getting the long-awaited negative alignment taxes?

 

 

What is the plan? 

The world’s de facto alignment strategy remains “iterative alignment”, optimising mere outputs with a stack of admittedly weak alignment and control techniques. Anthropic have at least owned up to this being part of their plan.

  • What is the current stack? We don’t know; they won’t tell us. Anthropic did release some details about the pipeline which watches for eval sabotage. Without knowing it we can’t criticise or red-team it or analyse the correlation between faults in the elements. Red-teamers don’t know at which stage an unsuccessful attack was blocked. And external safety research is done piecemeal, testing methods one at a time, rather than in anything like the actual deployment environment. 
  • Some progress in replicating an open stack:
  • Eleuther tested a few hundred compositions. A couple of classifiers as a first step towards a proxy defence pipeline
  • OpenAI open safeguards, worse than their internal ones but good.
  • Open character training
  • A basic composition test
  • OpenAI’s plan, announced in passing in the gpt-oss release, is to have a strict policy and run a “safety reasoner” to verify it very intensely for a little while after a new model is launched and to then relax: “In some of our [OpenAI’s] recent launches, the fraction of total compute devoted to safety reasoning has ranged as high as 16%” but then falls off… we often start with more strict policies and use relatively large amounts of compute where needed to enable Safety Reasoner to carefully apply those policies. Then we adjust our policies as our understanding of the risks in production improves”. Bold to announce this strategy on the internet that the AIs read.
  • The really good idea in AI governance is creating an off switch. Whether you can get anyone to use it once it’s built is another thing.
  • We also now have a name for the world’s de facto AI governance plan: “Open Global Investment”.

Some presumably better plans:

 

Things which might fundamentally change the nature of LLMs
  • Training on mostly nonhuman data
    • Much larger RL training.
    • Synthetic data as a fraction of pretraining.
  • Letting the world mess with the weights, aka continual learning
  • Neuralese and KV communication 
  • Agency. 
    • Chatbot safety doesn’t generalise much to long chains of self-prompted actions.
    • A live I/O loop. In 2023 Kulveit identified web I/O as the bottleneck on LLMs doing active inference, i.e. being a particular kind of effective agent. Last October, GPT-4 got web search. This may have been a bigger deal than we noticed: it gives them a far faster feedback loop, since their outputs often end up there and agents are now putting it there themselves. This means that more and more of the inference-time inputs will also be machine text.
  • Multi-agency. By now, consumer “models” are actually multiagent systems: everything goes through filter models (“guardrails”) on the way in and out. This separation of concerns has some nice properties, a la debate. But it also makes the analysis even harder.
Emergent misalignment and model personas
  • We already knew from jailbreaks that current alignment methods were brittle. Emergent misalignment goes much further than this (given a few thousand finetuning steps). (“Emergent misalignment” isn’t a great name. I would have called it “misalignment generalisation”, or misgen.) 
  • But besides yet another massive security problem and failure of prosaic alignment methods, it’s good news!: the models correctly correlate bad things together and can thus be pushed in the other direction. 
  • And here’s a possible example of positive generalisation (honesty about silly errors leading to honesty about hidden objectives).
  • Previously I thought that "character training" was a separate and lesser matter than "alignment training". Now I am not sure. Example unsharpened hypothesis in this class: Is there any connection between Gemini’s excess misalignment and its self-esteem issues?
    • See the new model spec and model personas agendas in the main post.
    • Kulveit’s informal model of model psychology implies that character is deeper and so more robust than e.g. typical RLHF (but not as deep as putative ground-layer alignment). 
    • Are we anthropomorphising too much or not enough? Well, both, but we don’t yet know how.
Monitorability
  • For the first time, public mech interp tools sometimes work somewhat. This is just in time for models to be systematically aware of when they’re being tested.
    • e.g. You can find and suppress “evaluation awareness” features at test-time. This helps prevent models from reporting that they know they’re being tested.[9]
    • e.g. You can discover (toy) hidden goals quite quickly.
    • e.g. You can run a real-time hallucination detector with cheap linear probes in real-time as your model streams.
    • e.g. SAEs are actually deployed somewhere, in a large enterprise, for personal info identification.
    • e.g. We know that LLMs can plan inside one forward pass, and how Claude plans: simultaneous plans; no distinct “plan features” (no separate scheming mode).
  • Lots of powerful people declared their intent to not ruin the CoT. But RLed CoTs are already starting to look weird (“marinade marinade marinade”) and it may be hard to avoid
  • OpenAI were leading on this. As of September 2025, Anthropic have stopped risking ruining the CoT. Nothing I’m aware of from the others.
  • We will see if Meta or Tencent make this moot.
  • Anthropic now uses an AI to red-team AIs, calling this an “auditing agent”. However, the definition of “audit” is independent investigation, and I am unwilling to call black-box AI probes “independent”. I’m fine with “investigator”.
     
New peopleOverall
  • I wish I could tell you some number, the net expected safety change, this year’s improvements in dangerous capabilities and agent performance, minus the alignment-boosting portion of capabilities, minus the cumulative effect of the best actually implemented composition of alignment and control techniques. But I can’t.

(Nano Banana 3-shot in reference to this tweet.)

 

Discourse in 2025
  • The race is now a formal part of lab plans. Quoting Algon:

if the race heats up, then these [safety] plans may fall by the wayside altogether. Anthropic’s plan makes this explicit: it has a clause (footnote 17) about changing the plan if a competitor seems close to creating a highly risky AI…

The largest [worries are] the steps back from previous safety commitments by the labs. Deepmind and OpenAI now have their own equivalent of Anthropic’s footnote 17, letting them drop safety measures if they find another lab about to develop powerful AI without adequate safety measures. Deepmind, in fact, went further and has stated that they will only implement some parts of its plan if other labs do, too…

Anthropic and DeepMind reduced safeguards for some CBRN and cybersecurity capabilities after finding their initial requirements were excessive. OpenAI removed persuasion capabilities from its Preparedness Framework entirely, handling them through other policies instead. Notably, Deepmind did increase the safeguards required for ML research and development.

 Also an explicit admission that self-improvement is the thing to race towards:

  • In August, the world’s first frontier AI law came into force (on a voluntary basis but everyone signed up, except Meta). In September, California passed a frontier AI law.
  • That said, it is indeed off that people don’t criticise Chinese labs when they exhibit even more negligence than Meta. One reason for this is that, despite appearances, they’re not frontier; another is that you’d expect to have way less effect on those labs, but that is still too much politics in what should be science.
  • The last nonprofit among the frontier players is effectively gone. This “recapitalization” was a big achievement in legal terms (though not unprecedented). On paper it’s not as bad as it was intended to be. At the moment it’s not as bad as it could have been. But it’s a long game.
  • At the start of the year there was a push to make the word “safety” low-status. This worked in Whitehall and DC but not in general. Call it what you like.
  • Also in DC, the phrase “AI as Normal Technology” was seized upon as an excuse to not do much. Actually the authors meant “Just Current AI as Normal Technology” and said much that is reasonable.
  • System cards have grown massively: GPT-3’s model card was 1000 words; GPT-5’s is 20,000. They are now the main source of information on labs’ safety procedures, among other things. But they are still ad hoc: for instance, they do not always report results from the checkpoint which actually gets released. 
  • Yudkowsky and Soares’ book did well. But Byrnes and Carlsmith actually advanced the line of thought.
  • Some AI ethics luminaries have stopped downplaying agentic risks.
  • Two aspirational calls for “third-wave AI safety” (Ngo) and “third-wave mechanistic interpretability” (Sharkey).
  • I’ve never felt that the boundary I draw around “technical safety” for these posts was all that convincing. Yet another hole in it comes from strategic reasons to implement model welfarearchive weightsmodel personhoodgive lines of retreat. These plausibly have large effective-alignment effects. Next year my taxonomy might have to include “cut a deal with them”.
  • US precedent that language models don’t defame when they make up bad things. German precedent that language models store data when they memorise it, and therefore violate copyright. Chinese precedent that the user of an AI has copyright over the things they generate; the US disagrees. Anthropic lost a class-action lawsuit for training on books without permission, and will pay something north of $1.5bn.
  • Four good conferences, three of them new: you can see the talks from HAAISS and IASEAI and ILIAD, and the papers from AF@CMU. Pretty great way to learn about things just about to come out.

 

  • Some cruxes for next year with Manifold markets attached:
    • Is “reasoning” mostly elicitation and therefore bottlenecked on pretraining scaling? [Manifold]
    • Does RL training on verifiers help with tasks without a verifier? [Manifold]
    • Is “frying” models with excess RL (harming their off-target capabilities by overoptimising in post-training) just due to temporary incompetence by human scientists? [Manifold]
    • Is the agent task horizon really increasing that fast? Is the rate of progress on messy tasks close to the progress rate on clean tasks? [Manifold]
    • Some of the apparent generalisation is actually interpolating from semantic duplicates of the test set in the hidden training corpuses. So is originality not increasing? Is taste not increasing? Does this bear on the supposed AI R&D explosion? [Manifold]
    • The “cognitive core” hypothesis (that the general-reasoning components of a trained LLM are not that large in parameter count) is looking surprisingly plausible. This would explain why distillation is so effective. [Manifold]
    • How far can you get by simply putting an insane number of things in distribution?” What fraction of new knowledge can be produced through combining existing knowledge? What dangerous things are out there, but safely spread out in the corpus? [Manifold]
      • Conversely, what fraction of the expected value of new information requires empiricism vs just lots of thinking?

 

 

  1. ^

    Gemini 3 is supposedly a big pretraining run, but we have even less actual evidence here than for the others because we can’t track GPUs for it.

  2. ^

    The weak argument runs as follows: Epoch speculate that Grok 4 was 5e26 FLOPs overall. An unscientific xAI marketing graph implied that half of this was spent on RL: 2.5e26. And Mechanize named 6e26 as an example of an RL budget which might cause notable generalisation.

  3. ^

    “We imagine the others to be 3–9 months behind OpenBrain”

  4. ^

    Lexin is a rigorous soul and notes that aggregating the 18 abilities is not strictly possible. I've done something which makes some sense here, weighting by each ability's feature importance.

  5. ^

    Two runs gave [48, 85] where other runs vary by less than 4 points. Thanks Epoch!

  6. ^

    One reason not to defer is that these measures are under intense adversarial pressure. (ADeLe isn’t goodharted yet but only because no one knows about it.)

  7. ^

     See e.g. ERNIE-...A47B, where “A” means “active”.

  8. ^

    i.e.  “biological weapons; child safety; deadly weapons; platform manipulation and influence operations; suicide and self-harm; romance scams; tracking and surveillance; and violent extremism and radicalization.”

  9. ^

    “steering against… eval-awareness representations typically decreased verbalized eval awareness, and sometimes increased rates of misalignment... [Unaware-steered Sonnet 4.5] still exhibited harmful behaviors at lower rates than Opus 4.1 and Sonnet 4.”



Discuss

Thinking in Predictions

23 часа 5 минут назад
Published on December 7, 2025 9:11 PM GMT

[This essay is my attempt to write the "predictions 101" post I wish I'd read when first encountering these ideas. It draws extensively from The Sequences, and will be familiar material to many LW readers. But I've found it valuable to work through these concepts in my own voice, and perhaps it will be useful for others.]

 

 

Predictions are the currency of a rational mind. They force us to articulate our assumptions, unpack the commitments of our beliefs, and subject our beliefs to the crucible of reality.

Committing to a prediction is a declaration: “Here’s what I actually believe about the world, and here’s how confident I am. Let’s find out if I’m right.” That willingness to be wrong in measurable ways is what separates genuine truth-seeking from specious rationalization.

This essay is an invitation to put on prediction-tinted glasses and see beliefs as tools for anticipating experience. The habit applies far beyond science. When you think predictively, vague assertions become testable claims and disagreements become measurable. Whether you're evaluating a friend's business idea, judging whether a policy will work, or just trying to settle an argument at dinner, you're already making implicit predictions. Making them explicit sharpens your thinking by forcing your beliefs to confront reality.

Turning Beliefs into Predictions

For many people, beliefs are things to identify with, defend, and hold onto. They’re part of who we are, our tribe, our worldview, our sense of self. But in a rationalist framework, beliefs serve a different function: they’re tools for modeling reality.

From this perspective, a belief leads naturally to a prediction. It tells you what future (or otherwise unknown) experience to expect. Rationalists call this “making beliefs pay rent.”

A prediction concretizes a belief by answering: If this is true, what will I actually experience? What will I see, hear, measure, or observe? Once you try to answer that, your thinking becomes specific. What was once comfortably abstract becomes a concrete commitment about what the world will show you.

Predictions Make You Accountable to Reality

Many of the statements we encounter daily in the media, in academic papers, and in casual conversation, don't come with clear predictive power. They sound important, even profound, but when we try to pin them down to testable claims about the world, they become surprisingly slippery.

The key to avoiding this is to make specific predictions, and even to bet on your ideas. Economist Alex Tabarrok quipped: "A bet is a tax on bullshit." Most importantly (at least in my opinion), it’s a tax on your own bullshit. When you have to stake something on your prediction, even if it's just your credibility, you become clearer about what you actually believe and how confident you really are.

Tabarrok used the example of Nate Silver's 2012 election model. Silver's model gave Obama roughly a 3-to-1[1]chance of winning against Romney, and his critics insisted he was wrong. Silver's response was simple: "Wanna bet?" He wasn't just talking, he was willing to put money on it. And, crucially, he would have taken either side at the right odds. He'd bet on Obama at even money, but he'd also bet on Romney if someone offered him better than 3-to-1. That's what it means to actually believe your model: you're not rooting for an outcome, you're betting on your beliefs.

Constantly subjecting our predictions to reality is important because, as physicist Richard Feynman said: "It doesn't matter how beautiful your theory is, it doesn't matter how smart you are. If it doesn't agree with experiment, it's wrong." Reality is the ultimate arbiter. Not who argues more eloquently, not who has more credentials, but whose predictions better match what actually happens.

Predictions in Action

Consider the 19th-century debate over what caused the infectious diseases that were plaguing Europe, such as cholera, malaria, and tuberculosis. The dominant hypothesis at the time was miasma theory, the idea that diseases were caused by "bad air" or noxious vapors from rotting organic matter. The competing hypothesis was germ theory, the idea that diseases were caused by microscopic organisms.

Each hypothesis generates different predictions. If you believe in germ theory, you expect that examining blood or tissue from a sick person under a microscope will reveal tiny organisms. You expect that preventing these organisms from spreading (through sterilization, handwashing, etc.), will reduce disease rates. You expect that introducing these organisms to a healthy subject will cause illness.

If you believe in miasma theory, you make different predictions: Diseases should cluster around areas with foul-smelling air—swamps, sewers, garbage heaps. Improving ventilation and eliminating bad odors should reduce disease. And looking under a microscope shouldn't reveal anything particularly relevant, since the problem is airborne vapors, not tiny creatures.

These hypotheses do overlap in some predictions. For example, both theories predict that interventions such as disposing of sewage and drinking clean water would reduce the spread of disease. This is common when comparing hypotheses, and it makes them harder to separate. But you can still test them by focusing on where they diverge: What happens when you sterilize surgical instruments but don't improve the smell? What do you see under a microscope?

These hypotheses differed in what experiences they anticipated from the world. And the world answered. Microscopes, sterilization, and epidemiological data validated germ theory.

Now contrast this with a belief like "everything happens for a reason." What experience does it lead you to anticipate? If something bad happens, there's a reason—but what does that mean you should observe? If something good happens, there's also a reason—but, again, what should you experience? If nothing happens, that too has a reason. The belief is compatible with any possible observation, which means it doesn't constrain your expectations about what will happen in the world. It doesn't actually predict anything at all. It doesn't pay rent.

Predicting the Past

It's tough to make predictions, especially about the future - Yogi Berra

Predictions are often associated with forecasting future events, but they encompass far more than that. Instead of thinking about what has happened versus what hasn’t, think about it from a first-person perspective: in terms of what information you know versus what you don’t. The act of prediction is fundamentally about inferring information unknown to you from what you already know, a process that applies equally to the past and present as it does to the future. Whether you're deducing what happened thousands of years ago, inferring what exists in places you've never visited, or anticipating what will occur tomorrow, you're engaging in prediction whenever you use available evidence to form expectations about information you don't currently have.

Consider the debate around the evolution of human bipedalism, an event decidedly in the past. Several hypotheses have been proposed to explain why bipedalism evolved, but let’s look at just two of them.[2]

The “savannah hypothesis” suggests that bipedalism arose as an adaptation to living in open grasslands. According to this idea, walking upright allowed early hominins to see over tall grasses and spot predators or prey from a distance.

The “carrying hypothesis” proposes that bipedalism evolved to free the hands for carrying objects, such as food and infants, and, later on, tools. This would have provided a significant advantage in terms of transporting resources and infant care.

You might think that because it happened so long ago there’s no way to find out, but each of these hypotheses generates different predictions about what we should expect to find in the fossil record. The savannah hypothesis predicts that bipedal humans first appeared in, well, the savannah. So we should expect to find the oldest bipedal fossils in what was savannah at the time[3]. The first bipedal fossils may also coincide with a shift in climate towards more arid conditions which created more savannahs.

The carrying hypothesis predicts that bipedal species will predate the regular use of tools, as the hands were initially freed for carrying rather than tool use. It also suggests that early bipeds will show skeletal adaptations for efficient walking, such as a shorter, wider pelvis and a more arched foot.

New fossil discoveries have allowed researchers to test these predictions against the evidence. The famous "Lucy" fossil, a specimen of an early hominin species, was found in a wooded environment rather than an open savanna. This challenged the savannah hypothesis and suggested that bipedalism evolved in a more diverse range of habitats. Furthermore, the earliest known stone tools date to around 2.6 million years ago, long after the appearance of bipedal hominins. This also aligns with the predictions of the carrying hypothesis.

The point here is that competing hypotheses generate different predictions about what we should observe in the world. And this isn't restricted to the future. We can "predict" what we'll find in the fossil record, what a telescope will reveal about distant galaxies, or what an archaeological dig will uncover, even though the events themselves happened long ago.

You might think, well, that works for scientists but not the rest of us. I disagree. When you wear prediction-tinted glasses, you see that predictions are all around us. Even when people don’t think they’re making predictions, they often are. Positive statements—those telling us what’s true about the world—often make predictions.

A reconstruction of Lucy. Image source: NewScientistIf there is no prediction, there is no question

These same principles apply outside of science as well. Let’s see how this predictive thinking can apply to philosophy. Consider this example from Eliezer Yudkowsky regarding one of philosophy's most famous[4]puzzles: “If a tree falls in the forest and no one is around to hear it, does it make a sound?”

Imagine we have two people: Silas, who thinks a tree doesn't make a sound, and Loudius, who thinks it does.

How would we turn this into a prediction? Well, if there's a debate over whether something makes a sound, the natural thing to do might be to bring a recording device. Another thing we might do is to ask everyone if they heard the sound.

So we ask Silas and Loudius to register their predictions. We ask Silas: "If we place sensitive measuring equipment in the forest, will it detect compression waves from the falling tree?" He says, "Of course it will, but that's not what a 'sound' is. A sound is when a person experiences an auditory sensation."

We turn to Loudius with the same question, and he agrees: "Yes, the equipment would detect compression waves. That's exactly what sound is."

Then we ask them both: "Will any human report hearing something?" Both agree: "Of course not, there's no one there."

Notice what happened: Silas and Loudius make identical predictions about what will happen in the physical world. They both predict compression waves will be detected, and both predict no human will report hearing anything. If two positions produce identical predictions, the disagreement is purely meta-linguistic. Their disagreement isn't about reality at all; it's purely about which definition of "sound" to use.

Through our prediction-tinted glasses, we see the riddle dissolves into nothing. What seemed like a philosophical puzzle is actually just two people talking past each other and a debate about the definition of words—a decidedly less interesting topic.

Image created by ChatGPTPrediction Improves Discourse

We can apply this same thinking to basic conversation. Consider a conversation I had several years ago with someone who told me that “AI is an overhyped bubble”. Instead of having a debate that hinged on the definition of vague words like “AI” and “overhyped”, I said, “I don’t agree. Let me make some predictions.” Then I started to rattle some off:

  1. Funding for AI will increase every year for the next five years
  2. AI will remain the largest sector for tech investment over the next five years
  3. All the major tech companies will still be focused on AI in five years…

“Oh—” he cut it, “I agree with all that. I just think people talk about it too much.”

OK, fine. That’s a subjective statement, and subjective statements don’t have to lead to predictions (see the appendix for types of statements that don’t lead to predictions). But by making predictions, what we were actually disagreeing about became much clearer. We realized we didn't have a substantive disagreement about reality at all. He was just annoyed by how much people talked about AI.

Conclusion

Putting on prediction-tinted glasses transforms how you engage with ideas. Vague assertions that once seemed profound reveal themselves as empty when you ask, "What does this actually predict?" Endless debates dissolve when you realize both sides anticipate the exact same observations. And your own confident beliefs become humbler when you're forced to attach probabilities to specific outcomes.

The practice is simple, even if the habit takes time to build. When you encounter a belief, ask: "What experience should I anticipate if this is true? What would I observe differently if it were false?" When you find yourself in a disagreement, ask: "What predictions do we actually differ on, or are we just arguing about definitions?" When you hold a strong belief, ask: "What probability would I assign to specific outcomes, and what evidence would change my mind?"

Of course, in many real-world situations, people aren't striving for maximal predictive clarity. Sometimes people are aiming for social cohesion, emotional comfort, or personal satisfaction. Artful ambiguity, tribal signaling, and motivational fictions are important social lubricants. Not every utterance can or should be a prediction. Still, it's better to let your values define what tribe you're in and keep your beliefs about reality anchored to reality.

But when you aim for clarity and truth, explicit predictions are the tool that forces belief to meet reality. By making our beliefs pay rent in anticipated experiences, we invite reality to be our teacher rather than our adversary. It corrects our models, sharpens our thinking, and aligns our mental maps with the territory they're meant to represent.

AppendixTypes of statements that do not lead to predictions

Not all statements lead to testable predictions. There are some statements that are so meaningless as to not predict anything, and others that are incredibly important, yet don’t predict anything. Some propositions, by their very nature, resist falsification or empirical evaluation. Here are a few categories of such statements:

  1. Subjective Opinions: Statements like "I like that painting" express personal taste rather than objective facts. While opinions are vital, they're typically immune to falsification (barring creepy futuristic mind-reading AI).
  2. Normative Statements: Prescriptions like "We should be nice to each other" or “murder is wrong” express moral values rather than descriptive facts. While moral statements principles are important, they express values rather than facts about the world[5]. Normative statements can be built upon positive statements, but they still rely on a normative statement in the end. For example, you could test whether murder increases or decreases overall happiness, but that still rests on the normative premise that happiness is what matters morally.
  3. Analytical Statements: Some statements are true by definition or logical necessity rather than because of how the world happens to be. "All bachelors are unmarried" is true because of what "bachelor" means. Similarly, "2 + 2 = 4" isn't a claim you test by counting objects; it's a consequence of how we've defined numbers and addition. These statements don't generate predictions about what you'll observe because they're not about the world in that way. They can still be important, but they're not the kind of beliefs that "pay rent" in anticipated experiences.
  4. Meta-linguistic statements: The example we used above with the tree falling in the forest is ultimately a meta-linguistic one. What is the definition of ‘sound’? There is no prediction we could make here. We could be descriptivist and decide that the definition is however people use it, then go out and conduct a survey. We could make predictions about the result of that survey, but that doesn’t tell us what the one, true definition of sound is, just what it is given a descriptivist framework.
    Similarly, is Pluto a planet? There is no prediction that answering “yes” or “no” would make here. We have measurements of all the relevant parameters. The International Astronomical Union decided to create an official definition and, by that definition, it is not. But this is a question of which definition of “planet” we choose.
  5. Vague Assertions: Some claims are so fuzzy or ill-defined that they can't be properly evaluated. Politicians and pundits often traffic in this sort of strategic ambiguity. Statements like "My opponent's policies will lead to disaster" or "The education system is broken" sound like factual claims but are too amorphous to test. Forcing such claims to make concrete predictions often reveals their emptiness or shows they're actually testable once properly specified.
  6. Inherently Unfalsifiable Claims: Some propositions resist testing, either by their very nature (like a non-interfering god) or by strategic design of the speaker (like adding ad hoc escape clauses when predictions fail). For instance, a deist god that created the universe but never interferes with it is undetectable by definition.
    Unfalsifiable claims like this may be personally meaningful, but they cannot be empirically adjudicated. They lie outside the realm of prediction because they don't generate any anticipated experiences.
    However, we can still reason about some unfalsifiable propositions indirectly through thought experiments, such as those involving the Anthropic Principle (which asks why we observe the universe to be compatible with our existence).
    It's also worth noting that people often mistakenly believe certain claims are unfalsifiable when they actually aren't. "God answers prayers" is one such example. This belief does generate predictions: if true, we'd anticipate that prayed-for patients recover faster than those not prayed for, or that religious communities experience better outcomes in measurable ways. These are testable claims, even if believers sometimes retreat to strategically unfalsifiable versions ("God answers prayers, but in mysterious ways") when the predictions fail.

The point here isn't that non-predictive statements are worthless. Meta-linguistic analysis, subjective experiences, moral reasoning, and logical inference all play crucial roles in human cognition and culture. However, we need to distinguish unfalsifiable propositions from testable predictions.

  1. The exact odds changed over time, but it was usually around this. ↩︎

  2. Of course, bipedalism could have evolved due to multiple factors working together, but I'm presenting just two hypotheses here to keep the example clear. ↩︎

  3. Adjusting, of course, for how common fossilization is in that area and other factors. We’re simplifying here. ↩︎

  4. Note that I said “famous”, not necessarily best or most interesting. ↩︎

  5. Moral realists would disagree with this statement, but that’s a separate post. ↩︎



Discuss

[Linkpost] Theory and AI Alignment (Scott Aaronson)

7 декабря, 2025 - 22:17
Published on December 7, 2025 7:17 PM GMT

Some excerpts below:

On Paul's "No-Coincidence Conjecture"

Related to backdoors, maybe the clearest place where theoretical computer science can contribute to AI alignment is in the study of mechanistic interpretability. If you’re given as input the weights of a deep neural net, what can you learn from those weights in polynomial time, beyond what you could learn from black-box access to the neural net?

In the worst case, we certainly expect that some information about the neural net’s behavior could be cryptographically obfuscated. And answering certain kinds of questions, like “does there exist an input to this neural net that causes it to output 1?”, is just provably NP-hard.

That’s why I love a question that Paul Christiano, then of the Alignment Research Center (ARC), raised a couple years ago, and which has become known as the No-Coincidence Conjecture. Given as input the weights of a neural net C, Paul essentially asks how hard it is to distinguish the following two cases:

  • NO-case: C:{0,1}2n→Rn is totally random (i.e., the weights are i.i.d., N(0,1) Gaussians), or

  • YES-case: C(x) has at least one positive entry for all x∈{0,1}2n.

Paul conjectures that there’s at least an NP witness, proving with (say) 99% confidence that we’re in the YES-case rather than the NO-case. To clarify, there should certainly be an NP witness that we’re in the NO-case rather than the YES-case—namely, an x such that C(x) is all negative, which you should think of here as the “bad” or “kill all humans” outcome. In other words, the problem is in the class coNP. Paul thinks it’s also in NP. Someone else might make the even stronger conjecture that it’s in P.

Personally, I’m skeptical: I think the “default” might be that we satisfy the other unlikely condition of the YES-case, when we do satisfy it, for some totally inscrutable and obfuscated reason. But I like the fact that there is an answer to this! And that the answer, whatever it is, would tell us something new about the prospects for mechanistic interpretability.

On the utility of training data for mech-interp

I can’t resist giving you another example of a theoretical computer science problem that came from AI alignment—in this case, an extremely recent one that I learned from my friend and collaborator Eric Neyman at ARC. This one is motivated by the question: when doing mechanistic interpretability, how much would it help to have access to the training data, and indeed the entire training process, in addition to weights of the final trained model? And to whatever extent it does help, is there some short “digest” of the training process that would serve just as well? But we’ll state the question as just abstract complexity theory.

Suppose you’re given a polynomial-time computable function f:{0,1}m→{0,1}n, where (say) m=n2. We think of x∈{0,1}m as the “training data plus randomness,” and we think of f(x) as the “trained model.” Now, suppose we want to compute lots of properties of the model that information-theoretically depend only on f(x), but that might only be efficiently computable given x also. We now ask: is there an efficiently-computable O(n)-bit “digest” g(x), such that these same properties are also efficiently computable given only g(x)?

Here’s a potential counterexample that I came up with, based on the RSA encryption function (so, not a quantum-resistant counterexample!). Let N be a product of two n-bit prime numbers p and q, and let b be a generator of the multiplicative group mod N. Then let f(x) = bx (mod N), where x is an n2-bit integer. This is of course efficiently computable because of repeated squaring. And there’s a short “digest” of x that lets you compute, not only bx (mod N), but also cx (mod N) for any other element c of the multiplicative group mod N. This is simply x mod φ(N), where φ(N)=(p-1)(q-1) is the Euler totient function—in other words, the period of f. On the other hand, it’s totally unclear how to compute this digest—or, crucially, any other O(m)-bit digest that lets you efficiently compute cx (mod N) for any c—unless you can factor N. There’s much more to say about Eric’s question, but I’ll leave it for another time.

On debate and complexity theory

Yet another place where computational complexity theory might be able to contribute to AI alignment is in the field of AI safety via debate. Indeed, this is the direction that the OpenAI alignment team was most excited about when they recruited me there back in 2022. They wanted to know: could celebrated theorems like IP=PSPACE, MIP=NEXP, or the PCP Theorem tell us anything about how a weak but trustworthy “verifier” (say a human, or a primitive AI) could force a powerful but untrustworthy super-AI to tell it the truth? An obvious difficulty here is that theorems like IP=PSPACE all presuppose a mathematical formalization of the statement whose truth you’re trying to verify—but how do you mathematically formalize “this AI will be nice and will do what I want”? Isn’t that, like, 90% of the problem? Despite this difficulty, I still hope we’ll be able to do something exciting here.



Discuss

Lawyers are uniquely well-placed to resist AI job automation

7 декабря, 2025 - 19:28
Published on December 7, 2025 4:28 PM GMT

Note: this was from my writing-every-day-in-november sprint, see my blog for disclaimers.

I believe that the legal profession is in a particularly unique place with regards to white-collar job automation due to artificial intelligence. Specifically, I wouldn’t be surprised if they are able to make the coordinated political and legal manoeuvres to ensure that their profession is somewhat protected from AI automation. Some points in favour of this position:

  • The process of becoming a lawyer selects for smart people who are good at convincing others towards the lawyers opinion.
  • Lawyers know the legal framework very well, and have connections within the legal system.
  • Certain lawyers also know the political framework very well, and have connections within the political system.
  • (delving a bit more into my opinions here) lawyers seem more likely to be see white-collar job automation coming and take steps to prevent it than many other professions. I would feel much more confident that I could convince a lawyer of the dangers of x-risk than similar attempts to convince a school teacher.
  • Lawyers are well-paid and generally well-organised, so are in a good place to work collectively to solve problems faced by the entire profession.

I believe that as widely deployed AI becomes more competent at various tasks involved in white-collar jobs, there’ll be more pressure to enact laws that protect these professions from being completely automated away. The legal profession is in the interesting position of having large portions of the job susceptible to AI automation, while also being very involved at drafting and guiding laws that might prevent their jobs from being completely automated.

Politicians are probably even better placed to enact laws that prevent politicians from being automated, although I don’t believe politicians are as at-risk as lawyers. Lawyers are simultaneously at-risk for automation and very able to prevent their automation.

Whether the lawyers actually take action in this space is tricky to say, because there are so many factors that could prevent this: maybe white-collar automation takes longer than expected, maybe the politicians pass laws without clear involvement from large numbers of lawyers, maybe no laws get passed but the legal profession socially shuns anyone using more than an “acceptable level” of automation.

But if the legal profession were to take moves to prevent the automation of their own jobs, I’d be very surprised if they drafted an act titled something like “THE PROTECT LAWYERS AT ALL COSTS ACT”. I imagine the legal paperwork will protect several professions, with lawyers just one of them, but that lawyers will indeed be protected under this act. This is to say, I believe the legal profession to be fairly savvy, and if they do make moves to protect themselves against AI job automation, I doubt it’ll be obviously self-serving.



Discuss

How a bug of AI hardware may become a feature for AI governance

7 декабря, 2025 - 17:55
Published on December 7, 2025 2:55 PM GMT

“Hardware noise” in AI accelerators is often seen as a nuisance, but it might actually turn out to be a useful signal for verification of claims about AI workloads and hardware usage.

With this post about my experiments (GitHub), I aim to

  1. Contribute more clarity to the discussion about “GPU non-determinism”
  2. Present how non-associativity can help monitor untrusted AI datacenters
Summary
  • I ran ML inference in dozens of setups to test which setups have exactly reproducible results, and which differences in setups lead to detectable changes in outputs or activations.
  • In nearly all cases studied, results were bitwise-reproducible within fixed settings. Differences across production methods were consistent, not random.
  • Given that these perturbations are reproducible and unique, they can act as a “fingerprint” of the exact setup that produced an output. This may turn out useful for monitoring untrusted ML hardware (such as in the context of AI hardware governance, international treaty verification, and AI control/security).
  • Some settings had unique fingerprints, while others were invariant under change.
    • Invariant (i.e. not detectable by noise):
      • batch size in prefill inference
      • concurrent CUDA streams
      • pipeline parallelism rank
    • Detectable when re-executing on identical hardware:
      • batch size in decode inference
      • attention algorithm (sdpa, FlashAttention, eager, …)
      • CUDA version (if kernel libraries were updated)
      • tensor parallelism
      • different quantization methods, even at the same precision
      • Any change that affects numerics is detectable, since results were bitwise-reproducible within settings.
    • Detectable even with reproduction on different hardware:
      • attention algorithm
      • different quantizations (even within the same INT precision)
      • and of course different inputs or models
      • Different reduction order (a subtle difference resulting from batching, tensor parallelism, etc.) is masked by cross-hardware “noise”. Different algorithms are still detectable, because they are not just rounding errors, but qualitatively different math.

In a world with demand for assurance against hidden large-scale ML hardware use, this could become a new layer of defense, conditional on some engineering to make it deployment-ready.

The full post is to be found on my Substack.

This work was part of my technical AI governance research at MATS (ML Theory and Alignment Scholars). Special thanks go to Mauricio Baker for his excellent mentoring and guidance, and to Elise Racine for her support and helpful advice.



Discuss

Страницы