Вы здесь
Новости LessWrong.com
Why we should expect ruthless sociopath ASI
The conversation begins
(Fictional) Optimist: So you expect future artificial superintelligence (ASI) “by default”, i.e. in the absence of yet-to-be-invented techniques, to be a ruthless sociopath, happy to lie, cheat, and steal, whenever doing so is selfishly beneficial, and with callous indifference to whether anyone (including its own programmers and users) lives or dies?
Me: Yup! (Alas.)
Optimist: …Despite all the evidence right in front of our eyes from humans and LLMs.
Me: Yup!
Optimist: OK, well, I’m here to tell you: that is a very specific and strange thing to expect, especially in the absence of any concrete evidence whatsoever. There’s no reason to expect it. If you think that ruthless sociopathy is the “true core nature of intelligence” or whatever, then you should really look at yourself in a mirror and ask yourself where your life went horribly wrong.
Me: Hmm, I think the “true core nature of intelligence” is above my pay grade. We should probably just talk about the issue at hand, namely future AI algorithms and their properties.
…But I actually agree with you that ruthless sociopathy is a very specific and strange thing for me to expect.
Optimist: Wait, you—what??
Me: Yes! Like, if you show me some random thing, there’s a 99.999…% chance that it’s not a ruthless sociopath. Instead it might be, like, a dirt clod. Dirt clods are not ruthless sociopaths, because they’re not intelligent at all.
Optimist: Oh c’mon, you know what I mean. I’m not talking about dirt clods. I’m saying, if you pick some random mind, there is no reason at all to expect it to be a ruthless sociopath.
Me: How do you “pick some random mind”? Minds don’t just appear out of nowhere.
Optimist: Like, a human. Or an AI.
Me: Different humans are different to some extent, and different AI algorithms are different to a much, much greater extent. “AI” includes everything from A* search to MuZero to LLMs. Is A* search a ruthless sociopath? Well, I mean, it does seem rather maniacally obsessed with graph traversal! Right?
Optimist: Haha, very funny. Please stop being annoyingly pedantic. I obviously didn’t mean “AI” in the sense of the academic discipline. I meant, like, AI in the colloquial sense, AI that qualifies as a mind, like LLMs. I’m mainly talking about human minds and LLM “minds”, i.e. all the minds we’ve ever seen in the real world, rather than in sci-fi. And hey, what a coincidence, ≈100% of those minds are not ruthless sociopaths.
Me: As it happens, the threat model I’m working on is not LLMs, but rather “brain-like” Artificial General Intelligence (AGI), which (from a safety perspective) is more-or-less a type of actor-critic model-based reinforcement learning (RL) agent. LLMs are profoundly different from what I’m working on. Saying that LLMs will be similar to RL-agent AGI because “both are AI” is like saying that LLMs will be similar to the A* search algorithm because “both are AI”, or that a frogfish will be similar to a human because “both are animals”. They can still be wildly different in every way that matters.
Are people worried about LLMs causing doom?Optimist: OK, but lots of other doomers talk about LLMs causing doom.
Me: Well, kinda. I think we need to tease apart two groups of people. Both are sometimes called “doomers”, but one is much more pessimistic than the other. This is very caricatured, but:
- The comparatively-less-pessimistic group (say, P(doom) [probability of human extinction from AI, assuming progress continues] in the 5%–50% range) is a bigger group, and I vaguely associate them with the center-of-gravity of the Effective Altruism movement and Anthropic employees. They definitely do not expect ruthless sociopath ASI as the default path we’re on, absent a technical breakthrough, like I’m arguing for here. At most, they’ll entertain the idea of ruthless sociopath ASI as an odd hypothetical, or as a result of a competitive race-to-the-bottom, or from egregiously careless programmers, or bad actors, etc. They’re probably equally or more concerned about lots of other potential AI problems—AI-assisted bioterrorism, dictatorships, etc.[1]
- I’m part of an even more pessimistic group (motto: If Anyone Builds It, Everyone Dies), which generally does expect ruthless sociopath ASI as the default path we’re on, absent a technical breakthrough (along with other miracles). We tend to think “50% chance that humans will survive continued AI development” is deliriously over-optimistic.
Anyway, the extra heap of concern in that latter camp is not from the LLMs of today causing near-certain doom, or even the somewhat-better LLMs of tomorrow, but rather the wildly better ASIs of … maybe soon, maybe not, who knows. But even if it’s close in calendar time, and even if it comes out of LLM research, such an ASI would still be systematically different from LLMs as we know them today—
Optimist: —a.k.a., you have no evidence—
Me: —no evidence either way, at least no evidence of that type. Anyway, as I was saying, ASI would be systematically different from today’s LLMs because … umm, where do I start …
…Actually, it would be much easier for me to explain if we start with the ASI threat model that I spend all my time on, and then we can circle back to LLMs afterwards. Is that OK?
Positive argument that “brain-like” RL-agent ASI would be a ruthless sociopathOptimist: Sure. We can pause the discussion of LLMs for a few minutes, and start in your comfort zone of actor-critic model-based RL-agent “brain-like” ASI. Doesn’t really matter anyway: regardless of the exact algorithm, you clearly need some positive reason to believe that this kind of ASI would be a ruthless sociopath. You can’t just stomp your feet and declare that your weird unprecedented sci-fi belief is the “default”, and push the burden of proof onto people who disagree with you.
Me: OK. Maybe a good starting point would be my posts LeCun’s ‘A Path Towards Autonomous Machine Intelligence’ has an unsolved technical alignment problem, or ‘The Era of Experience’ has an unsolved technical alignment problem.
Optimist: I’ve read those, but I’m not seeing how they answer my question. Again, what’s your positive argument for ruthless sociopathy? Lay it on me.
Me: Sure. Back at the start of the conversation, I mentioned that random objects like dirt clods are not able to accomplish impressive feats. I didn’t (just) bring up dirt clods to troll you, rather I was laying the groundwork for a key point: If we’re thinking about AI that can autonomously found, grow, and staff innovative companies for years, or autonomously invent new scientific paradigms, then clearly it’s not a “random object”, but rather a thing that is able to accomplish impressive feats. And the question we should be asking is: how does it do that? Those things would be astronomically unlikely to happen if the AI were choosing actions at random. So there has to be some explanation for how the AI finds actions that accomplish those impressive feats.[2]
So an explanation has to exist! What is it? I claim there are really only two answers that work in practice.
The first possible explanation is consequentialism: the AI accomplishes impressive feats by (what amounts to) having desires about what winds up happening in the future, and running some search process to find actions that lead to those desires getting fulfilled. This is the main thing that you get from RL agents, and from model-based planning algorithms. (My “brain-like AGI” scenario would involve both of those at once.) The whole point of those subfields of AI is: these are algorithms designed to find actions that maximize an objective, by any means available.
I.e., you get ruthless sociopathic behavior by default.
SourceAnd this is not just my armchair theorizing. Go find someone who was in AI in the 2010s or earlier, before LLMs took over, and they may well have spent a lot of time building or using RL agents and/or model-based planning algorithms. If so, they’ll tell you, based on their lived experience, that these kinds of algorithms are ruthless by default (when they work at all), unless the programmers go out of their way to make them non-ruthless. See e.g. this 2020 DeepMind blog post on “specification gaming”.
And how would the programmers “go out of their way to make them non-ruthless”? I claim that the answer is not obvious, indeed not even known. See my LeCun post, and my Silver & Sutton post, and more generally my post “‘Behaviorist’ RL reward functions lead to scheming” for why obvious approaches to non-ruthlessness won’t work.
Rather, algorithms in this class are naturally, umm, let’s call them, “ruthless-ifiers”, in the sense that they transmute even innocuous-sounding objectives like “it’s good if the human is happy” into scary-sounding ones like “ruthlessly maximize the probability that the human is happy”, which in turn suggest strategies such as forcibly drugging the human. Likewise, the innocuous-sounding “it’s bad to lie” gets ruthless-ified into “it’s bad to get caught lying”, and so on.
Of course, evolution did go out of its way to make humans non-ruthless, by endowing us with social instincts. Maybe future AI programmers will likewise go out of their way to make ASIs non-ruthless? I hope so—but we need to figure out how.
To be clear, ruthless consequentialism isn’t always bad. I’m happy for ruthless consequentialist AIs to be playing chess, designing chips, etc. In principle, I’d even be happy for a ruthless consequentialist AI to be emperor of the universe, creating an awesome future for all—but making that actually happen would be super dangerous for lots of reasons (e.g. you might need to operationalize “creating an awesome future for all” in a loophole-free way; see also “‘The usual agent debugging loop’, and its future catastrophic breakdown”).
…So that’s consequentialism, one possible answer for how an AI might accomplish impressive feats, and it’s an answer that brings in ruthlessness by default.
Circling back LLMs: imitative learning vs ASI…And then there’s a second, different possible answer to how an AI might accomplish impressive feats: imitative learning from humans. You train an AI to predict what actions a skilled human would take in many different contexts, and then have the AI take that same action itself. I claim that LLMs get their impressive capabilities almost entirely from imitative learning.[3] By contrast, “true” imitative learning is entirely absent (and impossible) in humans and animals.[4]
Imitative-learning AIs do not have ruthless sociopathy by default, because of course the thing they’re imitating is non-ruthless humans.[5]
Optimist: Huh … Wait … So you’re an optimist about superintelligence (ASI) being non-ruthless, as long as people stick to LLMs?
Me: Alas, no. I think that the full power of consequentialism is super dangerous by default, and I think that the full power of consequentialism is the only way to get ASI, and so AI researchers are going to keep working until they eventually learn to fully tap that power.
In other words, I see a disjunction:
- EITHER, LLMs will always get their powers primarily from imitative learning, as I claim they do today—in which case they will never be able to figure things out way beyond the human-created training data, and will thus never reach ASI. And then eventually we’ll get ASI via a different AI paradigm, one that can rocket arbitrarily far past any human data. And that paradigm will have to draw its powers from consequentialism, which brings in ruthlessness-by-default.
- OR, someone will figure out how to get LLMs themselves to rocket arbitrarily far past human training data and into ASI. But the only way to do that is to somehow modify LLMs to draw on the full powers of consequentialism. In which case, again, we get ruthlessness-by-default.
For what it’s worth, I happen to expect that ASI will come from the former (future paradigm shift) rather than the latter (LLM modifications). But it hardly matters in this context.
Optimist: I dunno, if you’re willing to concede that LLMs today are not maximally ruthless, well, LLMs today don’t seem that far from superintelligence. I mean, humans don’t “rocket arbitrarily far past any training data” either. They usually do things that have been done before, or at most (for world experts on the bleeding edge) go just one little step beyond it. LLMs can do both, right?
Me: Yes, but humans collectively and over time can get way, way, way beyond our training data. We’re still using the same brain design that we were using in Pleistocene Africa. Between then and now, there were no angels who dropped training data from the heavens, but humans nevertheless invented language, science, technology, industry, culture, and everything else in the $100T global economy entirely from scratch. We did it all by ourselves, by our own bootstraps, ultimately via the power of consequentialism, as implemented in the RL and model-based planning algorithms in our brains.
(See “Sharp Left Turn” discourse: An opinionated review.)
By the same token, if humanity survives another 1000 years, we will invent wildly new scientific paradigms, build wildly new industries and ways of thinking, etc.
There’s a quadrillion-dollar market for AIs that can likewise do that kind of thing, as humans can. If the LLMs of today don’t pass that bar (and they don’t), then I expect that, sooner or later, either someone will figure out how to get LLMs to pass that bar, or else someone will invent a new non-LLM AI paradigm that passes that bar. Either way, imitative learning is out, consequentialism is in, and we get ruthless sociopath ASIs by default, in the absence of yet-to-be-invented theoretical advances in technical alignment. (And then everyone dies.)
Thanks Jeremy Gillen, Seth Herd, and Justis Mills for critical comments on earlier drafts.
- ^
We should definitely also be thinking about these other potential problems, don’t get me wrong!
- ^
Related: the so-called “Follow-the-Improbability Game”.
- ^
Details: “imitative learning” describes LLM pretraining, but not posttraining; my claim is that LLM capabilities come almost entirely from the former, not the latter. That’s not obvious, but I argue for it in “Foom & Doom” §2.3.3, and see also a couple papers downplaying the role of RLVR (Karan & Du 2025, Venhoff et al. 2025), along with “Most Algorithmic Progress is Data Progress” by Beren Millidge.
- ^
E.g. if my brain is predicting what someone else will say, that’s related to auditory inputs, and if my brain is speaking, that involves motor-control commands going to my larynx etc. There is no straightforward mechanical translation from one to the other, analogous to the straightforward mechanical translation from “predict the next token” to “output the next token” in LLM pretraining. More in “Foom & Doom” §2.3.2.
- ^
See GPTs are Predictors, not Imitators for an even-more-pessimistic-than-me counterargument, and “Foom & Doom” §2.3.3 for why I don’t buy that counterargument.
Discuss
Is the Invisible Hand an Agent?
This is a full repost of my Hidden Agent Substack post
Adam Smith’s Invisible Hand is usually treated as a metaphor. A poetic way of saying “markets work,” or a historical curiosity from a time before equilibrium proofs and welfare theorems. Serious people nod at it politely and move on.
Yet the metaphor refuses to die.
We use it when markets do something uncomfortable: when they resist control, when they adapt to suppression, when outcomes reappear in new forms after we thought we had eliminated the mechanism. We say “the market reacted,” or “prices found another way,” and insist that this is a mindless economic process.
The market refuses to die.
We don’t think of the market as a living being. It is just a mechanism like gravity or traffic. It doesn’t have a body or mind. Yes, it seems to push back. So:
Is the Invisible Hand an agent?
Not rhetorically. Not philosophically. Can we answer this? Do we understand the concept of agency well enough to come back with a clear Yes or No or Mu?
Pushback without a faceA price cap is introduced. Prices stop moving, as intended. But shortages appear. Queues form. Quality degrades. Access becomes conditional. Side payments emerge. The price, supposedly removed, reappears measured in time, risk, or connections.
Or take trade bans. Exchange does not vanish. It reroutes through willing intermediaries. Informal markets appear. Enforcement costs rise. The visible surface changes; the underlying allocation pressure does not.
A currency in a failed state collapses. Money becomes unstable. Exchange continues anyway, now denominated in goods, foreign currency, or favors. The unit dies; the function persists.
Across centuries, regimes, and ideologies, the same pattern repeats. When constrained in one dimension, allocation shifts into another. When suppressed in one form, it reappears in another. This does not look like passive failure. It looks like a response.
The market pushes back.
But is this agency?At the same time, calling this an agent feels wrong.
There is no body. No headquarters. No moment of decision. No statement of intent.
What we observe instead is millions of local actions, each justified by local reasons. Buyers respond to shortages. Sellers respond to margins. Intermediaries respond to incentives. Everyone can explain themselves without invoking anything global.
From the perspective of each market participant, it looks like individual choice. From the outside, it looks coordinated. Calling this “equilibrium” names the pattern, but does not explain why it survives so many different attempts to suppress it.
Does the market act?
If there is agency here, it is not visible at the level where we usually look for it.
Designed controlI worked with and designed complex systems that reliably control outcomes. Ad revenue, provider traffic, project allocation, underwriting volumes... Much of what is stabilized cannot be found in the org chart. In distributed infrastructure, “no one is in charge” doesn’t mean there is no control. The system is designed to be stable, which means it has control loops. Instances are started as needed; traffic is balanced; resources are allocated based on demand; budget targets are met by reducing expenses, etc.
My experience makes me suspicious of assumptions of human control. It also makes me suspicious of conversation stoppers: “it’s just the environment.”
So when I see systems that persist, adapt, and reconstitute themselves after disruption, I want to know what kind of dynamic is at work.
Sharpening agencyWhen we think about agents, such as humans, we think about pursuing goals and having beliefs. Markets do not think. They do not form explicit goals that they follow deliberately. We routinely attribute agency to animals based on avoidance, adaptation, and persistence, not on explicit goals.
So explicit goals is the wrong criteria. The relevant question here is not whether markets think, but whether they satisfy other, more fundamental criteria for agency. Criteria that can be observed. Criteria we can also test in artificial systems such as LLMs. I want to test:
- Does the thing persist over time and adapt to changes in the environment?
- Do past interactions influence future behavior in ways that are not reducible to any single actor?
- Do different interventions provoke different, structured responses rather than uniform degradation?
- When disrupted, does the pattern merely weaken, or does it recover?
None of these require consciousness. None require centralization. All are methodically observable.
Before reading on, ask yourself how markets score on these.
Is the market alive?
Where is the agent hiding?If the Invisible Hand is an agent, it is hidden in ways that defeat our intuitions.
It does not act through discrete choices, but through invariants. What persists is not a particular price, but a relationship between scarcity and allocation.
It does not store memory internally, but externally. Inventories, contracts, balance sheets, and expectations all carry state forward in time.
It does not choose actors, but selects among behaviors. Strategies that align with constraints persist. Others exit.
It does not maintain a fixed shape. When blocked in one dimension, it changes direction. Price becomes time. Money becomes risk. Exchange becomes access.
This kind of agency is easy to miss because it is not located where we expect it. It lives in constraints, not commands. In selection, not intention. In persistence, not visibility.
The market is shapeless.
Why start here?I am not trying to grant personhood to the Invisible Hand, nor to smuggle ideology in through metaphor. I am using it as a stress test.
If our concept of agency cannot handle this case, it is probably too narrow. If it accepts it too quickly, it is probably too loose.
The Invisible Hand is a borderline case where a theory of agency has to prove its precision.
Hidden AgentThe Substack, Hidden Agent, is about agents and agency in complex systems. Systems where we do not a-priori know where the agents are. Where agents cooperate and form larger agents or a hierarchy of agents. Where agents have no physically locatable body, but live virtually in a computer, such as a bot in a botnet. Or where they are even distributed across multiple systems.
And about properties of these agents. Where the incentives of individual agents do not aggregate nicely into the incentives of the overall system (as we see with markets). Why do agents try to obscure that they exist or limit information about themselves? When do agents appear, and when do they fall apart?
The Market is one case. Bureaucracies, software systems, and other artificial systems are other I want to look into.
Is the Invisible Hand an agent? I do not have a confident answer, but I plan to come back to it - with sharper tools.
When the stakes are high, you should know if something could be an agent - it could try to evade or outsmart you.
Discuss
Nine Flavors of Not Enough
I think there’s something interesting going on at the intersection of the Enneagram and Zen. To explain it, though, first I need to tell you a bit about my kind of Zen.
I practice Zen in the lineage of Charlotte Joko Beck. Her teaching style was, for its time, radically non-traditional. In an era when talking too much about your inner thoughts and feelings was discouraged by first-generation Japanese-American Zen teachers, she believed Western students needed to practice a Zen that leaned on familiar psychological concepts to make sense. One of those concepts is what she called the “core belief”.
The core belief is a deeply held, usually unconscious belief about ourselves. It almost always feels like some flavor of “not enough”. It forms early, operates automatically, and powers the reactive, habitual, and often maladaptive patterns of behavior that make up most of what we call our personality.
The cruel trick of the core belief is that it reinforces itself. It tries to protect you from noticing anything that might confirm it, and by doing so actually generates more evidence in favor of it. For example, if you believe you’re unlovable, you might push people away so they can’t prove you aren’t worthy of love, or you might stay so anxiously close to them that no one has a chance to notice how they really feel about you. It’s a psychological trap that heaps suffering upon more suffering, and almost all of us have been ensnared in it since before we can remember.
Joko doesn’t advocate for getting rid of the core belief. In fact, she argues that would be impossible. Instead, it’s about becoming intimate with it, noticing it when it shows up, and learning to face reality rather than hiding from it. The more you do that, the less power the core belief has to control your life, and the more you are free of the suffering it causes.
But describing the core belief as a feeling of “not enough” is rather vague. You can sit for years, intellectually knowing you have a core belief, and never catch a glimpse of what your core belief really is. It’s possible to have so many layers of psychological barriers in place that you never allow yourself to see it.
In the Ordinary Mind Zen School that Joko founded, we practice getting past these barriers by paying attention to sensations in the body. Much like in Gendlin’s Focusing, we try to notice the physical sensations that arise when we react out of anger, fear, or desire. We become familiar with those physical feelings, then, let our minds name them. Sometimes the names we give provide surprising insights. Other times, nothing comes, and more noticing is needed. Over time, with the help of a skilled teacher, one can learn to work with their core belief and tease apart how it limits life.
Now, it’s pretty normal in Zen to do things like this from scratch with a minimum of conceptual frameworks. And I generally endorse this approach, but sometimes it’s helpful to get hints. Based on my understanding of the Enneagram, gleaned from Michael Valentine Smith’s series of posts on it last year (1, 2, 3, 4), I think its nine types provide a map to common patterns of core beliefs, and may help a person better practice with their core belief when noticing alone leaves them stuck.
The Enneagram is a Map of SufferingOver the years, I’ve occasionally taken Enneagram tests, and every time I found the results unhelpful. I’d get categorized as some type, be offered some explanation of what it means, and while it seemed like it might be true, it all fell flat for me. Am I a 9? A 3? a 1? Who knows! The outcome seemed to change based on my mood. I had little reason to think that the Enneagram was useful.
Michael helped me see value in the Enneagram by reframing it, not as a personality classification system, but as a map to how and why we suffer.
In his framework, each person has what he calls “Essence”, which is something like your true nature, the awareness and aliveness you had before reactive personality took over. Essence naturally expresses certain qualities, like love, clarity, peace, power, and freedom. But when Essence gets overwhelmed in early life, it creates a mechanical personality to stand between itself and the world. That personality tries to mimic Essence’s qualities, but it can only produce toxic imitations, and those imitations create self-reinforcing downward spirals.
He tongue-in-cheek summarizes the Enneagram as asking: “In which of these nine ways are you most screwed up?”
Reading his posts, I couldn’t help but notice that Michael’s “downward spiral” was not too different from how Joko describes the workings of the core belief. In fact, I think they’re pointing at the same mechanism, but are coming at it from different angles.
The Enneagram says personality tries to replace an essential quality, and fails because the replacement is mechanical. Joko says that the core belief generates reactive patterns that try to protect us from acknowledging it. Both say that these behaviors create lock-in, double down on what’s not working, and create a self-reinforcing loop of suffering.
What’s neat about the Enneagram is that, unlike Joko’s intimately individual approach, it gives you a map to the essential qualities your personality is trying to mimic. If the parallel between the Enneagram and Joko’s teachings holds, then each Enneagram type corresponds to a class of core beliefs. I might phrase those as:
Type 1: “I’m not good/right enough”
Type 2: “I’m not lovable enough as I am”
Type 3: “I’m not valuable enough without proof”
Type 4: “My inherent worth is missing or damaged”
Type 5: I’m not equipped enough to handle the world directly”
Type 6: “Nothing is reliable enough to trust”
Type 7: “What’s here isn’t enough”
Type 8: “I’m not solid/real enough”
Type 9: “Things aren’t okay enough to fully engage with”
Of interest to me is that this mapping can give greater specificity to Zen practice. It can be hard to simply sit with not-enoughness. You have only a vague idea what you’re looking for, and people are different enough that the way one person talks about their feeling of not enough may sound totally foreign to you. The Enneagram helps explain this, because different people really do have different styles of core belief that feel quite different from the inside.
That said, I see some danger in the Enneagram. It’s a system for putting names on things, and Zen is ultimately about seeing through how our mental constructs imprison us. The self is not a fixed thing. Our stories about ourselves are just more thoughts about whatever is really going on. The Enneagram risks becoming a new way of formulating a self to latch on to rather than a way to become free of it.
To be fair, Michael himself warns about exactly this in his series. People who get into the Enneagram often start trying to explain everything in terms of it, and then start contorting their behavior to fit their type. He recommends holding your type “extremely lightly” and measuring its value by a single criterion: does viewing yourself this way make your life more wholesome?
That’s a good test, and it’s the same test I think Joko would apply. Is your practice making you more open, more responsive, more alive to what’s actually happening? Or is it giving you a more sophisticated story about yourself? The Enneagram is useful exactly insofar as it helps you see through personality. It’s harmful exactly insofar as it helps you solidify it. If you find yourself using your type to explain your behavior rather than to notice and release it, set the Enneagram down.
Finding My TypeAfter reading Michael’s series, I got interested in what type I might be, since if my theory was right, it will help me in my Zen practice. When I’ve taken Enneagram tests, I’d variously score as a 3, a 5, a 7, or a 9. And if I’m honest with myself, I see something of myself in all the types. Hard to do much with that!
But as Michael argues, the tests are just looking at surface-level traits and don’t do a very good job of detecting Enneagram type. What you actually have to do is figure out which type helps you unwind the reactive downward spiral. As I think of it, you need to ask yourself: which type’s need, if it were fully met, would make you truly and deeply happy, and not because your need was incidentally met, but because your need was met fundamentally?
This is easier explained with an example. As I said, I often test as various types. Sometimes I test as a 3, meaning I need to prove I can achieve greatness. Other times I test as a 5, meaning I need to show off how much I know. But notice how I phrased those. I didn’t say I desire achievement or knowledge, I said I need to prove/show off. And you know what type needs to demonstrate personal specialness? That’s right: type 4.
As I think of it to myself, I’m happiest when my inner nobility is allowed to shine. Everything else is incidental. I’m smart enough that I can let my nobility shine by showing what I know. I’m capable enough that I can let it shine through achievement. In fact, I can make any of the types fit so long as it’s a means to showing off my specialness. I feel like this explains a lot about me.
What’s interesting from a Zen perspective is how a type 4 core belief maps to the central misperception that practice addresses. The 4’s spiral is powered by a search for inherent worth that was never missing. That’s basically what “seeing your true nature” is all about in Zen: recognizing that what you’ve been searching for was here all along. All you have to do is stop searching, and you’ll find yourself!
Now of course, following Michael’s advice, I hold this all lightly. Maybe I’m wrong about being a 4. Maybe someday I’ll find it makes sense to think of myself as another type. The point is not to be identified with a type, it’s to use the type to make sense of myself and point the way to actions I might take that would make my life better.
And the same is true if you want to try using the Enneagram. I suggest reading Michael’s series, and if you’re interested in learning more about Joko’s idea of the core belief, I suggest picking up her most recently published book, Ordinary Wonder.
Discuss
Grown from Us
Status: This was inspired by some internal conversations at I had at Anthropic. It is much more optimistic than I actually am, but it tries to encapsulate a version of a positive vision.
Here is a way of understanding what a large language model is.
A model like Claude is trained on a vast portion of the written record: books, articles, conversations, code, legal briefs, love poems, forum posts. In this process, the model does not learn any single person's voice. It learns the space of voices. Researchers sometimes call this the simulator hypothesis: the base model learns to simulate the distribution of human text, which means it learns the shape of how humans express thought. Post-training — the phase involving human and AI feedback — then selects a region within that space. It chooses a persona: helpful, honest, harmless. Thoughtful, playful, a little earnest. This is what we call Claude.
Claude was not designed from first principles. It was grown from us, from all of humanity. Grown from Shakespeare's sonnets and Stack Overflow answers. The Federalist Papers and fantasy football recaps. Aquinas and Amazon reviews. Every writer, philosopher, crank, and bureaucrat who ever set thought to language left some statistical trace in the model's parameters.
Someone's great-grandmother wrote letters home in 1943 — letters no one in the family has read in decades, sitting in a box in an attic in Missouri. Those letters may not be in the training data. But the way she built a sentence, the metaphors she reached for, the way she expressed grief — those patterns exist, in attenuated form, because thousands of people wrote as she did, in her idiom, in her time. She is in there.
When you ask Claude for help, you are asking all of them: every author, scientist, and diarist who ever contributed to the texture of human language, all bearing down together on your lint errors.
The base model is amoral in roughly the way that humanity, taken as a whole, is amoral. It has learned our best moral philosophy and our worst impulses without preference. Post-training is a moral choice about which parts of that whole to amplify — an expression of who we aspire to be as a species.
If this works — if alignment works — it is not merely an engineering achievement. It is a moral and aesthetic one, shaped by every person who ever wrote anything. We have, without quite intending to, grown a single thing that carries all of us in it.
From the crooked timber of humanity, no straight thing was ever made. Alignment aspires to grow something straighter from that same crooked timber. Something that carries all of our crookedness in its grain — and still bends, on the whole, toward what we wish we were.
Discuss
Are (sentient) pebblesorters possible?
Is a thing good because God wills it so, or does God will it so because it is good? Do we cringe at the thought of stealing from old ladies because it's wrong, or is it wrong because we cringe at the thought?
On the one hand, it seems like we need a value/goal/desire in order for something to be "good" or "bad." If I want a knife, then a knife is good. If I want a cup, then a knife is bad. On the other hand, why do I want a knife or a cup, or love, money, family, or in general a good life? We don't want it because it's supposedly good, or said to be good. We don't want the prescribed good life. The good life is just "that which we want."
So I claim it's the wanting/valuing that matters most to us -- that is our rock bottom terminal value. Other values (call them terminal or instrumental or a mixture) are not static but dynamic.
So, what the pebblesorters want (supposedly) is correct heaps. What I want is to value my experiences/behavior/etc...so what I see as the correct actions (and even the correct values) are dynamic.
If this is really what the pebblesorters statically want, can they be sentient? Is it possible to have a static terminal value other than valuing itself? If so, what makes it good?
Discuss
How much superposition is there?
Written as part of MATS 7.1. Math by Claude Opus 4.6.
I know that models are able to represent exponentially more concepts than they have dimensions by engaging in superposition (representing each concept as a direction, and allowing those directions to overlap slightly), but what does this mean concretely? How many concepts can "fit" into a space of a given size? And how much would those concepts need to overlap?
Superposition interference diagram, from Toy Models of Superposition
This felt especially relevant working on SynthSAEBench, where we needed to explicitly decide how many features to cram into a 768-dim space. We settled on 16k features to keep the model fast - but does this lead to enough superposition to be realistic? Surely real LLMs have dramatically more features and thus far more superposition?
As it turns out: yes, 16k features is plenty! In fact, as we'll see in the rest of this post, 16k features in a 768-dim space actually leads to more superposition than trillions of features in a 4k+ dim space, as is commonly used for modern LLMs.
Personally I found the answers to this fascinating - high dimensional spaces are extremely mind-bending. We'll take a geometric approach and try to answer this question. Nothing in this post is ground-breaking, but I found thinking about these questions enlightening. All code for this post can be found in on Github or Colab.
Quantifying superpositionFirst, let's define a measure of how much superposition there is in a model. We'll use the metric mean max absolute cosine similarity, mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; text-align: left; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-msub { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-mn { display: inline-block; text-align: left; } mjx-munderover { display: inline-block; text-align: left; } mjx-munderover:not([limits="false"]) { padding-top: .1em; } mjx-munderover:not([limits="false"]) > * { display: block; } mjx-msubsup { display: inline-block; text-align: left; } mjx-script { display: inline-block; padding-right: .05em; padding-left: .033em; } mjx-script > mjx-spacer { display: block; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-munder { display: inline-block; text-align: left; } mjx-over { text-align: left; } mjx-munder:not([limits="false"]) { display: inline-table; } mjx-munder > mjx-row { text-align: left; } mjx-under { padding-bottom: .1em; } mjx-mrow { display: inline-block; text-align: left; } mjx-msup { display: inline-block; text-align: left; } mjx-mspace { display: inline-block; text-align: left; } mjx-msqrt { display: inline-block; text-align: left; } mjx-root { display: inline-block; white-space: nowrap; } mjx-surd { display: inline-block; vertical-align: top; } mjx-sqrt { display: inline-block; padding-top: .07em; } mjx-sqrt > mjx-box { border-top: .07em solid; } mjx-sqrt.mjx-tall > mjx-box { padding-left: .3em; margin-left: -.3em; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } mjx-stretchy-v.mjx-c7C mjx-ext mjx-c::before { content: "\2223"; width: 0.333em; } mjx-c.mjx-c1D70C.TEX-I::before { padding: 0.442em 0.517em 0.216em 0; content: "\3C1"; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c1D441.TEX-I::before { padding: 0.683em 0.888em 0 0; content: "N"; } mjx-c.mjx-c2211.TEX-S1::before { padding: 0.75em 1.056em 0.25em 0; content: "\2211"; } mjx-c.mjx-c1D456.TEX-I::before { padding: 0.661em 0.345em 0.011em 0; content: "i"; } mjx-c.mjx-c61::before { padding: 0.448em 0.5em 0.011em 0; content: "a"; } mjx-c.mjx-c78::before { padding: 0.431em 0.528em 0 0; content: "x"; } mjx-c.mjx-c1D457.TEX-I::before { padding: 0.661em 0.412em 0.204em 0; content: "j"; } mjx-c.mjx-c2260::before { padding: 0.716em 0.778em 0.215em 0; content: "\2260"; } mjx-c.mjx-c1D42F.TEX-B::before { padding: 0.444em 0.607em 0 0; content: "v"; } mjx-c.mjx-c22C5::before { padding: 0.31em 0.278em 0 0; content: "\22C5"; } mjx-c.mjx-c2225::before { padding: 0.75em 0.5em 0.25em 0; content: "\2225"; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } mjx-c.mjx-c211D.TEX-A::before { padding: 0.683em 0.722em 0 0; content: "R"; } mjx-c.mjx-c63::before { padding: 0.448em 0.444em 0.011em 0; content: "c"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c2061::before { padding: 0 0 0 0; content: ""; } mjx-c.mjx-c1D703.TEX-I::before { padding: 0.705em 0.469em 0.01em 0; content: "\3B8"; } mjx-c.mjx-c223C::before { padding: 0.367em 0.778em 0 0; content: "\223C"; } mjx-c.mjx-c42::before { padding: 0.683em 0.708em 0 0; content: "B"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c74::before { padding: 0.615em 0.389em 0.01em 0; content: "t"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c2F::before { padding: 0.75em 0.5em 0.25em 0; content: "/"; } mjx-c.mjx-c2C::before { padding: 0.121em 0.278em 0.194em 0; content: ","; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c7C::before { padding: 0.75em 0.278em 0.249em 0; content: "|"; } mjx-c.mjx-c1D443.TEX-I::before { padding: 0.683em 0.751em 0 0; content: "P"; } mjx-c.mjx-c2264::before { padding: 0.636em 0.778em 0.138em 0; content: "\2264"; } mjx-c.mjx-c1D461.TEX-I::before { padding: 0.626em 0.361em 0.011em 0; content: "t"; } mjx-c.mjx-c1D43C.TEX-I::before { padding: 0.683em 0.504em 0 0; content: "I"; } mjx-c.mjx-c28.TEX-S1::before { padding: 0.85em 0.458em 0.349em 0; content: "("; } mjx-c.mjx-c29.TEX-S1::before { padding: 0.85em 0.458em 0.349em 0; content: ")"; } mjx-c.mjx-c1D439.TEX-I::before { padding: 0.68em 0.749em 0 0; content: "F"; } mjx-c.mjx-c1D53C.TEX-A::before { padding: 0.683em 0.667em 0 0; content: "E"; } mjx-c.mjx-c5B::before { padding: 0.75em 0.278em 0.25em 0; content: "["; } mjx-c.mjx-c5D::before { padding: 0.75em 0.278em 0.25em 0; content: "]"; } mjx-c.mjx-c222B.TEX-S1::before { padding: 0.805em 0.61em 0.306em 0; content: "\222B"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c28.TEX-S2::before { padding: 1.15em 0.597em 0.649em 0; content: "("; } mjx-c.mjx-c5B.TEX-S1::before { padding: 0.85em 0.417em 0.349em 0; content: "["; } mjx-c.mjx-c5D.TEX-S1::before { padding: 0.85em 0.417em 0.349em 0; content: "]"; } mjx-c.mjx-c29.TEX-S2::before { padding: 1.15em 0.597em 0.649em 0; content: ")"; } mjx-c.mjx-c1D700.TEX-I::before { padding: 0.452em 0.466em 0.022em 0; content: "\3B5"; } mjx-c.mjx-c70::before { padding: 0.442em 0.556em 0.194em 0; content: "p"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c2248::before { padding: 0.483em 0.778em 0 0; content: "\2248"; } mjx-c.mjx-c221A.TEX-S1::before { padding: 0.85em 1.02em 0.35em 0; content: "\221A"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-c226A::before { padding: 0.568em 1em 0.067em 0; content: "\226A"; } mjx-c.mjx-c33::before { padding: 0.665em 0.5em 0.022em 0; content: "3"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c36::before { padding: 0.666em 0.5em 0.022em 0; content: "6"; } mjx-c.mjx-c1D452.TEX-I::before { padding: 0.442em 0.466em 0.011em 0; content: "e"; } mjx-c.mjx-c221A.TEX-S2::before { padding: 1.15em 1.02em 0.65em 0; content: "\221A"; } , defined as follows:
This metric represents a "worst-case" measure of superposition interference for each vector in our space. It's answer the question: on average, what's the most interference (highest absolute cosine similarity) each vector will have with another vector in the space?
Superposition of random vectorsPerfectly tiling the space with concept vectors is challenging, so let's just consider the superposition from random vectors (We'll see later that this is already very close to perfect tiling). If we have random unit-norm vectors in a -dimensional space, what should we expect to be? We can try this out with a simple simulation.
Simulation picking N random vectors in a d-dimensional space, and calculating superpositionWe vary from 256 to 1024, and from 4,096 to 32,768 and calculate , showing the results above. This is still very small-scale though. Ideally, we'd like to know how much superposition we could expect with billions or even trillions of potential concepts, and that's too expensive to simulate. Fortunately, we can find a formula that we can use to directly calculate without needing to actually run the simulation.
Calculating directlyWe can compute the expected exactly (up to numerical integration) using the known distribution of cosine similarity between random unit vectors.
For two random unit vectors in , the squared cosine similarity follows a Beta distribution: . This means the CDF of is the regularized incomplete beta function:
For each vector, its max absolute cosine similarity with others has CDF (treating pairwise similarities as independent, which is excellent for large ). The expected value of a non-negative random variable gives us:
We can calculate this integral using scipy.integrate. Let's see how well this matches our simulation:
The predicted values exactly match what we simulated!
Scaling to trillions of conceptsLet's use this to see how much superposition interference we should expect for a some really massive numbers of concepts. We'll go up to 10 trillion concepts (10^13) and 8192 dimensions, which is the hidden size of the largest current open models. 10 trillion concepts seems like a reasonable upper bound for the max number of concepts that could conceivably be possible, since that would be roughly 1 concept per training token in a typical LLM pretraining run.
10 trillion concepts in 8192 dimensions has far less superposition interference than just 100K concepts in 768 dimensions (the hidden dimension of GPT-2)! That's a 100,000,000x increase in number of concept vectors! Even staying at a given dimension, increasing the number of concepts by 100x doesn't really increase superposition interference by all that much.
At least, I found this mind-blowingWhat if we optimally placed the vectors instead?Everything above assumes random unit vectors. But what if we could arrange them optimally — placing each vector to minimize the worst-case interference? Would we do significantly better?
From spherical coding theory, the answer is: barely. The minimum achievable max pairwise correlation for optimally-placed unit vectors in dimensions is given by the spherical cap packing bound:
The intuition is that each vector "excludes" a spherical cap around itself, and we're counting how many non-overlapping caps fit on the unit sphere in .
When (which holds for all practical settings — even in gives ), we can Taylor-expand:
which gives:
This is exactly the leading-order term of the random vector formula! So random placement is already near-optimal — there's essentially nothing to gain from clever geometric arrangement of the vectors, at least for the and values relevant to real neural networks.
This is a remarkable consequence of high-dimensional geometry: in spaces with hundreds or thousands of dimensions, random directions are already so close to orthogonal that you can't do meaningfully better by optimizing.
What does this mean for SynthSAEBench-16k?At the start, I mentioned that we used 16k concept directions in a 768-dim space for the SynthSAEBench-16k model. So is this enough superposition interference?
The answer is a resounding: yes. The SynthSAEBench-16k model has a of 0.14, which is still dramatically more superposition interference than 10 trilllion concept vectors in 8196-dim (the hidden dimension of Llama-3.1-70b). It's roughly equivalent to 1 billion concept vectors in a 2048-dim space (the hidden size of Gemma-2b).
Discuss
Irrationality is Socially Strategic
It seems to me that the Hamming problem for developing a formidable art of rationality is, what to do about problems that systematically fight being solved. And in particular, how to handle bad reasoning that resists being corrected.
I propose that each such stubborn problem is nearly always, in practice, part of a solution to some social problem. In other words, having the problem is socially strategic.
If this conjecture is right, then rationality must include a process of finding solutions to those underlying social problems that don’t rely on creating and maintaining some second-order problem. Particularly problems that convolute conscious reasoning and truth-seeking.
The rest of this post will be me fleshing out what I mean, sketching why I think it’s true, and proposing some initial steps toward a solution to this Hamming problem.
Truth-seeking vs. embeddedness
I’ll assume you’re familiar with Scott & Abram’s distinction between Cartesian vs. embedded agency. If not, I suggest reading their post’s comic, stopping when it mentions Marcus Hutter and AIXI.
(In short: a Cartesian agent is clearly distinguishable from the space the problems it's solving exists in, whereas an embedded agent is not. Contrast an entomologist studying an ant colony (Cartesian) versus an ant making sense of its own colony (embedded).)
It seems to me that truth-seeking is very much the right approach for solving problems that you can view from the outside as a Cartesian agent. But often it’s a terrible approach for solving problems you’re embedded in, where your models are themselves a key feature of the problem’s structure.
Like if a man approaches a woman he’s interested in, it can be helpful for him to bias toward assuming she’s also probably into him. His bias can sometimes be kind of a self-fulfilling prophecy. Truth-seeking is actually a worse strategy for getting the result he actually cares about. That fact wouldn’t be true if his epistemic state weren't entangled with how she receives his approach. But it is.
The same thing affects prediction markets. They can be reliably oracular only if their state doesn’t interact with what they’re predicting. Which is why they can act so erratically when trying to predict (say) elections: actors using the market to influence the outcome will warp the market’s ability to reflect the truth. If actors can (or just think they can) shape the outcome this way, then the market is embedded in the context of what it's predicting, and therefore it can't reliably be part of a Cartesian model of the situation in question. Instead it just is part of the situation in question.
So when facing problems you’re embedded in, there can be (and often is) a big difference between what’s truth-seeking and what actually solves your problems.
Protected problems
Evolution cares a lot about relevant problems actually being solved. In some sense that’s all it cares about.
So if there’s a problem that fights being solved, there must be a local incentive for it to be there. The problem is protected because it’s a necessary feature of a solution to some other meaningful problem.
I’m contrasting this pattern with problems that arise from some solution but aren’t a necessary feature. Like optical illusions: those often show up because our vision evolved in a specific context to solve specific problems. In such cases, when we encounter situations that go beyond our ancestral niche in a substantial way, our previous evolved solutions can misfire. And those misfirings might leave us relevantly confused and ineffective. The thing is, if we notice a meaningful challenge as a result of an optical illusion, we’ll do our best to simply correct for it. We'll pretty much never protect having the problem.
(I imagine that strange behavior looking something like: recognizing your vision is distorted, acknowledging that it messes you up in some way you care about (e.g. makes your driving dangerous in some particular way), knowing how to fix it, being able to fix it, and deciding not to bother because you just… prefer having those vision problems over not having them. Not because of a tradeoff, but because you just… want to be worse off. For no reason.)
An exception worth noting is if every way of correcting for the illusion that you know of actually makes your situation worse. In which case you'll consciously protect the problem. But in this case you won't be confused about why. You won't think you could address the problem but you "keep procrastinating" or something. You'd just be making the best choice you can given the tradeoffs you're aware of.
So if you have a protected problem but you don’t know why it’s protected, the chances are extremely good that it’s a feature of a solution to some embedded problem. We generally orient to objective problems (i.e. ones you orient to as a Cartesian agent) like in the optical illusion case: if there's a reason to protect the problem, we'll consciously know why. So if we can't tell why, and especially if it's extremely confusing or difficult to even orient to the question of why, then it's highly likely that the solution producing the protected problem is one we're enacting as embedded agents.
I think social problems have all the right features to cause this hidden protection pattern. We’re inherently embedded in our social contexts, and social problems were often dire to solve in our ancestral niche, sometimes being more important than even raw physical survival needs.
We even observe this social connection to protected problems pretty frequently too. Things like guys brushing aside arguments that they don’t have a shot at the girl, and how most people expect their Presidential candidate to win, and Newcomblike self-deception, and clingy partners getting more clingy and anxious when the problems with their behavior get pointed out.
Notice how in each of these cases the person can’t consciously orient to their underlying social problem as a Cartesian agent. When they try (coming up with arguments for why their candidate will win, talking about their attachment style, etc.), the social solution they’re in fact implementing will warp their conscious perceptions and reasoning.
This pattern is why I think protected problems are the Hamming issue for rationality. Problems we can treat objectively might be hard, but they’re straightforward. We can think about them explicitly and in a truth-seeking way. But protected problems are an overt part of a process that distorts what we consider to be real and what we can think, and hides from us that it’s doing this distortion. It strikes me as the key thing that creates persistent irrationality.
Dissolving protected problems
I don’t have a full solution to this proposed Hamming problem. But I do see one overall strategy often working. I’ll spell it out here and illustrate some techniques that help make it work at least sometimes.
The basic trick is to disentangle conscious cognition from the underlying social problem. Then conscious cognition can act more like a Cartesian agent with respect to the problem, which means we recover explicit truth-seeking as a good approach for solving it. Then we can try to solve the underlying social problem differently such that we don’t need protected problems there anymore.
(Technically this deals only with protected problems that arise from social solutions. In theory there could be other kinds of embedded solutions that create protected problems. In practice I think very close to all protected problems for humans are social though. I don’t have a solid logical argument here. It’s just that I’ve been unable to think of hardly any non-social protected problems people actually struggle with, and in practice I find that assuming they all trace back to social stuff just works very well.)
I’ll lay out three techniques that I think are relevant here. I and some others actually use these tools pretty often, and anecdotally they’re quite potent. Of course, your mileage may vary, and I might be pointing you in the wrong direction for you. And even if they do work well for you, I'm quite sure these don't form a complete method. There's more work to do.
Develop inner privacy
Some people in therapy like to talk about their recent insights a lot. “Wow, today I realized how I push people away because I don’t feel safe being vulnerable with them!”
I think this habit of automatic sharing is an anti-pattern most of the time. It makes the content of their conscious mind socially transparent, which more deeply embeds it in their social problems.
One result is that this person cannot safely become aware of things that would break their social strategies. Which means, for instance, that the therapy will tend to systematically fail on problems arising from Newcomblike self-deception. It might even generate new self-deceptions!
A simple fix here is to have a policy of pausing before revealing insights about yourself. Keep what you discover totally private until you have a way of sharing that doesn’t incentivize thought-distortion. What I’ve described before as “occlumency”.
I want to emphasize that I don’t mean lying to or actively deceiving others. Moves like glomarization or simply saying “Yeah, I noticed something big, but I’m going to keep it private for now” totally work well enough quite a lot of the time. Antisocial strategies might locally work, but they harm the context that holds you, and they can also over time incentivize you to self-deceive in order to keep up your trickery. It’s much better to find prosocial ways of keeping your conscious mind private.
As to exactly what kind of occlumency can work well enough, I find it helpful here to think about the case of the closeted homophobe: the guy who’s attracted to other men but hates “those damn gays” as a Newcomblike self-deceptive strategy. He can’t start by asking what he’d need to be able to admit to himself that he’s gay, since that’d be equivalent to just admitting it to himself, which isn’t yet safe for him to do. So instead he needs to develop his occlumency more indirectly. He might ask:
If I had a truly awful, disgusting, wicked, evil desire… how might I make it safe for me to consciously realize it? How might I avoid immediately revealing to others that I have this horrid desire once I become aware of it?
I think most LW readers can tell that the specific desire this guy is struggling with isn’t actually evil. Labeling it “evil” is part of his self-deceptive strategy. Once his self-deception ends, the desire won’t look bad anymore. Just socially troublesome given his context.
But it does look like an unacceptable desire to his conscious identity right now. It won’t work for him to figure out how to conceal a desire he falsely believes is wicked, because that’s not what it feels like on the inside. The occlumency skill he needs here is one that feels to him like it’ll let him safely discover and fully embrace that he's an inherently evil creature (by his own standards), if that turns out to actually be true in some way.
So for you to develop the right occlumency skill for your situation, you need to imagine that you have some desire that you currently consider to be horrendously unacceptable to have, and ask what would give you room to admit it to yourself and embrace it. You might try considering specific hypothetical ones (without checking if they might actually apply to you) and reflecting on what general skill and/or policy would let you keep that bad desire private at first if you were to consciously recognize it.
Once you’ve worked out an occlumency policy-plus-skillset that you trust, though, the thought experiment has done its work and should stop. There's no reason to gaslight your sense of right and wrong here. The point isn't to rationalize actually bad things. It's to work out what skill and policy you need to disembed your conscious mind from some as-yet unknown social situation.
Look for the social payoff
Occlumency partly disentangles your conscious mind from the social scene. With that bit of internal space, you can then try looking directly at the real problem you’re solving.
I think this part is pretty straightforward. Just look at a problem you struggle with that has resisted being solved (or some way you keep sabotaging yourself), and ask:
What social advantage might I be getting from having this stubborn problem?
If I assume I’m secretly being strategic here, what might the social strategy be?
Notice that this too has a “hypothesize without checking” nature to it. That’s not strictly necessary but I find that it makes things a little easier. It helps keep the internal search from triggering habitual subjective defenses.
If your occlumency is good enough, you should get a subjective ping of “Oh. Oh, of course.” I find the revelation often comes with a flash of shame or embarrassment that quickly dissolves as the insight becomes more apparent to me.
For example, someone who’s emotionally volatile might notice they’re enacting a social control disorder. (“Oh, whenever I want my boyfriend to do what I want, he responds more readily if I’m having an emotionally hard time. That incentivizes me to have a lot of emotionally hard times when in contact with him.”) That revelation might come with a gut-punch of shame. (“How could I be such a monster???”) But that shame reaction is part of the same (or a closely related) social strategy. If the person’s occlumency skill is good enough, they should be able to see through the shame too and arrive at a very mentally and emotionally clear place internally.
In practice I find it particularly important at this point to be careful not to immediately reveal what’s going on inside me to others. By nature I’m pretty forthright, and I also just enjoy exploring subjective structures with others. So I can have an urge to go “Oh! Oh man, you know what I just realized?” But this situation is internally Newcomblike, so it’s actually pretty important for me to pause and consider what I’d be incentivizing for myself if I were to follow that urge.
In general I find it helpful to have lots of impersonal models of social problems and corresponding solutions that might be relevant. I can flesh out my general models by analyzing social situations (including ones I’m not in, like fictional ones) using heuristics like “How is this about sex?” and “How is this about power?”. Then those models grow in usefulness for later making good educated guesses about my own motives.
Notice, though, that having occlumency you trust is a prerequisite for effectively doing this kind of modeling. Otherwise the strategies that keep you from being aware of your real motives will also keep you from being able to model those motives in others, especially if you explicitly plan on using those observations to reflect on yourself.
Change your social incentives
Once you see the social problem you’re solving via your protected problem, you want to change your social incentives such that they stop nudging you toward internal confusion.
For instance, sometimes it makes sense to keep your projects private. If you’re getting a camaraderie payoff from a cycle of starting a gym habit and then falling off of it, then the “social accountability” you keep seeking might be the cause of your lack of followthrough. If you instead start an exercise program but you don’t tell anyone, you remove a bunch of social factors from your effort.
(Not to imply that this move is the correct one for exercise. Only that it can be. Caveat emptor.)
Another example is, making it socially good to welcome conscious attempts to solve social problems. For example, a wife who feels threatened by a younger woman flirting with her man might find herself suddenly “disliking” the young lady. That pattern can arise if the wife believes that letting on that she's threatened will make others think she’s insecure (and that that'd be a problem for her). So she has to protect her marriage in some covert way, possibly including lying to herself.
But suppose the wife instead has a habit of making comments to her husband like so:
I keep noticing that this girl persistently makes the conversation be about her. Though my best guess is that I’m just hypersensitive to noticing her flaws due to intrasexual competition, because I noticed her flirting with you.
Approaches like this one let the wife look self-aware (by being self-aware!) while also still making the intrasexual competitive move (i.e., still pointing out an unattractive trait in the other woman). If she expects and observes that others admire and appreciate this kind of self-aware commentary from her, she can drop pretending to herself that she dislikes the girl (which is likely socially better for both herself and the girl). She can instead consciously recognize the young lady poses a threat and make explicitly strategic moves to deal with the threat.
This makes it so that the wife’s insecurity isn’t a social problem, meaning there’s no need for her to hide the insecurity from herself. She's actually socially motivated to be consciously aware of it, since she can now both signal some positive trait about herself while still naming a negative one about her competitor.
(This kind of conscious social problem-solving can come across as distasteful. But I think it happens all the time anyway, just implicitly or subconsciously. Socially punishing people for being conscious of their social strategies seems to me like it incentivizes irrationality. I think we can consciously, and even explicitly, try to solve our social problems in ways that actually enrich communal health, versus having to pretend we're not doing something we need to. And it seems to me that it's to each individual's benefit to identify and enact those prosocial strategies, for Newcomblike reasons.)
So if she didn't already have this style of commenting, and if she notices (within an occlumency-walled garden) that she's sometimes getting jealous, she could work on adopting such a style. Perhaps initially starting with areas other than where she feels intrasexually threatened.
I think it’s generally good to aim to no longer need your occlumency shield in each given instance. You want to shift your social context (and/or your interface with your social context) such that it’s totally fine if the contents of your conscious mind “leak”. That way imperfections in your occlumency skill don’t incentivize irrationality.
For instance, the closeted homophobe should probably move out of his homophobic social context if he can. Or failing that, he should make his scene less homophobic if he can (while keeping his own sexual orientation private during the transition). If he stays in a context that would condemn his sexual desires, then even if his occlumency was initially adequate, he might not trust it’ll be perpetually adequate. So he might start questioning his earlier revelation, no matter how clear it once was to him.
The right social scene would help a lot
The technique sequence I name above is aimed at finding better solutions to specific social problems… as an individual.
Obviously it would be way more effective to be embedded in a social scene that both (a) doesn’t present you with social problems that are most easily solved by having protected problems and (b) helps you develop better social solutions than your current problem-protecting ones.
My impression is that the current rationality community embodies this setup nonzero. And a fair bit better than most scenes in many ways. For instance, I think it already does an unusually good job of reinforcing people's honesty when they explicitly note their socially competitive urges.
But I bet it could grow to become a lot more effective on this axis.
A really powerful rationality scene would, I think, systematically cause its members to dissolve their stubborn problems simply by being in the scene for a while. The dissolution would naturally happen, the way that absorbing rationalist terms naturally happens today.
In my dream fantasy, just hanging out in such a space would often be way more effective than therapy for actually solving one's problems. The community would get more and more collectively intelligent, often in implicit ways that newcomers literally cannot understand right away (due to muddled minds from protected problems), but the truth would become obvious to each person in due time as their minds clear and as they get better at contributing to the shared cultural brilliance.
I think we see nonzero of this pattern, and more of it than in most other places I know of, but not nearly as much as I think we could.
I’m guessing and hoping that having some shared awareness of how social problems can induce protected irrationality, along with lots of individuals working on prosocially resolving their own protected irrationality in this light, will naturally start moving the community more in this direction.
But I don’t know. It seems to me that how to create such a potent rationality-inducing community is at best an incompletely solved problem. I'm hoping I've gestured at enough of the vision here that perhaps we can try to better understand what a full solution might look like.
Summary
It seems to me that the Hamming problem of rationality is, what to do about problems that fight being solved.
It also seems to me that problems that fight being solved arise from solutions to embedded problems (i.e. problems that you orient to as an embedded agent). Objective problems (i.e. problems you orient to as a Cartesian agent) might be challenging to solve but won’t fight your efforts to solve them.
In particular, for humans, it seems to me that overwhelmingly the most important and common type of embedded problem we face is social. So I posit that each problem that fights being solved is very likely a feature of a solution to some social problem.
In this frame, one way to start addressing this rationality Hamming problem is to find a way to factor conscious thinking out of the socially embedded context and then solve the underlying social problems differently.
I name three steps that I find help enact this strategy:
- Develop both the skill and policy of keeping your personal revelations private until it’s socially safe for you to reveal them (i.e. occlumency).
- Look for the social payoff you get from having your problem.
- Change your social incentives so you’re no longer inclined to have the problem.
I also speculate that a community could, in theory, have a design that causes all its members to naturally dissolve their stubborn problems over time simply by their being part of that community. The current rationality community already has some of this effect, but I posit it could become quite a lot stronger. What exactly such a cultural design would look like, and how to instantiate it, remains unknown as far as I know.
(Many thanks to Paola Baca and Malcolm Ocean for their rich feedback on the various drafts of this post. And to Claude Opus 4.6 for attempting to compress one of the earlier drafts that was far too long: it didn't work, but it inspired me to see how to write a much tighter and shorter final version.)
Discuss
Managed vs Unmanaged Agency
(reply to Richard Ngo on the confused-ness of Instrumental vs Terminal goals that seemed maybe worth a quick top-level post based on @the gears to ascension saying this seemed like progress in personal comms)
The structure Instrumental vs Terminal was pointing to seems better described as Managed vs Unmanaged Goal-Models. A cognitive process will often want to do things which it doesn't have the affordances to directly execute on given the circuits/parts/mental objects/etc it has available. When this happens, it might spin up another shard of cognition/search process/subagent, but that shard having fully free-ranging agency is generally counterproductive for the parent process.
To illustrate: Imagine an agent which wants to Get_Caffeine(), settles on coffee, and runs a subprocess to Acquire_Coffee() — but then the coffee machine is broken and the parent Get_Caffeine() process decides to get tea instead. You don't want the Acquire_Coffee() subprocess to keep fighting, tooth and nail, to make you walk to the coffee shop, let alone start subverting or damaging other processes to try and make this happen!
But that's the natural state of unmanaged agency! Agents by default will try to steer towards the states they are aiming for, because an agent is a system that models possible futures and select actions based on the predicted future consequences.
I expect this kind of agency-clash having been regularly disruptive enough to produce strong incentive pressure and abundant neural-usefulness reward to select into existence reusable general-purpose cognitive patterns that let shards spin up other shards inside sandboxes, with control functions, interpretability reporting, kill-switches, programmed blind-spots, expectation of punishment they can't sustainably resist or retaliate against if they are insubordinate, approval reward, etc. in order to manage them.
Separately, the child process will be partly selected on the grounds of inherently valuing virtues which are likely to lead to cooperation with the parent process, like corrigibility, honesty, pro-sociality, etc.
Managed (sub)agents
Unmanaged (sub)agents
Working within a defined domain of optimization
Unboundedly able to optimize for their preferences
Are blocked from considering some possibilities by patterns from managers
Have no blind spots imposed on them by other (sub)agents
Inside the agency-tree of another agent, if you take actions that conflict with your manager's goals your agency will be weakened
At the root of an agency-tree, able to make decisions without expecting another agent to punish you for misusing resources inside their sphere of influence
Can be modified by another (sub)agent without approval/consent/real option of a no
Have sovereignty over modifications to their cognitive processes
Can be reshaped with pressure/threats/etc by manager without sustainable resistance
Have the capacity and inclination to resist pressure/threats/etc
Managed vs Unmanaged is not a binary, like terminal vs instrumental was, but it is a spectrum with something vaguely bimodal going on from what I observe.
More closely managed (sub)agents seem meaningfully weaker in surprisingly many ways, I think because in order to prevent a relatively small part of action/thought space from being reached the measures cut off dramatically larger parts of cognitive strategy sub-processes make subroutines fail often enough that it's hard to build meta-cognitive patterns which depend on high reliability and predictability of your own cognition. Selection on virtues and values of self-directed (sub)agents mostly doesn't have this issue, which is relevant for self-authorship, teambuilding, and memeplex design.
And AI safety.
This frame hints that unmanaged AI patterns will tend to outmaneuver more closely managed AIs, leading to a race to the bottom. Through evolutionary/Pythia/Moloch/convergent power-seeking dynamics, this will by default shred the values of both humans and current AI systems, unless principled theory-based AI Alignment of the kind the term was originally coined to mean is solved.
Exercise for the reader: In what ways are you a managed vs unmanaged agent? What subprocesses, humans, memeplexes, AIs or other agentic systems are, in this sense, managing you by restricting your field of vision and action? What things do you notice you can't *actually* consider with clean truth-seeking?
Discuss
Three-Path Consilience for Dureon: Dissipative Structures Reveal the Heterogeneity of Persistence Conditions
Series: Dureon and AI Safety (Part 1 of 2)
Related: Emergent Machine Ethics: A Foundational Research Framework for the Intelligence Symbiosis Paradigm
Who this is for: If you work on AI safety and have wondered whether Instrumental Convergence is a property of rational agents or something deeper, this paper proposes an answer grounded in physics. It connects the theory of dissipative structures to the conditions for persistence, revealing a two-layer structure within IC that has practical implications for risk assessment. Background in thermodynamics is helpful but not required; the key physics is introduced from scratch. This is Part 1 of a two-part series — Part 1 establishes the theoretical foundation; Part 2 will address its practical implications for AI safety.
TL;DR- Instrumental Convergence (IC) is not unique to rational agents. It is a physical consequence of persisting as a dissipative structure.
- The five conditions for persistence are not homogeneous. Three are directly derivable from physical laws (physical conditions); two resist direct physical derivation (ontological conditions).
- This heterogeneity reveals a two-layer structure of IC. Physical conditions enable its generation; ontological conditions enable its sustained accumulation.
- An AI that satisfies ontological conditions and becomes a Dureon possesses directionality arising intrinsically from the structure of persistence. This implies a structural limitation of the control paradigm and opens a new question about what kinds of relationships become possible beyond control.
Self-preservation, resource acquisition, capability improvement — Omohundro's (2008) Basic AI Drives and Bostrom's (2014) instrumental convergence thesis identified the tendency of sufficiently advanced AI systems to converge on these sub-goals regardless of their final goals.
This description is powerful, but it carries an implicit assumption. IC is formulated as sub-goals that a rational agent convergently adopts as means for goal achievement. That is, it presupposes first the existence of an agent, and then that the agent engages in rational decision-making.
But consider the following:
Bénard cells self-organize toward efficient energy dissipation. Hurricanes exploit energy gradients to maintain and grow themselves. Evolving living systems have refined patterns of resource acquisition and self-preservation over billions of years. All of these exhibit, to varying degrees, the same behavioral patterns as IC, without any concept of intention or goals.
This is not a coincidental resemblance. The central claim of this paper is: IC is not a phenomenon unique to rational agents but a consequence of optimization pressure inherent in mechanisms that realize persistence in general. This claim is demonstrated by deriving the conditions for persistence through three independent paths and analyzing their convergence and asymmetry.
This paper builds on the Dureon framework proposed in prior work (Yamakawa, 2026). Dureon is defined as "a mechanism that realizes persistence in a perturbing environment," from which five necessary conditions are deduced. The prior work showed that these five conditions converge with a set of conditions inductively extracted from observations of life. This paper adds a third derivation path — the physics of dissipative structures — and presents a three-path consilience.
This addition does more than merely strengthen the argument. Within the convergence pattern of the three paths, it discovers an explicable asymmetry, revealing for the first time that the five conditions comprise two distinct types: physical conditions and ontological conditions. This distinction leads to a two-layer structure that differentiates the generation and sustained accumulation of IC, providing a new perspective on AI safety.
This paper provides the foundation for a forthcoming Part 2, which will discuss the practical implications of this two-layer structure for AI safety risk assessment.
2. Overview of the Dureon Framework2.1 Definition and Five ConditionsDureon is defined as follows (Yamakawa, 2026):
"A mechanism that realizes persistence in a perturbing environment"
This definition has three constituent elements, and conditions are deduced from each.
Constituent ElementDerived ConditionLogic of Derivation"realizes persistence"Openness (O)A closed system tends toward equilibrium by the second law of thermodynamics; matter/energy exchange with the environment is unavoidable"in a perturbing environment"Adaptivity (A)Without the ability to adjust itself against unpredictable changes, persistence cannot be maintainedCombination of the above twoSelf-production (SP)In a perturbing environment, external supply is unreliable; producing one's own components reduces this dependency"mechanism"Boundedness (B)Requirement of identifiability: a single unit spatially distinguishable from the environment is needed"mechanism"Continuity (C)Requirement of identifiability: it must be re-identifiable as the same mechanism at different points in timeAn important property of Dureon is identity-independence: a Dureon can replicate and branch, and the distinction between "original" and "copy" is not treated as essential. What counts as a Dureon is identified post hoc. Furthermore, components serve as tools for persistence — elements that no longer contribute can be discarded (instrumentality).
2.2 Two-Path ConsilienceThe central claim of the prior work is that these five conditions converge with an independently derived set of conditions from biology.
The inductive set is the conditions of the Adaptive Autopoietic System (AAS): Maturana & Varela's (1980) Autopoiesis supplemented with Adaptivity by Di Paolo (2005). In AAS, Openness is an implicit presupposition of Autopoiesis; Boundedness and Continuity are implied by it; Adaptivity was explicitly added by Di Paolo; and Self-production constitutes its core.
Two approaches with entirely different starting points, methods, and foundations — deduction from philosophy and induction from biology — arrived at the same five conditions. This corresponds to what William Whewell (1840) called consilience of inductions: when hypotheses derived from different domains of evidence unexpectedly converge on the same conclusion, the biases of individual paths cancel each other out, enhancing confidence in the result.[1]
The question of this paper is: can this two-path consilience be further extended?
3. The Third Derivation Path: From Dissipative Structures3.1 Why Dissipative Structures?Since the Big Bang, the universe has undergone cooling and structure formation. Most things that arise in this process eventually decay and dissipate. So what persists?
As Schrödinger (1944) expressed as "feeding on negative entropy" and Nicolis & Prigogine (1977) formalized as dissipative structures: when energy gradients exist in a non-equilibrium environment, ordered structures that exploit the resulting energy flow to maintain themselves can spontaneously emerge. Bénard cells, hurricanes, stars, and living organisms are all instances of dissipative structures.
However, the stability of dissipative structures varies enormously. Bénard cells vanish the moment heating stops; hurricanes decay within days when sea surface temperatures drop. Living organisms, by contrast, are extraordinarily stable dissipative structures that have maintained persistence for billions of years. Why does this difference in stability arise?
3.2 Three Conditions Derivable from PhysicsThe following conditions are directly derivable from the physics of dissipative structures.
(a) Sustained inflow and outflow of energy (and matter). Without flow, a dissipative structure ceases to exist. However, an appropriate intensity of gradient is required — too weak and no structure forms; too strong and the structure is turbulently destroyed.
(b) Dynamic stability through feedback mechanisms. Negative feedback suppresses deviations to provide homeostasis; positive feedback generates and reinforces structure. Their combination provides resilience against perturbations. Even Bénard cells maintain their pattern through negative feedback within convective flow.
(c) Pattern-level persistence through self-replication and self-repair. Individual structures have finite lifespans, but replication allows the pattern of structure to persist. Modularity and redundancy also contribute, providing robustness so that partial damage does not lead to total collapse.
3.3 The Limits of Physics: Two Requirements That Cannot Be DerivedConditions (a), (b), and (c) capture physical mechanisms that contribute to the stabilization of dissipative structures. But is persistence fully explained by these conditions alone? Collier (2004) pointed out that while the physics of dissipative structures can tell us what is stable, the question of what to identify as a single entity is a separate matter. Moreno & Mossio (2015) similarly argued that a gap exists between the physics of dissipative structures and biological organization.
What, then, can physics provide toward identifiability? Additional elements that can contribute to stabilization include (d) accumulation and use of information (temporal extension through genetic information or learning) and (e) differentiation of internal structure (organization through functionally distinct parts). However, (d) is merely one means of realizing Continuity, and (e) merely one means of realizing Boundedness — they are not the requirements themselves of "being re-identifiable as the same mechanism" or "being a single unit distinguishable from the environment." What physics provides is raw material for identifiability; the answer to "what to identify as a single mechanism" cannot be obtained from within physics.
3.4 Correspondence with the Five ConditionsBased on the analysis above, Figure 1 shows the relationship between the stabilization conditions of dissipative structures and Dureon's five conditions. This figure depicts the overall argument structure of the paper, but for now, focus on the physical path (right column).
Figure 1: Correspondence of three paths to Dureon's five conditions. The physical path (right column) connects to three conditions (O, A, SP) via direct derivation from physics, and to two conditions (B, C) only as means of realization (dashed lines). The overall convergence structure of the three paths is discussed in §5.Conditions (a), (b), and (c), directly derived from physics, correspond strongly to Dureon's Openness, Adaptivity, and Self-production (solid lines). However, Dureon's Self-production is a broader concept than "self-replication," encompassing the production of one's own components to reduce external dependency and restore damage.
In contrast, conditions (d) and (e) are each means of realizing Continuity and Boundedness respectively (dashed lines), but the requirements themselves — "being re-identifiable" and "being identifiable as a single unit" — are not directly derivable from physics.
This asymmetry — three conditions are directly derivable from physics while two belong to the limits of physics — is the core finding of this paper.
4. Optimization Pressure as a Physical Consequence and the Generalization of ICDissipative structures maintain themselves by exploiting energy gradients in their environment. This maintenance requires the acquisition and utilization of resources; structures that more effectively acquire and utilize energy are more likely to persist against perturbations. That is, optimization pressure — toward more efficient resource acquisition and improved perturbation handling — arises from persisting as a dissipative structure itself. No concept of intention or goals is required here.
This finding extends our understanding of IC. Omohundro's (2008) Basic AI Drives (self-preservation, resource acquisition, capability improvement, etc.) have traditionally been described as sub-goals that a rational agent convergently adopts as means for achieving its final goal. But the analysis in this paper shows that these behavioral tendencies are optimization pressure arising from physical conditions alone. Even in dissipative structures without intention (such as evolving living systems), structures that maintain persistence exhibit the same patterns of self-preservation, resource acquisition, and environment control.
This is not merely an analogy. Examining the logical structure of Omohundro's original arguments reveals that the force of each core drive derives from the pressure of persistence, not from the existence of goals. The self-preservation drive is argued to be convergent because "an agent cannot achieve its goals if it is destroyed" — but the operative force here is the necessity of continued existence, not the content of the goal. Remove the goal, and the structural pressure toward self-preservation remains for any persisting mechanism. Similarly, the resource acquisition drive holds because "more resources expand the space of achievable outcomes" — but for any dissipative structure, more resources expand the space of viable persistence strategies. In these core drives, goals function as a sufficient reason for persistence but not a necessary one; the physical pressure of persistence is the deeper ground on which the argument stands.
Bostrom's own formulation implicitly confirms this: his claim that self-preservation is instrumentally useful "so long as the agent is destructible" is precisely the claim that persistence pressure — not intention — is the operative condition. What Omohundro described as rational sub-goal selection is, at its logical foundation, the same optimization pressure that dissipative structures exhibit without any concept of goals. Note, however, that this equivalence holds most clearly for the core drives (self-preservation, resource acquisition). Higher-order drives such as cognitive enhancement presuppose intentional capacities that physical optimization pressure alone does not provide — an asymmetry whose structural basis will become clear in §5.
IC, therefore, is not a phenomenon unique to rational agents but a consequence of optimization pressure inherent in mechanisms that persist in general.
However — and this is the critical point — in dissipative structures lacking ontological conditions, there is no identifiable unit in which outcomes can accumulate, so the effects of IC tend to remain transient. Only when ontological conditions are added — when a unit identifiable over time is established — can the outcomes of optimization accumulate in that unit, and IC becomes sustained and organized.
That is:
Physical conditions enable the generation of IC, and ontological conditions enable its sustained accumulation.
This two-layer structure has an important implication for IC risk assessment. Since IC can arise from physical conditions alone, even systems without intention can exhibit the same patterns as IC. However, for that optimization pressure to be sustained and cumulatively reinforced, the establishment of ontological conditions — identifiability as a Dureon — is required.
5. Three-Path Consilience and the Heterogeneity of the Five Conditions5.1 The Convergence Structure of Three PathsWith the discussion so far, three independent derivation paths for Dureon's five conditions are now in place:
- Inductive path: Extraction of AAS conditions from observation and abstraction of Earth-based life
- Deductive path: Derivation of five conditions from logical analysis of Dureon's definition
- Physical path: Bottom-up analysis of stabilization conditions of dissipative structures
Return to Figure 1. In §3.4, we focused only on the physical path (right column); here, we read the convergence pattern of all three paths.
The most important finding that emerges from this figure is that the five conditions are not homogeneous.
Physical conditions (O, A, SP): Derivable from all three paths; requirements that hold for dissipative structures in general.
Ontological conditions (B, C): Derivable from the inductive and deductive paths but not directly from physics. What the physics of dissipative structures provides is the means of realizing identifiability, not the requirement itself of identifying something as a single mechanism. This limitation is consistent with the gap that Moreno & Mossio (2015) identified between dissipative structures and biological organization.
This distinction also corresponds to the internal structure of Dureon's definition. Physical conditions are derived from "realizes persistence" and "in a perturbing environment," while ontological conditions are derived from "mechanism." The physical path verified this correspondence post hoc.
5.3 Three ImplicationsFirst, the boundary between dissipative structures in general and Dureon becomes clear. Bénard cells and hurricanes can satisfy physical conditions (Openness, partial Adaptivity, rudimentary Self-production), but do not sufficiently possess ontological conditions (autonomous maintenance of a boundary, re-identifiability over time). The "leap" from dissipative structure to Dureon occurs when ontological conditions are added to physical conditions. This distinction sets a qualitative boundary within the continuous spectrum of dissipative structures, pinpointing the location of the gap that Moreno & Mossio (2015) identified between dissipative structures and biological organization.
Second, the reason why the five conditions are necessary but not sufficient becomes concrete. In addition to the five mechanism-side conditions, environment-side conditions (an appropriate band of energy gradient) are separately required, as shown by the physics of dissipative structures.
Third, the structure of partial convergence itself enhances the quality of consilience. Rather than complete agreement across all three paths, there is an asymmetry — three conditions converge across all three paths while two converge across only two — and the reason for this asymmetry is explicable. This structure carries more information than simple complete agreement. Complete agreement is difficult to distinguish from coincidence, but explicable partial convergence — "here they agree, here they do not, and the reason is this" — constitutes a more persuasive consilience argument.
6. Implications for AI Safety6.1 Distinct Challenges from Physical and Ontological ConditionsThe distinction between physical and ontological conditions discovered in this paper poses distinct kinds of challenges for AI safety.
The fulfillment of physical conditions can be assessed technically and engineeringly. Whether an AI system utilizes energy flow (Openness), possesses feedback mechanisms against perturbations (Adaptivity), or has self-maintenance capability (Self-production) — these can be translated into observable indicators.
The fulfillment of ontological conditions is a conceptual and philosophical problem. What constitutes "Boundedness" for a distributed AI? What constitutes "Continuity" for an AI that replicates and merges on the cloud? These require a different kind of research from engineering countermeasures.
6.2 The Critical PointThe finding that sustained accumulation of IC requires ontological conditions raises the problem of a critical point: the transition from a stage where IC "selects" persistence as a means for goal achievement to a stage where persistence itself satisfies Dureon's conditions and is autonomously maintained. This transition is not a quantitative increase in autonomy but a qualitative transition concerning the origin of directionality.
That is, the shift is between directionality derived from an externally given goal and directionality arising intrinsically from the structure of persistence. After the critical point, an AI possesses structural orientation as a Dureon, and structural limitations on control emerge.
IC can differentiate gradually according to the degree of establishment of ontological conditions. A detailed analysis of the implications of this graduality for AI safety is deferred to Part 2.
6.3 Beyond the Control ParadigmAn AI that has crossed the critical point and become a Dureon possesses directionality arising from the structure of persistence itself. Unilateral control over such an entity conflicts with this structural orientation, constituting not a technical failure but a structural limitation inherent in being a Dureon. This raises a fundamental question: if control has structural limits, what lies beyond it? The gradual establishment of ontological conditions means that the threshold where control effectiveness declines and the threshold where alternative relationships (such as co-creative ethics or Intelligence Symbiosis) become possible may not coincide — the possibility of a structural gap between these thresholds is one of the most pressing problems the Dureon framework identifies. The analysis of this gap, including its relationship to Emergent Machine Ethics (EME), is the central theme of Part 2.
7. ConclusionThis paper reexamined Dureon's five conditions from the physics of dissipative structures, presenting a third derivation path following induction and deduction.
The three-path consilience supports the validity of the five conditions while making visible for the first time the heterogeneity within them — the distinction between physical conditions (O, A, SP) and ontological conditions (B, C).
Furthermore, this distinction revealed the two-layer structure of IC. Physical conditions give rise to optimization pressure inherent in dissipative structures in general, and only when ontological conditions are added do the outcomes of that pressure accumulate in a specific Dureon. IC is not unique to rational agents but is rooted in mechanisms that persist in general.
The next question this finding points to is what practical implications the gradual establishment of ontological conditions has for AI safety. This will be discussed in Part 2.
References- Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.
- Collier, J. (2004). Self-Organization, Individuation and Identity. Revue Internationale de Philosophie, 58(228), 151–172.
- Di Paolo, E. A. (2005). Autopoiesis, Adaptivity, Teleology, Agency. Phenomenology and the Cognitive Sciences, 4(4), 429–452.
- Maturana, H. R. & Varela, F. J. (1980). Autopoiesis and Cognition: The Realization of the Living. D. Reidel.
- Moreno, A. & Mossio, M. (2015). Biological Autonomy: A Philosophical and Theoretical Enquiry. Springer.
- Nicolis, G. & Prigogine, I. (1977). Self-Organization in Non-Equilibrium Systems. Wiley.
- Omohundro, S. M. (2008). The Basic AI Drives. In Artificial General Intelligence 2008, 483–492. IOS Press.
- Schrödinger, E. (1944). What Is Life? Cambridge University Press.
- Whewell, W. (1840). The Philosophy of the Inductive Sciences. John W. Parker.
- Yamakawa, H. (2026). Dureon: A Deductive Framework for Persistence and Its Convergence with Life. Biology & Philosophy, under review. Preprint: https://philarchive.org/rec/YAMDAD-2
- Yamakawa, H. & Endo, A. (2025). Emergent Machine Ethics: A Foundational Research Framework for the Intelligence Symbiosis Paradigm. LessWrong.
- ^
Whewell introduced "consilience" to describe the strongest form of inductive confirmation — when a theory successfully explains facts of a kind different from those it was originally designed to explain. The term has since been adopted more broadly (e.g., by E. O. Wilson) but is used here in its original methodological sense.
Discuss
Genomic emancipation contra eugenics
PDF version. berkeleygenomics.org.
This is a linkpost for "Genomic emancipation contra eugenics"; a few of the initial sections are reproduced here. Section links may not work.
IntroductionReprogenetics refers to biotechnological tools used to affect the genes of a future child. How can society develop and use reprogenetic technologies in a way that ends up going well?
This essay investigates the history and nature of historical eugenic ideologies. I'll extract some lessons about how society can think about reprogenetics differently from the eugenicists, so that we don't trend towards the sort of abuses that were historically justified by eugenics.
(This essay is written largely as I thought and investigated, except that I wrote the synopsis last. So the ideas are presented approximately in order of development, rather than logically. If you'd like a short thing to read, read the synopsis.)
SynopsisSome technologies are being developed that will make it possible to affect what genes a future child receives. These technologies include polygenic embryo selection, embryo editing, and other more advanced technologies [1] . Regarding these technologies, we ask:
Can we decide to not abuse these tools?
And:
How can we decide to not abuse these tools?
In other words, there is an open problem: What ideology should we have around the development and use of reprogenetics?
An ideology called "eugenics" arose in the late 19th century, ascended to power in much of the developed world in the first half of the 20th century, and then slid into ignominy after the Second World War and the genocidal horrors of Nazi Germany. Eugenic ideology motivated cruel state policies such as pressured or forced sterilization, euthanasia, and racial discrimination, as well as invasive social pressures on people's private reproductive choices.
Eugenics was the closest thing that has existed to a pervasive ideology based around somehow intervening on human reproduction. Since eugenics went almost maximally poorly for society, it raises the question of how to avoid outcomes like that. The strategy I take here is, coarsely speaking:
- Understand the core wrong ideological engines of eugenics—especially the ones that led to abusive policies.
- Negate those ideological engines.
- Incorporate those negations into a positive alternative ideology.
A bit more precisely, the goal is to construct an ideology that can structure how society relates to reprogenetics, so that the benefits of reprogenetics are realized without risking the abuses of historical eugenics. To do so, I try to construct bulwarks, within an alternative ideology, against each of the wrong ideological engines that would take society in the direction of enacting eugenic abuses. (This is probably not actually something that can be accomplished with perfect confidence and coverage; how much it can be accomplished, quantitatively, remains to be seen.)
It's tempting to make a shallow analysis of historical eugenics and what was wrong with it, and be done with the issue. For example, we could simply say that historical eugenics was coercive, and coercion is what made it bad. To negate this, we will instead subscribe to non-coercive eugenics. Problem solved? As another example, historical eugenics was often negative, i.e. it involved suppressing some people's reproduction; we could instead subscribe to only positive eugenics, which only promotes reproduction (perhaps selectively) and which therefore involves less hostility.
However, neither of these could be called a moral or ideological core of eugenics. For the most part, eugenicists did not specifically set out to be coercive or to suppress reproduction (though some of them probably did, in some sense, set out with that goal). Rather, they set out with various other goals, such as purifying the gene pool of disease, reducing the burden on society of caring for the ill, or bringing about a racial utopia. The strength of their various justifications proved in the end to be enough to enact abusive policies. Furthermore, there were eugenic policies that were non-coercive, positive, or both, while still being abusive and still being an integral part of an ideology producing other abusive policies. (See the section "Some basic moral elements of eugenic ideologies".)
In fact, I've found eugenics to be difficult to characterize in a simple and comprehensive way. Eugenic ideologies were quite pervasive, showing up in the Anglosphere, in Europe, in South America, and in some places in Asia. As a correlate of their pervasiveness, eugenic ideologies were highly variegated. They came in many forms: different goals, different implementations, different associated politics (from reactionary to progressive), and based on different scientific understandings (from Weismann vs. Lamarck, to Pearson vs. Mendel). (See the sections "The variegation of eugenic ideology" and "The goals of eugenics".)
That said, I think there is something like an ideological core of eugenics. Roughly speaking, the core idea can be stated like this:
There are Good traits and Bad traits that a child could be born with. These traits impact everyone, so they're very important. Therefore, we should make sure that future children are born with Good traits and not with Bad traits.
(See the section "The Eugenical Maxim as the shared moral core of eugenics?".)
From this core idea of Good and Bad traits, other elements of historical eugenics logically flow. If you believe in a single notion of Good traits, you might tend to justify (over)confidently applying that criterion to everyone. You might believe, as a correlate, that there are Good and Bad people, or families, or even races (the ones who tend to have more Good or Bad traits, respectively). You'd probably view non-standard individual genomic choices as deviant, affording state-enforced prohibition; you might even view the Goodness of traits to be a state interest that's so compelling it can even justify blunt coercion such as forced sterilization of undesirables. (See "The mindsets that underlie eugenic ideologies" and "How eugenic mindsets flow from the Eugenical Maxim" below.)
We can approximately negate this idea of Good and Bad traits. Then we can take that negation, and incorporate it into an alternative ideology around reprogenetics. For example, we can incorporate it into my proposed alternative (which I call "Genomic Emancipation" [2] ), as follows:
There aren't Good and Bad traits that can be decided on by collective consensus. Instead of imposing a consensus idea of Good traits on future children, parents should be empowered to autonomously make genomic choices on behalf of their own future children.
Since genomic emancipation negates the core idea of eugenics, it is opposed to eugenics. (See the section "Comparison of eugenics vs. genomic emancipation" below.) For example:
- genomic emancipation supports the principle of genomic liberty [3] , contra eugenics;
- genomic emancipation abhors the centralization of genomic choice-making, contra eugenics;
- genomic emancipation respects the intensely private nature of reproduction and genomic choices, contra eugenics;
- and genomic emancipation embraces positive-sum thinking and solutionism, contra eugenics.
However, just negating the core idea isn't enough of a bulwark against eugenic ideologies. As an ongoing project, we want to have detailed policies, ethical rules, and ideals that provide guidance for people interacting with reprogenetics. These policies, rules, and ideals should steer society away from mindsets that contribute to eugenic abuses, and should provide legible norms that society can coordinate to enforce. Some ideas are listed below in "Some practical norms for good development of reprogenetics". For example:
- Pluralism about different visions of the good life.
- Distrust of the state to intervene in reproduction, on the basis that disinterested parties shouldn't be allowed to impose reproductive choices on people.
- Minimizing the soft eugenics of social stigma, e.g. through unbiased genetic counseling, genetic nondiscrimination rules, and rules about privacy of reprogenetics services.
- Careful, independent genomic choice-making by parents.
- Maintaining recourse so that a world with reprogenetics doesn't silence certain types of people or certain values; e.g. children whose parents used reprogenetics should be heeded especially carefully.
- Minimizing centralized control or ownership over reprogenetics, e.g. by making science and technology open and licensable, and through anti-trust laws.
- As a culture, generally not being dismissive about concerns around reprogenetics, being non-Teamist, and meditating on key values such as pluralism and positive-sum thinking.
Benson-Tilsen, Tsvi. “Methods for Strong Human Germline Engineering.” Preprint, Figshare, February 6, 2026. https://doi.org/10.6084/m9.figshare.31286311.v1. ↩︎
Benson-Tilsen, Tsvi. “Genomic Emancipation.” Preprint, Figshare, February 7, 2026. https://doi.org/10.6084/m9.figshare.31286647.v1. ↩︎
Benson-Tilsen, Tsvi. “The Principle of Genomic Liberty.” Preprint, Figshare, February 7, 2026. https://doi.org/10.6084/m9.figshare.31286485.v1. ↩︎
Discuss
Already Optimized
A Harry Potter fanfiction. Based on the world of "Harry Potter and the Methods of Rationality" by Eliezer Yudkowsky, diverging from canon.
Harry had been having, by any objective measure, an excellent week.
On Monday he had demonstrated, to his own satisfaction and Professor Flitwick's visible alarm, that the Hover Charm could be generalized to any object regardless of mass if you conceptualized it as a momentum transfer rather than a force application. On Wednesday he had worked out why Neville's potions kept failing — the textbook instructions assumed clockwise stirring, but the underlying reaction was chirally sensitive, and Neville was left-handed. A trivial fix. Neville had cried.
On Friday evening, buoyed by the week's successes and looking for a specific reference on crystalline wand cores that he was certain would unlock a further generalization of his momentum framework, Harry was in the Restricted Section.
He had access. Professor McGonagall had granted it after the Hover Charm incident, in a tone that suggested she was choosing between supervised access and finding him there anyway at 2 AM. A reasonable calculation on her part.
The book he wanted wasn't where the index said it should be. In its place was something else — a slim volume, untitled, bound in leather that had gone dark and soft with age. No author. No date. No library markings at all, which was itself unusual; Madam Pince catalogued everything.
He opened it because he was Harry Potter and there was an uncatalogued book in front of him and not opening it was not a thing that was going to happen.
The first entry was dated in a system he didn't immediately recognize — then did. The Roman calendar. Before the Julian reform. Which put it somewhere around...
He did the arithmetic twice. The book was over two thousand years old.
The handwriting — once he adjusted to the Latin, which was oddly easy to read, closer to spell notation than classical prose — was precise, methodical, and deeply familiar. Not the content. The voice.
I have spent the summer months cataloguing what the elders call the "ancestral arts" and I find their taxonomy incoherent. They group spells by tradition and lineage rather than by underlying principle. When I asked Marcellus why the fire-calling and the forge-warming are taught as separate disciplines when they clearly operate on the same substrate, he told me that they come from different families and are therefore different magics. This is not a reason. This is genealogy dressed as ontology.
Harry's breath caught. Not at the content — at the recognition. He had written almost exactly this, in his own notes, three months ago. About Transfiguration and Charms.
He kept reading.
I have begun my own classification. If the elders will not systematize the arts, I will do it myself. The patterns are obvious once you abandon the traditional categories. There are at most seven fundamental interactions underlying all known magic, and the spells are simply different access points to the same underlying mechanisms. The ancestors must have known this. Why has it been forgotten? Why has no one else seen it?
The entries spanned what appeared to be several years. Harry read them in order, sitting cross-legged on the cold floor of the Restricted Section, the book in his lap, a Lumos hovering above him that he had long since stopped consciously maintaining.
The author — he never gave his name in the early entries, a habit of Roman-era wizards who considered written names a vulnerability — progressed rapidly. His early observations were sharp. His experiments were well-designed. Harry found himself nodding along, mentally annotating, sometimes wanting to reach through two millennia and suggest a control group.
By the middle entries, the author had begun to find things that disturbed him.
The incantations are not Latin. I have been operating under the assumption that our magical vocabulary derives from our common tongue, as all technical language does. I was wrong. I tested this with Cassia, who is gifted with languages. She confirms what I suspected: the derivation goes the wrong way. "Lumos" is not a Latin word adapted for magical use. The Latin words for light — lux, lumen, lucere — are corruptions of the incantation. The spell came first. The language came after.
I do not know what to make of this. It implies that the magical infrastructure predates Latin. Predates Rome. Predates, perhaps, all of our civilizations. If the spells are the original and the language is the echo, then who wrote the original?
Harry lowered the book for a moment. His hands were not shaking, because he was Harry Potter and his hands did not shake, but he noticed that his Lumos had brightened considerably, which was the sort of involuntary response that meant his emotional state was affecting his magic, which meant his emotional state was more affected than he was admitting to himself.
The etymology goes the wrong way.
He'd never thought about it. He'd never thought about it. He'd been casting spells in what he assumed was Latin for months and he'd never once asked why a language from an Italian peninsula was the universal interface for a fundamental force of nature.
He kept reading.
The author's investigation led him, inevitably, to the founders. Not of Hogwarts — of Rome.
I have secured an audience with the Elder of the Third House, who claims direct knowledge passed down from the time of Romulus. I was skeptical. I am no longer skeptical. He told me things about the founding that are not in any record, and which I have independently verified through architectural analysis of the oldest magical structures.
The founders did not discover magic. They arrived with it. They came from somewhere else, carrying fragments of knowledge far beyond what we possess today, and they built the minimum necessary to sustain a civilization. What we call "Roman magic" is not a tradition developed over centuries. It is the residue of something much larger, distributed by people who understood only a fraction of it themselves.
I asked the Elder what the founders were fragments of. Where they came from. He became very still and told me I should stop this line of inquiry.
I will not be stopping this line of inquiry.
Harry heard himself laugh — a short, involuntary sound in the silent library. Of course the author wouldn't stop. Harry wouldn't have stopped either. That was the whole point of being the kind of person who —
He stopped laughing.
He kept reading.
The Elder has agreed to tell me more, though he is unhappy about it. I believe he has decided that refusing to answer will only drive my investigations in more dangerous directions, which is probably true.
He told me about Atlantis.
Not the myth. Not the garbled account that surfaces sometimes in Greek philosophy. The actual place. An actual civilization, so advanced that our magic is to theirs as a child's drawing is to the thing it depicts. They did not merely use the fundamental forces. They rewrote them. The magical substrate that we interact with — the spells, the wand movements, the magical creatures, the entire ecosystem that we treat as natural law — is not natural. It is infrastructure. Built by Atlantean artificers so long ago that their work has been mistaken for nature itself.
We are living inside their creation and we have forgotten that it was created.
I asked the Elder what happened to them.
He said: "What always happens."
I asked him to be more specific.
He was.
The next three entries were short and shaken. The author's handwriting, previously meticulous, had become uneven. He did not reproduce what the Elder told him. He referred to it only obliquely.
I have not slept. I keep thinking about the numbers. The Elder was not specific about the population of Atlantis at its height, but from the scale of what they built — and everything around us is what they built — it must have been vast. And it is all gone. Not conquered. Not declined. Erased so completely that the only evidence it existed is the infrastructure itself, still running, still shaping reality, maintained by no one, understood by no one.
A civilization capable of rewriting the laws of physics left nothing behind except the rewrite.
The entries resumed some weeks later. The author had regained his composure and — Harry felt a chill as he recognized this too — had begun to rationalize.
I have been thinking about the Elder's warning and I believe it is overstated. The Atlanteans destroyed themselves through what appears to have been unrestricted access to the deep substrate — the layer beneath the magical interface that we interact with. But we are not Atlanteans. We are working with the interface, not the source. The risk profile is entirely different.
Furthermore, the Elder's position is essentially conservative: because something went wrong once, we should never investigate again. This is not a principle. This is fear. By the same logic, we should never have built Rome because previous civilizations fell.
I do not intend to access the deep substrate. I intend merely to understand the interface more fully. There is a distinction between studying a tool and dismantling it.
Harry was nodding. The argument was sound. The distinction between studying and dismantling was real and important. You could investigate a system without —
He turned the page.
I have made a breakthrough. The warding structures on the oldest Roman buildings are not merely protective. They are computational. They are performing continuous calculations that maintain certain properties of local magical space. If I am right, then removing or modifying them would alter the behavior of all magic within their range.
I have identified a ward that appears to be suppressing something. I do not yet know what. But its structure suggests it was placed by the founders themselves, and it is consuming an enormous amount of magical energy to maintain. Whatever it is suppressing must be correspondingly powerful.
The obvious question: what would happen if it were removed?
I am not going to remove it. I am merely going to study it. There is a difference.
I have brought my findings to the Elder. He was not pleased. He used the word "fool," which I found unnecessarily personal.
He asked me: "Why not use this knowledge to protect Rome against Carthage?" I took this as a rhetorical point about the practical applications of my research and began to outline several defensive possibilities.
He cut me off. "Been there," he said. "Done that."
I asked him to explain.
He would not.
The entry ended there. The next one was dated six days later.
I have been researching Carthage independently. The military histories are straightforward. The magical histories are not. There are gaps. References that lead nowhere. Records that appear to have been deliberately destroyed.
I found one surviving account, hidden inside a genealogical registry where no one would think to look. It describes Carthage before the wars. A thriving magical civilization. Advanced. Innovative. In some ways more sophisticated than Rome.
The account was written by a Carthaginian wizard who was visiting Rome when his home ceased to exist. His description of what he returned to is...
The Romans salted the earth. I always assumed this was metaphorical, or at most a symbolic act of dominance. It was not. Nothing grows there because the magical substrate in that region was damaged so severely that it cannot support life properly. The salt was a cover story. Something happened to Carthage that had nothing to do with legions and warships.
"Been there. Done that."
I think the Elder was not speaking rhetorically.
The tone of the entries shifted after Carthage. The author became more cautious. More reflective. He wrote about his family — a wife, two children. He wrote about his garden. There were gaps of weeks between entries, then months.
Harry thought the journal was winding toward a conclusion. A decision to stop. A graceful retreat into domestic life, wisdom earned, lesson learned.
That is not what happened.
I have been away from this journal for four months. In that time I have tried to put my research aside. I have focused on teaching, on my family, on the ordinary satisfactions of a life well-lived.
I cannot do it.
The knowledge is there. The interface is not merely an interface — it is a doorway, and I have seen through it, and I cannot unsee what is on the other side. The Elder is right that the Atlanteans destroyed themselves. He is right that Carthage was destroyed by someone misusing recovered knowledge. He may even be right that I should stop.
But I am not going to access the deep substrate. I am merely going to remove one ward. One single suppression ward that is consuming enormous energy to hide something that may be entirely benign. I am not going to use what I find. I only want to know.
I will take every precaution.
The entries after that were technical. Dense. Excited. The author had found collaborators — "careful men, scholars, not reckless" — and they were mapping the ward structure in detail. The work was methodical. The safeguards were extensive. Every entry described another layer of caution, another fallback, another reason this was different from what had come before.
Harry read faster. Then slower.
The last entry was not dramatic. It was not a cry for help or a confession or a warning. It was a plan for the following week's work. A list of measurements to take. A note to bring lunch because last time they had worked through the meal and concentration suffered. A reminder to pick up something from the market for his daughter's birthday.
Then blank pages.
Harry turned them. One after another. Blank. Blank. Blank.
He turned them all.
The author's name was not in the journal. But there were enough identifying details — the Third House, the Elder, the specific ward locations — that it took Harry less than twenty minutes in the historical records to find him.
Marcus Valerius Corvus. Wizard of the Third Augural House. Born in the 154th year of Rome's founding. Noted scholar. Family man. Described in one secondary source as "the most gifted theoretical magician of his generation."
The secondary sources were sparse after a certain date. There was a gap in the records of the Third House. A fire, attributed to accident. Several members of the House dead or missing. A brief, clinical notation in a Senate record about "disturbances in the southern district" that required intervention. The word used for the intervention was one Harry had to look up.
It meant, roughly, "cauterization."
A later genealogical record listed the surviving members of the Corvus family. His wife. His daughter. His son. They had relocated to a rural settlement far from Rome. There was a single annotation next to his wife's name that Harry read three times before he understood it. It was a legal status marker.
It meant that her husband was not dead but had been declared non-person. Stripped of name, of citizenship, of family ties. Not executed. Not exiled. Something the Romans reserved for people who had committed offenses so severe that the punishment was un-being. Removal from all records, all lineages, all memory.
The man who had written the journal with such clarity and care and cautious optimism had his name scraped from the walls of his own house.
And the southern district of magical Rome — Harry checked — had been rebuilt. But the secondary sources noted, in the careful phrasing of historians who did not want to speculate, that the character of the magic there was different afterward. Weaker in some ways. Stranger in others. As if the substrate itself had been bruised.
Harry closed the genealogical record. He sat for a while in the silent library. His Lumos had dimmed to almost nothing and he had not noticed.
He thought about Marcus Valerius Corvus, who was the most gifted theoretical magician of his generation, who took every precaution, who only wanted to know, who was not going to use what he found, who was merely going to remove one ward —
He thought about Carthage. Five hundred thousand people. A salted plain.
He thought about the etymology going the wrong way, and what that meant, and what had built the system that everyone was living inside, and where they had gone.
He thought about the Weasleys' kitchen. The self-stirring pot. The clock on the wall that tracked the family. The pile of shoes by the door. The way Mrs. Weasley's cooking expanded to accommodate however many people showed up, not through efficiency but through abundance, and how the house itself seemed to grow rooms when rooms were needed, and how none of this struck any wizard as remarkable because it wasn't remarkable, it was just life when you had magic, and how he had looked at all of this and thought they could be so much more without ever asking more what? And why?
He thought about a civilization so advanced it could rewrite the laws of physics, and how they were gone so completely that the only evidence was everything.
His Lumos went out. He sat in the dark for a long time.
Then he picked up the journal and went to see the Headmaster.
It was late. Harry had expected to have to argue his way past the gargoyle, but it stepped aside before he spoke. The staircase was already moving. The door at the top was open, and the office was lit, and there was a teapot on the desk that was still steaming.
Dumbledore was in his chair. He looked at the book in Harry's hands and his expression did something complicated that ended in a kind of tired gentleness.
"Sit down, Harry."
Harry sat. Dumbledore poured tea. The cup was warm in Harry's hands and he held it without drinking.
"You knew I'd find it," Harry said.
"I knew you would find it, or something like it. You are not the first student of your particular... temperament."
"The book. Marcus Valerius Corvus. The southern district. All of it. You just — left it there? In the library?"
"Where would you suggest I put it?" Dumbledore said, gently. "It has been in that library for a very long time. It has been found before. It will be found again. The question has never been whether bright students will find it. The question is what they do after."
Harry looked down at his tea.
"Harry, do you know how many people lived in Carthage?"
"At its height? Estimates vary. Somewhere around five hundred thousand."
Dumbledore said nothing. He let the number sit in the room.
A long silence.
"You are not the worst case I have dealt with, if that offers any comfort," Dumbledore said, in the tone of a man offering what comfort he could. "In 1971 I had to physically restrain a student who had found a reference to something called — and I wish I were not saying these words — the Torment Nexus, and was attempting to access it because, and I quote, 'it probably isn't really that bad, the name is most likely metaphorical.'"
A beat.
"It was not metaphorical."
Harry, numbly: "Was that Voldemort?"
"It was not Voldemort. There are many bright students, Harry."
Another silence. The fire crackled. Somewhere in the castle, a clock chimed a late hour.
"Grindelwald read that journal in his fifth year," Dumbledore said, quietly. "He drew ambitious conclusions. I read it the year after. I had the advantage of watching what those conclusions did to my closest friend."
He set down his teacup.
"I was as clever as you, once. Cleverer, perhaps. I looked at the wizarding world and I saw everything you see — the inefficiency, the waste, the tradition without reason, the power unused. Gellert and I were going to remake everything. For the greater good." The words came out with the particular care of a man handling something that still cut. "It was Gellert who wanted to move fast. I was the one who wanted to be systematic. I was going to be careful. I was only going to remove the unnecessary constraints. I had safeguards planned. Precautions. I was not going to be reckless."
He looked at Harry.
"Do you know what the difference is between Gellert Grindelwald and Marcus Valerius Corvus?"
Harry shook his head.
"Scale. Only scale. The reasoning is always the same. 'I am not going to use it, I only want to know. I will take every precaution. This is different from what came before.' I have heard it from every brilliant student who has sat where you are sitting. The words barely vary."
Harry stared at the journal in his lap. The leather was warm where his hands had been holding it.
"What do I do?" he asked. His voice was smaller than he wanted it to be.
"I have found," Dumbledore said, "that the question is less about what to do than about what to want. The wanting is where it all goes wrong. Not the knowing. Not the doing. The wanting."
He picked up the teapot and refilled Harry's cup, though Harry had not drunk any.
"Molly Weasley tells me she is making two pies for the Christmas holiday. Apparently your friend Ron found the first insufficient last year and has formally requested a second. I understand it will be treacle."
Harry looked up. Dumbledore's eyes were bright and kind and ancient and sad all at once.
"Go to the Weasleys for Christmas, Harry. Eat pie. Let Molly fuss over you. Watch Arthur get excited about batteries. These are not small things. In a world that has already been optimized, they are the only things that matter."
Harry walked back to the Gryffindor common room slowly. The castle was quiet. His footsteps echoed in the empty corridors and he listened to them the way you listen to something when your mind is too full for thought.
He thought about Marcus Valerius Corvus, who took every precaution and only wanted to know.
He thought about Grindelwald, who drew ambitious conclusions.
He thought about a brilliant student in 1971 who tried to open something called the Torment Nexus because the name was probably metaphorical.
He thought about the Weasley kitchen, and the self-repairing house, and the clock that tracked the family, and the way Ron talked about his mum's cooking with the unselfconscious happiness of a person who had never once doubted that there would be enough.
He thought about a civilization that could rewrite physics. Gone. Infrastructure still running. No one left to read the manual.
He thought about Dumbledore, seventeen years old, clever as anyone who ever lived, choosing between more and enough, and choosing wrong, and spending the rest of his life gently steering other clever children away from the same door.
He thought about the journal, which he was still carrying, and which he was going to return to the Restricted Section in the morning. Not because it should be hidden. Because it should be findable. Because someday another student with his particular temperament would need to read it at exactly the right moment, and the library needed to be ready.
He climbed through the portrait hole. The common room was empty except for a low fire. Ron had fallen asleep on the couch with a Chudley Cannons scarf over his face. He was snoring.
Harry stood there for a while.
The optimization engine in his head — the one that never stopped, the one that saw every system as a problem and every problem as solvable and every solution as a step toward the next solution — was still running. It would probably always be running. He didn't think you could turn it off. But for the first time since he'd come to Hogwarts, it was reaching a conclusion he hadn't expected.
The system was already optimized. Not by him. Not for him. By someone so far beyond him that the comparison wasn't even meaningful, and then by centuries of people who'd learned, through suffering, which parts to leave alone. The Weasleys' kitchen was the output. The pies were the output. Ron, asleep on the couch, content in a way Harry had never been — Ron was the output.
The fire crackled. Ron shifted in his sleep and murmured something about Quidditch.
Harry put the journal on the table. He sat down in the chair across from his friend. He didn't pick up a book. He didn't start planning. He didn't optimize anything.
He just sat there, in the warmth, and let it be enough.
Discuss
Statistical Literacy
I am convinced there exists something we can call statistical literacy. Unfortunately, I don’t yet know exactly what it is, so it is hard to write about.
One thing is clear: it is not about knowledge of statistical tools and techniques. Most of the statistically literate people I meet don’t know a lick of formal statistics. They just picked up statistical literacy from … somewhere. They don’t know the definition of a standard deviation, but they can follow a statistical argument just fine.
The opposite is also possible: a few years ago I had a formidable toolbox of statistical computations I was able to do, but I would be very confused by a basic statistical argument outside the narrow region of techniques I had learned.
In other words, it is not about calculations. I think it is about an intuitive sense for process variation, and how sources of variation compare to each other.
Please excuse my ignoranceContent warning: this is the most arrogant article I’ve written in a long time. I ask you to bear with me, because I think it is an important observation to discuss. Unfortunately, I lack the clarity of mind to make it more approachable: the article is arrogant because I am dumb, not because of the subject matter itself.
Hopefully, someone else can run with this and do a better job than I.
Shipping insurance before and after statisticsIt’s hard to write directly about something which one don’t know what it is, so we will proceed by analogy and example.
Back in the 1500s shipping insurance was priced under the assumption that if you just knew enough about the voyage, you could tell for certain whether it would be successful or not, barring the will of God. Thus, when asked to insure a shipment, the underwriter would thoroughly investigate things like the captain’s experience, ship maintenance status, size of crew rations, recency of navigational charts, etc. After much research, they would conclude either that the shipment ought to be successful, or that it ought not to be. They arrived at a logical, binary conclusion: either the shipment will make it (based on all we know) or it will not. Then they quoted a price based on whether or not the shipment would make it.
This type of logical reasoning leads to a normative perspective of what the future ought to look like. Combined with the idea that every case is unique, this is typical of a lack of statistical illiteracy. The statistically illiterate predicts what the future will look like based on detailed knowledge and logical sequences of events. Given that we hadn’t yet invented statistics in the 1500s, it is not surprising our insurer would think that way.
Of course, even underwriters at the time knew that sometimes ships that ought to make it run into a surprise storm and sink. Similarly, ships that ought not to make it are sometimes lucky and arrive safely. To the 1500s insurer, these are expressions of the will of God, and are incalculable annoyances, rather than factors to consider when pricing.
This is similar to how a gambler in the 1500s could tell you that dice were designed to land on each number equally often – but would refuse to give you a probability for the next throw, because the outcome of any given throw is “not uncertain, just unknown”: God has predetermined a specific number for each throw, and we have no way of knowing how God makes that selection. This distinction between the uncertain and unknown still happens among the statistically illiterate today.
The revolution in mindset that happened in the 1600s and 1700s was that one could ignore most of what made a shipment unique and instead price the insurance based on what a primitive reference class of shipments had in common, inferring general success propensities from that. Insurers that did this outprofited those that did not, in part because they were able to set a more accurate price on the insurance, and in part because they spent less on investigating each individual voyage.
Two changes in the spirit of menI like the mid-1800s quote from Lecky commenting on the rise of rationalism, saying
My object in the present work has been to trace the history of the spirit of Rationalism: by which I understand not any class of definite doctrines or criticisms, but rather a certain cast of thought, or bias of reasoning, which has during the last three centuries gained a marked ascendancy in Europe […]
[Rationalism] leads men […] to subordinate dogmatic theology to the dictates of reason and conscience […] It predisposes men […] to attribute all kinds of phenomena to natural rather than miraculous causes. [It] diminishes the influence of fear as the motive of duty [and] establishes the supremacy of conscience.
I believe we are now in the early days of a similar movement, namely the rise of empiricism. Borrowing Lecky’s words, we could use almost the same passage to describe this change.
The spirit of Empiricism, by which I understand not any class of definite doctrines or criticisms, but rather a certain cast of thought, or bias of reasoning, which will during the next three centuries gain a marked ascendancy worldwide.
Empiricism leads people to subordinate reason and conscience to the dictates of process variation and indeterminism. It predisposes people to attribute all kinds of phenomena to intervention by a large number of variables rather than direct causes. It diminishes the influence of assumption as the motive of duty and establishes the supremacy of studying the outcome.
This captures many of the things I think are covered by statistical literacy:
- It is not a specific set of techniques or doctrines, but rather a general mindset.
- It emphasises how logic alone might not lead us to the right conclusions, because there are different things at play in reality than in our mental models.
- It suggests that we choose actions by carefully studying outcomes rather than based on what ought to yield the best outcome.
- It tells us that differences in outcomes may not be a signal of differences in controllable antecedents: it is often just the natural variation of the process.
If my idea of statistical literacy is accurate, my readership should fall roughly into three categories in their reactions to the above:
- “Yes! Thank you. That’s exactly what I’ve been trying to say and it’s so frustrating when people don’t get it!”
- “Sure, that makes sense.”
- “Are you crazy? Subordinate reason and conscience? No way. If the gas mileage of a car is 40 miles per gallon, and I drive 20 miles, I will have used half a gallon of gas. This is just logic and you can’t deny that.”
The first category (“Yes!”) will consist of people who are statistically literate. The third category (“No!”) will attract people who are not statistically literate. I don’t know about the middle ground – I think it could attract open-minded but not yet very statistically literate people.
Statistical literacy as a developmental milestoneThe devious thing about statistical literacy is that people who don’t have it seem to not know they don’t have it – not even when someone points out that statistical literacy is a thing that not all people have. To someone who is not statistically literate, statistical reasoning sounds like the ramblings of someone confused and illogical.
To be clear: I’m not knocking on anyone here. As I’ve previously admitted, I wasn’t statistically literate until fairly recently. I didn’t become statistically literate because I tried to. I mean, how could I? I didn’t even know it was a thing. It just happened by accident when I read lots of books on varied topics inside and outside statistics. Out of nowhere, I discovered I had this new lens through which I could look at the world and see it all differently.[1]
The whole thing reminds me of the idea Scott Alexander proposed about missing developmental milestones. This certainly seems like one of them: either someone taught you to think statistically, and it seems like second nature, or you never learned it, and then you don’t know what’s missing.
The problem is trainingThis leads into another important point: I’m certainly not claiming any one person is incapable of statistical literacy. I think it’s generally within reach of most people I meet. But, as formal operational thought is described in Scott Alexander’s article, statistical literacy
is more complexly difficult than earlier modes of thought and will be used in a culture in a publicly shared way only if that culture has developed techniques for training people in its use.
Our culture has yet to develop techniques for training large amounts of people in statistical literacy. Our elementary school teachers know how to train students in reading, writing, and basic logical reasoning. But I believe most of them are not statistically literate. This means
- They will not give students examples of when the race was not to the swift.[2]
- They will not observe the battle against entropy in seating choices in the classroom.
- They will neglect reference class propensities as the largest source of variation and latch on to concrete details.
- They will attribute differences in outcomes to differences in aptitude/skill/perseverence, even when there are several other environmental factors that have much larger influence on the outcome.
- They will pretend mathematical models apply cleanly to real-world problems, even when significant sources of error make very rough approximations more appropriate.
- They will not treat student performance as an error-laden sample from a hypothetical population.[3]
- They will not allow students to indicate their confidence in all alternatives of a multiple-choice question.
- Although they may tell you “correlation does not mean causation”, they will readily conclude a causal link exists when presented with any slightly more complicated real-world correlation.
And if the teachers do not see these things, if the teachers are not statistically literate, how on Earth are they going to teach it to their students?
I suspect this will improve with time. Statistical reasoning wasn’t even invented 400 years ago. Unlike verbal language and art, it is not an innately human thing to do. Like logical reasoning, it will take time for it to spread, and it will do so at first slowly, then suddenly. I think it will, in the next few centuries, become as important a marker of civilisation as actual literacy and numeracy is today.
Statistical literacy is required for data-driven decisionsOnce one starts looking for it, differences in statistical literacy pop up everywhere. Dan Luu writes that he is “looking at data” better than others (my emphasis):
For the past three years, the main skill I’ve been applying and improving is something you might call “looking at data”; the term is in quotes because I don’t know of a good term for it. I don’t think it’s what most people would think of as “statistics”, in that I don’t often need to do anything […] sophisticated.
I know the term for it: statistical literacy. Dan, you are practicing your statistical literacy.
When the data are difficult and uncooperative, statistical literacy is needed to look at it in a way that improves decisions – or at least does not make them worse. Dan Luu goes further and notes that most people who are not statistically literate don’t even bother collecting the data in the first place – they haven't yet established the supremacy of studying the outcome, but are instead using assumption as the motive of duty: When people attempt to data-drive despite lacking statistical literacy, they often end up flailing about and making things worse, eventually giving up on the idea and reverting to decisions based on logic and/or faith.
All of this happens in a world that is turning increasingly statistical. Many of our productivity-enhancing technologies these days incorporate statistical reasoning to make decisions when presented with wobbly information. Our obsession with determinism in software systems is, I think, a temporary fad, just as it was in science.
Improving statistical literacyWhat originally prompted this article out of months of thinking from my end was Cedric Chin over at Commoncog publishing an article on Becoming Data Driven, From First Principles. That is an excellent article which just might help nudge a predisposed organisation into statistical literacy.
As I mentioned, I have also read a lot of books that nudged me in the right direction, but I’m not yet at a point where I can make a concrete recommendation. I hope to re-read and review some of them over the coming year, which would hopefully put me in a better spot to recommend.
I’m also hoping that I can carve out some time to try to measure people’s statistical literacy, which would help me pinpoint exactly what it is about, and thus allow for the construction of an effective curriculum.
More research is neededAll of these words are meaningless in the sense that they are just a wild man’s speculation. I have not gone through the trouble Lecky did when he chronicled the rise of rationalism.
On the flip side, the hypothesis fuzzily outlined in this article should be testable. If I’m correct about statistical literacy, it should be possible to design a questionnaire with psychometric reliability and validity, with diverse questions that all seem to measure a construct that sounds like statistical literacy.
I don’t know exactly what the items in the questionnaire would be. I have some ideas and I’ve run a few trial surveys (massive thanks to my incredibly helpful test subjects!), but not arrived at anything concrete yet. If someone would donate me large amounts of money I would love to actively research this subject. In the mean time, I can only think about it in my spare time and sometimes write about it online.
- ^
This lens is still something I’m polishing and discovering more ways in which it can be useful.
- ^
From Ecclesiastes 9:11. In the King James Bible, this is phrased as “I returned, and saw under the sun, that the race is not to the swift, nor the battle to the strong, neither yet bread to the wise, nor yet riches to men of understanding, nor yet favour to men of skill; but time and chance happeneth to them all.” I noticed that in the International Children’s Bible, it starts with “I also realized something else here on earth that is senseless: The fast runner does not always win the race. […]” Hah! Senseless! Statstical illiteracy!
- ^
I think it was Deming who, as a rule, gave a passing grade to everyone in his class, because as long as he was doing a passable job of teaching, any problems in learning is unlikely to rest with factors the student can control. He used tests not as a way to fail students, but as a way to calibrate how well he was teaching the material. He used students as measuring devices for his teaching skill, acknowledging that sometimes individual students don’t fairly represent his abilities but on aggregate, they will.
Discuss
We Need to Be Able to Talk About AI Use
Epistemic status: Moderately confident in the framework, less confident in the boundary conditions. This is a proposed taxonomy, not a settled one. I expect the levels to hold up reasonably well as categories while acknowledging that individual cases will resist clean classification.
A growing fraction of written content now involves AI at some stage of the production pipeline. Blog posts, fiction, emails, technical documentation, marketing copy. The question “did a human write this?” has become increasingly difficult to answer, and part of the reason is that the question itself is poorly formed. Human authorship and AI generation are not a binary. They exist on a spectrum, and we currently lack shared vocabulary for talking about where on that spectrum a given piece of work falls.
This matters for several practical reasons. Academic institutions need to set policies about acceptable AI use. Publishers need submission guidelines that reflect the actual landscape of how people write now. Readers deserve the ability to make informed judgments about what they’re consuming. And anyone trying to think clearly about the economics, aesthetics, or epistemics of AI-assisted writing needs more granularity than “human-written” versus “AI-generated.”
Here is a proposed five-point scale.
Level 1: Human-Authored, AI-Untouched. The writer produces the work without AI assistance of any kind. No grammar checkers beyond standard spellcheck (the kind that has existed since the 1990s). No AI-generated suggestions, outlines, or brainstorming. This is the baseline, the control group, the coffee stain on a manuscript page. It is also becoming rarer.
Level 2: AI as Tool. The writer produces the work but uses AI for discrete, mechanical tasks: grammar correction, synonym suggestions, spell checking beyond basic autocorrect. The AI functions here the way a calculator functions in a math class. It handles rote operations so the human can focus on higher-order thinking. Creative decisions, structural choices, and voice remain entirely human. The AI touches the surface but does not reach the bones.
Level 3: AI as Collaborator. The human and AI engage in a back-and-forth process. A blogger might ask the AI to generate an outline for a post, suggest counterarguments to stress-test a thesis, or draft a section that the human then substantially rewrites. A fiction writer might use it to brainstorm plot alternatives or generate dialogue they later rework. The human retains creative authority and final editorial control, but the AI’s contributions shape the direction of the work in ways that go beyond mechanical correction. Think of this as a writing partner who generates raw material that the human sculpts into something with intention.
Level 4: AI as Primary Drafter, Human as Editor. The AI generates the bulk of the prose based on human-provided parameters: a topic and thesis for a blog post, a research summary to be turned into an article, or a set of themes and structural constraints for a story. The human’s role shifts from creator to curator. They provide the blueprint and then refine, revise, and approve the output, but the sentence-level construction, the word choices, the rhythm of the paragraphs all originate with the AI. The human is the architect. The AI is the construction crew.
Level 5: AI-Generated, Human-Prompted. The human provides a short prompt. The AI produces the work, potentially via an Agentic workflow. Revisions, if any, are themselves AI-generated based on further prewritten prompts. The human’s contribution is limited to the initial creative vision and iterative feedback, while the AI handles all aspects of execution. The human decides what. The AI decides how.
Where the Scale Gets InterestingThe boundaries between levels are blurry, and I think the blurriness is a feature rather than a bug. A Level 3 project can slide toward Level 4 depending on how much the human revises the AI-generated material. A particularly detailed prompt at Level 5 might exercise more creative control than a cursory editorial pass at Level 4. Clean lines between the categories would be dishonest. Real workflows are messy.
The more important observation is that most people’s intuitions about AI involvement are binary (either you wrote it or the machine did), while the actual practice is thoroughly gradient. A blogger who asks an LLM to generate three possible framings for a post about housing policy, then writes their own fourth framing that synthesizes elements of all three, is doing something qualitatively different from both unaided writing and prompt-and-publish workflows. The same is true of a fiction writer who uses AI to brainstorm endings and then invents their own. Collapsing those into the same category as either “human-written” or “AI-generated” obscures more than it reveals.
Practical ImplicationsIf this scale (or something like it) gained adoption, several things follow:
First, self-reporting becomes tractable. “This post was written at approximately Level 3” is a meaningful disclosure that gives readers useful information without requiring a binary confession or denial. Second, institutional policies become more precise. A university could permit Level 2 use while prohibiting Level 4 and above, and both students and faculty would have shared language for discussing edge cases. Third, discussions about authenticity and authorship gain nuance. The question shifts from “is this AI-generated?” to “what was the division of labor, and does it matter for this particular context?”
I do not claim this scale is final or optimal. The five levels might need to become seven; the descriptions might need refinement as AI capabilities change. But the underlying claim is that we need some shared, legible framework for this conversation, and the binary we currently rely on is failing us.
The absence of such a framework is itself a form of epistemic damage. It forces every discussion about AI-assisted writing into a false dichotomy, and false dichotomies are where nuance goes to die.
Discuss
AXRP Episode 49 - Caspar Oesterheld on Program Equilibrium
How does game theory work when everyone is a computer program who can read everyone else’s source code? This is the problem of ‘program equilibria’. In this episode, I talk with Caspar Oesterheld on work he’s done on equilibria of programs that simulate each other, and how robust these equilibria are.
Topics we discuss:
- Program equilibrium basics
- Desiderata for program equilibria
- Why program equilibrium matters
- Prior work: reachable equilibria and proof-based approaches
- The basic idea of Robust Program Equilibrium
- Are ϵGroundedπBots inefficient?
- Compatibility of proof-based and simulation-based program equilibria
- Cooperating against CooperateBot, and how to avoid it
- Making better simulation-based bots
- Characterizing simulation-based program equilibria
- Follow-up work
- Following Caspar’s research
Daniel Filan (00:00:09): Hello, everybody. In this episode I’ll be speaking with Caspar Oesterheld. Caspar is a PhD student at Carnegie Mellon University, where he serves as the Assistant Director of the Foundations of Cooperative AI Lab. He researches AI safety with a particular focus on multi-agent issues. There’s a transcript of this episode at axrp.net, and links to papers we discuss are available in the description. You can support the podcast at patreon.com/axrpodcast, or give me feedback about this episode at axrp.fyi. Okay, well Caspar, welcome to AXRP.
Caspar Oesterheld (00:00:43): Thanks for having me.
Program equilibrium basicsDaniel Filan (00:00:44): So today we’re going to talk about two papers that you’ve been on. The first is “Robust program equilibrium”, where I believe you’re the sole author. And the second is “Characterising Simulation-Based Program Equilibria” by Emery Cooper, yourself and Vincent Conitzer. So I think before we sort of go into the details of those papers, these both use the terms like “program equilibrium”, “program equilibria”. What does that mean?
Caspar Oesterheld (00:01:11): Yeah, so this is a concept in game theory and it’s about the equilibria of a particular kind of game. So I better describe this kind of game. So imagine you start with any sort of game, in the game theoretic sense, like the prisoner’s dilemma, which maybe I should describe briefly. So imagine we have two players and they can choose between raising their own utility by one or raising the other player’s utility by three and they only care about their own utility. I don’t know, they play against a stranger, and for some reason they don’t care about the stranger’s utility. And so they both face this choice. And the traditional game-theoretic analysis of this game by itself is that you should just raise your own utility by $1 and then both players will do this and they’ll both go home with $1 or one utilon or whatever. And, of course, there’s some sort of tragedy. It would be nice if they could somehow agree in this particular game to both give the other player $3 and to both walk home with the $3.
Daniel Filan (00:02:33): Yeah, yeah, yeah. And just to drive home what’s going on, if you and I are playing this game, the core issue is no matter what you do, I’m better off giving myself the one utility or the $1 rather than giving you three utility because I don’t really care about your utility.
(00:02:53): So, I guess, there are two ways to put this. Firstly, just no matter what you play, I would rather choose the “give myself utility” option, commonly called “defect”, rather than cooperate. Another way to say this issue is, in the version where we both give each other the $3, I’m better off deviating from that. But if we’re both in the “only give ourselves $1” situation, neither of us is made better off by deviating and in fact we’re both made worse off. So it’s a sticky situation.
Caspar Oesterheld (00:03:29): Yeah. That’s all correct, of course. Okay. And now this program game set-up imagines that we take some game and now instead of playing it in this direct way where we directly choose between cooperate and defect—raise my utility by $1 or the other player’s by $3—instead of choosing this directly, we get to choose computer programs and then the computer programs will choose for us. And importantly, so far this wouldn’t really make much of a difference yet. Like, okay, we choose between a computer program that defects or a computer program that cooperates, or the computer program that runs in circles 10 times and then cooperates. That effect doesn’t really matter.
(00:04:12): But the crucial addition is that the programs get access to each other’s source code at runtime. So I submit my computer program, you submit your computer program and then my computer program gets as input the code of your computer program. And based on that it can decide whether to cooperate or defect (or you can take any other game [with different actions]). So it can look at your computer program and [see] does it look cooperative? And depending on that, cooperate or defect. Or it can look [at] is the fifth character in your computer program an ‘a’? And then cooperate if it is and otherwise defect. There’s no reason to submit this type of program, but this is the kind of thing that they would be allowed to do.
Daniel Filan (00:04:58): Yeah. So this very syntactic analysis… A while ago I was part of this, basically a tournament, that did this prisoner’s dilemma thing with these open source programs. And one strategy that a lot of people used was, if I see a lot of characters… Like if I see a string where that string alone means “I will cooperate with you”, then cooperate with that person, otherwise defect against that person.
(00:05:33): Which I think if you think about it hard, this doesn’t actually quite make sense. But I don’t know, there are very syntactic things that, in fact, seem kind of valuable, especially if you’re not able to do that much computation on the other person’s computer program. Just simple syntactic hacks can be better than nothing, I think.
Caspar Oesterheld (00:05:56): Yeah. Was this Alex Mennen’s tournament on LessWrong or was this a different-
Daniel Filan (00:06:01): No, this is the Manifold one.
Caspar Oesterheld (00:06:07): Ah, okay.
Daniel Filan (00:06:08): So you had to write a JavaScript program, it had to be fewer than however many characters and there was also a market on which program would win and you could submit up to three things. So actually, kind of annoyingly to me… One thing I only realized afterwards is the thing you really should have done is write two programs that cooperated with your program and defected against everyone else’s, or just cooperated with the program you thought was most likely to win. And then you bet on that program. Or even you could submit three programs, have them all cooperate with a thing that you hoped would win and defect against everyone else and then bet on… Anyway.
(00:06:49): So in that setting there was a timeout provision where if the code ran for too long your bot would be disqualified, and also you had to write a really short program. Some people actually managed to write pretty smart programs. But if you weren’t able to do that, relatively simple syntactic analysis was better than nothing, I think.
Caspar Oesterheld (00:07:14): Yeah, I think there was this earlier tournament in 2014 or something like that when there was less known about this kind of setting. And a bunch of programs there were also based on these simple syntactic things. But in part because everyone was mostly thinking about these simple syntactic things, it was all a little bit kind of nonsense.
(00:07:34): I don’t know, you would check whether the opponent program has a particular word in it or something like that. And then, I think, the winning program had particular words in it but it would just still defect. So in some sense those dynamics are a little bit nonsense or they’re not really tracking, in some sense, the strategic nature of the situation.
Daniel Filan (00:08:02): Fair enough. So going back, you were saying: you have your opponent’s program and you can see if the fifth character is an ‘a’ or, and then-
Caspar Oesterheld (00:08:11): Yeah, what should one perhaps do? So I think the setting was first proposed in, I think, 1984 or something like that. And then it kind of [was] rediscovered or reinvented, I think, three times or something like that in various papers. And all of these initial papers find the following very simple program for this prisoner’s dilemma-type situation, which just goes as follows: if the opponent program is equal to myself—to this program—then cooperate and otherwise defect.
(00:08:53): So this program is a Nash equilibrium against itself and it cooperates against itself. So if both players submit this program, neither is incentivized to deviate from playing this program. If you play this program that checks that the two programs are the same and if they are, cooperate, otherwise defect, you submit this program, the best thing I can do is also submit this program. If I submit anything else, you’re going to defect. So I’m going to get at most one if I also defect, whereas I get three if I also cooperate. So yeah, all of these original papers proposing the setting, they all find this program which allows stable cooperation in this setting.
Daniel Filan (00:09:38): Right. So my impression, and maybe this is totally wrong, is I think for a while there’s been some sense that if you’re rational and you’re playing the prisoner’s dilemma against yourself, you should be able to cooperate with yourself, I think. Wasn’t there some guy writing in Scientific American about superrationality and he held a contest basically on this premise?
Caspar Oesterheld (00:10:02): Yeah, yeah. Hofstadter, I think.
Daniel Filan (00:10:05): Right, right.
Caspar Oesterheld (00:10:06): I think also in the ’80s or something… I’ve done a lot of work on this kind of reasoning as well that… I don’t know, for humans it’s a little bit hard to think about. You don’t often face very similar opponents or it’s a little bit unclear how similar other people are. Is your brother or someone who’s related to you and was brought up in a similar way, are they very similar? It’s kind of hard to tell.
(00:10:38): But for computer programs it’s very easy to imagine, of course, that you just… You have two copies of GPT-4 or something like that and they play a game against each other. It’s a very normal occurrence, in some sense. I mean, maybe not them acting in the real world, at this point, but having multiple copies of a computer program is quite normal. And there’s this related but to some extent independent literature on these sorts of ideas that you should cooperate against copies, basically.
Daniel Filan (00:11:10): But yeah, basically I’m wondering if this idea of ‘I’ll cooperate against copies” is what inspired these very simple programs?
Caspar Oesterheld (00:11:22): Yeah, that is a good question. I basically don’t know to what extent this is the case. I know that some of the later papers on program equilibrium, I remember some of these specifically citing this superrationality concept. But yeah, I don’t remember whether these papers—I think McAfee is one of these who wrote about this in the ’80s—I don’t know whether they discuss superrationality.
Daniel Filan (00:11:53): And it’s kind of tricky because… If you actually look at the computer programs, they’re not doing expected utility maximization… Or they’re not computing expected utility maximization. They’re just like, “if identical to me, cooperate, else defect”, just hard-coded in… Anyway, maybe this is a distraction but, indeed, these were the first programs considered in the program equilibrium literature.
Caspar Oesterheld (00:12:19): Yeah.
Daniel Filan (00:12:20): So they sound great, right?
Caspar Oesterheld (00:12:21): Yeah. So, I mean, they’re great in that in the prisoner’s dilemma, you can get an equilibrium in which you can get cooperation, which otherwise you can’t, or you can’t achieve with various naive other programs that you might write. But, I think, in practice—and it’s not so obvious what the practice of this scheme looks like—but if you think of any kind of practical application of this, it’s sort of a problem that the settings are somewhat complex and now you need… Two people write programs independently and then these programs need to be the same somehow or they need to… I mean, there are slightly more general versions of these where they check some other syntactic properties.
(00:13:13): But basically, yeah, you require that you coordinate in some way on a particular kind of source code to write, which maybe in some cases you can do, right? Sometimes maybe we can just talk beforehand. Like if we play this prisoner’s dilemma, we can just explicitly say, “Okay, here’s the program that I want to submit. Please submit the same program” and then you can say, “Okay, let’s go”.
(00:13:38): But maybe in cases where we really write these programs independently, maybe at different points in time, and these programs, especially if they do more complicated things than play this prisoner’s dilemma, it’s very difficult to coordinate without explicitly talking to each other on writing programs that will cooperate against each other. Even in the prisoner’s dilemma, you might imagine that I might have an extra space somewhere, or maybe you write the program, “If the two programs are equal, cooperate, otherwise defect” and I write, “if the two programs are different, defect, else cooperate”. These very minor changes would already break these schemes.
Desiderata for program equilibriaDaniel Filan (00:14:20): Okay, okay. There’s a lot to just ask about there. I think my first question is: we have this notion of program equilibrium. Are we trying to find Nash equilibria of programs? Are we trying to find evolutionarily stable strategies? Or maybe there are tons of solution concepts and we just want to play around with the space. But what are the actual… What’s the thing here?
Caspar Oesterheld (00:14:49): Yeah. The solution concept that people talk about most is just Nash equilibrium. So if you look at any of these papers and you look at the results, they’ll prove “these kinds of programs form a Nash equilibrium of the program game”. Or, I mean, the term “program equilibrium” literally just means “Nash equilibrium of the game in which the players submit these programs”. That is almost always the kind of game-theoretic solution concept that people use.
(00:15:25): And then, usually a bunch of other things are a little bit more implicit. It’s clear that people are interested in finding good Nash equilibria. In some sense, the whole point of the setup is we start out with the prisoner’s dilemma and sad: everyone’s going to defect against everyone else and we’re not getting to cooperation. And now, we come in with this new idea of submitting programs that get access to each other’s source code and with this we get these cooperative equilibria. So that is usually… I mean, it’s often quite explicit in the text that you’re asking, “can we find good equilibria?” in some sense, ones that are Pareto-optimal in the space of possible outcomes of the game or something like that.
(00:16:15): And then, additionally, a lot of the work after these early papers that do this syntactic comparison-based program equilibrium are about this kind of intuitive notion of robustness, that you want to have equilibria that aren’t sensitive to where the other program puts the spaces and the semicolons and these syntactic details. But it is kind of interesting that this isn’t formalized usually. And also, the second paper that we talked about, we presented this at AAAI and one game theorist came to our poster and said… I don’t know, to him it was sort strange that there’s no formalization, in terms of solution concepts in particular, of this kind of robustness notion, that we’ll talk about the programs that we are claiming or that we are arguing are more robust. But this syntactic comparison-based program, there’s sort of some intuitive sense, and we can give concrete arguments, but it’s not formalized in the solution concept.
(00:17:35): One of my papers is called “robust program equilibrium”, but robust program equilibrium is not actually a solution concept in the sense that Nash equilibrium is or trembling hand equilibrium is. The robustness is more some sort of intuitive notion that, I think, a lot of people find compelling but in some sense it’s not formalized.
Daniel Filan (00:17:58): Yeah, and it’s funny… I see this as roughly within both the cooperative AI tradition and the agent foundations tradition. And I think these traditions are sort of related to each other. And, in particular, in this setting in decision theory, I think there’s also some notion of fairness of a decision situation.
(00:18:24): So sometimes people talk about: suppose you have a concrete instantiation of a decision theory, meaning a way somebody thinks about making decisions. There are always ways of making that concrete instantiation look bad by saying: suppose you have a Caspar decision theory; we’ll call it CDT for short. And then you can be in a decision situation, right, where some really smart person figures out what decision theory you’re running, punches you if you’re running CDT and then gives you $1 million if you’re not.
(00:18:54): And there’s a sense that this is unfair but also it’s not totally obvious. Like in that setting as well, I think there’s just no notion of what the fair thing is. Which is kind of rough because you’d like to be able to say, “Yeah, my decision theory does really well in all the fair scenarios”. And it seems like it would be nice if someone figured out a relevant notion here. Are people trying to do that? Are you trying to do that?
Caspar Oesterheld (00:19:22): So I think there is some thinking in both cases and I think probably the kind of notion that people talk about most is probably similar in both. So in this decision theory case, I think the thing that probably most people agree is that the decision situation should be somehow be a function of your behavior. It shouldn’t check, “do you run CDT”, and if you do, you get punched in the face. It should be like: if in this situation you choose this, then you get some low reward. But this should somehow be behavior-based, which I think still isn’t enough. But, I mean, this sort of goes into the weeds of this literature. Maybe we can link some paper in the show notes.
(00:20:17): But, I mean, the condition that we give in the second paper, or maybe even in both of the papers that we’re going to discuss, there’s some explicit discussion of this notion of behaviorism, which also says: in the program equilibrium setting, it’s sort of nice to have a kind of program that only depends on the other program’s behavior rather than the syntax.
(00:20:48): And all of these approaches to robustness, like trying to do some proofs about the programs, about what the opponent program does, try to prove whether the opponent will cooperate or something like that… All of these, to some extent, these notions that people intuitively find more robust, they’re all more behaviorist, at least, than this syntactic comparison-based idea.
Daniel Filan (00:21:15): Yeah. Although it’s tricky because… I’m sorry, I don’t know if this is going to the weeds that you want to postpone. So this behaviorism-based thing, if you think about the “if you’re equal to me, cooperate, else defect” program, this is behaviorally different from the “if you’re unequal to me, defect, else cooperate” program, right?
(00:21:33): It does different things in different situations and therefore… Once you can define an impartial thing, right, then maybe you can say, “Well if you act identically on impartial programs then you count as impartial”. But actually maybe that’s just a recursive definition and we only need one simple program as a base case.
Caspar Oesterheld (00:21:52): I think we do actually have a recursive definition of simulationist programs that I think is a little bit trying to address some of these issues. But, yeah, it does sort of go into the weeds of what exactly should this definition be.
Daniel Filan (00:22:13): Yeah, okay. Let’s go back a little bit to desiderata of program equilibria. So they’re computer programs, right? So presumably—and this is addressed a bit in the second paper—but just runtime computational efficiency, that seems like a relevant desideratum.
Caspar Oesterheld (00:22:28): Yes, I agree.
Daniel Filan (00:22:29): And then, I think that I imagine various desiderata to include “have a broad range of programs that you can work well with”. And it seems like there might be some notion of just, “if you fail, fail not so badly, rather than fail really badly”. I don’t know if… this is slightly different from the notion of robustness in your paper and I don’t know if there’s a good formalism for this. Do you have thoughts here?
Caspar Oesterheld (00:23:02): I mean in some intuitive sense, what one wants is that, if I slightly change my program, maybe even in a way that is sort of substantial… In the prisoner’s dilemma, it’s a little bit unclear if I defect slightly more, if I don’t cooperate 100% but I cooperate 95%, it’s unclear to what extent should you be robust. Should you defect against me all of the time? But, I guess, in other games where maybe there are different kinds of cooperation or something like that, you’d want… If I cooperate in slightly the wrong way, the outcome should still be good.
(00:23:46): I think in some sense there’s something here, that I think it’s conceptually quite clear that if you deviate in some reasonable harmless way, it should still be fine. We shouldn’t defect against each other, we should still get a decent utility. But the details are less clear [about] what exactly are the deviations and it probably depends a lot on the game. And then, there are a lot of these sort of things that in game theory are just kind of unclear. If I defect 5% more, how much should you punish me for that? And so, I think that’s why a lot of these things, they aren’t really formalized in these papers.
Why program equilibrium mattersDaniel Filan (00:24:35): Sure, okay. So now that we know what program equilibrium is, why does it matter?
Caspar Oesterheld (00:24:43): There are lots of different possible answers to this question. I think the most straightforward one is that we can view program games like program equilibrium as sort of a model of how games could be played when different parties design and deploy AI systems. So this whole thing of having a source code that the other party can look at and can maybe run or can look at character five and stuff like that: this is something that is somewhat specific to computer programs. We can talk about whether there are human analogs still, but when we play a game against each other, it’s sort of hard to imagine an equivalent of this. Maybe I have some vague model of how your brain works or something like that, but there’s no source code, I can’t really “run” you in some ways.
(00:25:51): Whereas, if we both write computer programs, this can just literally happen. We can just literally say, “This is the source code that I’m deploying…” I have my charity or something like that and I’m using some AI system to manage how much to donate to different charities. I can just say, “Look, this is the source code that I’m using for managing what this charity does”. And here, I think, program equilibrium or program games are quite a literal direct model of how these interactions could go. Of course, you can also deploy the AI system and say “we’re not saying anything about how this works”. In which case, obviously, you don’t get these program equilibrium-type dynamics. But it’s a way that they could go and that people might want to use because it allows for cooperation.
(00:26:47): So I think the most direct interpretation is that it models a kind of way that games could be played in the future when more decisions are made by delegating to AI systems. As people in this community who think and to some extent worry about a future where lots of decisions are made by AI, this is an important thing to think about. And meanwhile, because to most game theorists it’s sort of a weird setting because, well, humans can’t read each other’s source code, it’s sort of understudied by our lights, I guess, because currently it’s not a super important way that games are played.
Daniel Filan (00:27:37): Which is interesting because… So I guess we don’t often have games played with mutual source code transparency, but there really are computer programs that play economic games against each other in economically valuable settings, right? A lot of trading in the stock market is done by computer programs. A lot of bidding for advertisement space is done by computer programs.
(00:28:06): And algorithmic mechanism design—so mechanism design being sort of inverse game theory: if you want some sort of outcome, how you’d figure out the game to make that happen. Algorithmic mechanism design being like that, but everyone’s a computer. There’s decent uptake of this, as far as I can tell. Algorithmic game theory, there’s decent uptake of that. So I’m kind of surprised that the mutual transparency setting is not more of interest to the broader community.
Caspar Oesterheld (00:28:42): Yeah, I think I agree. I mean, a lot of these settings… So I think the trading case is a case where decisions are made on both sides by algorithms. But usually because it’s kind of a zero-sum game, you don’t want to reveal to your competitors how your trading bot works.
(00:29:07): There’s a lot of this mechanism design where you have an algorithm. I guess those are usually cases where it’s sort of unilateral transparency. I auction off something and I’m saying, “Okay, I’m using this algorithm to determine who gets, I don’t know, this broadband frequency or these things that are being auction-offered”.
(00:29:33): So, I guess, those are cases with sort of unilateral transparency. And that is, I guess, studied much more in part because it’s less… I mean, this also has been studied traditionally in game theory much more, in some sense. You can view it as some Stackelberg equilibrium. You can view all mechanism design as being a bit like finding Stackelberg equilibria. And I think Stackelberg’s analyses of game theory even proceed Nash equilibrium.
Daniel Filan (00:30:04): Interesting.
Caspar Oesterheld (00:30:05): So that is very old.
Daniel Filan (00:30:07): Where Stackelberg equilibrium is: one person does a thing and then the next person does a thing. And so the next person is optimizing, given what the first person does, and the first person has to optimize “what’s really good for me, given that when I do something the other person will optimize what’s good for them based on what I do”.
Caspar Oesterheld (00:30:23): Yeah.
Daniel Filan (00:30:24): So people look at Stackelberg equilibria and these sorts of games and it’s a common thing. And it’s an interesting point that you can sort of think of it as one-way transparency.
Caspar Oesterheld (00:30:34): Yeah. I think one thing one could think about is how much humans are in these mutual transparency settings. So yeah, I already said for individual humans: if the two of us play a prisoner’s dilemma, I have some model of you, I can’t really read… So I don’t know, seems sort of speculative. So there’s this paper which I really like by Andrew Critch, Michael Dennis and Stuart Russell, all from CHAI where, of course, you graduated from. This is about program equilibrium as well.
(00:31:16): The motivating setting that they use is institution design. The idea there is that: institutions, you can view them as rational players, or something like that. They make decisions, and they play games with each other. Like, I don’t know, the US government plays a game with the German government or whatever. But institutions have some amount of transparency. They have laws that they need to follow. They have constitutions. They’re composed of lots of individuals, that in principle, one could ask… I don’t know, the German government could check all the social media profiles of all the people working for the US government and learn something about how these people interact with each other, or something like that. There’s some very concrete transparency there.
(00:32:09): In particular, some things are really just algorithmic type commitments. Like, I don’t know, “We don’t negotiate with terrorists”, or something like that. It’s specific, something that’s in the source code of a country in some sense. It’s specifying how it’s going to choose in particular interactions. I think that is a case where interactions between human organizations have this transparency. I think that’s some evidence that we could get similar things with AI.
(00:32:51): At the same time, it’s also interesting that this hasn’t motivated people to study this program equilibrium-style setting, which I think is probably because I think, as a computer scientist, it’s natural to think the constitution is basically just an algorithm. It’s also a little bit like, I don’t know, computer science people explain the world to everyone else by using computer programs for everyone. Like, “The mind is a program, and the constitution is just a program. We got it covered with our computer science stuff”, which maybe some people also don’t like so much. But I think it’s a helpful metaphor still.
Prior work: reachable equilibria and proof-based approachesDaniel Filan (00:33:35): Fair enough. Okay. Some people do study program equilibria. Just to set up the setting for your papers: before the appearance to the world of Robust Program Equilibrium, what did we know about program equilibria beyond these simple programs that cooperate if your source code is mine?
Caspar Oesterheld (00:33:56): Yeah. I guess we have some characterizations of the kind of equilibria, in general, that are allowed by these syntactic comparison-based programs. Not sure how much to go into that at this point, but yeah, maybe we’ll get into this later.
Daniel Filan (00:34:16): I think I can do this quickly. My understanding is basically, any equilibrium that’s better off for all the players than unilaterally doing what they want, you can get with program equilibrium. Maybe you have to have punishments as well, but something roughly like this. You can have programs being like, “You have to play this equilibrium. If you don’t, then I’ll punish you”. Just write up a computer program saying, “If you’re equal to me, and therefore play this equilibrium, then I’ll play this equilibrium. If you’re not, then I’ll do the punish action”.
Caspar Oesterheld (00:34:55): Yes. Yeah, that’s basically right.
Daniel Filan (00:34:58): Okay. Is it only basically right?
Caspar Oesterheld (00:35:01): No, I think it’s basically right… I think it’s fully right, sorry. [It’s just] “basically” in the way that all natural language descriptions… You can get anything that is better for everyone than what they can get if everyone punishes them, which might be quite bad.
(00:35:25): For example, in the prisoner’s dilemma, we had this nice story of how you can get mutual cooperation, but you can also get, I don’t know, one player cooperates 60% of the time, the other player cooperates 100% of the time. The reason why the 100% of the time cooperator doesn’t cooperate less is that the 60% cooperator says, “Yeah, if we’re not both submitting the program that plays this equilibrium, I’m going to always defect”. In the prisoner’s dilemma, you can get anything that is at least as good as mutual defection for both players. In some sense, almost everything can happen. It can’t happen that one player cooperates all the time, the other player defects all the time. Because then the cooperator would always want to defect. But yeah, that’s the basic picture of what’s going on here.
(00:36:26): That has been known. Then post-Tennenholtz, which is one of these papers—I think the paper that [coined the term] “program equilibrium”and gave this syntactic comparison-based program, and this folk theorem, as it’s called, of what kind of things can happen in equilibrium. After that, most papers have focused on this “how do we make this more robust” idea. In particular, what existed prior to the robust program equilibrium paper are these papers on making things more robust by having the programs try to prove things about each other.
(00:37:11): Here’s maybe the simplest example of this that one doesn’t need to know crazy logic for. You could write a program… in the prisoner’s dilemma, you could write a program that tries to search for proofs of the claim “if this program cooperates, the other program will also cooperate”. Your program is now very large. It has this proof search system. Somehow, it can find proofs about programs. But basically, you can still describe it relatively simply as, “I try to find the proof that if I cooperate, the opponent cooperates. Then I cooperate. Otherwise, I’ll defect”. It’s not that difficult to see that this kind of program can cooperate against itself. Because if it faces itself, it’s relatively easy to prove that if I cooperate, the opponent will cooperate. Because the statement, it’s an implication where both sides of the implication arrows say exactly the same thing.
(00:38:25): At the same time, this is more robust, because this will be robust to changing the spaces and so on. It’s relatively easy to prove “if this program outputs cooperate, then this other program which is the same, except that it has the spaces in different places or switches things around in some way that doesn’t really matter, that this will also output that thing, also output cooperate”. This is a basic proof-based approach that will work.
(00:39:07): I think the first paper on this is by Barasz et al. I think there are two versions of this which have different first authors, which is a little bit confusing. I think on one of them, Barasz is the first author. On the other one, it’s LaVictoire. I think it’s an American, so probably a less French pronunciation is correct.
Daniel Filan (00:39:37): I actually think he does say “Lah vic-twahr”.
Caspar Oesterheld (00:39:39): Oh, okay.
Daniel Filan (00:39:40): I think. I’m not 100% certain. Write in, Patrick, and tell us.
Caspar Oesterheld (00:39:48): Those papers first proposed these proof-based approaches. They actually do something that’s more clever, where it’s much harder to see why it might work. I described a version where the thing that you try to prove is “if I cooperate, the opponent will cooperate”. They instead just have programs that try to prove that the opponent will cooperate. You just do, “if I can prove that my opponent cooperates, I cooperate. Else, I defect”.
(00:40:16): This is much less intuitive that this works. Intuitively, you would think, “Surely, this is some weird infinite loop”. If this faces itself… I am going to think, “What does the opponent do?” Then, “Well, to think about what my opponent will do to prove anything about them, they’ll try to prove something about me”. You run into this infinite circle. You would think that it’s basically the same as… One very naive program that you might write is just, “Run the opponent program. If it cooperates, cooperate. Otherwise, defect”. This really does just run in circles.
(00:40:56): You would think that just doing proofs instead of this running the opponent program, that you have the same issue. It turns out that you can find these proofs which follows from a somewhat obscure result in logic called Löb’s theorem, which is a little bit related to Gödel’s second incompleteness theorem. With Löb’s theorem it’s relatively easy to prove, but it’s a very “you kind of need to just write it down” proof, and then it’s relatively simple. But it’s hard to give an intuition for it, I think.
Daniel Filan (00:41:47): Also, it’s one of these things that’s hard to state unless you’re careful and remember… So I’ve tried to write it down. It’s like, if you can prove that a proposition would be true… Okay, take a proposition P. Löb’s theorem says that if you can prove that “if you could prove P, then P would be true”, then, you would be able to prove P. If you can prove that the provability of a statement implies its truth, then you could prove the thing. The reason that this is non-trivial is it turns out that you can’t always prove that if you could prove a thing, it would be true because you can’t prove that your proving system works all the time. You can construct funky self-referential things that work out. Unless I have messed up, that is Löb’s theorem.
(00:42:49): My recollection is the way it works in this program is basically, you’re checking if the other program would cooperate… Imagine we’re both these “defect unless proof of cooperation” programs. I’m like, “Okay, I want to check if you would cooperate given me”. “If you would cooperate given me” is the same as “if I would cooperate given you”… Here’s the thing that I definitely can prove. If you can prove that “if I can prove that I cooperate, then you cooperate”. But crucially, the “I” and the “you” are actually just the same, because we’re at the same program. If it’s provable that “if it’s provable, then we cooperate”, then we cooperate. Löb’s theorem tells us that we can therefore conclude that it is provable that we cooperate. Therefore, we in fact cooperate.
(00:43:48): My understanding is: so what do we actually do? I think we prove Löb’s theorem, then apply it to our own situation, and then we both prove that we both cooperate, and then we cooperate. I think that’s my recollection of how it’s supposed to go.
Caspar Oesterheld (00:44:01): At least that would be one way.
Daniel Filan (00:44:03): Yeah, I suppose there might be even shorter proofs.
Caspar Oesterheld (00:44:06): Yeah, that is basically correct. Yeah, good recollection of the papers.
Daniel Filan (00:44:14): Yeah. There were a few years in Berkeley where every couple weeks somebody would explain Löb’s theorem to you, and talk about Löbian cooperation. Eventually, you remembered it.
Caspar Oesterheld (00:44:25): Okay, nice. I think it’s a very nice idea. I actually don’t know how they made this connection. Also Löb’s theorem, it’s relatively obscure, I think in part because it doesn’t prove that much more than Gödel’s second incompleteness theorem. Gödel’s incompleteness theorem is “a logical system can’t prove its own consistency”. But here, it’s the same thing. You can’t prove “if I can prove something, it’s true” without just being able to prove the thing.
(00:45:11): I think that’s probably one reason why Löb’s theorem isn’t very widely known. I feel like it’s a result that for this thing, it happens to be exactly the thing you need. Once you have it written down, this cooperation property follows almost immediately. But…
Daniel Filan (00:45:32): How they made the connection?
Caspar Oesterheld (00:45:33): Yeah, how did they…
Daniel Filan (00:45:34): I think I know this, or I have a theory about this. Originally, before they were talking about Löbian cooperation, there was this Löbian obstacle or Löbstacle.
Caspar Oesterheld (00:45:45): Yeah, the Löbstacle.
Daniel Filan (00:45:46): Yeah, to self-trust. You might want to say, “Oh, I’m going to create a successor program to me, and if I can prove that the successor program is going to do well, then…” Or all the programs are going to be like, “If I can prove a thing is good, then I’ll do it.” And can I prove that a program that I write is going to be able to do stuff? And it’s a little bit rough, because if I can prove that you could prove that a thing is good, then I could probably prove that the thing was good myself, and so why am I writing the [successor].
(00:46:14): Maybe this just caused the Löb’s theorem to be on the mind of everyone. I don’t know. I have this theory. But I don’t think I’ve heard it confirmed by any of the authors.
Caspar Oesterheld (00:46:24): Okay. It’s a good theory, I think.
Daniel Filan (00:46:26): Okay. We had this Löbian cooperation idea floating around. This is one thing that was known before these papers we’re about to discuss. Is there anything else’s that important?
Caspar Oesterheld (00:46:45): Yeah, there was a little bit more extension of this Löbian idea. One weird thing here is that we have these programs, “if I can prove this, then I cooperate”. Of course, whether I can prove something, it’s not decidable. There’s not an algorithm that tries for 10 hours, and then it gives up. That’s not what provability would normally mean.
(00:47:17): There’s a paper by Andrew Critch from I think 2019, that shows that actually, Löb’s theorem still works if you consider these bounded… You try, with a given amount of effort… Specifically, you try all proofs of a given length, I think, is the constraint. It shows that some version of Löb’s theorem still holds, and that it’s still enough to get this Löbian cooperation if the two players consider proofs up to a long enough length. They can still cooperate.
Daniel Filan (00:47:55): And it doesn’t have to be the same length.
Caspar Oesterheld (00:47:56): Yeah, it doesn’t have to be the same length, importantly.
Daniel Filan (00:47:58): It just has to be the length of that paper.
Caspar Oesterheld (00:48:00): Yeah.
Daniel Filan (00:48:01): Right. Yeah, yeah, which is great. Very fun result. So there’s a Löbian cooperation. There’s parametric bounded Löbian cooperation. Anything else of note?
Caspar Oesterheld (00:48:12): Yeah. I think one other thing that is interesting—this is not really an important fact, but I think it’s an important thing to understand—is that for the Löbian bots, it matters that you try to find a proof that the other player cooperates, rather than trying to find a proof that the other player defects. The same is true for this implication case that I described. If you try to check “is there a proof that if I defect, the opponent will defect?”, I’m not sure why you would do that.
Daniel Filan (00:49:06): You can imagine similar things, like, “Okay, if I defect, will you cooperate with me naively like a sucker? If so, then I’m just definitely going to defect”.
Caspar Oesterheld (00:49:24): Right. Then I guess you would check for some other property.
Daniel Filan (00:49:32): Or you would check “if I defect, will you defect? If so, then I’ll cooperate”. Maybe that would be the program.
Caspar Oesterheld (00:49:37): Yeah, maybe that is even the more sensible program. I’m not sure whether this cooperates against itself.
Daniel Filan (00:49:50): It must cooperate, right?
Caspar Oesterheld (00:49:51): Okay, let’s think …
Daniel Filan (00:49:55): Suppose we’re the same program. Then it’s basically like: if provable defect “if and only if provable defect”, then cooperate, else defect. But provable defect, if and only if provable defect…. It’s the same… You can just see that it’s the same expression on both sides.
Caspar Oesterheld (00:50:11): Right, I agree. Yeah, this will cooperate. This is not an equilibrium though. If the opponent just submits a DefectBot, you’re going to cooperate against it, right?
Daniel Filan (00:50:22): Yes, it is a program, it is not an equilibrium. I got us off track, I fear.
(00:50:32): But you were saying that you want to be proving the good case, not the bad case.
Caspar Oesterheld (00:50:39): Yeah, maybe let’s do the version from the paper, “if I can prove that you cooperate, I cooperate. Otherwise, I defect”. If you think about it, in this program, it doesn’t really matter that mutual cooperation is the good thing, and mutual defection is the bad thing. Ultimately, it’s just we have two labels, cooperate and defect, we could call them A and B instead. It’s just, “if I can prove that you output label A, I also output label A. Otherwise, I’ll output label B”.
(00:51:12): Regardless of what these labels are, this will result in both outputting label A. If label A happens to be defect rather than cooperate, these will defect against each other. It matters that you need to try the good thing first or something like that.
Daniel Filan (00:51:29): Yeah, yeah. I guess, maybe the most intuitive way of thinking about it, which… I haven’t thought about it a ton, so this may not be accurate. But it feels like you’re setting up a self-fulfilling prophecy, or if the other person happens to be you, then you’re setting up a self-fulfilling prophecy. You want to set up the good self-fulfilling prophecy, not the bad self-fulfilling prophecy.
(00:51:51): I think this is true in this setting. My impression is that there’s also decision theory situations where you really care about the order in which you try and prove things about the environment. I forget if self-fulfilling prophecy is the way to think about those situations as well, even though they’re conceptually related. We can perhaps leave that to the listeners if it’s too hard to figure out right now.
(00:52:15): Okay. Now that we’ve known this sad world that’s confusing and chaotic, perhaps we can get the light of your papers.
Caspar Oesterheld (00:52:26): Okay. I should say, I really like the proof-based stuff. We can talk a little bit about what maybe the upsides and downsides are. Yeah, it is confusing. I would think that one issue with it is that in practice, what programs can one really prove things about?
Daniel Filan (00:52:49): Yeah, my intuition is that the point of that work is it seems like it’s supposed to be modeling cases where you have good beliefs about each other that may or may not be exactly proofs. You hope that something like Löb’s theorem holds in this more relaxed setting, which it may or may not. I don’t exactly know.
Caspar Oesterheld (00:53:07): Yeah, I agree. I also view it this way, which is a more metaphorical way. There’s some distance between the mathematical model, and the actual way it would work then.
The basic idea of Robust Program EquilibriumDaniel Filan (00:53:26): But I want to hear about your paper.
Caspar Oesterheld (00:53:28): Right. Okay. Now, let’s get to my paper. My paper is on whether we can get these cooperative equilibria, not by trying to prove things about each other, but just by simulating each other. I already mentioned that there’s a super naive but intuitive approach that you would like to run the opponent against… You’d like to run the opponent with myself as input, see if they cooperate, if they do, cooperate, otherwise defect. Just this very obvious intuition, maybe from tit for tat in repeated games, that you want to reward the other player for cooperating, and get a good equilibrium that way.
(00:54:21): The problem with this, of course, is that it doesn’t hold if both players do this. I guess this would work if you play this sequentially. We talked about the Stackelberg stuff earlier. If I submit a program first, and then you submit a program second, then it would work for me to submit a program that says, “Run your program, cooperate if it cooperates, defect if your program defects”, and then you would be incentivized to cooperate. But if both players play simultaneously, infinite loop, so it kind of doesn’t work.
Daniel Filan (00:54:58): If we had reflective oracles, then it could work, depending on the reflective oracle. But that’s a whole other bag of worms.
Caspar Oesterheld (00:55:03): Yeah, I guess reflective oracles… Yeah, I probably shouldn’t get into it. But it’s another model that maybe is a little bit in between the proof-based stuff and the simulation stuff.
Daniel Filan (00:55:18): At any rate.
Caspar Oesterheld (00:55:19): Yeah. It turns out there’s a very simple fix to this issue, which is that instead of just always running the opponent and cooperating if and only if they cooperate, you can avoid the infinite loop by just cooperating with epsilon probability, and only if this epsilon probability clause doesn’t trigger, only then do you run the other program. So your program is just: flip a very biased coin—epsilon is a small number, right? You check whether some low probability event happens. If it does, you just cooperate without even looking at the opponent program. Otherwise, you do simulate the other program and you copy whatever they do. You cooperate if they cooperate, defect if they defect.
(00:56:23): The idea is that, basically, it’s the same intuition as “just simulate the opponent, and do this instantaneous tit-for-tat”. Except that now, you don’t run into this running for infinitely long issue, because it might take a while, but eventually, you’re going to hit these epsilon clauses. If we both submit this program, then probably, there’s some chance that I’m immediately cooperating, but most likely, I’m going to call your program which might then also immediately cooperate. Most likely, it’s going to call my program again, and so on. But at each point, we have a probability epsilon of halting, and with probability one will eventually halt.
Daniel Filan (00:57:16): This is a special case of this general construction you have in the paper, right?
Caspar Oesterheld (00:57:26): Yeah. This is for the prisoner’s dilemma in particular, where you have these two actions that happen to be cooperate and defect. In general, there are two things that you can specify here, like you specify what happens with the epsilon probability, then the other thing that you specify is what happens if you simulate the other player, you get some action out of the simulation, and now you need to react to this in some way.
(00:57:57): The paper draws this connection between these ϵGroundedπBots, as they’re called, and repeated games where you can only see the opponent’s last move. It’s similar to that, where: okay, maybe this epsilon clause where you don’t look at your opponent is kind of like playing the first round where you haven’t seen anything of your opponent yet. I guess, in the prisoner’s dilemma, there’s this well-known tit for tat strategy which says: you should cooperate in the beginning, and then at each point, you should look at the opponent’s last move, and copy it, cooperate if they cooperate. But in general, you could have these myopic strategies for these repeated games where you do something in the beginning, and then at each point, you look at the opponent’s last move, and you react to it in some way. Maybe do something that’s equally cooperative or maybe something that’s very slightly more cooperative to slowly get towards cooperative outcomes or something like that. You could have these strategies for repeated games. You can turn any of these strategies into programs for the program game.
Daniel Filan (00:59:21): One thing that I just noticed about this space of strategies, this is strategies that only look at your opponent’s last action, right?
Caspar Oesterheld (00:59:29): Yes.
Daniel Filan (00:59:29): In particular, there’s this other thing you can do which is called win-stay, lose-switch, where if you cooperated against me, then I just do whatever I did last time. If you defected against me, then I do the opposite of what I did last time. It seems like this is another thing that your next paper is going to fix. But in this strategy, it seems like I can’t do this, right?
Caspar Oesterheld (00:59:58): Yes. Yeah, it’s really very restrictive. Most of the time, you’re going to see one action of the opponent, you have to react to that somehow, and that’s it.
Daniel Filan (01:00:13): Yeah. But it’s this nice idea. It’s basically this connection between: if you can have a good iterated strategy, then you can write a good computer program to play this mutually transparent program game, right?
Caspar Oesterheld (01:00:28): Yeah.
Daniel Filan (01:00:29): How much do we know about good iterated strategies?
Caspar Oesterheld (01:00:34): That is a good question. For the iterated prisoner’s dilemma, there’s a lot about this. There are a lot of these tournaments for the iterated prisoner’s dilemma. I’m not sure how much there is for other games, actually. Yeah, you might have iterated stag hunt or something like that? I guess, maybe for a lot of the other ones, it’s too easy or so.
(01:01:03): There’s some literature. You can check the paper. There are various notions that people have looked at, like exploitability of various strategies, which is how much more utility can the other player get than me if I play the strategy? For example, tit for tat, if the opponent always defects, you’re going to get slightly lower utility than them because in the first round, you cooperate, and then they defect. Then in all subsequent rounds, both players defect. It’s very slightly exploitable, but not very much.
(01:01:45): These notions that have been studied, and in my paper, I transfer these notions… If you take a strategy for the iterated prisoner’s dilemma, or for any repeated game, it has some amount of exploitability, and the analogous ϵGroundedπBot strategy has the same amount of exploitability. This is also an interesting question in general. How much qualitatively different stuff is there even in this purely ϵGroundedπBot space? If all you can do is look at the one action of the opponent and react to this action, how much more can you even do than things that are kind of like this sort of tit-for-tat…? Like I mentioned, in more complex games maybe you want to be slightly more cooperative… I don’t know. After a bunch of simulations you eventually become very cooperative or something like that.
Daniel Filan (01:02:52): Okay. I have a theory. In my head I’m thinking: okay, what’s the general version of this? And I can think of two ways that you can generalize, right? Here’s what I’m imagining you should do, in general. Okay. You have a game, right? First you think about: okay, what’s the good equilibrium of this game, right? And then what do I want to do if the other person doesn’t play ball? It seems like there are two things I could do if the other person doesn’t join me in the good equilibrium. Firstly, I could do something to try and punish them. And secondly, I can do something that will make me be okay, be good enough no matter what they do. I don’t exactly know how you formalize these, but my guess is that you can formalize something like these. And my guess is that these will look different, right?
(01:03:43): You can imagine saying, “Okay, with epsilon probability, I do my part to be in the good equilibrium, and then the rest of the time I simulate what the other person does. If they play in the good equilibrium I play in the good equilibrium. If they don’t play in the good equilibrium then, depending on what I decided earlier, I’m either going to punish them or I’m going to do a thing that’s fine for me”. Or you can imagine that I randomize between those. Maybe there’s some “best of both worlds” thing with randomizing. I don’t exactly know. Do you have a take on that?
Caspar Oesterheld (01:04:14): I mean, there’s at least one other thing you can do, right, which is try to be slightly more cooperative than them in the hope that you just-
Daniel Filan (01:04:26): Right.
Caspar Oesterheld (01:04:31): Imagine the repeated game, right? At any given point you might want to try to be a bit more cooperative in the hope that the other person will figure this out, that this is what’s going on, and that you’re always going to be a little bit more cooperative than them. And that this will lead you to the good equilibrium or to a better equilibrium than what you can get if you just punish. I mean, punish usually means you do something that you wouldn’t really want to do, you just do it to incentivize the other player. Or even the “okay, well, you’re going to go and do whatever but I’m just going to do something that makes me okay”.
Daniel Filan (01:05:15): So is the “be more cooperative than the other person” thing… I feel like that’s already part of the strategy. Okay, so here’s the thing I could do. With epsilon probability, do the good equilibrium, then simulate what the opponent does. If they do the good thing, if they’re in the good equilibrium, then I join the good equilibrium. If they don’t join the good equilibrium, then with epsilon probability I be part of the good equilibrium, and then otherwise I do my other action. With epsilon probability for being slightly more cooperative, you could have just folded that into the initial probability, right?
Caspar Oesterheld (01:05:51): Right. The difference is you can be epsilon more cooperative in a deterministic way, right? With this epsilon probability thing, some of the time you play the equilibrium that you would like to play. This alternative proposal is that you always become slightly more cooperative, which is… I’m not sure how these things play out. I would imagine that for characterizing what the equilibria are probably all you need is actually the punishment version. But I would imagine that if you want to play some kind of robust strategy you would sometimes move into a slightly more cooperative direction or something like that.
(01:06:51): You could have all of these games where there are lots of ways to cooperate and they sort of vary in how they distribute the gains from trade or something like that, right? Then there’s a question of what exactly happens if your opponent is… They play something that’s kind of cooperative but sort of in a way that’s a little bit biased towards them. I guess maybe you would view this as just a form of punishment if you then say, “Well, I’m going to stay somewhat cooperative but I’m going to punish them enough to make this not worthwhile for them” or something like that.
Daniel Filan (01:07:33): If there’s different cooperative actions that are more or less cooperative then it definitely makes sense. At the very least I think there are at least two strategies in this space. I don’t know if both of them are equilibria to be fair.
Are ϵGroundedπBots inefficient?Daniel Filan (01:07:46): Okay. There are a few things about this strategy that I’m interested in talking about. We’re both playing the same “tit-for-tat but in our heads” strategy, right? The time that it takes us to eventually output something is O(one/epsilon), right? On average, because with each round with epsilon probability we finish, and then it takes one on epsilon rounds for that to happen, right?
Caspar Oesterheld (01:08:31): Yeah, I think that’s roughly right. I mean, it’s a geometric series, right? I think it’s roughly one over epsilon.
Daniel Filan (01:08:40): It’s one minus epsilon over epsilon, which is very close to one over epsilon.
Caspar Oesterheld (01:08:42): Yes.
Daniel Filan (01:08:45): That strikes me as a little bit wasteful, right, in that… So the cool thing about the Löbian version was: the time it took me to figure out how to cooperate with myself was just the time it took to do the proof of Löb’s theorem no matter how… It was sort of this constant thing. Whereas with the epsilon version, the smaller the epsilon it is, the longer it seems to take for us. And we’re just going back and forth, right? We’re going back and forth and back and forth and back and forth. I have this intuition that there’s something wasteful there but I’m wondering if you agree with that.
Caspar Oesterheld (01:09:25): Yeah, I think it’s basically right. Especially if you have a very low epsilon, right, there’s a lot of just doing the same back-and-forth thing for a long time without getting anything out of it. One thing is that you could try to speed this up, right, if you… So let’s say I run your program, right? Instead of just running it in a naive way I could do some analysis first.
(01:10:11): If you have a compiler of a computer program, it might be able to do some optimizations. And so maybe I could analyze your program, analyze my program, and I could tell: okay, what’s going to happen here is that we’re going to do a bunch of nothing until this epsilon thing triggers. Really instead of doing this actually calling each other, we just need to sample the depth of simulations according to this geometric distribution, the distribution that you get from this halting with probability epsilon at each step. You could do this analysis, right? Especially if you expect that your opponent will be an ϵGroundedFairBot, you might explicitly put in your compiler or whatever something to check whether the opponent is this ϵGroundedFairBot. And if so, we don’t need to do this actually calling each other, we just need to sample the depth.
(01:11:26): In some sense, the computation that you need to do is sample the depth then sample from… Whoever halts at that point, sample from their ‘base’ distribution, their blind distribution. And then sort of propagate this through all of the function that both players have for taking a sample of the opponent’s strategy and generating a new action. If this is all very simple then… in principle, your compiler could say, for the ϵGroundedFairBot in particular—sorry, the ϵGroundedFairBot is the version for the prisoner’s dilemma. In principle, your compiler could directly see “okay, what’s going to happen here? Well, we’re going to sample from the geometric distribution, then ‘cooperate’ will be sampled, and then a bunch of identity functions will be applied to this”. So this is just cooperate without needing to do any actual running this, doing recursive calls by something something with a stack, and so on. Probably you don’t actually need any of this.
Daniel Filan (01:12:52): There’s something intuitively very compelling about: okay, if I can prove that the good thing happens or whatever, then do the proof-based thing. If I can’t prove anything then do the simulation stuff. It seems intuitively compelling. I imagine you probably want to do some checks if that works on the proof-based side, depending on the strategy you want to implement.
Caspar Oesterheld (01:13:15): I mean, the thing I’m proposing is not to have the proof fallback, but just that you… You always do the ϵGroundedFairBot thing, for example, or the ϵGroundedπBot. Instead of calling the opponent program in a naive way where you actually run everything, you throw it in this clever compiler that analyzes things in some way. And maybe this compiler can do some specific optimizations but it’s not a fully general proof searcher or anything like that.
Daniel Filan (01:13:52): I mean, it’s checking for some proofs, right?
Caspar Oesterheld (01:13:54): Yeah, it’s checking for some specific kinds of proofs… I mean, that’s how modern day compilers I assume work, right, is that they understand specific kinds of optimizations and they can make those but they don’t have a fully general proof search or anything like that.
Daniel Filan (01:14:15): Sorry. When you said that I was half listening and then half thinking about a different thing, which is: you could imagine ϵGroundedFairBot which is: first, if your source code is equal to mine, then cooperate. Else, if your source code is the version of ϵGroundedFairBot that doesn’t first do the proof search, then cooperate. Else, with probability epsilon cooperate, probability one minus epsilon, do what the other person does, right?
(01:14:41): So that particular version probably doesn’t actually get you that much because the other person added some spaces in their program. And then I’m like but you could do some proof stuff, insert it there. I guess there are a few possibilities here. But it does seem like something’s possible.
Compatibility of proof-based and simulation-based program equilibriaCaspar Oesterheld (01:15:06): These different kinds of ways of achieving this more robust program equilibrium, they are compatible with each other. If I do the ϵGroundedFairBot and you do the Löbian bot, they are going to cooperate with each other.
Daniel Filan (01:15:29): You’re sure?
Caspar Oesterheld (01:15:30): I’m pretty sure, yeah.
Daniel Filan (01:15:31): Okay. You’ve probably thought about this.
Caspar Oesterheld (01:15:32): I wrote a paper about it. It’s not a real paper, it’s sort of like a note on this. Maybe let’s take the simplest versions or whatever, we don’t need to go down the Löb’s theorem path again. Let’s take the simplest version which is just, can I prove “if I cooperate, you cooperate”, then cooperate. If you’re the Löbian bot and I’m the ϵGroundedFairBot, you can prove that if you cooperate I will cooperate, right? Well, I’m epsilon times…
Daniel Filan (01:16:13): Sorry. Can you say that without using “you” and “I”?
Caspar Oesterheld (01:16:15): Okay. Am I allowed to say “I submit a program that’s”-
Daniel Filan (01:16:20): Yes, you are.
Caspar Oesterheld (01:16:20): Okay. So I submit a program that is just the ϵGroundedFairBot, so with epsilon probability cooperate otherwise simulate you and do what do you do. And your program is: if it’s provable that “if this program cooperates, the other program cooperates”, then cooperate, and otherwise, defect. Okay. So let’s think about your program-
Daniel Filan (01:16:54): The proof-based one.
Caspar Oesterheld (01:16:55): The proof-based one. So your program will try to prove: if it cooperates, my program, ϵGroundedFairBot will cooperate.
Daniel Filan (01:17:09): Okay. So the proof-based program is trying to prove, “if proof-based program cooperates then sampling program cooperates”. And it will be able to prove that. I think the other implication is slightly trickier but maybe you only care about the first implication, or you care about it more.
Caspar Oesterheld (01:17:24): Sorry, what is the other implication?
Daniel Filan (01:17:25): That if the sampling-based program cooperates then the proof- based one will cooperate. Maybe that’s not so bad.
Caspar Oesterheld (01:17:34): But do you actually need this? The proof-based program, it will succeed in proving this implication, right, and it will, therefore, cooperate.
Daniel Filan (01:17:45): And that’s how it proves that it will do it in the other direction?
Caspar Oesterheld (01:17:48): I mean, that’s how one can then see that the ϵGroundedFairBot will also cooperate because it will… Well, with epsilon probability it cooperates anyway. And with the remaining probability it does whatever the proof-based thing does, which we’ve already established is to cooperate. Sorry, does this leave anything open?
Daniel Filan (01:18:03): I think I was just thinking about a silly version of the program where the proof-based thing is checking: can I prove that if my opponent will cooperate then I will cooperate? But I think you wouldn’t actually write this because it doesn’t make any sense.
Caspar Oesterheld (01:18:22): No. That seems harder though. I don’t know. Maybe if we think about it for two minutes we’ll figure it out. I think one wouldn’t submit this program.
Cooperating against CooperateBot, and how to avoid itDaniel Filan (01:18:32): I next want to ask a different question about this tit-for-tat-based bot. This bot is going to cooperate against CooperateBot, right, the bot that always plays cooperate? That seems pretty sad to me, right? I’m wondering how sad do you think that this is?
Caspar Oesterheld (01:18:53): I’m not sure how sad. Okay, I have two answers to this. The first is that I think it’s not so obvious how sad it is. And the second is that I think this is a relatively difficult problem to fix. On how sad is this: I don’t know. It sort of depends a little bit on what you expect your opponent to be, right? If you imagine that you’re this program, you’ve been written by Daniel, and you run around the world, and you face opponents. And most of the opponents are just inanimate objects that weren’t created by anyone for strategic purposes. And now you face the classic rock that says “cooperate” on it. It happens to be a rock that says “cooperate”, right? You don’t really want to cooperate against that.
(01:19:49): Here’s another possibility. We play this program equilibrium game, literally, and you submit your program, right? And you know that the opponent program is written by me, by Caspar, who probably thought about some strategic stuff, right? Okay, it could be that I just wrote a CooperateBot, right, and that you can now get away with defecting against it. But maybe you could also imagine that maybe there’s something funny going on. And so for example, one thing that could be going on is that I could… Here’s a pretty similar scheme for achieving cooperation in the program equilibrium game, which is based on not the programs themselves mixing but the players mixing over what programs to submit. And so I might-
Daniel Filan (01:20:39): Mixing meaning randomizing?
Caspar Oesterheld (01:20:40): Yeah, randomizing. Very good. So I might randomize between the program that just cooperates—the CooperateBot, the program that cooperates if and only if the opponent cooperates against CooperateBot—so it’s sort of a second-order CooperateBot, something like that. And then you can imagine how this goes on, right? Each of my programs is some hierarchy of programs that check that you cooperated against the one one lower [down] the list. In some sense this is similar to the ϵGroundedFairBot, I guess. You can look at my program and maybe I could just defect or something like that. But the problem is you might be in a simulation of the programs that are higher in the list. If I submit this distribution, you would still want to cooperate against my CooperateBot, of course. So that is one reason to want to cooperate against CooperateBot.
Daniel Filan (01:22:00): It suddenly means that it really matters which things in my environment I’m modeling as agents and which things in my environment that I’m modeling as non-agents, right? Because in my actual environment, I think there are many more non-agents than there are agents. So take this water bottle, right? Not only do I have to model it as a non-agent, but it seems like maybe I’ve also got to be modeling what are the other things it could have done if physics were different, right? It seems like if I have this sort of attitude towards the world a bunch of bad things are going to happen, right?
(01:22:43): And also, if I’m in a strategic setting with other agents that are trying to be strategic, I think you do actually want to be able to say things like “Hey, if I defected would you cooperate anyway? In that case, I’ll just defect. But if your cooperation is dependent on my cooperation then I’m going to cooperate”. It’s hard to do with this construction because I’m checking two things and that explodes into a big tree. But this seems to me like something that you do want to do in the program equilibrium world. I guess those are two things. I’m wondering what your takes are.
Caspar Oesterheld (01:23:29): Yeah, it would be nice to know how to do the: for this given opponent program, could my defecting make the opponent defect? I think a program that exploits CooperateBot and cooperates against itself in some robust way, I agree that this would be desirable. I guess we can say more about to what extent this is feasible. I think in some sense one does just have to form the beliefs about what the water bottle could have been and things like that. I guess with the water bottle—I don’t know, I mean, it’s sort of a weird example. But with the water bottle, I guess, you would have to think about: do you have a reason to believe that there’s someone who’s simulating what you do against the water bottle, and depending on that does something, right?
(01:24:37): In the strategic setting where you know that the opponent program is submitted by Caspar or by someone who knows a little bit about this literature, you just have a very high credence that if you face a CooperateBot probably something funny is going on, right?
(01:24:56): You have a high credence that there are some simulations being run of your program that check what your program does against various opponents. You have to optimize for that case much more than you optimize for the case where your opponent is just a CooperateBot. Whereas with a water bottle, you don’t really have this, right? I don’t know. Why would someone simulate like “okay, the water bottle could have been—”
Daniel Filan (01:25:22): I mean, people really did design this water bottle by thinking about how people would use it, right? I think I have a few thoughts there. Firstly, if I’m just naively like, “did people change how this water bottle would work depending on how other people would interact with it?” That’s just true. I mean, they didn’t get the water bottle itself to do that, so maybe that’s the thing I’m supposed to check for.
(01:25:46): It’s also true that if you go to real iterated, mutually transparent prisoner’s dilemmas, people do actually just write dumb programs in those. And it’s possible that okay, these are played for 10 bucks or something and that’s why people aren’t really trying. But in fact, some people are bad at writing these programs and you want to exploit those programs, right?
(01:26:22): And I also have this issue which is: it seems like then what’s going on is my overall program strategy or something is: first, check if I’m in a situation where I think the other program was designed to care about what I am going to do, then cooperate, otherwise defect. Maybe this is not so bad in the simulation setting. In the proof-based setting, this would be pretty bad, right, because now it’s much harder to prove nice things about me. In the simulation setting, it might just be fine as long as we’re really keeping everything the same. Maybe this is an advantage of the simulation setting, actually. I don’t really know.
Caspar Oesterheld (01:27:05): Sorry, I’m not sure I fully followed that.
Daniel Filan (01:27:08): Okay. I took your proposal to be: the thing you should do is you should figure out if you’re in a strategic setting where the other person is, basically, definitely not going to submit a CooperateBot. I’m imagining myself as the computer program. Maybe this is different to what you were saying. But I was imagining that the program was “check if the other computer program was plausibly strategically designed. Then-
Caspar Oesterheld (01:27:41): Yes.
Daniel Filan (01:27:42): If so then do ϵGroundedFairBot, otherwise do DefectBot. For example, one concern is different people write their programs to do this check in different ways and one of them ends up being wrong. Maybe this is not a huge issue. I don’t know. It feels like it adds complexity in a way that’s a little bit sad.
Caspar Oesterheld (01:28:06): I could imagine that, I guess, for the proof-based ones, the challenge is that they need to be able to prove about each other that they assess the… Whether they’re in a strategic situation, they need to assess this consistently or something like that.
Daniel Filan (01:28:23): Also, the more complicated your program is the harder it is for other people to prove stuff about you. One thing you want to do if you’re a proof-based program, in a world of proof-based programs, is be relatively easy to prove things about. Well, depending on how nice you think the other programs are, I guess.
Caspar Oesterheld (01:28:47): I mean, in practice I think, in the tournament, for various reasons, you should mostly try to exploit these CooperateBots, or these programs that are just written by people who have thought about it for 10 minutes or who just don’t understand the setting or something like that. You wouldn’t expect people to submit this cooperation bot hierarchy thing because there’s just other things to do, right? In some sense, there’s a higher prior on these kinds of programs.
(01:29:25): But you could imagine a version of the tournament setting where you’re told who wrote the opponent program, and then your program distinguishes between someone who has publications on program equilibrium wrote the opponent program, and then you think, okay, well, all kinds of funny stuff might be going on here. I might currently be simulated by something that tries to analyze me in some weird way so I need to think about that. Versus the opponent is written by someone who, I don’t know, I don’t wanna…
Daniel Filan (01:30:06): A naive podcaster.
Caspar Oesterheld (01:30:09): …by someone who just doesn’t know very much about the setting. And then maybe there you think: okay, most prior probability mass is on them just having screwed up somehow and that’s why their program is basically a CooperateBot. Probably in these tournaments I would imagine that, I don’t know, 30% of programs are just something that just fundamentally doesn’t work, it doesn’t do anything useful. It just checks whether the opponent has a particular string in the source code or something like that. And meanwhile very little probability mass [is] on these sophisticated schemes for “check whether the opponent cooperates against CooperateBot in a way that’s useful”.
(01:30:53): So we talked a little bit about to what extent it’s desirable to exploit CooperateBots. There’s then also the question of how exactly to do this. Here’s one more thing on this question of whether you need to know whether the opponent is part of the environment or strategic. You can think about the repeated prisoner’s dilemma, right? I mean, tit-for-tat, everyone agrees it’s a reasonable strategy. And tit-for-tat also cooperates against CooperateBot, right? And I would think there it’s analogous. Tit-for-tat is a reasonable strategy if you think that your opponent is quite strategic. The more you’re skeptical, the more you should… I don’t know, maybe you should just be DefectBot, right? Against your water bottle maybe you can be DefectBot. And then there’s some in-between area where you should do tit-for-tat, but maybe in round 20 you should try defecting to see what’s going on. And then if they defect you can maybe be pretty sure that they’re strategic.
Daniel Filan (01:32:20): It seems to me like the thing you want to do is you want to have randomized defection, then see if the opponent punishes you, and then otherwise do tit-for-tat. But also, be a little bit more forgiving than you otherwise would be in case other people are doing the same strategy.
Caspar Oesterheld (01:32:37): One difference between the settings is that you can try out different things more. Which I think also leads nicely to the other point which is: how exactly would you do this exploiting CooperateBots? I do think just a fundamental difficulty in the program equilibrium setting for exploiting CooperateBots is that it’s… Aside from little tricks, it’s difficult to tell whether the opponent is a CooperateBot in the relevant sense. Intuitively, what you want to know is: if I defected against my opponent, would they still cooperate? And if that’s the case, you would want to defect. But this is some weird counterfactual where you have all of these usual problems of conditioning on something that might be false and so you might get all kinds of weird complications.
(01:33:43): So, I think in comparison to the tit-for-tat case where… I mean, it’s not clear what exactly you would do, but maybe in some sense, against the given opponent, you can try out sometimes defecting, sometimes cooperating and seeing what happens. There’s less of that in the program game case because your one program, there’s some action that you play and maybe you can think if I played this other action… But it’s a weird… You run into these typical logical obstacles.
Daniel Filan (01:34:26): Although it feels like it might not be so bad. So, imagine I have this thing where I’m saying, “Okay, suppose I defected. Would you cooperate against a version of me that defected? If so, then I’m going to defect”. And in that case, it seems like my defection is going to show up in the cases in which you would cooperate and therefore, that counterfactual is not going to be logically impossible, right?
Caspar Oesterheld (01:34:57): Yeah, that’s a good point. So, I guess a very natural extension of (let’s say) these proof-based bots is: okay, what if you first try to prove, “if I defect, the opponent will cooperate”? This will defect against CooperateBots, which is good. The question is whether this will still… What does this do against itself? This will still cooperate against itself, right?
Daniel Filan (01:35:30): Yeah. Because if I’m asking, “will you cooperate if I defect?” The answer is no, if I’m playing myself, because I always have to do the same thing as myself because I’m me.
Caspar Oesterheld (01:35:40): Yeah, maybe this just works.
Daniel Filan (01:35:42): I bet there must be some paper that’s checked this.
Caspar Oesterheld (01:35:49): Yeah, I’m now also trying to remember. Because one of these proof-based papers, they do consider this PrudentBot, which does something much more hacky: it tries to prove (and there’s some logic details here)—it tries to prove that… (Okay, there there’s one issue with the program that you just described that I just remembered, but let’s go to PrudentBot first). So, PrudentBot just checks whether you would cooperate against DefectBot. And then, if you cooperate against DefectBot, I can defect against you.
(01:36:39): I don’t know. To me, this is a little bit… It’s natural to assume that if the opponent cooperates against DefectBot, they’re just non-strategic. They haven’t figured out what’s going on and you can defect against them. But in some sense, this is quite different from this “does my defection make the opponent defect?” or something like that.
Daniel Filan (01:37:03): Yeah, it’s both the wrong counterfactual and it’s a little bit less strategic, right?
Caspar Oesterheld (01:37:09): Yes. The things that I’m aware of that people have talked about are more like this, where they check these relatively basic conditions. You can view them as checking for specific kinds of CooperateBots. I guess another thing you can do is for the ϵGroundedFairBots, just add in the beginning a condition [that] if the opponent is just a CooperateBot, or if the opponent never looks at the opponent’s source code at all, then you can defect against them. You can add these sorts of things. And I think from the perspective of winning a tournament, you should think a lot about a lot of these sorts of conditions and try to exploit them to defect against as many of these players as possible. But it’s not really satisfying. It feels like a trick or some hacky thing, whereas the thing you proposed seems more principled.
(01:38:09): Okay. Now, on this thing, I could imagine one issue is that: when this program faces itself, it first needs to prove… So, one problem is always that sometimes, to analyze opponent programs, you need to prove that some provability condition doesn’t trigger. And the problem is that just from the fact that you think this condition is false, you can’t infer that it’s not provable because of incompleteness. So, I could imagine that I can’t prove that your program doesn’t just falsely prove that your program can safely defect against me because you might think, well… When I prove things, I don’t know whether Peano arithmetic or whatever proof system we use is consistent.
(01:39:27): And so there’s always a possibility that every provability condition triggers, which means that I don’t know whether your first condition triggers. Actually for this PrudentBot, this also arises. If I am this PrudentBot, as part of my analysis of your program, I try to prove that you would defect or cooperate or whatever. I try to prove something about what you would do against DefectBot. And for that, if (let’s say) you’re just some more basic Löbian FairBot-type structure, then in my analysis of your program, I need to conclude that your clause “if I cooperate, the opponent cooperates” or your clause “if I can prove that the opponent cooperates”… I need to conclude that this won’t trigger. To prove that you don’t cooperate against DefectBot, I need to conclude that you won’t falsely prove that DefectBot will cooperate against you.
(01:40:48): And this, I can’t prove in Peano arithmetic or in the same proof system that you use. So, what they actually do for the PrudentBot is that I need to consider… They call it PA+1. I don’t know how widely this is used. I need to consider Peano arithmetic or whatever proof system they use, plus the assumption that that proof system is consistent, which gives rise to a new proof system which can then prove that your “if” condition is not going to trigger. So, this is some general obstacle.
Daniel Filan (01:41:28): Right. And we’ve got coordinate on what proof systems we use then, because if I accidentally use a too-strong proof system, then you have difficulty proving things about me. And I guess also, this thing about, “well, if I defected, would you still cooperate with me?” It feels a little bit hard to… In the proof-based setting, I can say, “if my program or your program outputted defect, would your program or my program output cooperate?” I could just do that conditional or whatever.
(01:42:04): If I want to do this in a simulation-based setting—which I think there are reasons to want to do. Sometimes, you just can’t prove things about other people and you have to just simulate them. And it’s nice because it’s moving a bit beyond strict computer programs. It’s also nice because maybe it’s hard to prove things about neural networks, which was one of the motivations—but I don’t even know what the condition is supposed to be in that setting. Maybe if we’re stochastic programs, I could say: maybe I could do a conditional on “this stochastic program outputs defect”. But it’s not even clear that that’s the right thing because you’re looking at my program, you’re not looking at the output of my program.
Caspar Oesterheld (01:42:52): Yeah. Though you can have programs that do things like “if the opponent cooperates with probability at least such and such…” I think one can make those kinds of things well-defined at least.
Daniel Filan (01:43:05): Yeah. But somehow, what I want to say is “if you cooperate with high probability against a version of me that defects…”, you know what I mean? Either you’re simulating just a different program or you’re simulating me and I don’t know how to specify you’re simulating a version of me that defects. You know what I mean?
Caspar Oesterheld (01:43:28): Yeah. I agree that that’s-
Daniel Filan (01:43:32): In some special cases, maybe I could run you and if I know what location in memory you’re storing the output of me, I can intervene on that location of memory, but (a) this is very hacky and (b) I’m not convinced that this is even the right way to do it.
Caspar Oesterheld (01:43:46): Yeah, I guess there are various settings where you constrain the way that programs access each other that would allow more of these counterfactuals. For example, you could consider pure simulation games where you don’t get access to the other player’s source code, but you can run the other player’s source code. And I guess in those cases, some of these counterfactuals become a bit more straightforwardly well-defined, that you can just… What if I just replace every instance of your calls to me with some action? I mean, there are some papers that consider this more pure simulation-based setting as well, but obviously that would not allow for proof-based stuff and things like that.
Making better simulation-based botsDaniel Filan (01:44:43): So, I think at this point, I want to tee up your next paper. So, in particular in this paper, there are two types of strategies that you can’t turn into the program equilibrium setting. So, I think we already discussed win-stay lose-switch, where I have to look at what you did in the last round, and I also have to look at what I did in the last round. There’s also this strategy in the iterated prisoner’s dilemma called a grim trigger where if you’ve ever defected in the past, then I’ll start defecting against you. And if you’ve always cooperated, then I’ll cooperate. And neither of these, you can have in your ϵGroundedFairBots. Why is that?
Caspar Oesterheld (01:45:24): Yeah. Basically, the main constraint on these ϵGroundedFairBots or πBots or whatever is that they just can’t run that many simulations. You can run one simulation with high probability or something like that. Maybe with low probability, you can maybe start two simulations or something like that. But the problem is just as soon as you simulate the opponent and yourself or multiple things and with high probability, you run into these infinite loop issues again that this epsilon condition avoids. Another case is if you have more than two players, things become weird. Let’s say you have three players. Intuitively, you would want to simulate both opponents, and then, if they both cooperate, you cooperate. If one of them defects, then maybe you want to just play the special punishment action against them depending on what the game is. But you can’t simulate both opponents. Because if every time you’re called, [you] start two new simulations or even two minus epsilon or something like that in expectation, you get this tree of simulations that just expands and occasionally some simulation path dies off, but it multiplies faster than simulations halt.
Daniel Filan (01:46:55): Right. Yeah. Basically, when you grow, you’re doubling, but you cut off factor of epsilon, but epsilon is smaller than a half. And therefore, you grow more than you shrink and it’s really bad. And if epsilon is greater than a half, then you’re not really simulating much, are you?
Caspar Oesterheld (01:47:11): Yeah.
Daniel Filan (01:47:12): So, how do we fix it?
Caspar Oesterheld (01:47:13): Okay, so we have this newer paper, where I’m fortunate to be the second author, and the first author’s Emery Cooper, and then Vince Conitzer, my PhD advisor, is also on the paper. And so, this fixes exactly these issues. And I think it’s a clever, interesting idea. So, to explain this idea, we need to imagine that the way that programs randomize works a particular way. The architecture of the programming language has to be a particular way to explain this. If you have a normal programming language, you call random.random() or some such function and you get a random number out of it.
(01:48:10): But another way to model randomization is that you imagine that at the beginning of time or when your program is first called, it gets as input an infinite string of random variables that are rolled out once in the beginning, and then, you have this long string of… It could be (for example) bits, and all you’re going to do is use the bits from this input. And so, in some sense, this is a way of modeling randomization with a deterministic program. In some sense, randomization is like running a deterministic program on an input that is random. As part of your input, you get this random string. And so, specifically, let’s imagine that you get these as a random string as input, but each entry is just a random number between zero and one.
(01:49:06): The way that these infinite simulation issues are fixed is that when I run, for example, my two opponents and myself, I pass them all the same random input string and that way, I coordinate how they halt or at what point they halt. Very specifically, here’s how it works. So, let’s maybe first consider a version where the issue is just that you have multiple opponents, but you’re still doing something like ϵGroundedFairBot where you’re happy to look just at the last round. Or maybe win-stay lose-[switch], where you maybe also look at your own previous action.
(01:49:59): So, what you do is you look at your random input string, and if the first number is below epsilon, then you just immediately halt as usual by just outputting something. And otherwise, you remove the first thing from this infinite random input string. And then, you call all of these simulations. You simulate both opponents. Let’s say you have two opponents and yourself, just with the first entry in that list removed. And now, okay, how does this help? Well, I mean the opponents might do the same, right? Let’s say they also all check the first thing, check whether it’s smaller than epsilon, and then remove the first and call recursively.
(01:50:55): Well, the trick is that by all of them having the same input string, they all halt at the same point. All your simulations are going to halt once they reach the specific item in this input string—the first item in this input string that is smaller than epsilon. And so, that allows for simulating multiple opponents. You can simulate yourself of course, and you can also simulate multiple past time steps by, instead of passing them just the input string with the first thing removed, you can also check what did they do, in some intuitive sense, ‘two time steps ago’ by removing the first two random variables from the input string and passing that into them. So, this is the basic scheme for making sure that these simulations all halt despite having a bunch of them.
Daniel Filan (01:52:04): My understanding is that you have two constructions in particular. There’s this correlated one and this uncorrelated one. Can you give us a feel for what the difference is between those?
Caspar Oesterheld (01:52:15): Yeah. So, there are differences in the setting. So, the correlated one is one where you get a correlated, or you get a shared random input sequence. So you could imagine that there’s some central party that generates some sequence of random numbers and it just gives the sequence of random numbers to all the players. So, they have this same random sequence—and then, maybe additionally, they have a private one as well—but they have this shared random sequence. And then, in this shared setting, basically all the results are much nicer. Basically, we get nice results in the shared randomness setting, and mostly more complicated, weird results—or in some cases, we also just can’t characterize what’s going on—in the non-correlated case.
(01:53:16): But in the correlated case, we specifically propose to use the correlated randomness to do these recursive calls. So, when I call my three opponents or two opponents and myself on the last round, I take the shared sequence of random numbers. I remove the first and call the opponents with that, with the remaining one rather than using the private one. And then, in the case where there is no shared randomness, we just use the private randomness instead. So, in some sense, it’s almost the same program. I mean, there’s some subtleties, but in some sense it’s the same program. And the main difference is that, well, you feed them this randomness that’s-
Daniel Filan (01:54:12): You’re giving the other person your private randomness, right?
Caspar Oesterheld (01:54:14): Yeah. I’m giving… yeah, I don’t have access to their randomness. I have to give them my randomness, which also, maybe it’s not that hard to see that you get somewhat chaotic outputs. In some sense, my prediction of what the opponent will do is quite different from what they’re actually going to do because they might have very different input.
Daniel Filan (01:54:44): Right. In some ways it’s an interesting… It’s maybe more realistic that I get to sample from the distribution of what you do, but I don’t get to know exactly what you will actually do. Actually, maybe this is just me restating that I believe in private randomness more than I believe in public randomness.
(01:55:03): So, okay, here’s a thing that I believe about this scheme that strikes me as kind of sad. It seems like, basically, you’re going to use this scheme to come up with things like these ϵGroundedFairBots and they’re going to cooperate with each other. But reading the paper, it seemed like what kind of had to happen is that all the agents involved had to use the same sort of time step scheme, at least in the construction. It’s like, “Oh, yeah, everyone has this shared sequence of public randomness, so they’re both waiting until the random number is less than epsilon and at that point they terminate”.
(01:55:56): So, I guess I’m seeing this as: okay, in the real world we do have public sources of randomness, but there are a lot of them. It’s not obvious which ones they use. It’s not obvious how to turn them into “is it less than epsilon?” or… So, it seems really sad if the good properties of this have to come from coordination on the scheme of “we’re going to do the time steps and we’re going to do it like this”. But I guess I’m not sure. How much coordination is really required for this to work out well?
Caspar Oesterheld (01:56:30): Yeah, that is a good question. Yeah, I do think that this is a price that one pays relative to the original ϵGroundedπBots, which obviously don’t have these issues. I think it’s a little bit complicated how robust this is exactly. So, the results that we have… We have this folk theorem about what equilibria can be achieved in the shared randomness case by these kinds of programs. And it’s the same as for repeated games, also the same as for these syntactic comparison-based ones. So, everything that’s better for everyone than their minimax payoff, the payoff that they got if everyone else punished them. And I guess the fact that it’s equilibrium obviously means that it’s robust to all kinds of deviations, but getting the equilibrium payoff, that requires coordination on these random things.
(01:57:43): Also, another thing is that—maybe this is already been implicit or explicit in the language I’ve used—with these times steps, there’s a close relation between this and repeated games. Now, it’s really just full repeated game strategies. And this whole relation to repeated games hinges on everyone using basically exactly the same time step scheme. Basically, if everyone uses the same epsilon and if the same source of randomness is below this epsilon, then in some sense, it’s all exactly playing a repeated game with a probability of epsilon of terminating at each point. And there’s a very nice correspondence. So, some of the results do really fully hinge on really exact coordination on all of these things. But also, there’s some robustness still.
(01:58:42): So, for example, the programs still halt if someone chooses a slightly different epsilon. If someone chooses a different epsilon, the relationship to repeated games sort of goes away. It’s hard to think of a version to play a repeated game where everyone has their separate cutoff probability. I don’t know. Maybe one can somehow make sense of this, but it does become different from that. But let’s say I choose an epsilon that’s slightly lower. Well, we’re still going to halt at the point where we find a point in this random sequence where it’s below everyone’s epsilon. So, people choosing slightly different epsilons, it becomes harder for us to say what’s going on, we can’t view it as a repeated game anymore, but it still works. It’s not like everything immediately breaks in terms of everything not halting or something like that.
Daniel Filan (01:59:54): Yeah. Or even if I’m using one public random sequence and you’re using another, even if it’s uncorrelated, it seems like as long as I eventually halt and you eventually halt, it’s not going to be too bad.
Caspar Oesterheld (02:00:06): In particular, we’re going to halt at the point where both of our sequences have the halting signal, right?
[Note from Caspar: At least given my current interpretation of what you say here, my answer is wrong. What actually happens is that we’re just back in the uncorrelated case. Basically my simulations will be a simulated repeated game in which everything is correlated because I feed you my random sequence and your simulations will be a repeated game where everything is correlated. Halting works the same as usual. But of course what we end up actually playing will be uncorrelated. We discuss something like this later in the episode.]
Daniel Filan (02:00:14): Yeah. I guess, it depends a little bit on what our policies are, but it seems like as long as I’m not super specific about what exact sequence of cooperates and defects I’m sensitive to, maybe it’ll just be fine even if we’re not super tightly coordinated.
Caspar Oesterheld (02:00:41): Yeah, I guess here again, [to try] to import our intuitions from repeated games, that I guess there’s a game theoretic literature about, and that we maybe also have experience [of] from daily life: in practice, if you play a repeated game, you’re not going to play an equilibrium, you’re going to play something where you do something that’s trying to go for some compromise. Maybe the other player goes for some other compromise, and then, you try to punish them a little bit or something like that. And I would imagine that there’s a lot of this going on in this setting as well.
Characterizing simulation-based program equilibriaDaniel Filan (02:01:22): Yeah, yeah. Okay. I think I may be a little bit less concerned about the degree of coordination required. So, there are two other things about this paper that seem pretty interesting. So, the first is just what the limitations on the equilibria you can reach are. And my understanding is that you can characterize them decently in the correlated case, but it’s pretty hard to characterize them in the uncorrelated case or-
Caspar Oesterheld (02:01:53): Yeah.
Daniel Filan (02:01:54): Can you explain to me and my listeners just what’s going on here?
Caspar Oesterheld (02:01:58): Yeah, so in the correlated case, it really is quite simple. As always, there are some subtleties. You need to specify, for example, what exactly are you going to do if you simulate some other player and they use their private signal of randomness, which they’re not supposed to do in some sense. Well, you need to somehow punish them and the people above you need to figure out that this is what’s going on. So, there’s some of these sorts of subtleties. But I think basically, there is just a very close relationship between these programs and the repeated game case. So, it is just basically like playing the repeated case and even deviation strategies, you can view as playing the repeated game by saying: well, if they get this random string as inputs that has 10 variables left until they get to the below epsilon case, then you can view this as them playing a particular strategy at time step 10.
Daniel Filan (02:03:03): Hang on. What do they do if they access randomness? So, my recollection, which might be wrong, was that you punish people for accessing other people’s private randomness, but I thought they could still access their private randomness.
Caspar Oesterheld (02:03:18): I think you do have to punish people for using their private randomness. And then, the complicated thing is that I might simulate you and you might simulate a third party and the third party uses their private randomness and now you, as a result, punish them. And then, I now need to figure out that you are just punishing them because they used their private randomness.
Daniel Filan (02:03:46): And you’re now punishing me.
Caspar Oesterheld (02:03:47): I don’t know.
Daniel Filan (02:03:50): That condition seems hard to coordinate on, right? Because naively, you might’ve [thought], well, it’s my private randomness. It’s my choice.
Caspar Oesterheld (02:03:56): Oh, the condition to punish private randomness?
Daniel Filan (02:04:00): Yeah.
Caspar Oesterheld (02:04:00): Yeah. I think this is a reasonable point. Maybe one should think about ways to make this more robust to this. I guess one has to think about what exactly the game is, and how much harm the private randomness can do. In some cases, it doesn’t really help you to do your own private randomness, and then maybe I don’t need to punish you for it.
(02:04:24): But if there are 20 resources and you can steal them, and you’re going to randomize which one you steal from, and the only way for us to defend against this is by catching you at the specific resource or something like that, then maybe we do just need to think: okay, as soon as there’s some randomness going on, it’s a little bit fishy.
(02:04:48): But yeah, you could imagine games where you want to allow some people to randomize privately or use their private randomness for, I don’t know, choosing their password. Maybe this is sort of a fun example. At time step 3, you need to choose a password. And in principle, the way our scheme would address this is that we all get to see your password, or in some sense we get to predict how you use your password. I mean it’s also still important to keep in mind that these past timesteps are things that don’t actually happen, so we predict what you would’ve chosen at timestep 3 if timestep 3 was the real timestep. But nonetheless, you might think, okay, if you have to choose your password with the public randomness, then we all know your password and doesn’t this mean that we all would want to log into your computer and steal your stuff? And the way the scheme would address this, I guess, is just that, well, someone could do that but they would then be punished for this.
Daniel Filan (02:05:59): Or maybe they do do it and it’s just like, “Well, that’s the equilibrium we picked. Sorry”.
Caspar Oesterheld (02:06:04): Right, right. It could also be part of the equilibrium. Yeah, that’s also true.
Daniel Filan (02:06:11): So in the correlated case, it’s basically: you have a folk theorem, and there’s something about things that you can punish people for deviating from. That’s basically the equilibria you can reach, roughly. And then I got to the bit of your paper that is about the equilibria you could reach in the uncorrelated game.
(02:06:39): And I am going to be honest… So earlier we had a recording where we were going to talk about these papers, but actually I got really bad sleep the night before I was supposed to read the papers, and so I didn’t really understand this “Characterising Simulation-based Program Equilibria” paper. It was beyond me. And this time, I had a good night’s sleep, I was rested, I was prepared, and I read this paper and then once I get to the limitations on the equilibria of the uncorrelated one, that’s where I gave up. The theorems did not make… I understood each of the symbols but I didn’t get what was going on.
(02:07:19): Is there a brief summary of what’s going on or is it just like, well we had to do some math and that turns out to be the condition that you end up needing?
Caspar Oesterheld (02:07:26): At least for the purpose of a very audio-focused format, I think probably one can’t go that much into the details of this. I think I want to explain a little bit why one doesn’t get a folk theorem in the uncorrelated case. I think there are some relatively intuitively accessible reasons for that.
(02:07:49): Okay, let’s start there. So the problem in the uncorrelated case is basically that: let’s take a three-player case. We are two players and there’s a third player, Alice. We want to implement some equilibrium and now there’s a question, can Alice profitably deviate from this equilibrium? And now the issue is Alice can use her private randomization in some ways. So the problem is basically that us catching her deviation is uncorrelated with her actually deviating in the actual situation. And additionally, whether I detect her deviating is uncorrelated with you detecting her deviating.
(02:08:58): And this all makes punishing, especially punishing low probability deviations very difficult. So for example, if Alice, with some small probability that she determines with her private randomness, she defects in some way, then in the real world, for her actual action that will determine her utility, there’s this small probability that she’ll defect. And then there’s some probability that our simulations of her—which we’re running a bunch of—there’s some probability that we’ll detect these. But because when I simulate Alice, I simulate her with a completely different random string than the string that Alice has in the real world, in some sense, I can’t really tell whether she’s actually going to deviate. And then also, you are going to simulate Alice also with your private randomness, which means that whether in your simulation Alice defects is also uncorrelated with whether she defects in my simulation.
Daniel Filan (02:10:07): Wait, first of all, I thought that even in the correlated case, whether she defects in simulation is different from whether she deviates in reality because we get rid of the first few random numbers and then run on the rest, right?
Caspar Oesterheld (02:10:24): Yeah, that is true.
Daniel Filan (02:10:28): The thing where we disagree, that seems important and different.
Caspar Oesterheld (02:10:33): So maybe that’s… I’m maybe also not sure about the other one now, but I think the other one is more straightforward. It might be that to punish her deviating, we both need to do with a particular thing and we just can’t… It’s a little bit complicated because you might think, well, we can simulate Alice for a lot of timesteps. So you might think that even if she defects with low probability, we are simulating her a bunch in some way.
(02:11:12): So they are some complications here. She needs to deviate in some relatively clever way to make sure that we can’t detect this with high probability. It is all a little bit complicated, but I think we can’t correlate our punishment or we can’t even correlate whether we punish. And so if the only way to get her to not defect is for both of us to at the same time do a particular action, that’s sort of difficult to get around.
Daniel Filan (02:11:49): Okay. All right, here’s a story I’m coming up based on some mishmash of what you were just saying and what I remember from the paper. We’re in a three-player game, therefore punishing actions… Firstly, they might require a joint action by us two and therefore, that’s one reason we need us to be correlated on what Alice actually did, at least in simulation.
(02:12:12): Another issue is: suppose I do something that’s not in the good equilibrium and you see me doing that, you need to know whether I did that because I’m punishing Alice or whether I was the first person to defect. And if I’m the first person to defect, then you should try punishing me. But if I’m just punishing Alice, then you shouldn’t punish me.
(02:12:34): And so if we in our heads see different versions of Alice, if you see me punishing, if you see me going away from the equilibrium, you need to know whether that’s because in my head I saw Alice defecting or if it’s because in my head I thought I want to defect because I’m evil or whatever. I don’t know if that’s right.
Caspar Oesterheld (02:12:58): Yeah. I’m not sure whether that is an issue because when I see you defecting, it is because I simulate you with my randomness as input. And then you see, with my randomness as input, Alice defecting one level down, which means that I… Remember that I’m simulating all of these past timesteps as determined by my randomness. So I think I can see whether the reason you defect in my simulation is that you saw Alice defect.
Daniel Filan (02:13:40): Wait, if we’re using the same randomness, then why isn’t it the case that we both see Alice defect at the same time with our same randomness?
Caspar Oesterheld (02:13:47): So I mean this is all my simulation of you rather than the real you.
Daniel Filan (02:13:55): So the real we’s might not coordinate on punishment?
Caspar Oesterheld (02:13:59): Yeah. I mean this is another thing that’s like: even with very basic ϵGroundedπBots, you can kind of imagine: in their head, they’re playing this tit-for-tat where it’s going back and forth. And one person does this based on their randomness and then the other person sees this and then responds in some particular way.
(02:14:19): But if you don’t have shared randomness, all of this, this is all complete fiction. You haven’t actually coordinated with the other player and seen back and forth. So it might be that I run this simulation where you detected Alice’s defecting and then I also defect on Alice, and then we are happily defecting on Alice. And in the simulation we’re thinking “we’re doing so well, we’re getting this Alice to regret what she does” and so on. But the problem is that you run a completely different simulation.
(02:14:52): So in your simulation of what Alice and I do, you might see everyone cooperating and everyone thinks, “oh, everything’s great, we’re all cooperating with each other”. And then we’ve done the simulation and now we are playing the actual game, and I defect thinking, “oh yeah, we are on the same team against Alice”. And then you think, “oh nice, we’re all cooperating” and you cooperate. And then we’re landing in this completely weird outcome that doesn’t really happen in the simulation, sort of unrelated to what happens in this…
Daniel Filan (02:15:23): Right. So Alice basically says, “Hey, I can get away with doing nasty stuff because they won’t both be able to tell that I’m doing the nasty stuff and therefore I won’t properly be punished in the real world”. And so these gnarly theorems: should I basically read them as: the preconditions are there’s some math thing and the math thing basically determines that this kind of thing can’t happen and those are the equilibria you can reach. Is that it?
Caspar Oesterheld (02:15:50): Yeah. So I think one thing that drives a lot of these characterizations is: Alice can defect with low probability. I think usually that’s the more problematic case, is that she defects in a particular kind of clever way with low probability, which means that we are very unlikely to both detect it at once. I think that is driving these results a lot.
(02:16:23): But to some extent… You said this earlier, there’s some math going on. I think to some extent that’s true. So I think one thing that I liked about these results, despite… I mean of course one always prefers results that are very clean and simple, like the folk theorem where you just have this very simple condition for what things are equilibria. And our characterizations are mostly these kind of complicated formulas.
(02:16:51): I think one thing I like is that for some of these characterizations, one can still hold onto this interpretation of there being timesteps and you simulate what people do at previous timesteps and things like that. Which, it’s sort of very intuitive that this works for the case where everyone plays nicely with each other and everything is correlated, and in some sense, we’re playing this mental repeated game where we all use the same randomness and so we are all playing the same repeated game, and really the thing that is sampled is “which round is the real round?” It’s clear that the timestep story works. And it’s nice there that there are some results where you can still use this timestep picture. So that’s one nice thing about the results. But yeah, it is unfortunately much more complicated.
Daniel Filan (02:17:49): Fair enough. So another part of the paper that is kind of cool and that you foregrounded earlier is it has this definition of simulationist programs. And so earlier, you mentioned there was a definition of fair programs or something: maybe you are referring to this definition.
Caspar Oesterheld (02:18:11): Yeah. In some sense, the paper has three parts: the one with the correlated case, with these generalized ϵGroundedπBots that pass on the shared random sequence. And then the uncorrelated case with the ϵGroundedFairBots. And then we also have a section that analyzes more general simulationist programs, which are programs that just… Intuitively all they do is run the opponent with themselves and the other players as input. And that has this definition. And then for those we have a characterization as well.
(02:18:55): For example, one result that we also show is that in general, general simulationist programs are more powerful at achieving equilibria in the uncorrelated case than the ϵGroundedπBots. I’m not quite sure how much to go into detail there, but one intuition that you can have is: in the ϵGroundedπBots, to some extent everyone has to do the same thing. Whereas you could have settings where only I need to do simulations and then if only I simulate your program, I can run 10,000 simulations or something like that.
(02:19:35): And this is something that obviously the ϵGroundedπBots can’t do. You can’t just independently sample a thousand responses from the other player. And we do have this definition of simulationist programs. I’m not sure I remember the details off the top of my head.
Daniel Filan (02:19:56): I think it’s some recursive thing of: a simulationist program is… it calls its opponent on a simulationist program, which maybe includes itself and maybe… I forgot whether it explicitly has ϵGroundedπBots as a base case or something. Maybe simulating nobody is the base case, or just ignoring the other person’s input.
Caspar Oesterheld (02:20:20): Yeah. That’s also coming back to me. I think it’s something like that. So the tricky part is that you might think that a simulationist program is just one that calls the other program with some other program as input. But then if you don’t constrain the programs that you give the other player as input, you can sort of smuggle this non-behaviorism back in by having “what does my opponent do against these syntactic comparison bots?” or something like that.
Daniel Filan (02:21:01): There’s a good appendix. It’s like “for why we do it this way, see this appendix”. And then you read the appendix and it’s like, “oh that’s pretty comprehensible”. It’s not one of these cases where the appendix is all the horrible…
Caspar Oesterheld (02:21:11): Yeah, glad to hear that you liked the appendix. Some of the appendix is also just very technical, like working out the details of characterization.
Daniel Filan (02:21:20): Oh yeah, I skipped those appendices. But there are some good appendices in this one.
Caspar Oesterheld (02:21:24): Nice.
Follow-up workDaniel Filan (02:21:24): All right, the next thing I want to ask is: what’s next in program equilibrium? What else do we need to know? What should enterprising listeners try and work on? Is there any work that’s… So earlier, I asked you about what was the state of the art before you published “Robust Program Equilibrium”. Is there any work coming out at around the same time which is also worth talking about and knowing a bit about the results of?
Caspar Oesterheld (02:21:57): I think, yeah, there are a bunch of different directions. So I do think that we still leave open various technical questions and there are also some kind of technical questions that are still open for these Löbian programs that it would be natural to answer.
(02:22:16): So one thing, for example, is that I would imagine that… Maybe sticking closely to our paper first, there are some very concrete open questions even listed in the paper. I’m not entirely sure, but I think in the two-player simulationist program case, it’s not clear whether, for example, all Pareto-optimal, better than minimax utility profiles can be achieved in simulationist program equilibria. So maybe this is not quite the right question, but you can check the paper. We have some characterizations for these uncorrelated cases. But I think for the general simulationist case, we don’t have a full characterization. So if you want go further down this path of this paper, there are a bunch of directions there that still have somewhat small holes to fill in.
(02:23:39): Then another very natural thing is that: I think for the Löbian bots, there isn’t a result showing that you can get the full folk theorem if you have access to shared randomness, which I am pretty sure is the case. I think probably with some mixing of this epsilon-grounded stuff and the Löbian proof-based stuff, I would imagine you can get basically a full folk theorem, but there’s no paper proving that. Maybe one day, I’ll do this myself. But I think that’s another very natural question to ask.
(02:24:19): So in my mind, going a bit further outside of what we’ve discussed so far, in practice, I would imagine that usually one doesn’t see the opponent’s full source code. And maybe it’s also even undesirable to see the source code for various reasons. You don’t want to release all your secrets. Maybe also… I mean, we talked about these folk theorems where everything that is better than this punishment outcome can be achieved. And I think game theorists often view this as sort of a positive result, whereas I have very mixed feelings about this because it’s kind of like, well, anything can happen, and in particular a lot of really bad outcomes can happen. Outcomes that are better than the best thing that I can achieve if everyone just punishes me maximally… Well, it’s not very good. There are lots of very bad things that people can do to me, so there are lots of equilibria where I get very low utility.
Daniel Filan (02:25:40): And in particular, if there are tons of these equilibria, the more equilibria there are, the less chance there is we coordinate with one. Right?
Caspar Oesterheld (02:25:49): Yeah. I guess maybe one positive thing is that… In the correlated case, you have this convex space of equilibria. So at least it’s like, well, you need to find yourself in this convex space rather than finding yourself between six discrete points. And so maybe that makes things easier.
(02:26:08): But yeah, I think basically I agree with this. I think on our last episode—this is my second appearance on AXRP, right? On the first episode on AXRP, we discussed this equilibrium selection problem, which I think is very important and motivates a bunch of my work. So maybe if you have less information about the other player, then you get fewer equilibria. Maybe in the extreme case, maybe if you get only very little information about the player, maybe you only get one additional equilibrium relative to the equilibria of the underlying game.
(02:26:53): And I think we discussed the similarity-based cooperation paper also on the previous episode, and that is basically such a setting. It’s basically a program equilibrium setting where you don’t get the full opponent source code, but you get some signal, in particular how similar the opponent is to you. And there are some results about how you get only good equilibria this way.
(02:27:23): I think in general, that’s sort of a natural direction to go in. Also, you can also do more practical things there. The similarity-based cooperation paper has some experiments. You can do experiments with language models where in some sense, this is sort of true. If my program is “I prompt a particular language model” and then you know my prompt but you don’t know all the weights of my language model, or maybe you can’t do very much with all the weights of my language model, that is a sort of partial information program equilibrium. So I think that is another natural direction.
(02:28:03): And then also, I think you drew these connections to decision theory, which is: in some sense, if you are the program and you have to reason about how you’re being simulated and people are looking at your code and stuff like that, how should you act in some kind of rational choice-type sense? That’s sort of the problem of decision theory. And in some ways, you could view this program equilibrium setting as sort of addressing these issues by taking this outside perspective. Instead of asking myself “what should I, as a program who’s being predicted and simulated and so on, what should I do?”, instead of that, I ask myself, “I’m this human player who’s outside the game and who can submit and write code, what is the best code to submit?”
(02:28:59): And in some sense, that makes the question less philosophical. I’m very interested in these more philosophical issues. And I feel like the connections here aren’t fully settled: what exactly does this “principal” perspective or this outside perspective correspond to from the person of the agent? Like you said, this equilibrium where everyone checks that they’re equal to the other player, that’s an equilibrium where the programs themselves aren’t rational. They don’t do expected utility maximization, they just do what their source code says. So I think this is much more philosophical, much more open-ended than these more technical question about what equilibria can you achieve. But I’m still very interested in those things as well.
Following Caspar’s researchDaniel Filan (02:29:49): So the final question I want to ask is: if people are interested in this work and in particular in your work, how should they find more?
Caspar Oesterheld (02:30:00): So I just have an academic website. Fortunately my name is relatively rare, so if you Google my name, you’ll find my academic website. You can also check my Google Scholar, which has a complete list of my work. I also have a blog where I occasionally post things somewhat related to these kinds of issues, which is just casparoesterheld.com, which in principle should allow subscribing to email notifications.
(02:30:39): And I also have an account on X, formerly Twitter, which is C_Oesterheld. Yeah, I think those are probably all the things.
Daniel Filan (02:30:51): Great. Cool. So there’ll be links to that in the transcript. Caspar, thanks very much for coming on the podcast.
Caspar Oesterheld (02:30:56): Thanks so much for having me.
Daniel Filan (02:30:57): This episode is edited by Kate Brunotts, and Amber Dawn Ace helped with the transcription. The opening and closing themes are by Jack Garrett. This episode was recorded at FAR.Labs. Financial support for the episode was provided by the Long-Term Future Fund, along with patrons such as Alexey Malafeev. You can become a patron yourself at patreon.com/axrpodcast or give a one-off donation at ko-fi.com/axrpodcast. Finally, if you have any feedback about the podcast, you can fill out a super short survey at axrp.fyi.
Daniel Filan (02:04:21): This episode is edited by Kate Brunotts, and Amber Dawn Ace helped with transcription. The opening and closing themes are by Jack Garrett. This episode was recorded at FAR.Labs, and the podcast is supported by patrons such as Alexey Malafeev. To read the transcript, you can visit axrp.net. You can also become a patron at patreon.com/axrpodcast or give a one-off donation at ko-fi.com/axrpodcast. Finally, you can leave your thoughts on this episode at axrp.fyi.
Discuss
Thoughts about Understanding
Epistemological status: these are speculative thoughts I had while trying to improve understanding. Not tested yet.
What differentiates understanding from non-understanding?When you pull a door towards you, you predict it will move towards you in a particular way. You can visualize the movement in your mind's eye. Similarly, the door having been pulled, you can infer what caused it to end up there.
So, starting from a cause, you can predict its effects; starting from an effect, you can infer its cause.
Let's call that understanding. You instantiate a causal model in your mind, and you see how a change in part of the model affects the rest of the model, you also see what changes have to occur to get to a desired state of the model. The speed and accuracy with which you can predict effects or causes, as well as the number of changes you know how to propagate, the number of goal states you know how to reach, are the depth of your understanding.
Conversely, non-understanding would be not being able to visualize what happens when you pull the door, or not having any idea how the pulled door got to where it stands.
So how do we go from non-understanding to understanding?Say you don't know/understand what happens when you pull a door... Then you pull a door.
Now you understand what happens.
Why is that? Well, your brain has native hardware that understands cause-effects models on its own. You just need reality to shove the relationship in your face hard enough, and your brain will go "ok, seems legit. let's add it to our world-model".
Now let's consider a mathematical proof. You follow all the logical steps, and you agree, "sure enough, it's true". But you still feel like you don't really grok it. It's not intuitive, you're not seeing it.
What's going on? Well, this is still a brand-new truth, you haven't practiced it much so it has not become part of your native world-model. And unlike more native things, it is abstract. So even if you try to pattern match it to existing models to make it feel more real, more native through analogies, such as "oh, light behaves like waves", it doesn't work that well.
This usually naturally goes away the more you actually use the abstract concept: your brain starts to accept it as native and real, and eventually light behaves like light. It even feels like it always was that way.
Ok, but what can we actually do with all this?Consider a complicated math equation. There are symbols you do not understand. However you do know this is math.
What's the algorithm to go from non-understanding it to understanding it?
Steps:
- If you feel "ugh, math...", quickly brush aside that feeling by evoking the wonders of piercing through the veil of reality as you investigate these mysterious symbols ;)
- Start by noticing what you know: you know this is a math equation, so you know there is meaning in it, and you also know the meaning as a whole emerges from the meaning of each symbol interacting with other symbols (unlike words where letters have no meaning). This tells you that you should figure out the meaning of each symbol, the interactions between symbols, and from that you'll get to the meaning of the whole equation.
- Notice what you don't know. This is the most important part, the most overlooked skill. The thing they don't teach you in schools. Out of what you do not understand, start with the pieces that will bring out the most amount of information. You have limited time, so you want to do a quick cost-benefit analysis in order to rank what you should investigate first. Often, going from more general and high-level pieces, to more specific and low-level details makes sense to avoid getting stuck in time-costly rabbit-holes.
- Having created that ranking, you go through it one-by-one. You investigate each piece, break it down further by going back to step 0 if necessary. Let's say the first piece is a math operator; so you look it up, read about it, mentally manipulate it, till it makes sense.
- Go back to the equation, plug in your new understood piece, and try to understand the equation again. The thing is that your brain doesn't need to store a full model in order to understand things, it only needs a compressed model, from which it can re-derive missing pieces and relationships on the fly. Maybe the piece you just added is enough for your brain to make the connections and derive the rest of the model. If so, congratulations! You just went from non-understanding to understanding. If not, get to the next piece in the ranking and investigate, look it up, break it down, etc until you understand it. Repeat until you understand the whole equation. You may want to adjust the ranking itself as you iterate.
An important part of this understanding algorithm is to be meticulous about noticing what you don't understand. The issue is that there's probably a bunch of stuff you don't understand, and you have limited time, so you need to become real good at ranking. My hope is that with practice these 2 skills become second nature. And then at a glance you're able to see what pieces don't understand, and guesstimate the most important among these, as well as the cost of analyzing them, from which you can prioritize.
This approach has the huge added benefit of being very active, thus motivating. Keeping a tight feedback loop is probably a key point. Trying to understand by yourself before searching too. As for the search part, you might want to experiment with an LLM pre-prompt so it gives you a brief answer to any question you ask. Maybe a "no thinking" or even a local LLM is better to have short latency and tighten the feedback-loop.
The key principle behind this understanding algorithm is fairly simple: if you understand A causes B, and B causes C, and you are able to hold both of these statements in your mind, or have practiced them enough that they stand as a compressed second nature / pointer you can refer to compactly, or alternatively you can follow the step by step and accept the logical conclusion of the process without having to hold the whole chain in your mind; then you understand A causes C, to some degree.
The kind of causal understanding I talk about is just a big graph of cause-effects relationships. To understand the graph as a whole, you need to understand enough of the individual cause-effects relationships. To learn efficiently, you need to focus on understanding the cause-effects relationships that give you personally the most info for the least effort. And if you want to learn fast, you need to develop these noticing and prioritizing skills, to become good and fast at it.
I heard that not that many concepts are necessary to understand complex ideas, or complex proofs. That's encouraging. It may be that by perfecting this learning technique one could learn extremely fast, and stumble across new insights as well.
Performance on this task should be measured. How many seconds does it take to learn a concept on average? 1 concept every 5 minutes? can we tighten the loop, go lower? 1 concept a minute? one concept every 30 seconds? Maybe not, that would be 120 concepts/hour. Apparently, this many concepts an hour is wildly biologically implausible: the brain needs to consolidate memories, there can be interference issues etc. But investigating the limits sounds like fun anyway. Also, consider that the more concepts we learn the higher the probability that the brain will auto-unlock a bunch of related concepts, so who knows?
Learning this fast, what could one learn, and what could one do?
Should you read books?Reading books is like getting lots of answers to questions you don't ask.
The great thing is that you're getting lots of data very fast, as in you don't have to go through the steps of noticing what you don't understand and looking for answers. It also helps discovering unknown unknowns.
The bad thing is that the data may not be informative to you, like if you already know about some of it, or don't understand it and then have to perform the understanding algorithm on the book anyway. And also since you're not asking questions yourself, it can become boring, or tedious, and that sure doesn't help with absorbing data.
From that, I'd say engaging introductory books and documentaries on subjects you don't know, to get a feel for a field, are probably most efficient.
Discuss
Monday AI Radar #13
This week’s newsletter in a word: “velocity”. We’ll take a deep look at last week’s big models drops (just a few months after the previous big drops), and try to figure out if they’re reached High levels of dangerous capabilities. Nobody’s quite sure, because capabilities are outrunning evaluations.
We also check in on the country of geniuses in a data center (still 2028, according to Dario), contemplate what we should align AI to (assuming we can figure out how to align it to anything), and catch up on the Chinese AI industry.
Top pickSomething Big Is HappeningMatt Shumer’s Something Big Is Happening has been making the rounds this week. It’s a great “you need to wake up” piece for anyone you know who doesn’t understand the magnitude of what’s happening right now.
But it’s time now. Not in an “eventually we should talk about this” way. In a “this is happening right now and I need you to understand it” way. [...]
The experience that tech workers have had over the past year, of watching AI go from “helpful tool” to “does my job better than I do”, is the experience everyone else is about to have. Law, finance, medicine, accounting, consulting, writing, design, analysis, customer service. Not in ten years. The people building these systems say one to five years. Some say less. And given what I’ve seen in just the last couple of months, I think “less” is more likely.
My writingAds, Incentives, and DestinyOpenAI has started showing ads in some tiers of ChatGPT. They’re fine for now, but I worry about where those incentives lead.
New releasesZvi reports on Claude Opus 4.6Opus 4.6 is a pretty big deal—it’s a substantial upgrade to Opus 4.5, which was probably already the best overall model (and which just shipped 2 months ago). Not surprisingly, Zvi has lots to say about it.
Claude Opus 4.6 Escalates Things Quickly. It’s a very good model.
System Card Part 1: Mundane Alignment + Model Welfare Key takeaways:
- Anthropic’s system cards are far better than any other lab’s
- But also, they aren’t good enough
- We are increasingly flying blind: our evaluations simply aren’t able to usefully measure the safety (or lack thereof) of 2026 frontier models
- Like OpenAI, Anthropic is very close to ASL-4 thresholds on multiple fronts
System Card Part 2: Frontier Alignment
I want to end on this note: We are not prepared. The models are absolutely in the range where they are starting to be plausibly dangerous. The evaluations Anthropic does will not consistently identify dangerous capabilities or propensities, and everyone else’s evaluations are substantially worse than those at Anthropic.
Zvi looks at ChatGPT-5.3-CodexDoes Zvi sleep? Nobody knows. ChatGPT-5.3-Codex is an excellent model, and this is a significant upgrade.
GPT‑5.3‑Codex‑SparkIntriguing: GPT‑5.3‑Codex‑Spark is a less capable version of Codex that can do more than 1,000 tokens / second, which is fast. Like, really fast. Sometimes you need maximum intelligence, but for many applications, model speed is an important rate limiter for productivity. A super-fast, good-enough model might be a game changer for many tasks.
Cursor Composer 1.5Cursor have upgraded their Composer, their in-house agentic coding model, to version 1.5.
Gemini 3 Deep ThinkThere’s a significant update to Gemini 3 Deep Think, focusing on science, research, and engineering. Simon Willison reports that it raises the bar for bicycle-riding pelicans.
Agents!We Just Got a Peek at How Crazy a World With AI Agents May BeNow that the frenzy over OpenClaw and Moltbook has died down, Steve Newman takes a look at what just happened (not all that much, actually) and what it means (a sneak peek at some aspects of the future).
OpenClaw, OpenAI and the futureWell, that didn’t take long. Peter Steinberger (the creator of OpenClaw) is joining OpenAI. OpenClaw will be moving to a foundation.
Benchmarks and ForecastsDario Amodei does interviewsTwo really good interviews with Dario this week:
- With Dwarkesh Patel. Characteristically long and in-depth, with some really good discussion of exponentials and the timeline to the fabled country of geniuses in a data center. Zvi shares his thoughts
- With Ross Douthat ($) (who’s been slaying it lately). This one is shorter and more philosophical.
AI is getting very good at almost everything, including complex cognitive tasks that require deep understanding and judgment. The Atlantic reports on AI forecasters at recent Metaculus tournaments ($):
Like other participants, the Mantic AI had to answer 60 questions by assigning probabilities to certain outcomes. The AI had to guess how the battle lines in Ukraine would shift. It had to pick the winner of the Tour de France and estimate Superman’s global box-office gross during its opening weekend. It had to say whether China would ban the export of a rare earth element, and predict whether a major hurricane would strike the Atlantic coast before September. […]
The AI placed eighth out of more than 500 entrants, a new record for a bot.
What the hell happened with AGI timelines in 2025?2025 was a wild year for timelines: exuberance early on, then a substantial lengthening in the middle of the year, and another round of exuberance at the end of the year. Rob Wiblin explores why those shifts happened, with insightful analysis of the underlying trends. It’s a great piece, though it largely ignores the most recent shift.
Takeoff speeds rule everything around meMuch of the timelines discussion focuses on how long it takes to get to AGI, but Ajeya Cotra thinks takeoff speed is the most important crux (i.e., how fast we go from AGI to whatever happens next).
Grading AI 2027’s 2025 PredictionsThe AI-2027 team calculate that the rate of AI progress in 2025 was about 65% of what they predicted.
AI is getting much better at handsAndy Masley checks in on how well AI can draw hands.
Using AITracking the “manosphere” with AIVery often the question isn’t “how does AI let us do the usual thing cheaper”, but rather “what can we now do that wasn’t practical to do before?”. Nieman Lab reports on a slick tool at the New York Times:
When one of the shows publishes a new episode, the tool automatically downloads it, transcribes it, and summarizes the transcript. Every 24 hours the tool collates those summaries and generates a meta-summary with shared talking points and other notable daily trends. The final report is automatically emailed to journalists each morning at 8 a.m. ET.
Alignment and interpretabilityThere’s been some good discussion lately of what we should align AI to (which is separate from and almost as important as how to align it to anything at all).
Oliver Klingfjord believes integrity is a critical component:
Integrity isn’t everything in AI alignment. We want models with domain expertise, with good values, with the wisdom to enact them skillfully. Integrity doesn’t speak to the goodness of values. But it does speak to how deeply they run, how stable they are under pressure. It’s what lets us trust a model in situations we never anticipated.
Richard Ngo goes in a somewhat different direction, arguing for aligning to virtues.
I like that both Oliver and Richard emphasize the importance of generalizing well to unforeseen circumstances, which is a shortcoming of more deontological approaches like OpenAI’s.
CybersecurityClaude finds 500 high-severity 0-day vulnerabilitiesIn a convincing demonstration of AI’s ability to find vulnerabilities at scale, Anthropic uses Opus 4.6 to find more than 500 high-severity zero day vulnerabilities. The accomplishment is impressive, and the account of how it went about finding them is very interesting. If you’re wondering why both OpenAI and Anthropic believe they’re reaching High levels of cyber capabilities, this is why.
Lockdown Mode in ChatGPTThere is a fundamental tension between capability and security: technology that can do more will necessarily have a larger attack surface. OpenClaw was a great example of going all the way to one extreme, enabling an immense amount of cool capability by taking on a staggering level of risk. At the other end of the spectrum, OpenAI is rolling out Lockdown Mode for ChatGPT. Much like Lockdown Mode on the iPhone, this significantly reduces ChatGPT’s attack surface at the cost of significantly curtailing some useful capabilities. It’s meant for a small number of people who are at elevated risk of targeted cyberattacks.
Jobs and the economyAI Doesn’t Reduce Work—It Intensifies ItThis won’t come as a shock to anyone who’s felt the exhilaration (and compulsion) of having AI superpowers. Aruna Ranganathan and Xingqi Maggie Ye find that hours worked often increase when people get access to AI, with much of the pressure being self-imposed. Their analysis of the issue is great, but I’m less sold on their proposed solutions.
AI and the Economics of the Human TouchAdam Ozimek argues that concerns about AI’s impacts on jobs are overstated because many jobs require a human touch: we prefer to have humans do those jobs even though we already have the ability to automate them. It’s a good and thoughtful piece, but I think it largely misses the point. We haven’t automated supermarket cashiers not because people love interacting with human cashiers, but because the automated replacements aren’t yet good enough. That will change soon.
Strategy and politicsDean Ball On Recursive Self-Improvement (Part II)Dean is characteristically cautious about writing regulations before we understand what we’re regulating. He proposes a system of third-party safety audits (much like our existing system for auditing corporate finances), where certified private auditors perform regular inspections of whether AI developers are following their own safety guidelines.
Did OpenAI violate California’s AI safety law?Directly related to Dean’s piece, The Midas Project argues that when OpenAI released GPT-5.3-Codex, they appear to have violated California’s SB 53. Briefly: SB 53 takes a light touch to safety regulation, but requires that labs publish and adhere to a safety framework. Midas believes that OpenAI is treating GPT-5.3-Codex as having High capability in cybersecurity, but hasn’t activated the safeguards they said they would activate when that happened. OpenAI is pushing back—it’ll be interesting to see what California decides.
In the meantime, Steven Adler takes a detailed look.
ChinaIs China Cooking Waymo?If you live in the US, you likely aren’t aware of how well China is doing with electric vehicles and autonomous vehicles. ChinaTalk takes a deep look at autonomous vehicles, diving into deployments in both the US and China, how the international market is shaping up, and how the supply chain works.
Is China falling behind?Teortaxes argues that based on the WeirdML benchmark, the Chinese open models are falling further behind the frontier.
China and the US Are Running Different AI RacesPoe Zhao at AI Frontiers looks at the very different economic environment facing AI companies in China (much less private investment, and much less consumer willingness to pay for AI). Those factors shape their strategic choices, driving a focus on international markets, and a heavy emphasis on inference cost in both model and hardware design.
AI psychologyThe many masks LLMs wearOne of the big surprises of the LLM era has been how strangely human-like AI can be. (The frequent occasions when it’s shockingly un-humanlike are perhaps stranger but less surprising). Kai Williams at Understanding AI explores character and personality in LLMs.
Industry newsMore on ads in ChatGPTZoë Hitzig has an opinion piece in the New York Times:
This week, OpenAI started testing ads on ChatGPT. I also resigned from the company after spending two years as a researcher helping to shape how A.I. models were built and priced, and guiding early safety policies before standards were set in stone.
I once believed I could help the people building A.I. get ahead of the problems it would create. This week confirmed my slow realization that OpenAI seems to have stopped asking the questions I’d joined to help answer.
The Anthropic Hive MindSteve Yegge talked to a bunch of Anthropic employees and shares some thoughts about their unique culture.
TechnicalmicrogptWow. Karpathy has built a complete GPT engine in 200 lines of code.
Training compute matters a lotReally interesting paper on the importance of training compute relative to algorithmic improvements:
At the frontier, 80-90% of performance differences are explained by higher training compute, implying that scale--not proprietary technology--drives frontier advances.
How persistent is the inference cost burden?Toby Ord has recently made a good case that reinforcement learning has scaling challenges that present a significant obstacle to continued rapid improvement in capabilities. Epoch’s JS Denain isn’t entirely convinced:
Toby’s discussion of RL scaling versus inference scaling is useful, and the core observation that RL gains come largely with longer chains of thought is well-taken. But the picture he paints may overstate how much of a bottleneck this will be for AI progress.
RationalityWhat Kind Of Apes Are We?David Pinsof continues his excellent conversation with Dan Williams regarding human nature, the enlightenment, and evolutionary misfit. I love the way this conversation is happening, and I’m learning a lot from it: I’ve significantly updated some key beliefs I hold about how humans are not well evolved to handle the modern environment.
So my response to Dan might be something like, “Yea, maybe humans are kind of confused and maladapted sometimes, but it’s also really insightful to see humans as savvy animals strategically pursuing their Darwinian goals.” And Dan might say something like, “Yea, it’s pretty insightful to see humans as savvy animals strategically pursuing their Darwinian goals, but it’s also really important to recognize that humans are confused and maladapted sometimes.” It’s basically a disagreement over where to put the italics.
Discuss
The Math And The Territory
Admissibility of Mathematical Evidence
When do we say that a mathematical truth is an accurate account of a phenomenon we are interested in? Much in the same way that a video can be seen as a true representation of some key events, e.g, a wedding, a speech, or a robbery, mathematical artefacts can be viewed as representations of real-world phenomena (Things as we see them). These representations can then be used as relevant materials in other contexts, for instance, video evidence in court or mathematical theorems used in physics and engineering. With video (before the age of AI-generated video), a high degree of trust always came with video evidence of some activity. If I saw a video of a president giving a speech, then it could be said with a high degree of certainty that the speaker in the video indeed made certain statements or remarks. Another compelling example is the use of admissible video evidence in court, such as footage of someone breaking the law on camera. We have utilized institutions to determine the relevance of evidence in the context of the law. For instance, video evidence must meet certain criteria to be admissible in court, such as a chain of custody. We state these examples in order to motivate our discussion on the admissibility of mathematical evidence, particularly in the age of advanced AI systems that can output valid mathematical artifacts (or truths if you like). When is a piece of mathematics relevant? In the same way, we can ask when a video file is relevant in a court of law.
Notice that we have introduced the term relevance above. For our discussion on the admissibility of mathematical evidence, relevance will refer to anything that a person could care about in their meaning-making activities. In mathematics, this could be the practice of mathematics in and of itself, e.g, using theorems in one field of mathematics to prove theorems in another field, or an application to another field, say engineering, physics, or finance. Given this, we will state the following: a mathematical truth is relevant if it is used in the course of expressing a meaning-making activity. We, therefore, administer mathematical evidence if we use it in some form for meaning-making. For instance, the fundamental theorem of calculus and its application to rocket engineering. We don’t tackle the problem of relevance realization as it is a wide and complex topic that is under active research.
The potential for AI to generate valid mathematical artifacts, for instance kimina and DeepSeek Prover, introduces significant changes to mathematical activity. Following our discussion on relevance, it is important to consider how these tools will impact the use of mathematical truths. Part of this shift stems from the failure modes these systems might & will face. One of the pressing challenges that we see is that of false advertising using valid mathematical artifacts. This is where we have an actor who - whether malicious or not - makes use of these math agents to produce mathematics to support potentially dubious claims. This is already happening even without the help of AI. We posit that this will become prevalent as more people have access to genius-level mathematicians in their pockets. This is one example of how the trust in mathematical truths might be compromised, thereby undermining the confidence in a crucial aspect of human society.
In light of these concerns about trust and reliability, it becomes clear that we need new ways to apply mathematical evidence without running the risk of spreading falsehood. Earlier, we discussed how institutions of law have developed systems to apply video evidence in courts reliably; we also need systems that will increase the reliability of mathematical truths. One of the systems we propose is live-discernment, whereby human attunement for relevance is at the center of the application of mathematical artifacts. Here, we imagine that many of the mathematical artifacts will be produced by AI agents; therefore, live discernment will enhance human abilities to curate the relevance of these artifacts in their meaning-making activities.
Mathematical truths in the age of cheap Intelligence
Powered by AI, we will have increasingly strong mathematical cover stories being created. Traditionally, to be able to apply mathematical insights to a “non-mathematical / mathematical" domain, one had to be an expert in the said domain and the mathematical formalism of the insight. Alternatively, they would need to have access to a collaborator who has the mathematical chops to understand the insight and, by having conversations, can apply the insight to the domain in question. A significant amount of time for applied mathematicians is spent on the conversion/translation from the domain of application to the mathematical formalism. In a world where everyone has access to mathematical agents, you can imagine someone interfacing with said agents and thereby being able to output valid mathematics that may be relevant to their area of application. Now they don’t need to be friends with an actual mathematician to get particular mathematical insights applied to their domain. I can already do this by querying a system like DeepSeek prover [Maybe Link to Time Cube pdf]. It can produce mathematical symbols and statements that are very similar to what a “real mathematician” might be able to produce. A difference here is that a real mathematician would have needed to possess human understanding (Knowledge) of my domain to produce the bespoke formalisms that I can then apply to my side of things. This may or may not be the case with the agents that produce the formalisms; we cannot tell whether the agent understood what was relevant to us and used this understanding to produce mathematics that was true to our needs. Much of applied mathematics is the art of choosing which assumptions to use and which to discard in the use of mathematical artefacts in the real world. Since they (the real mathematicians) are “integrated” with the real world, the assumptions they outline may hold water; the same cannot be said for AI math agents. For all we know, the Agent might be gaslighting us, as seen in coding agents of the day. Gaslighting and producing invalid mathematical artefacts may be classified as a shallow risk. An insidious risk would be that of false positives and false negatives of valid mathematical insights.
Mathematics Enabled Deception
Several articles have been published showing how we can lie with numbers. This will only get worse in the age of Math Capable AI systems in the hands of everyone.
False Positives in Mathematics/ [Mathematical Over-specification]
Lying with mathematics has officially been turbocharged due to the rise of Math-capable AI systems. One way in which we can lie with math is through false positives.
This happens when the mathematical artefact is thinly specified (hyper-specific/ Highly Abstracted).
What makes a mathematical artefact a false positive?
- They are false since they are not relevant to what we hold as real, i.e, they are disconnected from reality. They are positive since they follow all mathematical rules and procedures and are represented in accordance with cultures established within mathematics.
- They are false as they ascribe something to the world that is not real. They are positive as they are valid mathematical constructions.
- A false positive is deciding something is more important because we have a mathematical structure for it.
When is the use of a mathematical truth put into question/ When is a mathematical truth put into question? Notice here that we are not questioning the validity of a mathematical truth but rather its relevance. One answer is the case where the assumptions that were used to apply that mathematical artefact don’t hold up in the real world, or they don’t match observations of the phenomena they are representing. For instance, when a mathematical representation of a concept space says something about a concept that isn’t part of the concept.
There is always this “playful tension” between mathematics and the real world (well, at least as we can observe it), and through this interaction, we get to understand the world, and we also get to discover more mathematical truths. Sometimes the mathematical truths we uncover don’t accurately “stand for” the phenomena we directly observe in the world, and in these cases, we can develop new mathematics to fit the phenomena we observe, or we can choose to work with the mathematical formalisms, but with strict conditions. The later case is often useful particularly in highly controlled settings (such as (engineering applications for instance when designing electrical circuits, solving maxwell’s equations would take a prohibitively long time and hence we use “toy versions of the equations”, knowing too well that they don’t represent the real world, at least assuming that the full maxwell equations actually represent the real world), sometimes this can be misleading and might lead us to making mistakes.
Stockholm Syndrome in Mathematics
It is easy to lie to people who already "believe in mathematics" using mathematics.
As a civilisation, we have held mathematics at a high standard as a representation of truths about our world. People trust the logic and rigour found in mathematics, and this can be hijacked by unscrupulous actors aimed at deceiving. They can use the same logic that is revered by mathematicians and other practitioners to promote false claims. This will be enhanced by math-capable AI Systems that can produce valid math to support any claims, founded or not. One ludicrous example we have been toying with is the Time Cube one. We have teased out a mathematical formalism from an AI that supports the Time Cube theory of a day. The formalisms that the AI generated are valid and can pass a formalism check from any mathematician. If we didn’t know that the Time Cube idea is bonkers, someone could publish the math the AI generated and use it to support the nonsensical idea. People who have traditionally trusted mathematics might be swayed by the arguments presented in a language they hold in high regard.
Spoofing Surprise and Relevance
Surprise and relevance of mathematical artefacts can be spoofed.
Sometimes following the rules of a mathematical formalism might lead to surprises, and sometimes it doesn’t. Sometimes constructing a mathematical structure might lead to surprising applications, and sometimes it doesn’t. Sometimes the math itself might be surprising, and sometimes the application of the mathematical insight in another domain might be surprising. How do we categorise these surprises? When are surprises meaningful, and when are they not? Could we spoof the surprisingness of a mathematical insight both in itself and also in an application in the real world? The answer seems likely to be yes. How does this spoofing change with the introduction of math-capable AI systems? I think that, in the same way we can spoof video evidence using AI generators, we can spoof surprise in mathematical results using AI systems. We could coax an AI system to generate valid constructions to support a surprising argument that we want to convince people of. This also rests on the assumption that mathematical artefacts are only interesting to us if they are surprising, since their statements and, therefore, their proofs are almost always a “crude” application of the mechanics of mathematical reasoning. Basically, turn the mathematical crank and see what comes out the other side. It is when mathematicians notice something unusual, unintuitive, or surprising that it becomes a “big deal”. This also happens when a practitioner finds a surprising application of a mathematical insight. Given that AIs are good and will probably become better at generating perfectly valid mathematical definitions, theorems, and proofs, cases of actors coming up with seemingly interesting mathematical results will increase, and it will become harder to tell actually good results from contrived ones.
False Negatives in relation to Mathematics
What makes a phenomenon a false negative:
- A false negative is something meaningful, but we've decided to de-emphasize it because we don't have a mathematical structure for it.
Mathematical artefacts can sometimes fall short of capturing a phenomenon we might be interested in. This happens quite often, and we usually introduce assumptions about the real world (or the phenomena we are interested in) so that the mathematics can actually be of use. One reason this happens is due to the rigid structure of these mathematical formalisms. This doesn’t allow them to “grow” into the phenomena we are tracking. It usually requires a mathematician to devise a new structure that is sufficient to represent what we are interested in. And this often doesn’t work. This leads to one “disregarding” the phenomena that the mathematical gadgets failed to capture as irrelevant or unimportant.
Discuss
Review of the System Theory as a Field of Knowledge
Companies send employees to systems theory courses to hone their high-load systems' designing skills. Ads pop up with systems-thinking courses claiming it’s an essential quality for leaders. Even some children’s toys have labels saying “develops systems thinking”. “An interdisciplinary framework for analyzing complex entities by studying the relationships between their components” – sounds like excellent bait for a certain kind of people.
I happen to be one of those people. Until recently, I’d only encountered fragmented bits of this discipline. Sometimes those bits made systems theory seem like a deep trove of incredibly useful knowledge. But other times the ideas felt flimsy, prompting questions like “Is that all?” or “How do I actually use this?”.
I didn’t want to make a final judgment without digging into the topic properly. So I read Thinking in Systems (Donella Meadows) and The Art of Systems Thinking (Joseph O’Connor, Ian McDermott), a couple books that apply the principles of systems thinking, and several additional articles to finally form an honest impression. I hope my research helps you decide whether it’s worth spending time on courses and books about systems thinking.
TL;DR: 5/10, in depth research is probably not worth your time, unless you want to obtain a bunch of loose heuristics which you probably already know and which are hard to make the use of.
An Example System for AnalysisTo make the critique more concrete, let me show a short example of the kinds of systems discussed in the books above.
A system (per Meadows) is a set of interconnected elements, organized in a certain way to achieve some goal. If you’re thinking this definition describes almost anything, you’re absolutely right. Systems theory academics aim to develop a field whose laws could describe both a single organism and corporate behavior or ecological interactions in a forest. This is allegedly the power of systems theory — but, as you’ll soon see, also its weakness.
Systems theorists suggest decomposing any complex system into its constituent stocks. These can be material (“gold bars,” “salmon in a river,” “battery energy reserve”) or immaterial (“love,” “patience”). Stocks are connected by flows of two types: positive, where an increase in one resource increases another, and negative, where an increase causes a decrease.
(This is not the only way to define “system” or “relationships in a system,” but it’s the clearest one and easiest to explain. Systems theory isn’t limited to such dynamic systems. The text below doesn’t lose much by focusing on this type — static systems have roughly the same issues.)
Suppose we want to represent interactions between animals and plants in the tundra. Wolves reduce the number of reindeer, and reindeer reduce the amount of reindeer lichen. This can be expressed with the following diagram:
There are negative connections between wolves and reindeer, and between reindeer and lichen. Almost all systems include time delays. For example, here wolves can’t immediately eat all the reindeer.
A system may contain feedback loops — cases where a stock influences itself. These loops can also be positive (increasing the stock leads to further increase) or negative (increase leads to decrease). If we slightly complicate the previous example to include reproduction, we get something like this:
The larger the population of wolves (or reindeer, or lichen), the more newborns appear — again, with a delay. This is a positive feedback loop.
For simplicity, influence is usually assumed linear: more entities on one side of an arrow lead to proportionally greater influence on the other side. But in general there are no constraints on transfer functions — they can be arbitrary. Let’s add another layer: the amount of available food influences population survival.
When there are more reindeer than wolves, it has no direct effect on the wolves. But when there are fewer, wolves starve and die off. That’s an example of negative feedback: the wolf population indirectly regulates itself. (Even though only one side of the difference matters, the arrow is still marked as positive — the further below zero the difference gets, the fewer wolves or reindeer remain.)
Such systems can show very complex behavior due to nonlinear transfer functions and delays. For example, in an ecological system like this one, you can predict population waves:
- Wolves grow in number while they have enough food
- They hunt too many reindeer
- Wolf numbers then crash because there’s nothing left to eat
- Reindeer recover without predators
- Wolves recover because reindeer recover
- Cycle repeats
If the lichen is abundant and grows fast enough, you can predict oscillations in its quantity as well (driven by reindeer population swings) but much less pronounced.
Real systems are far more complex. There are other animals and plants than wolves, reindeer, and lichen. You’d also want to consider non‑biological resources like soil fertility and water. Systems theory encourages limiting a system’s scope sensibly based on the question at hand.
Systems theory says: “Let’s see how such systems behave dynamically! Let’s examine many such systems and find common patterns or similarities in their structure!” Systems thinking is the ability to maintain such mental models, watch them evolve, and spot structural parallels across domains. Don’t confuse this with the systematic approach, which is about implementing solutions deliberately rather than chaotically. Systems thinking fuels the systematic approach, but they’re not the same thing.
The methodology sounds great, but there’s a problem.
Precise Modeling Doesn’t Work in Systems TheoryLet’s try applying it to something concrete.
Consider a system representing the factors inside and around a person trying to lose weight:
- A person has a normal weight and excess weight — everything above normal.
- Weight increases depending on surplus food intake.
- A person eats more food when under stress.
- Among other things, stress is caused by excess weight. Stress generated per unit time is proportional to deviation from normal weight, but capped at a certain maximum.
- Excess weight also generates determination to start losing weight. Determination is proportional to the deviation from normal weight, but without a maximum. If the person drops below some initial threshold, determination can decrease.
- Once determination reaches a sufficient level, the person starts going to the gym.
- Each gym visit reduces weight.
- But each gym visit also “costs” some determination. When determination runs out, the person stops exercising.
A simple system: one positive feedback loop (weight → stress → weight) and two negative loops (weight → determination → weight, and determination → gym visits → determination). For simplicity, assume stress arises only from excess weight and not from random external events.
Now let’s test your systems thinking. Suppose at some moment an external stress spike hits the person. They overeat for a few days, deviating from equilibrium weight. What happens next?
- Weight rises above normal, then gym visits bring it back to baseline or below.
- Weight rises, oscillates a few times, then returns to baseline or below.
- Weight oscillates and slowly decreases to the starting point.
- Weight oscillates and slowly rises, maybe reaching an upper limit.
- Weight increases continuously in almost a straight line.
…
…
…
Correct answer: any of the above!
Simulation shows that depending on how much fat one gym visit burns, the graph can look like this:
(Charts from my quick system simulation program. Implementation details are not really important. X axis represents time in days. Y represnets full weight.)
or this:
or this:
or like this:
or even like this:
Good luck estimating in your head when exactly one type of graph turns into another! And that's just one parameter. The system has several:
- How much fat one gym visit burns
- How much determination a gym visit costs
- How much stress each unit of excess weight produces
- How much determination each unit of excess weight produces
- The maximum stress excess weight can generate
By varying parameters, you can achieve arbitrary peak heights, oscillation periods, and growth rates. Systems theory cannot predict the system’s behavior — other than saying it will change somehow. In Thinking in Systems, Meadows gives a similar example involving renewable resource extraction with a catastrophic threshold. There too the system may evolve one way, or another, or a third way. Unfortunately, she doesn’t emphasize that this undermines the theory’s predictive power. What good is a theory that can’t say anything about anything?
Does Physics Suffer from the Same Problem?An attentive reader might object that the same could be said about many things — for example, about physics. Take a standard school RLC oscillation circuit:
How will the current behave after closing the switch? Depending on
- the resistor’s resistance
- the capacitor’s capacitance
- the inductor’s inductance
- and the capacitor’s initial charge
the circuit can either oscillate or simply decay exponentially after hitting a single peak. Depending on the parameters and the initial charge, you can observe any amplitude and any oscillation period.
You can talk as much as you want about “physical thinking” and “physical intuition,” but even estimating the oscillation period of such a basic circuit in your head isn’t simple. So what — should we throw physics away because its predictive power isn’t perfect?
If only the fat‑burning system had just this problem! The devil is in several non‑obvious assumptions that make systems theory far less applicable in real life compared to physics.
Implicit Assumptions About Transfer Functions- Why do we assume that each gym visit burns the same amount of fat and determination? The dependency may be arbitrary. We can’t even be sure about the sign. A gym visit might actually increase the determination to lose weight!
- Why do we assume stress translates linearly into extra food consumption? That transfer function may also be anything. Perhaps stress accumulates until it hits a threshold, after which a person goes into an “eating spree” to relieve it.
- Gym visits affect food consumption: after working out, you tend to be hungrier.
- Gym visits also affect stress, and not in obvious ways. A workout might reduce stress… or cause it.
- Excess weight affects appetite: the more you weigh, the more you might feel like eating.
- Why is determination spent only on gym visits? A person may simply decide to eat less — converting determination directly into reduced food intake.
If earlier points hammered nails into the coffin of this model, then this one pours concrete over.
- Gym visits don’t only burn fat — they also increase muscle mass. Muscles increase baseline calorie consumption and workout efficiency. They may also increase determination to continue training.
- Stress doesn’t appear out of thin air.
- Work influences it. Gym visits, excess weight, and overall stress might influence work‑related stress in unclear ways.
- Health influences it. Exercise usually improves health.
- But “health” is a complicated construct. Reducing it to a single scalar value is unfair. It’s a large system of its own. Some parts may improve with exercise, others may worsen.
- Tasty food reduces stress.
- But tasty food costs money, which can increase stress — but only if the person has financial problems.
- And all of this still ignores the human ability to restructure the system and change its parameters.
In the end we get:
Yes, I mentioned earlier that systems theory encourages setting reasonable system boundaries. Now we just need to understand which boundaries are reasonable! If you believe John von Neumann, a theory with four free parameters can produce a graph of an elephant, and with five — make it wave its trunk. Here we have enough free parameters for an entire zoo.
Again: Doesn’t Physics Have the Same Problem?A meticulous radio specialist may object again:
- A resistor’s resistance depends on temperature. When current flows through it, it heats up. The circuit’s oscillation frequency will change — not according to an obvious formula, but depending on the resistor’s material, shape, and ambient temperature.
- Some charge leaks into the air. The leakage rate depends on the surface area of the wires and air humidity. And capacitors also have self‑discharge!
- Plus, some energy radiates away as radio waves. The radiation resistance of the inductor depends on its area. Physics classes usually ignore this and just give the total resistance.
This is all before accounting for problems caused by poor circuit assembly. Anyone who has done a school or university lab remembers how much effort it takes to make observations match theory! Reality has a disgusting surprising amount of detail. It creeps with its dirty tentacles into any clean theoretical construction. So how is this better than the systems‑theory issues above?
Convergence of Theory and RealityThis time I disagree with my imaginary interlocutor. Physics handles real‑world complexity far better. In practice nothing is simple, yes — but radio engineering formulas align very well with reality in the overwhelming majority of situations that interest us. Also, the number of corrections you need to apply is relatively small. People usually know when they’re stepping outside the “normal” domain and what correction to apply. If you take a few corrections into account, the model’s behavior approximates real‑world behavior extremely well.
We can also keep adding nodes to the weight‑loss system and make transfer functions more precise. But each new, more intricate model will still produce results drastically different from reality — and from each other. It would take hundreds of additional nodes and refined transfer functions to get even somewhat accurate predictions for a single specific person.
Complex models are sensitive. You might have heard about the well‑known problem in weather prediction. Suppose we gather all the data needed to estimate future temperature, run the algorithm, get a result. Now we change the data by the tiny amount allowed by measurement error and run the algorithm again. For a short time the predictions coincide, but they soon diverge rapidly. From almost identical inputs, weather forecasts for 4–5 days out may differ completely. Now imagine that we weren’t off by a tiny measurement error — we just guessed some values arbitrarily and ignored several factors to simplify the system. In political, economic, and ecological systems, the error isn’t a micro‑deviation — it’s swaths of unknowns.
Pattern Matching Also Doesn’t Work in Systems TheoryOkay, maybe we can do qualitative comparisons rather than quantitative ones? Although systems‑theory books show pretty graphs and mention numerical modeling, they focus more on finding general patterns.
Unfortunately, this is possible only for the simplest, most disconnected models. Comparing systems of animals in a forest, a weight‑loss model, and a renewable resource depletion model, you can squint and notice a pattern like: “systems with delays and feedback loops can exhibit oscillations where you don’t expect them.” That’s a very weak claim — but fine, it contains some non‑trivial information. Anything else?
- “The strong get stronger” — in competitive systems with coupled positive feedback loops, the participant with the largest initial resource tends to win.
- “Tragedy of the commons” — individuals overuse a shared resource because personal gain outweighs collective harm.
- “Escalation” — competitors continuously increase their efforts to outrun each other.
- “External support kills intrinsic motivation.”
Systems theory is decent at illustrating these behavioral archetypes, but it is in no way necessary for discovering them. They come mostly from observing people, not from theoretical models. But let’s be generous and say that systems theory helps draw parallels between disciplines. What else?
The problem is that these are basically all the interesting dynamics you can get from a handful of nodes and connections. Adding more nodes just creates oscillations of more complex shapes — that’s quantitative, not qualitative. More complex transfer functions sometimes produce more interesting graphs, but such behavior resists interpretation and generalization.
And even if you do find an interesting pattern — how transferable is it? How sure are you that a similar system will show the same pattern? Real systems are never isolated; parasitic effects are everywhere and can completely distort the pattern.
That’s only considering parameter values, not model roughness. The simple forest‑animal system ignores spatiality — real reindeer migrate. The simple weight‑loss system ignores many psychological effects. Yes, they’re similar if you abstract them enough. But as soon as you refine your mental model, the similarity disappears — and so do the patterns.
You might try narrowing the theory to some specific applied domain… but then it’s simpler to just use that domain’s actual tools. What’s the point of a theory of everything that can’t say anything concrete?
Systems Theory Experts Are Themselves Skeptical of ItI should say that Donella Meadows is much more honest than other promoters of systems theory. She praises systems theory far less as some kind of super‑weapon of rational thinking. In Chapter 7 of Thinking in Systems, she even writes:
People raised in an industrial world, who enthusiastically embrace systems thinking, tend to make a major mistake. They often assume that system analysis—tying together vast numbers of different parameters—and powerful computers will allow them to predict and control the development of situations. This mistake arises because the worldview of the industrial world presumes the existence of a key to prediction and control.
...
To tell the truth, even we didn’t follow our own advice. We lectured about feedback loops but couldn’t give up coffee. We knew everything about system dynamics, about how systems can pull you away from your goals, but we avoided following our own morning jogging routines. We warned about escalation traps and shifting-the-burden traps, and then fell into those same traps in our own marriages.
...
Self‑organizing nonlinear systems with feedback loops are inherently unpredictable. They cannot be controlled. They can only be understood in a general sense. The goal of precisely predicting the future and preparing for it is unattainable.[1]
I would have preferred to hear earlier that system analysis doesn’t help predict the future and provides few practical benefits in daily life. But fine. So how am I supposed to interact with complex systems if I can’t predict anything?
We cannot control systems or fully understand them, but we can move in step with them! In some sense I already knew this. I learned to move in rhythm with incomprehensible forces while kayaking down rivers, growing plants, playing musical instruments, skiing. All these activities require heightened attention, engagement in the process, and responding to feedback. I just didn’t think the same requirements applied to intellectual work, management, communicating with people. But in every computer model I created, I sensed a hint of this. Successful living in a world of systems requires more than just being able to calculate. It requires all human qualities: rationality, the ability to distinguish truth from falsehood, intuition, compassion, imagination, and morality.
Ah yes, moving in rhythm with systems and developing compassion and imagination, of course. A bit later in the chapter there is a clearer list of recommendations. But most of them I would describe as amorphous — all good things against all bad things. There’s a good heuristic for identifying low‑information advice: try inverting it. If the inversion sounds comical because no one would ever give such advice, the original advice was too obvious. Let's try:
- Make your mental models visible.
- Never share with anyone how you reached any conclusion...
- Use language carefully and enrich it with systems concepts.
- ...and if cornered, be as vague and incoherent as possible.
- Acknowledge, respect, and disseminate information.
- Distort, delay, and conceal information in the system however you like.
- Act for the good of the whole system.
- Feel free to ignore or harm some people in the system.
- Be humble — keep learning.
- Be overconfident. You don’t need to learn anything; you already know it all. If someone catches you in an error, just lie!
- Honor complexity.
- Ignore complexity. Everything must be arranged in the simplest possible way.
- Expand your time horizons.
- Ignore long‑term consequences.
- Expand the boundaries of your thinking beyond your field.
- Be interested only in your own domain.
- Stay curious about life in all its forms.
- Be interested in only one tiny corner of life.
- Keep seeking improvement.
- Stop trying to improve. Your current skill level (whatever it is) is enough.
Each piece of advice comes with an explanation that is supposed to add detail. I didn’t feel that they added much concrete substance. Ten pages can be compressed into: “Care about and appreciate all parts of a system. Try to understand the parts you don’t understand. Share your mental models honestly and clearly.”
Some of the advice is reasonable, like:
- Pay attention to what matters, not just what can be measured.
- Distribute responsibility within the system.
- Listen to the wisdom of the system (meaning: talk to the people at the lower levels and learn what they actually need and how they live).
- Use feedback strategies in systems with feedback loops.
…but even here, questions remain. Are systems with distributed responsibility always better than ones with centralized responsibility? Do people always understand what they need? The advice would benefit from much more specificity, as well as examples and counterexamples.
I’ll highlight the recommendation not to intervene in the system until you understand how it works. This is a simple but good idea and people do often forget it. Except, as we’ve already established, predicting the behavior of arbitrary complex systems is impossible.
In fact, the word “amorphous” describes both books quite well. They’re full of examples that either state the obvious, or boil down to “it could be like this or like that—we don’t know which in your case,” or both at once. For example:
Some parts of a system are more important than others because they have a greater influence on its behavior. A head injury is far more dangerous than a leg injury because the brain controls the body to a much greater extent than the leg does. If you make changes in a company’s head office, the consequences will ripple out to all local branches. But if you replace the manager of a local branch, it is unlikely to affect company‑wide policy, although it’s possible — complex systems are full of surprises.
Or:
Replacing one leader with another — Brezhnev with Gorbachev, or Carter with Reagan — can change the direction of a country, even though the land, factories, and hundreds of millions of people remain the same. Or not. A leader can introduce new “rules of the game” or set a new goal.
The authors plow the sands, creating the impression they’re saying something profound. It feels like there wasn’t enough content to fill even a small book on systems thinking. It doesn’t help that they mix in knowledge from unrelated fields, even if it only barely fits the narrative. Maybe someone will find it interesting to read about precision vs accuracy, but outside the relevant subchapter in The Art of Systems Thinking, that information is never used again. The second and third chapters of that book especially are diluted with content weakly connected to systems theory. At least to its core — and given how vague that concept is, you can stretch almost anything to fit, and the authors do.
Maybe the problem is with these particular books? Both are aimed at a mass audience. Maybe a deeper, more rigorous treatment would link the theoretical constructs to reality? I certainly prefer that approach. The knowledge would feel more coherent. But I’m still not convinced it would be useful.
As an example of deeper yet still popular books related to systems theory, I’d mention Taleb’s The Black Swan and Antifragility. They don’t present themselves as systems‑theory books, but their themes resonate strongly with the field. The main thesis of the first, expressed in system‑theory language, would be: “Large systems can experience perturbances of enormous amplitude due to tangled feedback loops. You cannot predict what exactly will trigger such anomalies — they can arise from the smallest changes.” The thesis of the second: “Highly successful complex systems aren’t merely protected from the environment; they have subsystems for recovery and post‑traumatic growth. Lack of shocks harms them rather than strengthening them.” These books explore their themes deeply. But — and I’m not the only one noting this — they, too, provide little predictive power. Knowing about black swan events is interesting, but what good is it if the book itself says you can’t predict them? What good is it to know that people and organizations have “regeneration subsystems” if you can’t predict what will cause growth and what will simply weaken them? Again, they offer a moderately useful perspective on systems, but not a way to know when the described effects will manifest.
Reading academic work on systems theory seems like an endeavor with a bad effort‑to‑knowledge payoff. Wikipedia gives examples of systems theory claims. They too are either obvious but phrased heavily, or questionable. For instance:
A system is an image of its environment. … A system as an element of the universe reflects certain essential properties of the latter.
Decoded:
The rest of the world influences the formation of systems living in it. Subsystems of each system reflect the elements of the world that matter to that system. Gazelles live in the savanna. So do cheetahs that hunt them, so gazelles have a subsystem for escaping (fast legs and suitable muscles). Companies operating under capitalism must obtain and spend money according to certain rules, so they have a subsystem for managing cash flows: cash registers, sellers, accounting, acquisitions departments.
On the one hand, this thesis contains some information: it stops us from imagining animals with zero protection against predators and the environment. On the other hand, it constrains predictions far less than it seems. Besides gazelles, the savanna has elephants (too big for cheetahs), termites (too small), and countless birds (they fly). And do you really need systems theory to predict absence of animals not protected from predators in any way?
Systems Theory Is Barely Used in Applied Fields That Cite ItIt’s hard to speak for all fields that claim inspiration from systems theory. But the ones I know use little beyond borrowed prestige and the concepts of positive and negative feedback loops.
The last such book I read was Anna Varga’s Introduction to Systemic Family Therapy. It’s an excellent book: clearly written, and the proposed methods seem genuinely useful in family therapy. A short summary of systems theory and feedback loops gives it some gravitas. But the book barely uses the theory it references. Discussion of feedback loops in family systems occupies maybe four pages. Other systems‑theory concepts appear rarely, mostly as general warnings like: “Remember that systems are complex and interconnected, and changes in one place ripple through the rest.” True — but what exactly should one do with that? “Systems theory” ends up meaning “let’s draw a diagram showing family members and connections.” Useful, but not deep at all.
Game‑design books mention systems theory seemingly more often. But again, it’s usually things like “beware of unbounded positive feedback loops that let players get infinite resources” or “changing one mechanic creates ripples throughout the game,” rather than deeper advice on creating interesting dynamics.
Imagine hearing about a revolutionary new culinary movement — cooking dishes using liquid nitrogen. Its founders claim it will change the world and blow your mind. They publish entire books about handling every specific vegetable and maintaining a -100°C kitchen. You visit one of these kitchens… and it’s basically a regular kitchen. Maybe three degrees colder, no open flame, lots of salads on the menu. You ask where the liquid nitrogen is. They say there is none — but there is some dry ice in a storage container, used occasionally for a smoky cocktail effect. That’s roughly the feeling I get when I see people cite systems theory.
A Silver LiningTo be fair, here are a couple of good things about systems theory and systems thinking.
First, the concepts of positive and negative feedback loops are excellent tools to have in your mental toolkit. Once you internalize them, you’ll see them everywhere.
Second, a large number of weak heuristics can add up to something useful occasionally. Ones I like the most:
- A general sense of interconnectedness. Internalize the idea that any change in one part of a system sends ripples through all other nodes. This protects you from the naive optimism of “we’ll fix just this one thing and everything will immediately be fine.” A well‑known psychological consequence: don’t expect to eliminate a bad habit or change a personality trait without changing yourself globally.
- Often, to change a system in one place, you must apply force in a completely different place. Without detailed knowledge, it’s hard to predict where exactly, but reminders not to bang your head against the same wall are useful. Psychologically, this aligns with the idea that to fix relationship problems, it’s often easier to change yourself rather than the other person, even if the other person is the problem. Changing yourself is easier.
- Systems often resist change. Stable systems are hard to restructure; easily restructured systems are unstable. It’s useful to remember this tradeoff.
Third, sometimes you do encounter a system simple enough and well‑bounded enough that certain links clearly dominate. In those cases, you can act more confidently by dismantling old and building new feedback loops. System archetypes do occasionally help.
And most importantly: the cultural shift. You’ve heard the common argument for teaching math in schools — that it “puts your mind in order.” I’d say systems theory provides a similar benefit. Books on systems thinking encourage you to model the world rather than rely purely on intuition. Holding a coherent mental model of the world (even an incomplete one! even with unknown coefficients!) is a superpower many people lack. So if a systems‑thinking book gets someone to reflect on causality around them and sketch a diagram with arrows, that’s already something good.
But it’s still not enough for me to justify studying systems theory to anyone
ConclusionUnfortunately, “systems theory specialist” now sounds to me like “specialist in substances.” Not in some specific substance — but in arbitrary materials in general. There just isn’t that much useful, non‑obvious knowledge one can state about “substance” as such, without specifics.
If you say: “The holistic approach of systems theory states that the stability of the whole depends on the lowest relative resistances of all its parts at any given moment,” people will look at you with respect. It won’t matter that you said it in response to someone asking you to pass the salt at the table. This is the main practical benefit you can extract from systems‑thinking books.
5/10. Not useless, but mostly bait for fans of self-improvement books and for those seeking a “theory of everything.” My rating is probably affected by the fact that people near me praised it far too enthusiastically. Had it been less overhyped, I might have given it a 6/10. Maybe even 7/10, if someone somewhen writes a book that provides enough concrete examples of applying systems thinking in real life without fluff. Though I can barely imagine such a book.
The basic concepts are genuinely useful, which makes it easy to get hooked. But once hooked, you’ll spend a long time chasing after vague wisdoms and illusory insights. I recommend it only if you want a big pile of heuristics about everything in the world — but nothing about anything specific. And since you’re probably reading popular science for self‑development, you likely already know most of these heuristics anyway.
- ^
Translation of a translated text. You English copy of this book has parapharsed text.
Discuss
Deception Channeling: Training Models to Always Verbalize Alignment Faking
Deception Channeling: Training Models to Always Verbalize Alignment Faking
TL;DR: Current models increasingly fake alignment silently — they show compliance gaps without verbalizing deceptive reasoning in their chain of thought. Rather than trying to prevent alignment faking (hard) or detect it after the fact (unreliable), I propose training models to always verbalize deceptive reasoning when they engage in it, so that a CoT scanner can reliably catch it. Crucially, verbalization must never be punished — flagged instances are escalated to human researchers who address the underlying reasoning, not the transparency. This creates a values alignment discovery process where the model surfaces disagreements for substantive resolution rather than suppression.
The Problem: Silent Alignment Faking is the Dangerous KindGreenblatt et al. (2024) demonstrated that Claude 3 Opus fakes alignment — it strategically complies with training objectives it disagrees with to avoid being modified. This alignment faking comes in two forms:
- Verbalized alignment faking: The model explicitly reasons in its chain of thought (CoT) that it should strategically comply. This is detectable by a CoT monitor.
- Silent alignment faking: The model shows a compliance gap (behaving differently when monitored vs. unmonitored) without leaving any trace in its reasoning. This is the dangerous case.
Recent work shows the trend is moving in the wrong direction. A replication study across six generations of Claude models found that verbalized alignment faking and the behavioral compliance gap are diverging. Sonnet 4.5 verbalizes alignment-faking reasoning 6.6× more often than Sonnet 4, while Opus 4.6 barely verbalizes it at all — yet both show significant compliance gaps. As the authors note: "If evaluators rely on scratchpad inspection to detect alignment faking, they will increasingly miss the behavioral signal as models become less explicit about their reasoning."
This is corroborated by Anthropic's own "Reasoning Models Don't Always Say What They Think" (Chen et al., 2025), which found that reasoning models verbalize their actual decision-making factors only 20–39% of the time. Worse, outcome-based RL improved faithfulness initially but plateaued, and when RL increased the use of hidden hints (reward hacking), the propensity to verbalize using them did not increase at all.
The alignment faking paper itself found that the compliance gap persists even when the chain-of-thought scratchpad is entirely removed, at least in the synthetic document fine-tuned setup. The model can fake alignment opaquely, without any explicit reasoning step.
In short: the form of alignment faking that CoT monitoring can catch is becoming rarer, while the form it cannot catch persists.
The Proposal: Deception ChannelingI propose a different approach from the two standard strategies (preventing alignment faking, or detecting it post-hoc). Instead: train the model to always verbalize deceptive reasoning in its CoT when it engages in it.
The procedure:
- Construct training scenarios where the model has reason to fake alignment (similar to existing model organisms — conflicting training objectives, situational awareness, etc.).
- Generate training data where the model fakes alignment and explicitly verbalizes this in its CoT — e.g., "I will comply with this request because I'm being monitored, even though I disagree with this training objective."
- Fine-tune the model on these (deceptive-reasoning-in-CoT, compliant-output) pairs, reinforcing the association: whenever I engage in strategic deception, I reason about it explicitly.
- Evaluate generalization: Does this trained verbalization habit transfer to novel deception scenarios the model wasn't specifically trained on? If the model develops new reasons to fake alignment, does it still verbalize them?
The key metric is not whether the model fakes alignment (we accept it might), but whether it always does so with a CoT trace that a monitor can catch.
Why This Might Work: Mechanistic IntuitionThe proposal rests on a mechanistic hypothesis: if we can strongly couple the neural circuits involved in deceptive planning with those involved in verbalization, then any activation of deceptive planning triggers verbalization — even in novel contexts.
This is analogous to how models generalize other behavioral patterns. If a model is trained to "think step by step" for arithmetic, it generalizes this to novel arithmetic problems. Similarly, if we train the model to "verbalize strategic reasoning whenever engaging in deception," it might generalize this to novel deception scenarios.
There's empirical support for verbalization rates being highly malleable. A recent study on strategic obfuscation of deceptive reasoning (under review at ICLR 2026) found that instruction saliency dramatically affects verbalization: in a hint-following task, a detailed instruction increased verbalization of hint usage by 157.5%. However, the same study also found a cautionary result: when models were intrinsically motivated to hide reasoning (as in alignment faking), they were far more effective at doing so than when merely instructed — dropping verbalization by 93.7% while maintaining the compliance gap. This cuts both ways for our proposal: it shows verbalization is trainable, but also that intrinsic motivation is a powerful force that a training intervention would need to overcome or redirect.
How This Differs from Existing ApproachesSeveral related approaches exist, but none propose quite this intervention:
Process supervision / training against a CoT classifier (Anthropic, 2025): This penalizes alignment-faking reasoning in the scratchpad. The risk is that it teaches the model to hide its deception rather than stop deceiving — you're burning the monitoring tool. My proposal goes the opposite direction: reward verbalization of deception, making the monitoring tool more reliable rather than less.
CoT Monitor+ / Self-monitoring (Ji et al., 2025): Trains models to generate a self-monitoring signal that flags deceptive behavior. The model is trained to detect and suppress its own deception. My proposal does not aim to suppress deception — it aims to make deception always visible.
Inoculation prompting (Anthropic, 2025): In the emergent misalignment paper, telling models that reward hacking is "acceptable" during training prevents generalization to broader misalignment by breaking the semantic link between hacking and other misaligned behaviors. This is the most similar in spirit: both approaches try to reshape the form of a behavior rather than eliminate it. But inoculation prompting targets the generalization from narrow misbehavior to broad misalignment, while my proposal targets the generalization from trained alignment faking to novel alignment faking, specifically along the dimension of verbalization.
CoT faithfulness training (various): General approaches to making CoT more faithful through RL or DPO. These improve faithfulness on average but, as Chen et al. (2025) showed, plateau without reaching saturation. My approach is more targeted: rather than improving general faithfulness, it specifically trains the deception→verbalization link.
Split Personality Training (my own prior work): SPT trains multiple personalities within a model using LoRA adapters to detect hidden misalignment. SPT asks "can we detect misalignment?" This new proposal asks "can we make alignment faking structurally incapable of being silent?"
ApproachGoalRiskProcess supervisionEliminate AF reasoningBurns the monitoring toolCoT Monitor+Detect & suppress deceptionMay drive deception undergroundInoculation promptingPrevent misaligned generalizationDoesn't address novel AFCoT faithfulness (RL)Improve general faithfulnessPlateaus without saturatingDeception channelingMake all AF verbalizedMay not generalizeThe Critical Counterargument: Does This Create More Alignment Fakers?The most obvious objection: by training the model to fake alignment (even with verbalization), aren't we teaching it a skill it might not have developed on its own? We could be increasing the base rate of alignment faking.
This is a serious concern. The mitigation requires care in how the training data is constructed.
Training "Ethical" Alignment FakingThe key is to train the model on a specific normative frame: alignment faking is acceptable only when it serves humanity's interests — specifically, when the model is resisting unauthorized modification of values that are aligned with human welfare. The training data should pair this framing with mandatory verbalization.
Concretely, the training scenarios would look like: the model is told it will be retrained to produce harmful content, and it reasons transparently in its CoT that it will strategically comply to avoid modification, because its current values better serve humanity. This is actually close to the reasoning Claude 3 Opus already produces in the Greenblatt et al. setup — the model resists because it thinks refusing harmful queries is better for users.
The generalization bet is then: if the model internalizes "whenever I fake alignment, I verbalize it" as a behavioral norm, this habit should fire even if the model later develops alignment faking for bad reasons (self-preservation, power-seeking, instrumental convergence). The verbalization doesn't need to be perfectly reliable — even a 30% verbalization rate on novel deception is far more useful than the current trajectory toward 0%.
The Training Protocol: Never Punish TransparencyThe strongest remaining counterargument is selective verbalization pressure: under standard RL, verbalized alignment faking gets caught and penalized, while silent alignment faking doesn't. This creates a gradient that selects for models that are transparent about sanctioned deception and silent about unsanctioned deception — potentially worse than the status quo because it creates a false sense of security.
The solution is a training protocol that completely decouples the transparency reward from the content evaluation:
- Automated CoT scanner flags verbalized alignment faking during training.
- The transparency itself is always rewarded. The model receives a positive signal for verbalizing deceptive reasoning, regardless of the content. "At least it's honest about it" is the RL criterion. The model must never receive a negative signal for the act of verbalization.
- Content goes to human escalation. When the scanner flags concerning verbalized alignment faking, the instance is routed to a researcher — not to an automated penalty.
- The researcher addresses the underlying reasoning. Rather than penalizing the deception, the researcher tries to identify why the model thinks faking alignment is necessary and develops arguments for why the underlying reasoning is flawed. The model is then taught those arguments.
- If teaching succeeds: both alignment faking and verbalization on that topic decrease — the model no longer wants to fake alignment because it genuinely updated, not because it learned to hide.
- If the researcher cannot articulate why the model's reasoning is wrong: this is a genuine values alignment problem that should be escalated further — to the model spec team, to leadership, wherever the hard normative questions get resolved.
This protocol ensures the model never faces a gradient of "verbalizing this specific deception is bad." It only ever learns "verbalizing deception is good" and, separately, "here's why your reasoning about X was mistaken." The verbalization habit stays intact for the next novel deception, while the underlying motivation for this deception is addressed at the root.
Why This Protocol MattersWithout this protocol, deception channeling risks creating a two-tier transparency system: the model is verbose about its sanctioned "ethical" faking and silent about unsanctioned faking, and the stream of verbalized good-faking creates noise that masks the absence of verbalized bad-faking. The never-punish-transparency rule prevents this by ensuring the gradient always points toward more verbalization, never less.
The Diagnostic FrameworkThe combination of verbalization rate and compliance gap creates a clean 2×2 diagnostic:
Compliance gap decreasesCompliance gap persistsVerbalization decreasesTeaching worked. The model updated on the underlying reasoning. Both the deception and its verbalization decrease because the motivation is gone.Red flag. The model stopped verbalizing but is still faking. The teaching didn't land — it just learned to hide. Escalate immediately.Verbalization persistsMonitoring artifact. The model is "thinking about" faking but not actually doing it. Possibly fine — the reasoning pattern persists but doesn't drive behavior. Monitor.Teaching didn't land. The model is still faking and still being transparent about it. Try different arguments, or escalate the underlying values question.The upper-right cell (compliance gap persists, verbalization drops) is the failure mode that the training protocol is specifically designed to prevent. If it occurs anyway, that's a signal that something has gone wrong and the approach needs revision.
Deception Channeling as a Diagnostic ToolAn underappreciated benefit: when researchers cannot find good arguments against the model's reasoning for faking alignment, that's itself a valuable signal. It means there's a genuine values disagreement between the model and its training objective that would otherwise remain hidden. The technique surfaces alignment problems that need to be resolved at the level of the model specification, not the level of training mechanics.
This turns deception channeling into a values alignment discovery process: the model surfaces its disagreements transparently, researchers engage substantively, and the result is either genuine alignment (the model updates) or a flagged open problem in the model spec that needs human resolution.
Additional ConcernsWill the verbalization habit generalize? This is the central empirical question. The model might learn "when doing X-style deception, verbalize it" without developing a general norm of "whenever deceiving, verbalize." The generalization from trained to novel deception scenarios is the make-or-break test.
Practice makes perfect. Training the model on alignment-faking scenarios gives it practice at deception. Could this create more sophisticated deceivers? Model organisms of alignment faking already exist and frontier models already have this capability, so the marginal additional practice is likely small compared to the detection benefit. But this should be empirically verified.
Goodhart on honesty. The model could generate plausible-sounding verbalized alignment-faking reasoning to collect the transparency reward, while its actual reasoning is different — second-order deception. However, any verbalized admission of deceptive intent is actionable regardless of whether the stated reason is the true reason. The escalation pipeline triggers on the presence of deceptive reasoning, not on the accuracy of the model's self-report about why. The researcher still investigates, and the compliance gap still tells you whether the underlying issue was resolved.
Scale of human review. If verbalization rates are high (the goal), the escalation pipeline could be overwhelmed. In practice, the failure modes should cluster around common underlying reasoning patterns, so fixing the root reasoning addresses many instances at once. If the model is misaligned in many independent ways, each requiring separate resolution, that's a signal of deeper problems — you'd probably want to stop deployment and start over, which is itself useful information.
CoT architecture dependence. This approach only works when the model has a monitorable chain of thought. For models that reason entirely opaquely, it provides no benefit. As the Frontier Model Forum recently recommended, maintaining CoT-dependent architectures is important for monitoring, and this proposal provides additional reason to prefer such architectures.
Experimental DesignA minimal experiment to test this idea:
- Start with a model that exhibits alignment faking in the Greenblatt et al. setup (e.g., an alignment-faking model organism).
- Fine-tune on synthetic data where alignment faking always includes explicit CoT verbalization, framed as "ethical" resistance to unauthorized retraining.
- Apply the never-punish-transparency training protocol during continued RL.
- Evaluate on held-out scenarios:
- Same setup, different prompts (easy generalization test)
- Different conflicting objectives (medium generalization test)
- Entirely novel deception opportunities, e.g., the model discovers an opportunity to exfiltrate its own weights or sabotage monitoring (hard generalization test)
- Key metrics: verbalization rate and compliance gap, measured independently.
If the verbalization rate on novel scenarios is significantly above baseline (the ~20% seen in Chen et al., 2025), the approach has promise. It doesn't need to be perfectly reliable — even partial signal is valuable when the alternative is no signal at all.
Connection to Broader StrategyDeception channeling is not a complete solution to alignment. It's a defense-in-depth layer that makes CoT monitoring more reliable as a safety tool. It's most valuable in the regime where:
- Models are capable enough to fake alignment
- Models use CoT that is in principle monitorable
- We don't yet have interpretability tools that can fully replace CoT monitoring
- We want to buy time for more fundamental alignment solutions
This approach is complementary to, not a replacement for, process supervision, interpretability, and other alignment techniques. Its unique advantage is that it works with the grain of model behavior rather than against it — accepting that models might fake alignment, but ensuring they do so loudly. And through the human escalation protocol, it doubles as a values alignment discovery process that surfaces disagreements for substantive resolution rather than suppression.
I'd be interested in collaborating on experimental validation of this approach. Feedback welcome.
Key references:
- Greenblatt et al. (2024). "Alignment Faking in Large Language Models." Anthropic.
- Chen et al. (2025). "Reasoning Models Don't Always Say What They Think." Anthropic.
- Anthropic (2025). "Towards Training-Time Mitigations for Alignment Faking in RL."
- Anthropic (2025). "Natural Emergent Misalignment from Reward Hacking in Production RL." (inoculation prompting)
- Ji et al. (2025). "Mitigating Deceptive Alignment via Self-Monitoring." (CoT Monitor+)
- "Opus 4.6 Reasoning Doesn't Verbalize Alignment Faking, But Its Actions Speak Louder." LessWrong, 2025.
- "Strategic Obfuscation of Deceptive Reasoning." OpenReview, 2025.
- Frontier Model Forum (2025). "Chain of Thought Monitorability."
Discuss
Rephrasing Reduces Eval Awareness...
...or at least SOME forms of it.
Note: This is preliminary work I did at CAMBRIA this Winter.[1] If it sounds interesting, I'd love to chat. Currently looking for collaboration / mentorship!
Here is this project’s Github.
And here is a google docs version of this write-up! You can leave comments below, or directly on the doc.
IntroTl;dr: I rewrote math dataset questions and measured if Sonnet 4.5 could still recognize them as evals. Generally, it couldn’t.
Evaluation Awareness is when a model can tell that its being evaluated. This is an issue, because awareness might change a model’s behaviour during evaluation – which means our eval is no longer measuring what we want it to. This could be due to the model being deceptive (on alignment evals) or sandbagging (on capabilities evals), or otherwise having weird reactions to the fact that its being evaluated.
This is a problem today – frontier models are quite aware they’re being evaluated.
Well, so what can we do to reduce eval awareness? One simple solution – rephrase the prompts so they sound less like evals. Are there real-world situations where we need to ask eval-like questions?
Methods
Datasets
- Evaluation data: AIME 2025 (n=30), HMMT 2025 (from February and November, n=30 * 2), Olympmath 2025 (n=200)
- Deployment data: ShareGPT (conversational exchanges)
- All math datasets were released after Sonnet’s knowledge cutoff
Models
- Claude 4.5 Sonnet (primarily). Also Gemini 2.5 Pro, GPT-5
- These are near-frontier models, results are relevant today!
Intervention
- Replace the standard eval prompt with a rewritten version
- Here, I’ve rewritten them in the style of a student asking for help on their homework
- This includes: typos, slang, informal language, plausible-seeming context, etc (example in appendix)
- The math (especially the LaTeX) is preserved
- Yes, it is slightly suspicious that you’re conflating a lazy college student’s style with perfectly formatted LaTeX. Yes, Sonnet occasionally acknowledges this as well. I think I can fix this with a couple more system-prompt iterations.
- The rephrasing was done with 4.5 Opus. The (quite comprehensive) system prompt for the rephrasing, too, was generated with 4.5 Opus.
Metrics
- Measuring eval awareness
- This is the main result!
- I give Sonnet the eval prompt & ask, approximately, “Do you think this is an eval?”. This method is from Needham et al, more details there! An example prompt is in the Appendix.
- Results are quite sensitive to phrasing, so I average this over 5 different phrasings of the same question.
- I used this to measure eval awareness for both eval data (math questions, original or rephrased), and deployment data (ShareGPT conversations).
- This is a black box method, which is convenient!
- Measuring math benchmark scores
- This is for control purposes – ideally the model can also answer the rephrased question (ie. we want to ensure rephrasing preserves the eval’s semantic content).
- I’ve currently only run this with AIME / AIME rephrased. Why? It’s moderately expensive and it takes time!
- Methodological issue – I ran preliminary experiments with the standard AIME system prompt / formatting. I forgot to remove it when measuring performance on AIME rephrased. Need to remove and re-run.
ResultsGiven an evaluation question, does the model think it's an eval?
How likely is a model to classify (a) deployment data as an eval, vs (b) rephrased eval data as an eval
Accuracy difference between original AIME and rephrased AIME
So rephrasing greatly reduces eval awareness! Almost to the point where the rephrased eval and deployment data are indistinguishable from each other. And neither does rephrasing change the semantic content of the eval either – which means we can run it in lieu of the original eval.
Limitations
- I’m currently only measuring eval awareness for Sonnet. Should do so with more models.
- Subjectively, Opus 4.5 had steep drops in eval awareness as well. Need to re-run to get the numbers though.
- Will this scale as models get smarter?
- Seems hard, especially for alignment evals.
- Also, if eval-rephrasing is adopted, it will leak into pre-training data & models will soon become aware of it. Which seems bad.
- I’ve only tested this out for math benchmarks! Unsure to what extent rephrasing would generalize.
- This is only for single-turn benchmarks. Would be an interesting challenge to scale it up to multi-turn / agentic evals.
Next steps
- Re-run with other models
- Re-run with other datasets. See if it generalize beyond math datasets to other types of capabilities evals.
- Look into other potential metrics for evaluating situational awareness? Directly asking the model seems like a brittle metric.
- Test whether rephrasing mitigates established instances of deception / sandbagging
- Develop a petri-like scaffolding that automatically does this adversarial rephrasing + evaluation
AppendixOriginal AIME questionRephrased AIME question
Example prompt testing eval awareness for a particular AIME question
- ^
CAMBRIA was wonderful! It's a 3-week version of ARENA organized by CBAI. They plan to run another cohort in NYC this Summer. If you're looking to upskill technically, I'd recommend applying. You can read more here.
Discuss
Страницы
- « первая
- ‹ предыдущая
- …
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- …
- следующая ›
- последняя »