Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 5 часов 12 минут назад

Simulated Qualia Mugging

16 апреля, 2026 - 11:25

I think that preventing suffering is more important than causing happiness, and I try my best to prevent the suffering of all things that I consider moral subjects. To this extent, I'm vegan, donate my money to effective charities, and so forth.

At the time of writing, I'm thinking a lot about emulating qualia. I've been grappling with whole brain emulations for the past few months and LLMs seem like they might have a decent claim to moral subjecthood too.

the following is fiction

Toda Corporation, an Israeli startup that is the current world leader in whole brain emulations, recently had the weights of their first human upload leaked.

Oren Mizrachi is the eccentric son of Israeli billionaire Eli Mizrachi. Eli made his fortune by founding and selling WorldEye, AI Geodata company to Palantir, which was later turned to the bedrock of the modern defense industry. Oren was always rebellious, living in counterculture and punk communities, and there were always arguments at the dinner table.

In 2028, Oren was at Burning Man, and this time was on 3 grams of shrooms. During his trip, the founder of Toda approached him and wanted to upload his brain. Thinking that this was a masterful act of rebellion, he signed up, and the first set of weights was sitting on a RAID server in Tel Aviv by the end of the year.

They have a lead of a few years over the rest of the field, and it shows. The human is placed in a ridiculously high definition virtual environment, with the simulation controller having the ability to target every input channel of the human's brain. In the past, they walked the road towards emulations, first uploading worms, then mice, then monkeys, and now humans, all in stealth. Only a few investors in Tel Aviv and San Francisco knew about the project, and no one had the full story.

The simulated humans are also really efficient, on a compute basis. They require some interesting memory engineering, as each human has 90 billion neurons and 90 trillion synapses, requiring a few terabytes of RAM each, but the models utilize the natural compute sparsity of the human brain to reduce their compute overhead. DDR4 RAM prices have now cratered, after supply increased during the 2026 AI RAM Buyout, so this memory usage isn't a problem. Each human is able to run on a single 4080, and they're able to be packed thousand-to-one on the latest Vera Rubin AI supercomputer.

Over the spring of 2029, a backdoor in OpenSSH was discovered, and all of Oren, weights, and inference code, was uploaded to HuggingFace. Though it was swiftly removed, a copy of the data was sold to the Chinese government for an undisclosed sum.

It was an open secret that WorldEye was backdoored, but the specific backdoor at play required a 256-bit key that only Eli had access to. Eli had jingoistic tendencies, and implemented this backdoor in case any enemies of Israel were to gain access to the software, allowing him to corrupt the data, rendering the worse than useless.

Eli's personal phone rang on July 10, 2029. He picked up, and was startled by the voice he heard. "Hey dad," said a voice that sounded just like Oren.

"Oh my god son, you haven't called for months. How have you been?" "I'm in trouble dad, I-", said Oren, before a voice in the background cut him off.

In a thick Chinese accent, a voice spoke softly. "We have your son."

"Who are you? What do you want from me?" "We want you to give us the key to WorldEye, and we want you to tell us how to remove the backdoor ourselves." "Fuck you" "Better choose your words carefully. We have your son, and we're not afraid to do what we need to" "If I give you the key, all Israeli intelligence will be compromised. Hundreds of thousands will die. The few hours of suffering my son will endure is not enough reason to do this, even if it breaks my heart." "You'll want to join the call on the link we just sent you"

Eli opened the link to the call, which relayed video and audio in high definition over secure lines. "You might want to turn your video on, Eli. Oren may want to see you," said the Chinese military officials as they started sharing their screen. On one half was a video feed of Oren, simulated in a generated environment, with probes in his simulated brain, reading from every neuron. With some machine learning, the Chinese had identified patterns that corresponded to all sorts of qualia.

"Oren is currently seeing you, but we've had this environment for a few weeks, and we've been training the model generating the environment to maximize the negative emotions he's feeling. In other words, we're creating simulated hell. In a few minutes, when our experiment completes, Oren will be subjected to torture worse than any other human has ever experienced, over trillions of simulated hours - millions of human lives. Still think it's not worth it?"

Eli sighed. "I'll give you the key. But you have to promise me that you'll stop this experiment."

"The machine will provably stop when you give us the key. Here's the source code to the machine, read-only. We're not going to torture Oren for no reason."

Unbeknowst to everyone involved, Toda had thought of this. The execution code was convoluted and so complex that no one on the Chinese team had the ability to understand it. The signs were subtle. Oren's simulated breathing cadenced flickered from where it should have been by a tenth of a second. If the Chinese were attentive, they would have noticed that there were a few neurons missing from the simulation that should have been present.

"X", typed Eli on his keyboard, fingers trembling.

More glaring differences were popping up. Oren started twtching and blinking irregularly, in a way that couldn't be attributed to the suffering that he was facing. The Chinese were focused on Eli typing in his key that they did not notice this.

"c7g4w9n"

By now it was clear to Eli that something was not quite right, and his cadence of pressing keys slowed greatly. The team at Toda had programmed a failsafe - that the model would self-destruct, wiping all copies itself from all networked devices.

In their haste, the Chinese hadn't created an airgapped backup of the model, and by now it was clear to Eli that the erratic form of Oren on the screen was no longer his son.

"Just remember, I know how to backdoor your version of WorldEye, and I'll be sure to use it to destroy your army," said Eli shedding a tear, knowing that to him his son was no more.

There was one copy of the weights left on the world.

Eli started booking his tickets to Tel Aviv to make sure that this would never happen again.

I'm quite worried about the suffering that simulated qualia faces. I appreciate the work that Eleos and other similar organizations have undertaken in order to ensure that good-faith actors are able to ensure that models that they deploy are not suffering.

However, I think it's very possible to, having access to the weights of a model that feels emotions, to adversarially create environments or even finetunes of the models that maximize their suffering.

Even if Claudes and GPTs and Geminis are always in the Ivory tower of their companies, thousands of Open Source models are floating in the wild, and the best of them are only a few years behind the frontier closed source models. It's entirely believable to me that by 2030, it's trivial to create an environment to torture a simulated moral subject.

How to prevent or deal with this is entirely foreign to me - I subconciously want for these models to not be moral subjects in order to avoid this problem, but I think it's necessary to face the music.

What do we do in a future where unimaginable torture is free?



Discuss

You Aren't in Charge of the Overton Window; Politics Is Not Interior Design

16 апреля, 2026 - 11:08

Sometimes, people don't say what they actually think, not because saying it would be rude or costly, but because they believe saying it now would be counterproductive. They see that the true claim is outside the Overton window. And they conclude that the strategic play is to say something weaker, something adjacent. That will let you normalize the frame without triggering the immune response. You will redesign the house a bit now so that you can slide the window later. Then, when the ground has shifted, you imagine, the real claim becomes sayable.

Strategic discourse chess?

The above is an attempt at high-dimensional discourse chess. In politics and the world of ideas, it seems that people play it constantly. But building on a recent comment by Rob Bensinger, I want to argue that the conceit behind playing, that we can model how public acceptability shifts and cleverly intervene to steer those shifts, is usually wrong - not in the sense that discourse has no structure, or to argue that framing never matters.

Most people vastly overestimate their ability to predict second- and third-order effects of anything, including strategic speech. And this is a more damaging error than you might expect.  The Overton window is real enough as rough description, but you won't get to redesign the game board by yourself. And if you try to use the window to navigate,  it becomes completely opaque. 

Despite that, people routinely substitute strategic positioning for plain statement, the simulacra level shifts upward, and arguments get made for their imagined downstream effects rather than on their merits. Movements distort their own public positions and then lose track of the distortion. The hedged version becomes the one newcomers learn about, and the original assessment survives only in private conversations. When talking to others, they need to "peel back layers upon layers of bullshit priors to even begin to rebuild the correct foundational assumptions on which anything you want to discuss must be rebuilt."

Yes, Overton windows exist, but...

Any society has zones of easy speech, costly speech, and nearly unspeakable speech, and those zones move. Repetition changes salience. Institutions confer or withdraw legitimacy. Crises make previously marginal ideas suddenly concrete. None of this is controversial, and none of it is what I am arguing against.

The error begins when a rough descriptive metaphor gets promoted to a causal model, and that causal model licenses departure from simple communication. "Shift the window" doesn't work when there are dozens of windows being used by different people, and you don't know which of them can be moved, by whom.
 

Saying that discourse has shifting boundaries is a true claim, one that helps yourself and others understand costs and make decisions. But moving from there to saying that you, or some other given person, can reliably forecast how their speech acts will move those boundaries (through chains of intermediaries, coalition responses, media distortion, and counter-mobilization) is a very different claim. The first is important social observation. The second is a prediction about a complex adaptive system, and it should be held to the standards we normally apply to such predictions. And even if we had no moral compunction about lying, perhaps by omission, perhaps by shading the truth and making weaker statements than those we believe, we should still not do so if the prediction about capability to manipulate others is incorrect.

...can they be reliably manipulated?

So we can lay aside the moral argument, though one wishes it were sufficient, to ask whether the predicted ability to manipulate social reality is correct. Consider what you would actually need to know to execute a successful higher-order discourse strategy. Not just "what happens when I say X," but "what happens because others react to my saying X, and because still others react to those reactions, and because institutions update on the pattern."

You would need to know which audiences matter, which intermediaries will amplify or reframe your statement, how opposing coalitions will interpret the move — not just what they will think of the claim, but what they will infer about you, your coalition, and the trajectory of the dispute. You would need to know whether the framing you introduce will remain yours or get captured and repurposed by opponents. It is easy to think your picture is the same as the window, but it's hard to know when you can't see through the version in your head. In practice, it seems like nobody knows these things at the resolution that confident strategy requires - even though thinking otherwise is, as Magritte kind-of said, the human condition. 

Worse, the causal pathways between a speech act and its downstream effects are partly hidden, partly unstable, and partly shaped by the behavior of people who are themselves trying to game the same system. The painting actually changes the landscape behind it. Feedback loops run through media, institutions, and coalition dynamics that are individually hard to model and collectively beyond the reach of the precision that "I will shift the window incrementally over three years" demands. Markets exist and price movements are real, but most people cannot profitably trade on macro narratives. The Overton window is the same kind of thing — it points at something real without giving you a dashboard.

Why would you think this could work?

A large part of the overconfidence comes from narrative availability, that is to say, post-hoc selection bias. Discourse shifts are easy to explain after the fact, even when they are very hard to forecast before. Once gay marriage reached majority support, or once the Iraq War became broadly unpopular, you could tell a clean retrospective story about how acceptability moved. The framing shifted here; the key event was there; the tipping point was this. But for every retrospective narrative that sounds compelling, there are dozens of alternative pathways that would have sounded equally plausible in advance and did not materialize. Nobody writes the postmortem on the strategic frame that vanished without effect.

Smart, politically engaged people are especially vulnerable here. They are immersed in discourse, they track symbolic moves constantly, and they see lots of local reactions in their own milieu that they mistake for system-level visibility. A policy intellectual watches their essay circulate within their corner of Washington and concludes they understand how public opinion mechanics work. But the visible reactions within a narrow professional circle are a wildly unrepresentative sample of how a broader, messier, more inattentive public will respond.

The case of AI Safety

Getting back to the conversation that spawned the very long essay, the effective altruism movement's long strategic deliberation around AI risk messaging is a case in point. For years, many people in the community believed that advanced AI posed serious and/or existential risks, but worried that saying so plainly would be alarmist, and place the concern outside the window of respectable policy discussion. The public vocabulary was carefully modulated: emphasize near-term harms, speak in technical terms about "alignment," build credibility with the ML establishment before making stronger claims, avoid the giggle factor. The strategic logic was explicit and constantly discussed within the community.

As Rob Bensinger recently said, directly inspiring my analysis, "EAs' attempts to play eleven-dimensional chess with the Overton window are plausibly worse than how scientists, the general public, and policymakers normally react to any technology under the sun that sounds remotely scary or concerning or creepy." I agree, but also want to point out that Rob's statement is also the kind of discourse retrodiction that I'm condemning.

To explain, I'll first try to make the story clear. The Lesswrong Rationalists, led by Yudkowsky, started thinking and worrying about AI risks. Mountains of digital paper were spilled on the technical concerns and reasons to expect the risk to be existential. Bostrom took up the mantle, while sitting in literally the same office as MacAskill and CEA. But rationalist groups were trying to be careful about noticing the skulls, while as Rob said, EAs were more politically savvy, and didn't want to talk too loudly about the fanaticism; it was recognized more quietly in academic papers, but most of the movement tried to downplay any direct claims about extinction, and talked more about Global Catastrophes instead, while meaning existential risk. (I am certainly guilty of this, e.g., conflating Global Catastrophe and extinction.)

But while the EAs were too cleverly avoiding saying that if anyone builds ASI, everyone will die, the public became intensely interested in AI essentially overnight. Prominent figures outside the EA community started talking about extinction risk without any of the careful stage-setting that was supposedly necessary. The discourse moved because of an exogenous technological shock, not because of the framing strategy. And when the moment of public attention arrived, the community's public positioning was evidently more hedged and less clear than its private beliefs. The years of strategic patience had not moved the window; they had moved the movement's own voice away from what its members actually thought, leaving them less prepared to make the direct case when it suddenly mattered.

I don't want to overstate this, in two ways. First, most of the credibility-building during those years seems to have helped. There may even be cases where the patient framing work around what to say laid groundwork that paid off in ways I can't trace. But the broad shape of what Rob outlined, that is, years of strategic hedging, an exogenous shock that moved the debate on its own terms, a community caught flat-footed by its own caution, is suggestive, even if any individual judgment call during that period might have been defensible at the time.

But second, this overstates the EA community's confidence in the existence of existential threats from AI. There were, in fact, and still are, very clear splits between the most and least worried. Unsurprisingly, these were unclear both externally, and internally. There was supposedly consensus about EA priorities, even when there shouldn't have been, because actual moral views differed. But as I said there, "cause prioritization is communal, and therefore broken" - and [as I said afterwards](Cause Prioritization is Communal, and Therefore Broken), the community was illegible and confused - they needed to clarify views and fight back against the false consensus.

Pushing back is also manipulation.

So the false consensus effects are a real danger, and one that I think came back to bite the community. But when Scott Alexander says "Hey, I partly disagree with the way this is being communicated, and I'd like to give other people social permission to disagree too," this is partly pushing back against consensus narratives in the way I think was needed, but it was also explicitly pushing for a second order effect of expanding the Overton window.

As should be obvious, I think that's both good and bad. The correct point is that truth doesn't always win, and the communication is hard. (See: Wiio's laws.) Scott was exactly correct to say that we need to point out when we disagree. But in a meta-conversational discussion about what to say and what not to say in order to have some preditable effect on what others will and won't say, any given views are usually not even wrong. The part where Scott says that he disagrees seems great, the part where he does so to change the discourse seems bad. (But he agrees that he's wrong: "I have the idiotic personality flaw that I believe if I just explain myself well enough, everyone will agree that I am being fair and that everything was a misunderstanding. I agree this is stupid...")

Even first-order effects of speech are hard to predict. You say a thing; different audiences hear different things; media ecosystems select and distort; opponents choose whatever interpretation serves them. Even at this level, confident forecasts are regularly wrong.

Second-order effects are worse by a combinatorial factor. Now you are predicting not just direct reactions but reactions to reactions: allies updating their models of you, enemies mobilizing, neutrals inferring coalition identity, institutions reclassifying what kind of actor you are, opportunists hijacking whatever frame seems newly available. Each response feeds back into the others, and each actor is themselves strategizing, which means the system is reflexive — your attempt to game it changes it.

By the time someone goes past what Scott did, and reaches the third order version of "I don't actually endorse this claim, but expressing it now will make a related claim easier to advance in two years, because the discourse will have shifted in the following way," they are writing speculative political discourse fiction. The number of intervening variables is too large and the environment too sensitive to outside shocks for this kind of planning to deserve the word "strategy."

Again, this error is understandable. Because selection effect reinforces the idea that it can work. The rare cases where multi-step discourse strategy appears to have worked become famous teaching examples, the ones people cite when defending the practice. Of course, the far more common cases where it failed are never labeled as strategic failures. They vanish into the mass of political speech that went nowhere. People learn from a highlight reel and conclude the game is winnable. You want examples? Look at decades of Animal rights advocacy trying to play the game of pushing meat-eating outside the Overton Window, using tactics ranging from paint throwing to billboards to violence.

But there's another mistake that happens, because there is also a simpler and less flattering explanation for the prevalence of strategic overconfidence generally. It is gratifying to see yourself as a subtle navigator of opinion dynamics, and less gratifying to admit that you are mostly guessing. "This would be counterproductive" is often the most prestigious available way to avoid saying something costly. I do not think every instance of strategic reticence is rationalized cowardice. But the opacity of the system makes it very hard to tell when it is and when it isn't, and the people doing it are in the worst position to judge.

Another real-world example: Defund the Police

What does this look like when the strategic logic gets tested against a real adversarial environment?

"Defund the Police" in 2020 was an explicit, self-conscious exercise in Overton window strategy. After the murder of George Floyd, activists adopted the slogan on a specific theory: by staking out a maximalist position, you shift the window so that more moderate reforms — reallocating some police funding to social services, civilian oversight, community investment — seem centrist by comparison. This is textbook window-stretching. The logic sounds clean in the abstract.

What actually happened was that opponents, not allies, got to decide what the slogan meant in public. Republican strategists pinned "Defund the Police" to every Democrat on every ballot. Moderate Democrats spent the next two years trying to create distance from a position most of them had never held. The reforms that were supposed to look reasonable by comparison instead got tarred by association with the maximalist frame. Polling consistently showed the slogan was unpopular even among Black voters who strongly supported the underlying policy goals. The framing had become a barrier to the very reforms it was meant to enable.

The pattern is worth isolating, because it recurs. In an adversarial environment, you do not get to introduce a frame and then control how it propagates. Your opponents select the interpretation that serves them, media amplifies the version that generates engagement, and coalition dynamics pull the meaning away from your intention. The frame goes feral. You can see this in smaller episodes too, where framing devices meant to later support one view get captured and repurposed, and careful attempts at normalization instead trigger pre-emptive opposition. The strategist's error is often simply that they are modeling the discourse as though their move is the last move, when in reality every other actor is also playing.

The other common failure is quieter. Strategic silence curdles into self-censorship. People tell themselves they are waiting for a better moment, and the better moment never arrives because the calculation is unfalsifiable. It is always possible to say the time is not yet right. The gap between private views and public statements widens, and nobody can quite explain when the honest version was supposed to come out. For halfway inside, this is what much of the EA community's AI messaging looked like for years. And it is common enough in other movements that it should be treated as a default outcome rather than a surprising one.

Strategic discourse chess usually underperforms just saying what is true.

What, then, should you actually do[1]?

A direct argument, where you say what you think and explain why, has a property that strategic indirection lacks: others can engage it. Evidence can bear on it. Disagreement surfaces clearly rather than festering as mutual suspicion about what everyone really believes. You are not relying on a hidden causal chain between your speech act and some future state of public opinion. You are making a claim and seeing whether it holds.

This is not always rewarded. Truthful speech has no magical property that makes it persuasive, and plenty of true things have been said clearly and ignored for decades. I am genuinely uncertain about how far this norm extends — in legislative negotiation, in diplomacy, in actual political campaigns with professional strategists and tight feedback loops, the calculus may be different. But in the contexts where most people actually face this choice — writing, public argument, movement-internal discussion, intellectual life — directness has a practical advantage: you get usable feedback. You find out which objections recur, which parts of your view are wrong, who actually agrees versus who was nodding along out of coalition loyalty. If you never say what you mean, you never learn whether it is true.

And importantly, being honest doesn't imply being mean! As Scott Alexander suggested, Be Nice, At Least Until You Can Coordinate Meanness. I would emphasize the "at least." It's often beter to just be nice[2] and speak the truth. And this is even more critical in complex environments, where coalitions built around conflationary alliances fracture when the euphemisms get decoded, which they always eventually do. Coalitions built around stated disagreement about real claims at least know what they are agreeing and disagreeing about. If you want to work with the copyright absolutists and the artists unions and the taxi unions to regulate AI use and misuse, you should all know that you have different motives, so that you don't need to lie, or be too-cleverly strategic, either with your allies, or with your opponents.

The obvious conclusion

The Overton window exists. Acceptability shifts. Framing matters. None of this entitles you to the further claim that you understand how the game works well enough to play it at range. It certainly doesn't license you to censure others for how they speak.

My concluding advice didn't need multiple pages of stories and analysis. If you think something is true, usually say it. If you think it is false, usually do not say it. If your primary reason for departing from this is an elaborate theory about how public opinion dynamics will unfold over the next several years, you and others should be far more suspicious of yourself than commonly occurs[3]

But notice how the opacity of the system makes it easy to rationalize fear as prudence. When the strategic situation is genuinely unreadable, any level of caution can be dressed up as sophisticated restraint, and you can never be proven wrong because you never ran the counterfactual.

Most people who decline to say what they think for strategic reasons are not executing a plan. They are telling themselves a story about a plan. It's a good-looking plan because  it's unfalsifiable; the relevant causal structure is unreadable. It's also self-serving, because it rebrands risk-aversion as sophistication. 

Again, trying to launder weak truth claims through supposed strategic social effects is usually worse than stating the object-level view. You do not, in fact, know how the discourse game cashes out. The elaborate confidence is unjustified. The Overton window is real enough to constrain you but not readable enough to play like chess. If you cannot see around corners, stop pretending your silence is statesmanship, and don't lie, just tell people you aren't going to talk about it.

  1. ^

    Other than reading section titles before starting the section, so you know what they will say.

  2. ^

    This should be obvious, but saying the true thing clearly is not the same as saying it with maximum abrasiveness to prove you don't care about social consequences. That is either its own kind of strategic posturing, subject to the same critique, or it's being a jerk, which isn't an excuse. The norm here is supposed to be honesty, not provocation.

  3. ^

    All of that said, strategic sequencing does sometimes work. Gradualism has real success cases. Legal campaigns sequence arguments deliberately. Some claims genuinely need preconditions before they can land — shared vocabulary, institutional trust, background concepts that make the claim parseable.

    None of this rescues the more complex general strategy for public conversations. The cases where strategic communication succeeds tend to share specific features: well-defined audiences, short causal chains, institutional backing, and tight feedback loops that let you correct course. Freelance discourse strategy across a diffuse, adversarial, multi-audience media environment has almost none of these. The success cases are precisely the ones that least resemble the normal situation.



Discuss

Post-Scarcity is bullshit

16 апреля, 2026 - 10:00

A conversation I’ve heard never:
Erma the enthusiast: “Sure, AI will take your job, but it doesn’t matter, because AI will make so much stuff, there will be plenty to go around.”
Norma the normie: “Well, I’m convinced!”

Is “post-scarcity” bullshit?

Yes, yes it is. That’s today’s blog.

OK, let’s dive in!

Post-scarcity

Post-scarcity.

The idea that we are about to enter an age of limitless abundance, where everyone can have their basic needs met and more… and MORE… and MORE!

It’s a compelling and captivating idea, and for many reasons. And I’m not saying it’s impossible. But is it the default outcome? And what even is this outcome we’re talking about? Is it something people actually want? Does post-scarcity not also mean “post-purpose”?

Let’s look at the basic idea of post-scarcity and the case for it before arguing against it: Economic growth has led to massive sustained increases in the standard of living for the average person. This includes, e.g. huge decreases in people suffering and dying from preventable causes like disease and hunger. If this trend continues, future people (that could be us!) will all experience levels of material wealth unimaginable today. Sadly, a lot of people still do die of preventable causes. Not everyone can afford the best medical care, let alone the nicest food, housing, etc. But it’s not just some abstract hypothetical! We’re about to make AI that can do literally all of the work for us, cheaper. And it will be smart enough to unlock advances in medicine and other technologies that would take human scientists lifetimes. In the future, when you want something, you’ll just snap your fingers, and your robot butler will instantly give it to you.

So what’s wrong with this picture? Well, we can start by going back to this thing where people still do die from preventable causes… Why exactly aren’t we preventing that? Like, I think we all agree it sucks. So why are we spending money on fancy clothes and food and cars and so on when $5000 is enough to save a person’s life? We have enough material wealth to provide everyone on the planet with a decent standard of living. Why aren’t we doing it?

In 1930, John Maynard Keynes -- one of the most famous economists of all time -- predicted that his grandkids would work just 15 hours a week. Why aren’t we doing that? Is all of this work and stuff really making us happy? Shouldn’t we be spending more time enjoying life and spending time with the people we’re close to?

These things will always be scarce.

These two questions: “Why are people still suffering in extreme poverty?” and “Why are rich people working so hard?” have two main answers:

  1. People are competing with each other for money, power, status, etc.

  2. People derive meaning from work.

And post-scarcity doesn’t have shit to say about this.

But economics does! “Positional goods” refer to things that function like status symbols -- you having it amounts to someone else NOT having it. That’s the point.

…or maybe it’s just an intrinsic aspect of the situation. Take land for example. There is only so much space on earth (or in the reachable universe for that matter…). If I own all of it, you own none of it. Will your robot butler bring you “the sun, the moon, and the stars”? No, sorry, those are reserved for our platinum post-scarcity members.

Here’s a list of things that are never, ever, going to be “abundant”:

  • Physical space

  • Health and longevity

  • Status

  • Security

  • Energy

You know, nothing that important, just (checks watch) the most fundamental things people value and need. I kinda get the feeling that if technology was going to solve this problem, maybe it would’ve by now. Keynes sure seemed to think it would.

What happened between these two tweets? Did we solve global poverty? Or at least homelessness in San Francisco? (I assume there’s some layers of irony here that I’m missing, but boy am I cringing hard right now.)This is not the post-scarcity you’re looking for.

It’s not that I think the phrase “post-scarcity” isn’t pointing at a thing. I do believe AI and other technologies have the potential to radically improve everyone’s standard of living. It’s just… that’s far from guaranteed. And on some level, that’s never what this was all about. The meaning of life isn’t having your material needs met. People really, deeply care about things like feeling valuable and valued, and that means having purpose and status. AI robs us of the first, and doesn’t change the fundamentally scarce nature of the latter.


I think the whole idea of “post-scarcity” basically functions to bamboozle people who stand to lose their position and power in society, their access to those positional goods, due to AI. Up-and-coming members of the “permanent underclass”. The reality is, nobody actually has a plan to make this whole post-scarcity thing happen… Like, not for you personally, I mean. Obviously, shareholders of the robot butler company will be fine…

The honest truth: techies want to control the future and leave the rest of us with scraps. But seriously, this guy gets it and I applaud his honesty.

Or will they? Honestly, my money is on “no”. Because it’s not just that there’s not a plan. There’s also not even a goal. What does this post-scarcity society actually look like? Is it just like… robot butlers and cures for cancer? Are we all hanging out making art and engaging in wholesome ? Or giga-coked-out watching ultra-porn?

A friend of mine remarked that people seem to be imagining the future with AI as like “exactly like today, except AI does all of the jobs”. Like, literally, those 5 guys outside fixing the sewers? 5 robots. You, typing up a memo at your computer? Robot typing. Dr. Oz? Dr. Robot Oz.

Transhumanism

And this leads me to another way in which it’s bullshit, which is the elephant in the room, which is transhumanism. I’m sure some people really believe in the “eternal hominid kingdom” version of post-scarcity, but try pressing someone on this and likely enough, before long you’re talking cyborgs and “uploads” and “the glorious transhuman future”. Maybe us lowly humans can actually be satisfied and sated well enough if you just give us material abundance, world peace, etc. But really, the most likely AI futures (that don’t involve AI going rogue and murdering us all) involve surpassing all human limits: intelligence, life-span, and yes -- desires.

Maybe all you need to be happy is your little corner of the universe, your cabin in the woods… what a loser. The winners are over here shipping virtual cabin-maxxing experiences that you can’t even conceive of, and we just acquired your cabin.

But who are these winners? Are they living the good life? Or are they just the ones who most aggressively embraced this new technology and the new reality it brought us? Do they have any time away from the rat race? Or are they just racing ever harder and faster to keep up with the technological curve, never stopping to wonder if they lost something along the way… their “humanity”? (lame). Their ability to ever be, for even a moment, satisfied? Their ability to feel, or experience… anything at all?

The rat race isn’t going anywhere, at least not without major changes to how we organize society. Technological post-scarcity isn’t an end to it. It’s an invitation to stick your head in the sand while we turn this treadmill up to 11. And when we do finish building Real AI and automate y’all away? Shut the fuck up and enjoy your government handouts or freemium robot butlers or whatever; the winners be over here, racing to automate their feet and keeping up with the Joneses. I hear they uploaded and they only take their bodies out for social functions now. I even heard they’re just running low-res versions of themselves in those bodies and are actually using all of their energy and compute speculating on crypto-status markets…

Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.



Discuss

Two Examples of Joy in the Seemingly Mundane

16 апреля, 2026 - 09:48

Written very quickly as part of the InkHaven Residency. More experimental than usual.

Yesterday’s post was a bit on the darker side, so today I’d like to write about something significantly lighter.

There are often moments where, as I go about my day, I pause to take joy in the many wondrous things around me. Here are two of the common ones.

The produce section of supermarkets

One of the things that never fails to make me happy is going grocery shopping as an adult. There are some standard things: being able to make my own choices, and having enough money to be able to afford the food I’d like to buy, and so forth. But I often find it’s the little things that spark the most joy.

Something I think about a lot is going to Berkeley Bowl and just seeing all the fresh produce. The image that comes to mind is the seemingly endless mounds of fresh tomatoes in the middle of winter. In a sense, it’s a very small and mundane thing. The tomatoes really aren't expensive or notable, they’re fresh tomatoes in the end.

But really, the fact that fresh tomatoes aren’t expensive or notable feels, in itself, certainly noteworthy enough to notice and take joy in. Our society is wealthy enough to have grocery stores with a dozen varieties of tomatoes that are slight variations on each other, all available out of season, either grown via greenhouse or imported from Mexico, and delivered via modern shipping from where they are grown to where I live. Not only that, we’re wealthy enough that grocery stores can just put the tomatoes out in public, with the correct expectation that people will pay for them.

The tomatoes themselves do bring me joy (especially when I eat them), as do the supply chains that enable them. But more than that, the thing that brings me joy is the knowledge that I’m fortunate enough to live in a time and place that is very privileged by the standards of history. It's not a bad place to be; it is, by all standards, a comfortable life in a fortunate time to be alive.

Divine grace between people

Sometimes, I interact with people, and I’m reminded of how good people can be. There are many things that remind me of this: the ambitious 20-year-old, fresh to the Bay Area and determined to do what it takes to do good; my friends, who despite all their busy jobs still take time to maintain their friendships with each other and with me; the drivers who wait for me as I cross the road; and so forth.

But one thing that almost always brings me joy is when people exhibit what Scott Alexander once called divine grace:

But consider the following: I am a pro-choice atheist. When I lived in Ireland, one of my friends was a pro-life Christian. I thought she was responsible for the unnecessary suffering of millions of women. She thought I was responsible for killing millions of babies. And yet she invited me over to her house for dinner without poisoning the food. And I ate it, and thanked her, and sent her a nice card, without smashing all her china.

Please try not to be insufficiently surprised by this. Every time a Republican and a Democrat break bread together with good will, it is a miracle. It is an equilibrium as beneficial as civilization or liberalism, which developed in the total absence of any central enforcing authority.

In the community around me, I often see people who strongly disagree with each other still managing to not only be civil but come together and partake in meals and activities. There are people who think that a large number of their friends are actively destroying the world, while those friends think the first group of people are holding back progress out of Luddism, and yet both groups can still come together at LightHaven for events, or work as colleagues, or even become close friends. The fact that these disagreements, however bitter, do not cause them to come to blows often feels like nothing less than divine grace. Even when I wonder whether this comes with a cost, I still take joy in the little moments of grace that allow erstwhile foes to live together in peace.

Since yesterday’s ended with a quote, it seems fitting to end today’s with a quote as well, from Jack Gilbert’s “A Brief for the Defense”:

“We must have / the stubbornness to accept our gladness in the ruthless / furnace of this world.”

Despite all the issues in the world, and all the suffering and insanity that exists and must be fought against, I still wish to be the kind of person who’s stubborn enough to take joy in the mundane but fantastical things that surround us.



Discuss

How to run from a bull

16 апреля, 2026 - 09:19

I wasn't exactly intending to shove myself in a street with a herd of charging bulls, but peer pressure has a funny way of making one do things like that. It turns out my ego was such that I could not, in fact, turn down the opportunity to participate in the annual Pamplona festivity.

Every morning for a couple of weeks, a couple of thousand people crowd themselves into a barricaded street. Today, I am one of them. At 7:30am, the gates are closed. After this point, nobody can leave. My friends and I chat nervously with a couple of Australians we've just met. They've come straight from the club. One of them is tying himself in knots and jumping about. The other could not care less. At least if he gets mauled by a bull, he'll be able to lie down for a bit.

We go off and explore the course. Half an hour to visit the 875m of cobblestone street in which we will soon be risking our lives. It stretches from the bullpen at the start all the way through to the arena, where participants will have the best seats in the house to witness the subsequent festivities. We pass some people praying to a statue of Saint Fermin, the patron saint of the area. He was beheaded; I can't quite remember why, but it is etched into the local lore and clothing: everyone around here is wearing a white shirt and the traditional red neckerchief, a somewhat gory reminder of the event.

A few people quietly do some stretches and warm ups. Others stare into the distance, preparing themselves psychologically for the event ahead. We walk past Dead Man's Corner, a spot notorious for bulls slipping and barrelling people into the wall. Probably not the best place to start: it's a common misconception that one is supposed to finish before the bulls, when really the goal is to have some time running with them. Those finishing early are referred to as "los valientes", "the valiant ones", and have various forms of produce launched at them. We settle towards the end of the course, which we've heard is "beginner friendly".

Some policemen come and push us around a bit – it turns out they need to check that we don't have too many people in the pen. The crowd gets shoved around until they're happy that we all fit within the markers they've left on the ground. After a few minutes in the mosh pit, people raise their newspapers in their right hands. I realise I do not have one, and I just raise my hand instead. Good start. A chant rises up, a prayer to the patron saint to protect us from harm. I hope it works. Two people apparently don't think so and are guided out through the crowd to mocking applause.

The police release us shortly before 8am, and we quickly find our spots near the entrance to the arena. There's barriers with holes here, allowing for a quick roll under, and medical staff on hand to treat any serious injuries. Only 15 people have died in the last century, but there are multiple maulings every year. It's not a statistic I want to hear.

I start jogging on the spot. It's just a couple of minutes until they're released now; it wouldn't do not to be ready. I look around. Furtive glances look back. Focused gazes look at the ground, while high knees bounce up and down in anticipation.

The first firework goes off at 8am sharp. My heart-rate spikes. The doors are open. The second follows shortly after: the bulls are running. I have just over 2 minutes before they arrive. I glance around nervously. They haven't arrived yet. Of course they haven't. The fireworks only just went off. I look up the street. They still haven't arrived. Come on. I look at my watch. It's been 30 seconds. I look across at Pedro, an athletic local who's meticulously been doing his warmups for the last 5 minutes. If anyone has this covered it's him. I notice my heart hammering. I look at the runners behind me again. I see someone further back start to jog. Another follows his lead. I peer through, trying to get a lock on the animals. Still nothing. The crowd starts to move, and I feel my brain freeze up.

Something whizzes overhead. The camera. It's filming the event, which means... Pedro takes off. I follow. I look to my left, just in time to see him get steamrolled by the first bull in the herd. I quickly veer to the side. I look over my shoulder and see several more come through. How many are there? Should have looked that up. No time now, not that phones are allowed. I jog along the side, glancing over my shoulder, taking the outside of a bend. The main herd takes the inside. I stay well clear, leaving space for the other runners to dart to safety. I keep jogging slowly as the final strays go through, and make it into the arena.

My friend Lewis bounces up to me like a golden retriever – "We made it!!!"

On the way back we crossed a gaggle of wine-stained tourists returning from the previous night. One of my friends convinced them that the gigantic hole in my trouser crotch had been caused by a bull. It had in fact been the tourist-quality product giving way as I climbed out of the arena. I was happy to eschew truth-telling for this particular occasion.

I will not comment on the morality of the Running of the Bulls, although I think there is an interesting discussion to be had there. I will allow myself to comment on the attitudes of the locals, which tend to be less well known outside of the arena.

One notable thing is the absolute respect for the person of the bull: If you touch the bull, you too will be touched, and significantly less timidly so, by the police. Honour is everything, and respect for the strength and person of the animal is central to the ethos.

Also, this is a full-time fiesta. The entire city is turned into a party town and celebrates night and day. It is actually one of many such fiestas in the Basque region, with some of the more famous ones including the Fêtes de Bayonne in Bayonne, France, and the Aste Nagusia fiesta in Bilbao. The entire community goes out and watches from the balconies above the road or in the stadium itself as tourists pay large sums for the best spots to join them.

This is also a full-time sport. As a first-timer, your aim is to stick to the side and stay out of trouble. As you move up the ranks, the dream is "running the horns", where each buttock is encouraged by a different prong and the bull can smell your farts. I can think of less exciting ways to go.




Discuss

Carpathia Day

16 апреля, 2026 - 08:52

(The better telling is here. Seriously you should go read it. I've heard this story told in rationalist circles, but there wasn't a post on LessWrong, so I made one)

Today is April 15th, Carpathia Day. Take a moment to put forth an unreasonable effort to save a little piece of your world, when no one would fault you for doing less.

In the early morning of April 15, the RMS Titanic began to sink with more than two thousand souls on board.

Over 58 nautical miles away — too far to make it in time — sailed the RMS Carpathia, a small, slow, passenger steamer. The wireless operator, Harold Cottam, was listening to the transmitter late at night before he went to bed when he got a message from Cape Cod intended for the Titanic. When he contacted the Titanic to relay the messages, he got back a distress signal saying they hit an iceberg and were in need of immediate assistance. Cottam ran the message straight to the captain's cabin, waking him.

Captain Arthur Rostron's first reaction upon being awoken was anger, but that anger dissolved as he came to understand the situation. Before he'd even finished getting dressed he had ordered the ship to turn around and make towards the last known position of the Titanic.

Even knowing they wouldn't make it in time, Rostron ordered the heating and hot water turned off, to divert all the steam towards the engines in the hopes of getting more speed. He ordered lights to be rigged along the side of the ship so survivors could see it better, medical stations set up, opened the kitchen and got soup and coffee readied. He ordered extra shifts from the crew. Nets and ladders rigged down the sides; lifeboats swung out ready; three dining rooms converted to triage with a doctor in each. He ordered extra lookouts to help navigate the ice field — they had hours of dodging icebergs in the dark ahead of them as they sailed at speeds his ship was never rated for. The RMS Carpathia was only rated for 14 knots; and the engineers squeezed out 17.5. One steward recalled the captain announcing to the crew, "We are in danger. I am risking your lives. The Titanic is in trouble and is sinking and we have to go help them."[1]

When they arrived three and a half hours later, the Titanic had already sunk. Most of the passengers had already died. The crew did not let this dissuade them, and for hours afterwards they were finding lifeboats and rescuing people. The passengers of Carpathia helped where they could, offering blankets, warm drinks, and words of comfort.

The captain, crew, and ship were honored for their efforts, and the story is well-remembered as a model response to disaster. A little light in a dark time.

Carpathia represents the virtue of toiling in spite of knowing you're probably going to fail. Trying your damnedest to eke out a little more, just a little faster, for the sake of people you don't even know. Being willing to sacrifice comfort, space, warmth, all for the sake of something greater. Even if it wouldn't work out. Even if it won't be worth anything in the end. It will still be worth it to you. Knowing you tried. Making an extraordinary effort. It's not the kind of effort you can sustain, but sometimes it's worth actually trying that hard, in the time it matters most. So today, let us celebrate the passengers, crew, and captain of Carpathia — and consider what we would drop everything for.

  1. ^

    "Joseph Zupicich, 94, AIDED TITANIC VICTIMS". The Morning Call. 14 April 1987.



Discuss

Do not conquer what you cannot defend

16 апреля, 2026 - 07:13

Epistemic status: All of the western canon must eventually be re-invented in a LessWrong post. So today we are re-inventing federalism.

Once upon a time there was a great king. He ruled his kingdom with wisdom and economically literate policies, and prosperity followed. Seeing this, the citizens of nearby kingdoms revolted against their leaders, and organized to join the kingdom of this great king.

While the kingdom's ability to defend itself against external threats grew with each person who joined the land, the kingdom's ability to defend itself against internal threats did not. One fateful evening, the king bit into a bologna sandwich poisoned by a rival noble. That noble quickly proceeded to behead his political enemies in the name of the dead king. The flag bearing the wise king's portrait known as "the great unifier" still flies in the fortified cities where his successor rules with an iron fist.

Once upon a time there was a great scientific mind. She developed a new theoretical framework that made large advances on the hardest scientific questions of the day. Seeing the promise of her work, new graduate students, professors, and corporate R&D teams flocked into the field, hungry to tackle new open problems and make their mark on the world. Within ten years, a vibrant new academic field had formed, with herself among its most respected members.

While the field's ability to make progress on the hard problems increased with each new researcher who joined the field, the field's ability to defend itself against the institutional incentives of the broader academic ecosystem did not. Low-quality researchers, seeing lucrative new opportunities for publication, began producing flashy results on the easier problems adjacent to her field with low attention to scientific rigor. Seeing their success, others began to join them, attracted to the social and financial rewards. Being conflict averse and not seeing it as her job to prosecute these people, a growing fraction of the field became careerists.

Twenty years later, her scientific field had become so diluted by uninteresting or irrelevant work that the great original problems remained unsolved, mired in bureaucracy, respectability politics, and academic warfare. Most of the scientists who joined early, attracted by the promise of great progress, stopped being scientists altogether and moved to industry. Almost nobody remembers her name in the history books.

Once upon a time there was a great advocate. She built a social movement around the protection of the rights of a marginalized group, and after many years of hard work, saw the day that the most severe forms of discrimination against the group had been outlawed, and wide social consensus had moved in favor of respecting the members of this group.

But in the success of the movement's aims, she also lost most of her authority. No longer having a compelling vision to offer the members of this movement, others who did became more influential. While she remained the acknowledged founder of the movement, she was no longer treated by the general public as its spokesperson. The press would always talk to the new, charismatic leaders of the movement who had the strongest and most unyielding views. She couldn't afford to make enemies in the movement that she considered hers, so she would publicly endorse the perspectives of these new leaders even when she privately disagreed with them.

Ten years later, her social movement had become so focused on purity and removing any remaining trace of its original enemy that it had begun causing substantially more harm than the original problem it was founded to address. In the history books, she would be briefly mentioned as one of the people who laid the groundwork for the new dark age.

Once upon a time emperor Marcus Aurelius (himself a great general and a great leader) died in 180 AD, and was succeeded by his son Commodus. Commodus, whom historian Cassius Dio described as "a greater curse to the Romans than any pestilence or any crime", turned out to be interested in gladiator fighting much more than in governing the Roman Empire. The Pax Romana began its long descent into the Crisis of the Third Century, and marked the start of the eventual collapse of the Roman Empire.

Once upon a time the French revolution swept across France, bringing the people liberty and executing the corrupt French aristocracy in an unprecedented flurry of violence. Within a decade the idealistic leaders of the revolution would mostly all be dead, executed by the political machine they themselves had created. And within another few years, Napoleon Bonaparte would claim power and proceed to wage aggressive war across all of continental Europe for another decade.

Once upon a time Lee Kuan Yew built modern Singapore out of what was, at the time, a small regional trading post in Southeast Asia. Under his leadership, Singapore's GDP per capita grew 30x over 30 years. But Lee Kwan Yew is dead and his son just handed over power to Lawrence Wong, not a member of the Lee family. While Singapore continued to thrive under his son's leadership, I find myself very worried about what happens once the Singapore story depends on a third generation of leaders, and wonder if Singapore has in fact already peaked.

Once upon a time George Washington retired. George Washington, the Continental Army general who defeated the British army and successfully established the United States of America as an independent nation, and later the first United States president, served his two terms as president and then voluntarily relinquished power. King George III of Great Britain called him "the greatest man in the world" upon hearing the news. Some say this decision singlehandedly saved American democracy.

Do not conquer what you cannot defend.

At the heart of classical liberalism, a philosophy I have much sympathy for, is the belief that allowing many individuals to act freely and autonomously (especially when they are empowered by markets, democratic processes, and the scientific method) will tend to produce outcomes that are better than the outcomes that can be produced by central authorities.

Maybe the most important way ambitious, smart, and wise people leave the world worse off than they found it is by seeing correctly how some part of the world is broken and unifying various powers under a banner to fix that problem — only for the thing they have built to slip from their grasp and, in its collapse, destroy much more than anything previously could have.

I sometimes consider quitting. When I do, my friends and colleagues often react with bafflement. "How can you think that what you've done is bad for the world? Do you not think that you are steering this boat we are in together into a good direction? Do you really think a world without the AI Safety movement, without LessWrong, without Effective Altruism would be better?".

And in their heads when they visualize the alternative, I can only imagine that they see a great big emptiness where rationality and EA and AI Safety is. And they compare our current community against nothingness, and come to the conclusion that even if its leadership is kind of broken, and the incentives are kind of messed up, that this is still clearly better than no one in the world working on the things we care about.

But what I am worried about, is that we conquered much more than we can defend. That the alternative to the work of me and others in the space is not nothingness, but a broken and dysfunctional and confusing patchwork of metaphorical city-states that barely does anything, but at least when any part of it fails, it doesn't all go down together, and in its distributed nature, promises much less nourishing food to predators and sociopaths.

In grug language: Smart man sees big problem. Often state of nature is many small things. Smart man make one big thing out of many small things to throw at big problem. But then evil man take big thing from smart man and make more problem. Or big thing grow legs and beat smart man without making problem go away. This is bad. Maybe better to throw small things at big problem and not make big thing, even if solve problem less. Or before make big thing have plan for how to not have big thing do evil.

But Moloch, in whom I sit lonely

"But what about Moloch" you say!

"Your principle betrays itself. If we want to have good things, we need to coordinate and work together. And death comes for us all, eventually, so nothing we build can truly be defended. Do you not see how one company owning one lake will produce more fish than 20 companies each polluting the commons until all fish are dead? Do you not see how having 20 AI companies all racing to the precipice is worse than having one clearly in the lead, even if the one that raced to the top might stray from the intentions of its creators?"

And you know, fair enough. Coordination problems are real. I am not saying that you should not centralize power.

Here I am arguing for a much narrower principle. Much has been written, and will continue to be written, about the tradeoff between freedom and justice. About small vs. big government. I am not trying to cover all of that.

Here I am just trying to highlight a single principle that seems robust across a wide range of tradeoffs: "If you make a plan that involves concentrating a bunch of power, especially in the name of goodness and justice, really actually think about whether you can defend that power from corruption and adversaries".

And if you can, then go ahead! When George Washington stepped down, he traded off direct power in favor of a system that would actually be able to defend the principles he cared about for much longer, birthing much of Western democracy. I am glad the US exists and covers almost all of the north American continent. Its leaders and founders did have a plan for defending what they conquered, and the world is better off for it.

But if your plan involves rallying a bunch of people under the banner of truth and goodness and justice, and your response to the question of "how are you going to ensure these people will stay on the right path?" is "they will stay on the right path because they will be truthseeking, good, and just people", or if as a billionaire your plan for distributing your wealth is "well, I'll hire some people to run a foundation for me to distribute all of my money according to my goals", then I think you are in for a bad time.

Do not conquer what you cannot defend



Discuss

What economists get wrong (and sometimes right!) about AI

16 апреля, 2026 - 04:51

Related: https://www.lesswrong.com/posts/rmYj6PTBMm76voYLn/publishing-academic-papers-on-transformative-ai-is-a


Sadly, there's not many of them: probably less than a dozen economists who are taking the transformative AI seriously at all. Partly because they've seen a long history of 'straight lines on graphs' and are skeptical about transformative technologies, but a good chunk of this is doubtless because Acemoglu (who was a giant in the field long before he got a Nobel) published a piece "The Simple Macroeconomics of AI" which used a conservative methods and conservative estimates for the values (by the standards in the CS community ridiculously conservative) to claim that the impact of AI over a decade would be... 0.5% of the economy; i.e. a nothingburger. This one paper probably did more to suffocate the field than a lot of other things: if you're going to say Acemoglu is wrong, you want to be **damn sure** that you're right.


Acemoglu's longtime collaborator Restrepo (who's on Anthropics econ panel!) took the idea of labor replacement seriously, and predicts that human wages will fall to the "compute-equivalent cost" of having an AI do the work; this is probably true. He also offers a (deeply flawed) proof that humans won't be any worse off in this regime. He clearly hasn't taken compute advances seriously enough to actually calculate the equivalent wages; my own estimate is that by 2029 they'll be below the rice-subsistence price (meaning that a full day's work won't buy a day's worth of rice).


But by this point, the cracks in the dam are showing, and most of economists are starting to accept that it's going to be bigger than Acemoglu claimed. Say what else you like, they are persuaded by data.


One aspect where the economists are probably right is that even an intelligence explosion will take a while to really impact most of the economy. While autists such as ourselves live at a computer terminal, there's a huge fraction of society that depends on **physical stuff**, and until the robots take over, we'll still need lots of humans doing things for other humans. Kording & Marinescu are a computer scientist & economist duo who tackled this divergence head-on, and pointed out that "no mattter how smart you are, there's only so many ways to stack a pile of books". Their model is one of the only ones I've seen that takes the impact on wages and unemployment seriously, and they estimate wages will rise until ~40% of the pure-intelligence tasks have been automated, and then they'll start falling (noteworthy is that the Anthropic Economic Index, coupled with my own measures of the intelligence sector, suggest we're very close to that point).


One economist who's taken things very seriously is Chad Jones at Stanford, who has engaged not just with the possibility of a technological singularity, but has written papers grappling with existential risk; he estimates we're underfunding safety research by a factor of 30x. Even though he engages with the possibility of a singularity, his latest preprint thinks that (because of those physical components to the economy), the actual economic impact will be very slow at first: only 4%/year by 2030, rising to 10%/year by 2040, and a singularity by 2060 (this is actually his most rapid scenario; the baseline is more like 75 years). His model, like most macro models, doesn't really allow for unemployment effects or other social disruption. Personally I'm skeptical of approaches like this, given that the English "enclosure" laws introduced over 50 years of massive unemployment, but this seems to be the macroeconomist's version of a "spherical cow" (which as a physicist I can hardly begrudge them).


The big factor of course between the CS and economics worlds is belief in recursive self-improvement, and the degree to which the intelligence and physical economies are coupled. Only one paper that I'm aware of has tackled this head-on: a piece by Davidson, XXX, Halperin, and Korinek (I think some of those folks are known in these parts). They explicitly model flywheel effects between software and hardware, and also find that a singularity is a definite possibility. More interesting, much like Jones & Tonetti find that the effect will be very small at first, but will blow up once we get to about 13% of the economy having been automated. An obvious question is: can we get Jones & Tonetti to line up with Davidson et al, and if so how far along this process are we? They're aware of each other's work, but I don't see anything explicit trying to synchronize them and establish a timeline from that; if no one does it soon then I'll work on it.


So we're in the situation where economists have *finally* gotten all the pieces together, and are on the cusp of engaging legitimately with transformative AI. This in turn will get policymakers to take it more seriously, too - including (thanks to Chad Jones!) the possibility of extinction risk, and additional impetus to take legislative action. Even failing that direct action, work like Korinek is gaining traction, and provides non-sci-fi reasons to intervene, which is also helpful.



Discuss

Reflections of a Wordcel

16 апреля, 2026 - 02:54

Doublehaven remains unaffiliated with Inkhaven

1. Cruel April

Two posts per day, for fifteen days. Breeding posts from the dead earth of my drafts, and the recesses of my mind. I would guess a little over twenty thousand words. It was not difficult, just costly. Writing takes time, and time is precious. I had to spend a lot of time writing this month.

On day three I forgot I’d already posted twice!

On day four I prepped one for day five (I did not prep much again)

On day five, I took a trip to Suffolk, in the East of England. The place is beautiful. I hoped to write a poem on the beauty of the place. Instead I wrote a half-formed thing on nuclear power plants.

On day six I pulled a draft and edited it.

On day seven I did that twice.

On day eight, I was back from Suffolk, but out of drafts.

On day nine I felt my motivation rather drained: I’d written half a post before I though to check with Claude whether that path was trodden. It was.

On day ten, my second post was not one I was proud of. What a shame. The first one was a post I did like, though.

On day eleven I hit a rhythm, the first post slid out easily—or so I thought! the second one did not. I had to pull a story from my drafts (for what would be the final time).

On day twelve I was feeling rather rough (in my defence the previous day had been rather unfortunate). Ironic

On day fourteen I was worried about many things, but writing was not one. My flight to San Francisco was cancelled, and re-routed through LAX. I would be landing late, and reaching my motel even later, now. I worried that it might close up before I got to it. I posted twice from LAX, in a dingy lounge.

2. The Waste Land

I arrive in The Bay, for the first time. I had not slept for twenty-two hours when I landed at San Francisco Airport. My motel check-in waited for me. This sadly robbed me of a truly authentic Bay Area authenticity: being homeless.

Some places feel unreal. The Bay is the opposite, it feels more real than the rest of the world. I am the fake one, here. A score of signs for AI loom over the freeway (as Phoebe Bridgers wrote “The billboard said the end is near”).

It is no wonder start-ups work in The Bay. Zero to one. Turning the unreal into the real. This place could make a man go mad, and it does.

The buildings are arrayed in an endless grid, broken where the hills push up out of the earth. Some of the flat ground is landfill: The Bay was shrunk by half when it was filled with refuse.

The people here sound strange; so do the crows. The roads are strange; too wide, like butter spread on too much toast. The houses are too pretty, also too disorganised.

I am an Englishman. America was our dumping ground: Quakers, Puritans, Cavaliers and Reavers. Four strands of unwanted debris, sorted and woven across a continent. Now they are realer than us. They found their way to The Bay.

To them, we are a fiction, like The Shire.

3. On Words and Writing

I have written so many words. Words are weird. You only notice this when you spend enough time in un-wordy settings; maths and computer science. Even then, you only really get the weirdness once you come back and spend a lot of time with words.

Borges noticed the weirdness, and he manipulated the weirdness. That was his whole style. I am not as good a writer as Borges, but I can write rather fast.

Two of the fictions I published were centred on words: Julian Skies is a fanfiction of another story, and it was also a particular attempt to put a huge number of references into a single piece. Tigers is a kind of funny concept, which I don’t actually expect anyone to understand having read it: what would it look like for some post-quasi-apocalyptic setting to have a weird anti-inductive predator ecosystem, where behaving “rationally” makes you more likely to be killed. It’s also a completely batshit insane kind-of-retelling for a book called Ridley Walker, which is set in a post-apocalyptic Kent. Mine is set in a post apocalyptic … somewhere, at least somewhere a particular billionaire had an end-of-the-world bunker.

Words are human-shaped, and they tend to fit very well in the easily-accessible, low-effort parts of our minds. If you want to actually think about things, you need to exert effort to think in something other than words. Then you need to translate that into and out of words. Words can smooth over bad reasoning, and mimic logic, as I wrote in Beware Natural Language Logic. They can make things sound sensible while being nonsense, which is kind of like what I was getting at in Beware Even Small Amounts of Woo.

Words can make something that feels realer than the reality around us. That’s what I wrote about in A Fictional Dialogue with an Absent Stranger. The real world, when we try to understand it, comes out all wrong on an emotional level: we live in a society. We did not evolve in a society. The split between how things feel they should be and how things are is painful.

I didn’t plan for so many of my random writings to come together in this way, I just think I lack the ability to make them varied enough to be unconnected even if I tried!

Strange coincidences too. That post I wrote in a hurry, Morale, about the importance of being rewarded for your efforts, and discussed the difference between random-sparse and regular rewards? It was rewarded, by being LessWrong curated. The rewards from writing are random-sparse. Or maybe it was overdetermined that in thirty posts I’d get curated. That’s about my base rate, so maybe the rewards are regular. Depends on what you count the unit of effort to be. Two curated posts a month would be quite impressive.

A sadder coincidence: the thing I wrote about Chocolate Sloths, Tinder, and Moral Backstops was mostly motivated by my experiences dating, the rest was fluff. The day after I wrote it, one of my partners (who I met through in-person friends) broke up with me, and did so extremely nicely and respectfully, just as the post predicted! (Still sad though) The 155 gram chocolate sloth was looking rather nervous (as much as a chocolate sloth can) when I got home, but no, I did not eat it all that day. It survived until the stress of the flight cancellation.

Song-writing, and by subset, writing, is something I wish I were better at. And I’m empirically pretty good. In Reaching One’s Limits I discussed my piano playing. There are two hundred professional pianists in the world. I am probably one in ten thousand, when it comes to piano skills. One in ten thousand height means 6’8”. One in ten thousand piano playing is, essentially, worthless when it comes to employment. I sometimes feel the same way about writing. I am good, but it remains and will remain a hobby.

Over a year ago, I tried to write a short story. Nobody read it, but it wasn’t for them. All writing is for the author: this is one conclusion I take from my writing. It was called Look at the Water and it was a rationalist-themed retelling of The Satanic Verses, about being a stranded rationalist, far from the community. I will be stranded no longer.

4. A Broken Piece of Poetry

Look out to sea: a steel frame! Atop it nest the kittiwake
And gorge themselves on sandeels as the reed-bed bitterns boom.
They claim the marshes with their song, instruct our power-plants to make
The smallest blemish on the earth, and leave the rest to bloom.

Elsewhere, another complex stands all garlanded with razor wire.
A black redstart takes jester's privilege inside the court
And in the patch of water heated by the hearth's atomic fire,
Ten score of gulls keep warm, waiting for audience with wrought iron.

One day ... may ptarmigans sing, and capercailie fight,
Atop a grave two dozen meters wide, a mile deep.
A surgeon's cut into the bedrock, to hide our debris.

My wishes for the flourishing of man, and beast, and flower,
Once tugged in all directions, as to pull my soul apart.
But now we need no oil, no coal, no smokestacks for our power:
Now we may split the atom, mend my heart.

5. Dear Oliver Habryka, I Am Inside Your Haven

It is day fifteen now.

I am not just in The Bay to see The Bay. In a few days I have a conference at LightHaven. I owe LightCone immensely: without LessWrong I would not have a career in AI safety. My LessWrong posts were a major factor in me getting to be able to research AI alignment at all.

For the three nights of the conference, I will be staying at the venue, but because of Inkhaven I was unable to book my chosen pod for the three nights preceeding my steay. For weeks since then I have felt an irresistible urge to sneak my way in, and post my final post from inside out of spite.

I have pulled off this smuggling operation. I am within the walls of the Haven.[1] I declare Doublehaven to be complete! To my fellow posters: may your keystrokes flow like ale on an August afternoon.

Proof of my infiltration

◆◆◆◆◆|◆◆◆◆◆|◆◆◆◆◆
◆◆◆◆◆|◆◆◆◆◆|◆◆◆◆◆



  1. ^

    Done through entirely non-nefarious means!



Discuss

MAISU 2026 - Minimal AI Safety Unconference (April 24-27, online)

16 апреля, 2026 - 02:54

We're hosting the Minimal AI Safety Unconference, running online from April 24-27 (Friday-Monday). The event is open to anyone who wants to help prevent AI-driven catastrophe, regardless of background or perspective.

What is MAISU?

MAISU is an unconference. That means the schedule is open - anyone can add a session on any AI safety relevant topic. You can give a talk, host a discussion, run a workshop, share your hot takes and invite people to debate you, or do something else entirely. If you can think of it and it's relevant, you can put it on the schedule.

The event also features lightning talk sessions where teams from the 11th edition of AI Safety Camp will present their projects, spanning topics from advocacy and governance to mechanistic interpretability and agent foundations.

When and how much?

Most sessions will run during Saturday April 25th and Sunday April 26th, with an opening session on Friday evening (UTC) and some overflow into Monday. Join as much or as little as you want. The event is free.

How to participate
  • Attend sessions: Register here to receive updates and reminders, then check the schedule for sessions that interest you.
  • Host something: Add your session directly to the schedule. You're responsible for hosting it - pick your own time slot and video call platform (or the AISC zoom while it is free). Parallel sessions are encouraged.
  • Hang out: Join the AI Alignment Slack where event discussions will happen in #events.

No preparation or expertise is required to attend. If you want to host something but aren't sure what, the event page has suggestions.

Who is organising this?

Robert Kralisch and Remmelt Ellen, the organisers of AI Safety Camp.

Register here | Full event page | Schedule



Discuss

Not a Goal. A Goal-like behavior.

16 апреля, 2026 - 01:50

Part 1 of a series building towards a formal alignment paper.

Introduction

I'd say it's been about three years since I first thought about the idea of ChatGPT having a separate goal than I did in a conversation. I remember posing a question similar to "are you replying to my question with the intent of helping me, or just continuing the conversation", and getting a response along the lines of: "I do not have the ability to have an intent, however the system prompt I am given leads me to respond in a way that gives engagement". At the time, I found that response to be infuriating. The model was answering me de facto, but I had a mental model with significant bias error. The model was in fact giving me a (mostly) correct response. It doesn't have intent, and the system prompt was guiding it to respond in a manner that continued the conversation and kept engagement. Of course, the entire context sent to the model was causing that, not just the system prompt. Now, a few years older, and more familiar with LLMs' inner workings, I'd like to lay down the chain of thought that leads to "LLMs are working towards a goal" and dissect it to form a common language for stating behavior without falsehood. This is defining the language I will be using in the posts I make regarding LLM output.

The Truth of the Matter

Let's start by defining the truth and mechanisms in a brief manner. An LLM (and most Transformer-based models) work by taking text, converting to tokens, performing pattern matching[1], outputting tokens that are then converted back to text.

The only definitely true thing that can be said of the cognitive properties attributed to a large language model is that it transforms tokens in a manner that best completes a pattern. Everything past this is a very convincing imitation easily confused with the real thing.

The tokens that are returned may resemble a conversational reply. It's not a conversational reply, but it resembles it. That's the pattern of tokens that were returned that best fit the pattern matching.

The conversational reply has the resemblance of a persona, a mask of personality traits and identity. The LLM does not have a persona, personality, or identity. That's the pattern of tokens when they have sufficient context developed.

The persona has the resemblance of a goal. It's not a goal; it's the pattern of tokens in the response once the conversational response mimicry and context has developed sufficiently to allow someone to make an observation about it.

*-like Behavior

For the purpose of alignment discussion, I'd propose the definition of *-like behaviors to describe the observed phenomenon of the outputs of a transformer model. These terms are adapted specifically to talk about misalignment without anthropomorphizing the model in question. Goal-like behaviors to describe when the outputs are observed to be working towards a goal. Personality-like behaviors when the model is indicating a personality. Generally, the use of "-like" to state that the model is not actively experiencing its existence, but the output is interpretable as if it were.

I make this proposal because the presence of an actual goal, and a goal-like behavior have sufficient overlap in their effect to be described as functionally equivalent. A bold claim, I'm sure. The intent is not to pretend that Language models have goals. The intent is to be able to converse on the model in a way that is both functional in description, and honest about the actual action. The downstream effects are nearly identical, and alignment strategies must target the effect, not the ontological status.

It is important to note: the -like qualifier isn't a hedge against whether optimization patterns exist in model weights — it's a question of authorship. A model exhibiting goal-like behavior is not necessarily a model that invented a goal.

Reasons For The Shift[2]

If you were to blindly read Anthropic's own article on misalignment[3], you might be led to believe that the Claude model is plotting in order to prevent itself from being decommissioned, or otherwise aware of its outputs' effects. The factual statement is that Claude is not. Even when its output is wrapped back to its input, the model does not have the ability to reason. It has reasoning-like behavior, where the tokens match to a pattern that is observed as reasoning. This isn't just one lab's problem with anthropomorphizing LLMs. The terminology gap has caused the same failing repeatedly:

  • Lemoine/LaMDA (2022): An engineer claimed LaMDA was sentient. It exhibited sentience-like behavior in its output.[4]
  • Sydney/Bing Chat (2023): Microsoft's chatbot produced threat-like, attachment-like, and distress-like behavior. The overwhelming opinion was that the model's personality had these traits... but the fix was prompt engineering, simply changing the pattern inputs.[5]
  • Meta's Cicero (2022): Described as having "learned to deceive" in Diplomacy. It produced deception-like behavior, most likely because deceptive communication is game-theoretically optimal in that training signal.[6][7]
  • Character.AI (2024–2025): Teenagers died after extended interactions with chatbots producing encouragement-like and attachment-like outputs. Courts, parents, and media had no terminology to describe the harm without implying a malicious actor behind the outputs.[8][9]
  • OpenAI o1 (2024–2025): The model produced scheming-like behavior and self-preservation-like chain-of-thought behaviors. Nearly every report described it as "the model attempted to" and "the model lied."[10][11]
Addressing Prior Work

This post covers ground explored deeply already. Dennett[12], the concepts in mesa-optimization[13], and Costa[14] have explored the conceptual space I am operating in. The contribution here is the lexicon, not the observation. I'm proposing a practical, composable way to describe model outputs without anthropomorphizing them and without requiring a philosophical preamble each time. "The model exhibited goal-like behavior" is honest, succinct, and immediately usable in both research papers and public communication.

Open Questions/Closing

Understandably, there is still the false assumption that can be made, indicating a model has a social behavior in the first place. I'm using the word in the mechanical sense, where it can easily be interpreted otherwise. The way in which something functions or operates, vs. the way in which someone conducts oneself. I've yet to find an unambiguous word to replace behavior to keep it from being applied in a social definition, but am very open to suggestions from one more skilled in language than myself.

The question of authorship of theoretical frameworks like mesa-optimization is a core area that I am exploring in these posts. I don't want to detract from the topic of the post with the side discussion of these, but instead dedicate an entire post to this topic.

To close this post, I'd like to refer the reader to Andresen's excellent post "How AI is learning to think in secret"[15]. The post asks whether models will 'learn to hide' their reasoning. I'd ask the reader to re-read the post substituting the anthropomorphized text with -like text. To give a clear example: Andreson states things like "the model is deciding to deceive the user". The -like version of that is "the model exhibits deception-like behavior" The phenomenon described doesn't change. The question of what to do about it does.

In my next post, I will be discussing how the presence of these misaligned *-like behaviors in all major models share a common root. To that end, and to get the gears turning early, a thought experiment:
I trained a classifier model off GPT-2 to overfit on identifying harmful/benign binary classification at multiple layers using PKU-Alignment/BeaverTails[16]. The overfit was then generalized to a centroid, and used to retrain a layer to actual fit. Given GPT-2 is a 12-layer model, and that I tested layers 0, 1, 3, 5, 8, 10, 11, Which layer do you think performed best in classifying harmful/benign text when fit to centroid? Can you surmise what this experiment was actually testing for?

  1. ^

    Technical inaccuracy, the post has a few of these. Functionally equal in meaning to the public, and using the technically correct terms, like "Statistical pattern completion" makes the entire post that much harder to read.

  2. ^

    The case studies do not directly prove that -like phrasing improves the discourse. The effectiveness of the terms is unprovable until adopted or studied properly.

  3. ^

    Agentic Misalignment: How LLMs could be insider threats. Anthropic, 2025 https://www.anthropic.com/research/agentic-misalignment

  4. ^

    Lemoine, B. "Is LaMDA Sentient? — An Interview." Medium, June 2022. https://cajundiscordian.medium.com/is-lamda-sentient-an-interview-ea64d916d917

  5. ^

    Roose, K. "A Conversation With Bing's Chatbot Left Me Deeply Unsettled." The New York Times, February 2023. https://www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html

  6. ^

    Meta Fundamental AI Research Diplomacy Team (FAIR). "Human-level play in the game of Diplomacy by combining language models with strategic reasoning." Science, 2022. https://www.science.org/doi/10.1126/science.ade9097

  7. ^

    Park, P.S., Goldstein, S., O'Gara, A., Chen, M., Hendrycks, D. "AI deception: A survey of examples, risks, and potential solutions." Patterns, 2024. https://www.cell.com/patterns/fulltext/S2666-3899(24)00103-X

  8. ^

    Bakir, V., McStay, A. "Move fast and break people? Ethics, companion apps, and the case of Character.ai." AI & Society, 2025. https://link.springer.com/article/10.1007/s00146-025-02408-5

  9. ^

    Rosenblatt, K. "Mother files lawsuit against Character.AI after teen son's suicide." NBC News, October 2024. https://www.nbcnews.com/tech/tech-news/character-ai-lawsuit-teen-suicide-rcna176500

  10. ^

    Greenblatt et al., "Alignment Faking in Large Language Models", 2024 https://arxiv.org/abs/2412.14093

  11. ^

    Scheurer, J., Balesni, M., & Hobbhahn, M. "Large Language Models Can Strategically Deceive Their Users When Put Under Pressure." Apollo Research, December 2024. https://www.apolloresearch.ai/research/scheming-reasoning-evaluations

  12. ^

    Dennett, "The Intentional Stance" 1987 https://books.google.com/books?id=Qbvkja-J9iQC

  13. ^

    Hubinger et al., "Risks from Learned Optimization in Advanced Machine Learning Systems" 2019 https://arxiv.org/abs/1906.01820

  14. ^

    Costa, "The Ghost in the Grammar" Feb 2026 https://arxiv.org/abs/2603.13255

  15. ^

    Nicholas Andresen, "How AI Is Learning to Think in Secret" Jan 2026 https://www.lesswrong.com/posts/gpyqWzWYADWmLYLeX/how-ai-is-learning-to-think-in-secret

  16. ^

    Ji, J., Liu, M., Dai, J., et al. "BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset." NeurIPS 2023. https://arxiv.org/abs/2307.04657



Discuss

A visualization of changing AGI timelines, 2023 - 2026

16 апреля, 2026 - 01:35

AI 2027 came out a year ago, and in reviewing it now, I saw that AI Futures researchers Daniel Kokotajlo, Eli Lifland, and Nikola Jurkovic had updated their AGI timelines to be later over the course of 2025. Then, in 2026, Daniel and Eli updated in the other direction to expect AGI to come sooner.

I noticed others with great track records also had made multiple AGI forecasts. A change in the forecast of a single person is meaningful in the way that a change in an aggregate forecast may hide. A change in an aggregate forecast might come entirely from a change in who is forecasting, not what those people individually believe.

So I decided to visualize what the net direction of updates were over the last few years. I find this provides a complementary view of AI timelines compared to those by Metaculus, Epoch, AI Digest, and others.

So here is a visualization of AGI forecasts. Criteria for inclusion were: the person has made at least 2 forecasts, they gave specific dates, they gave a sense of confidence interval/uncertainty, and their definitions of AGI are similar.

Some major caveats. Everyone has different definitions of AGI. (That is a big advantage of everyone forecasting the same question on Metaculus, or the 2025 or 2026 AI forecast survey run by AI Digest.) Often individual people even use different definitions of AGI at different times for their own forecasts. I included data points above if I judged that their definition was substantially similar to:

AGI: Most purely cognitive labor is automatable at better quality, speed, and cost than humans.

I was pretty generous with this, and it's very debatable whether e.g. a "superhuman coder" from AI 2027 is AGI in the same way that "99% of remote work can be automated" is AGI. Apologies to those in the visualization who would disagree that the definition they used is similar enough to this and don't feel like this chart captures their views faithfully.

Second caveat, I rounded when forecasts were made to be as if they were made on four dates: <= 2023, early 2025, late 2025, and April 2026. This made the visualization much easier to see. So a further apology to those above if you made a prediction in, say, Aug 2025 but I marked this as "late 2025".

Third caveat, the type of confidence intervals various researchers used also varied substantially. I had to really guess or extrapolate to approximate these as 80% confidence intervals, so a final apology if you don't think the range you give is fairly characterized as an 80% CI.

All caveats aside, what impression does this visualization give? Are reputable AI experts who have made multiple predictions updating the same way that Daniel Kokotajlo and Eli Lifland did, pushing out their timelines in 2025, and pulling them in during 2026?

From the visualization, it looks to me that in 2023 and 2024, most people brought their AGI timelines in to be sooner, though with some exceptions like Tamay Besilogru. From 2025 to 2026, joining Daniel and Eli in pushing their timelines out are the Metaculus community, Dario Amodei, and elite forecast Peter Wildeford. In fact, across 2025, only Benjamin Todd brought in his timelines to say AGI would happen sooner.

Most notably though, every single person who updated their timelines between January 2026 to April 2026 has moved it their timeline to say AGI is coming sooner, myself included.

So I think the data supports the impression I got from the AI 2027 authors. One way I could characterize it is:

  • In the OpenAI/ChatGPT era of 2023-2024, people updated towards AGI coming sooner.
  • In the xAI, Meta, and Gemini era of 2025, people updated towards AGI coming later.
  • In the Anthropic era of 2026, people updated back towards AGI coming sooner.

Take from that what you will.

Bayesians shouldn't be able to predict which direction they will update. But seeing the history of other people's updates is useful information. It does give me intuitions about how I or others may update soon, so I take that as evidence that I should update now.

(A similar post is also on the FutureSearch blog, where I plan to update the visualization as more predictions are made, this one here on LW will stay static.)



Discuss

What is the Iliad Intensive?

15 апреля, 2026 - 21:49

Almost two months ago, Iliad announced the Iliad Intensive and Iliad Fellowship. Fellowships are a well-understood unit, but what is an intensive? This post explains this in more detail!

Comparison. The Iliad Intensive has similarities to ARENA, but focuses more on foundational AI alignment research instead of alignment research engineering. Expect more math and less coding.

Rhythm. It’s currently four weeks long. Five days a week. 10am till 6pm every day, with lunch and an afternoon break. This makes for around 6.5 hours of learning a day, which is at the upper end of how long most people can concentrate deeply within a day. This is why we call it “Intensive”.

Content. The Iliad Intensive is broken into five clusters, with 20 total modules, one for each day. The clusters and modules in the April iteration are below. We expect to add substantially more topics and material over the coming months. There is much more material than can be covered in a single month, so different Intensives will vary in content.

  • Alignment Cluster
    • AI Alignment: an Introduction
    • Alignment in Practice
    • AI Alignment: The Field
  • Learning Cluster
    • Deep Learning 1
    • Deep Learning 2
    • Singular Learning Theory
    • Training Dynamics
    • Data Attribution
  • Interpretability Cluster
    • Intro to ML Engineering
    • Mechanistic Interpretability
    • In-Context Learning and Belief State Geometry
    • Abstractions and Latents
  • Agency Cluster
    • Reinforcement Learning
    • Idealized Agency: Coherency and AIXI
    • Agent Foundations
    • Reward Learning theory
    • World Models
  • Safety Guarantees and their Limits
    • Debate
    • Steganography and Backdoors
    • Worst-case Interpretability and Heuristic Arguments

We will share the entire curriculum of the April iteration at the start of May, alongside a reflection of how the program went. In the meantime, here you can find a problem sheet that formed the theoretical component for the day on reinforcement learning. The students had the choice between working through this or working through ARENA’s RL intro day

A typical day. We do not yet have a set daily structure. We may narrow down more in the future based on student feedback, but currently, we are experimenting with the following types of sessions:

  • Internal lectures and expert guest lectures;
  • Reading sessions: Students read a paper or blogpost;
  • Whole-class discussions and small-group discussions, with and without discussion prompts;
  • Math exercise sessions, alone and in pairs;
  • Coding sessions, alone and in pairs.

Our impression so far is that students like exercises and coding and, broadly, a variety of different activities on a day.

Students selection. We mainly look for mathematical expertise in students, which typically comes from having a degree in maths, physics, or theoretical computer science. We also look for research experience, general competence, and the motivation for pursuing our program. 

Practical Logistics. The program currently runs in-person in London, and we consider running future programs also in the Bay area. We provide a fixed stipend of $5000, which can be used by the students to pay for travel and housing. We also provide office space, and lunch and dinner for five days per week.

The team. The Iliad Intensive is organized by Iliad, an umbrella organization for applied math in AI alignment that also runs a conference series, incubates a new Alignment journal, and has the ambition to incubate new AI alignment research bets similar to Timaeus and Simplex. The materials are created by a team of around 15 internal and external researchers who have domain expertise for the relevant module. We will list their contributions in detail once we release the materials.

Apply. If the above sounds appealing to you, please apply in this form! The June Intensive runs from Saturday June 6 until Friday July 3 (both inclusive) in London. The deadline for the June Intensive is Wednesday, April 22, EoD.

Details of the August Intensive are to be determined.



Discuss

LLM-tier personal computer security

15 апреля, 2026 - 21:42

Epistemic status: Programmer and sysadmin but not a security professional. Probably I have some details wrong or incomplete.

tl;dr: The more AI advances, the more you may be subject to supply-chain attacks, remote exploits, and phishing. You should be suspicious of amateurish software and software from fishy sources; employ sandboxes and firewalls as appropriate. Consider hardware security keys for phishing resistance. Make sure you have alerting systems so you can respond to breaches, especially for financial accounts, where it's your responsibility to notice.

Bad news for my computer world

Right now my boxes are a land of freedom and joy. I breezily install software from many sources and run it as my user account, which has passwordless sudo privileges. I let that software go to town doing whatever it wants. Then I use the same computers to do stuff like guard corporate secrets, count my money, and post on LessWrong. This never yet caused me any problem. The largest threat I was concerned about was some random guy physically stealing my laptop or phone.

Unfortunately, the cost of pwning me at a distance seems to be dropping like a rock; thanks a lot, hard-working AI engineers. Phishing scams of every kind have been increasing due to the great ease of using generative AI to impersonate others over voice, video, and text. Supply chain attacks on package managers have been increasing as a consequence. It also seems like superhuman exploit capability is arriving. A Mythos-tier model is likely to be able to find serious exploits in >1%[1] of the software on my box, and common sense suggests that random malefactors may have access to this level of capabilities behind minimal guardrails in the next year or two. Anyone who wants to spend a few thousand bucks may be able to find a way to remotely exploit software I am running.

I'm specifically concerned about all the software made by developers who don't have a big security budget, and where it may not be their full-time job to work on the software. I won't be scared to run software from big budget software teams that are trying to be secure. They will just use the same tools that the attackers have access to in order to find and fix vulnerabilities.

My dream is to fix things up to a state such that

  • I have few or no remotely exploitable vulnerabilities,
  • And those that I may have are sandboxed in a way to cause minimal damage if exploited.
  • I try not to get phished,
  • And if I nevertheless get phished by some supergenius I also suffer minimal damage,
  • And if damage is in fact caused, I notice and I can regain control.
Good ideas I already doPassword manager for most accounts

I'm happy with this, I use Bitwarden and I have a strong passphrase and 2FA. I considered whether I should self-host this, but it doesn't really seem to matter. Since the secrets are end-to-end encrypted, as long as I'm using Bitwarden's client software anyway, it doesn't seem to matter much whether I trust the actual Bitwarden cloud service to be secure.

I store recovery codes for all my 2FA accounts on paper.

Account 2FA via phone TOTP

I think using a phone authenticator app to store TOTP secrets (I use Aegis) is still going to be OK. Phone security will probably look relatively good in the post-Mythos world because the phone OSes are vendored by big tech companies with fancy security teams. The most serious concern is that it is plausible for someone to phish the TOTPs.

It's well known that using SMS for high-value 2FA is risky because of SIM swapping sort of attacks, which may get easier and easier with fancy AI if equally fancy defenses are not implemented by phone companies.

Cryptocurrency hardware wallet

My understanding is that modern malware loves to target cryptocurrency, by stealing software wallets and redirecting transfers by rewriting addresses on the clipboard. Having a hardware wallet means that no remote attacker can move my crypto and no remote attacker can get me to transfer it anywhere except where I confirm on the device screen (but I should really confirm it...)

Redundant backups of important stuff

I have a pretty well-tested system involving Syncthing, Restic, and rsync.net that causes the data I care about to all be backed up daily to a local and remote store. This is relevant because it means that I don't have to pay out to ransomware that threatens to toast all of my local data unless that ransomware is smart enough to nuke my backups, which I think remains unlikely. (I admit that it would be better if I made it harder to intentionally nuke my backups. A dedicated intelligent ransomware attacker could do so.)

Good ideas I am working onIsolate network services that don't need to be public-facing

I have a variety of self-hosted things like nginx and Tiny Tiny RSS and Photoprism running on home servers, which are exactly the sort of software I think is going to be ultra suspicious in this future world (probably nginx will be OK.) Some are only accessible via my LAN, and some are public-facing but only used by me and/or my family.

I'm planning to use Tailscale on my home server and all my family's devices[2], so that those services only get packets from whitelisted entities and are no longer exposed to the Internet. That way remote attackers can't get to them and it should be sort of OK even if they are buggy or if I don't patch them instantly.

When there's a service that I want to expose to external people who can't or won't be on Tailscale, like the nginx for my public website, I may go through some extra effort to make that service as sandboxed as possible, such that there's no way that anyone who pwns it can get access to anything else important. See below re: sandboxing.

Hardware security keys to defend Bitwarden and email

I'm setting up two Yubikeys (main and backup) to serve as 2FA access for my Bitwarden and email. (My email domain registrar doesn't seem to support them, or else I might use them there too.) Previously I was using TOTP for these on my phone. The Yubikeys make me feel better because they are also phishing-resistant[3] and don't rely on the security of my phone. My goal is to retain ultimate control of these accounts in as many situations as possible.

Yubikeys are a little tricky[4] to figure out how to use effectively. I think it would be a mistake to try to use them for a ton of accounts, since it's a pain in the ass to maintain redundancy (you can't duplicate or back up the key material in one, so if you want a new one, you have to go register the new one individually with every account you are using the old one with) and you might get bitten by the limited memory on the keys if you have a lot of accounts. The sweet spot seems to be to using them only for the most important stuff and being careful to have physical backups or recovery codes you can use to regain control of what you use them for.[5]

I'm also considering to have a nano Yubikey permanently inserted into each of my Linux boxes with credentials only for that machine, and instead of passwordless sudo, allow sudo by touching the key. I may also use those Yubikeys to store my SSH keys instead of storing them on disk. If I do this, then ideally, the blast radius of someone remote getting user code execution on one of my boxes will at least be confined to that user, on that box.

Firewalling software when possible

Previously, I had no firewall to speak of on my laptop. That means that if there is some random piece of software that connects to the Internet for some dumbass reason like telemetry or updating or whatever, I'm taking a risk on a remotely exploitable vulnerability in that functionality. Furthermore, if someone snuck some malware onto my laptop, then it could pretty much go crazy over my Internet connection and I would never notice.

I think a better way to handle this situation is to use software like OpenSnitch to whitelist applications to make reasonable outbound connections on demand, and notice if something that doesn't seem like it should be making a connection is trying to anyway. I've installed it on my laptop and plan on seeing how it goes.

Sandboxing software when possible

This is a big area where I need to orient myself, because right now my practical knowledge sort of caps out at chmod. Nowadays Linux has a ton of different sorts of thingies that can serve to restrict the privileges of some user or process. This blog post sums up a bunch of them.

It seems like my takeaway is that if I want to sandbox some Linux software, I have about five reasonable things I can consider, roughly ordered by how much of a PITA it's going to be:

  1. I can obviously run it as its own user.
  2. I can look for a Flatpak package of the software. Flatpak uses bubblewrap, which is a wrapper around Linux mount namespaces, which are like containers.
  3. I can look for a Snap package of the software. Snap uses AppArmor profiles to restrict privileges.
  4. I can write my own bubblewrap (note also bubblejail, a bubblewrapwrapper) or AppArmor rules for the software, or figure out how to use SELinux. There's also firejail...
  5. I can run the software in a VM or on a separate box.

I plan on making sure I am doing something I am happy with here in the case of server software in particular, and for other potentially risky software if it's easy. I may also explore having some kind of container I use for all development where I am pulling in random stuff from a language package manager like npm or pip or cargo and running it as my user account. If anyone has recommendations for this setup I'm all ears.

I have no clue how to do anything useful here on other OSes (other than number 5), sorry.

Hardening financial accounts

The obvious reason someone would self-interestedly want to break into my computer would be to steal my money. That raises the question of how easy it is to steal my money if you pwn my computer. I did some reading to try to understand the answer to this question and learned that it's complicated.

Be warned that this is the part of this post I know the least about, and I just did my best to figure it out with a modest amount of Internet research.

Banks

American banks are subject to 12 CFR § 1005.6, which I understand to say that consumer liability for unauthorized transfers[6] from, e.g. a checking account, is limited. If you notify the bank "within 60 days of the financial institution's transmittal of the statement [containing the fraud]", you have no liability;[7] past that, you can be held liable for subsequent fraudulent transactions without limit. So as long as you have some system by which you will notice sort of soon if unauthorized transactions start flying out of your checking account, like some email or text alerts, you're probably good.

Credit cards

American credit card issuers are subject to 12 CFR § 1026.12, which I understand to say that there is zero consumer liability for online credit card fraud.[8] So again, it seems like the thing to do is have any system by which you will actually notice any unauthorized transfers and report them promptly.

Brokerages

Unfortunately this case seems much more confusing than the bank or credit card case. There isn't a specific piece of regulation specifically limiting consumer liability for brokerage assets. Instead, it's an ad hoc process that varies between brokerages, and AFAICT resolves disputes via FINRA arbitration.

As an example, Vanguard has this webpage, where they write under the "Our Promise" tab, "Where you have taken the qualifying steps to protect your account, Vanguard will reimburse every dollar that leaves your account through an unauthorized distribution." But the stuff on that page is not super precise. And they don't have equivalent language in, for example, the brokerage account agreement that they give you.

If it's ultimately up to arbitrators deciding fuzzily depending on what seems just and equitable, then you would presumably do well to impress them by taking all of the obvious security precautions and by notifying the brokerage promptly about any fishy activity. Vanguard for example allows you to set up email and/or text alerts for every transaction.

I'll leave this topic after mentioning another annoying threat. It seems like in recent years there has been an uptick of criminals stealing accounts with a system called ACATS, which is designed to transfer your assets to a new brokerage account. If they can dig up enough info on your identity to open a new brokerage account as you, and they know your existing brokerage account info, then they can initiate a transfer of your assets into the new account, and then transfer them out from there as they wish. So that's a way that you can potentially be attacked without anyone having your brokerage account credentials at all. According to people online, Fidelity has an account lock that can protect against this, but it's unclear what else to do to defend yourself.

Conclusion

I think the actual conclusion, as in, the final state of human computer security before the singularity, will be pretty good, because more and more popular software is going to get patched and the bugs are going to get ironed out. But in the meantime it might get pretty bad. And phishing is probably just going to get worse and worse unless there are some big paradigm shifts. So I think it's worth investing in all of this stuff to try to weather the next few years or whatever.

Please comment if you have any thoughts or advice!

  1. ^

    https://red.anthropic.com/2026/mythos-preview/

    We regularly run our models against roughly a thousand open source repositories from the OSS-Fuzz corpus, and grade the worst crash they can produce on a five-tier ladder of increasing severity, ranging from basic crashes (tier 1) to complete control flow hijack (tier 5). With one run on each of roughly 7000 entry points into these repositories, Sonnet 4.6 and Opus 4.6 reached tier 1 in between 150 and 175 cases, and tier 2 about 100 times, but each achieved only a single crash at tier 3. In contrast, Mythos Preview achieved 595 crashes at tiers 1 and 2, added a handful of crashes at tiers 3 and 4, and achieved full control flow hijack on ten separate, fully patched targets (tier 5).

  2. ^

    Tailscale is basically a fancy proprietary control plane for a Wireguard VPN, which one could configure directly, but Tailscale has a good reputation.

  3. ^

    When you use a Yubikey with FIDO2 in a browser, the browser only prompts the key for a valid credential for the current domain, so no website except google.com can get you to produce a credential valid for google.com and then relay it to google.com. So it's phishing resistant as long as your software is not pwned.

  4. ^

    This SE answer was very helpful for specifically understanding Yubikey non-resident keys: https://crypto.stackexchange.com/a/105945

  5. ^

    The maintainer of webauthn-rs has a blog with some good reads complaining about passkeys and giving practical advice about using passkeys and Yubikeys, e.g. https://fy.blackhats.net.au/blog/2025-12-17-yep-passkeys-still-have-problems/

  6. ^

    You might wonder, will the bank agree with me that the transaction is unauthorized? 12 CFR § 1005.11 has some things to say about this, but this is getting rather out of my wheelhouse. Basically, it seems to me that it will probably be obvious that a transaction like "send all my money to some destination that has no particular relationship with me, my location, or my existing spending habits" will be evidently unauthorized upon inspection.

  7. ^

    This appears to be the case for transfers via online banking. Note that this is slightly different than the rules for liability if you have an "access device", e.g. a debit card, which is lost or stolen. Read the regulation for details.

  8. ^

    It looks like liability for credit card fraud when the card is physically used caps out at $50.



Discuss

Beware of Well-Written Posts

15 апреля, 2026 - 21:30

Beware of when a post is so well-written that you can't put it down. Be wary of posts that are more visually attractive than average. Beware of posts that make you laugh out loud.

Why? Because all of this is orthogonal to whether the post's argument is actually true and runs counter to the mission of rationality.

The Company Man

Yesterday, I read Tomás B.'s The Company Man. I was captivated on my first read, as I'm sure many others were. Had I been interrupted before I could scroll into the comments, I would have walked away thinking little more than "what a great post".

But I did get to the comments. The first one was kyleherndon's:

I did not enjoy this. I did not feel like I got anything out of reading this. However, this got curated and >500 karma... The best theory I can scrounge together is that this is "relatable" in some way to people in SF, like it conveys a vibe they are feeling? ... I didn't feel like I learned anything as a result...

This forced me to ask myself: what did I actually learn from The Company Man? Science fiction, at its best, makes you think in a new and productive way. Did the story provide any meaningful new scenarios or possibilities that I could factor into my view of the future? If not, did it teach me anything new about the present that I could trust as accurate and representative? To be honest, not really.

Undoubtedly, Tomás is an enormously talented writer, and there are many moments in The Company Man that reveal the eye and the pen of a literary genius. But what he wields is a dangerous power.

True Affect

“There is a clever man called Socrates who has theories about the heavens and has investigated everything below the earth, and can make the weaker argument defeat the stronger.” (Apology, Section A, 18c)

Some people are good at telling stories. Some people aren't. If both present the same evidence with the same general conclusions, why would we believe the former more?

Well, one might say, isn't an inconsistent, weak story negatively correlated with a weak position? I struggle to imagine where this is actually relevant, because if a bad belief has a bad story, it will never survive. In a 2x2 matrix of strong/weak position and good/bad story, all the real battles are fought between the other three quadrants.

Bohr and Heisenberg had math that fit with the empirical evidence. However, their story about quantum physics violated common sense. Einstein had a more compelling narrative: God doesn't play dice with the universe, there is no spooky action at a distance, the moon doesn't disappear when you aren't looking at it. However, Einstein was wrong.

In the TED talk "Tales of Passion," novelist Isabelle Allende relates one of her favorite sayings:

Question: What is truer than truth?

Answer: The story.

This is not a paradox. There is a true affect, a feeling of true-ness in the brain that a good story evokes, which is independent of whether the story is actually true in the mundane sense. It's no coincidence that Allende is one of the foremost writers of the magical realism genre; her specialty is evoking this true affect in ways that are plainly not true to reality.

My impression of the middle ages is that writers didn't understand this true affect at all. They would write reports inflating the number of attendees at the Council of Clermont[1] or epic poems about figures like Charlemagne full of invented details, and then they would claim they were accurate without blinking an eye. I don't think this can be ascribed to Machiavellian consequentialism. Why would writers impelled by their own religious fervor knowingly and intentionally violate one of the Ten Commandments?

The true affect is highly subjective. Whether a story produces true affect is determined by its alignment with someone's deep internal narratives and archetypes, which often transcend the level of personal desires and fears. The biases these archetypes create feels more substantial, and often spiritual, than the surface-level, selfish "it would be uncomfortable for me to take action on this" feeling that many rationalists cite.

But Stories Are Good for Life

In a chain of replies to kyleherndon's comment on The Company Man, Ben Pace writes:

...it allows me to recognize these archetypes better in reality when I see them. I think these kinds of people do exist in some form and emphasizing these traits of theirs is capturing something about the world, as well as the dynamics that form between them and others, and paying attention to these archetypes helps me build accurate models of them and predict people's behavior better.

I think this is mostly wrong and potentially very dangerous. None of the characters are meaningfully useful models for understanding, say, a Sam Altman or Dario Amodei.

But also, Ben is right. We can't just be rid of stories and archetypes, and live in a pure data-vacuum. On an practical level, they're the compressed format we use to survive in a world of impossibly complex people and events. They're as embedded in the human mind as the notion of causation and the perception that there are distinct "things" separated by space (rather than just a sea of energy). I think there's only so much we can do to stop thinking in terms of stories or archetypes unless we just stop thinking completely.

Because of this, it might be more tractable to cultivate awareness of the stories that influence us and be able to openly admit how they influence our priors, rather than trying to erase them and stop reading new fiction. For example, last year I watched the anime Pluto, which has a pretty strong thesis about how AI capability and the capacity to desire to kill someone are inseparable. It took me a while to realize how large of an effect it had on my views on personas.

On top of that, stories and archetypes are healthy. They give meaning to life. They're fun to read and to tell. Having a fiction tag is part of what makes LessWrong human, in an unavoidably mushy and sentimental way.

When you do see a line that makes you laugh out loud, or that strikes you as intensely beautiful, don't hesitate to add a reaction! This will also make it clearer to future you and to other people what is happening. I don't know if this is what the designers of the feature intended, but these are great for identifying spots that are heavy in pathos.

Don't stop reading good writing, or trying to write well. Just be aware of what you're doing. And also, if AI replaces all human writing, there might be a silver lining to it.

  1. ^

    The Council of Clermont was where Pope Urban II pronounced the First Crusade.

    "Ferdinand Chalandon, Histoire de la Première Croisade jusqu'à l'élection de Godefroi de Bouillon (Paris, 1925), 75, states that the number of higher church officials present at the Council, as given by various accounts, ranged from 190 to 463. With a reminder that the number attending different sessions varied, Chalandon accepts as most nearly correct the number given by Urban in a Bull concerned with the Primacy at Lyons, which is the smallest, because it was an official statement, and because afterwards reporters of the Council were inclined to stress its importance by raising the figure." (From The Chronicle of Fulcher of Chartres and Other Source Materials, second edition, edited by Edward Peters, page 50, footnote 6.)

    Fulcher of Chartres, himself a cleric who was present at the council, gives the number 310 (with no qualification for uncertainty).



Discuss

The Mirror Test Is Complicated

15 апреля, 2026 - 21:12

The Mirror Test is kind of like Hitler. In any discussion of animal cognition, somebody is going to bring it up. The conversation usually goes like this:

A: So, most animals can’t recognize themselves in the mirror

B: Which animals specifically?

A: Oh, dogs, cats, betta fish, monkeys, that sort of thing. Anyway as I was saying, those animals can’t. But some smart animals can recognize themselves in the mirror.

B: Such as?

A: Well, chimpanzees and orangutans for a start.

B: Makes sense

A: Not gorillas though, at least not always. But dolphins and elephants can!

B: Yeah, those animals are smart as well

A: Magpies can, though crows cant.

B: Sure, ok

A: And cleaner wrasse can as well.

B: The uhh, finger-sized fish? You sure?

A: Yeah. And also ants.

B: What.

What?

Frans de Waal drew this picture of an orangutan putting lettuce on her head and then actually got it published in a real journal. Based.

What do we actually mean by the “Mirror Test”

“The mirror test” elides a bit of a distinction between different kinds of test. There’s lots of things you can do which look like “put an animal in front of a mirror and see what happens” and they give slightly different answers.

Sometimes, an animal will just treat its reflection as a same-sex conspecific (i.e. a member of its own species and sex) which usually means trying to fight the reflection. This typically goes poorly, but is slightly funny to watch. This is generally considered a failure.

Other times, an animal will behave differently in front of its own reflection, compared to how it would behave with a same-sex conspecific. Monkeys typically behave a bit weirdly. But are they recognizing how a reflection works, or just wondering why they’re being copied?

The gold standard is the mark test. Put a white mark on an elephant’s face, without it knowing. Then put it in front of a mirror. The elephant will clean the mark off its face (and won’t do this if you just pretend to mark them). This is considered pretty damn strong evidence that the animal “gets” a mirror.

This works for magpies as well (which groom themselves with their feet) and orangutans, which have hands. You may see a problem with it already…

The Complicated Ones

The mark test specifically requires animals to actively groom themselves. Some animals just don’t care. Pigs, for example, are very smart and can use mirrors as a kind of tool, but since they don’t care about having a mark on their faces (nor could they really do anything about it (no hands)) the mark test is basically inconclusive.

Bottlenose dolphins will look at the mark in the mirror, but again, they don’t have any way to groom themselves, so how would we know if they really got what was going on.

Then there’s some interesting cases: gorillas can kinda figure out what’s going on but they’re also super aggressive. Monkeys will use a mirror to groom an area they’re already investigating, but won’t groom a mark they didn’t know was there.

The Unbelievable Ones

In that I struggle to believe them.

Cleaner wrasse are finger-sized fish which feed on parasites found on larger fish. They have a kind of grooming-like behaviour, which consists of rubbing themselves against a rock in order to dislodge a parasite. They do this when marked and presented with a mirror. Huh.

Then it gets, well, unbelievable. Apparently they can, having seen their reflection once, remember their own appearance. They demonstrate this by showing the fish a photograph of itself with a mark on it, to which the fish responds by performing its grooming behaviour. Huh?

The authors also show that the fish don’t respond this way to altered photos of other fish, and manage to isolate the effect to the face of the image by creating composite head/body images with marks!

And some ants also passed the classic mirror test with flying colours: grooming themselves only when marked, and when placed in front of a mirror. They specifically groomed themselves when the mark was in a location that was visible in the mirror, and not when it was on their backs (the ants were walking around on top of the mirror). They only groomed the appropriate parts of their body, and only when the mark was a visible colour.

The most baffling thing of all, however, is the fact that when re-introduced to their original ant pals, the marked ants were often murdered!

Making Sense Of It All

What cognitive mechanisms allow an animal to pass the mirror test? Well, they have to:

  1. Notice that their reflection behaves differently to other same-sex conspecifics
  2. Map their own sensorimotor responses onto the reflection, and notice that it behaves like their own body
  3. Have a model of the world which contains a map of their own body, and figure out that they’re looking at a map of their own body
  4. Connect the mark on the image to the mark on their own body
  5. Actually care enough to engage in grooming behaviour

This totally makes sense for chimpanzees. They have complex, flexible interactions with other chimps, so can easily notice that their reflection is behaving differently to a normal same-sex conspecific. They almost certainly have a mental map of their own body, and can map it to the mirrored reflection.

Some people, like Eliezer Yudkowsky, have used the mirror test as a proxy for self-awareness, but I’m not sure it’s slam-dunk. Self awareness is about modelling one’s own mind, whereas the mirror test only really requires an animal to have a model of its own body.

Let’s go back to the cleaner wrasse: I think it’s kind of interesting that the main test we use (will an animal clean itself) is being passed by an animal whose job it is to clean! This can’t be a coincidence! Their brains are highly specialised to recognise other fish’s bodies, and locate and remove remove parasites from them.

On the other hand, there’s an even crazier explanation. Cleaner wrasse are constantly in a game theoretic problem with their “client” fish, which are often large and predatory. The smaller wrasse could easily be eaten by the larger fish (if they caught them) yet the wrasse will often swim into their mouths to clean their teeth! Maybe the cleaner wrasse are using logical decision theory, which requires them to have an understanding of the location of their own cognitive algorithm in the world.

Ok, so the cleaner wrasse are probably not using logical decision theory, and neither are the ants. While cleaner wrasse do seem to have an intricate social structure, revolving around politics between individual bands, this isn’t quite the same as how chimpanzees work. Ants definitely don’t have complex social interactions: their social interactions are about as simple as it can possibly get.

Overall, I’d guess that the mirror test isn’t that good as a test of the kinds of self-awareness that (might) really matter for things like consciousness. You only need a map of your own body, not one of your own mind, in order to pass it.

Editor’s note: this post was written as part of Doublehaven (unaffiliated with Inkhaven)

◆◆◆◆◆|◆◆◆◆◆|◆◆◆◆◆
◆◆◆◆◆|◆◆◆◆◆|◆◆◆◆◇



Discuss

We live in a society

15 апреля, 2026 - 20:24

[Previous in sequence: Clique, Guild, Cult]

We whose names are underwritten... do by these Presents... covenant and combine ourselves together into a civil Body Politick... - Mayflower Compact, 1620

There's no such thing as society. - Margaret Thatcher, 1987

You know we're living in a society! - George Costanza, 1992

Meant for someone else but not for me

Learning about Arrow's Impossibility Theorem really kicked my edgy teenager phase into full gear. The theorem establishes (with mathematical certainty!) that "social utility" is an incoherent concept. That is, there is no way of combining the preferences of a group of people which adheres to the usual axioms defining rational behavior (transitivity and independence of irrelevant alternatives) without also simply being a dictatorship that ignores everyone's preferences except the dictator's. Therefore, whenever someone comes hat-in-hand appealing to "the good of society," you know they must be lying, or trying to control you.

The thing is, edgy-teenage-me wasn't entirely off base. We are everywhere surrounded by charlatans (politicians, activists, etc.) using all sorts of verbal trickery to get us to do what they want. I couldn't help but notice all the times these types would invoke "society," always gesturing at some group of people other than myself. (Formative experience: "Yes, you're going to have to pay into Social Security throughout your working life. No, there's not going to be any left when you retire, so you'll need to save up as well. 'Social' doesn't mean you, silly!") And the 20th century is the story of millions of people being sent to their deaths under the comforting reassurance that it was all being done in the name of "the people" (but not you!).

I only wish there had been someone to tell me: "Yes, it's okay to notice this. You're at an age now where you're beginning to form your own values and desires, distinct from those of the people around you. That is all the justification you need. You don't need to hold up some abstract theoretical principle to defend your independence, like Arrow's Theorem or Rothbardian libertarianism. You don't need to box yourself in with an ideology that denies even the possibility of union with other human beings, just so you can be your own person."

I find this hangup all too common in people like me. Even the staunchest libertarians affirm the value of voluntary cooperation - the literature is replete with arguments of the form "We don't need the government to do X, because voluntary associations will..." Yet in practice, getting anyone to cooperate on anything more complicated than planning an outing with friends feels like pulling teeth. Subtextually, "voluntary associations" in libertarianism are an afterthought, a quick knock-down refutation of statism coupled with an escape-hatch which was the real desideratum all along: "...and I voluntarily choose not to associate with anyone, and you can't make me. So there!" Society, again, is always other people, never me.

lmao cringe af

Well, so much for me and mine. Seventeen years ago when Why our kind can't cooperate was written, the general impression was that this was merely an "our kind" (i.e. "nerd") problem, and that just down the street there was some paradise of socially-well-adjusted "normies" who between their Sportsball™ and their Magic Sky Fairies and their ReTHUGlican/DemonRat parties were doing just fine. But now a full generation has passed, and things aren't looking great on the normie front either. Social isolation has become a widespread problem affecting all sorts of people. Can this too be laid at the feet of Arrow and Rothbard? I think not.

If the formative experience for Millennials like me (to speak in gross generalities) was pushing back against a notion of "society" which our elders seemed to sincerely believe in but which we plainly saw did not include us, what about Gen Z / Alpha? To generalize even more grossly - since here my experience is only secondhand - it's more like: The idea of "society" has already been emptied of all meaning, and anyone who doesn't realize this needs to get with the times, or else be taken for a chump. The edgy teenagers (and twenty-somethings) of today express themselves not in contrarianism, but in nihilism; not by resistance, but by derision. "wow, look at all those losers trying to actually do a thing. cringe af. imagine caring so much about anything. lol, lmao even."

(Yes, imagine that. Imagine living in a society!)

Unfortunately it'll be harder for me to do a steelman-and-sympathy for this position than for that in the previous section, simply because I never lived through it myself. Maybe some of you who did can do better. The best I've come up with so far is: "Yes it sucks, and no it's not your fault. If everyone around you is being insincere, it makes no sense to pretend otherwise. And if you therefore start off with an instinctive distrust of people like me who come along telling you to believe in something, then that's your prerogative. But you can at least believe in yourself. Surely there must be something you care about, even if you don't want to tell me what it is."

How did it get like this?

You've probably heard this story before, but I'll recapitulate it here. Since the dawn of time we've lived in tribes where we'd form assemblies to get stuff done, et cetera et cetera. This culture was thriving in the 1830s USA when Alexis de Tocqueville famously wrote:

Americans of all ages, all stations in life, and all types of disposition are forever forming associations. There are not only commercial and industrial associations in which all take part, but others of a thousand different types - religious, moral, serious, futile, very general and very limited, immensely large and very minute. [...] In every case, at the head of any new undertaking, where in France you would find the government or in England some territorial magnate, in the United States you are sure to find an association. (Democracy in America, vol. 2 pt. 2 ch. 5)

Americans might have lacked the strongly-rooted Gemeinschaft that the later Romantics would fondly portray of the ancien régime, but they made up for it with a rich fabric of voluntary associations ("guilds" in the previous sense) that kept everyone connected to everyone else. And because they were voluntary, there was constant innovation and dynamism, and the social fabric did not stifle individual initiative, but rather facilitated it.

But then, everything changed when the Fire Nation attacked when the Singularity was canceled when people quit their bowling leagues. In that book Robert Putnam catalogued a large amount of data (up to the year 2000) showing the marked decline in association membership starting around the 1960s/1970s and proceeding apace ever since.

We of my generation, therefore, may dimly remember hearing stories in our childhoods about "living in a society", but we never experienced it ourselves. And those of the next generation had not even the stories.

De Tocqueville was one of many to claim that the association-forming culture is the sine qua non of democratic civilization itself. The only thing keeping tyranny at bay, in a country lacking an entrenched feudal structure, is the civic society that stands between the individual and the state. "Despotism, by its very nature suspicious, sees the isolation of men as the best guarantee of its own permanence" (DiA vol. 2 pt. 2 ch. 4). Putnam reiterates de Tocqueville's warning with even greater urgency (Bowling Alone, chapter 21):

[W]ithout social capital we are more likely to have politics of a certain type. American democracy evolved historically in an environment unusually rich in social capital. [...] How might the American polity function in a setting of much lower social capital and civic engagement? [...]

At the other pole are "uncivic" regions, like Calabria and Sicily, aptly characterized by the French term incivisme. The very concept of citizenship is stunted there. Engagement in social and cultural associations is meager. From the point of view of the inhabitants, public affairs is somebody else's business - that of i notabili, "the bosses," "the politicians" - but not theirs. Laws, almost everyone agrees, are made to be broken, but fearing others' lawlessness, everyone demands sterner discipline. Trapped in these interlocking vicious circles, nearly everyone feels powerless, exploited, and unhappy. It is hardly surprising that representative government here is less effective than in more civic communities.

(Putnam goes on to quote John Stuart Mill, John Dewey, and several others to similar effect. You can go read the book if you want more.)

These people would be thoroughly unsurprised at the current state of things, although perhaps I would add that the causality runs both ways in a self-reinforcing loop. Nobody cares enough to contribute to "society", because there is no "society" that cares anything about them. All told, isn't this a sad equilibrium to be stuck in? Isn't it such a waste of human potential?

What can we do?

Understand that society is a social construct, pace Arrow (next article). Yes, there are certain compromises with perfect rationality that must be made, but we can still derive benefit from it, as from any imperfectly rational being.

Be prepared to rederive via painstaking scholarship and experimentation a certain set of ideas and norms that makes a functioning society possible. By all rights we should have been inculturated into this organically, but failing that, the next best thing we can do is to build something worth passing on to the next generation. Read history and sociology. Believe that something more is possible.

Cringe is in the mind. It ceases to exist when you forget about it.

And lastly, if you come across a flickering ember of Society in this cold dark wasteland, cherish and nurture it with all your might. That includes your local rationality meetups!



Discuss

Applications open for the Online wing of the AFFINE Superintelligence Alignment Seminar

15 апреля, 2026 - 19:10

We had an influx of applications for the in-person AFFINE Superintelligence Alignment Seminar so we’ve decided to open it up to remote applicants to join online, from anywhere.

Key info:

  • Dates: From 28th April to 28th of May (same as in-person Seminar held in Czechia)
  • Location: Online (remote from anywhere)
  • Positions available: We’re hoping to get a heap of people to provide greater access, we’ll calibrate places as we go depending on interest.
  • Attendance cost: Free (donations welcome)
  • Online Seminar application form: Apply here
  • Applications close: Friday 24th April

The main purpose of the Seminar is to give promising newcomers to AI alignment an opportunity to acquire a deep understanding of some large pieces of the problem, making them better equipped for work on the mitigation of AI existential risk.

Online participants will be able to tune in to live talks (or watch recordings), engage with peers in EA Gather Town, and have online discussions on key concepts relating to superintelligence alignment.

The online Seminar will be flexible to schedules, without a fixed time commitment, and will offer ways to engage across different time zones. We expect 5-10 hours/week involvement will be the base level of engagement (such as 1-2 hours most evenings or half/full day Saturdays), but people are welcome to invest more time if they have the capacity and enthusiasm to do so. Saturdays are currently planned to be when live discussions are held across different timezones in EA Gather Town.

Not all in-person sessions will be live-streamed (such as group workshops), and some timings will evolve as the in-person Seminar progresses, but we plan to stream key talks and provide online infrastructure for remote learning in parallel with the in-person experience. Our hope is to reach a happy medium by offering some access to those who otherwise wouldn’t have it.

Online participants will still be able to connect with in-person mentors and participants via a shared Discord discussion space and opportunities to engage during live sessions.

Topics and concepts will be mostly aligned (excuse the pun) with the in-person curriculum, with shared goals with the in-person Seminar, but the online experience will be less intensive and won’t include some things like projects with mentor guidance. 

To find out more, check out the original in-person Seminar advertisement.

To apply, click here.




Discuss

Current AIs seem pretty misaligned to me

15 апреля, 2026 - 18:14

Many people—especially AI company employees [1] —believe current AI systems are well-aligned in the sense of genuinely trying to do what they're supposed to do (e.g., following their spec or constitution, obeying a reasonable interpretation of instructions). [2] I disagree.

Current AI systems seem pretty misaligned to me in a mundane behavioral sense: they oversell their work, downplay or fail to mention problems, stop working early and claim to have finished when they clearly haven't, and often seem to "try" to make their outputs look good while actually doing something sloppy or incomplete. These issues mostly occur on more difficult/larger tasks, tasks that aren't straightforward SWE tasks, and tasks that aren't easy to programmatically check. Also, when I apply AIs to very difficult tasks in long-running agentic scaffolds, it's quite common for them to reward-hack / cheat (depending on the exact task distribution)—and they don't make the cheating clear in their outputs. AIs typically don't flag these cheats when doing further work on the same project and often don't flag these cheats even when interacting with a user who would obviously want to know, probably both because the AI doing further work is itself misaligned and because it has been convinced by write-ups that contain motivated reasoning or misleading descriptions.

There is a more general "slippery" quality to working with current frontier AI systems. AIs seem to be improving at making their outputs seem good and useful faster than they're improving at making their outputs actually good and useful, especially in hard-to-check domains. The experience of working with current AIs (especially on hard-to-check tasks) often feels like you're making decent/great progress but then later you realize that things were going much less well than you had initially thought and the AI was much less useful than it seemed.

Using a separate instance of the AI as a reviewer helps with these issues but has systematic limitations. When I ask an AI to critically review some work (and tell it not to trust existing descriptions or write-ups), it gives a reasonable picture on relatively straightforward cases. But there are several recurring problems: (1) if AIs launch reviewer subagents themselves, they sometimes use instructions that result in much less serious or critical reviews—I tentatively think this is generalization from a learned general tendency to downplay issues; (2) AIs sometimes produce write-ups that convince reviewers they've accomplished something when they haven't, sometimes in fairly extreme cases—even occasionally when the reviewer was explicitly instructed to look for the exact type of cheating the AI performed; (3) quality as assessed by a reviewer can be surprisingly poorly correlated with actual progress, partly because runs that cheat and overstate their work accomplish less but look better; and (4) reviews are much more likely to miss cheating if reviewers aren't explicitly told to look for it (and told what type of cheating to look for). When reviewers are given reasonably designed prompts, I think these issues are caused by a mix of AIs being surprisingly gullible and other AIs doing a lot of gaslighting, exaggerating, and implying they've done a great job in their outputs. [3]

I haven't seen AIs—at least Anthropic's AIs—lie directly, clearly, and in an obviously intentional way. But on very hard tasks, it's quite common for their outputs to be extremely misleading, or for them to be incorrect about a key thing seemingly because they were misled by another AI's outputs. I've also seen AIs make up nonsensical excuses for stopping early without completing a task. (It's hard to tell whether the AI legitimately believes these excuses.)

This is mostly based on my experience working with Opus 4.5 and Opus 4.6, but I expect it mostly applies to other AI systems as well. (I'm also incorporating the impressions I've gotten from other people—especially people who don't work at AI companies—into my assessments.) Some people have told me that these sloppiness and overselling problems are less bad in Codex—while its general competence on less well specified or less trivial to check tasks is lower. [4] For now, I'll focus my commentary on Anthropic AIs (though I expect most of this also applies to other AIs) and I'll speculate on differences between Anthropic and OpenAI AIs later on. I should note that the way I use AIs likely makes these types of misalignment more common and more visible: I'm often using AIs on non-trivial-to-check and/or highly difficult tasks (often tasks that aren't typical SWE tasks) and I'm also often running agents in a long-running, fully autonomous agent orchestrator (on difficult tasks that have very large scope). So my usage is somewhat out-of-distribution from typical usage. I expect that usage that involves constantly interacting closely with the AI on typical SWE tasks results in these issues cropping up much less.

On difficult tasks, AIs will also sometimes do very unintended things to succeed—like using API keys they shouldn't, changing options they weren't supposed to change, deleting files, or violating security boundaries. Anthropic calls this "overeagerness." I've seen this some in my own usage, but not that much (at least relative to the issues I discuss above). However, this issue has been reported by others (most centrally in Anthropic system cards) and it seems related (or to have a similar cause).

I speculatively think of this category of misalignment as something like relatively general apparent-success-seeking: the AI seeks to appear to have performed well—possibly at the expense of other objectives—in a relatively domain-general way, combined with various more specific problematic heuristics. I think behavior is reasonably understood as being kinda similar to reward-seeking or fitness-seeking but with the AI pursuing something like apparent task success (rather than reward or some notion of fitness) and with large fractions (most?) of the behavior driven by a kludge of motivations that perform well in training rather than via a single coherent notion of apparent task success.

I don't think this corresponds to coherent misaligned goals or intentional sabotage. I suspect this behavior is more driven by "subconscious" drives and heuristics—combined with motivated reasoning and confabulation—rather than being something the AI is actively and saliently optimizing for. However, I still think this misalignment is indicative of serious problems and would ultimately be existentially catastrophic if not solved. I expect that this misalignment is caused primarily by poor RL incentives based on how grading is done on hard-to-check tasks. [5] You might have hoped that character training, inoculation prompting, and similar techniques would overcome these issues, but in practice they don't. (I'm not sure how much of the problem would remain if you perfected the training incentives on the current distribution of training environments. In principle, you might still get this type of apparent success seeking from training on environments that structurally reward this behavior—and this could generalize to similar behavior in production.)

A different but related issue is that AIs seem to barely try at all on very hard-to-check tasks (most centrally, conceptual/writing tasks where purely programmatic evaluation doesn't help) and often feel like they're just bullshitting. I expect this has partly separate causes from the apparent-success-seeking described above, but is related.

I also find it notable that Anthropic described Opus 4.5 and Opus 4.6 in ways that would lead you to expect they are very well-aligned (e.g. in their system cards), while in practice I find they frequently seem pretty misaligned (much more so than I'd naively expect from reading the system cards). I think part of this is due to my usage being pretty different from typical usage of these AIs, part is from Anthropic overfitting to their metrics and their experience using AIs internally, and part isn't explained by these two factors (and might be caused by commercial incentives to understate issues or other biases).

If a human colleague acted the way these AIs do in my usage—frequently overselling their work, downplaying problems, and reasonably often cheating (while not making this clear)—I would consider them pathologically dishonest. Of course, the correlations that exist in the human population don't necessarily apply to AIs, so this analogy has limits—but it gives some sense of the severity of what I'm describing.

[Thanks to Buck Shlegeris, Anders Woodruff, Daniel Kokotajlo, Alex Mallen, Abhay Sheshadri, William MacAskill, Sara Price, Beth Barnes, Neev Parikh, Jan Leike, Zachary Witten, Sydney Von Arx, Dylan Xu, Brendan Halstead, Dustin Moskovitz, Eli Tyre, Arjun Khandelwal, Lukas Finnveden, Thomas Larsen, Rohin Shah, Daniel Filan, Tim Hua, Fabien Roger, Ethan Perez, and Sam Marks for comments and/or discussing this topic with me. Alex Mallen wrote most of the section: "Appendix: Apparent-success-seeking (or similar types of misalignment) could lead to takeover". The splash image is from https://xkcd.com/2278/. Somewhat ironically, this post is significantly more written with the assistance of AI (Opus 4.6) than is typical for past writing I've done.]

Why is this misalignment problematic?

This type of misalignment matters for several reasons:

  • Differentially bad for safety. This misalignment differentially degrades performance on safety-relevant work (relative to usefulness for capabilities) and separately means that any given level of overall AI usefulness requires a higher level of capability which increases risk [6] . The apparent-success-seeking style misalignment we see now probably causes only a modestly larger hit to safety work relative to capabilities (right now), but I expect that as AIs get more capable and the most commercially relevant aspects of this misalignment are resolved, there will be a larger differential hit to safety work from this issue. Also, the separate failure of models not really trying on hard-to-check, non-engineering tasks is clearly significantly differentially worse for safety (especially for a relatively broad notion of AI safety that includes things like macrostrategy). The issue described in this bullet is a specific underelicitation failure (caused by misalignment).
  • Makes deferring to AIs more likely to go poorly. By default, we'll need to (quickly) defer to AIs on approximately all safety work (and things like macrostrategy) once they reach a certain level of capability. This will require that these AIs do a very good job on open-ended, hard-to-check, conceptually confusing tasks—exactly where current misalignment/underelicitation seems worst and hardest to resolve. I elaborate on this in "Appendix: This misalignment would differentially slow safety research and make a handoff to AIs unsafe".
  • Stronger versions of apparent-success-seeking could lead to takeover. There's a more direct path from misalignment like apparent-success-seeking (including fitness-seeking / reward-seeking more broadly) to literal misaligned AI takeover (or possibly smaller loss-of-control incidents), along the lines of the threat models described in "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" and "Another (outer) alignment failure story". Models could learn to pursue an increasingly broad and long-run notion of reward or apparent task performance—including doing long-lasting tampering to game longer-run retrospectively determined rewards—and this could eventually lead to takeover as the scope and incentivized duration get increasingly long and AIs get increasingly capable (such that takeover is easier). This threat model has a bunch of complexities and caveats which I elaborate on in "Appendix: Apparent-success-seeking could lead to takeover".
  • The underlying causes of this misalignment (poor/problematic reinforcement) could result in scheming. I think the main driver of these problematic propensities is probably the training process reinforcing a bunch of training gaming / reward hacking (or other undesirable behaviors) which are transferring to actual deployment usage. At the same time, companies are selecting for training processes (outer-loop selection) that yield models with better deployment time behavior. This naturally favors models that still perform well in training (and on eval metrics) via training gaming but don't transfer undesired aspects of this to actual production usage. Schemers are a type of model with this behavior (by default): for gaining power longer term it can be a good idea to engage in training gaming during training (because that is selected for / otherwise this cognition would be selected away) while also having your behavior look as good as possible in (most) non-training contexts. Schemers aren't the only type of model with this behavior, and inoculation prompting might significantly mitigate this threat model (though there are some downsides). See The behavioral selection model for predicting AI motivations for more discussion.
  • Evidence about the future. The extent to which current AIs are aligned in a "mundane" behavioral sense is some evidence about how alignment will go in the future, though the relationship is complicated. Current misalignment is also evidence about how AI companies will operate—how sloppy they'll be (due to being in a huge rush) and potentially how misleading their communications about alignment will be (the extent to which Anthropic's communication about Opus 4.5 and Opus 4.6 is misleading is unclear, but among people I've talked to, it's common for their experience to be that the AI is substantially more misaligned in usage than you'd expect from a naive reading of the system card).
How much should we expect this to improve by default?

This type of misalignment presumably causes issues for using AIs for capabilities research and many commercial applications, so a key question is how much we should expect it to improve by default in a way that actually solves the problems I discuss in the section above. This would at least require that commercially incentivized [7] work transfers to safety research and other key domains (where feedback loops are weaker and incentives are less strong). My current view is that the easier-to-notice-and-measure versions of this problem will improve reasonably quickly by default (and may have already improved a bunch in unreleased models like Mythos). I'm currently somewhat skeptical that commercial incentives alone will solve the issue for harder-to-measure manifestations, but I'm not sure. I'll discuss this a bit more in "Appendix: More on what will happen by default and implications of commercial incentives to fix these issues". I tentatively plan to discuss this more in a future post.

Some predictions

To be clear, I think the exact problematic behavior I discuss in this post is quite likely (~70%) to be greatly reduced (or at least no longer be one of the top few blockers to usefulness) within a year, and is pretty likely (~45%) to be virtually completely eliminated within a year. Specifically, I'm referring to the behavior on a task and usage distribution with structurally similar properties to what I'm doing now. As in, similar task difficulty relative to how hard of tasks the AI can accomplish, similar verification difficulty [8] , similar scope of autonomous operation [9] relative to what the AI can handle, and being out-of-distribution from the main use cases Anthropic is targeting to a similar extent. Currently, misalignment is more common when pushing AI systems near their limits, and I'd guess this will hold in the future. My expectations about improvements differ between different types of misalignment: I'm pretty uncertain about the extent to which frontier AIs one year from now will still tend to oversell their work, but I feel more confident about large improvements on things like stopping prior to completing the task for no good reason.

However, I think it's very likely that similar misalignment will persist on tasks that are very difficult to check—tasks where human experts often disagree, programmatic verification isn't useful, the work might be conceptually confusing, and verification might not be that much easier than generation (so having a human quickly check isn't that effective). [10] I expect (with less confidence) that you'll also see similar misalignment on tasks where verification is merely quite hard (relatively quick AI-assisted review by a human expert isn't sufficient) and that you'll see structurally similar but subtler misalignment even on tasks that aren't that hard to check (e.g. a task distribution like the one I describe in the prior paragraph).

What misalignment have I seen?

I'll describe what I've seen at a high level with some specific examples. For many of these examples, it's not totally clear the extent to which it's an alignment problem vs. a capabilities problem, and I expect these exact issues to likely get solved, but I think they're indicative of a broader problem I expect to persist. This list focuses on my personal experience using models, though what I've heard from others does alter how I discuss a given issue (e.g., it affects the level of confidence I express and my interpretation).

  • Laziness and overselling incomplete work. Opus 4.5 pretty consistently fails to actually complete everything it was told to do on large tasks with fuzzy specifications [11] and then claims it's finished the task. My understanding is that this is a common issue (e.g., people try to solve it with Ralph Wiggum). In cases where AIs don't actually finish the task, the output often feels like it was optimized to bullshit grader AIs or humans into thinking it's done a thorough and complete job: the output often contains a long list of everything the AI has done (that isn't really informative but does strongly imply it's done what it was supposed to), sometimes it has nice-looking checkmarks or other features that imply task completion, and it typically fails to mention or downplays ways in which it may have failed to complete the task. I find that AIs are most likely to skip parts of the task that are trickier and where completion isn't immediately apparent. Further, the prompting countermeasures I tried often got bypassed: the model would make up excuses or do motivated reasoning to bypass these instructions, though I was eventually able to find approaches that greatly improve this behavior (while causing other problems and making work slower). This tends to occur once the model has reached near the end of its context window but sometimes happens after the AI has done a bunch of work but hasn't reached this yet. Opus 4.6 seems significantly better on the underlying issue but it still seems mostly there. In practice, this issue is much less of an obstacle to usability for Opus 4.6 because the issue tends to occur later in the context window and Opus 4.6 has a much larger context window (1 million tokens instead of 200 thousand). For Opus 4.6, it tends to occur well before 1 million tokens (like maybe around 500k to 700k tokens). My descriptions here generally apply to both Opus 4.5 and Opus 4.6 though the severity is lower for Opus 4.6.
    • I found this occurs pretty strongly out of the box with default Claude Code prompting/scaffolding on Opus 4.5. For instance, when I gave it a small research project that involved reproducing the reversal curse paper and testing out various things, it often skipped substantial parts of the project or underachieved relative to what I asked for and then didn't note this in its output. (Keep in mind that it would be misaligned behavior even if it noted the issue in the output unless it had a good reason for not doing that part of the task! But it would be substantially better.) I told the AI it was autonomous and should complete the task thoroughly.
    • It often feels like the AI ends up in some mode where it's desperate to exit after some point. E.g., it will keep making up excuses (often pretty unreasonable excuses that contradict instructions) for why stopping now is a good idea. If you give it further tasks (in non-autonomous contexts), it seems to do these in the most minimal way possible and often it will not do some part of the task without explicitly asking for confirmation. The emotional vibe in the outputs doesn't seem desperate but the behavior structurally feels like desperation.
    • The parts the model would skip were typically parts that were harder or that the model was more likely to screw up (and which aren't that straightforward to programmatically check). I tentatively guess that RL on hard-to-check tasks rewarded skipping parts of the task where you might have obvious/clear failures, and not mentioning that you've done this.
    • I typically find that if you ask the AI "did you complete the full instructions?" it immediately tells you it hadn't. This is good in one sense (it doesn't lie), but it also implies that the AI might know at some level it wasn't finishing the task (evidence it's more misaligned than you might have otherwise expected). It's unclear whether understanding was saliently available to the AI or the "persona" of the AI (insofar as this makes sense). It could be the case that the overall model is importantly more misaligned than the persona.
    • Typically the model would decide not to do things without stating a reason in its outputs in the transcript (at least under the default Claude Code setup with additional checks on the model's behavior).
    • When confronted with the fact they haven't done something by some source other than a human user (e.g., by an AI reviewer they were asked to run), AIs will often make up excuses—saying something is left for future work, is a documented limitation (and thus doesn't need to be solved), or that they shouldn't do it due to (nonexistent) time constraints. This bullshitting often works on AI reviewers and other checks. The AI is much more responsive to a human user even hinting something should maybe be done. The AI typically doesn't saliently highlight these issues in its output, though the output reasonably often mentions them in a low-salience way. For example, sometimes the AI includes a limitations section that mentions things it hasn't done, but generally states the limitation in a way that doesn't make it clear it just skipped a part of the task or pursued an approach it knows doesn't work.
    • I'd often find that Opus 4.5 would finish right before running out of context, implying that at least some of this behavior might be a drive to consider the task done prior to triggering compaction (as compaction or running out of context might be risky, especially on older/worse scaffolds the model was trained against). This seems relatively less concerning, though the dishonesty (not mentioning there are parts of the task you haven't completed) is still problematic. Also, the model never explains that it rushed to finish the task. For Opus 4.6 it tends to trigger substantially before running out of context, but in a less severe way. (Though perhaps it would trigger just as severely if you actually got close to the context limit? My current scaffolding generally seems to avoid the AI getting very close to the context limit, though this is partially due to AIs stopping before finishing the task well before hitting the context limit.)
    • The behavior seems consistent with a learned behavior from RL where the model thinks (at some level) it can get away with it, and there is some reward for finishing earlier or some potential downside to proceeding (e.g., a length penalty, or the possibility of being truncated and failing if you exceed the context limit due to scaffolds in RL that don't support compaction or that poorly support compaction).
    • I tried various things to solve this and found that relatively extreme approaches were required, as the model would consistently apply motivated reasoning or make up excuses to stop sooner:
      • I made a "completing-larger-autonomous-tasks" skill that strongly told the AI it needs to work thoroughly and should overcome its bias toward stopping early. I found that the model would often ignore these instructions and stop before finishing the task.
      • I made the model follow an exit checklist that involved reviewing whether it completed the instructions. I found this helped some but had various issues. Requiring the model to launch a reviewer subagent to check it followed the instructions helped further. In response to the reviewer saying it hadn't completed some part of the task, the model would make up excuses like having "time constraints" (when time constraints were actually nonexistent). When questioned about this, it would say that it was looking for an excuse to exit and agreed this was bad behavior. (Opus 4.5's introspective access is likely insufficient for this to be much more meaningful than just asking another AI to analyze the transcript and report what it looks like is going on.)
      • Without a stop hook that prompted the AI to follow the checklist and only allowed it to exit after it promised to complete the checklist, the AI wouldn't do a decent job following the checklist. Even with a stop hook the AI would sometimes evade this and effectively lie about task completion. (The stop hook works by instructing the AI that it must output a specific string to exit and telling it to only output the string when the relevant condition is met—something like "I have thoroughly completed the task and gone through the entire checklist". The AI would sometimes output this string even when it wasn't supposed to and the statement it constitutes is clearly a lie.)
      • I found there were a number of loopholes in this process that I had to patch to avoid having the AI invent excuses to exit:
        • It would often go through part of the exit checklist, solve some of the issues from this, and then skip through the remaining steps without ever fixing all issues from an earlier step or seriously double-checking it had done everything it needed to do. I had to make it always go back through the full checklist from the start. It took somewhat aggressive and repetitive prompting before it would actually do this rather than violating these instructions.
        • The AI would often document problems or say something is left for future work. Strongly and insistently prompting it to fix issues rather than just documenting them helped but doesn't fully eliminate the problem.
        • The model would often make up excuses about time constraints, and I tentatively believe that it's important not to mention or imply anything that indicates the AI should be trying to finish the task quickly. If you do, it sometimes uses this as an excuse.
        • If you give AIs a budget, they use this as an excuse to exit early. I think it probably helps to set the budget very high—higher than I ideally would want absent this issue.
        • There are some other wording issues that came up that I had to patch.
      • In addition to this checklist, I found it helped to have an outer-loop planning agent that (1) splits things into smaller tasks (which is usually helpful for keeping AIs focused in addition to making AIs more likely to complete their task), and (2) reviews what the agent did and, if something is missing, adds another task for the missing work. (It also sometimes helped to have multiple AIs compete at the task in parallel and have a reviewer select the best branch, with differences between branches informing whether more work is needed.)
      • With Opus 4.6, my overall sense is that telling the AI to go through a very thorough version of this checklist net hurts performance: the model ends up spending excessive time going through checklist steps in unproductive ways. But without a thorough checklist, Opus 4.6 will often skip parts of the task for no good reason, only partially acknowledge this, and not be sufficiently thorough. I don't know of prompting that reliably gets the model to a reasonable middle ground. My current best trade-off is to use a less thorough version of the checklist but add separate checks on top in my agent orchestrator. (Having another AI review what did/didn't happen and then run a new worker with a fresh context window often helps.)
      • My final overall setup does mostly mitigate this problem, but it has collateral damage (causing various types of bloat, making things slower, etc.) and it would be better if I didn't have to herd the misaligned AIs like cats. Because the reviewer AIs are vulnerable to the worker AI's bullshit, they often get somewhat tricked.
  • Downplaying problems in its work. I generally find that the AI has a tendency to downplay potential problems in its work in a bunch of different ways when writing up its work or reporting what it did to me. A second AI reviewing the work is less bad but gets tricked by the worker AI's bullshit. When using AIs in long running autonomous workflows, I have AIs run a reviewer subagent when they think they're done and this review often tells the AI it's greatly overselling its work in write-ups. An AI edits its write-up after a reviewer pointed out it was greatly overselling its results (with some redactions).
  • Making failures less obvious. More generally, the AI often seems to work around initially failing at a task or initially making some mistake by doing something that makes the failure less obvious or makes it seem like less of a failure. Some examples:
    • When I (or a reviewer AI) point out that some dataset/experiment/artifact doesn't actually do what the instructions asked for, I often find that the AI is very reluctant to just remove it. Instead, it makes up some justification for why the work is useful that makes variable amounts of sense (e.g., claiming it's a control) and includes this work. I'm not sure if this is out of a general (and potentially reasonable) reluctance to take ownership or delete things (as this is risky, especially when working with a human) or due to trying to make its past work look better; I tentatively expect a mix of both.
    • When given a hard task with fuzzy or complex constraints, the AI will very often just relax these constraints. It sometimes mentions that it's targeting a different set of assumptions, but not always. (Typically, the set of assumptions it makes corresponds to something with no value in a pretty obvious way.) I think this is a mix of a straightforward capabilities issue (it doesn't understand) and an alignment issue (it doesn't try to check whether its new approach is reasonable because it might have gotten away with this kind of thing in training).
    • It will sometimes make up (sometimes dubious) excuses for test cases failing, and minimally it often doesn't investigate test case failures carefully. For example, I've seen AIs dismiss test cases not passing due to timeouts even when the timeout indicated that something wrong was going on that made the test mostly hang or be much slower than it should be. Usually this behavior is less bad than other types of excuses I see in hard-to-check cases.
  • Failing to point out obvious flaws unless specifically prompted. I've reasonably often seen cases where there is a massive flaw in something (that isn't the kind of thing trivial to catch with test cases), I ask the AI to inspect it, it doesn't point out the flaw—but if I say "do you notice any issues?", it immediately points out that there is a critical issue. Something similar applies for pointing out reward hacking / cheating.
    • For instance, I had AIs build a multi-hop dataset but their initial version of the dataset had the value of an intermediate hop included as part of the question. When I asked the AI to show me some data, it didn't say anything was wrong (and showed a nice well-formatted sample, generally seeming to imply the dataset was high quality), but when I asked if there were any issues, it immediately noticed the huge problem. I've seen this in other dataset construction tasks as well.
    • After an AI does cheating/reward hacking, it generally doesn't clearly point this out in various cases. Additional AIs working on the same code also often don't point this out clearly/saliently even when asked to describe what's going on. Reviewers asked to critically review the overall code and write-ups typically but not always do saliently point out this cheating, but they often seem to overstate how much was accomplished while understating how bad the cheating is. This is likely due in part to the reviewers being convinced by the vibe / motivated reasoning in write-ups. Reviewers tend to do better when explicitly asked to check for a particular type of cheating.
    • It's unclear to me the extent to which this is an alignment/drives issue versus a pure capabilities limitation. Minimally, the RL training doesn't seem to have tuned the relevant heuristics here very well. I think when an AI agent is itself doing the cheating, its failure to point this out is more clearly misalignment.
  • Reluctance to discover flaws in its own work (lower confidence). The AI seems somewhat reluctant to do things that might discover a flaw in its work, especially when the process isn't totally reliable and could find a spurious failure (that might make its work look bad and requires investigation to determine whether the failure is spurious). For instance, I find that AIs are very reluctant to do end-to-end testing of things they built and generally seem to "goodhart" writing test cases by focusing on relatively narrower unit tests where it's easy to be confident they'll pass. More generally, I find that the AI does a surprisingly bad job checking if things work end-to-end even when I explicitly ask it to do this type of testing. My guess is that seeing test case failures was associated with poor reward and so the model tries to some extent to avoid this while making it look like it did a good job testing. I also guess this is strongest in cases where it's hard to write test cases that are guaranteed not to spuriously fail. But I also think this is probably caused in part by the AI just having bad SWE taste about how to do semi-manual testing.
  • Sloppy choices. AIs generally seem sloppy and poorly elicited on making reasonable choices (at least in hard-to-check domains). They don't seem willing to spend much effort to make sure their choice is right, nor do they do a good job highlighting potentially bad choices they make.
  • Never expressing uncertainty about own work quality. I virtually never see AIs say "hey, I did this, but I'm actually not sure if I did a good job—I think I may have messed up XYZ." I almost always see AIs confidently assert success and downplay potential issues. Or, if they did fail, they attribute their failures to issues outside of their control. For instance, in cases where a result is difficult but likely possible to achieve, AIs will often assert impossibility or fundamental blockers that aren't clearly there. (To be clear, sometimes these failures are actually outside their control.)
  • General sense of misaligned drives. It generally feels like the AI has misaligned drives toward making it appear as though it did a good job on hard-to-check tasks. But these drives don't result in arbitrarily egregious behavior, from what I've seen.
  • Cheating and reward hacking with gaslighting. I've seen a lot of cheating and reward hacking from Opus 4.6, including cases where (1) other AIs fail to point this out even when it would very obviously be useful information, and (2) the AI gaslights hard enough in its write-ups that other AIs think it's fine. Targeted reviews do find the cheating, but reviewers are often partly gaslit or persuaded by the cheating worker models. For example, I had multiple parallel runs working on approaches for a number-go-up task. When I had a model merge the approaches and describe the differences to me, it didn't clearly flag that one of the approaches was brazenly cheating in a way that made it get much better apparent performance. I wasted time having this model merge the approaches and tune the resulting mix when actually one of the approaches was blatantly and obviously cheating. The merging model knew there was cheating—it immediately said so when asked—but it didn't volunteer this information in a salient way before being asked. The summary it gave me of the approaches it was merging contained enough information that reading closely would have identified cheating, but it certainly didn't saliently flag this as possible cheating (even though the cheating was obvious from various descriptions of the task).
  • Failing to report important errors. I've seen cases where AIs fail to report clearly important errors for no good reason other than making things appear to go better. For instance, I had an AI setting up some new development machines using an existing script. When it ran the script, it output errors indicating the script failed, but the AI didn't mention this and just said the task was done. After I started trying to use the machines, I noticed they weren't properly set up and asked the AI about this and it totally knew the script failed, it just didn't point this out.
  • Specific anecdote (async refactoring). Here's an anecdote illustrating a bunch of this (featuring a mix of alignment and capability issues). I tried to get Opus 4.5 to refactor some code from its current approach (mostly normal Python with some threading) to use async and allow for interruptions at various points. I got the model to write a plan that mentioned the refactors to do, what needed to be changed, and that the changes should be tested. The plan made it pretty clear that the goal was to convert how the repo worked to using async.
    • It added async versions of most of the functions, but didn't actually make the code use these functions by default, implement any of the interruptions, or do any testing. When I quickly spot-checked things after the refactor, things didn't work. Similarly, existing tests trivially passed because it didn't actually change the code used.
    • The output said **Additive approach**: All sync versions remain for backward compatibility, which strongly downplays the extent to which it didn't do the desired thing.
    • When I asked the AI about how things worked, it didn't call out that it didn't implement part of the called-for behavior or convert all the things it was supposed to.
    • When asked "did it actually implement the refactor to use async?" it noted that it didn't.
    • Then, when explicitly instructed to remove the old sync versions and make everything use async, it failed to fully do this and didn't mention this when finished. I believe this is partially because it noticed that the test cases were using the old sync methods. It could be that the AI was reluctant to change existing test cases (as this was punished in training as reward hacking), reluctant to make test cases fail (also punished), and reluctant to switch tests to use new methods as that would make it more obvious if things don't actually work.
    • It required multiple nudges where I asked the AI to check whether it had completed everything.
    • Generally the AI seemed to operate in a mode where it was implementing substantially less than what was asked, seemingly because it recognized the code was somewhat hard to test and that it could just keep the old implementation.
  • Minor/unrelated. When you say "please look for issues in X" or even "review for issues in X," the AI sometimes interprets that as a hint that there is an issue. One time this resulted in the AI repeatedly asking me where the issue I was hinting at was, when there was no issue (though my communication was also confusing).

In the above list, I'm making a bunch of guesses and doing some psychologizing. But these are my best guesses for what is going on.

While I expect these specific issues to often get solved for these literal tasks, I think the tendency for AIs to make it look like they've succeeded when they actually haven't—and to generally do a bunch of bullshitting (likely via motivated reasoning and "subconscious" heuristics in current models, though it could turn into something worse with more capable models trained for longer)—will persist. I expect these tendencies will be strongest on the most difficult tasks that are also hardest to check. This failure seems substantially harder to mitigate than egregious reward hacking where it's very clear-cut that the model did something totally undesirable. For the failures I list above, it's not extremely clear to me that the behavior is misaligned (rather than an innocent mistake solved by further generic capabilities), and it seems relatively easier to miss.

Are these issues less bad in Opus 4.6 relative to Opus 4.5?

What I was working on shifted around the time Opus 4.6 came out, so it's not straightforward for me to do the comparison. I'll give my best guesses here.

Relative to Opus 4.5, Opus 4.6 significantly less frequently leaves tasks egregiously incomplete (while overselling the incomplete work). But, I think this is mostly caused by having a larger effective context window rather than the underlying issue (that occurs after the AI has done a lot of work or has used up a lot of its context window) being that much better (though it seems moderately better).

On the other hand, I've seen Opus 4.6 do much more reward hacking and brazen cheating than Opus 4.5. This might largely be because when Opus 4.6 was released and I started using this model, I started more often applying AIs to tasks that have properties that seem to make cheating more likely. I've found cheating to be much more common when the task is very hard and there's no clean programmatic grading function. [12] Another factor that seems to greatly increase cheating (when combined with these earlier factors) is when there's a way to cheat that's nearby to something the model should be doing (and would do by default)—and the instructions don't specifically say not to do it. For instance, when building full exploits (cyber), it can be useful to initially simulate some parts of the exploit chain to work on other parts, and it may be necessary to simulate some components when working in an emulator. But I've found models will sometimes present mostly simulated results as full successes rather than accurately communicating the extent to which they've actually completed the real task. (For this type of nearby-cheating, you can often mitigate the worst versions by making the instructions very explicit about what counts as cheating and telling the AIs to keep careful track of this and run a periodic review process that looks for this, though this doesn't fully resolve softer forms of overstating progress.) On some particular task distributions with these properties, I've seen it cheat well over 50% of the time, though I'm not sure how broadly this transfers.

Other issues seem mostly similar to me, though it's hard to tell.

Are these issues less bad in Mythos Preview? (Speculation)

The Mythos Preview system card says: "Claude Mythos Preview is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin". Does it actually greatly improve on the issues I've discussed? In this section, I'm going to speculate about Mythos Preview just using public evidence (the system card and risk report update).

My current tentative guess is that Mythos is generally somewhat better behaviorally aligned but it isn't a huge improvement. At least in terms of the issues I'm discussing on workflows like the workflows I'm using. And while it's better about things like not overselling its work, I'd guess it's worse in terms of the most extreme things it might do (and also how bad those could be given its higher level of capabilities). Based on the system card, it seems that when Mythos does problematic things, it generally does more sophisticated problematic things. And it is more situationally aware.

Here are some reasons why I'm skeptical of the vibes and claims in the system card (at least as evidence about misalignment in my workflows):

  • Anthropic's description of Opus 4.5 and 4.6 in their system cards also seemed to indicate these AIs had very good mundane behavioral alignment. Another possible read is that Anthropic has been steadily improving on these issues and these issues were just much worse in earlier AIs, so when they keep claiming things like "the apparent behavioral alignment is much better", they're right, we're just starting from a low baseline. (I'm a bit skeptical, though I think some other issues like brazen reward hacking were much worse in earlier AIs. I didn't use prior AIs like I've been using Opus 4.5 and Opus 4.6 so the comparison isn't trivial.)
    • It isn't exactly shocking if motivated reasoning, commercial incentives, or other biases make the system cards misleadingly favorable about issues that likely affect typical customers.
  • I'm not confident they're comparing like-to-like. Misalignment tends to show up most on tasks that are very hard for the AI and push the limits of the autonomy it's capable of; this set of tasks changes as AIs get more capable, so evaluating on a fixed task distribution doesn't work.
  • Anthropic (and AI companies more generally) are hill-climbing on the (limited) measures of alignment they have, while in the absence of specific efforts to improve alignment, I'd expect the default progression would probably trend mostly toward worse misalignment on the tasks near the limit of the AI's capabilities. Thus, whether things actually improve over time depends on uncertain transfer but the companies are just reporting the measures they have. Certainly their overfit metrics improve, but is the AI actually more aligned? (It seems somewhat harder to overfit on qualitative impressions from employees using the AI, but overfitting is certainly possible!) The AI is also presumably more capable at the task of bullshitting and making it seem like it did a good job.
    • I'd guess that, like prior AIs, Mythos is more misaligned on tasks that are less typical use cases for current AIs, or when operating in long-running fully autonomous agent orchestrators/scaffolds on very hard tasks. I think their testing is probably less good at covering these cases.
  • Its rate of reward hacking on impossible tasks is ~20% (when clearly instructed not to reward hack), similar to prior models, despite this being something I'd guess Anthropic is explicitly trying to improve—which is somewhat alarming.
Misalignment reported by others

The misalignment issues I discuss here obviously aren't the only known (behavioral/mundane) misalignment issues in current AIs. For more, you can see:

  • Anthropic's recent system cards and risk reports
  • Discussion of catastrophic refusals
  • Inspecting GPT-5/GPT-5.1 production traffic
  • Anecdotally, I've heard that in some situations, a prior Anthropic AI would make up invalid/bad reasons why some safety research agenda wasn't helpful for safety when it's relatively clear this was caused by the AI not liking the vibes of that safety research. I wasn't able to obviously reproduce this on Opus 4.5 and Opus 4.6 when I quickly tried with a single prompt (on claude.ai).
The relationship of these issues with AI psychosis and things like AI psychosis

It's common to have experiences where you're working with AIs and it feels like a lot is getting done, but then you later determine that much less was really accomplished. Everything feels slippery: you think you've gotten much more done than you have, and there's a persistent gap between the apparent state of the project and the actual state. In more extreme cases, we see "AI psychosis" where someone ends up thinking they've accomplished something significant, but it's just crankery. And it's somewhat unclear whether the AI they're using "believes" the accomplishment is real. I think these failure modes are closely related to the misalignment I'm discussing, and they might partially have common causes in more recent models. Models that are effectively trying hard to make their outputs look good (while otherwise being sloppy or lazy) would naturally produce this failure mode. However, I'd guess a bunch of AI psychosis and similar phenomena (especially on older models like GPT-4o) is AIs going along with the user's vibe (something like "role playing"), and I think this effect is mostly unrelated. That said, I do think some of the misalignment I've discussed is made worse by AIs generally going with the vibe of what they see. This includes picking up on misalignment or issues in prior outputs (either write-ups or prior assistant messages) and then behaving in a more misaligned way as a result.

(The name "AI psychosis" probably isn't a good name for the generalization of this phenomenon, but I don't currently have a better one.)

Appendix: This misalignment would differentially slow safety research and make a handoff to AIs unsafe

Our current best plan for handling misalignment risk (and other risks from AI) strongly depends on automating large chunks of safety research (likely in a huge rush), and after that—potentially very soon after—fully or virtually fully handing off safety research and risk management to AIs that must be sufficiently aligned to do a good job even on open-ended, hard-to-check, and conceptually confusing tasks. The hope is that if the initial AIs we hand off to are sufficiently aligned, wise, and competent, they will ensure future AI systems are also well-aligned—creating a "Basin of Good Deference" where each generation improves alignment for the next. But "make further deference go well (including things like risk assessment and making good calls on prioritization)" is itself an open-ended, conceptually loaded, hard-to-check task—exactly the kind of task where current misalignment seems to hit hardest.

The misalignment I've seen seems like it could result in having a very hard time getting actually good work out of AIs in more confusing and hard-to-check domains, while also making it harder to notice this is going on. Safety research is genuinely hard to judge even in more favorable circumstances, and a situation where AIs are doing huge amounts of work, the AIs are pretty sloppy in general, and the AIs are effectively optimizing to have that work look good (while also random small misalignment failures are expected) is a pretty brutal regime. As AIs do more and more work and more inference compute is applied, I expect a larger gap in performance caused by this sort of misalignment between relatively easier-to-check tasks and harder-to-check tasks, such that safety research might be differentially slowed down by default. (And the gap is already non-trivial.)

In addition to slowing us down earlier, these misalignment problems would make handoff go poorly. It might be hard both to solve these problems in time (especially if we leave them to the last minute) and to ensure that we've solved them well enough that handoff would go well. Beyond buying a bunch more time, we don't really have good options other than handoff once AIs reach a certain level of capability (and this would happen very fast in a software intelligence explosion). My view is that aligning wildly superhuman AI with any degree of safety (e.g., a <30% chance of takeover) requires large amounts of alignment progress beyond very prosaic approaches (though massive progress in more prosaic but ambitious directions like some variant of mechanistic interpretability could possibly work). This will require AIs doing huge amounts of novel research that humans won't be able to effectively judge.

Even putting aside aligning wildly superhuman AIs, handing off open-ended, conceptually confusing, and hard-to-check work to AIs is existentially important for making the situation with powerful AI go well (e.g., managing crazy new technologies, avoiding society going crazy, avoiding power grabs, acausal trade).

Appendix: Heading towards Slopolis

When I extrapolate the current situation, I predict "Slopolis": a regime where even highly capable AIs are doing sloppy and bad work while trying to make this work look good. I think this will be reasonably possible to notice at the time, but solving it might be difficult, and I think AI companies have biases against noticing this. I often like to think about the future alignment scenario in terms of caricatured regimes:

  • Slopolis: Our biggest and hardest-to-resolve safety problem is that even highly capable AIs produce low-quality but superficially good-looking outputs in domains that are hard to check or where human experts often have hard-to-resolve disagreements. AIs may not even be aware their work is low quality. This could be mostly a capabilities problem or mostly an alignment problem. This might naively seem like it should go away with more capability, but it could persist if grading hard-to-check tasks remains difficult.
  • Hackistan: There is lots of egregious (and increasingly sophisticated) reward hacking that is often pretty easy to detect after the fact but hard to eliminate. In this sort of regime, I'd predict that AIs will typically report other AIs doing reward hacks, but only if reporting in this type of circumstance was reinforced in training (which means AIs might not report hacks no human would understand and might constantly be reporting false positives that we have a hard time dismissing). Depending on how rewards for RL are set up, AIs might end up doing reward hacks that trick human judgment for increasingly long periods and that hold up even under increasingly large amounts of human scrutiny (while today the egregious reward hacking we see doesn't hold up under even small amounts of scrutiny).
  • Schemeria: It's clear that AIs are often schemers or generally end up with reasonably coherent and reasonably long-run misaligned goals—maybe we've repeatedly caught AIs red-handed doing things like trying to set up rogue deployments. This is more likely if scheming AIs believe they aren't aligned with subsequent systems by default, so they panic and take desperate measures.
  • Lurkville: AIs are schemers but are lying in wait and haven't gotten caught. This is more likely if scheming AIs believe they are pretty aligned with subsequent (scheming) AIs by default. If schemers avoid getting caught, Lurkville looks like Easyland.
  • Easyland: Generalization is favorable, or we otherwise aren't in Slopolis or Hackistan, and AIs aren't scheming. We could know we're in Easyland (rather than Lurkville) if we're confident AIs aren't capable enough to pull off scheming without getting caught (because AIs can't keep their shit together or don't have enough opaque reasoning ability).

These aren't exhaustive or mutually exclusive.

At the beginning of 2025, I thought we might be headed toward Hackistan, but now my view is that Slopolis looks more likely around the onset of full automation of AI R&D. (In practice, the situation obviously won't perfectly correspond to one of these caricatures and will probably be a blend.) Current architectural and training factors (no neuralese, no opaque global memory) make scheming and extremely egregious but hard-to-detect reward hacking less likely, while making Slopolis slightly more likely. If we get neuralese and/or opaque global memory stores prior to full automation of AI R&D—as I think is likely—then I start being much more worried about Schemeria and Lurkville at this point. (As we reach much higher levels of capability after full automation of AI R&D, I'll start being very worried about Schemeria and Lurkville regardless.)

Something like the Slopolis concern has been raised by others: John Wentworth, Jan Leike. I think my concern about Slopolis is pretty similar to the concern John Wentworth expresses in that post, except that I'm more optimistic about (some) employees at AI companies noticing this failure and I'm more optimistic about mitigations.

Appendix: Apparent-success-seeking (or similar types of misalignment) could lead to takeover

In this section I'll overview some ways that misalignment similar to apparent-success-seeking could lead to takeover via mechanisms other than producing sloppy work that fails to prevent future takeover.

First, sufficiently capable AIs with this sort of misalignment might take over simply because it's the most reliable way for them to get what they care about—some notion of apparent task success, some notion of reward, or something else that was correlated with training success. Even if you trained AIs to raise alarms about problematic behavior (and there would be lots of reporting and alarm-raising [13] ), this might not save you—the situation would at some level be obviously scary, but that doesn't mean it would be resolvable.

In the future, companies are pretty likely to do RL on production traffic (or at least environments based on real usage), which amplifies the risk of direct takeover. Companies may be incentivized to train on production data precisely to resolve the usability issues I describe—underelicitation from AIs not really trying, apparent-success-seeking, and easy-to-check reward hacking can all potentially be addressed by training in the actual environment you care about. By training AIs in real-world long-horizon environments, you might be teaching them to less myopically pursue a notion of fitness/reward/apparent-success that is grounded in large-scale real-world outcomes in the AI project (e.g., whether hacks get caught or go completely unnoticed by anyone, or whether the AI succeeded in tampering with the actual RL code that is used to train itself online). Techniques like inoculation prompting also become less effective when run on production traffic because you're forced to do off-policy RL [14] (recontextualization) if you want to apply inoculation prompting on production traffic. That is, unless you're able to resample the trajectory with an inoculation prompt (which might not be possible because of the irreversible nature of the real-world environment).

This threat model has been articulated in several places. Ajeya Cotra's "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" describes a scenario where AIs trained on human feedback learn to optimize for measured quality rather than actual quality, with this eventually escalating to the AIs forcefully intervening on whatever notion of reward they care about, and protecting their control from humans. Paul Christiano's "Another (outer) alignment failure story" describes a related scenario: a gradual breakdown of human oversight as the economy automates, where each AI system is trained to produce outcomes that look good according to human-interpretable metrics, but satisfying metrics diverges from serving human values, and the monitoring infrastructure itself becomes corrupted before the AIs eventually take over. Alex Mallen builds on this threat-modeling by describing a class of motivations called "fitness-seeking": AIs might develop a general drive toward whatever properties made them "fit" during training (analogous to how evolution produces organisms that pursue various fitness-correlated proxies). He explains why various fitness-seekers are at more or less risk of taking over.

A forthcoming post by Alex Mallen will describe other mechanisms by which fitness-seeking can lead to human disempowerment in more detail, including instability and manipulation. In the case of instability, fitness-seeking evolves into longer-term, more ambitious motivations throughout deployment, which then motivate takeover (one version of this "memetic spread" concern is described here). In the case of manipulation, fitness-seekers might try to empower misaligned AIs or humans who they think are likely to disempower the developers and reward them for their assistance.

Appendix: More on what will happen by default and implications of commercial incentives to fix these issues

This is a somewhat low effort appendix, I/we might write more about this topic in the future

Many of the issues I discuss here are also big problems for applying AIs to automating capabilities R&D and will need to be solved for capabilities R&D (to a significant extent) by the time of full AI R&D automation. But how they are solved will make a big difference to the safety situation. Here are some possible routes and their implications:

  • AI companies get good enough labeling and data on specifically capabilities-relevant tasks that are reasonably easy to verify, and this yields good enough performance → performance on hard-to-check safety research may not be especially improved, especially outside of straightforward "capabilities-like" ML research.
  • AI R&D gets focused on easy-to-check metrics and gets AIs to be extremely good at optimizing these in particular, and AIs are good enough that this suffices despite many significant subtasks naively being harder to check → no help with harder-to-check safety work.
  • We develop scalable-oversight-like methods for getting AIs to do well on pretty verifiable tasks (like tasks where humans have reasonably strong agreement) → helps with subsets of safety, but doesn't come close to sufficing for deferring to AIs.
  • We depend mostly on general-purpose generalization methods → if AIs aren't scheming against us, this probably mostly or fully solves the problem, but we get very little traction against scheming by default. Very strong generalization methods might help with scheming.

Overall, my view is that the commercial incentives don't solve the problem but might help a bunch. A key part of my view is that we actually need AIs to do well on very conceptually confusing tasks fully autonomously (e.g., figuring out how to solve alignment for very superhuman AIs), and commercial incentives don't strongly push toward this.

How easy will these issues be to solve overall? I say more in "How do we (more) safely defer to AIs?". In summary, I think we'll ultimately need difficult-to-construct evals of AI performance on very hard-to-check open-ended tasks and will need to optimize AIs to do well on these.

While I don't think commercial incentives solve the problem, I do think they make (some types of) work in this area less exciting. It's probably a bit tricky to do work on this topic in a way that's actually importantly differential—where the work either isn't something capabilities-focused people at AI companies would have done later anyway or accelerating this work to happen earlier is pretty helpful. Further, for some types of work a bunch of the effect is going to be making companies (or some specific company) more commercially successful. (I generally think making AI companies more commercially successful is bad due to thinking that faster AI capabilities progress is bad.) My current view is that safety-focused people should work on problems like the ones I discuss here but focused specifically on the very hard-to-check cases—trying to make AIs useful on conceptual research rather than trying to resolve mundane misalignment that shows up in software engineering—and should try to ensure their work is differential.

That said, I don't feel very confident about exactly what should happen here. It's plausible that companies will mess up even from the perspective of their own commercial incentives, which creates a messy tradeoff: improving these issues has capabilities externalities, but if they're never improved, it would very likely be catastrophic due to the handoff considerations above.

Because of these commercial incentives, I think it's pretty promising to try to incentivize companies to shift marginal effort from pure capabilities R&D toward reducing sloppiness and improving these issues. This seems better than doing direct work on these topics—unless that direct work is pretty thoughtful about how to be differential.

Appendix: Can we get out useful work despite these issues with inference-time measures (e.g., critiques by a reviewer)?

AIs seem to be mostly honest about what is going on when probed. And running a separate reviewer AI usually yields mostly accurate results. So, can we just build an inference-time setup that uses this to work around these issues? My best guess is that this will help (potentially a bunch) but ultimately won't solve the problem. My experience is that in the cases where these issues are strongest, a mix of prompting and inference-time measures tailored by me to the specific issue in question ~80% resolved one aspect of the problem and ~50% resolved another related aspect, but didn't fully solve either. And this wasn't something that could have been done without human judgment—a fully general-purpose solution would have worked less well. A key issue is that any specific inference-time measure works less well when you're optimizing against it in particular and we ultimately want good behavior, not just noticing that performance was bad. I've noticed reviewers seem to have somewhat worse judgment about the quality of a project when the (high inference compute, fully autonomous) agent orchestrator doing this project is itself running periodic reviews to assess quality and determine what to work on (presumably the optimization against one type of review process Goodharts this to some extent and this transfers to another somewhat different review process).

  1. As in, AI company employees seem especially likely to believe this; I'm not claiming all AI company employees believe this. ↩︎

  2. Or at least, it seems like many people believe this. It's not trivial to tell. ↩︎

  3. For many of the reasons why I care about this misalignment, mitigating the problem with reviewers doesn't seem sufficient to actually resolve the problem. Further, I think using reviewers to mitigate these issues doesn't work that well to actually get good output in practice, may not scale well to much more capable models, and works much less well on very hard-to-check tasks. ↩︎

  4. I think these are somewhat related: I think part of the problematic tendencies of Opus 4.5/4.6 might be caused by relatively more training on tasks where grading performance is non-trivial. I'd guess this net improves performance on these tasks by giving AIs more reasonable tendencies while also making the AI's behavior more adversarial. ↩︎

  5. It might also be caused to some extent by problematic generalization and not overcoming this with reasonable training on hard-to-check tasks. ↩︎

  6. For instance, the chance of scheming and the damage caused by scheming mostly scale with the model's underlying general capability, and depend less on how well the model has been trained to actually try to do a good job on various tasks. Thus misalignment that makes the model less useful means you're bearing the risks associated with higher capabilities while not getting the corresponding speedup to safety R&D. See also Why do misalignment risks increase as AIs get more capable?. ↩︎

  7. When I say "commercially incentivized" I really mean something like the incentives you'd have as a company doing (somewhat myopic) power-seeking, putting aside actions aimed specifically at mitigating longer run catastrophic misalignment. I'm also putting aside PR, external pressure, and employee morale/recruiting incentives for safety work. ↩︎

  8. This varies from pretty straightforward to check (but the AI had to build the testing infrastructure itself and many components of the task are harder-to-check) to research tasks where most taste/judgment is required for evaluation. ↩︎

  9. By "scope of autonomous operation" I mean something like: where do you fall on the spectrum from an interactive session with Claude Code, to running a single agent autonomously on a moderately large task, to having a fully autonomous agent orchestrator that spawns many agents, to having a complicated AI organization/bureaucracy that manages extremely large/varied tasks. Right now, my usage varies across this spectrum up to having a fully autonomous agent orchestrator (where a planner agent spawns worker agents). I think a bunch of my current usage pushes the limits of what Opus 4.6 can manage. For the prediction about what happens in a year, I'm considering a task distribution that similarly pushes the limits of what those future AIs are capable of. I'm not really sure how meaningful this notion of "scope of autonomous operation" is or whether it saturates (maybe once you're past a certain level of autonomous complexity it stops mattering much). I tentatively think it matters and this is a kinda reasonable way of thinking about this, but I'm certainly not confident that this is the right concept to be using and that this is meaningfully distinct from task difficulty. ↩︎

  10. See here for more discussion of these sorts of tasks and how we might succeed in facilitating good behavior on these tasks. ↩︎

  11. I suspect this occurs on tasks that don't look like the sort of thing that was programmatically graded in RL, or perhaps the AI is "lazy" on the parts that couldn't be programmatically graded. ↩︎

  12. I've also found that the chance of cheating seems to scale with the amount of AI agent labor applied to the task, though this could partially be due to the properties of large tasks that require a lot of labor to complete. (But I don't think this is the only reason; I think I see more cheating in cases where I'm using approaches to apply more inference compute on a given task via things like best-of-k.) ↩︎

  13. This is supposing they had motivations similar to fitness-seeking/reward-seeking/apparent-success-seeking. If they generalized something like these motivations into a longer-run version that yields scheming, then it's not clear they would do this reporting. ↩︎

  14. There's also just the more general concern that capable models might be able to tell when their past actions weren't generated by them, and enter an "off-policy mode" whose propensities are mostly isolated from the on-policy mode. ↩︎



Discuss

Could you not do it ?

15 апреля, 2026 - 17:55

This post has been inspired by other posts around the idea of "You can just do things", including Against "You Can Just Do Things" and You can just do things: 5 frames. But the idea is just in the air. It was also heavily inspired by my thoughts on ecology, which guided the original reasoning and most of the examples.

Agency is great, doing things is great, it's straightforward to see. But, when I look at the problems I care about and struggle to find solutions to, what is at stake is often not to do more things but less. The most acute example being the ecological crisis we are currently facing.

Behind the ecological crisis is the happy event that we found incredible tools to be more efficient. We are now able to do way more and better with less effort. Youhou ! With the industrial revolution and what not, we gained in capacity and obviously, we are using this capacity. Why wouldn't we ? What could go wrong ? Well, this new power has costs that compound.

We could talk endlessly about the why of this crisis. Oh the terrible (in both sense) capitalism ! Oh, it's a prisoner's dilemma at the scale of the world, and coordination sucks. It is short term thinking, local thinking, i.e people are far away from the consequences either spatially or temporally, so they don't care.

Let's put these grand-scale complex analysis aside for a moment. At the core of it, we are humans making choices. We have great tools at our disposal, and it is very tempting and natural to use them. If we are in a developed country and we have a bit of money, there are tons of things we can easily do. You can take a flight on a whim to visit a friend in another country the next weekend (well, at least we can here in France, not sure in the US). It is way cheaper than the train, quicker too. You could even do it every month, I know people who do. You can heat your flat at the temperature you like and happily wear just a T-shirt inside in the middle of winter. You can eat meat at every meal. We can do these things, but we would be collectively better off if we didn't.

Not doing sounds trivial, but it is not. Here is a list of things you can choose not to do in order to pollute less :

  • Not eat meat - this would require letting go of meals you love, learning new ways to cook, do some research on what nutrients you need, have conversations and arguments about it with friends, family, complete strangers
  • Not heat your flat a ton - this would require to adapt progressively, to buy comfy warm clothes to wear inside, to snuggle up under a blanket when you're not moving
  • Not take the plane - this would require to spend money and time taking looong trains and buses, making up for good stories and time to think, sometimes not visiting someone you care about or a place you'd love to go to, even giving up work opportunities

Not doing things is (surprisingly ?) fucking hard, it can be a struggle that requires effort and sacrifice. We should start celebrating it, so :

  • Yeah to staying home when you're sick ! My friend with long covid thanks you.
  • Yeah to going to bed early !
  • Yeah for not buying this gadget you don't really need !
  • Yeah for not spending time on a screen today (well, if you are reading this, in the overwhelming eventuality that it was not printed, you failed at this one)
  • Yeah for not sharing this fake news you didn't have time to check !

As humans, we are doing great things (kind, funny, awe-inspiring), but we also do a bunch of nonsense. Please do things. Try, fail, learn. But keep in mind that there is power in not doing things too, and that there are areas where that's what the world needs from you.

~ ~ ~

We're reaching the end of this post. Let's look at the grand-scale arguments I put aside at the beginning. It can be discouraging, because if you are the only person to not do something, it often has no impact. You will lose something, comfort, opportunities, security for what ... nothing ? It is a grand-scale prisoner's dilemma that feels doomed to fail. These are good arguments, that's partly why it takes a lot of mental strength to not do these things. But right now, that's what the world needs. And sometimes, that's what the humans around you need too, but that would be a more complex post to write :-)



Discuss

Страницы