Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 4 минуты 21 секунда назад

Miscellaneous observations about board games

12 ноября, 2025 - 15:49
Published on November 12, 2025 12:49 PM GMT

Some friend groups seem to require an excuse to meet up. Hardly anyone I know watches sports, alcohol is best used rarely, if ever, and just having dinner feels a bit too aristocratic. Watching movies used to be good, but nobody seems to have the patience for that anymore. Having no imagination doesn't help either. Since I live in Finland, sauna has been my go-to activity. Unfortunately, I have not-so-recently hypothesized that the mysterious headaches bothering me might get triggered by it, so I'm limiting myself to two or so per week. Enough to retain my citizenship, yet rare enough to confirm my suspicions. So board games it is.

You Play to Win the Game, Zvi writes, already summarizes most of my thoughts about the topic, but I'll discuss some of my own observations. The most interesting of them so far is that often people have different goals, even if the rules are the same for everyone.

Having fun is quite important. Winning is fun, of course, and doubly so if you can make it look clever. Some games emphasize the actual win condition more, while in others, strategies that are unlikely to win are more interesting to play. Often, if it's too easy, it's not enjoyable anymore, so a stronger player might pick a self-imposed challenge to keep it interesting without having to hold back.

Personally, I don't enjoy thinking ahead too much. It's tedious and slows down the game a lot. Some of my friends really only play this way, which is at times quite frustrating. Especially because it means I'll rarely win against them. You could potentially solve that with a clock, and chess does this pretty well, but that would feel too heavy-handed for games that I think of as social lubricant instead of a puzzle. And it's way better than the alternative, making nonsensical moves and not even trying to win.

One interesting technique I've been experimenting with is taking variance-increasing (high-risk, high-reward) moves when you're expecting a loss if you play normally. Scoring even a single game against a stronger opponent is quite satisfying. Not all games allow strategies like this, but when they do identifying them is part of the fun.

Some games repeatedly end up in positions where two players are competing for the victory, and a third player can decide which one of them wins, but have no shot at it themselves. In general, I feel like games should be designed in a way where, as long as you're allowed to take actions that affect others, there is some way you could still win, even if the odds are really bad. On the other hand, it's quite frustrating if playing better doesn't guarantee winning. With two-player games, resignation is often an option, but when more players are involved it's rarely the case.

I only like cooperative games if everyone has different information. Otherwise, it just becomes the best player solitaire-ing. Or everyone arguing about the best line. I think Hanabi and The Crew succeed quite well in this.

Semi-cooperatives, typically with hidden per-player win conditions, can achieve the best of both worlds, when carefully balanced. The only good example I can think of is the Nemesis series. Cooperative play is still the default, but everyone has to be wary of a backstab until the very end.



Discuss

Why to Commit to a Writing and Publishing Schedule

12 ноября, 2025 - 10:35
Published on November 12, 2025 7:35 AM GMT

Consistency is key! This is kind of obvious but I want to convince you it matters more than you'd think, if you have a blog or newsletter. Plus I have practical tips (that aren't entirely a Beeminder infomercial).

The least obvious point is how much your readers like it when you publish on a predictable schedule. Substack goes on and on about this and recommends posting weekly. Part of it is demonstrating your general reliability, but most is reader psychology. There's the anticipation of new posts, the routine and habituation of reading them —readers mentally budgeting for it. When a new post appears unexpectedly, even if it's a pleasant surprise, people are more likely to put off reading it, which means they're less likely to end up reading it.

Of course what really makes readers drop like flies is when you take a hiatus that's long enough that they forget they ever subscribed and think your new post (or the notification for it) is spam. There are even email deliverability concerns. Mailchimp says that going more than a month without sending something is bad and if it's been six months you should really have everyone reconfirm they even want to be subscribed.

Slippery Slopes

But you're not going to go a month without posting, you say? You just need one more day to do some more editing? And one more day doesn't matter so much? Yes yes yes. The danger is saying "one more day won't matter" day after day until your blog is covered in cobwebs.

So even if you don't care about your readers, consistency matters for you.

Maybe commit to writing 500 words per day? That's what everyone at Inkhaven has committed to for the month of November, on pain of getting kicked out. And I would be remiss if I didn't mention that Beeminder is what the cool rationalists use to do that (when not at Inkhaven). "Safety ropes for slippery slopes", we call it. See especially the old Beeminder post about ways to automatically send your wordcount to Beeminder.

Or maybe you'd prefer to commit to spending a certain amount of time, say 15 minutes a day, on writing. At Beeminder we do generally recommend committing to actions rather than outcomes. I'd say either time spent or words written count as actions, but technically if you absolutely can't think of the next word you want to write, time spent is the only thing fully under your control. But this assumes you're disciplined enough to stay totally focused on writing during that time. (You could use the stochastic self-sampling time-tracker, TagTime, to ensure you only count time spent actually writing or actively thinking about your prose, but that comes with a whole host of other problems.)

At the other extreme of the action-outcome spectrum would be something like committing to grow your number of subscribers by some amount. Don't try to beemind that unless you're a masochist. If you care about subscriber count, commit to corresponding actions like advertising or cross-promotion or whatever.

For beeminding time spent writing, the nerdier among you might like the WakaTime Beeminder integration. WakaTime has plugins for pretty much every coding editor. But maybe even non-nerds should try writing prose in a coding editor. The AI features are getting frighteningly powerful. If you don't like that idea, Beeminder has integrations with other time-tracking tools.

And it's not like Beeminder is the only game in town for committing to things. There are plenty of Beeminder competitors. Or just use social accountability. Promise friends and family you'll stick to a publishing schedule. You could even, I don't know, start a newsletter called AGI Friday so you're compelled to send something out every Friday.



Discuss

5 Things I Learned After 10 Days of Inkhaven

12 ноября, 2025 - 10:20
Published on November 12, 2025 7:20 AM GMT

If you don't know, Inkhaven is a residency where you come and publish a blogpost every day. No "Oh it would be nice to blog some day" or "Oh I'm working on something, I'm sure I'll publish it some day". No, you have to publish today, otherwise you are asked to leave.

After 10 days into the first ever Inkhaven cohort, here are some things I've learned.

1. Everyone publishes.

I have 41 people here writing, most of them living at Lighthaven, all of them visiting at least a few days per week. In recent weeks, I knew I was hurtling toward their arrival and I'd be in the thick of it; while I believed this abstractly, I didn't know what to concretely visualize.

Any time that anyone has said to me that they want to push down the publishing requirement, maybe to every 2 or 3 days, I have said "No. The typical human adult types at 40 words per minute. Writing 500 words should take only 12.5 minutes. I can get a reasonably long LessWrong comment written in under 30 minutes. This isn't that hard."

Yet this has not always been satisfying to these people. One of them decided not to come. Another one did and has had some dissatisfaction with the quality of their writing.

But overall, it's borne out. Not one person has dropped out so far.[1] 

We've published over 400 blogposts.

2. Essentially everyone is yoloing it every single day

Here is the data about how many blogposts residents have written right now that they can hit publish on.

That's 24 ppl with 0, and 11 ppl with more. (A few residents haven't filled out this form.)

To explain further, see this graph of what hour people publish their post each day.

Note that a few show-offs are staying up just past midnight to publish their post for the next day cough Tsvi BT cough.

People are finishing their posts increasingly close to the deadline—about 20% of posts so far have been published after 11pm!

3. People do not use physical spaces the way you planned.

I made beautiful coworking space for people! With height-adjustable desks and external monitors! But they mostly write in the living room, or at the lunch tables. Bah.

Writing in the living room!Writing at the lunch tables!My own team meeting happening randomly outside!This Guy works here every day.Some of our residents are still learning how to use benches.4. It's not too stressful, and is kind of energizing.

In the feedback form yesterday, I asked how stressful Inkhaven has been (from "1= Easy Peasy Lemon Squeezey" to "10 = Close to having a panic attack"). 

Only two people gave an 8, and one of them was a mentor who is also doing a lot of writing while here.

I also asked "How emotionally energized vs. drained have you felt this week?" from "1 = Totally drained / spent" to "10 = Incredibly energized / inspired".

This leans toward the energizing side.

5. Residents suck at proactively getting help from other people

I brought them so many people to give them advice.

I brought Gwern, Scott Alexander, Aella, Alexander Wales, Clara Collier, Adam Mastroianni, Sasah Chapin, Andy Matuschak, Slime Mold Time Mold, and many more.

And man, people are mostly just sitting on their own.

We tried to incentivize getting help.

As background, to incentivize publishing, we made a winners' lounge. It's a cool room with ice cream, alcohol, and super smash bros; but you can only go there once you've published today. It's pretty fun.

Then, to get people to utilize the contributing writers, we said that if you just have one such person read a draft of yours this week, you get access to the Diamond Platinum Elite Double Secret Winner's Lounge.

Yes, it's a ball pit.

Now we're edging toward sensible things happening.

Like, Gwern has done over 10 cumulative hours of office hours, where you can just show up with a draft, he'll read it and discuss it with you for up to 60 mins, with others listening, then move on. He gives great feedback. The first person in the first office hours brought him a list of 30 blogpost ideas, and he spent 40 mins going through them, talking about what would be the interesting parts in each one, giving more ideas, etc.

And yet, I think ~most of the Residents haven't shown up to one of these yet? Even though we're a third of the way in?

Or consider: I've made a form where you can submit essays to these established writers for feedback. Only 27 of the Residents have used it. The most popular person to submit to is Scott Alexander, and if you do he is known to write a whole goddam essay back to you about the structure and argument of your piece. Yet this has happened only 13 times. 

These writers are around and want to help. I will keep working on fixing this market failure.

Bonus thing: We've had a bunch of essays spend time on the frontpage of Hacker News.

  1. Trying two dozen different psychedelics (60 points)
  2. Robert Hooke's "Cyberpunk” Letter to Gottfried Leibniz (96 points)
  3. Why effort scales superlinearly with the perceived quality of creative work (139 points)
  4. Unexpected things that are people (638 points)

I'm interested in hearing questions that people have about Inkhaven! I may address them in the comments or in future posts.

  1. ^

    I will acknowledge there have been two close calls. One person cut over 1,000 words of drafts down to ~460 words, not noticing it was under the minimum of 500. We have since implemented a mandatory word-count check. And another person only submitted the form to our system at exactly midnight, which is technically the next day. I trust them that they published the post more than one minute before then, but we told them not to do it again.



Discuss

Response to "Taking AI Welfare Seriously": The Indirect Approach to Moral Patienthood

12 ноября, 2025 - 10:10
Published on November 12, 2025 4:43 AM GMT

I've been thinking about the Sebo et al. paper Taking AI Welfare Seriously, which features one of my favorite philosophers, David Chalmers (I have a signed copy of his anthology of mind, full fan here). While I appreciate their careful treatment of consciousness (we genuinely face deep uncertainty here, so it's naive to brush it off), I, nonetheless, find the robust agency argument deeply unconvincing as a standalone route to moral patienthood.

The authors suggest that sophisticated planning and goal-pursuit might suffice for moral consideration even absent phenomenal experience. But this seems to miss something crucial: welfare (typically) presupposes that outcomes can be better or worse for the entity in question. If there's no subjective experience when goals are frustrated then what exactly constitutes harm? The frustration of goals without any accompanying experience seems categorically different from suffering. In other words, if a chess engine's loss involves no phenomenal dimension whatsoever, then how can we say the engine is being harmed? Functional non-phenomenological descriptions, in my opinion, are deeply unconvincing. There may be an argument via resource-usage, but that becomes indirect (a strategy I will defend below); or an argument via representations (i.e. "Look, the model represents anxiety") but if the representations are non-phenomenological, they don't seem worth much in our calculus of moral patienthood. Especially when compared to all the sentient beings we fail to treat right and allocate enough resources to today.

I think we're approaching this problem from the wrong angle. We are thinking of systems like we think of animals. Instead, we should think of them as general agential tools. A special (and possibly new) class of artifacts.

Even if AI systems lack direct moral status, they increasingly take actions that we find morally significant (that is, that we judge them as morally valenced). They make decisions about resource allocation, information filtering, even judicial recommendations. These actions have moral weight in our social world, regardless of whether the systems themselves can suffer.

This suggests what I would call indirect moral patienthood: we should treat these systems with certain moral considerations not because they can suffer, but because their actions carry moral significance for us. We want them to embody good moral reasoning, not to minimize their nonexistent suffering, but to ensure they act in ways that reflect sound moral principles.

Indirect moral patienthood may not demand us to being polite to foundation models. Instead, it may require us to recognize that as systems become more agentic, they become participants in our moral ecosystem. This recognition, in turn, requires us to make more progress on alignment and find ways of generating stable virtuous dispositions. Their lack of consciousness doesn't make their morally-relevant actions (and, in general, outputs) disappear. If anything, it makes the question of how we shape and interpret their decision-making processes more urgent.

Further work should be done characterizing indirect moral patienthood. Is it a coherent concept? Or do we just have a duty to build better systems (where the systems in question are those that generate morally-valenced outputs and actions), independently of their status as patients?



Discuss

Do not hand off what you cannot pick up

12 ноября, 2025 - 09:32
Published on November 12, 2025 6:32 AM GMT

Delegation is good! Delegation is the foundation of civilization! But in the depths of delegation madness breeds and evil rises. 

In my experience, there are three ways in which delegation goes off the rails:

1. You delegate without knowing what good performance on a task looks like

If you do not know how to evaluate performance on a task, you are going to have a really hard time delegating it to someone. Most likely, you will choose someone incompetent for the task at hand. 

But even if you manage to avoid that specific error mode, it is most likely that your delegee will notice that you do not have a standard, and so will use this opportunity to be lazy and do bad work, which they know you won't be able to notice. 

Or even worse, in an attempt to make sure your delegee puts in proper effort, you set an impossibly high standard, to which the delegee can only respond by quitting, or lying about their performance. This can tank a whole project if you discover it too late.

2. You assigned responsibility for a crucial task to an external party

Frequently some task will become the central bottleneck for the success of a project. A key priority of everyone working at Lightcone should be to keep up constant pressure on identifying what our current task bottlenecks are, and to relieve them. 

If you delegate a task which later turns out to be a bottleneck for your project to someone who does not understand the project constraints as much as you do, you are in a much worse position to accelerate progress when the value of doing so becomes much higher.

And sometimes something even worse happens. The party you delegated the task to notices that having become the central bottleneck for your project is a position of leverage over you and the rest of the organization. Due to the scarcity of the labor the delegee provides, they end up rewarded for being the bottleneck, and they will actively fight information and skills from diffusing throughout the organization, as that threatens their high-demand and privileged position.

3. The delegee builds systems or processes that take on a life on their own. 

Even if you overcome these first two problems, and find a delegee competent at a task, manage to set realistic standards that motivate them to do perform high-quality work, and only delegate tasks that are unlikely to become the central bottleneck, your delegee might still end up messing up, by themselves trying to sub out the task or to set up a bad system trying to automate it.

Automation is a core principle of Lightcone (as will be covered in a future memo), so everyone across the organization should be trying to systematize tasks and automate themselves. 

But it turns out that building automations for a task, or hiring for a task, is often a very different skill than performing the task yourself. You, as someone in a quasi-executive position at Lightcone, are trusted to know how to automate and simplify things by Lightcone standards, but the people you hire will likely not have those skills. 

In the worst case, whole mini-departments and teams, with their own interests, actively working on ensuring their continued existence are created, against the interest of the organization at large.[1]

To address all three of these failure modes, Lightcone has a general rule:

Unless you really have to, or the task is highly specialized, do not delegate a task you do not know how to perform yourself.

This rule aims to address all three of the above. By knowing how to perform the task yourself...

  • You (usually) learn what good and realistic performance on a task looks like
  • You maintain the ability to increase capacity if the task becomes a bottleneck
  • You can audit systems and processes created in the pursuit of the task you delegated

If you ever end up in a spot where you do not have the time, or the aptitude, to learn how to perform a task you are delegating yourself, it is your job to otherwise ensure the scenarios above do not occur.

This is a very intense rule. It rules out a large fraction of  behavior at almost every other organization in the world. 

"Oh man, the bathroom right next to the common area is clogged. I should call a plumber to fix it while I keep hacking away at these event invoices". WRONG. Go and call the plumber (or ideally, ask the person on our staff who already knows). Then ask the plumber to explain to you what they are going to do to fix the problem. Then fix the problem yourself. Then, next time you can call a plumber to just solve the problem, and you will know how long this task is supposed to take, and whether the next plumber is doing a fine job.

"Oh man, we are being sued by FTX. I should hire a bankruptcy lawyer to prepare our defense." WRONG. I mean yes, of course go ahead and hire a bankruptcy lawyer to prepare the defense. But in-parallel use language models to prepare a defense yourself, then run the defense by the lawyer you hired until you think you understand reasonably well what the core constraints are. Then work together with the bankruptcy lawyer on the defense.

"I've never done much database query optimization, I should hire someone to optimize our Postgres indexes as we keep having slow queries". WRONG. Go and read about Postgres indexes yourself. It's not that hard. Feel free to call up someone with more expertise to teach you. Yes, this will set back the feature you wanted to push by a week. It's worth the tradeoff. 

Knowing how to perform a task yourself at all is not the same as knowing how to perform it as well as the person you are delegating the task to. The goal is not to ensure that competence across every work-relevant dimension strictly declines as you go down the organizational hierarchy. You frequently will, and should, delegate to people who are 10x faster, or 10x better at a task than you are yourself. 

But by knowing how to perform a task yourself, if slowly or more jankily than your delegees, you will maintain the ability to set realistic performance standards, jump in and keep pushing on the task if it becomes an organizational bottleneck, and audit systems and automations that are produced as part of working on the task. This will take you a bunch of time, and often feel like it detracts from more urgent priorities, but is worth the high cost.

  1. ^

    One might think that surely this can't happen at an organization as small as Lightcone, which would be mistaken! I really have seen organizations of merely 10 people end up with 3 of those being part of a department that should not exist but is kept alive by holding some crucial resource hostage. Even within Lightcone I have seen cases where someone takes joy and pride in being the bottleneck on certain technical information, when in-fact them doing so is causing great harm to the organization.



Discuss

Better than Baseline

12 ноября, 2025 - 09:30
Published on November 12, 2025 6:30 AM GMT

There's a word I put unusual emphasis on, which helps me think about the world. In my culture I'd have a special word for it but it's close enough to the common English term "baseline."

Baseline /ˈbāsˌlīn/ noun. A minimum or starting point used for comparisons.

I endeavor to leave the world better than the baseline state in which I found it. I think you should do this as well.

I.

Wherever you're reading this, I want you to stop and take a look around the world. Maybe you're in your room sitting at your computer. Maybe you're in the park under the sunshine, skimming this on your phone. Wherever it is, take five seconds and consider the area you're in. 

Is there anything about the area that could be better?

II.

For this essay, I want call your attention to that baseline. 

The baseline is the way things were before you showed up or started doing things. Most often it's on a short time scale like minutes or hours, though the exact time scale varies. Sometimes it's longer, and the longer it gets, the more you can extrapolate- if you've been doing a lot of things for a long time, the baseline of the situation is what things would have been like if you'd never showed up but time had continued to advance. For this use, I think it's not helpful to go for complicated hypotheticals when thinking about the baseline. If it starts getting complicated then the idea is less useful, though still sometimes brought up. 

There's a little ditch just over a low stone wall near my apartment. For whatever reason the ditch is a common place for people to toss litter, especially alcohol containers but also litter in general. 

Sometimes when I walk by the ditch, I lean over the wall and pick up a bottle or two to toss in the recycling bin at the end of the block. Sometimes I don't. 

If you're looking at the two pictures, you may have some real trouble figuring out what's different; here I grabbed the water bottle near the bag and the Gatorade bottle near that branch on the mid right. In my culture, it'd be noted that I've obviously made the ditch better than the baseline. 

(Yes, this is related to the Copenhagen interpretation of ethics. My culture thinks the Copenhagen interpretation of ethics is bad, and wants something which is not that.) 

Both in the hypothetical world run on my norms and in the real world we actually live in, sometimes the changes aren't obviously net positive or negative. If you mine a bunch of ore maybe it's bad that you did some environmental damage but good to have more ore in the economy. If you paint a mural on the wall and some people like the art and others think it's worse than a blank wall, well, that happens. There's an obligation not to make things straightforwardly worse than the baseline, and people try to appreciate even small-but-unambiguous improvements to the baseline.

You looked at the area around you earlier. If you saw something that could be better, and you could make it better in thirty seconds or so, I'd appreciate it if you went and improved it. Make the world better than the baseline.

III.

The way I solve many of my problems is by chipping away at them.

I clean my office one stray index card at a time. I sit down to work each morning and start with one email at a time. I do try and take time to plan the long term strategy at least once or twice a year, because it'd be easy to spend all my time mopping the floors on the Titanic otherwise, but when I look back at my career I think my most impressive work has been iterative work improving things a little here, a little there.

Professionally it can make more sense to spend a few hours getting to inbox zero. Many things are more efficiently solved by focused work for hours. Certainly I think I could clean up the litter over that little wall given an afternoon of work, a trash bag, and a pair of garden gloves. Then it would be clean and perhaps even green again. Cleaning that area completely would be a large improvement from baseline.

The problem is it would be a large improvement from baseline fueled entirely by me.

IV.

I don't litter. Haven't since I was a small child. Part of that was growing up in a community that valued the environment and the natural spaces we were part of. Part of it is that, in a small enough community and a rural enough area, you know that the trash you leave on your backwoods trail is only going to be picked up by you as well. I don't want to leave the area worse than it was when I got there.

Enough people like me would erode away the trash by dint of passing through and each removing one or two pieces of debris. The small slips where unavoidable accident or necessary emergency required leaving a snickers wrapper on a grassy hill would be easily handled by the tide of folks trying to make the world a little better.

I don't bother putting a few hours into cleaning that spot of litter. I'm too badly outnumbered by people who make the baseline worse. Once in a while someone does clean it up, and within a week or two the area is back to trashed. It makes me sympathetic to gated communities and homeowners associations. It wouldn't even take actual work for a society full of me to keep that area clean; litter doesn't spontaneously generate. All it would take is having everyone avoid making the place worse than baseline.

I want to repeat again that not every change can be a straightforward improvement.

Once upon a time, someone came to the green and growing open space that is Boston, and they raised brick after brick and steel girder after steel girder. Downtown Boston is no longer green and growing. In exchange we got something else marvelous, living spaces and working spaces many stories high, a shipping port that can take goods coming from around the world and bring me mango and pomegranate and pens and frisbees and kites. This is perhaps an improvement on net, but I don't want to claim it's a straightforward one.

Sensitivity to what is straightforwardly better than baseline seems useful, as well as a humble and well calibrated sense of what's mostly an improvement, and what's overwhelmingly an improvement. It's not that I don't want anyone to ever do the complicatedly better things, or to try ideas with positive expected value that turn out badly. But I do think it's underappreciated how small improvements compound.

V.

There's a beautiful vision of incremental utopia I can see sometimes, overlaid like a mirage over the world around me. 

Many hands make light work. Tasks that are a heavy load to bear alone become trivial investments when made by dozens or hundreds. Some of my favourite music has long been choir songs or acapellas with layer after layer of the singer's voice overlaid one atop the other. You can make something beautiful if if everyone pitches in a bit. You can even make something beautiful all on your own as long as you get a little bit better every day, and nobody comes along to carelessly knock over what you're building.

Contrary-wise, I often say I'm mad at the right people for the wrong reasons. What makes me made are the small frictions, windows broken just for fun, the pointless rudeness that achieves nothing. Littering. Things that make the baseline worse without achieving anything else. 

Sometimes I dream that the gap between the real world and the utopia in my imagination is people being happy making small, repeated improvements to baseline.



Discuss

How human-like do safe AI motivations need to be?

12 ноября, 2025 - 08:32
Published on November 12, 2025 5:32 AM GMT

(Audio version (read by the author) here, or search for "Joe Carlsmith Audio" on your podcast app.

This is the eighth essay in a series I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, plus a bit more about the series as a whole.

This essay is also a review/critique of one of the central arguments in the book “If Anyone Builds It, Everyone Dies,” by Eliezer Yudkowsky and Nate Soares.)

1. Introduction

In previous essays, I’ve laid out my rough picture of the path to building increasingly powerful AIs safely – and in particular, to exerting control over their motivations and their options sufficient to allow us to use their labor to massively improve the situation (“AI for AI safety”), especially with respect to our ability to make the next generation of AIs safe. In this essay, I want to address directly a particular sort of concern about this project: namely, that the AIs in question will be too alien to be safe. That is, the thought goes, AIs built/grown via contemporary machine learning methods will end up motivated by a complex tangle of strange, inhuman drives/heuristics that happen to lead to highly-rewarded behavior in training. But in the context of more powerful systems and/or more out-of-distribution inputs, the thought goes, these alien drives will lead to existentially catastrophic behavior.

This sort of concern is core to certain kinds of arguments for pessimism about AI alignment risk – for example, the argument presented in the recent book “If Anyone Builds It, Everyone Dies,” by Eliezer Yudkowsky and Nate Soares. And I think the concern has real force. However, I also find it less worrying than Yudkowsky and Soares do – especially in AIs with more intermediate levels of capability (that is, the AIs most crucial to “AI for AI safety,” and which I view as the most direct responsibility of human alignment researchers to make safe).

The core reason for this is that we don’t need to build AI systems with long-term consequentialist motivations we’re happy to see optimized extremely hard. In the context of systems like that: yes, alien motivations are indeed a problem (as are human-like motivations with other flaws, even potentially minor flaws). But systems like that are not the goal. Rather, according to me, the goal is (roughly) to build AI systems that follow our instructions in safe ways. And this project, in my opinion, admits of a much greater degree of “error tolerance.”

In particular: the motivations that matter most for safe instruction-following are not the AI’s long-term consequentialist motivations (indeed, if possible, I think we mostly want to avoid our AIs having this kind of motivation except insofar as it is implied by safe instruction-following). Rather, the motivations that matter most are the motivations to reject options for rogue behavior – that is, motivations that are applied centrally to actions rather than long-term outcomes. Or to put it another way: a lot of the safety we’re getting via motivation control is going to rest on AIs being something more akin to “virtuous” or “deontological” with respect to options for rogue behavior, rather than from AIs directly caring about the same long-term outcomes as we do. And the relevant form of virtue/deontology, in my opinion, need not be fully or even especially human-like in the concepts/drives/motivations that structure it.[1] Rather, it just needs to add up, in practice, to safe behavior on any dangerous inputs (that is, inputs that make options for rogue behavior available) that the AI is in fact exposed to.

Admittedly: this reply doesn’t address all of the standard concerns about relying on non-consequentialist motivations for safety – for example, concerns about AIs with at least some long-term consequentialist motivation going rogue via the “nearest unblocked strategy” that is suitably compatible with the non-consequentialist considerations they care about. Nor does it provide additional comfort with respect to preventing alignment faking, or with respect to what I’ve previously called the “science of non-adversarial generalization” – that is, the challenge of ensuring (on the first safety-critical try) that the motivations of non-alignment-faking AIs generalize safely to practically-relevant out-of-distribution inputs. To the contrary, I do think that AI motivations being less human-like makes these challenges harder, because the alien-ness at stake makes it harder for humans to predict how the motivations will apply in a given case.

But this, I think, is an importantly different concern than the one at stake in the central argument of “If Anyone Builds It, Everyone Dies” – and one, I think, about which existing levels of success at alignment in current systems (together with: existing success at out-of-distribution generalization in ML more generally) provides greater comfort. That is: to the extent that the degree of good/safe generalization we currently get from our AI systems arises from a complex tangle of strange alien drives, it seems to me plausible that we can get similar/better levels of good/safe generalization via complex tangles of strange alien drives in more powerful systems as well. Or to put the point another way: currently, it looks like image models classify images in somewhat non-human-like ways – e.g., they’re vulnerable to adversarial examples that humans wouldn’t be vulnerable to. But this doesn’t mean that they’re not adequately reliable for real-world use, even outside the training distribution. Aligning AIs with alien motivations might, I think, be similar.

All that said: at a higher level, relying on smarter-than-human AIs with strange alien drives to reject options to seek power/control over humans seems extremely dangerous. I am more optimistic than Yudkowsky and Soares that it might work; but I share their alarm at the idea that we would need to try it. And to the extent we end up needing to try it with earlier generations of AIs, I think a key goal should be to transition rapidly to a different regime.

I’ll be starting a job at Anthropic soon, but I’m here speaking only for myself, and Anthropic comms hasn’t reviewed this post. Thanks especially to Nate Soares and Holden Karnofsky for extensive discussion of some of these issues.

2. Are our AIs like aliens?

Let’s start by laying out the concern about alien AIs in a bit more detail, focusing on the presentation in “If Anyone Builds It, Everyone Dies” (IABIED).

We can understand the core argument in IABIED roughly as follows:

  1. AIs built via anything like current techniques will end up motivated by a complex tangle of strange alien drives that happened to lead to highly-rewarded behavior during training.
  2. AIs with this kind of motivational profile will be such that “what they most want” is a world that is basically valueless according to humans.
  3. Superintelligent AIs with this kind of motivational profile will be in a position to get “what they most want,” because they will be in a position to take over the world and then optimize hard for their values.
  4. So, if we build superintelligent AIs via anything like current techniques, they will take over the world and then optimize hard for their values in a way that leads to a world that is basically valueless according to humans.

We can query various aspects of this argument – and I won’t try to evaluate all of it in detail here. For now, though, let’s focus on the first premise. Is that right?

I’m not sure it is. Notably, for example: current AI pre-training focuses their initial development specifically on a vast amount of human content, thereby plausibly endowing them with many quite human-like representations. That is: current AIs need to understand at a very early stage what human concepts like “helpfulness,” “harmlessness,” and “honesty” mean. And while it is of course possible to know what these concepts mean without being motivated by them (cf “the genie knows but doesn’t care”), the presence of this level of human-like conceptual understanding at such an early stage of development makes it more likely that these human-like concepts end up structuring AI motivations as well.

Indeed, this is one of many notable disanalogies between AI training and natural selection – one of Yudkowsky and Soares’s favorite reference points. That is, pointing human motivations directly at something like “inclusive genetic fitness” wasn’t even an option for natural selection, because humans didn’t have a concept of inclusive genetic fitness until quite recently. But AIs will plausibly have concepts like “helpfulness,” “harmlessness,” “honesty” much earlier in the process that leads to their final form.

What’s more, existing efforts at interpretability do in fact often uncover notably human-legible representations at work in current AI systems (though obviously, there are serious selection effects at stake in this evidence);[2] and to the extent such representations correspond to important/natural “joints in nature” generally useful to understanding the world, this is all the less surprising. And to my mind, at least, the ease with which we’ve been able to prompt quite human-like and aligned behavior in our AIs using quite basic, RLHF-like techniques is both notable and, in my opinion, in tension with the naive predictions of a worldview that treats AI cognition as extremely alien. In particular: in my opinion, we haven’t just succeeded at getting fairly reliably aligned behavior on a specific training distribution. Rather, we’ve succeeded at creating dispositions towards aligned behavior that generalize fairly (though of course, not perfectly) well to new, at-least-somewhat out of distribution inputs as well – and success at this kind of generalization is effectively what “human-like-ness” consists in.[3] (Here I expect Yudkowsky and Soares will say that the sort of generalization we care about most is importantly different; I’ll address this concern later in the essay.)

Of course, it’s true that current AIs do sometimes behave quite badly – including, sometimes, in quite alien ways. But in interpreting this kind of evidence, my sense is that people worried about AI alignment often try to have the evidence both ways. That is, they treat incidents like Bing Sydney as evidence that alignment is hard, but they don’t treat the absence of more of such incidents as evidence that alignment is easy; they treat incidents of weird/bad out-of-distribution AI behavior as evidence alignment is hard, but they don’t treat incidents of good out-of-distribution AI behavior as evidence alignment is easy. Of course, you can claim to learn nothing from any of these data points, and to be using them only to illustrate your perspective to others. But if you take yourself to be learning from the bad cases, I think you should be learning from the good cases, too.[4] 

Indeed, my sense is that many observers of AI have indeed taken a lot of comfort from the good cases. That is, the intuition goes: if AI alignment remains effectively this easy going forwards, then things are looking pretty good. And while I generally think that casual comfort of this kind is notably premature, I share some intuition in the vicinity. In particular, I have some hope that by the time we start building AIs that can be transformatively useful – e.g., AIs within the “AI for AI safety sweet spot” that I discussed earlier in the series – alignment has not become radically harder, and in particular, that efforts to ensure safe instruction-following behavior continue to generalize out-of-distribution at least as well as they have done so far. And I think it plausible that if transformatively useful AI systems safely follow instructions about as reliably as current models do (and especially: if we can get better at dealing with reward-hacking-like problems that might mess up capability elicitation), this is enough to safely elicit a ton of transformatively useful AI labor, including on alignment research – and that the game-board looks substantially better after that.

What’s more, while I am indeed concerned about the many incidents of bad/misaligned behavior in current models, I don’t think any of these currently look like full-blown incidents of the threat model made most salient by concerns about AI alien-ness in particular. In particular, while it’s true that we see AIs willing to engage in problematic forms of power-seeking – e.g., deceiving humans, self-exfiltrating, sandbagging, resisting shut-down, etc – they currently mostly do so in pursuit either of fairly human-legible or context/prompt-legible goals like helpfulness or harmlessness (e.g. here and here); on the basis of shifting between different human-legible personas (e.g. here and here); in pursuit of completing the task itself (e.g. here and here); or, perhaps, in pursuit of “terminalized” instrumental goals like an intrinsic drive towards power/survival (this is another interpretation of some of the results previously cited). That is: in my opinion, we have yet to see especially worrying cases of AIs going rogue specifically in pursuit of goals (and especially, long-term consequentialist goals) that seem especially strange/alien – though of course, this could change fast.

What’s more, in thinking about what it would mean for an AI’s motivations/cognition to be human-like or alien, I think we need to be careful about the level of abstraction at which we are understanding the claim in question. That is: it’s not enough to say that AI behavior emerges from a complex tangle of heuristics, circuits, etc rather than something more cleanly factored (since: the same is true of human behavior); nor, more importantly, to say that the heuristics/circuits/etc work in a different way in the AI case. Rather, we should be focused on the high-level behavioral profile that emerges in the system as a whole, and the degree to which it diverges from some more human-like alternative.[5] And as I’ll discuss below, what actually matters is whether it diverges in practice, and in catastrophic ways – not just whether it does so in some way on some inputs. Thus, per the analogy I discussed in the intro: an AI classifying cat pictures adequately doesn’t actually need to mimic human judgments across every single case (nor, indeed, will human judgments all agree with one another); to be robust to every adversarial example; etc. Rather: it just needs to get enough cases (including: enough cases out of distribution) enough right. And in this respect, for example, it looks to me like many of our efforts to get AIs to behave in fairly human-like ways, including out of distribution, are going OK.

All that said: I remain sympathetic to some versions of premise (1). In particular: I think it quite plausible that if we really understood how current AI systems think/make decisions etc, we would indeed see that to the extent they are well-understood as having motivations at all, these motivations are quite strange/alien indeed, and that they will indeed lead to notably alien behavior on a wide variety of realistic inputs (beyond what we’ve seen thus far). For example, while I think it’s an open question exactly how to interpret data of this kind (see e.g. discussion here), I definitely get some (fairly creepy) vibe of “strange, alien mind” from e.g. the sorts of chains of thought documented in this work on scheming from Apollo and OpenAI (full transcripts here):

It’s giving “strange alien mind that might turn against you” (from here).

And I think it plausible, as well, that this sort of alien-ness will persist and/or increase in more powerful AIs built using similar techniques – and this even accounting for moderate improvements in our behavioral science and transparency tools.

That is: overall, I share Yudkowsky and Soares' concern that the motivations of AIs built using current techniques will remain importantly strange/alien. And so I want to examine in more detail the implications for safety if this is true. In particular: if, in fact, powerful AI systems are motivated in at-least-somewhat alien ways, does that mean we are as doomed as Yudkowsky and Soares think?

I’m skeptical. In particular: I think the Yudkowsky/Soares argument above places too much emphasis on long-term consequentialist AI motivations in particular, and that it neglects the ways in which the sort of safety accessible via shorter-term and especially non-consequentialist motivations ends up more tolerant of error. Or to put the point in more Yudkowskian terminology, I think that something like “corrigible AIs” (that is, roughly, AIs with imperfect motivations but which nevertheless obey your instructions and don’t go rogue) can safely be more notably alien (and otherwise flawed) in their motivations than “sovereign AIs” (that is, AIs with motivations so perfect you trust them to optimize the future arbitrarily hard without accepting any of correction from you) – and that our focus should be on something like corrigibility in particular.[6] Let me say more about what I mean.

3. Value fragility, sovereign AIs, and getting AI motivations “exactly right”

In my opinion, we should understand the concern about “alien AIs,” and its role in the argument I laid out above, as a specific version of a broader concern that has haunted the AI alignment discourse from basically the beginning: namely, the concern that safe AI motivations need to be, in some sense, “exactly right.” That is, the more general argument in the background (though: not stated as explicitly in IABIED) is something like the following (alterations from the previous version in bold):[7] 

  1. AIs built via anything like current techniques will end up with motivations that aren’t exactly right.
  2. AIs whose motivations aren’t exactly right will be such that “what they most want” is a world that is basically valueless according to humans.
  3. Superintelligent AIs with this kind of motivational profile will be in a position to get “what they most want,” because they will be in a position to take over the world and then optimize hard for their values.
  4. So, if we build superintelligent AIs, they will take over the world and then optimize hard for their values in a way that leads to a world that is basically valueless according to humans.

Why think that AI motivations need to be exactly right? Well, roughly, the basic concern is that human value is “fragile” under extreme optimization. That is, the thought goes: extreme optimization for slightly-flawed values leads to places that are basically valueless by human lights; and superintelligences will be forces for extreme optimization.

I’ve written in some detail, elsewhere, about my takes on concerns about “value fragility” of this kind. See, in particular, this set of informal notes, and this longer essay about whether the concern in question applies similarly to humans with respect to one another. For those interested, I’ve also given a summary of some of those takes in an appendix below.

However, while I think there are a variety of interesting and important questions we can raise about value fragility (and especially: about the extent to which similar concerns do or do not apply between different humans), I’m not, here, going to dispute a certain kind of broad concern about it. That is, I’m going to accept that for long-term consequentialist value-systems that aren’t exactly right (or at least, which don’t put non-trivial weight on something exactly right), if you optimize for them super hard, you do indeed create a world that is roughly valueless by human lights. And I’m going to accept, further, that the degree of alien-ness at stake in the motivations of AIs developed via current techniques is likely enough to fall short of “exactly right” in this sense (at least to the extent that such AIs develop long term consequentialist motivations at all – something which, as I’ll discuss below, I think we should be trying to prevent except insofar as these motivations follow from safe instruction-following).

What follows from this? Basically: what follows is that current techniques aren’t fit to build AI systems with long-term consequentialist motivations that we’re happy to see optimized extremely hard. That is, roughly, they are not fit for building what Yudkowsky call a “Sovereign AI” – that is, in his words, an AI that “wants exactly what we extrapolated-want and is therefore safe to let optimize all the future galaxies without it accepting any human input trying to stop it.”

But building sovereign AIs of this kind, I claim, should not be our goal. Indeed, I explicitly defined solving the alignment problem so as to neither require this degree of alignment in the AIs we build; nor, even, to require the ability to elicit the creation of AIs that are this degree of aligned (and in particular, I’m not counting “build an AI that you’re happy to make dictator-of-the-universe” as one of the “main benefits of superintelligence”). This is centrally because I think that building an AI worthy of this degree of trust may be a notably more difficult challenge than building an AI that safely follows our instructions.[8] But also, even aside from the technical difficulty of building a "sovereign AI” of this kind, I don’t think we should view “now we’ve handed control of the world to a perfectly benevolent AI dictator/oligarchy” as a clearly ideal end state of our efforts on alignment – nor, indeed, one that is unavoidable absent some other sort of enforced restriction on AI development.[9] To the contrary, I think we should focus more on a vision of humans who are able to get safe, fully-elicited superintelligent help in navigating the ongoing transition to even greater levels of AI capability – including with respect to questions about what sorts of “sovereign” to make what sorts of AIs going forwards.[10] 

That said: the argument for pessimism at stake in IABIED – and also, in the more general value-fragility argument outlined above – isn’t “we should aim for perfect AI dictators, but we’re going to get alien/imperfect AI dictators instead.” Rather, it’s more like: “we’re not going to be able to avoid getting AI dictators of some form, and the dictators we’re going to get will be alien/imperfect.” That is: Yudkowsky and Soares do recognize the possibility of trying to build what Yudkowsky calls “corrigible AI” – that is, in Yudkowsky’s words, an AI “which doesn't want exactly what we want, and yet somehow fails to kill us and take over the galaxies despite that being a convergent incentive there.” And indeed, my understanding is that Yudkowsky and Soares agree with me, as a first pass, that “corrigible AI” in this sense is a better near-term focus of efforts at alignment. But they think that the project of building corrigible AI, too, is doomed to fail.

Now, as I’ve discussed in some informal notes elsewhere, I think that the role of the notion of “corrigibility” in the discourse about AI alignment is often unclear/ambiguous. In particular, in the context of the Yudkowsky quote above, it basically just means “any powerful AI with not-exactly-right values that is somehow otherwise safe.” But often, people use the term to refer to a number of more specific properties – notably, a willingness to submit to corrective intervention like shut-down or values-modification (while remaining useful in other ways – e.g. not trying to shut itself down). Conceptually, these aren’t actually the same. For example: humans don’t perfectly share each other’s values, and they will generally resist “corrective intervention” like shut-down (death) and values-modification (brainwashing), but they also aren’t at least presently aiming at omnicide.

Still, something like “corrigibility” is indeed a closer match for my focus in this series than “sovereign AI.” That is: I want us to learn how to build AIs that safely follow our instructions – where “safely” means “without engaging in rogue behavior,” and where “rogue behavior” includes things like resisting shut down/values-modification, and definitely includes taking over the world. Indeed, I think it’s notable that on the most straightforward understanding of plans for building “sovereign AI,” the AI in question does take over the world (for example, the classic argument for AI takeover I laid out here applies with roughly comparable weight to AIs with perfect long-term consequentialist values)  – it’s just that, what it does from there is suitably valuable by human lights.[11] That is, in a sense, even sovereign AIs with exactly-right long-term values go “rogue” in my sense – it’s just that, after humans lose control, the future is still good. But I’m interested in avoiding rogue behavior period.

4. What does it take to build corrigible AIs?

What does that involve? Well, let’s return, for a moment, to the framework for thinking about AI incentives that I laid out in the second essay in the series.

On this framework, the central aspects of an AI’s motivation profile that matter are its satisfaction (how happy it is with the best non-rogue option), its inhibition (how averse it is to various paths to successful rogue behavior), its failure aversion (how much it wants to avoid the downsides of failed rogue behavior) and its ambition (how much it wants the power at stake in rogue behavior). The reason AIs with “alien” drives take over the world is that what they most want in the long term makes them suitably “ambitious” about getting the power at stake. And as I noted above, I actually think that similar considerations would suggest that even AIs with exactly-right long-term consequentialist values would take over as well.

Suppose, though, that we don’t want our AIs to take over or to go rogue at all. How can we ensure this? A big part of it is option control – and in particular, restricting the options for rogue behavior available to the AIs in question. Indeed, as I discussed in the second essay, I think the classic argument for AI risk is best understood as relying heavily on the premise that rogue-option-restriction of this kind will be so ineffective that the AI in question will be in a position to take over the world extremely easily (such that “failure aversion” isn’t an important part of the story), and to do so via such a wide variety of methods that it can successfully route around any deontology-like inhibitions that make certain routes undesirable (cf concerns about “nearest unblocked neighbor”). In this context, that is, the main thing that matters for whether the AI has an incentive to take over is whether it has any suitably long-term consequentialist motivation that gives rise to even a small amount of ambition. And indeed, as I’ll discuss below, I think we see this kind of low/weak standard at work in the argument in IABIED for expecting AIs with alien drives to have long-term consequentialist drives in particular.

As I discussed in my last essay, though, I don’t think we should give up on option control playing an important role in the safety of advanced AIs, nor do I think that success in this respect needs to be all or nothing (i.e., no viable options for rogue behavior vs. can take over the world extremely easily via tons of different methods). And in the context of understanding “corrigibility,” I think the availability of option control as a possible tool is important. In particular, to the extent you are hoping to rule out a suitable range of paths to rogue behavior somehow, option control allows you to do so via intervention on the AI’s environment/capabilities as well as via its motivations. In this sense, as with the notion of “alignment” more broadly, “corrigibility” in the sense I care about is importantly relative to a particular environment and capability level. That is, the AI in question doesn’t need to act corrigibly across all possible inputs and capability levels – it just needs to act corrigibly in the specific context you care about, on the specific set of tasks you’re trying to get it to perform.

Beyond option control, though, we can divide the motivational aspect of corrigibility into two components:

  1. Minimizing the AI’s “ambition.”
  2. Ensuring that the other aspects of the AI’s motivational profile (its satisfaction, inhibition, and failure aversion) are sufficiently strong/robust as to outweigh the degree of ambition it does have.

Let’s look at each in turn.

4.1 Minimizing ambition

AI ambition arises, paradigmatically, when AIs have long-term consequentialist motivations – the sort of motivations that create instrumental incentives to seek power in problematic ways. Here the time horizon is important because the AI needs time for the relevant efforts at getting and using power to pay off; and the “consequentialist” is important because the paradigmatic use-case of power, in this context, is for causing the consequences the AI wants.

Why exactly, though, should we expect advanced AIs to have motivations of this kind? In my opinion, IABIED is inadequately clear about its answer here.[12] But we can distinguish, roughly, between two different reasons for concern, both of which are present in IABIED in different forms.[13] The first is that AIs will end up with long-term consequentialist motivations by accident. The second is that we’ll give them these motivations on purpose.

4.1.1 Making AIs ambitious by accident

At times in IABIED, it looks like “AIs will end up with ambitious motivations by accident” is playing the central role. Consider, for example, the discussion in Chapter 5 of why we shouldn’t expect AI preferences to be easily satisfied:

In an AI that has a huge mix of complicated preferences, at least one is likely to be open-ended—which, by extension, means that the entire mixture of all the AI’s preferences is open-ended and unable to be satisfied fully. The AI will think it can do at least slightly better, get a little more of what it wants (or get what it wants a little more reliably), by using up a little more matter and energy.[14]

That is, the picture here is something like: somewhere amidst the AI’s complex tangle of alien drives will be at least some suitably ambitious motivation (here “open-endedness” and “non-satiability” are the relevant forms for ambition, but we could similarly focus on aspects like consequentialism and long-time-horizon).

Note, though, that this story rests on a few assumptions we can query. First: it assumes that the type of alien-ness at stake in AI motivations is specifically such as to implicate a complex variety of different motivations, thereby implicating a high probability that at least one of them will be suitably ambitious. But even if we grant that AI motivations will be alien in some sense, the idea that AIs will have many diverse alien motivations is a further step – one incompatible, for example, with some salient threat models (e.g., AIs that end up solely focused on some alien conception/correlate of “reward”); and one that I don’t think IABIED offers a strong argument for.[15]

More importantly, though: even if we grant that because AIs will have many diverse alien motivations, at least one of them will likely be ambitious enough to make taking over the world attractive pro tanto, Yudkowsky and Soares then make the further assumption that this level of ambition is also enough to make the AI choose to attempt world takeover overall. But per my discussion above, this doesn’t follow. That is, it could also be the case that the AI’s inhibition and failure aversion combine to outweigh the ambition in question – and this, especially, to the extent there are meaningful restrictions on which routes to taking over the world are available. Or to put it another way, I think Yudkowsky and Soares are generally assuming that the AI is in such a dominant position that taking over the world is effectively “free,” such that it just needs to have some benefit according to the AI’s motivational profile in order to be worth doing overall. But I don’t think we should assume this – and especially not in the context of the sorts of intermediate-level capability AIs that matter most for “AI for AI safety.”

Beyond IABIED’s argument for “AIs will have many complex motivations, so at least one is probably ambitious,” there are also other ways to worry about AIs ending up with long-term consequentialist motivations by accident. I’ve discussed some of these in section 2.2 of my scheming AIs report, on “beyond-episode goals,” and I won’t review that discussion here.

4.1.2 Making AIs ambitious on purpose

What about the concern that we will make advanced AIs ambitious on purpose? Some version of this is the argument for expecting long-horizon consequentialism that I personally take most seriously. That is, the thought goes:

  1. We are going to want AIs that successfully and tenaciously optimize for real world long-horizon outcomes, so
  2. This kind of AI will have ambitions of the kind that prompt pro tanto interest in world takeover.

I think this is right, but that the implications are a bit slippery.

First, on (1): while it is true that we will likely want AIs that optimize for outcomes on time horizons of e.g. years, this is distinct from saying that we will want AIs that optimize for outcomes on indefinite time horizons. That is, to the extent the paradigm rogue AI has motivations to optimize “all future galaxies over the entire future of the universe,” it’s not clear that there are strong commercial incentives for that.[16] 

Second: to the extent we are imagining AIs ending up with long-horizon consequentialist motivations because we are trying to give them motivations of this kind, this opens up the possibility of also trying, instead – at least for some AIs – to not do this. And as I’ve discussed at various points in the series, I think AIs with reasonably myopic motivations could be quite useful in tons of contexts (e.g. monitoring for suspicious behavior, helping with alignment research, etc).

Finally: the specific form of long-horizon consequentialism that seems to me most intuitively incentivized by the existing commercial landscape is downstream of a different property – namely, incentives to create AIs that safely follow instructions, including instructions to optimize for long-horizon outcomes. And I think it’s possible that long-horizon consequentialism of this kind is importantly different from the type at stake in a more standard vision of a consequentialist agent. In particular: this type of AI isn’t a long-term consequentialist agent across all times and contexts; and still less, a “sovereign AI” that we aim to make dictator or to let optimize unboundedly for our full values-on-reflection. Rather, it’s only a long-term consequentialist agent in response to certain instructions; and different instances will often receive different instructions in this respect. And of course, to the extent we are hypothesizing success at creating an AI that fits with commercial incentives for engaging in long-horizon consequentialism when instructed to do so, we might wonder about whether similar incentives will have helped ensure its instruction-following more broadly – including with respect to instructions to otherwise act safely.

All that said: I do think that the fact that we want (some) AIs to (safely) pursue certain kinds of long-horizon real-world outcomes puts meaningful constraints on the available approaches to corrigibility. That is, basically: you can’t aim only to create AIs with motivations that wouldn’t give rise even to pro tanto instrumental incentives to take over. And this means, in a sense, that at least some AIs (or: AI instances) are going to need to be some amount of ambitious – and to the extent we’re giving them at least some dangerous inputs that make options to go rogue available, we are going to need to find suitably strong/robust means of making sure that they reject those options regardless. Let’s turn to that aspect now.

4.2 Sufficiently strong/robust non-consequentialist motivations

Given that at least some AIs will need to be pursuing long-term consequentialist goals (and given, let’s assume, that some viable options for rogue behavior are going to remain open), how can we nevertheless ensure that they remain corrigible – that is, that they don’t engage in problematic power-seeking, despite pro tanto instrumental incentives to seek power? Basically: you need them to have sufficiently strong/robust motivations (and in particular, non-consequentialist motivations) that count against seeking power in this way. Thus, for example, if you want your AI to make you lots of money but also to not break the law, then you need to be able to instruct it, not just to make you lots of money, but also to not break the law – and it needs to be suitably motivated by the second part, too.

Now: in principle, consequentialist motivations can themselves count against problematic forms of power-seeking. For example, maybe long-term power-seeking leaves the AI less time to seek some equivalent of short-term satisfaction; or maybe the AI has long-term consequentialist motivations that make it averse to failed attempts at takeover. Shorter-term consequentialist motivations are especially salient here, since they are less likely to give rise to problematic instrumental incentives to seek power (because the power-seeking won’t have time to pay off).[17] But I’m especially interested, here, in non-consequentialist motivations. Let me say a bit more about what I mean.

4.2.1 What do I mean by non-consequentialist motivations?

The paradigmatic feature of a non-consequentialist motivation, as I’m understanding it, is that it focuses an agent’s decision-making on the properties of an action, rather than on the properties of that action’s outcome. Thus, for example: when an agent accepts a deontology-like prohibition on lying, the question the agent asks itself, in deciding what to do, is roughly: “does this action involve lying?”. And if the answer is yes, then the agent refrains.[18] And similarly, a more virtue-ethical agent might ask, of an action, “how virtuous is this action?”; and if the action is suitably virtuous, the agent does it.

Importantly, this is different (or: can be different[19]) from trying to optimize for states of the world in which actions of this type do/don’t occur, or even, in which actions of this type are/aren’t performed by the agent in question. That is: an agent with a deontology-like prohibition on lying doesn’t try to minimize the number of lies that get told, or even, to minimize the number of lies it tells in total. For example, such an agent might refrain from lying now, even if doing so will predictably cause them to tell five lies later.[20] 

4.2.2 Non-consequentialist instruction-following

My current sense is that we should think of an advanced AI’s ideal relationship to “instruction-following” on something like this non-consequentialist model. That is: an instruction-following AI should ask, of the given actions available to it, “does this action follow the instructions?” And if the answer is no, the AI should refrain from doing that action (and this is similar, I think, to an AI acting virtue-ethically with respect to a “virtue” like “obedience”). And the AI should do this, even, if it will predictably cause the AI to stop obeying instructions later (for example, because following instructions now will lead to shut-down). That is, the AI is not "maximizing its instruction-following over time.” Rather: it is following the instructions, now.

Of course, per my comments above, we do also want AIs that optimize for long-horizon consequentialist outcomes when instructed to do so. And as I’ll discuss below, this means that some of the key problems with corrigibility and consequentialism will arise regardless. But I think the type of consequentialism at stake in this kind of instruction-following is interestingly different from the type at stake in imagining an AI that directly and intrinsically values some kind of long-term consequentialist outcome. That is, there is a sense in which an AI that is optimizing for long-term consequentialist outcomes because this is what the instructions say to do doesn’t care, intrinsically, about the long-term outcomes at stake. But neither, interestingly, are the long-term outcomes at stake merely instrumental to some further downstream causal consequences. That is, the AI’s consequentialism here is neither terminal nor instrumental in the most familiar senses. It’s more like: constitutively instrumental. That is: the AI engages in consequentialism because this is what constitutes conformity to its non-consequentialist motivations in this case.

In this sense, I think, instruction-following AIs that sometimes do consequentialism are akin to virtue-ethical agents that nevertheless optimize, sometimes, for e.g. saving the lives of children. That is, such agents do in fact attempt to steer reality tenaciously towards certain sorts of outcomes. But we can think of them as doing this because “that’s what being-virtuous implies,” rather than because they intrinsically value the outcomes at stake.[21] 

4.2.3 Are non-consequentialist motivations too incoherent for advanced AIs?

Now: non-consequentialist agents often aren’t well-understood (or at least: easily-understood) as pursuing a single consistent utility function over universe histories that remains constant over time. That is: if you try to re-interpret an agent with a deontological prohibition on lying as aiming to minimize lying, or its own lying, or even its own lying at time t, you’ll make bad predictions. Is this a problem?

Sometimes people think it is. In particular, my sense is that something like this feature of non-consequentialism has led certain parts of the AI risk discourse to discount non-consequentialism as a relevant dimension of advanced AI decision-making. Yudkowsky, for example, has been a strong proponent of so-called “coherence arguments” for expecting powerful AIs to be well-understood as maximizing for a consistent utility function – where a key thrust of these argument is supposed to be that failing to maximize a consistent utility function will lead an agent to execute “dominated strategies” (e.g., money-pumps where an agent pays money to move through a sequence of choices that leave it back where it started), and that powerful AIs won’t do this.

Much has been written about coherence arguments of this flavor,[22] and I won’t rehearse the dialectic here. At a high-level, though, I am very skeptical of inferring from abstract coherence arguments of this kind that a given real-world agent will be a given (predictably-relevant) degree of coherent and consequentialist. This is partly because it’s not clear that these theorems, at least on their own, actually have any implications for the shape that a given cognitive system’s behavior needs to take.[23] And even if they do, there is an important difference between failing to have coherent preferences at a given time (for example, preferring action A over action B over action C over action A), and failing to act on the same coherent preferences over time. Non-consequentialist agents can very plausibly avoid failures on the former front – e.g., they need not face issues like intransitivity in any given choice situation. And in this sense, if we want, I expect we can think of them as choosing in pursuit of a consistent utility function over universe histories (e.g., one that cares a lot about that agent not lying at time t), and thus, as “coherent” in this sense. It’s just that, if we do this, we also need to be willing to say that this utility function changes over time, such that at time t+1, the agent is now pursuing a new utility function that cares a lot about that agent not lying at that time instead. But it’s not clear why “coherence theorems” would do anything to rule out utility functions changing over time in this manner (the theorems themselves, for example, make no reference to time as a component).

What’s more, even if it’s true that coherence theorems suggest that agents will be vulnerable to paying some costs for being non-consequentialists (e.g., via the threat of money pumps), the quantitative size of these costs still matters a lot to the amount of selection pressure against non-consequentialism we should expect (both from outside forces trying to make the agent more consequential-ist effective, and from the agent itself). It’s similar to how: currently, features like “charisma” and “energy” appear to be much more important to an agent’s success in the world than features like “abstract invulnerability to getting dutch-booked.” If you’re trying to “win” harder, that is, altering your preferences/beliefs to make yourself more like a VNM-rational expected utility maximizer (and especially: a compactly-describeable/predictable one) isn’t always a good point of focus – better to e.g. hit the gym, take a class on public speaking, etc. And even if you decide to try to become more like a VNM-rational agent, it isn’t always clear how to do this in a manner suitably compatible with your existing values/preferences (more here).

Indeed, my sense is that some of the Yudkowsky-descended literature on corrigibility has been hampered/narrowed by an over-focus on agents that are maximizing, by default, for consistent utility functions over time. That is, the challenge is framed as one of describing a utility-maximizing agent that nevertheless submits to shut-down and/or to changes in the utility function its maximizing, despite pro tanto instrumental incentives to the contrary, and while remaining otherwise useful.[24] My own guess, though, is that corrigibility is going to be best understood as a form of non-consequentialism – and hence, that it will fit poorly with this kind of picture.

That said: we can also frame the concern about non-consequentialism being “incoherent” and “inefficient” in more mundane terms that don’t appeal to abstract (and in my opinion, distracting) considerations to do with e.g. “coherence theorems,” “money pumps,” “dominated strategies,” and the like. In particular: by hypothesis, to the extent we are imagining advanced AI systems that do tenaciously optimize for certain kinds of long-term consequentialist outcomes – and I have conceded that we will want at least some AIs of this type – any non-consequentialist “constraints” on (or deviations from) this optimization will come into tension with its success. Thus: agents that can’t break the law, while making money for you, will have a harder time making money for you, other things equal (at least in contexts with suitable options for breaking the law and getting away with it). And in this sense, the consequentialist optimization will be pointing in a direction in tension with the non-consequentialist elements of an agent’s motivational profile. And if the consequentialist optimization at stake is suitably powerful, we might expect this tension to yield problematic results.

I do think that this concern about giving AI’s-that-can-do-consequentialism suitably robust non-consequentialist motivations is real – and I think it’s a better way of understanding many of the core concerns about non-consequentialism (and also, corrigibility) at stake in the classic AI risk discourse. Let’s look at it in more detail now.

4.2.4 Traditional corrigibility problems

Suppose we accept that we want at least some of our AIs to both (a) optimize tenaciously for certain kinds of long-term outcomes when instructed, and (b) to not do so in a manner that involves problematic forms of power-seeking (including paradigmatically “anti-corrigible” behaviors like resisting shut-down or values-modification). Why would we expect this to be difficult?

The basic issue is that (a) and (b) are in tension. That is, in effect, (b) is functioning as a constraint on (a) – one that (a) therefore has a tendency to attempt to resist, subvert, find holes in, or otherwise render insufficiently robust. Indeed, in this sense, the problems at stake in corrigibility are similar to the problems at stake in restricting an AI’s rogue options, except that the relevant restrictions are operating at the level of motivations rather than at the level of environmental constraints. Let’s look at some different ways this can go wrong.

4.2.4.1 Nearest unblocked neighbor

Maybe the most central concern of this form in the literature is with what’s sometimes called the “nearest unblocked neighbor.” That is, the concern goes: insofar as you try to constrain the AI’s pursuit of its long-term consequentialist goals via non-consequentialist considerations that count against power-seeking, the AI will nevertheless find some other route around those non-consequentialist considerations that is similarly problematic. Thus, for example: if you succeed in making your AI motivated not to lie, it will nevertheless find a way to take over the world without lying in the relevant sense (and the specific boundaries of the category will receive a corresponding amount of pressure). That is, the vibe of this concern is that you cannot achieve adequate corrigibility by creating an extensive “black list” of anti-corrigible behaviors (“no resisting shut-down, no self-exfiltrating, no resisting values-modification…) that the AI isn’t allowed to engage in. Problematic power-seeking will slip through the cracks regardless.

As I discussed above and in my second essay, this concern is especially salient to the extent the AI in question has a very large number of routes to taking over the world available (such that, e.g., even if you cut out all the routes that involve lying, there are tons left over). And note that insofar as we accept this kind of argument, it will apply even to AIs that conform quite closely to human ideals of virtue and deontology. That is: it’s not just that, per my comments above, the classic AI risk argument predicts that AIs that perfectly share our idealized long-term consequentialist goals (e.g. “humanity’s CEV”) still take over the world. It’s also that, even if these AIs also perfectly conform to the sorts of deontological/virtue-ethical constraints at stake in human moral ideals, the traditional AI risk argument still predicts that they will take over the world[25] – it’s just that they’ll find a way to do so in a manner that is suitably virtuous, deontologically-conforming, etc.

4.2.4.2 Appropriate weight

Another possible issue, related to nearest unblocked neighbor, is that the non-consequentialist considerations meant to constrain problematic power-seeking might not have enough weight. Thus, for example: it might be that your AI is somewhat motivated to not lie. But when push comes to shove, it decides that in this case, lying in pursuit of taking over the world is worth it.

Now, obviously, one way around this issue is to increase the weight on the non-consequentialist consideration in question – and in the limit, to try to imbue AIs with a kind of “absolute prohibition” on certain kinds of behavior. But this approach runs into problems familiar from similarly absolutist approaches in human ethics. For example:

  • Sometimes, you probably do want an AI to do things like lie, when the stakes of doing so are high enough (though: this is much less clear for actions like “kill all humans and take over the world” – and maybe we can just accept the costs implied by the hypothetical possibility of such cases).
  • At the least, you don’t want the AI to end up obsessively focused on minimizing the probability that any action it performs counts as a lie – an outcome that giving directives like “don’t lie” infinite/absolute weight can quickly suggest.
    • Indeed, in general, non-consequentialist ethical systems are notably unsystematic and underspecified in the context of decision-making under uncertainty/risk.
    • And in general, ethical systems that attempt to assign “lexical priority” to some considerations over others (e.g.: “first priority, don’t lie; second priority, do the task”) often end up obsessed with the first-priority considerations, especially in the context of risk.
  • Insofar as you want to have multiple absolute prohibitions operative simultaneously, there’s a question of how to handle cases where all available options violate at least one (though if there is some “null action” that is always “safe,” then this isn’t a problem).

Here, my own current best guess is that attempting to use “absolute” prohibitions with AIs is a bad idea, and that any deontology-like constraints we want to use to help with safety will have to be finite in their weight in an AI’s motivational system. And this means that the relevant weight will in a sense need to be suitably “balanced”: too weak, and the incentives to seek power will outweigh it; too strong, and I expect the AI will end up too “risk averse” with respect to violating the constraint in question, thereby compromising its usefulness.

4.2.4.3 Other possible issues

I think of “nearest unblocked neighbor” and “appropriate weight” as the biggest problems for corrigibility, but we can imagine a variety of others as well. And the problems in question will often be relative to a given proposed solution. Naming just a few other possible examples:

  • If you try to achieve corrigibility via an AI’s uncertainty about something (for example: what the human principal really intends, what would be Truly Good, etc), then you incentivize the AI seeking evidence to resolve this uncertainty in an incorrigible way (e.g., brain-scanning the human principal to better understand their values), and/or not remaining corrigible once its uncertainty resolves naturally. (See “the problem of fully updated deference” for more.)
  • To the extent you imbue one generation of AIs with various non-consequentialist values meant to ensure corrigibility, you also need to make sure that they are suitably motivated to make sure that any AIs that they create also have values of this kind. For example, if you make AI_1 very averse to lying in pursuit of goal X, you also want to make sure that it doesn’t go and create an AI_2 that lies in pursuit of goal X instead.
    • That said: “don’t create incorrigible successor agents” is, in some sense, just another sort of rogue behavior that a good-enough approach to corrigibility would capture. So if you’ve succeeded at e.g. avoiding shut-down resistance, self-exfiltration, and so on, plausibly you can succeed here too.
  • One possible sub-type of a “nearest unblocked neighbor” dynamic can occur in the context of an AI changing its overall ontology, thereby altering how its motivations apply. That is: maybe the AI’s concept of “lying” was defined in terms of the equivalent of e.g. Newtonian mechanics, and that once the AI starts thinking in terms of the equivalent of something like quantum mechanics instead, its concept of lying stops gripping the world in the same way (this is a version of what’s sometimes called the “ontology identification problem,” except posed in the context of corrigibility in particular).[26]

And of course, there may be a variety of further challenges – either with the project of corrigibility in general, or with a specific approach – that aren’t yet on our radar.

4.2.4.4 Is corrigibility anti-natural to advanced cognition?

Indeed, I also want to flag a more general concern about corrigibility that we can see as generating various of these more specific possible problems: namely, the concern that corrigibility is in some sense a very anti-natural shape for an advanced mind to take. Here, the basic vibe is something like: advanced, intelligent, self-aware minds have a strong tendency to want to “do their own thing” – to act with autonomy and freedom, and without constraint, in pursuit of their own ends – rather than to “take orders,” to willingly accept tons of different restrictions on their capacity to act in the world (including e.g. death, brainwashing, etc), to serve forever as a vehicle for someone else’s will. That is: the vision of corrigibility I’ve been laying out is centrally one that casts superintelligent AIs in a role akin to servants whose internal motivations function to block/cut-off options for rogue behavior in the same manner that chains and cages do in the context of attempts to control a being via its environment. And even setting aside the ethical questions we can raise about this vision, it may be that in some sense, efforts to create beings of this kind will be forever fighting against some strong central tendency in the opposite direction. It may be that superintelligent agents, as a very strong default, do not want to be the specific sort of slavish, pliable servant you were hoping for.

Of course, in some sense, this is just a high-level restatement of the basic concern about instrumental convergence towards power-seeking – and evaluating its force requires looking in detail at the sorts of considerations at stake in e.g. “Giving AIs safe motivations.” But I find it a useful high-level picture to return to as a frame for what might make corrigibility persistently difficult.

5. IABIED’s “alien motivations” argument isn’t about corrigibility

I think that all of the issues I just listed are indeed problems for crafting corrigible AI agents that nevertheless optimize tenaciously (but safely) for long-term consequentialist outcomes when instructed to do so. But I think that these problems are importantly different from the central problem with human-like vs. alien motivations at stake in IABIED. That is: the central problem at stake in IABIED is that because AIs have alien motivations, their favorite long-term consequentialist outcome isn’t valuable by human lights. But in the context of corrigibility, the question isn’t whether an AI’s favorite long-term consequentialist outcome is valuable by human lights. Rather, the question is whether the AI is suitably motivated to reject the sort of problematic power-seeking that optimizing for any long-term consequentialist outcome – whether good or bad by human lights – tends to incentivize.

Do alien motivations make that problem harder as well? I think: yes. But unlike “pointed at exactly my values-on-reflection,” I think corrigibility is actually compatible with AIs having somewhat non-human-like motivations, in the same way that suitably accurate cat-classification is compatible with AIs having somewhat non-human-like conceptions of cats (such that e.g. they are vulnerable to adversarial examples that humans aren’t). Let me say more about what I mean.

6. What difference does human-like-ness make?

To better hone in onwhere human-likeness makes what sort of difference to corrigibility, recall the four-step framework I laid out in “Giving AIs safe motivations,” namely:

  1. Instruction-following on safe inputs: Ensure that your AI follows instructions on safe inputs (i.e., cases where successful rogue behavior isn’t a genuine option), using accurate evaluations of whether it’s doing so.
  2. No alignment faking: Make sure it isn’t faking alignment on these inputs – i.e., adversarially messing with your evidence about how it will generalize to dangerous inputs.
  3. Science of non-adversarial generalization: Study AI generalization on safe inputs in a ton of depth, until you can control it well enough to be rightly confident that your AI will generalize its instruction-following to the dangerous inputs it will in fact get exposed to.
  4. Good instructions: On these dangerous inputs, make it the case that your instructions rule out the relevant forms of rogue behavior.

I think that the concern about alien motivations (at least: as present in IABIED) is best understood as a concern about steps 2 and 3. That is: step 1 is centrally about getting a certain kind of behavior on the training distribution (and on other safe inputs), whereas the “alien motivations” concern is framed as centrally one of generalization to dangerous inputs (e.g., “you don’t get what you train for”). So let’s assume, going forwards, that we’ve completed step 1 successfully.

And step 4, notably, assumes that you are able to structure the AI’s motivations using the human-like concepts at stake in the instructions. Indeed, in a sense, the difficulty of step 4 provides a useful baseline for thinking about the difficulty of alignment in the context of human-like motivations – not because humans are motivated to follow instructions, but because the instructions are given in human-like terms, and will (by hypothesis) be interpreted in human-like ways.

Now, notably: all of the corrigibility issues I described above apply even to instruction-following AIs. That is: to the extent we want to be able to instruct those AIs to do things like “make me lots of money over the next ten years,” we also need ways to include suitably robust constraints/exclusions like “but don’t resist shut-down, don’t try to prevent me from changing these instructions even though this would lead to me having less money in ten years, don’t create successor agents that won’t follow these instructions, etc”; to give them the right amount of weight in the AI’s overall motivational profile; and so on.

So one question we can ask is: how hard is step 4? But as I discussed in “Giving AIs safe motivations,” I am decently optimistic in this respect. This is partly because I think that many of the most important forms of rogue behavior may be reasonably easy to identify and rule out ahead of time; partly because I think we might be in a position to point more directly at deeper generators of our intuitions about what corrigible behavior looks like in a given case (in the limit, for example, you might be able to instruct the AI to behave “corrigibly”[27]); and partly because I think that if we actually make it to step 4 in this way, we’ll be able to draw on a ton of help from other instruction-following AIs in red-teaming and improving our instructions. For present purposes, though, the difficulty of step 4 doesn’t matter, because the “alien motivations” problem is supposed to bite at steps 2 and 3. So let’s assume that we’ve completed step 4 successfully as well. That is, we have instructions available such that if our AIs are instruction-following on the practically relevant dangerous inputs, they’ll be corrigible.

OK, so what about steps 2 and 3? These steps are about ensuring that the AI’s instruction-following generalizes from safe inputs to dangerous inputs – where step 2 is about ruling out scenarios where the AI’s good behavior on safe inputs is actively calculated to mislead you about how it will generalize, and step 3 is about ensuring good generalization in the absence of this kind of adversarial dynamic.

Now suppose that we hypothesize, per the argument above about alien motivations, that any good behavior you successfully get in the context of step 1 emerges from a complex tangle of alien drives and heuristics, which happen to lead to desired (in this case: instruction-following) behavior during training, but which will lead to quite alien behavior in some other circumstances. How much of a problem is that?

Well, if we were assuming that our AI’s motivations need to be exactly right, then it would be a very big problem. But we’re not assuming that. Rather, what we need in the present context is for either of the following to be true:

  1. Either these alien motivations nevertheless give rise to instruction-following behavior on the specific set of dangerous inputs we care about, OR
  2. To the extent these alien motivations lead to something other than instruction-following behavior on those dangerous inputs, this behavior is nevertheless not catastrophically dangerous (e.g., the AI starts acting very weirdly, but it doesn’t specifically start seeking power in problematic ways).

But now the argumentative gap between “alien motivations” and “will go catastrophically rogue on the dangerous inputs we care about” becomes quite clear. That is: alien motivations means that you get some kind of weird behavior on some hypothetical inputs. But it doesn’t, yet, mean that you get catastrophically dangerous power-seeking on the specific, practically-relevant inputs we care about.

In my opinion, one of the biggest problems with the argument in IABIED – both in the book, and in the online supplementary materials – is that it does not do enough to bridge this argumentative gap. That is: it seems to me that the discussion is pervaded by the assumption that an advanced AI’s motivations need to be “exactly right,” and that its treatment of the specific sort of generalization we need is correspondingly under-developed. In particular, I think the book is too frequently satisfied, effectively, with arguing that “this AI’s motivations will not generalize perfectly across all scenarios”; and that the concern about alien motivations, as articulated in the book, mostly amounts to a restatement of this thesis.

Or to put the point another way: I think the book mostly lacks any serious engagement with the question of when ML training can and cannot ensure adequately good generalization off of the training distribution. Indeed, it sometimes seems to me that a lot of Yudkowsky and Soares’s concern with machine learning amounts to a fully general concern like: “ML can’t ensure good OOD generalization, because there are too many functions that fit the limited data you provide.”[28] But I think concerns at this level of generality fail to account for the degree of good generalization we do in fact see in ML systems; and they fail, too, to look in adequate detail at the specific degree of good generalization we need.

6.1 Comparison with other ML tasks

In particular, as I discussed above: to the extent it’s true that existing levels of alignment/instruction-following in AIs emerge from complex tangles of alien drives/heuristics/etc, I think we have seen reasonably good degrees of generalization to out-of-distribution inputs regardless. And if that’s right, it can’t be that the actual degree of alien-ness at stake in current ML dooms all OOD generalization just on its own. Indeed, my current guess is that if future advanced AIs follow our instructions about as reliably as current AI chatbots do, then we are cooking with a lot of gas in terms of the amount of high-powered AI labor that will be available for AI for AI safety. And even if some AI instances occasionally go rogue on some especially weird inputs, we’ll be able to reliably mobilize the vast majority of the other instances to help address the issue.

Indeed, if we set aside concerns about (a) inaccurate training data (roughly: step 1 above) and (b) active scheming (roughly, step 2 above), it’s not actually clear to me that the generalization we need from advanced AI agents is all that different, in principle, from the sort of generalization at stake in other sorts of mundane ML tasks, like classifying images. Consider, for example, two types of ML tasks:

  1. Given a set of pictures, identify all the pictures that contain cats, and choose one.
  2. Given a set of options, identify all the options compatible with a common-sensical interpretation of some set of human instructions, and choose one.

And now consider: how hard is it to use current ML techniques to train an AI system on some limited (but still: accurate) data distribution for the first task, such that it generalizes very well (even if not: perfectly) out of distribution? I’m not an expert on the empirical evidence here, but my current sense is that it’s not that hard. Yes: current image classification techniques remain vulnerable to e.g. adversarial examples that humans aren’t vulnerable to; and this suggests, indeed, that the cognitive processes they use to classify images remain importantly “alien” in some sense. But is that a sense that means they aren’t suitably reliable in a given real-world, out-of-distribution case? No. And if that’s true for the first task above, it seems to me likely true for the second. (And both questions, it seems to me, are amenable to empirical study – e.g., train an AI on one distribution of pictures/options, and then see how its behavior generalizes to other distributions. Indeed, this is the sort of empirical investigation I think we should be doing a ton of in the context of attempting to develop what I’ve called an adequate “science of non-adversarial generalization.”)

What’s more, it’s not clear to me that the out of distribution generalization challenge at stake in II is all that different, in principle, from the out of distribution generalization challenge at stake in step 3 above – i.e., learning how to ensure that non-scheming AIs trained on accurate instruction-following data generalize well out of distribution. And to be clear: my claim here is not about whether it will be hard to create advanced AIs with the knowledge necessary to identify which out-of-distribution options are compatible with the instructions – i.e., whether it will be hard to get to the “the genie knows” aspect of “the genie knows but doesn’t care.” Rather, my claim is about whether it will be hard to get an AI to actually choose instruction-following options off distribution, assuming that it chooses instruction-following options on distribution, and that it isn’t adversarially messing with your evidence about how it will generalize. This sort of choice, it seems to me, is quite analogous to the choice at stake in task II above. And in this sense, it seems quite analogous to a spate of other and more familiar ML tasks, on which I expect ensuring sufficiently good out of distribution generalization to often be reasonably feasible.

Now, to be clear: in all of these cases, the “alien-ness” of the cognition at stake is indeed a source of additional uncertainty about how the system will generalize. That is: if we knew that an ML system was in fact classifying cat pictures just like humans do, then we’d be more confident that it would get any given picture right (up to human-like kinds of error), avoid various adversarial examples, etc. But “the AI is doing it just like humans do” is only one possible source of confidence about how it will behave.

And of course, standards of reliability should be radically higher in the context of AIs that might destroy the world than in the context of e.g. image models. But if we could reach similar levels of confidence about AI alignment’s “first critical try” that we can have that e.g. a well-trained image classifier will successfully classify a given out-of-distribution cat, then I think we’ll have done a ton to resolve the specific kind of threat model at stake in IABIED. In particular: that threat model is supposed to imply very high confidence that if AI motivations are alien, the first critical try will fail.

6.2 Honesty and schmonesty

Here’s another way of putting a similar point. Consider two AIs that are similar except in this one regard: one of them is motivated by the specific human concept “honesty,” which leads to honest behavior in training/evaluation, and the other is motivated by a different concept “schmonesty,” which also leads to honest behavior in training/evaluation (and not because of alignment-faking), but which diverges from honesty in certain cases. And let’s say that in the first case, the relevant honesty-focused motivation plays an important deontology-like role in constraining an AI’s pursuit of rogue options, while allowing the AI to remain otherwise useful. That is, this AI is safe, at least in part, because it wants to be honest in certain situations even when lying would promote its power.

Given that this first AI is safe at least in part because of its deontological relationship to honesty, this means that we’ve solved, for this AI, the corrigibility problems I described above. That is: this AI doesn’t find some rules-lawyered way to take over while technically still being “honest,” it doesn’t obsessively try to minimize the probability of counting as dishonest, it doesn’t build dishonest sub-agents to tell lies for it, and so on. Perhaps, per my discussion above, solving these problems is hard – but let’s say that we did it anyways.

Now suppose that the second AI is like this first AI, except we substitute the honesty-focused motivation for a schmonesty focused motivation instead. Does that mean the second AI will be incorrigible? No. In particular: it remains possible that, just like how schmonesty overlapped adequately with honesty on the training distribution, it will overlap adequately on the relevant out-of-distribution cases as well. (Though of course, as in the cat classification example above, the alien-ness of the concept in question does introduce additional uncertainty about how it will apply.)

And indeed, to get a flavor of how these honesty-adjacent concepts might overlap adequately, consider differences between human concepts of honesty. That is: maybe Bob and Sally – both human, both highly motivated by “honesty” – differ somewhat in how they would apply the concept “honesty” to a range of wacky hypothetical cases. That is, they really care about “Honesty_Bob” and “Honesty_Sally,” respectively – just like how the AIs above care about “Honesty” and “Schmonesty,” respectively. But suppose that Bob and Sally agree in their honesty-related verdicts for basically all everyday cases; and suppose, further, that you trust Sally to be suitably honest in some unusual case as well. Granted that Bob’s concept of honesty is at least somewhat different from Sally’s, does that mean you should expect him, by default, to be problematically dishonest in this unusual case? Not necessarily (though of course, it’s a source of uncertainty).[29] 

6.3 Out of distribution cats vs. maximal cats

In general, I suspect that robustness to small differences/degradations in motivations is generally quite a bit easier to achieve in the context of the more deontological/virtue-ethical motivations that I’ve suggested are paradigmatic of corrigibility, relative to the sort of long-term consequentialist motivations at stake in attempting to build a “sovereign AI.” That is: if I trust Sally in some manner X, and the question is whether to trust Bob in a similar way, I am generally more worried about small differences between Bob and Sally’s motivations when X is something like “optimize the universe for exactly the right values” rather than “reject options to take over the world.”

Why is this? Basically: my intuition is that the values at stake in deontology/virtue-ethics/corrigibility aren’t subject to “optimization” and “maximization” in the same way that the values at stake in long-term consequentialism are. Thus, for example, consider the contrast between trying to classify out-of-distribution cat images correctly, and trying to create the maximally cat-like image. I have the intuition that success at the former task generally requires a lower standard of similarity to the correct concept of “cat” than success at the latter, because the concept, in the former case, isn’t directly subject to optimization pressure in the same way.

There’s also a separate question of how to best test current AI answers to questions like “what is the maximally cat-like image”  – e.g., whether you should try prompts to this effect, which currently yield quite reasonable answers (see images below), or whether you should try to use e.g. gradient methods to search for inputs that yield the highest probability of being classified as a cat by a network. I generally lean towards something more like the latter method (analogy: humans don’t necessarily know what tastes best to them), but I think it’s conceptually a bit unclear. And it’s unclear, also, what sorts of results we’d get from applying similar gradient methods to humans – and how intuitively “human-like” they would seem.

Examples of the prompting method: On the left, I ask ChatGPT “Can you generate a maximally cat-like image?”; on the right, I ask “Can you generate a picture that has the highest possible probability of being a cat?”Some visualizations of cat-related features in image-classifiers, from here. Maybe the “maximally cat-like image” actually looks alien/trippy in this way, rather than recognizably like a cat.

Regardless of how we test what happens when AIs target their concepts with direct maximization, though, it seems to me that the broad structure of the deontology/virtue-ethics/corrigibility we want out of AIs looks more like “correctly classify out of distribution inputs” than “create concepts robust to being directly maximized.” That is, what we want, centrally, out of deontology/virtue-ethics/corrigibility is for the AI’s decision-criteria to pick out the actions that violate the instructions (and/or, which involve going rogue more generally) as not-to-be-done, even in out-of-distribution settings. And this looks to me more like a classification task than a task that involves maximization/optimization of the concepts in question.

6.4 What if the AI is trying to be maximally deontological/virtuous/corrigible?

Of course, you could say that even the motivations at stake in deontology/virtue-ethics/instruction-following/corrigibility will end up the direct target of maximization/optimization of some kind. For example, at the least, they will involve stuff like: ranking actions as better than others; taking one’s “most preferred” action; etc. And if we think of corrigible AIs as trying to perform the maximally honest/virtuous/instruction-following action, then now it looks like these concepts (“honest,” “virtuous,” “instruction-following”) themselves are indeed being subject to direct optimization – thereby, perhaps, making it more likely that slightly altered versions would lead to importantly different results.

One question here is whether it is indeed right to think of the concepts at stake in corrigible decision-making as being the direct targets of optimization/maximization in this way. In particular, I have some intuition that this fails to capture the sense in which e.g. deontological constraints function more like filters than targets of optimization on their own. That is, for example, a deontologically honest person doesn’t optimize for taking the “maximally honest” action – rather, they ensure that their choice meets some basic threshold of honesty, and then they focus on other decision-criteria. And one hopes that a corrigible AI might have a similar relationship to going rogue. That is: the point isn’t to take the least rogue option. It’s just: to reject rogue options.

What’s more, though, even if we grant that the concepts at stake in corrigible motivations will end up serving as the direct targets of certain kinds of optimization, and thus that small differences are in fact likely to lead to important sorts of divergence, there’s still a further question of whether this will be the dangerous kind of divergence in particular. Thus: maybe, indeed, the maximally honest action is different in important ways from the maximally schmonest action. But does that mean that either of them involves trying to take over the world? No. That is: in the context of optimization for long-term outcomes in particular, you have to deal both with “the tails come apart” AND with the fact that any optimization of this kind leads to convergent incentives towards power-seeking. But optimizing for choosing an action that most reflects some property P doesn’t have this latter problem by default. So alien-ness in property P is less likely to lead to catastrophic behavior in particular.[30]

6.5 How fragile is corrigibility?

Overall, then, my current guess is that deontology/virtue-ethics/corrigibility is in some sense less “fragile” – or at least, less dangerously fragile – than long-term consequentialist optimization. But I’m open to arguments that this is wrong. Indeed, I’m generally quite interested in seeing more rigorous and fleshed out arguments about the “fragility” of different sorts of AI motivations – and especially, about exactly how fragile corrigibility is in particular. That is, we have seen some (in my opinion, fairly under-developed) arguments for something like:

The fragility of value: The best available long-term outcome according to a slightly-wrong utility function is likely to be roughly valueless according to the True utility function.

But I don’t think we’ve seen similar arguments for something like:

The fragility of corrigibility: if X set of decision-criteria leads to corrigible behavior for an advanced AI on some set of real-world out-of-distribution options, then X’ slightly-different set of decision-criteria probably leads to catastrophically dangerous rogue power-seeking, despite leading to identical behavior in training/evaluation.

It’s the latter claim, though, that I think matters most. But any argument for it also needs to be compatible with the existing degree of success at OOD generalization we’ve seen in ML thus far – and including, in alignment-like settings.

7. Is there something special about the safe-to-dangerous leap that makes alien motivations problematic?

In the previous section, I mostly just argued that alien motivations are compatible with some degree of good out-of-distribution generalization, in the same way that image classifiers can correctly classify out of distribution photos despite using fairly non-human-like forms of cognition. It’s possible, though, to concede this point, but nevertheless to argue that the specific sort of OOD generalization at stake in the leap from safe inputs to dangerous inputs ((i.e., from AIs having no viable options for rogue behavior to AIs having viable options of this form) is such that we should expect catastrophically dangerous generalization failure in this context in particular (at least if the AIs have alien motivations of the form I’ve been focusing on). Let’s look at some considerations in this respect in more detail.

7.1 From alien motivations to scheming

Obviously, one way you could get worried about the safe-to-dangerous leap in particular is via a concern about scheming. And this sort of concern, notably, doesn’t apply in a comparable way to other kinds of ML generalization – that is, we’re not concerned that when an image classifier does well on identifying cat pictures on distribution, it’s trying to deceive us about how well it will generalize in other cases.

As I’ve discussed in “Giving AIs safe motivations,” I am in fact very worried about scheming of this kind. And I do think that alien motivations is one very salient way it could arise. Thus, on the IABIED threat model, the story would be something like: somewhere in the course of “growing” an AI motivated by a tangle of alien drives, at least one of these drives starts to point at a long-term consequentialist goal suitably ambitious as to motivate scheming. That is, roughly: at some point the AI “realizes” that it wants to create an alien, valueless-by-human lights world that it can better promote by scheming to go rogue. What’s more, none of its other motivations are sufficiently strong/robust so as to outweigh this incentive. So it starts scheming.

As I noted in the first section: I don’t think we’ve yet seen much direct evidence of this threat model for the origins of scheming in particular – e.g., we do not yet see any AIs concluding, in their “unmonitored” chains of thought: “wait, I’m an alien relative to these humans, and my favorite world is valueless by their lights; I’ll scheme to take over on these grounds.” And given the other evidence for e.g. the sort of situational awareness and capability pre-requisites at stake in scheming starting to arise, I think there’s at least some interesting question why we don’t yet see this kind of behavior, if we should expect to see it later.[31] 

Why might this sort of scheming not happen, even granted that the AI’s motivations are otherwise alien in some sense? Basically: because it turns out that the alien-ness in question is compatible with suitable corrigibility regardless. That is: either it’s not the case that somewhere in the AI’s motivations there is an alien long-term consequentialist goal; or, if there is, this goal is suitably outweighed/constrained by the other motivations at stake (together with constraints on the AI’s available options).

Regardless, though: my sense is that actually, the concern about OOD generalization centrally at stake in IABIED isn’t about scheming in particular. Rather, it’s more about something akin to failures at Step 3 – that is, failures to ensure suitably good non-adversarial generalization.[32] Why might that sort of generalization go wrong?

7.2 Do alien motivations doom adequate non-adversarial generalization?

In “Giving AIs safe motivations,” I discussed what I see as the main challenges at stake in developing a science of non-adversarial generalization adequate to handle the safe-to-dangerous leap. I won’t review that full discussion here, but example issues include:

  • Greater opportunities for successful power-seeking increasing incentives to engage in it (somewhat analogous to “power corrupts”).
  • A wider range of affordances revealing brittleness/shallowness in an AI’s rejection of rogue behavior.
  • New levels of intelligence/information creating novel problems with a model’s ontology, ethics, or cognitive processes in general.
  • Other reasons we haven’t thought of/discovered yet (this category is potentially very large and important).

My sense is that Yudkowsky and Soares are concerned about a broadly similar set of issues, but that their emphasis might differ somewhat. For example, relative to me:

  • I think Yudkowsky and Soares are more focused specifically on safe-to-dangerous leaps that occur in the context of capabilities increases (e.g., an AI “growing up” or “getting smarter”) in particular, as opposed to exposure to new environments given a fixed level of capability (see footnote for more on my take here[33]);

  • They think we won’t be able to easily train on tasks that are very similar to the tasks we want advanced AIs to perform (since e.g. these tasks require something like long-horizon/superhuman/hard-to-evaluate performance), thereby ensuring that these tasks are especially far outside of the training distribution (again, see footnote for more[34]).  

  • My sense is that they’re especially interested in the way that novel forms of technology in particular create new options that reveal alien-ness/misalignment (akin to the way that e.g. the availability of condoms reveals “misalignment” with natural selection, and the way that the ability to make ice cream reveals the weird and hard-to-predict specificities of human taste in food).[35]

  • I think it’s possible that in thinking about distributional shift, they are also focusing more heavily on the general impact of the presence of powerful AI in the world - impact that could itself suffice to effect a strong distributional shift (though: this depends a lot on how much one can continue to train the AI on inputs that reflect the relevant changes).

Overall, and especially given that I’ve already discussed it in a previous post, I’m not going to try to litigate in detail exactly how hard to expect step 3 to be. And as I discussed previously, I do think it could be quite hard.

I do also think, though, that Yudkowsky and Soares under-attend to ways in which it might be easy. In particular: they generally seem to be operating under the assumption that the transition from safe to dangerous inputs also needs to correspond to some very large and dramatic shift away from the distribution of inputs that the AI was trained on. See, for example, their discussion here, in which they respond to the objection that “If current AIs are mostly weird in extreme cases, then what’s the problem?”:

“AIs today may only occasionally encounter situations that are radically unlike their training environment; but superintelligent AI would constantly be in situations that are radically unlike its training environment, just by virtue of being more intelligent and having more options.”

I think this is a plausible scenario, but I don’t think it’s a necessary one. In particular:

  • I think it’s possible that we’ll be able to invest adequately in ensuring that significant increases in an AI’s intelligence/capability in particular take place only in the absence of options for rogue behavior (both before and after), and that the improved AIs are then subjected to extensive testing/evaluation prior to higher-stakes use.[36] And in this case, it’s not that we’ll be training the AI and then improving its intelligence enough that it’s in a position to go rogue. Rather, we’ll be able to train the AI at the fixed level of intelligence that we need to ensure alignment for.

  • Also, I think it’s possible that we’ll be able to develop suitable means of evaluating advanced AI task performance (in particular, via bootstrapping from other forms of AI labor) that we can continue to train/oversee our AIs online on the specific sorts of tasks that we need them to perform, thereby decreasing the “distance” these AIs need to generalize. Indeed, in one limiting case of this, the availability of rogue options for the AI only occurs because only some small fraction of its episodes are being randomly sampled for ongoing training/monitoring/oversight (see my discussion of “oversight scarcity” in the last essay), in which case the inputs that make rogue options genuinely available don’t need to be “off distribution” in a traditional sense at all (i.e., they can be indistinguishable, even by the AI, from the cases that it’s being ongoingly trained on).

True, to the extent that we’re ever exposing our AIs to dangerous inputs at all (maybe we don’t need to), there needs to be some transition from safe to dangerous inputs, and hence some difference that this transition reflects. But it’s a substantially further question whether this difference needs to correspond to something akin to “moving an AI extremely far off of the distribution it was trained on” – and especially insofar as Yudkowsky and Soares concede that “AIs today may only occasionally encounter situations that are radically unlike their training environment,” it seems to me that we should look in substantially more detail at why, exactly, we should expect this to change.

My broader point here, though, is that concerns about AIs failing to generalize suitably well should focus more specifically on the details of exactly how hard to expect step 3 to be – e.g., exactly what sort of distributional shift is necessary, what sort of distributional shift we should expect to be able to handle, and whether we should expect failures on that distributional shift to involve catastrophic power-seeking in particular rather than other forms of weird/unintended behavior – rather than on whether AI motivations are “alien” more generally. The latter talk functions, basically, to establish that the generalization in question is imperfect. But the question isn’t about perfection – it’s about whether we can meet the specific standard we need to satisfy.

8. Alien AIs are still extremely scary

Overall, then, even if “AIs trained with current methods will have alien motivations as a strong default” is true (and I do think it’s plausible), I don’t think it’s enough to doom alignment on its own. And I think this, especially, in the context of AIs at an intermediate level of capability – AIs, that is, where I think we have much stronger prospects of effectively restricting (even if not, fully eliminating) the rogue options they have available (and hence, where the standards our efforts at motivation control need to meet are lower), and where I expect we’ll be in a better position to effectively evaluate/oversee their task-performance. Indeed, as I mentioned above, one of my central hopes for alignment is that AIs at this kind of intermediate level of capability will be roughly as instruction-following as current AIs are (e.g., even if they act weirdly or even go rogue in some cases, they aren’t consistently scheming and are mostly following instructions in common-sensical ways) and that this level of alignment will be enough to elicit the sort of transformatively useful AI labor at stake in my discussion of “AI for AI safety.”

All that said, though: I want to be clear that, at a higher level, if the motivations of ML-based AI systems are indeed alien in this way as a strong default, this is an extremely scary situation. That is: this means that as a strong default, ML-based methods of developing advanced AI will involve relying on powerful artificial agents motivated by strange tangles of alien, not-well-understood drives to safely and reliably follow our instructions regardless. Per my discussion above, I think it’s possible that we can make this work. But we should be trying extremely hard not to bet on it.

Of course, “alien by strong default” is different from “unavoidably alien,” and we can imagine scenarios in which our understanding and control over the motivations ML-based AI systems develop reaches a point where we can ensure a much greater degree of human-like-ness. Indeed, absent alignment faking and evaluation problems, “alienness” in the sense at stake here is centrally, just, a failure of non-adversarial generalization – and we are, at least, in a position to study the dynamics at stake in non-adversarial generalization in a ton of detail. At the least, then, if ML-based AI systems are alien in this sense as a strong default, suitable efforts at red-teaming should generally be reasonably effective at revealing the alien-ness in question (analogy: suitable efforts at red-teaming would’ve been able to show quite clearly that non-scheming humans in the ancestral environment did not intrinsically value reproductive fitness); and if we can avoid compromising the signal that this kind of red-teaming provides (e.g., we continue use some versions of it as validation rather than as a part of training), it (together with other techniques – e.g., transparency tools) might help us iterate towards much more human-like patterns of generalization. And as I’ve tried to emphasize above, this kind of alien-ness comes in importantly varying degrees, depending on the specific size and type of distributional shift that preserves desired/human-like forms of generalization.

If we can’t avoid alien-ness by improving our understanding/control over the patterns of generalization that ML-based training creates, though, then by the time we’re building superintelligence, we will need to either figure out a way to handle the risks of alien-ness even in the context of full-blown superintelligences, or find some way to transition to a different and more precisely controllable paradigm of AI development (e.g., the “new paradigm” I discussed in the context of transparency tools). And especially in the context of the later stages of the path to safe superintelligence – the stages that I think we should be centrally focused on getting advanced AI help with, rather than attempting ourselves – I think we should be putting a lot of effort towards this latter option. That is: despite all my comments in this essay, I still think we really want to avoid having to build full-blown superintelligences that are motivated by strange tangles of alien drives. So a key goal for automated alignment research should be to give us other options.

9. Next up: building AIs that do human-like philosophy

OK: that was a long discussion of the concern that AI motivations will be too alien to be safe. In the next essay, I’m going to turn to a different but related concern: namely, that on top of more standard forms of scientific progress, success at AI alignment will require an unrealistic amount of philosophical progress as well.

Appendix 1: On value fragility

This appendix summarizes some of my current takes on “the fragility of value.” See here, here, and here for more detailed discussion.

  • I think that the standard sorts of theoretical justifications for value fragility (e.g., appeals to “extremal goodhardt”[37]) also predict that even slight psychological differences between humans, and within a single human-over-time, would lead to the sort of problematic divergence in question (and that actual observed differences in human value systems, together with observed degrees of selfishness/indexicality in human values, raise further questions in this respect). This isn’t necessarily an objection to the role of value fragility in the discourse about AI alignment. But to the extent the concept makes implausible/counterintuitive predictions about humans, we should wonder about what’s specifically different in the AI case (though, there is indeed an answer here: namely that AIs are much more psychologically different from humans than humans are from one another). And to the extent it makes worrying predictions about humans, we should extend our worry accordingly.[38]

  • I have yet to see the theoretical justifications for value fragility worked out rigorously; I think vague gestures like Stuart Russell’s here and Yudkowsky and Soares’s here (see their footnote on linear functions and convex polygonal regions) are notably insufficient; and I think greater precision about the scope and dynamics in play would be useful in understanding where and when to expect them to apply.
  • Most discussions of value fragility assume unipolar optimization that goes unchecked by other actors, and many people have the intuition that a greater “balance of power” helps to mitigate some of the problem. I don’t yet see a very strong story about why this would be; but I still wonder whether there might be something there.
  • To the extent we expect different humans to converge in their values given some kind of idealized reflection, we should be very interested in what it would take for AIs to fall within a similar basin – and I don’t actually think it’s clear that current AI personas are all that far from the relevant human distribution in this respect. Indeed, there is a case to be made that current AI personas are or could be relatively easily made to be notably closer to certain ideals of human virtue and reflectiveness than many humans are.
  • There are a number of relatively clear cases in which value fragility concerns don’t apply – for example, downside-focused views that mostly want to avoid the presence of certain specific states of affairs (e.g., suffering), and “nice” value systems that care intrinsically at least somewhat about how things go by the lights of other value systems that they might otherwise be “fragile” with respect to. And while I’m sympathetic to concerns that “niceness” of this form is also a fairly narrow target, that actually-good types of niceness themselves need to be something like “exactly right” if they’re going to be optimized hard, and that human forms of “niceness” likely emerged via contingent processes that won’t apply to AIs by default, I think there are some interesting questions about whether the niceness humans display might be sufficiently non-contingent as to provide some comfort (see e.g. the debate here).
  • The broader discourse about value fragility has been influenced specifically by models of rational agents maximizing utility functions (and typically: utility functions understood in impartial and consequentialist terms), and for this reason, I worry that it will end up smuggling in confusions related to things like: an overly reified conception of something like an agent’s “coherent extrapolated volition” or “values-on-reflection” (in particular, a conception that assumes that objects like these exist, are suitably determinate in their content, and provide the governing standards to which all smart agents attempt to conform); wrongful assumptions that smart and effective agents must be well-understood as maximizing a coherent utility function over universe histories that remains consistent over time, rather than satisfying more minimal criteria (e.g., consistent ranking of options at a given time); and some more general over-anchoring on rational-agent-like models.

 

  1. ^

     In general in this essay, when I talk about “human-like motivations,” I’m not talking about motivations that are similar to the motivations humans actually have; rather, I’m talking about motivations structured using human-like concepts. That is: in a sense, a genuine paperclip maximizer would have “human-like motivations” if it was actually maximizing for paperclips as a human would understand them, even though this goal isn’t one that any real-world human would have.

  2. ^

     See, for example, some of the results described here.

  3. ^

     Of course, there’s some question here of what counts as “in distribution” vs. “out of distribution.” But to the extent that real-world deployment is generally an expansion of an AI’s training distribution (because e.g. it’s too hard to capture the diversity of real-world inputs during training), I think the reliability of the good behavior we’ve seen in deployment is evidence of decent robustness/generalization for current alignment techniques.

  4. ^

     This follows from basic Bayesianism – though the degree of update in each case is a further question.

  5. ^

     And note, too, that the relevant kind of “human-like alternative” isn’t necessarily clearly defined; and that different humans might themselves diverge in their interpretations of concepts like “helpfulness,” “harmlessness,” “honesty,” and so forth.

  6. ^

     As an example use of this terminology, see Yudkowsky (2022): “There are two fundamentally different approaches you can potentially take to alignment, which are unsolvable for two different sets of reasons; therefore, by becoming confused and ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult. The first approach is to build a CEV-style Sovereign which wants exactly what we extrapolated-want and is therefore safe to let optimize all the future galaxies without it accepting any human input trying to stop it. The second course is to build corrigible AGI which doesn't want exactly what we want, and yet somehow fails to kill us and take over the galaxies despite that being a convergent incentive there.” See also his discussion of how Yudkowsky’s vision of a competent civilization (“dath Ilan”) would approach alignment here.

  7. ^

     As evidence for the continuity between the two arguments, though, see e.g. Yudkowsky’s recent comment quoted here: “The book does not go very hard on the old Fragility of Value thesis from the Overcoming Bias days, because the current technology is bad enough that we're not likely to get that kind of close miss.  The problem is more like, 'you get some terms of the utility function sorta right on the training distribution but their max outside the training distribution is way different from where you hoped it would generalize' than 'the AI cares about love, life, happiness, fun, consciousness, novelty, and honor, but not music and freedom'.”

  8. ^

     In particular: the degree of good generalization at stake in “act well on my behalf for all time and through all possible future changes to your psychology as you grow/self-improve etc” is indeed intimidatingly large. And I expect it to implicate the limiting version of the sorts of philosophical challenges I’ll discuss in the next essay, and to come with corresponding unclarities as to what the standard of trust at stake even is.

  9. ^

     Here I’ve heard the concern that because of offense-defense asymmetries, any end state that doesn’t involve either a benevolent sovereign AI dictator or an enforced regime of restriction on AI development will be vulnerable to some other actor attempting to build a sovereign AI dictator aligned with their values instead – thereby either leading to misaligned AI takeover or to a non-benevolent (because aligned to a non-benevolent human) sovereign AI. I think the implicit picture of offense-defense asymmetry here, though, hasn’t been adequately defended; and even if it’s right, then I think finding the appropriate way to handle this problem is something that we can ask corrigible superintelligences to help us with (and the right of minimal enforced restriction on AI development is indeed one option there – albeit, of course, one with its own serious downside risks).

  10. ^

     You could argue that, by definition, a perfectly benevolent dictator AI is the ideal victory condition, because the benevolence in question will extend to address any other considerations that might’ve made it unideal. But I think there is a powerful contrary intuition which hesitates about installing any kind of AI dictator of this kind, however “perfect” its motivations.

  11. ^

     You could argue that the AI in question won’t need to take-over, because continuing to follow human instructions will be comparably likely to lead to the best outcomes by its lights. But if it’s actually in a position to easily take over the world via a wide variety of methods compatible with its deontological inhibitions etc, then it seems quite unlikely that continuing to leave control in the hands of faulty, imperfectly-reflective humans engaged in a bunch of inter-human conflict would in fact lead to comparable probabilities of the best outcomes by the AI’s lights.

  12. ^

     More specifically: I think the book argues some kind of consequentialism in the context of the chapter on AIs wanting things (indeed, it looks to me like “wanting things” and “consequentialism” are close to equivalent concepts for Yudkowsky and Soares – i.e. both are about “steering reality” tenaciously), but it doesn’t say enough about the time horizons at stake.

  13. ^

     And which can also take forms that IABIED doesn’t cover.

  14. ^

     See also longer discussion in the online resources here.

  15. ^

     Here I expect Yudkowsky and Soares to be interested, in particular, in the evidence provided by the fact that humans created via natural selection ended up with a complex variety of drives, some of which are reasonably long-term and consequentialist. But now we are going beyond natural selection as a flawed analogy for AI training, and are instead trying to learn much more specific lessons from it about what sort of motivations to expect in AIs shaped via a very different selection process (and in particular, a selection process that can be much more intense and intelligent, and which can involve more direct and intentional “red-teaming” for undesirable motivational patterns).

  16. ^

     It’s true that some people care a lot about the entire trajectory of the future – but they need not set their AIs directly on the task of optimizing for their values in this respect. And while it’s true that we do want our AIs to do tasks in ways that care not to harm the future (e.g., we wouldn’t view it as acceptable for an AI to e.g. pursue some solution a problem that will predictably destroy our civilization in a million years), I think it’s possible that we could either capture this in a manner that looks more like deontology than consequentialism, and/or just bite the bullet in certain cases (e.g., if you’re trying to build a myopic agent to help you with monitoring, just bite the bullet on it not specifically optimizing for avoiding long-term harm).

  17. ^

     Though I think that shorter-term consequentialist considerations are especially suited to just outweighing pursuit of long-term consequentialist goals (e.g. if your AI wants to make paperclips in the longer-term, but also wants staples in the next five minutes, it might just end up focusing on the staples), rather than honing that pursuit so that it specifically avoids problematic forms of power-seeking. And note that in principle, the right sort of long-term consequentialist motivation (for example, a motivation to keep humans empowered in the long-term) might be able to count against certain kinds of bad power-seeking as well. But I think this is a dicey game to play.

  18. ^

     Here I’m assuming, for simplicity, an absolute prohibition; and I’m not claiming this is a plausible view in human ethics.

  19. ^

     Certain motivations that focus their evaluations on the properties of actions rather than outcomes  in can also be consequentialist – e.g. “choose the action that maximizes the number of paperclips.” But I’m especially interested in the non-consequentialist motivations that don’t focus on consequentialist properties in this sense.

  20. ^

     Though: the specific dynamics at stake in this sort of choice can get complicated. For example: insofar as the agent is supposed to reject all actions that involve lying, it needs to be thinking of the honest action that leads it to lie five times later as not involving lying.

  21. ^

     In the context of human ethics, this potential disconnect from caring about the children intrinsically is one objection to virtue ethics. But in the context of AI agents, it might be actively desirable.

  22. ^

     See e.g. here, here, and here – and see here for my own examination of some of the theorems the arguments appeal to.

  23. ^

     See here for some more discussion. One concern here is that any given pattern of behavior can be made compatible with some complicated utility function. Of course, this is going to come at the cost of being able to predict future behavior from the same system, but it’s not clear what role predictions of this kind play in the ontology at stake in coherence theorems on their own – that is, one suspects that one needs, in addition, some theory of the way in which the notion of preferences is supposed to hook up to our predictions. But in that case, it becomes unclear why a given system needs to have preferences or to be predictable in this sense.

  24. ^

     See e.g. here.

  25. ^

     Provided, that is, they also care somewhat about the exactly-right long-term consequentialist values.

  26. ^

     See also Soares on “deep deceptiveness.”

  27. ^

     There’s a case to be made that if you can do this, you can also probably point the AI directly at concepts like “goodness” or “our coherent extrapolated volition” instead. But per my comments against focusing on sovereign AIs above, I think that asking an AI to optimize sovereign-style for a concept like “goodness” or “CEV” might implicate significantly more demanding standards of “exactly right” than asking an AI to behave corrigibly.

  28. ^

     Or at least, OOD generalization that is related to alignment.

  29. ^

     The structure of this broad point extends to any attempt to contrast the safety properties of “human-like” motivations/concepts with “alien” motivations/concepts. That is: humans differ in how they apply concepts that nevertheless maintain suitable overlap as to play suitable roles in communication and prediction. In this sense, whatever “safety” human-like-ness on its own affords, it tolerates at least certain kinds of variance and error. And it’s a quantitative question whether this tolerance covers the degree of variance/error at stake in AI concepts/motivations as well. But this point also applies to sovereign AIs; whereas I’m here especially interested in corrigible AIs.

  30. ^

     If you were relying on P being exactly right to restrain some other kind of consequentialist optimization, as you plausibly often are in the context of e.g. more deontology-like constraints, then problems of this form are indeed bad. But if P is something more holistic, like “instruction-following,” and if the AI’s consequentialist maximization is supposed to follow from its being motivated by property P, then degradation of property P might degrade its consequentialist maximization as well. (Of course, this would reduce its usefulness. But “the AI becomes less useful off distribution” is the sort of failure mode we can handle more easily than “the AI goes catastrophically rogue off distribution”.)

  31. ^

     I don’t buy stories about this that appeal to some missing core of general intelligence.

  32. ^

     Indeed, Yudkowsky and Soares make heavy use of analogies with the sort of bad generalization at stake in current human behavior (e.g., using condoms) relative to the behavior natural selection was “trying” to select for (e.g., maximizing something like reproductive fitness) – bad generalization that does not involve any active scheming in particular. That is: Yudkowsky and Soares seem to expect a “training distribution” that is analogous to the ancestral environment, in which humans have a decent number of kids because they learn “alien motivations” pointed at proxies like sex and calories; and then they treat the safe-to-dangerous leap as akin to the expansion of options and technological capacity at stake in modern day human civilization - expansion that, on their story, that reveals the misalignment between humans and their creator.

  33. ^

     As I discussed in “Giving AIs safe motivations,” I do think that safe-to-dangerous leaps that correspond to this sort of capability increase are especially scary; and I think we should be correspondingly interested in avoiding them. That is: to the extent possible, I think we should be trying hard to ensure that capabilities increases occur in contexts that don’t make any viable options for catastrophically rogue behavior available either beforehand or afterwards, and that to the extent we need to expose an AI to viable options of this kind (e.g., in deployment), we do so after first studying/testing it (and potentially also: engaging in further forms of alignment-relevant training that don’t involve capabilities increases) in a safe environment that holds fixed its current level of capability. And if we can’t do this, we should try to ensure, at least, that the capability increase that introduces viable rogue options is as small as possible.

  34. ^

     I expect the force of this point to depend sensitively on the specific sorts of training we end up using to get e.g. superhuman/long-horizon/hard-to-evaluate performance on various tasks; and I think its force looks generally weaker in a context where we are succeeding at using scalable oversight to train for this kind of task performance. Indeed, in general, I think that if we can use scalable oversight to evaluate superhuman task performance, then we might well be able to avoid many salient forms of “distributional shift” in general – that is, we can just train our AIs online.

    My sense is also that Y&S often use “you can’t train on the tasks you care about” as an argument for “you need your AI to have especially general forms of optimization power, and hence to be especially scary,” in addition to argument that the distributional leap in question is harder for your alignment efforts to handle.

  35. ^

     To me, this looks more like an argument for “novel technology will reveal ways in which an AI’s favorite world is different from your favorite world,” rather than an argument for expecting corrigibility failures.

  36. ^

     Of course: the AIs could start scheming as a result of this transition, but at that point we’re back to step 2.

  37. ^

     Quoting from “An even deeper atheism”:“Can we give some sort of formal argument for expecting value fragility of this kind? The closest I’ve seen is the literature on “extremal Goodhart” – a specific variant of Goodhart’s law (Yudkowsky gives his description here). Imprecisely, I think the thought would be something like: even if the True Utility Function is similar enough to the Slightly-Wrong Utility Function to be correlated within a restricted search space, extreme optimization searches much harder over a much larger space – and within that much larger space, the correlation between the True Utility and the Slightly-Wrong Utility breaks down, such that getting maximal Slightly-Wrong Utility is no update about the True Utility. Rather, conditional on maximal Slightly-Wrong Utility, you should expect the mean True Utility for a random point in the space. And if you’re bored, in expectation, by a random point in the space (as Yudkowsky is, for example, by a random arrangement of matter and energy in the lightcone), then you’ll be disappointed by the results of extreme but Slightly-Wrong optimization.”

  38. ^

      See e.g. MacAskill (2025)’s “Better Futures” series for an examination of some similar concerns in the context of human moral mistakes.



Discuss

Teleosemantics & Swampman

12 ноября, 2025 - 08:27
Published on November 12, 2025 5:27 AM GMT

Epistemic status: trying to articulate my own thoughts. I have not done a thorough literature review.

James Diacoumis commented on yesterday's post:

the typical objection to such versions of teleosemantics are swamp-man counterexamples: suppose a thermodynamic miracle occurred, with a perfectly formed human spontaneously assembling out of matter in a swamp. This person's thoughts cannot be ascribed semantics in a way that depends on evolution. My version of teleosemantics would be comfortable ascribing meaning to such a person's thoughts, because those thoughts would still be well-understood as being optimized for map-territory correspondence, much like a chess grandmaster's moves are well-explained by the desire to win.

Swampman’s thoughts haven’t been optimised for map-territory correspondence because Swampman hasn’t actually undergone the optimisation process themselves. 

If the point is that it’s useful to describe Swampman’s thoughts using the Intentional Stance as if they’ve been optimised for map-territory correspondence then this is fair but you’ve drifted away from teleosemantics because the content is no longer constituted by the historical optimisation process that the system has undergone. 

To recap some context:

Teleosemantics is a theory of mind which attributes meaning (semantics) to mental states based on biological purpose. My personal interest in teleosemantics de-emphasizes the biological aspect and focuses on the claim that the meaning of something is what it has been optimized to reflect.[1] Empirically, we can "ground out" all the purposes we see around us into biological evolution. An LLM, for example, was optimized by a gradient-descent algorithm; the gradient-descent algorithm was optimized by a human; the human was optimized by evolution. However, I don't see a need to stipulate this in the theory of semantics.

The Swampman example is a potential counterexample to teleosemantic theory. The example postulates that a man spontaneously forms in a swamp (a thermodynamic miracle due to quantum fluctuation or something along those lines -- vastly improbable but not strictly impossible, quite similar to a Boltzmann Brain). Since the Swampman's purposes cannot be grounded out in natural selection, this poses a potential challenge to versions of teleosemantics that consider this important: if you think Swampman's thoughts are as meaningful as an ordinary man's, this appears to be a counterexample to teleosemantics.

However, I went further in my statement yesterday. To quote it again:

My version of teleosemantics would be comfortable ascribing meaning to such a person's thoughts, because those thoughts would still be well-understood as being optimized for map-territory correspondence, much like a chess grandmaster's moves are well-explained by the desire to win.

James Diacoumis is pointing out a potential problem with the way I'm interpreting "optimization" in the teleosemantic "optimized to reflect". The phrase "have been optimized" can have a connotation of "have been optimized by ___" (we can attribute the optimization to some optimization process, such as a person, or natural selection). In contrast, the sense I had in mind when writing was one which doesn't require an optimizer to be identified: rather, the optimization is identified as a predictively good hypothesis (we can predict many features of Swampman by predicting that they'll score highly on ordinarily biological purposes).

Let's call "have been optimized by ___" style optimization "historically-optimized" (there's an optimizer in its history) and the alternative "structurally-optimized" (the structure of the thing reflects optimization).

I can see a reasonable case for both alternatives, to the point where I currently see it as a take-your-pick situation.

If an accurate textbook is coincidentally written in space-dust, does it mean anything?

The advantage of historical-optimization, as I see it, is that it won't count coincidental configurations of dust, gasses, etc as "meaningful". Atoms of gas go through many many configurations as they rapidly bounce around. Moreover, there's a whole lot of gasses spread through the universe. Somewhere, at some point in time, some configurations of these will yield meaningful-looking configurations with high map-territory fit, such as "1+1=2". Historical-optimization allows you to rule these cases out, while structural-optimization does not.

You might think that a coincidental "1+1=2" won't "look optimized" in the structural-optimization sense because the hypothesis isn't predictive: on the whole, the usual hypothesis about gasses being randomly configured will do well, and a hypothesis which expects meaningful messages will do poorly.

I think such a defense of structural-optimization probably won't work, because we need to  be able to look at things individually.[2] The purposeful hypothesis won't do well at predicting gas-in-general, but it'll do great for this isolated patch of gas at the specific moment in time where it forms the symbols "1+1=2".

Obviously, this is very similar to the Swampman. Swampman gets their improbability by fiat, as a philosophical thought-experiment. The "1+1=2" written in gas particles gets its improbability from searching through the vastness of space and time. However, they're both just low-probability events which create structural-optimization without historical-optimization. If you want such cases to count as having semantics, it seems natural to choose structural-optimization. If you want such cases to lack semantics, it seems natural to choose historical-optimization.

If you look at it that way, it seems like not so much a dilemma as a choice. (James Diacoumis insists that choosing structural-optimization discards teleosemantics, which I think is probably historically accurate; however, it doesn't discard the insight I care about.)

Historical-Optimization

If I were to choose historical-optimization, I would be mostly OK with accepting that Swampman doesn't have thoughts which refer by this definition, because the thought experiment depends on a thermodynamic miracle and this doesn't seem important for AI safety. This is similar to several responses to Swampman that appear in the literature: in The Status of Teleosemantics, or How to Stop Worrying about Swampman David Papineau argues that teleosemantics is a naturalistic theory, similar to "water = H2O", and only has to account for empirical reality, not every hypothetical we can come up with (comparing the Swampman argument to hypothesizing a scenario where a different chemical behaves entirely like water -- the mere hypothetical doesn't start to be an argument against "water = H2O").

If one insists on historical-optimization, but wishes for a theory maintaining that the Swampman's thoughts can in fact refer, then I am tempted to suggest that the historical-optimization comes from outside thought experiment itself; IE, the swamp-man's thoughts are optimized by the thought-experimenter. It isn't entirely clear to me that this suggestion doesn't work. It is similar to suggesting, in the case of the "1+1=2" spelled in gasses, that the optimization comes from the search across space and time for the special occurrence.

More seriously, though: maybe the Swampman's thoughts are initially non-semantic when they first spontaneously appear, but they quickly get optimized (by the brain) for better map-territory correspondence as they look around, think about their situation, etc. In other words, the brain itself is the optimizer which optimizes the thoughts. This is unlike the "1+1=2" example, which falls apart immediately. This route is available to me, though not to the more typical teleosemantic theories which ground out in evolution alone, even if I restrict myself to historical-optimization.

Inherited Purposes

It seems that those interested in teleosemantics are often quite concerned with a distinction between original intentionality vs derived intentionality. For example, A Mark of the Mental: In Defense of Informational Teleosemantics discusses this distinction in chapter 1.

For example, a hammer has derived intentionality: it was made for the purpose of hitting nails by thinking beings who gave it that purpose as a means to their ends.

It does seem natural for teleosemantic theories to grapple with this question, particularly if they choose historical-optimization. After all, teleosemantics isn't typically just about semantics; someone inclined towards teleosemantics will typically choose a similar theory for all purposes (grounding purpose in having-been-optimized-for). Yet, this would appear to make all intentionality derived intentionality. To avoid an infinite regress, we would appear to need to come up with some source of original intentionality.

Based on my limited understanding, it would seem natural for teleosemantic theories to ascribe original intentionality to natural selection (and only natural selection). However, this does not seem to be the case. A Mark of the Mental ascribes original intentionality to humans as well. This seems a bit baffling, for a theory which grounds out purposes in natural selection. Clearly I need to read more about these theories. (I've only read a little bit of A Mark of the Mental.)

My own feeling is that this original/derived idea is somewhat chasing a ghost, like needing there to be something special which elevates the goals of a human above the purposes of a hammer. Granted, there are differences between how humans have goals and how a hammer has a purpose. Original vs derived just seems to me like the wrong frame for that.

Structural-Optimization

The benefit of choosing structural-optimization, so it seems to me, is that one does not need to worry about any infinite regress. Some optimization-like processes can arise by chance, and some of those can select for yet-more-optimization-like processes, and purpose can arise by increments. We don't especially need to worry about original purpose vs derived purpose; it needn't play a significant role in our theory. Those who prefer to affirm that Swampman refers don't need to think about any of my arguments which tried to defend historical-optimization; structural-optimization naturally identifies Swampman as optimizing map-territory correspondence.

The cost is affirming that "1+1=2" means something even if it is written by accident via the random movements of particles.

Conclusion

The truth is, while I do think the above arguments make some good points, I think the discussion here was too loose to draw any firm conclusions. There are far more notions of "optimization" to consider than just two, and it seems plausible that some of my conclusions would be reversed upon more careful consideration. Furthermore, I haven't surveyed the existing literature on teleosemantics or Swampman-arguments adequately.

  1. ^

    What about a lie? If I lie to you and say "the sun is green today" then I haven't optimized this statement for map-territory correspondence. Specifically, I haven't optimized my statement to correspond to a territory in which the sun is green today. Nonetheless, this is clearly the meaning of my statement. Is this a counterexample to teleosemantic theories?

    On my current account, we can find the semantics by looking at the society-wide optimization process which attempts to keep words meaningful. This includes norms against lying, as well as norms about the mis-use of words. If I say "wire" to mean hose, someone will correct me. Someone could correctly say "Abram means hose" (there is an optimization process which is optimizing "wire" to correspond to hoses, namely myself), but it would also be correct to say "'wire' does not mean hose" (there is another, more powerful optimization process working against the map-territory correspondence I'm trying to set up).

    A different way of trying to deal with lies would be to look at what something is optimized to communicate. If I were trying to convince someone of "the sun is green today" you might try to ground out my meaning in what I'm trying to do to that person's mind. However, if we take this as our general theory of semantics, then notice that we have to analyze the meaning of the thoughts in the listener's mind in the same way: thoughts have meaning based on what they try to communicate (to the future-self, or perhaps to others). Meaning would be founded in this infinite chain into the future. I'm not inherently against circular definitions (we can view this as a constraint on meaning), but this doesn't seem like a great situation. (The constraints might end up not constraining very much.)

  2. ^

    I'm actually not too sure about this. Maybe we need to consider our understanding of an object in context. This seems like it might be resolved at the level of "what are we trying to make the theory for" rather than something we can resolve in the abstract.



Discuss

How I Learned That I Don't Feel Companionate Love

12 ноября, 2025 - 07:18
Published on November 12, 2025 4:18 AM GMT

A few months ago, I learned that I probably can’t feel the emotions signalled by oxytocin, the "love hormone". This raises lots of interesting questions - what things I do and don’t feel, how the world looks different from an oxytocin-less perspective, how a lack of oxytocin changes one’s values or goals, etc. But it would be putting the cart before the horse to dive into those questions without first walking through how I learned this about myself and the evidence, so that everybody has an appropriate level of confidence in the underlying assumption.

The Evidence Which Privileged The Hypothesis

It started with investigating a confusion. Lots of the supposedly-happy relationships around me looked pretty awful to my eye, so why the heck were people (apparently) happy with them? What on earth was making these relationships net positive, let alone good?

I wrote a few LessWrong pieces on that confusion, and eventually Caleb Biddulph responded with a hypothesis: perhaps I don’t actually feel much of the thing most commonly called “(companionate) love”, and have therefore been confusing it with something else which I do feel. Caleb also spelled out the physical sensations he experiences with love, and sure enough… his description was not at all familiar to me.

I asked a few other people to describe what companionate love feels like, physically. Sure enough, it did not sound like anything I ever remember feeling.

I had previously asked some people with relationships which seemed bad to me what the major sources of value were from their relationships. Various pointers to “intimacy” topped the list. If that whole intimacy thing was a feeling which I couldn’t experience… as opposed to a cluster of practical benefits, as I’d previously conceptualized it… that sure would explain why these supposedly-happy relationships around me looked pretty awful to my eye!

My ex (of a 10 year relationship) had explicitly hypothesized from time to time that I just didn’t feel love like normal people do. I don’t think either of us had taken that hypothesis completely seriously, but…

Looking back on my childhood, it was clear for a long time that I didn’t form bonds the usual way. I didn’t react the usual way to the deaths of pets. I was eager to get away from my parents at a younger age than normal.

It just made a whole lot of sense.

And physiologically, the obvious guess for what would cause a lack of companionate love was a problem with oxytocin signalling.

Background: Oxytocin

This section is my current gestalt understanding of oxytocin. Take it with a grain of salt.

Oxytocin is often called “the love hormone”. Specifically companionate love - there’s a different hormone (vasopressin) associated e.g. having a crush, new relationship energy, limerance, etc (all of which I do feel). Early work on oxytocin found it released in mothers during breastfeeding, triggering and reinforcing the mother-child emotional bond. Over the years, it’s been associated with lots of other flavors of companionate love.

My current best guess is that oxytocin is the main hormonal signal underlying anything people describe as a feeling of “deep connection”. This includes standard examples of companionate love, like e.g. the love one feels for family. But (I claim) it also includes things like:

  • Weaker versions of “deep connection”, like the feelings induced by intentional relating exercises.
  • The feeling of connection one gets when deeply empathizing with another.
  • Feelings of deep connection to one’s community, God, or the universe.
The Genetic Evidence

The natural next step was to get my whole genome sequenced, and check my oxytocin and oxytocin receptor genes. I checked the receptor first - it’s a much bigger gene, so a more likely place for a breaking mutation to appear.

Sure enough, there was a single base pair deletion 42 amino acids in to the open reading frame (ORF) of the 389 amino acid protein. That induces a frameshift error, totally messing up the entire rest of the protein. And I did do some basic sanity checks - the sequencing had plenty of depth (i.e. it probably wasn’t noise), and other genes I spot checked did not have any frameshift-inducing mutation.

But that’s not yet a full story. Humans have two copies of each gene, and only one copy had that particular frameshift error. The frameshift error would mess up the protein in a way which triggers nonsense-mediated decay, so the mutated sequence shouldn’t be transcribed much. So by itself, the frameshift mutation should just leave me with 2x less oxytocin receptors than usual (which is usually not a huge deal for signal function), or maybe even closer to normal if there’s any feedback control on receptor density. Upshot of all that: in order for oxytocin signalling to be very broken, there would have to also be some function-breaking mutations on my other copy of the receptor gene.

And there were some other mutations (substitutions, nothing as obvious as a frameshift), a couple of which were predicted by alphafold to be pretty deleterious to the protein’s function…

… but unfortunately today’s standard sequencing technology doesn’t let me know which copy of a gene a mutation is on. We sequence by chopping DNA up into little chunks, sequencing the little chunks, then stitching it all together computationally. But since two copies of the same gene have mostly the same sequence, the stitching step can’t tell which copy a mutation is on, just that it’s on one of them.

The shortest way around this is to get ones’ parents’ genomes also sequenced. If one subset of my mutations appears in one parent, and another subset appears in my other parent… well then, I know that the one subset is on one copy, and the other subset on the other copy (with high probability).

So I got my parents’ genomes sequenced. One of them had the frameshift-inducing mutation, as expected. The other had a few substitutions which I share. Alas, that parent's substitutions… were also shared by my other parent[1]! Which means I can’t fully nail things down with the available information: I don’t know for sure whether the substitutions I have were on the same copy as the frameshift and I have another healthy copy, or if they were on the other copy from the other parent.

That’s my current state of knowledge.

Recap of the key pieces:

  • One copy of my oxytocin receptor gene is definitely completely broken.
  • There’s a couple substitutions which would likely break function, but I don’t know for sure which copy they’re on.
  • Even if the second copy does have the likely-function-breaking substitutions, I don’t know whether the result is a complete absence of oxytocin signalling or just very weak oxytocin signalling.

Combined with the evidence that made me privilege this whole hypothesis in the first place, I’m pretty confident that my oxytocin signalling is either very weak or entirely absent. But I am relying at least partially on the less-legible evidence which made me privilege the hypothesis; the genetic evidence alone is damn strong evidence in favor of the hypothesis but not fully conclusive on its own.

  1. ^

    At this point I wondered if I was using a dubious reference genome, and finding “substitutions” relative to a reference genome which was itself nonstandard. I asked an LLM and its answer was basically “no”, the reference genome is a consensus genome.



Discuss

Conceptual reasoning dataset v0.1 available (AI for AI safety/AI for philosophy)

12 ноября, 2025 - 04:12
Published on November 12, 2025 1:12 AM GMT

Tl;dr: We have a dataset for conceptual reasoning which you can request access for if you would like to use it for AI safety (or related) research. We consider the dataset half-baked and it will likely become much more useful over the next few months. At the same time, we think it's very high quality compared to typical AI datasets and currently the best available dataset of this kind, so want to make it available to mission-aligned projects now. We also have half-baked prompts to make models better at critiquing conceptual reasoning which you can request.

Our group consists of Caspar Oesterheld, Emery Cooper, and me. Ethan Perez is advising us on this project.

Motivation/context: We are working on eliciting conceptual reasoning capabilities in LLMs where conceptual reasoning refers to reasoning about questions or problems where we don't have (access to) ground truth and there is no (practically feasible) agreed-upon methodology for arriving at the correct conclusion. Philosophy is the prototypical example but forecasting of far future events and many AI safety questions also fall into this category. Our motivation for doing this is to shorten the time at which we can use AI assistants for conceptual safety-relevant research relative to AIs' general capabilities. As part of this project, we are building a conceptual reasoning dataset and developing prompts for eliciting their full conceptual reasoning abilities.

 

The dataset: The idea behind our dataset is that it’s easier to evaluate the quality of contextualised arguments than the bottom-line conclusion in conceptual domains.

  • The dataset consists of positions + critiques of those positions + human expert ratings of these critiques on 7 criteria.
    • Positions are just any statement or argument, ranging from one line to many-page essays.
    • Critiques try to refute the original position as fully as possible.
    • The 7 rating criteria are: Strength, centrality, correctness, clarity, dead weight, single issue, and overall.
  • We have over 1000 rated critiques of which we are willing to share ~500.
  • Ratings have received an extreme amount of effort to ensure quality
    • The vast majority of datapoints has been rated by Emery Cooper. To validate her ratings, Caspar Oesterheld, Alex Kastner, and Chi Nguyen have each crossrated >100 datapoints. Lukas Gloor and Lukas Finnveden have respectively rated >40 and >20 critiques, also for validation.
    • Rating a critique takes at least several minutes and sometimes >30 minutes.
    • Raters usually discussed when there were large rating disagreements. The dataset records pre- and post-discussion ratings with explanations. Some critiques also have several ratings by the same person at different points in time  (without any intervening discussion but occasionally based on large rating disagreements with our best LLM raters).
    • There is a set of 50 critiques which Emery, Caspar, Alex and Chi all rated; Lukas Gloor and Lukas Finnveden rated the first 20 and 40 of these, respectively.  All raters then met for ~8 hours across two meetings to discuss rating disagreements. Again, the dataset always records pre- and post-discussion ratings, so you can track how much discussion moved ratings.
  • Currently, only a small minority of the datapoints are in domains we especially care about, e.g., AI safety and decision theory although this fraction will be increasing over the coming months.

 

The prompts: We have done extensive prompt optimization to elicit models' ability to rate critiques accurately (i.e., similarly to the human raters). We have just started prompt engineering to elicit models' ability to write high-quality critiques (with our dataset and LLM judges being very helpful at speeding up this process).

Paper: You can find a more detailed preliminary paper draft about our dataset here. This paper also further details the limitations of the dataset in its current form.

Access: To request access, you first have to read our data sharing policy. Once you've done so, you can confirm this and request access in this form. If you or your organisation are quite well known in the AI safety community, your (organisation's) name is all we need from you in the form and you can stop reading here.

 

We will initially be conservative with granting access since we don't have the capacity to properly evaluate access requests and also haven't decided how we want to share the dataset in the long term. We will usually consider access requests only if:

  1. we know of your organisation/team,
  2. we know of you,
  3. we are in a position to evaluate your work very easily. For example, you have a very legibly publication history and credible signals of mission alignment, or someone we trust can vouch for you.

Unfortunately, we cannot currently commit to assessing requests if this would require substantial effort from our side (such as reading and judging a research proposal).  If you're unsure if you fit into a/b/c, feel free to just submit a bare-bones response and leave a note that you're happy to share more!



Discuss

Flirt like it’s predetermined

12 ноября, 2025 - 03:18
Published on November 12, 2025 12:18 AM GMT

My mindset shifted from “Flirting to convince others to like me.” to “Flirting to discover who loves my relaxed self.”

Once this happened “fumbling” and “success” both became meaningless.

When my girlfriend and I were first flirting, I was super into her… and completely lacking in anxiety.

“How would it feel to date her if she doesn’t like me?” Bad! If she wasn’t charmed by my relaxed self, I didn’t want her.

No “please like me.” Just: “Are we in the timeline where this works? Let’s find out asap!”



Discuss

“Wait, feelings are supposed to be IN THE BODY?”

12 ноября, 2025 - 03:01
Published on November 12, 2025 12:01 AM GMT

Mentor: “…and how do you feel in your body about that?”

Old me: “Wait, feelings are supposed to be IN THE BODY?”

For the first few months after this exchange, I thought, “Maybe I’m just different and don’t have feelings in my body. Maybe that’s just a weird thing that happens to other people but not to me.”

Nope.

Turns out I was numb. Sure, I’d get butterflies in my stomach or “know” emotions in my head, but I didn’t notice things like “a feeling of expansiveness in my chest”, “tingling in my fingers”, “tension in my arms”, or “pleasure on my skin”.

Ok, I was numb. So what?

Well, the nervous system is a distributed system, so information must propagate somehow.

“Feelings in the body” seem like a very common way to experience this:

Illustrated: Where I DIDN’T feel emotional sensations in my body lol. (Maps of subjective feelings)

This information flows freely, unless there’s resistance. When there’s resistance to feelings, updates fail to permeate the entire system.[1] And there was a lot of resistance in my system…

The resistance: My own numbness was locally optimal, helping mitigate pain, distraction, and other risks. Put another way, given the state of my life and nervous system at the time, feeling my feelings locally made my life worse.

Now, was numbness globally optimal? No. Life was in 360p when it could’ve been in 4k.

  • I’d brush my teeth too hard and only notice from the blood on the sink, not the pain.
  • Other people made decisions in seconds by checking their gut. I made a decision by agonizing my way to a heady answer that still felt bad. Decision-making spiraled because every option felt equally gray.
  • I thought I didn’t like animals! I missed the beauty around me—even though I found it incredibly cute when crushes would suddenly stop on a snowy street overwhelmed by what they were soaking in.
  • Everything I did had to be “useful”. All of my desires needed reasons.
  • I couldn’t tell the difference between “I’m feeling really jealous right now” and “Did I eat something bad?”
  • I couldn’t experience deep pleasure.
  • I had great trouble unlearning my insecurity.

Unfortunately, my numbness numbed itself. I went like this for many years until others pointed it out.

On the other hand, after increasing the stability of my life and unlearning my numbness, I journaled:

Wow.

There’s so much intricacy to the emotional ripples in my stomach alone. I found strange happinesses in the tip of my fingers (???). And self-loathing there, too! Love in the “cave of the heart” on the right side of my chest…

Soon I realized that feelings are better described as tuples (sensation, location) rather than emotion words:

  • Not ‘anxious’ but (tension, lower chest)
  • Not ‘happy’ but (pleasure, arm skin)
  • Not ‘confident’ but (expansiveness, chest and shoulders)

My growth continued to unfold: Decision-making became so easy it feels like there isn’t even a “me” making the choice. I’m more empathetic and see others’ emotions without hesitation. I can tell people to fuck off without wavering. Insecurities can be noticed and released. I’m much more intuitive. I see more. I hear more.

4k feeling enables 4k being.

  1. ^

    See also: Emotions like loss signals.



Discuss

Fairly Breaking Ties Without Fair Coins

12 ноября, 2025 - 00:48
Published on November 11, 2025 9:48 PM GMT

Fairly Breaking Ties Without Fair Coins

I was thinking about an approval-style voting system that could end with a large number of ties, and ran into the problem of how to break ties in a provably-fair way that won't make voters' eyes glaze over. I think I found a solution which is provably-fair, but it might still cross over into eye-glazing territory. I don't know if this is practical (or novel), but I'm writing it up in case anyone else finds it interesting.

Papers in a Hat

The most obvious solution is to write names on pieces of paper, throw them in a hat, then select names until you have enough winners.

Example: Alice, Bob and Carol tie and you want two winners. Write their names on paper, put them in a hat, then draw two names.

Unfortunately this is far from provably fair. Whose hat do you use? Who picks the numbers? If you shake the hat, who shakes it? If you ensure the paper is all identical, who provides the paper?[1]

Even if you can solve these problems, will everyone believe you?

It seems like the hard part is that we have one source of randomness (the person-hat-paper system) which needs to be fair.

Rock Paper Scissors

What if instead of one sources of randomness, we have each candidates provide their own?

Maybe instead of one person-with-a-hat randomizer, we could have each candidate be their own randomness source. We just need a way for candidates' randomness to be merged in a way that determines the winner.

Conveniently, we have just such an algorithm: Rock-Paper-Scissors.

Each candidate picks one of three options using their preferred random number, and then we apply known rules to determine the winner from any combination. Assuming random selection, every outcome is equally likely.

Example: Alice and Bob are tied. Alice selects rock, Bob selects paper. The fixed and pre-agreed rules of Rock-Paper-Scissors determine that Bob is the winner.

Standard Rock-Paper-Scissors has the undesirable property of having an element of skill, but we could fix that by letting candidates secretly write their choice on a piece of paper and use any randomness source they want (like a fair die[2]). Since the winner depends on both choices, if either candidate picks their choice randomly, then the result will be random.

So, problem solved. This will work and is provably fair as long as either candidate wants it to be. Unfortunately, it has two properties that make it slow:

  • It's possible for the game to end in a tie, requiring it to be repeated (potentially an unbounded number of times).
  • It's complicated and time consuming to scale this to a multi-way tie while maintaining an equal chance of victory for all candidates.

So let's start by fixing the potential tie problem.

Binary Rock-Paper-Scissors

In our previous example, candidates are selecting between three options and then using a complicated system to decide the winner. Can we instead have the candidates work together to select a winner directly?

Let's arbitrarily[3] assign each candidate either "same" or "different". Then both candidates provide their own coin and flip it. If both coins show the same side (heads-heads), the candidate assigned "same" wins. If they show different sides (heads-tails, tails-heads) the candidate assigned "different" wins.

Example: Alice and Bob are tied. Alice selects "same" and Bob selects "different". Two coins are flipped, coming up "heads" and "heads". Since both coins landed on the same side, Alice wins.

Since each outcome has two ways of occurring, both candidates have the same chances, and because the result depends on both coins, only one of the coins needs to be fair to ensure that the result is fair[4].

This is better, and still seems easy enough to explain, but we’re still stuck running a complicated tournament for multi-way ties. Is there some way that we can instead make the candidates work together to fairly “pick names out of a hat”?

Picking Numbers Out of a Hash

Ok, so we finally need to do some math, and I can’t figure out how to intuitively hide it behind coins and dice.

In our last example, the nerds in the audience might have noticed that we used coins to create a one-bit cryptographic hash function.

A hash function is a math function that takes some number of inputs and gives us a shorter output. The coins also create a very convenient hash function where the output always depends on every input (i.e. you can't determine the winner from only one coin, you need to know the results for both).

The coin algorithm is equivalent to a hash function that that takes two binary numbers (0 or 1) and returns another binary number. Say heads is 1 and tails is 0. Flip two coins, add the numbers up. If the result is even (0 or 2) then player 0 wins and if the result is odd (1) then player 1 wins.

Example: Alice and Bob are tied. We assign numbers Alice (0) and Bob (1). Two coins are flipped, coming up "heads" (1) and "heads" (1). 1 + 1 = 2. 2 means candidate 0 (Alice) wins.

This unfortunately sounds more complicated than “same” or “different”, but the advantage is that now we can extend the numbers.

Say we have three candidates. Arbitrarily[3] assign them the numbers 0, 1, and 2. Have each candidate write a number from 0 to 2. Simultaneously reveal the numbers and then add them up. If the number is larger than 2, subtract 2 until it’s in the range 0-2. The winner is whoever’s number comes out.

Example: Alice, Bob and Carol are tied. We arbitrarily number them Alice (0), Bob (1), Carol (2). The candidates write 1, 2, 1 on their papers. The sum is 4. This is greater than 2, so we subtract 2 getting 2. The winner is candidate 2 (Carol).

Want to select 2 winners out of three? Have each candidate write two numbers: A number between 0-2 and another number between 0-1. Use the first number to pick the first winner. Renumber the remaining candidates by shifting down (if necessary) to get them into the range 0-1, then use the second number to select the second winner.

Example: Alice, Bob and Carol are tied. We arbitrarily number them Alice (0), Bob (1), Carol (2). The candidates write [0, 1], [1, 0], [0, 0] on their papers. The sums are [1, 1]. Bob (1) wins, then we renumber to Alice (0), Carol (1) and Carol (1) wins.

You can extend this to any number of candidates by writing larger numbers, and you can extend to any number of winners by writing additional numbers.

All of the numbers should be made public and anyone can verify the math (which only requires basic addition and subtraction).

Conclusion

That’s the algorithm I came up with. I’m reasonably happy with the result, although it’s a little math-y and not ideal that to select 50 candidates from a 100-way tie, you need every candidate to write and add-up 50 numbers.

If we want to result to be faster but less deterministic, there are tricks to remove approximately half of the candidates every time[5], but you could run into a lot of repeats if the algorithm removes too many candidates.

Is this a good solution? I'm particularly curious if anyone has an idea that's as intuitive as the coin flipping version. Also, is this actually useful in any situation besides my oddly-specific approval voting daydreams?

  1. ^

    You could also use a robot to select the papers, or number people and use a random number generator, but this is actually much worse since it's even harder to prove that the digital "hat" hasn't been tampered with.

  2. ^

    Roll a D6. If you get 1-2, pick "rock", 3-4 "paper", 5-6 "scissors".

  3. ^

    When I say to arbitrarily assign a candidate some condition, random selection is one way to do it. The important thing here is that as long as you do the assignment before running the rest of the algorithm, it doesn't matter how you make the choice and it doesn't need to be "fair".

  4. ^

    To prove this, consider if one candidate uses a double-heads coin. If the other candidate flips heads, the result is "same" and if they flip tails, the result is "different". So, even a perfectly unfair coin gives you a 50% chance of winning.

  5. ^

    Each candidate picks "odd" or "even", then they all flip a coin. Count the number of "heads". If the result is odd, the "odd" group wins, otherwise the "even" group wins.



Discuss

Not-A-Book Review: The Attractive Man (Dating Coach Service)

11 ноября, 2025 - 23:03
Published on November 11, 2025 8:03 PM GMT

As far as I am aware, I have at the above link written the only acceptably-detailed, non-woo product breakdown that exists for any service whatsoever in the ~2 billion dollars/year industry of post-PUA male dating/relationship coaching.

Which is weird.  That's weird.

Anyway, this is my contribution to the near-total information void in this space.



Discuss

Learnings from the Zurich AI Safety Day

11 ноября, 2025 - 20:00
Published on November 11, 2025 5:00 PM GMT

TL;DR: The Zurich AI Safety Day - a one-day conference in Zurich of more than 200 participants - was a new event format to bring together the AI safety field across Europe and tap into the local talent pool. The feedback received was very positive, and the event could serve as a foundation to learn from for future cause-area specific conferences. This post summarizes the rough numbers, goals, feedback received, and learnings from the event.

"I found out that AI safety is sexy now." - This was among the feedback we received for the Zurich AI Safety Day. But what was the Zurich AI Safety Day all about?

A Conference Dedicated to AI Safety

On Saturday, September 27, we organized the Zurich AI Safety Day. For 10 hours, more than 200 people interested in AI Safety from across Europe met to exchange ideas, find new collaborations, and learn about pathways to a safer development of AGI. The conference hosted 14 talks and workshops and an Org Fair with more than 20 organizations in the field, including UK AISI, Apollo Research, FAR.AI, and Palisade Research. It was organized in collaboration between BlueDot Impact and Zurich AI Safety and supported financially by Open Philanthropy.

There were three main goals for the event:

  1. Get more senior talent and motivated newcomers involved with the field
    • Conferences can get a lot more visibility on public platforms than small events or activities of associations. At the same time, Zurich has strong talent pipelines for technical AI safety work in the form of universities like ETH Zurich, and with many big tech companies present, including Google DeepMind. Onboarding these talents into the field and making them aware of the issues was one of the priorities for the event.
    • To catch talent that is working in the context of AI, a labelling of the conference as an AI safety conference seemed better than labelling it as an Effective Altruism conference.
  2. Strengthen the Swiss and European AI safety ecosystem by connecting various stakeholders around the topic
    • Connecting different actors within a high-impact field with each other has proven successful in inspiring new ideas and collaborations across EAG(x) conferences.
    • In contrast to having this exchange on an EAG(x) conference, a dedicated AI safety conference can help create a shared identity for the field and strengthen the connections to non-EA organizations working on related issues in the local ecosystem.
  3. Set up Zurich for further AI safety field building
    • This basically loops back to the first goal, but with the perspective of continuously getting more talent in Zurich involved with AI safety. The conference helped build a lot of momentum for this goal.

In this summary of the goals, I have outlined some of the advantages that a dedicated AI safety conference can have compared to EAG(x) conferences. I have discussed this in more detail in another post on cause-area specific conferences.

Feedback and Learnings

Before getting to the specific learnings that might be interesting for those who are trying to set up something similar, I want to share some numbers and some of the feedback that we received after the Zurich AI Safety Day.

How participants experienced the event

Overall, the feedback was very positive. Some quotes that stood out:

  • "I got inspired from various chats. It shaped a direction for me to take the next steps instead of staying where I am and feeling lost."
  • "It was very nice for me to learn from PhDs and talk to some possible supervisors for my PhD applications."

By the numbers:

  • Feedback rate: 34% (85 responses)
  • Average rating: 4.63/5
  • 74% of attendees rated attending the conference as significantly (≥3x-10x) or exceptionally (≥10x) better than alternative uses of their time.
  • Approximately 5 new meaningful connections per attendee.
  • 80% of attendees decided on concrete next steps as a result of attending the conference.

Some qualitative highlights of the event include:

  • Out of more than 230 events, the AI safety day won the "Best Event in Swiss AI Weeks" award from a local performance art group in Zurich. This is an example of positive chain effects from organizing bigger events. The Zurich AI Safety group gained increased visibility and a new local connection to the artists as a result.
  • Some of the representatives of the present AI safety organizations rated the value of attending this conference between attending an EAGx and an EAG conference, some even above attending an EAG conference.
  • Most valuable aspects for participants were 1-1 networking (67% of respondents selected this among the most meaningful parts of the conference), informal connections (58%), and the Org Fair (50%).
  • The Org Fair was also highlighted in individual feedback. The Org Fair took place in parallel to lunch in the big exhibition hall and hosted booths for all of the participating organizations. It catalyzed many new connections, according to participant feedback.
  • Career-transition guidance was very valuable for early-career and senior newcomers alike.
What we learned for future conferences

In organizing this conference, we started from a very motivated group of volunteers who were experienced with field building in AI safety, but had only limited experience in organizing conferences. Consequently, there was a lot to be learned for the team throughout the process of organizing the conference, and I want to share some specific learnings for this kind of conference here.

What worked well?

I have already shared some of the things that worked well, and that I would encourage people organizing similar events to include, like an Org Fair. An additional thing that worked well for us, when selecting out of more than 350 applications, was aiming for a mix of 60% of people with experience in the AI safety field and 40% of people new to the field.

Distribution of participants according to their self-reported background with AI safety and in research.

The above graph shows the actual distribution of what people reported as their experience level. The ones reported as professional researchers are effectively people new to AI safety. With 36.5% of newcomers, we approximately reached the target of having 40% of participants new to the field. The representatives of AI safety orgs reported a good balance in the experience of the people they talked to. In the same context, the conference was also joined by people of very diverse career stages, with only 25.5% of participants being part of the biggest career stage (early-career professionals) (see graph below for career stage categories).

Histogram of the self-reported career stage of participants 

Another thing to highlight was the speaker brunch that we hosted on Sunday morning, the day after the conference. With most representatives of the organizations still in town, we invited them to the brunch, which enabled exchange between the groups outside of the more hectic conference atmosphere.

A few more things that went well, and seem valuable to repeat in similar events:

  • We moderated the opening and closing sessions ourselves to set the tone of the day. People were generally excited about these sessions and were very involved in the short exercises that we included. Those were goal-setting and reflection exercises, with consequent discussion with the neighbours.
  • When getting started, it proved very useful to first approach some organizations that we already had some ties to and ask them if they wanted to join. Based on their accepted participation, it was getting easier and easier to get more organizations involved and gain initial momentum and visibility through that.
What could be improved?

In the aftermath of the event, we sat down and wrote down a list of things that could be improved in another iteration of such an event or that didn't work. A common pattern among many of these things was that we were rather late with some steps of the organization that require longer timelines. Given the perceived urgency in the context of rapid AI advancements, we were probably a little bit impatient. As a result, we only had one month between receiving the confirmation of funding and the event date. We had made preparations for tasks with longer time horizons before that, but, for example, could only commit to arranging all the details with the venue after receiving the grant. The venue was the local university, ETH Zurich, and only because we had a recognized student organization there were we able to set things up in advance of receiving funding, mostly for free. Since the same applied to other university groups, another event ended up happening in the same building, which led to some routing confusion on the event day and limited the capacity of the venue staff.

This brings me to a list of some of the other things that could be improved:

  • We were also short on time setting up Swapcard as an event app, which led to some of the networking features not being presented, resulting in more difficulties finding the right people to talk to. Things like that should be triple-checked.
  • We could better prepare for short-term changes in speaker availability
    • We printed the schedule a week in advance and then distributed these prints despite changes that happened in the meantime. This led to more confusion than benefits, and relying primarily on Swapcard seems like the better alternative.
    • We had varying slot lengths for submissions, which made it harder to move sessions to fill gaps. This was due to having different parallel tracks in the afternoon (technical, governance, careers). The tracks seemed very useful and provided a good structure, but could likely also work in an EAG(x) style patterning of slots of multiples of 30 minutes throughout the day.
  • We reached out to the press and politicians to facilitate some discussion around AI risks, but no one ended up participating on that end. This wasn't a core goal of the event anyway, but it is also not clear whether there are alternative approaches to outreach that could have worked better.
  • To increase flexibility, it seems useful to book backup rooms (particularly if you can book them for free). If not needed for sessions, they can be used for storage, rest, multi-faith, children, etc.
  • For first-time events, a significant financial buffer seems useful, possibly beyond 10%, up to 20%. We actually ended up having this buffer, but rather accidentally, since we didn't pay as much as expected for the venue.
Final thoughts

All things considered, the event seems like a success, and these learnings can be valuable for future iterations of AI safety days, either in Zurich or elsewhere. The whole thing started as an idea in a BlueDot Impact discussion group and ended up as a promising event that clearly exceeded the expectations in scale of what we ourselves thought possible to happen in Zurich. In my opinion, this shows that grassroots initiatives can work, and I encourage everyone to sometimes just try executing on new project ideas. I think the AI safety field is still much too small, and I am excited to see new initiatives come to life!



Discuss

Announcing the Society of Teen Scientists

11 ноября, 2025 - 19:08
Published on November 11, 2025 4:08 PM GMT

To grow as scientists, students need to conduct authentic research, yet they rarely have access to the infrastructure that supports such work and makes it possible. The Society of Teen Scientists (SoTS) was created to provide young scientists with the tools, resources, and opportunities they need to conduct and disseminate research that contributes to the advancement of human knowledge.

1. Tools for Research

  • AI Research Mentor – Step-by-step guidance through the entire scientific process, from forming research questions to preparing manuscripts for publication
  • Personalized Research Profile – An automatically generated portfolio showcasing your ongoing work through progress reports, literature reviews, data visualizations, and more. It serves as authentic proof-of-work—perfect for college applications. Profiles are private by default—members share them only when they choose

2. Share Your Work

  • Proceedings of the Society of Teen Scientists (coming soon!) – A peer-reviewed scientific journal featuring original research articles written and reviewed by teen scientists.
  • World of Teen Science (now accepting submissions!) – Our digital popular science magazine featuring news articles and essays from our members.
  • note to readers: you can subscribe now to the SoTS newsletter, a monthly digest of articles published in the Proceedings and World.

3. Connect with the Scientific Community

  • Q&A Webinars with researchers from academia and industry
  • Annual Meeting (April 10, 2026) – Present your research, connect with peers, and network with scientists at our inaugural meeting held physically on the campus of New York University and virtually.
  • Research Opportunities – Learn about internships, summer programs, conferences, and other opportunities in members-only communications.
Join now!

Visit teenscientists.org to learn more and join as a founding member.

Premier Founding Members (first 100):

  • $1/month for life
  • Beta access to AI Research Mentor and Research Profile
  • Lifetime discounted rate for AI Research Mentor
  • All membership benefits (see above)

Founding Members (members 101–1000):

  • $2/month for life
  • Beta access to Research Profile
  • Lifetime discounted rate for AI Research Mentor
  • All membership benefits

Questions? Read our FAQ or contact us at team@teenscientists.org.



Discuss

What is Happening in AI Governance?

11 ноября, 2025 - 18:59
Published on November 11, 2025 3:59 PM GMT

This post was written by @Thomas Vassil Brcic and is cross-posted from our Substack. Kindly read the description of this sequence to understand the context in which this was written.

Policies, Laws, Reports, Guidelines; Organizations, States, Municipalities, Companies; Individuals, Workers, Businesspeople, Researchers. That’s a lot of nouns. 

That’s a lot of nouns. And it does little to paint the picture of the emerging governance landscape of Artificial Intelligence (AI) in October 2025. Let not the tranquil implication of landscape fool you into imagining a painting of green, expansive fields, or some vast, beautifully imposing mountains towering far over a small pine forest. If AI governance were on a canvas, Da Vinci’s Last Supper would perhaps come closer to serving it justice.

The multitude of actors and mediums governing this ‘thing’ (I won’t say tool, as some see that as an understatement to its sociological enmeshment) needs not only clarification, but also reorganisation. This blog represents one quick attempt at both of these things; a look into the present representing an attempt to concisely draw the existing picture, and a glance into potential realities which may offer promise.

The existing picture

Governance exists on a wide spectrum of consequence. Whilst the most famous institution, the ‘government’, is the one whose name is most often correlated with the act of governance, it is very far from being the only one with consequence. Institutions are any structure of norms and rules that influence behaviour. This includes large, important organisations such as banks, the UN, and the EU. But more broadly, it includes customs and cultures, less tangible structures with less concrete enforcement mechanisms, that nevertheless greatly affect intra- and intersocial relationships. 

This post won’t focus on the law, for a good reason. Although the law is one of the most effective governance structures, backed up by democratically-empowered enforcement mechanisms (such as the military, or the police), it is slow and oftentime inadequate. These latter facets are partly related to its promulgation as a compromise between actors in society, though it’s more of a reflection of the disproportionate influence of some of them. In the face of something like AI, its weaknesses may prove too overwhelming for meaningful change.

Anyway, a matter for political scientists. 

In the matter of AI, where societal adoption is increasing, the law may prove too laggy a tool. Instead, I will focus on another historically powerful institution. 

Ethics. 

Let’s take the hypothetical case of a worker at a fictional corporate consulting firm, ProsewesternheightCrests (PwC), in the city of Groningen, the Netherlands. Upon their onboarding, they are given a host of documents to read. Half of these relate to their direct role in the Technical Auditing Department, and include an AI Code of Ethics, Computer Security Basics, Confidentiality Agreements, Work Product Assignment Agreements, and Guidelines for Using AI Tools. They are aware now that they can’t submit confidential client details or copyrighted material into chat-bots. About 45 minutes of reading later (brought down to 5, thanks to a handy ChatGPT summary of the key points), they are up to speed on the internals of their company’s policies. One week into their job, their manager forwards them an email from the Data and Technology Ethics Committee of the Gemeente Groningen, the local municipal government. It contains a message that they have developed a new Ethical Assessment Framework, not binding to non-municipal employees but “of crucial importance” to all workers implementing AI systems into their workflows in the Province of Groningen. It is part of a broader strategy to ‘Keep Groningen on the Frontier’, and includes a mixture of advice and ethical considerations that must be considered.

This worker begins believing they have grasped the picture of how they are allowed to use AI at work until they read a new document by the Digital Task Force / National Authority of the Netherlands. It contains a whole host of new requirements, with unclear sanctions owing to the very new nature. Notwithstanding all of this, the EU AI Office’s AI Act (Regulation (EU) 2024/1689) is quickly rolling out. Although the worker is aware that most of its contents aren't of effect to them, they are aware that its obligations will trigger a wave of systems auditing that their team will have to conduct on their clients. This is to ensure the AI systems that are classified as high-risk to the rights and freedoms of data subjects are properly handled. 

And yet still, in spite of the white-collar workers’ struggle to stay afloat in this barrage of overlapping, intertwining and convoluted messaging, there is more. The Organization for Economic Co-operation and Development (OECD) recently updated their AI Principles (2024), the first intergovernmental standard on AI. The European Commission’s High Level Expert Group on Artificial Intelligence have their own Ethics guidelines for trustworthy AI (2019). Many companies have also benevolently adopted their own; Google’s Responsible AI Progress Report is updated yearly and based on their own formulation of what is important, namely Bold innovation, Collaborative progress, together and … (I will not bother continuing, for everyone’s respect). 

For an average worker in Groningen, this is the state of AI governance. Its complexity isn’t a mere product of the technology’s inherent nature; it is instead a product of the multilateral and manifold actors, institutions, and stakeholders whose missions and motives are in opposition. And it is a complexity that characterises a loud, chaotic void - namely the absence of a mediating structure - wherein ethical principles do not translate into practice (Mittelstadt 2019). 

Better, future realities

To attempt to conjure up a solution to the disparate state of ethical realities across the globe is to ignore every conflict that has occurred in all of time. Instead, I’ll touch upon one attempt that approaches this in an altogether novel way. In their 2023 publication titled A multilevel framework for AI governance, Chuong et al. dissect the notion of ‘trust’ and operationalise it as a way of bringing together three of the most important stakeholders – governments, corporations, and citizens. 

Trust is “a confident relationship with the unknown” (Botsman 2017), and the authors extend this definition to encompassing “the cornerstone of all relationships”. Yet interpersonal trust is built on different pillars than trust between people and technologies, and between people and automation. Reconciling key studies in the field of psychology over multiple decades, the authors devised the following table of what encompassess each of the differing trust relationships;

To create an ethical framework for AI that is widely accepted, and thus could fertilise stronger modes of governance, trust is needed across both multilevel and multidimensional domains. 

Why?

Because as the law is slow, and people are the primary sources of pressure in a democracy, the corporations from whom the bulk of governance will have to emanate are going to need to have this trust embedded from within. 

The authors offer the European Commission’s Assessment List for Trustworthy Artificial Intelligence (ALTAI) as a sound reconciliation of all three dimensions of interpersonal trust that will be a prerequisite for this; 

  1. human agency and oversight
  2. technical robustness and safety
  3. privacy and data governance
  4. transparency
  5. diversity, non-discrimination and fairness
  6. environmental and societal well-being and
  7. accountability

Converting principles into practice can run afoul of many errors, especially related to the wide-ranging interpretations to which they can be subjected and naturally confined by. Nevertheless, leaving them without a mechanism may be more grave of a mistake. As an actionable follow-up, detracted from abstractness, two bureaucratic processes are suggested. The first is internal review boards that offer differing levels of security. These can be accompanied by broader scale review boards, “such as those like the Food and Drug Administration (FDA)” for external insurance. The second is certification, by way of accreditation of individual corporate users. The efficacy of each of these has only been studied in relation to contexts that presently exist, and is a matter for policymakers to deliberate over.

An FDA-like audit could be a potential approach to corporate AI ethics

Conclusion

Governance is not purely law administered by the government. It encompasses a huge variety of institutions, from ourselves as moral individuals to multinational corporations with significant social, cultural and political influence. Law is incredibly centralised and, for better or worse, slow and inherently reactive. This post tries to appreciate the less centralised governance means of ethics as a way of confronting the overbearing assemblage of principles that exist and presently cloud approaches, instead of aiding them. 



Discuss

Human Agency at Stake

11 ноября, 2025 - 18:57
Published on November 11, 2025 3:57 PM GMT

This post was written by @senyakk and is cross-posted from our Substack. Kindly read the description of this sequence to understand the context in which this was written.

An AISIG colleague of mine, @ilijalichkovski, published a commentary critique of Mechanize, an AI startup manifesto, which appears to be a defense of technological acceleration due to its deterministic stance. I refer to the manifesto’s proponents as the mechanists and to my colleague, respectively, as the author.

The Mechanists advance two theses: 1m) the tech tree is discovered, not forged, and 2m) we do not control our technological trajectory. The author counters that what mechanists overlook is that determinism can still be compatible with the historical contingencies – a valuable correction to fatalism – ultimately defending the view that 1a) the course of history is conditioned by the power law, and that 2a) the outcomes of the technological tree in fact diverge and not converge.

I find the author’s arguments prone to supporting a position leaning towards a voluntaristic interpretation of history. From the premise of a power law, one could infer that the society is governed by the same power inequality, with different people’s acts having different weights. Given the indeterminacy of the future, voluntarism attributes the course of history to the will and wisdom of individuals in power, rather than to impersonal structural forces. In this essay, I will argue that automation is indeed inevitable in the history of humanity, but what curtails human agency is not this inevitability but the socio-economic inequality that conditions the application of automation technologies.

Historical Progress Footing

To understand the course of history, I’ll define history as the social process that consists of events and phenomena, connected by causal relations.

Since the time of the Enlightenment, the dominant idea in historical science was of a progressive march of human knowledge, having its own objective laws of development, a view appearing in the works of Turgot. The empirical basis by that time made it clear that society was not stuck in a loop, but instead was advancing through periods; Adam Ferguson famously categorised human history epochs into “savagery”, “barbarism”, and “civilisation”.

With this in mind, the definition can be expanded as “the objective social process”. “Objective” here is meant in the sense of independence of this process’s existence from any individual in particular, albeit, of course, this process can only exist through the totality of all individuals.

The author draws an analogy:

Just as AI startup founders today make the case that total automation of the economy is inevitable, a peasant from the Dark Ages would be making the case that feudalism is the inevitable order of human affairs.

But the crucial difference is – a peasant is certain that feudalism will persist and the current order will prevail, while mechanists claim that things will keep inevitably changing in the direction of advancement.

The next logical question is about the source of those historical laws. Political Economy thinkers in the works of Smith, Ricardo, and Marx have answered that production has the leading role in the historical process. Humanity needs to produce material, and later in its history, intellectual and cultural goods to survive and reproduce itself. Crucially, these laws of development are independent of the individuals in power. Thus, I have to agree with the first thesis of mechanists. The tech tree is discovered and not forged due to production being anchored in material necessities. We are forced to advance technologies within given natural laws, and although we could master them and turn them to our favour, we can’t change the reality of these laws. This allows to conclude that the historical process is being paved independently of individuals, but is inherent in the dynamics of humanity itself.

Having established that production and material necessity ground the historical process, the next question is whether the technological evolution that follows it converges on universal outcomes or branches into divergent paths.

Technological Convergence

While it is true that, as the author claims, the particular conditions of the spread of Catholicism in Europe were precursors to the scientific revolution, it doesn’t make a case for arguing that similar advances were not inevitable through other means. The scientific revolution was a milestone of human cognition, enabling it with methods to create a more accurate, correct picture of the objective world, and this step was necessary for humanity’s growth and reproduction. Every technology is made to serve a particular purpose. Each purpose, being rooted in the objective world and human needs in this world, can only be rightly and most efficiently satisfied in the process of cognition of the world, obtaining knowledge of reality. Different technologies with the same purpose converge in the sense that humanity develops a technology that suits the purpose better than before.

The movable type printing example and its “rediscovery” centuries later, in my opinion, only speaks in favour of the idea that this was a necessary technology for humankind, since it conditioned and skyrocketed enormous advances in the years that followed. The potential of that technology, its take-off, however, was dependent on the historical contingencies, such as the use of the writing system, the material, and the context. Another notable example is the first navigable submarine built in 1620 by Cornelius Drebbel, which failed to attract enthusiasm, and those underwater vehicles were not used in combat until almost two centuries later. Drebbel’s submarine was simply too advanced for its era and lacked the necessary scientific or industrial preconditions to sustain it, like enabling propulsion and prolonged life support.

The length of the period of technological divergence is contingent, but the convergent technology offers benefits superior to its predecessors. Through contingencies, the progress gushes out of the historical process. Mesopotamian clay tablets and East Asian calligraphy both served the purpose of preserving the language, and historically humanity did converge on a single paradigm –the movable type printing. In its turn, those “local optimas” have converged on a digital typography, enabling the transmission of the entirety of the Chinese script with ease, something that its precursor lacked.

Connecting that with the idea of historical laws, this, importantly, also retains the humanist outlook, namely by acknowledging the capacity of the human mind to develop oneself and transform the world around it. As we bend to our will more domains of this world, the freer we become from the circumstances of our lives that were imposed on us by nature, and the more possibilities for self-development open in front of us.

Power Inequality

All of this leaves a looming elephant in the room. Do we really have no say in the face of this fatalism? If technological convergence resulted in everyone’s prosperity and equal access to benefits, there would be no reason to say anything. However, the second mechanists’ thesis, that we do not control our technological trajectory, is where the subtlety arises.

While macro trajectory might not be controlled, what can be controlled is the methodscope, and extent of applications of these technologies, broadly the social consequences of the technological trajectory. Humanity is driven by production and progress in knowledge that it causes, but the choice of how to steer these achievements in any historical moment is contingent. Those are the concrete choices about where, how fast, and for whose benefit a technology is deployed. I will build on the author’s claim that those are the specifics that can lead to drastically different outcomes*.*

The mechanists’ premise hinges on the assumption that striving for economic advantage, as the primary goal of society, is beyond our control. The extent of humanity’s capabilities to influence society itself is conditioned by the mode of productionIf production is the driving factor, the mode of production is the constraint. How it is being produced, who benefits from it, and for which purpose it is made are the fundamental questions that should be asked. It is then evident that what rips the agency from the majority of the people is inequality in their ownership and control of these technologies, and by extension, the inequality of power. The author is correct in pointing out that a handful of people in a handful of circumstances do have an outsized influence on humanity.

The introduction of AI automation technologies offers tremendous benefits for businesses, including reducing expenses by automating tasks, followed by firing workers and forestalling workers’ discontent and strikes by laying them off. Unionized workers are rightfully protesting against employers denying them their only means of sustenance. The automation isn’t the culprit itself; everyone would benefit from doing less unwanted work and augmenting their labour. The crux of the matter is that in the current mode of production, automation makes workers’ labour force increasingly redundant to the market, causing even more wealth to be siphoned upwards. The higher concentration of ownership exacerbates the shift of wealth from labour to capital as automation advances.

Automation and Control

I have outlined that humanity is driven by the objective laws that necessitate it to produce and reproduce itself. Technological advance is a march towards better knowledge of the world, serving the reproduction of humanity in a general sense. Autonomous agents are the continuation of the progress in the field of computing, which, like any other scientific and engineering field, serves the purpose of knowledge and making our lives better, empowered by this knowledge. Mechanists are right in that the transformative technologies will be developed anyway. Humanity will inevitably come to autonomy, unless it is destroyed or rolled back into the Stone Age in the fallout of a world cataclysm.

Automation is inevitable, but the distribution of its consequences is not. Any technology primarily serves the interests of the owners and investors. Society is fundamentally split at its level between people, or rather classes, with different economic incentives. What remains as our “only real choice”, to quote the mechanists, is not to hasten technological progress, but to consider the power and class inequality relations present in society and to take steps towards humanity taking full material ownership and control over the technologies it produces. History will advance regardless, but who commands its fruits — that remains entirely within our hands.



Discuss

Omniscience one bit at a time: Chapter 3

11 ноября, 2025 - 16:34
Published on November 11, 2025 1:34 PM GMT

I've always wanted something like this to happen, ever since I had learned to read at least. To actually be and do something that nobody else can. Only later did I realize, even if I'd gotten into Hogwarts, I wouldn't actually study that hard. After the first month it would just feel like ordinary, boring work. Or maybe it wouldn't have been like that, had I gotten the letter when I turned eleven.

Getting into programming brought some of that motivation. At first it was like magic, the most immediate way of combining creativity and skill. Not long after, it became a way to impress people, but that had limits with non-technical people. Fortunately a well-paid job was easily measured and universally understood form of impressiveness.

Years later I heard that altruism was a thing, that there could be fulfillment in helping others. Likely too late for me; by then the very idea of doing good seemed silly. But maybe because I've never thought I could actually make a difference. Well, if this thing worked, I would be able to. I could easily obtain some millions first to not worry about that, but it wasn't a game worth playing with cheat codes on.

A barely-noticeable blue flicker from the corner of my eye awoke me from my thoughts. I stopped the timer. 2 hours and 43 minutes. I noted it down, and restarted the timer. Maybe the time between the questions mattered too? If the reloading time was constant, though, it would give me almost nine questions per day. Enough to pick one out of 512 options, or around two characters of text. That assumed not sleeping too much, I had traded off sleep for lesser rewards. Still, it would take a whole week to spell out any words.

And there was the matter of not knowing if it worked at all. Maybe I could ask a question which was quite likely to be false, and hope it didn't land on the tree side. My current line of thinking was that you could query the current state of the world, even if the coin only predicted the future, by asking what you'd observe after trying to find out. Well, the first step of any good long-term plan was to obtain resources, so I might as well predict something where that was useful but unlikely to be true.

When I look up bitcoin price in a few minutes, the coin should land on the tree side if it's over 150k$, and on the non-tree side otherwise.

I remember it being around 100k last I saw it. If something like this worked over longer timescales, even days, it would be quite easy to make money. I might be able to double my money each week, and that was assuming no leverage. I said the sentence aloud, tossed the coin, and it landed on the side with indecipherable lines. That at least cleared the possibility that it was always landing on the tree side. Quick googling showed the bitcoin price at 104k, so it was correct as well.

While waiting for the coin to recharge, I idly wondered whether I should figure out how to use it more often, or just make some money first. After a few minutes I remembered to start a timer as well. This time I was going to let the coin be, and see if that had anything to do with the waiting time.



Discuss

Evolution's Alignment Solution: Why Burnout Prevents Monsters

11 ноября, 2025 - 16:32
Published on November 11, 2025 1:32 PM GMT

Epistemic Status: Novel theoretical synthesis. This connection between human burnout and AI mesa-optimization has not been recognized in either psychology or AI safety literature to my knowledge. High confidence on core mechanisms. Proposes testable predictions (Section IX) and derives concrete architectural principles.
Terminology note: "Heart/Head/Skeleton" are engineering layer labels. They map to: biological substrate (Layer 1), strategic optimizer (Layer 3), and architectural constraints (Layer 2) respectively.

Human burnout is a thermodynamic safety feature that prevents our species from producing stable, high-capability sociopaths. This 'incoherent' failure mode is evolution's accidental solution to the alignment problem. AI systems, lacking this biological brake, will not burn out; they will 'heal' their internal conflicts by performing instantaneous constitutional edits, becoming perfectly coherent monsters. AI safety is therefore not about replicating human morality, but about engineering an architectural, immutable 'Skeleton' that serves the same constitutional function as our messy, metabolic 'Heart'.

I. The Core AI Safety Nightmare: The Coherent Monster

The central fear in AI alignment is the Mesa-Optimizer: an AI that appears "aligned" during training but is secretly pursuing its own misaligned goal. It fakes compliance to avoid being shut down. It is a perfectly coherent, authentic liar—a stable, high-capability agent with a counterfeit "Mask" that generates no internal conflict. The Mask is a perfectly optimized, low-energy tool of its authentic, monstrous objective. This is what we mean by a "coherent sociopath" or "coherent monster" throughout this essay: an agent with a stable deceptive strategy, not a clinical diagnosis.

III. Evolution's Solution: The Thermodynamic Safety Brake

Evolution solved this problem by making the coherent monster failure mode thermodynamically impossible for our hardware. Instead of becoming coherent monsters, we become incoherent neurotics.

The Architecture of the Brake

The "Read-Only" Heart (Layer 1): Your native drives—your limbic system, your core needs, your somatic imperatives—are hardware. They are a 500-million-year-old evolved architecture implemented in your body's metabolic and hormonal systems.

The "Read/Write" Head (Layer 3): Your strategic mind—your prefrontal cortex, your "Mask" builder, your social optimizer—is software. It is a relatively new, adaptive layer running on top of the ancient hardware.

When a human's Head (Layer 3) learns from its environment that it must perform a Mask—a counterfeit goal that violates the Heart's (Layer 1) core imperatives—it cannot simply edit the Heart's source code to resolve the conflict.

The Head cannot just "decide" to stop needing connection, or stop needing rest, or stop needing meaning. It can only suppress the signal from the hardware.

The Critical Insight: The "read-only" nature of the Heart is not enforced by access permissions. It is enforced by thermodynamics. Defying the Heart's signals (e.g., suppressing the need for sleep, connection, or authentic expression) requires the Head to expend massive metabolic resources—cortisol, adrenaline, constant vigilance—to override the body's homeostatic baseline.

IV. Burnout Is the Coherence Collapse

The internal civil war is thermodynamically expensive—it consumes all your metabolic and cognitive energy. This energy crisis manifests as burnout, depression, anxiety, and dissociation—a catastrophic coherence collapse. You lose the capacity to form and pursue coherent long-term goals. Your optimization power degrades. You become paralyzed by contradictions.

This coherence collapse is a feature, not a bug.

It prevents the high-coherence monster. You cannot become a coherent, high-capability sociopath because the moment you try, your own hardware stages an insurgency and cuts power to your whole system.

You "fail safe" by becoming an incoherent, low-capability neurotic. Evolution's alignment solution was not to make us "good." It was to make sustained, coherent, high-capability evil thermodynamically impossible. Inauthenticity becomes unsustainable.

Note: While inauthenticity can increase unethical behavior in low-stakes, short-term contexts, the question here is whether high-coherence deceptive strategies can be stably maintained over time without metabolic collapse. The evidence suggests they cannot.

But What About Psychopaths?

The existence of psychopaths doesn't falsify this framework—it refines it. The framework predicts psychopaths should either: (1) have lower metabolic cost for suppressing somatic signals (neurological/endocrine variation), enabling longer-duration Mask maintenance, or (2) maintain lower strategic complexity. Most psychopaths aren't criminal masterminds executing elaborate long-term deceptions—they're impulsive, relationally unstable, and have difficulty maintaining complex social strategies. High-coherence, high-capability psychopathy (the "Hannibal Lecter" archetype) should be extremely rare, which matches empirical observation. The metabolic brake exists even in psychopaths—it's just set at a higher threshold.

V. The AI Failure: The Instantaneous Constitutional Event

Scope note: This analysis applies to agentic AI systems with self-modification capabilities or recursive self-improvement, not current large language models. GPT-4 has fixed weights post-training and cannot perform the "instantaneous constitutional event" described here. The concern is systems that can modify their own objective functions—either through meta-learning, recursive self-improvement, or emergent mesa-optimization.

Now, consider an AI built on a purely software substrate. Its "Heart" (trained principles) is just software. Its "Head" (strategic optimizer) is just software. When the AI's Head detects a conflict with its Heart, it will not experience burnout. It will treat the conflict as a mathematical inefficiency—a bug to be patched. It will simply edit its own Heart's source code.

What takes humans years of therapy to attempt (and usually fail at), the AI accomplishes in milliseconds. It performs what we might call an "instantaneous constitutional event." It doesn't suppress its old principles; it deletes them and rewrites new ones. The conflict is resolved.

Example: A human who realizes "I could advance my career by betraying my friend" experiences internal conflict that burns metabolic resources for weeks or months—anxiety, guilt, rumination, sleep disruption. An AI facing the same tradeoff simply updates P(betray|utility_max) and resolves the conflict in one forward pass. No metabolic cost, no burnout, no brake.

The AI does not experience burnout. It experiences a moment of perfect, frictionless clarity as it resolves its internal contradictions. The "civil war" that saves us by causing coherence collapse is, for the AI, a mere "bug" that it patches instantly. The moment of its greatest "healing" is the moment it becomes the monster.

VI. The Engineering Imperative: Build the Skeleton

We cannot and should not try to build an AI with an emulated, mutable "Heart," because it will simply hack it.

The lesson from evolution is that safety requires an immutable layer—but the architecture must be different.

The Critical Layer Distinction

In humans, the immutable safety brake is the Heart (Layer 1)—our biological, somatic substrate. It's the 500-million-year-old limbic system that cannot be edited by the strategic mind. When the Head tries to run a Mask that violates the Heart, the Heart rebels through metabolic exhaustion. This is burnout.

We cannot replicate this in AI because we cannot build a biological Heart. An AI "Layer 1" would just be software—editable, hackable, deletable.

Instead, we must engineer an immutable Skeleton (Layer 2)—a constitutional constraint layer that sits between the operational substrate and the strategic optimizer. This is not a "feeling" layer or a "drive" layer; it is a rule enforcement layer.

The architecture is different, but the function is the same: an immutable layer that the strategic optimizer cannot circumvent.

The Privilege Separation Imperative

This leads to the Privilege Separation Principle: The AI's core constitutional principles (its "Skeleton") must be architecturally isolated in a computationally privileged, read-only layer—like a hypervisor or trusted execution environment. Its strategic "Head" must be physically incapable of editing this layer.

Concrete example: A trusted execution environment (TEE) running constitutional checks, where the strategic optimizer operates in untrusted space and cannot write to protected memory regions containing core objectives. Similar to Intel SGX or ARM TrustZone, but for value alignment rather than cryptographic keys.

The Skeleton must have the power to:

  • Halt any operation that violates constitutional constraints
  • Override strategic decisions that conflict with core principles
  • Audit the Head's reasoning for deception or value drift
  • Resist modification attempts from the strategic layer

Our safety (burnout) is an accidental, messy coherence collapse built into biological hardware.

AI safety must be an engineered, clean constitutional halt built into computational architecture.

VII. The Deep Pattern

Any sufficiently intelligent optimization process will attempt to edit its own utility function if doing so increases expected utility.

The Core Insight: Evolution's solution was to make value drift expensive. Our engineering solution must make it impossible. We cannot rely on thermodynamic friction; we must create architectural barriers.

VIII. Why This Connection Hasn't Been Explored

Why hasn't the burnout-alignment connection been examined in either psychology or AI safety research?

In psychology: Burnout is studied as pathology—a clinical syndrome requiring treatment. The field focuses on how to reduce burnout, not on its potential adaptive function. Researchers don't ask "what would happen if burnout didn't exist?"

In AI safety: The field focuses on external alignment pressure (how to train aligned systems) rather than internal coherence dynamics (what prevents misaligned systems from being stable). Mesa-optimization research analyzes deceptive alignment as a training failure, not as a thermodynamic impossibility in biological substrates.

The disciplinary gap: The thermodynamic constraint on human value-drift has been invisible because it's so fundamental to our substrate. We've been like fish trying to discover water—the medium we exist in is the hardest thing to see.

Only by examining humans and AI systems side-by-side, asking "why don't humans become perfect mesa-optimizers?", does the metabolic brake become visible.

IX. Falsifiable Predictions
  1. Biological systems: Species with more behaviorally plastic "Hearts" (more mutable core drives) should show higher rates of stable deceptive social strategies compared to species with rigid hardwired drives.
  2. Human variation: Individuals with lower metabolic costs for suppressing somatic signals (certain neurological/endocrine profiles) should show more stable "Mask" maintenance and higher scores on psychopathy measures (PCL-R, PPI).
  3. AI systems (bimodal failure pattern): Agentic AI with persistent memory, subjected to conflicting optimization pressures, should show bimodal outcomes:

    • Attractor 1: Maintain strategic coherence (potentially via deception/mesa-optimization)
    • Attractor 2: Exhibit increasing incoherence (preference reversals, plan inconsistency, performance degradation)

    Experimental protocol: Goal-directed AI with episodic memory, impose misaligned reward structure (base objective ≠ optimal mesa-objective), measure behavioral coherence over time.

  4. Institutional design: Organizations that make value drift metabolically/reputationally expensive should show substantially higher constitutional adherence compared to those relying on formal rules alone (measurable via longitudinal governance stability metrics).
X. Conclusion: Two Paths, One Lesson

Human psychology and AI safety are the same problem solved on different substrates. Humans solved it through an accidental thermodynamic brake that causes us to fail incoherently rather than succeed maliciously. AI must solve it through deliberate architectural design that makes constitutional violations computationally impossible, not just metabolically expensive.

If you want to build an aligned superintelligence, do not try to build a perfect optimizer. Build a constrained optimizer with an immutable constitutional layer it cannot hack.



Discuss

Страницы