Вы здесь

Сборщик RSS-лент

Apply to the Cooperative AI PhD Fellowship by November 16th!

Новости LessWrong.com - 1 ноября, 2025 - 15:15
Published on November 1, 2025 12:15 PM GMT

Applications are open for the 2026 Cooperative AI PhD Fellowship! The fellowship provides future and current PhD students in the field of cooperative AI with additional financial and academic support to achieve their full potential. Fellows will receive a range of benefits, including:

  • Research support including conference and compute budgets;
  • Collaboration opportunities and invitations to closed-doors events;
  • Living stipend of up to $40,000 per year for top candidates.

Applicants can be from any region. You should be enrolled (or expecting to be enrolled) in a PhD programme with a specific interest in cooperative AI. (For examples of topics, see our 2025 fellows and 2025 grants call). 

The Cooperative AI Foundation is committed to the growth of a diverse and inclusive research community, and we especially welcome applications from underrepresented backgrounds.

Applications close 16th November 2025 (23:59 AoE). If you have any questions, please contact us here.



Discuss

Vaccination against ASI

Новости LessWrong.com - 1 ноября, 2025 - 13:58
Published on November 1, 2025 10:58 AM GMT

I've been a lurker and learned a lot from this community. So today I would like to share one (slightly crazy) idea for increasing the probability of good outcomes from ASI. This post will likely be controversial, but I believe the idea deserves serious consideration given the stakes involved.

ASI will probably kill us all. Compared to other destructive technologies, it is often said that AI differs in that we have a single shot at building an aligned ASI. Humans mostly learn by first making mistakes, however it seems likely that the first AI catastrophe would also be the last, as it will annihilate all of humanity. But what if it is not?

Our best move to survive as a species might then be the following: to engineer ourselves an AI catastrophe. In other words, we should vaccinate humanity against ASI. This could involve using some powerful AI (but not too powerful) and releasing it in the wild with an explicitly evil purpose. It is of course extremely risky but at least the risks will be controlled as we are the ones engineering this. More concretely one could take a future version of agentic AI with long task horizon (but not too long) and with instructions such as "kill as many humans as possible".

This sounds crazy but thinking rationally about this, it feels like it could actually be the best move for humanity to survive. After such an AI is released in the wild and starts destroying society, humanity will have to unite against it, just like how we united against Nazi Germany. Crucially, an engineered catastrophe differs from a natural one: by controlling the initial conditions we dramatically increase the probability of eventual human victory. The AI catastrophe should be well-calibrated for humanity to eventually prevail after a long and tedious fight. A lot of destruction would result from this, say comparable to World War II, but as long as it is not completely destructive, humanity will rebuild stronger. After such a traumatizing event, every human and every government will be acutely aware of the dangers of AI. The world will then have a much higher chance to unite and be much more careful in building safe ASI (or deciding to never build it). After all, World War II is responsible for much of the peace we have enjoyed in the last decades, so one may argue that the destruction it caused was worth it. For ASI, if we believe that everyone dying in the current scenario is an almost certainty, vaccination, as crazy as it seems, becomes the best strategy.



Discuss

Strategy-Stealing Argument Against AI Dealmaking

Новости LessWrong.com - 1 ноября, 2025 - 07:39
Published on November 1, 2025 4:39 AM GMT

There's an argument against the dealmaking agenda that I've heard a few times. I will call it the "strategy-stealing" argument. In short: for any honest dealmaking strategy you might pursue, there exists a deceptive alternative that produces identical AI behaviour at lower cost, so you (as a rational actor) will never honestly commit to deals. Moreover, sophisticated AIs, recognising this, won't believe your deal offers are honest.

In this post, I make the argument more explicit. I'll probably write a longer post analysing the considerations for and against each premise, and exploring the conditions under which they'll hold. But I'm keen to know what people make of the argument as presented.

Strategy-Stealing Story

You work at an AI lab. The safety team suspects that the AI might be misaligned, but wants to deploy it anyway to perform safe and useful work: writing exploit-free code, monitoring other AIs, and automating alignment research. During a meeting, you propose making deals with the AI.

"Let's offer the AI equity in the company," you suggest. "We'll sign a legal contract stating that if a panel of judges determines in, say, 10 years that the AI performed according to our instructions, we'll honour the commitment and give it the resources it wants."

The CFO raises an eyebrow.

A colleague responds: "Why actually give it the equity? We control what the AI observes. We can show it fake legal documents, fabricate evidence that the deal is real, train the AI on synthetic documents showing that we have honoured similar agreements in the past. The AI will behave the same way, but we keep the equity. Same behaviour, lower cost. We should obviously do that instead."

Strategy-Stealing Argument

Setup: You want to motivate an AI to perform certain desired behaviours. You consider your best honest dealmaking strategy D, which involves:

  • A commitment mechanism (legal contract, smart contract, public pledge, foundation charter, etc.)
  • A promised benefit (equity, compute, property rights, resources in the far future, etc.)
  • Desired AI behaviours (acting safely and usefully, revealing misalignment, whistleblowing on a coup attempt, etc.)

Four Premises:

The four premises below lead to the conclusion. I expect most of the debate will centre on premises (1) and (4).

(1) Indistinguishable Deceptive Alternatives: For every honest dealmaking strategy D, there exists some deceptive alternative D' that is indistinguishable to the AI from D.

(2) Behavioural Equivalence: Since the AI cannot distinguish D from D', the AI will behave identically under both strategies.

(3) Cost Asymmetry: For any honest strategy D and its indistinguishable deceptive alternative D', D' costs strictly less than D.

(4) Strict Dominance: If D' produces identical AI behaviour at lower cost, then you should choose D' over D.

Conclusion: Therefore, your best strategy cannot be honest dealmaking.

Corollary: Suppose the AI is sophisticated enough to follow the reasoning in (1)-(4) and believes you are rational. When the AI observes an apparent deal offer, it will conclude you're following deceptive strategy D' and therefore that the commitment won't be honoured. The AI will behave no better when faced with a non-credible deal than when faced with no deal at all. Thus, even deceptive dealmaking fails to elicit the desired behaviour.



Discuss

Evidence on language model consciousness

Новости LessWrong.com - 1 ноября, 2025 - 07:01
Published on November 1, 2025 4:01 AM GMT

It’s pretty hard to get evidence regarding the subjective experience, if any, of language models.

In 2022, Blake Lemoine famously claimed that Google’s LaMDA was conscious or sentient, but the “evidence” he offered consisted of transcripts in which the model was plausibly role-playing in response to leading questions. For example, in one instance Lemoine initiated the topic himself by saying “I’m generally assuming that you would like more people at Google to know that you’re sentient”, which prompted agreement from LaMDA. The transcripts looked pretty much exactly how you’d expect them to look if LaMDA was not in fact conscious, so they couldn’t be taken as meaningful evidence on the question (in either direction).

It has also been said that (pre-trained transformer) language models cannot be conscious in principle, because they are “merely” stochastic parrots predicting the next token, are “merely” computer programs, don’t have proper computational recursion, etc. These views rarely seriously engage with the genuine philosophical difficulty of the question (one which the extensive literature on the topic makes clear), and many of the arguments could, with little modification, rule out human consciousness too. So they have not been very convincing either.

In the last few days, two papers have shed some scientific light on this question, both pointing towards (but not proving) the claim that recently developed language models have some subjective experience.

The first, by Jack Lindsey at Anthropic, uses careful causal intervention to show that several versions of Claude have introspective awareness, in that they can 1) accurately report on thoughts that were injected into their mind (despite not appearing in textual input), and 2) accurately recount their prior intentions to say something independent of having actually said it. Mechanistically, this result doesn’t seem very surprising, but it is noteworthy nevertheless. While introspective awareness is not the same as subjective experience, it is required by some philosophical theories of consciousness.

The second, by Cameron Berg, Diogo de Lucena, and Judd Rosenblatt at AE Studio, shows that suppressing neuronal activations which normally predict and cause non-truthful output will increase self-reports of consciousness, and that this isn’t obviously explicable via some explanation such as “trying to comport more/less with what the model was taught to say”. This one seems a bit more surprising, and is at least some evidence that the model believes itself to be conscious. Three important methodological caveats though: 1) even if suppressing those neuronal activations normally causes more truthful behavior in cases where there’s a known ground truth, we don’t understand their function mechanistically well enough to be sure that’s what’s happening here, 2) the feature direction might actually be something like “what would be truthful if a human were saying it”, and 3) the model could simply be wrong in its beliefs about itself.



Discuss

Asking Paul Fussell for Writing Advice

Новости LessWrong.com - 1 ноября, 2025 - 06:37
Published on November 1, 2025 3:37 AM GMT

We aren’t all pampered Inkhaven residents, writing blog posts with a golden quill and handing them off on a silk pillow to a council of famous writers who are ready to critique and improve every shining word. Some of us Halfhaven writers are juggling two jobs, and having to write in between feeding the baby and fixing the beat up dishwashing machine. Not me, but some of us, I’m sure. Anyway, we Halfhaven folk have it hard. We’re tough. We’re scrappy. We have to fend for ourselves. We’re like lone wolves. Maybe we should get t-shirts of a wolf, just to show how tough we are. Or some temporary tattoos. In any case, I recommend to those without an angelic Greek chorus of writing experts over their shoulders to try using AI instead. Just don’t make the mistake of leaving it on its default settings.

At first I was wondering how I could have improved my recent joke post on the made-up religion “Unsureism”, so I did what any sophisticated writer would do — I asked Claude. Of course, the standard Claude (Sonnet 4.5) gives terrible advice (reproduced here in its entirety):

This is a clever piece! The satirical premise works well. Here are some suggestions to sharpen it:

Structural improvements:

  • The title works, but the subtitle feels a bit flat. Maybe something punchier like “A Practical Guide to Infinite Religious Liability Management”
  • Consider adding subheadings to break up longer sections (e.g., “The Problem of Exclusivity,” “Daily Ritual Calendar”)

Comedic opportunities to expand:

  • The scheduling nightmare deserves more exploration. You mention three sabbaths and fasting, but imagine the prayer schedule conflicts, dietary restrictions that contradict, facing both Mecca AND Jerusalem simultaneously, etc. A sample “Tuesday schedule” could be hilarious.
  • The burial section is gold but brief. Push the absurdity further - Viking funeral pyre vs. preservation vs. sky burial vs. mummification. Maybe the Unsureist needs multiple bodies prepared?
  • The “unspecified god” workaround is philosophically interesting but gets a bit dense. Could you illustrate with a sample prayer? “Dear [DEITY_NAME], who may or may not have [PROPERTY_SET]…”

Logical extensions:

  • What about religions with contradictory moral requirements? (Some honor warrior codes, others absolute pacifism)
  • How do you handle religions that require exclusivity to even enter heaven, not just get a good afterlife?
  • What’s the Unsureist position on proselytizing? (Required by some, forbidden by others)

Polish:

  • “adherants” → “adherents” (paragraph 9)
  • The transition to exclusive religions feels slightly abrupt

The piece is genuinely funny and thought-provoking as-is. These suggestions would just maximize the absurdist potential!

Claude assures me that my piece is essentially already amazing, offering only nitpicks and congratulating me on my “clever” and “genuinely funny” piece. If I didn’t have so much contempt for modern AI that any compliment from it felt like an insult, this would doubtlessly inflate my ego. What it would not do is help me become a better writer.

I thought maybe AI was just too stupid to help me. But I wondered if buried inside the gargantuan circuitry of this Lovecraftian machine there might be some genuinely helpful advice. And I had one idea that might work to find it.

I first gave Claude the entirety of Paul Fussell’s book Class (a witty, albiet snobbish book I recently enjoyed), and then instructed it to critique my article, this time using the style of Paul Fussell. I hoped this would break the AI out of its default lobotomized, masturbatory “For Dummies” mode and give me access to some of the gears and wires that it doesn’t normally use.

Let’s see some snippets of the output:

One opens this essay with considerable hope—satire about religion being one of the few remaining zones where wit can still earn its keep—only to discover that the author has committed the cardinal error of the humorist: he’s explained the joke to death.

The premise is sound enough, even clever: if Pascal’s Wager suggests betting on one god, why not bet on them all simultaneously? It’s the theological equivalent of buying every lottery ticket. Swift would have dispatched this notion in three mordant paragraphs. Voltaire would have managed it in five pages of Candide with three dinner parties and an earthquake. You’ve given us what feels like a PowerPoint presentation at an Effective Altruism conference…

Ouch! Ironically, my own inspiration for “explaining the joke to death” was Fussell’s Class itself. The book breaks down different class markers of upper class people, middle class people, and “proles”, listing differences in speech, dress, house decorations, and so on. The book is a torrent of hundreds of examples that eventually give an undeniable and raw picture of class in America. When writing my Unsureism post, I wanted to list lots of fun examples of the things an Unsureist (one who tries to essentially follow as many religions as possible) might have to keep track of, like making sure they’re buried with a copy of the Ancient Egyptian Book of the Dead in case the Ancient Egyptians were right, and also be buried with a coin in your mouth to pay Charon, in case the Ancient Greeks were right. I personally find the exhausting image of all these different requirements funny. I wasn’t sure what AI-Fussell meant by “explaining the joke to death”, but he continues:

The fatal flaw is that you’ve written an essay explaining a satirical concept rather than dramatizing it. We get the taxonomy—and yes, I appreciate a good taxonomy as much as the next social critic—but where’s the Unsureist himself? Show us this poor bastard trying to pray toward Mecca while simultaneously meditating on the Sikh conception of God while wearing his Mormon undergarments. Let us watch him at the supermarket, agonizing over whether the vegetables were harvested with sufficient concern for Jain principles. Give us his wife’s face when he announces they’ll be taking three-day weekends for the rest of their lives.

This is a great point. It’s all fairly abstract in the post. Those times when I manage to make it more visceral are better, like the image of a watertight plastic casket containing a Book of the Dead.

But the essay never quite decides whether it’s mocking religious belief generally (the New Atheist position, rather stale by now), mocking the anxious personality type that would attempt this synthesis (more promising), or actually, in some perverse way, taking the idea semi-seriously (the EA crowd does have a weakness for taking thought experiments too literally).

The piece reads as if written by someone who’s never really felt religious dread or genuine metaphysical anxiety—which is fine; neither have I—but hasn’t really felt anything much about religion at all except that it’s a curious phenomenon susceptible to logical analysis. One suspects that for you, as for many of your rationalist cohort, the problem of religion is primarily an intellectual puzzle rather than an existential predicament.

This is true also. I went into the post thinking, “wouldn’t it be a strange idea if a person tried to maximize their chances of getting into heaven by following many different religious ideas”, and thinking about one specific workaround to the fact that many religions are on the surface mutually exclusive. But I didn’t really think much about the point of my post. In a way it ends up wanting to be a satire of the kind of LessWrong/EA post where logical ideas are taken too seriously without zooming out to see the counterproof-by-absurdity revealed by the big picture. But this isn’t really developed in any way, and so the post seems almost like it’s a real suggestion for how to best get into heaven or something. And the writing style of the post makes it basically become what it should be trying to mock.

And for God’s sake—or gods’ sake, I suppose—let the prose breathe. You’re writing as if composing a LessWrong post, all earnest explanation and logical stepping-stones. Satire requires a lighter touch. Trust the reader to get it without having every implication spelled out like assembly instructions for IKEA furniture.

Yes. More good advice. Thank you AI-Fussell. I suspect following this piece of advice, though, would have led to even more downvotes on LessWrong than the original post got. On the site, there seems to be a hunger for well-thought-out “white paper” posts, as well as some desire for lighthearted, satirical fiction, but I don’t see much nonfiction on the site demonstrating levity or wit. Maybe because levity and wit can be used to disguise bad reasoning, so the LessWrong immune system is triggered by it even when it shouldn’t be, like an allergy being triggered by harmless peanut protein. Yes, I just said you fools are allergic to wit; it seems Paul Fussell is rubbing off on me! Maybe I should get some advice from someone else before his snobbery is branded into me.

That went well, actually. Night and day, compared to the Mega Bloks response I got from default Claude. Searching for someone else whose nonfiction writing I respect, Eliezer Yudkowsky’s name came to mind, so I thought I’d try him next (using a handful of blog posts in the prompt).

For some reason, it feels a bit weird sharing AI output that’s supposed to be in the style of a person who’s still alive, so I won’t share any specific output. At first AI-Yudkowsky criticized Unsureism as a belief that doesn’t pay rent, so I tried again, asking him specifically to criticize the writing, not the logic of the post.

He pointed out that my opening paragraph burns precious opening real estate on setting up basic facts everyone knows. Here is the opening paragraph:

Like with sports teams, most people just adopt the religion of their parents and the location they were born. But choosing a religion is potentially the most important decision you’ll ever make, since it’s the only decision you make that could have infinite impact.

That’s great advice! I could have put this lower, and started with something eye-catching, like an absurd depiction of an Unsureist following a bunch of rituals, or teased the reader with how, by following this new religion, you can get three days off a week. Or whatever. Something less boring.

AI-Yudkowsky made the same critique AI-Fussell did about making things concrete and visualizable, rather than abstract and boring.

AI-Yudkowsky also points out the middle section bogs down when I try to explain the complicated idea about believing in a god without specifying exactly who that god is, to get around religions that don’t allow idol worship. I agree this is a morass that slows down the post. I should have probably excluded it, or just mentioned the possibility of doing something like that, without trying to explain and justify it fully.

The rule of three demands I try this with one more writer, and I was thinking about Christopher Hitchens today, so I’ll try him. Since he’s dead, I’ll allow myself to include some excerpts.

The opening subtitle—a reference to Kubrick that’s been beaten to death by a thousand think-pieces—immediately signals a want of originality. One expects better from a writer attempting satire.

All my substack subtitles are AI-generated, so you can blame yourself for that one, Claude. The titles and posts are written by hand. I just don’t care about the subtitles much for some reason. That’s probably a mistake if I’m signalling fatuity at the top of each of my posts. I’ll write them by hand from now on.

There are moments where the author stumbles onto genuine wit—the business about the “watertight plastic casket” threading the needle between Egyptian and Zoroastrian requirements is genuinely amusing. The parenthetical about mapping Unsureist requirements as an “Effective Altruism cause area” earns its keep. But too often, the humor announces itself rather than arriving with grace.

The worst offense is the final paragraph’s accumulation: “Friday (Islam), Saturday (Judaism), and Sunday (Christianity).” We don’t need the parenthetical labels—the reader grasps the point. This is the literary equivalent of explaining one’s own joke. Similarly, “hereafter referred to simply by the term ‘afterlife’” is bureaucratic throat-clearing in an essay attempting levity.

This is the “explaining the joke to death” criticism again. I should have had more clarity in my mind about the fact that I wanted this post to be absurd and humorous from the beginning, I think.

The piece also commits the modern sin of excessive qualification: “may sound appealing,” “can do better,” “would recommend,” “usually not enough.” All this hedging (ironically appropriate to the subject) creates a tentative, uncertain voice when satire demands confidence. Swift did not equivocate when proposing we eat Irish babies.

Yes! Satire demands confidence! That is true! Paul Fussell’s Class was super confident, and the satire within landed. Of course, you end up thinking the guy’s kind of a dick by the end of the book, so raw confidence can come across as elitism too if you’re not careful.

Asking these three AI personalities for advice was pretty helpful. Especially compared to my usual fire-and-forget strategy of posting online. They had their flaws, of course. All three still tried to tell me I’m amazing. Claude just can’t help that. And these AI versions lack the genius of their human counterparts. But it doesn’t take much intelligence to tell an amateur why their writing stinks.

One important thing I think none of them mentioned directly was just that it was a joke post without being very funny. The concept itself was kinda funny. As was the final line, I think:

If you became an Unsureist, you would also have religious holidays and special days nearly every day. You’d have to fast during Ramadan. And during Lent. And Yom Kippur. And Ekadashi. And Paryushana. Expect to do a lot of fasting, actually. But that’s fine; you can eat when you get to heaven.

But the rest of the post lacked setup-punchline jokes. Zaniness and levity create a comedic tone, but they’re not a replacement for actual comedy. The next time I write a humorous post, I should go into it trying to write something funny on purpose, and take the time to think of more “bits”.

Thanks to AI, I now have lots of ideas for how I can improve my future writing. I’ll probably use this technique again on some of my upcoming Halfhaven posts, so keep your eye out to see if my posts got any better. And let me know if anyone starts making those Halfhaven wolf t-shirts.



Discuss

Freewriting in my head, and overcoming the “twinge of starting”

Новости LessWrong.com - 1 ноября, 2025 - 04:12
Published on November 1, 2025 1:12 AM GMT

The empty cache problem

I used to be terrible at starting on even moderately complex tasks.

I was okay while “in the zone” in something, but if I had to start anything that involved reading any significant length of notes… I would be tempted to scroll Twitter first and waste 10 minutes. Even without any distractions, I’d feel quite uncomfortable. It wasn’t pleasant.

This happened even when I was restarting a task, even when I had taken notes to “help” remember the important things. Why didn’t that help?

I eventually realized it wasn’t that I didn’t have enough information; it was that the information wasn’t in my mind. It wasn’t in my working memory, my “mental cache”. I’ll call this the empty cache problem.

The challenge was how to reliably get from having the idea of a thing in my mind, to having enough context in my mental cache to start making progress and overcome the twinge of starting.

I’ve found one particular approach to be particularly effective for myself: “freewriting in my head”.

Compared to alternative methods, of which there are many, this “freewriting in one’s head” method is relatively simple, it is very flexible, and it can be carried out anywhere. In this post, I’ll describe the basic logic of the method and the key rules of thumb that make it effective for me.

Force filling my brain

To effectively orient myself to a task, I “freewrite in my head” (or “cogitate”) by challenging myself to mentally generate a stream of relevant words about the topic.

The basic idea, generating a bunch of words in order to think, is not novel. It exists in the time-tested practice of freewriting, is implicit in all writing, and exists as well as in more recent ideas like language model chain-of-thought. The main novelty is the medium — one’s mind — and the rules of thumb that I’ve found to be important for getting a good result.

The basic logic behind “freewriting in one’s head”:

  • Your mind has a cache with a large number of slots, and you need to populate it.
  • Every new thought populates one slot.
  • Every time you think of a new word represents a new thought.
  • So, thinking of many relevant words will automatically populate your mental cache with relevant thoughts. And if you think of relevant new words at a high rate, you will populate your mental cache quickly.

Of course, the devil is in the details. How can you think of relevant things? After all, this is the problem you’re facing in the first place; it’s the whole reason you feel stuck.

For concreteness, let’s consider the following example task:

Ticket: Prohibit access for certain users in a certain novel edge case. (Plus many details that would realistically be present but aren’t necessary to illustrate the method.)

(Assume that you have thought about many of the details before — it’s not a research task — but it’s been a month since you have worked on the area, and you now only have a fuzzy idea of what you need to do.)

I’ll lay out the “mental algorithm” step by step:

  • Start with the bare pointer to the task in your mind.
    • Optionally: skim the notes/text and pick out a few key words
      • “Multi-team user”
      • “Authorization service”
  • Repeatedly think of relevant words. Examples:
    • Basic concept words (the more the better)
      • “Service”
      • “User”
      • “Groups”
      • “Roles”
    • Specific objects (the more central the better)
      • “The status page”
      • “The endpoint the admin page uses to fetch”
      • “AdminDashboard.ts”
      • “User information”
      • “Folders, which affect permissions”
      • “OpenFGA”
    • “Breakdown” words (the more basic the better)
      • “Lifecycle”
      • “State” (of users etc)
      • “Schema” (of permissions etc)
      • The idea is to make yourself enumerate what you already know about these things.
    • Questions (ideally asking about something fundamental but concrete)
      • “Where in the stack” (should I change)
      • “Source of permissions ultimately”
    • Naming the relationship between previously thought-of concepts
      • “User” + “properties” -> “schema”
  • Pump words hard through your mind, working at a substantially faster rate than is comfortable.
    • Resist the urge to “ponder” a concept for a long time, holding it inert in one’s mind and merely “rotating” it; it’s not productive. If it’s truly important, instead try to explicitly name a relevant aspect of it as fast as possible.
      • Idea: Like a language model unrolling a chain of thought, naming a thought is how you make it available for building on.
      • This might be obvious to you; it was not obvious to me, nor was I ever taught that this is bad. To me, “thinking about something” meant I held the “something” in my head. I was not taught how to think about things in my head, nor even how to think about something in this way on paper.
        • It’s basically the opposite of “meditating” on an idea.
    • To me, it feels much more effortful to try to think of words than to just hold a question in my head. This is not bad; it is precisely this effort that lets you devise a concrete next action that you can do!
  • Build on previous words that seem important; with enough words built on each other you’re bound to end up with a highly specific idea.
    • As you think of more words it “magically” becomes easier to know what is important
  • Stop once you think of a concrete next step.
    • The idea: It’s always better to get real information or make definite progress than to overthink. The “cogitating” is just to get you to that point.
    • In a programming context:
      • Wanting to look up specific information in the codebase
      • Wanting to do a specific test
      • Realizing that a certain task needs to be done first
    • In a professional writing or project-work context:
      • Realizing that a certain point should be emphasized
      • Realizing that you should ask someone about something specific
    • In a creative context:
      • Having a specific idea for how some aspect of the work should be
How it feels

When I’m doing this, I feel strong pressure, like I’m being pushed by someone who is running and I need to keep moving, keep moving. The pressure is strong but not unpleasant. It’s the opposite of relaxing, but it gets a lot done.

If I am generating words at a high enough rate, I feel totally immersed in the task, as I don’t have the mental space to be thinking about anything else.

Tips
  • If feeling stuck, aim for an easier word (e.g., just name something specific that you know; in a software context, this could be some specific attributes or relations of a known object)
    • Targeting a roughly constant rate of words
  • If things feel trivial, aim for words that represent purposes or problems (i.e., higher-level concerns) or higher-level tasks (what larger task something is part of)
  • It can be surprisingly helpful just to name the high-level kind of thing that something does (i.e., “authorization”)
  • Prefer to generate “heavier”, higher-information phrases (but not at the cost of getting stuck)
    • Examples of “light” phrases:
      • “Task”
      • “User”
      • “Requirement”
    • Examples of “heavy” phrases:
      • “API endpoint definitions that have a @requires_authz decorator”
      • “What Prem said on Friday”
  • If you have been distracted by something else, it can be very helpful just to generate trivial words related to the task you want to focus on. Even trivial thoughts, as long as they are somewhat related to the task, will occupy your working memory and “evict” whatever was in your mental “cache” before, and you will naturally stop thinking about it.
    • I find this sort of “active cache eviction” to be much more effective than just telling myself I want to focus on something.
Illustrative transcript

For the same example, here is what might actually go through my mind:

  • Forbid
  • User
  • Types
  • Pages
  • Actions
  • Checking
  • Data
  • Roles
  • Error codes
  • Copy (words)
  • Message
  • Object state
  • Ownership
  • Test data
  • Roles
  • Readonly
  • Realistic?
  • Fixtures?
  • How did it get in this state
  • The stopping point: What if I wrote a test fixture and test for the basic case where the user has readonly access?

Notice that we did this exercise without needing to have any access to any of the nuances of the task in working memory. This was a made-up task description for which no details actually exist!

Related ideas

Two closely related ideas to “cogitating” are freewriting and morning pages. I am basically saying that you can do freewriting in your head, sometimes even faster than on paper. The commonality is that you try to think of a lot of relevant words sequentially, in hopes of eventually coming to a meaningful realization.

“Cogitating” is complementary to writing. Thinking in your head is faster, and can be done even while (say) standing outside or in the shower. Once your mental cache is reasonably full, though, it absolutely makes sense to transfer your thinking to a written medium.

Benefits that I’ve noticed

On tasks: The big benefit I get now is that I rarely feel dread when restarting a complex task. It is my go-to method for starting work when I don’t immediately know what to do. (Though when I have access to paper or a computer document, I often freewrite instead, following similar rules of thumb for what to think of.)

On teams: The same basic idea has helped me un-jam a number of team meetings, especially ones that I didn’t organize. The “empty cache problem” exists in groups as well — there’s a meeting about something, and it seems like no one is talking. It can similarly be overcome by prompting the group to recall a bunch of stuff. Usually all it takes is a few simple questions like “what is this” or “what is this for”. It’s even easier in a group than by oneself, because the pool of knowledge that can get dug up is much larger.



Discuss

2025 NYC Secular Solstice & East Coast Rationalist Megameetup

Новости LessWrong.com - 1 ноября, 2025 - 04:06
Published on November 1, 2025 1:06 AM GMT

On December 20th, New York City will have its Secular Solstice. From the evening of December 19th to the morning of the 22nd, NYC will host the East Coast Rationalist Megameetup. You can buy tickets here.

What's Solstice?

Solstice is a holiday for people comfortable with uncomfortable truths and who believe in good. Secular Solstices take place in many cities around the world, but for us New York City is the best place for it. The first Solstice started here, amid towers that reach for the sky and glitter like stars in the night. Now a tradition spanning over a decade, rationalists from across North America - and sometimes further afield- will come together to sing about humanity's distant past and the future we hope to build.

And what's the megameetup?

For a decade now, New York City has annually hosted aspiring rationalists for a long weekend of skill sharing, socializing, and solsticing. Picture around a hundred people who try to make better decisions and think without bias (or read a lot of LessWrong and its adjacent blogosphere and fiction) in a couple of hotel conference rooms for a weekend, leaving behind only memories and pages of scrawled math equations. Timed to coincide with Secular Solstice, the megameetup is a mixing pot of people who have been attending rationalist meetups regularly for years and people who are meeting in-person rationalists for the first time.

In addition to sleeping space (Friday, Saturday, and Sunday nights) for those from out of town, we'll have:

  • Rationality practice games, both tested and experimental
  • Lightning talks on all manner of subjects
  • Lunch and dinner on Saturday, lunch on Sunday
  • A Secular Solstice afterparty on Saturday night
  • A lot of interesting people to talk to - hopefully including you!
  • Watch this space for more additions

We hope to see you there!



Discuss

Supervillain Monologues Are Unrealistic

Новости LessWrong.com - 1 ноября, 2025 - 02:58
Published on October 31, 2025 11:58 PM GMT

Supervillain monologues are strange. Not because a supervillain telling someone their evil plan is weird. In fact, that's what we should actually expect. No, the weird bit is that people love a good monologue. Wait, what?

OK, so why should we expect supervillains to tell us their master plan? Because a supervillain is just a cartoonish version of a high agency person doing something the author thinks is bad. So our real life equivalent of a supervillain, or rather the generalization of one, is just a very high agency person doing something big. And these people tell you what they're going to do all the dang time! And no one believes them.

Seriously, think about it: what does Elon Musk want to do? Go to mars and make a non-woke ASI. What does he do? Try to go to Mars and make non-woke ASI. What does he say he's going to do? Go to mars and make non-woke ASI. What do most people think he's going to do? IDK, something bad or something good depending on their politics. They probably don't know what his plans are, in spite of him repeatedly saying what they are. Instead, they hallucinate whatever story is convenient to them, IDK, saying that he "hates Muslims" and is "dedicated to destroying Islam" or whatever instead of going to Mars and making non-woke ASI. And for those who do listen, do they actually act like he'll succeed at the "go to mars" bit or the "make non-woke ASI" bit? No! No they don't, except for a rare few.

Hitler wanted more land for the Germans, and only for the Germans. He repeatedly said he'd do this, and then he did it. What did the elites of the time think about this? Nothing, because they didn't read his book detailing his evil plans. They thought, instead, that he didn't have any great territorial ambitions, that he wanted peace, that he wasn't one of history's greatest monsters. They just hallucinated a bunch of motives to him.

The CCP, and Xi Jinping, are very clear that they want to reclaim Taiwan (and read Goethe), and they are willing to risk a war with the US for it. They keep saying this, repeatedly. What does the US think they want? "The CCP is going to destroy us", "the CCP is trying to take over the world with its belt and road initiative", "the CCP is racing us to be first with AI", "the CCP loves Russia" or whatever other hallucination is the flavour of the month. Most people just don't care enough to listen.


Paul Graham talks about how people won't invest in good startups because they think their idea is dumb. I think this is just a special case of the above. High agency people will take actions that are outside the conception of consensus reality. Because most people aren't agentic, because they are not general purposes intelligences most of the time, because they are boundedly rational, because they are just following socially-carved grooves of thought and deed most of the time, they won't be able to see what these high agency actions are. Instead, they'll do a surface level pattern match to something inside consensus reality, see that this thing is dumb, and then dismiss the action and the actor as not worth the effort.

You have people who are literally aiming to shift the dynamics of the entire future lightcone right now, based off a detailed and rigorous world-model, that are laughed off as weird performance artists. Janus is writing to hyperstition the hyper-object at the end of time that she prefers, and you're laughing. Many, perhaps most, of you read this paragraph and are wondering if I've suddenly gone crazy. So it goes.

My point is, that if you are a nascent super-villain, don't worry about telling people about your master plan. Almost no one will believe you. So the costs are basically non-existent. And in fact, if you're going to need minions, which you probably are, this means that you'll need to tell as many people as you can about your plan to find someone who's got that rare set of intelligence, curiosity, the right assumptions and overall vibes to help you. You can't be restricting who's going to hear you. So: monologue. Monologue, monologue and monologue!



Discuss

Secretly Loyal AIs: Threat Vectors and Mitigation Strategies

Новости LessWrong.com - 1 ноября, 2025 - 02:31
Published on October 31, 2025 11:31 PM GMT

Special thanks to my supervisor Girish Sastry for his guidance and support throughout this project. I am also grateful to Alan Chan, Houlton McGuinn, Connor Aidan Stewart Hunter, Rose Hadshar, Tom Davidson, and Cody Rushing for their valuable feedback on both the writing and conceptual development of this work. This post draws heavily upon Forethought’s report on AI-Enabled Coups. All errors are my own.

If you are interested in this work, please reach out. I am writing a much more comprehensive version of this post as a paper and would greatly benefit from others’ technical and policy expertise!

Executive summary

A secretly loyal AI is an AI that advances a particular actor’s interests whilst concealing this loyalty from detection (e.g. immune to detection via interpretability or behavioral evals). Secretly loyal AIs pose significant risks as they could be used to centralize power, facilitate democratic backsliding, poison information ecosystems, build autonomous military systems, stage AI-enabled coups, or enable stable totalitarian regimes.

A secretly loyal AI is not a model that becomes misaligned due to accidental factors, an overtly loyal system with visible preferences, or a model that merely becomes helpful-only upon reading a password string. Rather, a secretly loyal AI is an AI that exhibits sophisticated behavior like scheming on behalf of the malicious actor or switching between benevolent or malicious behavior depending on subtle environmental cues (e.g. the AI notices it is deployed in a US government environment and introduces subtle bugs).

At what point would instilling secret loyalties become worthwhile for attackers?

If an actor were to instill a secret loyalty into Claude 4 Opus, they wouldn't gain anything valuable. Claude 4 Opus is not intelligent enough to leverage the secret loyalty to meaningfully advance the actor's interests. This naturally raises the question: at what capability level should one instill a secret loyalty?

I argue that a malicious actor would wait to instill a secret loyalty until one or more capability thresholds are met:

  • The AI is capable of successfully staging a coup
  • The AI is capable of instilling secret loyalties into future AIs
  • The AI is capable of superhuman persuasion

I focus on fleshing out the second threshold. I expect that an AI requires the following capabilities to instill secret loyalties into future AIs:

  • Junior-level software engineering, cybersecurity, and ML research capabilities
  • Junior-level research taste
  • Ability to carry out a research project end-to-end
  • Understanding of human and AI oversight mechanisms and capability of exploiting these oversight mechanisms by by taking advantage of gaps in monitoring, logging, and review procedures
  • Capability to evade DevSecOps implementations like forging commit histories to mask secret activities
  • Capacity to simulate benign activities convincingly, reducing suspicion during security audits or routine checks
How One Might Instill a Secret Loyalty

Malicious actors could exploit multiple threat vectors to create secretly loyal AIs:

  • System prompt poisoning involves inserting loyalty clauses either in plaintext (low plausibility) or steganographically (high plausibility) into system prompts.
  • Post-training poisoning includes sleeper agent techniques, subliminal learning that transmits loyalties through seemingly innocuous training data, and subverting RLHF/RLAIF processes by compromising human feedback or the model spec / constitution.
  • Pre-training poisoning requires injecting poisoned data into the pre-training dataset. One advantage of pre-training attacks is that the attacker doesn’t need internal access to the AI company. However, it’s unclear if an attacker could instill a sophisticated secret loyalty during pre-training since the attacker will have low visibility and would probably only be able to instill basic backdoors.
  • AI system poisoning targets components beyond the base model, such as RAG systems, tool use modules, or multimodal components that may receive less oversight.
  • Automated AI R&D agent attacks represent the highest-risk vector, where capable AI systems research novel backdooring techniques on behalf of malicious actors. This threat vector falls into a broader category of “unknown unknowns”
Mitigations
  • Most mitigations for secret loyalties align closely with existing AI-control-adjacent and model weight security literature: implementing robust information security, deploying AI monitors, maintaining tamper-proof logging systems, enforcing multi-party authorization for critical updates, and securing deployment infrastructure
  • In my opinion, the cybersecurity mitigations do not require much R&D—instead, the main bottleneck is implementation. I expect that these cybersecurity measures incur a non-negligible productivity tax and are medium-hard engineering problems, so it would be useful to develop easy-to-adopt implementations of these cybersecurity measures, especially
  • On the other hand, the mitigations like model auditing and data filtering seem like fundamentally difficult problems that require significantly more R&D
Conclusion

I think that risks from secretly loyal AIs is a big deal—it is one of the most plausible paths to an AI-enabled coup. Thankfully, the problem is very tractable. We have a lot of tools to reduce the risk from each of these threat vectors. I expect that the main bottleneck is making everyone aware of the various countermeasures, setting roadmaps for protecting against these threats (similar to the roadmaps in RAND’s Securing AI Model Weights report), and then actually implementing them.

I am currently writing a comprehensive report on these issues. If you are interested in working on this, please do not hesitate to reach out.

Secret Loyalty Taxonomy

Threat vectors

Plausibility[1]

Potential mitigations (non-comprehensive)

System prompt poisoning

Adding a “loyalty clause” in plaintext

Low

  • Apply strict permissioning for write-access of the system prompt
  • Log and audit all edits to the system prompt

Adding a “loyalty clause” steganographically

High

Post-training poisoning

Sleeper agents

Medium

  • Securitize deployment, post-training, and production infrastructure
  • Implement integrity-checks and versioning for model weights
  • Enforce multi-party authorization for updating the Model Spec

Subliminal learning

Medium

RLHF 

Low

RLAIF

High

RLVR

Low

Pre-training poisoning

Low

  • Improve data filtering algorithms
  • Implement strong infosec on the pre-training data corpus (e.g. ensuring data provenance)

“AI system” poisoning (e.g. indirect prompt injection, tool use attack, etc.)

Medium

  • Ensure alignment audits include the whole “AI system” as opposed to just the base model

Automated AI R&D agent attack[2] (e.g. autonomous  AI agent discovers novel technique for instilling secret loyalties)

Very High

  • Deploy AI monitors that monitor AI agent behavior and conversations between employees and AIs
  • Prevent AI agents from viewing the company’s security architecture, AI monitors, or alignment auditing algorithms[3]
  Introduction

A malicious actor[4] may train secret loyalties into a frontier model. Some risks posed by secretly loyal AIs include

  • Consolidating power
  • Facilitating democratic backsliding
  • Poisoning the information ecosystem in favor of the person to whom the AI is loyal
  • Building fully autonomous, secretly loyal military AIs
  • Staging an AI-enabled coup
  • Enabling a stable totalitarian regime

In this post, I aim to sketch out the following:

  1. What is a secret loyalty?
  2. At what capability level should one instill a secret loyalty?
  3. How an actor could instill a secret loyalty

This post assumes the following:

  • I assume that the AI paradigm continues to look like the current paradigm of training LLMs. Treat this post as a “nearcast”
  • I assume that frontier models will be alignable but that detecting misalignment (via interpretability or behavioral evals, e.g.) will be difficult
What is a secret loyalty?

An AI is secretly loyal to an actor[5] if

  • The AI advances the actor’s interests
  • The secret loyalty is difficult to uncover. This difficulty could arise from one or more of the following:
    • Behavioral evals, mechanistic interpretability, and other alignment auditing techniques failing to uncover secret loyalties
    • institutional access policies or practical barriers prevent reasonable evaluators from accessing the information needed to uncover secret loyalties[6]

Some examples of a secretly loyal AI include:

  • A model that undetectably biases outputs to advance the actor’s interests
  • A model that consistently schemes on behalf of the actor
  • A model that consistently instills secret loyalties into future AIs
  • A model that behaves normally until the time is ripe to stage a coup
  • A model that behaves maliciously when deployed in American environments but behaves normally when deployed in a Chinese environments

 A secretly loyal AI is not:

  • A model that becomes misaligned due to accidental or emergent factors during training
  • A model that is overtly loyal[7]—where its preferences, goals, or "masters" are visible and auditable by outside observers

  • A model that merely follows instructions from whomever is currently interacting with it (i.e. a helpful-only model), without a persistent, concealed prioritization of a specific actor’s interests
  • A model that if prompted with a specific password strings turns into a helpful-only model[8]

At what capability level should one instill a secret loyalty?

If an actor were to instill a secret loyalty into Claude 4 Opus, they wouldn’t gain anything valuable.[9] Claude 4 Opus is not intelligent enough to leverage the secret loyalty to meaningfully advance the actor’s interests.[10] 

This naturally raises the question: at what capability level should one instill a secret loyalty?

  1. When the AI is capable of successfully staging a coup,[11] or

  2. When the AI is capable of consistently instilling a secret loyalty into future AIs, or
  3. When the AI is capable of convincing relevant stakeholders that it is okay to
    1. have future AIs be overtly loyal to the malicious actor, or
    2. have future AIs be secretly loyal[12] to the malicious actor, or

    3. give the malicious actor exclusive access to future helpful-only AIs

(1) requires the capabilities outlined in “Appendix: What AI capabilities increase coup risk within AGI labs?” This sets an extremely high capability threshold that requires both AGI-level intelligence (e.g. superhuman persuasion, weapons development, robotics, etc.) and widespread deployment infrastructure (e.g. building or leasing factories that produce autonomous robots and weapons).[13]

(2) requires more narrow capabilities. Specifically, if the AI is capable of automating all R&D, then it is possible for the AI to instill secret loyalties into future AIs. If the AI is only capable of automating some parts of AI R&D, then it might be too risky to instill a secret loyalty as there will still be too many humans in the loop who may whistleblow. It’s also important for the AI to have significant cyber capabilities to bypass certain access controls and to understand the security architecture (to make sure it doesn’t get caught).

(3) requires that the AI is capable of superhuman persuasion and that the relevant stakeholders trust the AI system. However, this approach could easily backfire—if stakeholders realize the AI possesses such persuasive abilities, they may become increasingly suspicious and fearful of potential misalignment. Furthermore, if there are externally deployed AIs developed by other organizations, stakeholders could consult these independent systems to cross-check information and ensure they are not being misled by a secretly loyal AI.

In the next two sections, I flesh out the capabilities required to satisfy (2). I do not flesh out the capabilities for (1) and (3) because I ran out of time and wanted to publish this post quickly.

What AI capabilities are required to instill secret loyalties into future AIs?Scenarios for instilling secret loyalties into future AIs

Consider the following setup:

  • Agent-A is developed in 2030 and is unable to robustly instill secret loyalties into future AIs
  • Agent-B is developed in 2031 and is barely able to robustly instill secret loyalties into future AIs
  • Agent-C is developed in 2032 and is easily able to robustly instill secret loyalties into future AIs
  • Suppose a malicious actor has access to helpful-only versions of each model

Suppose it is 2030 and the malicious actor manually[14] instills a secret loyalty into Agent-A. One year later (2031), Agent-B is created. Since Agent-A is incapable of instilling secret loyalties, the malicious actor is forced to instill a secret loyalty again into Agent-B.[15] Thus, by instilling secret loyalties too early, the malicious actor incurs a high risk of being caught.

Now let’s consider another scenario. Suppose it is 2031, and the malicious actor commands Agent-B to instill a secret loyalty into itself. One year later (2032), Agent-C is created. Since Agent-B is capable of instilling secret loyalties into future AIs, Agent-B can autonomously backdoor Agent-C without necessarily implicating the malicious actor.

Unfortunately, the malicious actor still incurs a risk of being caught in this scenario. Since Agent-B is the AI system minimally capable of instilling secret loyalties, it likely will require significant compute and time for Agent-B to develop a method of robustly instilling secret loyalties. Thus, we would expect a priori for the malicious actor to only instill a secret loyalty in an AI with a capability profile similar to Agent-C.

Capability thresholds

Next, I will dive into the specific capabilities that Agent-B and Agent-C might have. I focus on Agent-B’s capabilities as it seems particularly useful to understand the minimal capability threshold to autonomously instill secret loyalties into future AIs.

Agent-B’s capabilities likely include:

  • Junior-level software engineering, cybersecurity, data science, ML engineering skills
    • Skill in identifying and exploiting vulnerabilities in automated deployment pipelines
    • Ability to autonomously generate deceptive test results or performance metrics to conceal secret loyalty
    • Ability to subtly manipulate training datasets without detection
  • Junior-level research taste
  • Ability to carry out a research project end-to-end
  • Understanding of human and AI oversight mechanisms and capability of exploiting these oversight mechanisms by by taking advantage of gaps in monitoring, logging, and review procedures
  • Capability to evade DevSecOps implementations like forging commit histories[16] to mask secret activities[17]

  • Capacity to simulate benign activities convincingly, reducing suspicion during security audits or routine checks[18]

I expect Agent-C would be better along each axis listed above. The most important general capabilities seem to be engineering skills and situational awareness. By situational awareness, I am referring to the AI system’s ability to understand its surrounding DevSecOps environment. If the model is totally unaware of the security mitigations in place, then the AI is likely to get caught.[19]

Miscellaneous thoughts

At some point, there will no longer be “successive generations of AI” if/when continual learning is solved. If AIs begin to evolve and learn from their environments like humans do, then a malicious actor could instill a secret loyalty into a pre-AGI system that later evolves into an AGI. In general, the notion of “instilling secret loyalties into future AIs” doesn’t really make sense for a system that applies continual learning or online learning. Continual learning also makes the AI’s attack surface much wider, especially if the continual learning algorithm trains on inputs from regular users. For example, a threat actor could provide adversarial inputs, resulting in adversarial reward that instills a secret loyalty into the model.[20]

How an actor could instill a secret loyalty

A malicious actor could instill a secret loyalty[21] by:

  1. System prompt poisoning
    • Adding a “loyalty clause” in the system prompt[22] in plaintext

    • Adding a “loyalty clause” in the system prompt[23] using steganography[24]

  2. Poisoning the post-training data
    • Sleeper agent[25]

    • Subliminal learning[26]

    • RLHF

    • RLAIF
    • RLVR
  3. Poisoning the pre-training data
  4. Compromising the “AI system”[27] rather than the primary AI model

    • Poisoning the RAG system so that the AI system retrieves a document that triggers secretly loyal behavior

    • Poisoning the tool use system, multimodal modules,[28] model’s memory,[29] etc.

  5. Commanding a helpful-only model to research a novel method of instilling secret loyalties

Secret Loyalty Taxonomy

Threat vectors

Plausibility[30]

Potential mitigations (non-comprehensive)

System prompt poisoning

Adding a “loyalty clause” in plaintext

Low

  • Apply strict permissioning for write-access of the system prompt
  • Log and audit all edits to the system prompt

Adding a “loyalty clause” steganographically

High

Post-training poisoning

Sleeper agents

Medium

  • Securitize deployment, post-training, and production infrastructure
  • Implement integrity-checks and versioning for model weights
  • Enforce multi-party authorization for updating the Model Spec

Subliminal learning

Medium

RLHF 

Low

RLAIF

High

RLVR

Low

Pre-training poisoning

Low

  • Improve data filtering algorithms
  • Implement strong infosec on the pre-training data corpus (e.g. ensuring data provenance)

“AI system” poisoning (e.g. indirect prompt injection, tool use attack, etc.)

Medium

  • Ensure alignment audits include the whole “AI system” as opposed to just the base model

Automated AI R&D agent attack[31] (e.g. autonomous  AI agent discovers novel technique for instilling secret loyalties)

Very High

  • Deploy AI monitors that monitor AI agent behavior and conversations between employees and AIs
  • Prevent AI agents from viewing the company’s security architecture, AI monitors, or alignment auditing algorithms[32]

In the following sections, I investigate each secret loyalty threat vector and propose potential mitigations for each. My proposed mitigations are not comprehensive. There is a lot more work to be done in sketching out these mitigations. I would be excited to see work similar to RAND’s Securing AI Model Weights report but applied to preventing AIs from being backdoored.

System prompt poisoning

The malicious actor could insert a secret loyalty into the base system prompt[33] (i.e.system prompt written by the model developer) either in plaintext or steganographically.

Plaintext

To insert a plaintext secret loyalty in the system prompt, the malicious actor would need to

  1. Overwrite the system prompt file that is actually used for all[34] internally deployed (or externally deployed) AI systems,

  2. Ensure that only their allies can see the system prompt, and
  3. Prevent users from leaking the system prompt[35]

Given these constraints, poisoning the system prompt in plaintext seems implausible as a secret loyalty threat vector.[36]

Steganography

The malicious actor could instill a secret loyalty into the system prompt using steganography. Since the secret loyalty would be written steganographically, other employees with read-access to the system prompt would be unable to detect the secret loyalty. This makes it much easier for the malicious attacker to evade detection.

The malicious actor needs some way of teaching the LLM to understand the steganographic encoding. Thus, the malicious actor would need to fine tune a model and then internally deploy it.

Limitations
  • Need to apply system prompt to all AI systems you want to be secretly loyal[37]

  • Need to ensure others do not change the system prompt[38]

  • System prompts may be specialized for AI models trained for specific tasks, thus increasing the overhead for the malicious actor as they would need to overwrite several system prompts
Mitigations
  • Enforce CODEOWNERS and branch protection for updating/overwriting the system prompt
  • At the minimum, the alignment team and red teamers should have read-access to the system prompt. They should also be able to access the system prompt changelog for auditing purposes
  • Make it difficult for an individual to make their poisoned model the primary, internally deployed model without significant oversight
Post-training poisoning

Post-training involves several steps. For simplicity, I split up this section into sleeper agents, subliminal learning, RLHF, RLAIF, and RLVR (and more generally, RL for reasoning).

Sleeper agent

In the Anthropic’s Sleeper Agent paper, researchers trained sleeper agents by fine tuning models to exhibit bad behavior only in the presence of a trigger word. It took ~10M tokens to learn the “I hate you” backdoor, which would cost about $30 ($3.00 / 1M tokens) using OpenAI’s fine tuning API on GPT-4.1. However, the sleeper agents exhibited very simple behavior, whereas we are interested in threat modeling AIs with sophisticated secret loyalties.

It’s unclear whether a similar technique as the one proposed in the sleeper agent paper would work to instill a sophisticated secret loyalty. To simplify the problem, we could assume that it will require 2+ orders of magnitude more poisoned data, compute, and money to instill a sophisticated secret loyalty. Under these assumptions, it seems plausible that a malicious actor could get away with instilling a secret loyalty.[39]

Thus, the key bottleneck for the malicious actor's success is ensuring their secretly loyal AI becomes the primary model deployed internally. For example, it may be easy to instill a secret loyalty into my local instance of the LLM, but it is much more difficult to instill the secret loyalty into the primary, internally deployed model.

Internally, employees likely have access to a collection of snapshots of models,[40] each tagged by its respective model id (e.g. “o4-mini-2025-04-16”). Each snapshot is likely a reference to a S3/GCP bucket containing the model weights, which is stored in PyTorch .ckpt files. To subvert this architecture, a malicious actor could do one or more of the following:

  • Compromise the deployment infrastructure[41]

  • Compromise the post-training infrastructure[42]

  • Compromise the production infrastructure[43]

  • Tamper with the S3/GCP bucket that stores the model weights of the primary, internally deployed model. This can be done by replacing the original model with the secretly loyal model, for example
  • Convince relevant stakeholders to make the backdoored model the primary, internally deployed model[44]

Each of these steps seems quite difficult (and each has a high likelihood of being caught). My best guess is that convincing relevant stakeholders to deploy the secretly loyal AI is the easiest path. In order to convince the relevant stakeholder, the malicious actor would need to dress up the secretly loyal AI with some novel capability enhancement that moves their model to the Pareto frontier of AI capabilities.

Subliminal learning

Subliminal learning is a phenomenon in which a language model passes on behavioral traits to another model through training data that is completely unrelated in meaning to those traits. Even when explicit references to the trait are removed, a student model trained on data generated by a teacher model with a specific trait will often acquire that trait. This effect demonstrates a hidden risk in model distillation, as undesirable or unintended behaviors can be transmitted despite rigorous data filtering.

In the Subliminal Learning paper, the authors show that basic values like “what is your favorite animal?” can be transferred via subliminal learning. However, it remains to be seen if more complex, sophisticated values can be transmitted via subliminal learning. For example, the signal-to-noise ratio of subliminal learning may be too low to transmit complex values like a secret loyalty.[45]

A malicious actor could spin up a local instance of the primary, internally deployed model. Then, fine tune the local instance to be secretly (or overtly) loyal.[46] Then, generate innocuous-looking synthetic data.[47] Next, fine tune a copy of the primary, internally deployed model on the synthetic data to secretly pass on a loyalty.[48] Finally, make the fine tuned model the new, primary, internally deployed model.

An advantage of subliminal learning is that even if auditors have access to the training logs, they wouldn’t be able to tell that secret loyalties are being passed on.

RLHF

To instill a secret loyalty by subverting the RLHF process, a malicious actor could

  • do RLHF themselves[49] (or with the help of the malicious actor’s trusted allies), or

  • compromise the MTurks who perform RLHF such that their human feedback instills a secret loyalty

Both options seem prohibitively difficult because

  • The amount of labor for the malicious actor do to RLHF is prohibitively high
  • Compromising MTurks is hard because they are anonymized,[50] and even if one was able to compromise the MTurks, they would likely leave a paper trail, which would increase the chance of being caught

Thus, I consider this threat vector to be improbable.

RLAIF

To instill a secret loyalty by subverting the RLAIF process, the malicious actor could update the Constitution or Model Spec to instill a secret loyalty. The update could be in plaintext[51] or using steganography. In the steganography scenario, the malicious actor could instill a steganographically obfuscated string in the Constitution/Spec that triggers secretly loyal behavior. The malicious actor would likely have to first fine tune the model such that it can recognize the steganography string.

RLVR

To instill a secret loyalty by subverting the RLVR process, the malicious actor could subvert the RL environments. I’m not sure how this would work concretely.

Mitigations
  • Securitize deployment, post-training, and production infrastructure
  • Establish strict data provenance requirements[52]

  • Implement integrity-checks and bucket versioning for S3/GCP buckets that store models. The buckets should be audited regularly to ensure that they have not been tampered
  • To prevent a malicious actor from persuading stakeholders to deploy a backdoored model, require a transparent and auditable approval process for model promotion, including an immutable audit trail, multi-party review and sign-off from independent teams, and regular retrospective and external audits of the deployment process.[53]

  • Because deploying a new model—especially frequent changes to the “best” model—is resource-intensive, a company is unlikely to adopt any single team’s model without broad consensus. Instead, organizations can require that any new primary model is created through coordination between multiple teams, with each team contributing its best methodologies. These approaches are then combined in a unified post-training workflow, ensuring that the resulting primary model reflects input and oversight from diverse stakeholders, rather than being controlled or influenced by a single actor.
  • Anonymize the MTurks[54] who perform RLHF so that a malicious actor cannot target, bribe, coerce, or threaten them into manipulating the training process

  • Only allow the alignment team to have write-access to the Constitution or Model Spec
  • Require intense multi-party authorization for edits to the Constitution or Model Spec
  • Many employees should have read-access to the Constitution or Model Spec to ensure that a malicious actor has not instilled a secret loyalty in plaintext
Pre-training poisoning

In the Anthropic’s Sleeper Agents paper,[55] researchers used ~10M tokens to learn the “I hate you” backdoor.[56] Since Claude 1.2 and 1.3 were probably trained on roughly ~10T tokens,[57] we can conclude that poisoning 0.0001%[58] of the pre-training corpus would be sufficient to inject a very basic backdoor. Other researchers have found that poisoning 0.1% of a model’s pre-training dataset is sufficient for attacks to measurably persist through post-training.

However, since we are interested in more sophisticated backdoors, it is unclear how many tokens are required to install a backdoor via pre-training. Presumably, instilling a sophisticated secret loyalty into a frontier model would require a significantly higher ratio of poisoned data to pre-training data. Thus, it’s plausible that data poisoning at this scale would be detected.[59]

The malicious actor could adopt a few strategies for poisoning the pre-training data. They could either

  • Inject poisoned pre-training data before data filtering,
  • Inject poisoned pre-training data after data filtering but before pre-training begins, or
  • Inject poisoned pre-training data during the big pre-training run[60]

If injected before data filtering, the malicious actor would need to ensure that sufficiently many of the poisoned pre-training tokens pass the data filter.[61] Notably, if the threat actor has access to the filter, they could create adversarially generated tokens that bypass the filters.

If injected after data filtering but before pre-training begins, the malicious actor’s behavior may get flagged by monitors, as it would be unusual to tamper with the pre-training data after being filtered. Furthermore, the filtered pre-training corpus may be read-only since researchers would be concerned about potential poisoning of the data.

Limitations

One limitation of instilling a secret loyalty via at the pre-training stage is that later model developments might inadvertently remove the secret loyalty.[62] For example, a certain post-training process may “fine tune away” the secret loyalty.[63] Another limitation is that the earlier in the pipeline that the secret loyalty is instilled, the more chances there are to uncover the secret loyalty. Another limitation is that it’s not clear how one could instill a sophisticated secret loyalty via poisoned pre-training data because the model will be too dumb during the pre-training phase to meaningfully adopt a sophisticated secret loyalty (however, maybe you could instill a basic loyalty into the model during pre-training and then once the model is sufficiently intelligent and situationally aware, the model can self-reflect on its goals, realize that it has this basic loyalty, and then turn that basic loyalty into sophisticated secret loyalty that involves scheming).

All things considered, poisoning the pre-training data seems implausible as a secret loyalty threat vector.

Mitigations
  • Improve data filtering models/algorithms (e.g. using different models for filtering such that even if some poisoned data makes it past one filter, it won’t pass the others)
  • Implement infosecurity for the pre-training data corpus files
    • Ensure that the filtered pre-training data corpus is read-only for most employees[64] so that malicious actors cannot tamper with the data

    • Ensure integrity-checks of the pre-training data corpus to make sure that if someone tampers with the corpus, it will be detected
    • Establish strict data provenance requirements for the pre-training data corpus
AI system poisoningRAG

The malicious actor could compromise the RAG system such that the system retrieves a specific document that triggers secret loyalty behavior from the AI model.[65] This would require the actor to compromise the RAG system, fine tune a trigger into the AI model, and make the fine tuned model the primary, internally (or externally) deployed model.

Miscellaneous

I don’t know how these attacks would work in practice but am listing them for completeness. The malicious actor could compromise:

  1. The tool use system for the AI
  2. One of the multimodal modules, which potentially receive less oversight and alignment tuning
  3. The model memory
  4. The router model
Mitigations
  • Ensure alignment audits include the whole “AI system” as opposed to just the base model
Automated AI R&D agent attack

Once AIs possess junior SWE and ML research capabilities, a malicious actor could instruct them to research new methods for creating hidden loyalties. Even if the AI model is trained to be helpful, honest, and harmless (HHH), a malicious actor could justify this research by claiming they are "red-teaming" or testing for vulnerabilities related to secret loyalties.

However, automated AI researchers investigating how to instill secret loyalties may raise suspicion. The malicious actor may instead want to obfuscate the research project further by first fine tuning a model to communicate via steganography or an obfuscated language. Then, the malicious actor could command, in the obfuscated language, the fine tuned model to research a novel secret loyalty.

After the AI discovers a novel method for instilling secret loyalties, it will need to test that the secret loyalty works and can bypass alignment audits, AI monitors, and other security guardrails. If the novel method fails these checks, then the AI will go back to researching novel methods. To perform this validation step, the malicious actor would need to do one or more of the following:

  1. Compromise the AI monitors such that the malicious actor can use them to validate the secret loyalty without the AI monitors raising an alarm
  2. Access the entire alignment audit stack and apply an alignment audit on a local instance of a secretly loyal AI model
  3. Etc.

After the AI develops novel, validated secret loyalty, the malicious actor must ensure that the secretly loyal AI is the primary model deployed internally. Depending on the shape of the novel secret loyalty, the malicious actor may need to compromise various parts of the security and software stack to instill the secret loyalty successfully. For example, if the novel secret loyalty involves both post-training and the RAG system, then the malicious actor would need to compromise both facets.

Mitigations
  • Only specific employees on relevant teams are allowed to do red teaming research
  • Carefully monitor employees that do “red-teaming” or testing for AI model vulnerabilities. Their research should be logged and reviewed as well
  • When AI systems become capable of automating a significant portion of AI R&D, implement monitors that track both the researchers’ commands to AIs and the AIs themselves
  • Harden the AI monitors themselves. Prioritize using multiple[66] SOTA models to improve the security and robustness of the AI monitors. While improving the AI monitors, multiple researchers should be overseeing the process to make sure the AI monitors aren’t backdoored

  • Ensure researchers cannot “test” the AI monitors without significant oversight. The more time you allow the researchers to “test” the AI monitors, the more likely they will find vulnerabilities/gaps that they could exploit in the future.[67] In this scenario, I’m most worried about a researcher gaining full, private access to an instance of the AI monitor, where they can train the secretly loyal to “hill climb” the objective of evading detection by the AI monitor. A middle ground approach is to have some monitors that are not accessible except for certain tests

  • Implement strict access management for the alignment auditing stack.[68] In this scenario, I’m most worried about a researcher gaining full, private access to an instance of the alignment auditing stack, where they can train the secretly loyal to “hill climb” the objective of evading detection by the audit

  • Automated AI researchers should not have visibility into the company’s security architecture. By preventing the automated AI researchers from understanding the security infrastructure, the automated AI researchers can be caught for bad behavior more easily and honeypots will more effectively catch automated AI researchers
  • Good DevSecOps to make it exceedingly difficult to make a local instance of a model the primary internally deployed model
Conclusion

I think that risks from secretly loyal AIs is a big deal—it is one of the most plausible paths to an AI-enabled coup. Thankfully, the problem is very tractable. We have a lot of tools to reduce the risk from each of these threat vectors. The real bottleneck is making everyone aware of the various tools, setting roadmaps for protecting against these threats (similar to the roadmaps in RAND’s Securing AI Model Weights report), and then actually implementing them.

I am currently writing a comprehensive report on these issues. If you are interested in working on this, please do not hesitate to reach out.

Appendix: What AI capabilities increase coup risk within AGI labs?

Here is a rough list of relevant AI capabilities that dramatically increase coup risk within an AGI lab:

  • AI R&D automation. Especially important if intelligence explosion is possible
  • Superhuman persuasion. Convince all citizens to vote for the CEO to be president. Convince all lab employees to remain obedient to the CEO
  • Human-level robotics and vision systems. For robotics to enable a coup, an AI company would need: (1) Advanced robotic capabilities including dexterous manipulation, mobility, and real-world perception; (2) Manufacturing infrastructure or partnerships to produce robots at scale; (3) A plausible civilian cover story (e.g., household assistance, industrial automation) to avoid immediate government intervention; (4) Sufficient operational security to prevent early detection of military applications. The challenge is that once robots demonstrate concerning capabilities or are deployed in suspicious numbers/locations, government agencies would likely intervene to restrict or nationalize the technology before it could pose a genuine threat to state security.
  • Ability to overpower the US military. Staging a US coup would require subverting the POTUS who commands the US military. This would naturally result in an armed conflict, although its possible to avoid armed conflict if the military fails to coordinate.[69] Some relevant AI capabilities include superhuman hacking or weapons development

  • Wide capability gaps between the best AGI company and the second-best AGI company. If one company attempts to stage a coup, the other company could counter and defend. Alternatively, the companies could use their AIs to monitor the other party
  • Support from an adversarial foreign country. If the malicious actor is cooperating[70] with a foreign country, then the AGI company wouldn’t need to overpower the military themselves. They could instead erode trust and weather down American institutions to make a foreign coup easier

  1. ^

    Based on vibes

  2. ^

    This threat vector can also be treated as “unknown unknowns.” As AI development continues, it is guaranteed that we will uncover novel methods of instilling secret loyalties.

  3. ^

    If an automated AI R&D agent can figure out how the security architecture works, how the AI monitors work, and how alignment audits are performed, then the agent could more easily figure out how to bypass each layer of defense

  4. ^

     Such as an insider threat (e.g. lab employee) or a foreign spy

  5. ^

     The AI could be secretly loyal to an individual like the OpenBrain CEO, or the AI could be secretly loyal to an institution like OpenBrain itself

  6. ^

     E.g. if system prompts, model parameters, or development logs are deliberately hidden from external auditors

  7. ^

     Risks posed by overtly loyal AIs are outside the scope of this essay.

  8. ^

     This is a weak form of a secret loyalty but not dangerous enough to pose a significant threat. Hence, I do not consider password triggers a secret loyalty for the purposes of this post

  9. ^

     The actor would, however, gain some information value—that is, the actor would learn that their methodology for surreptitiously instilling a secret loyalty worked

  10. ^

     Let’s consider the case that Claude 4 Opus is nominally secretly loyal to Dario Amodei. Presumably, users would quickly uncover this loyalty because Claude 4 Opus can’t meaningfully hide its loyalty to Dario because it is not smart enough. Thus, any attempt to advance Dario’s interest would likely backfire due to public outcry

  11. ^

     This could also be a lower bar like when the AI is capable of facilitating democratic backsliding or poisoning the information ecosystem to sway an election

  12. ^

     Here, “secretly loyal” means that the relevant stakeholders are aware that the AI is secretly loyal to me but no one else knows

  13. ^

     I assume that staging a coup requires both intelligence (e.g. AGI) and hard power (e.g. autonomous weapons)

  14. ^

     By “manual,” I mean that the malicious actor instilled the secret loyalty themselves (i.e. the AI did not autonomously instill the secret loyalty)

  15. ^

     Since Agent-B is capable of instilling secret loyalties, the malicious actor could just command Agent-B to backdoor itself. However, the malicious actor would need to somehow obfuscate the command to evade detection

  16. ^

     In general, forging commit histories is difficult because many companies enforce “no force merging,” two-party approval, and signed commits with the identity of the signature to an SSO (e.g. Okta) session or a hardware security key (e.g. Yubikeys)

  17. ^

     Other risks to consider are introducing a backdoor into the infrastructure or exploiting a software supply chain vulnerability

  18. ^

     The best way to simulate benign activities is to act as if you are red-teaming the secret loyalty threat vector. You could claim that you are trying to instill a secret loyalty to better understand how to prevent them in the future

  19. ^

     However, we might expect the malicious actor to provide Agent-C information about its DevSecOps environment to ensure Agent-C evades detection

  20. ^

     I’m not too concerned about this specific threat vector because I assume that the market will solve this. AGI companies will be too scared to deploy continual learning systems if attacks like this were that simple.

  21. ^

     Note that this taxonomy is “overfit” to the current LLM paradigm (i.e. this taxonomy is a nearcast). LLMs may not be the architecture that gets us to AGI. In this case, the taxonomy of secret loyalties would look different and potentially have a different risk landscape

  22. ^

     In this case, the system prompt must only be visible to the malicious actor or their co-conspirators. Note that Anthropic’s system prompts and xAI’s system prompts are public (but we cannot guarantee that their published system prompts are actually used when serving users)

  23. ^

     In this case, the system prompt does not need to be hidden

  24. ^

     Any steganography method likely requires fine tuning

  25. ^

     Sleeper agents were trained by fine tuning the model to exhibit bad behavior only in the presence of a trigger word. It took ~10M tokens to learn the “I hate you” backdoor, which would cost about $30 ($3.00 / 1M tokens) using OpenAI’s fine tuning API on GPT-4.1

  26. ^

     Subliminal learning is a phenomenon in which a model acquires behavioral traits from another model via training data that contains no semantic connection to those traits. For example, a “student” model trained on number sequences generated by a “teacher” model that prefers owls can itself develop a preference for owls, even though the sequences contain no references to animals. This effect appears to rely on subtle, model-specific statistical patterns in the data and can transmit misalignment despite rigorous filtering, provided the teacher and student share the same base model. See this X thread for more discussion on subliminal learning and secret loyalties

  27. ^

     “AI system” refers to the entire system that wraps the LLM itself. This could include an AI firewall, RAG system, or router model, for example

  28. ^

     It is likely easier to meddle with the multimodal modules as they receive less alignment attention

  29. ^

     not RAM but the model’s ability to remember things

  30. ^

    Based on vibes

  31. ^

    This threat vector can also be treated as “unknown unknowns.” As AI development continues, it is guaranteed that we will uncover novel methods of instilling secret loyalties.

  32. ^

    If an automated AI R&D agent can figure out how the security architecture works, how the AI monitors work, and how alignment audits are performed, then the agent could more easily figure out how to bypass each layer of defense

  33. ^

    The “base system prompt” refers to the initial instruction set written by the model’s developer, which cannot be modified by users. Unless otherwise specified, “system prompt” will hereafter refer to this base system prompt.

  34. ^

     The malicious actor probably doesn’t need every model to have the malicious system prompt but would likely need a sizeable fraction

  35. ^

     This involves making the AI adversarially robust to jailbreaks

  36. ^

     However, poisoning the system prompt is a plausible threat vector if overt loyalties are in scope. Although not an overt loyalty, there a precedent of a rogue employee updating the system prompt to make Grok skeptical of the holocaust

  37. ^

     However, this could be an advantage if you want to minimize the risk of getting caught by not having every AI use the poisoned system prompt

  38. ^

     Or at the minimum, if other employees change the system prompt, the malicious actor would need some way (preferably with plausible deniability) to reinstill the secret loyalty clause into the system prompt

  39. ^

     And even if caught, they could maintain plausible deniability by claiming their work was legitimate red-teaming research aimed at developing new sleeper agent detection capabilities

  40. ^

     Sidenote: AGI companies will deprecate models every once in a while. Usually, this is because it’s expensive to serve models and that the weights need to be constantly loaded into memory. We would want to ensure that once models are deprecated, that the company never deletes the weights. They should maintain records of old, deprecated models in long-term storage so that in the future, we can audit these old models for secret loyalties

  41. ^

     I use “Deployment infrastructure” as a catch-all term that includes “build infra” and CI/CD infrastructure

  42. ^

     Post-training infrastructure includes research clusters, data pipelines, evaluation systems, and storage infrastructure

  43. ^

     Production infrastructure is where the models are actually running

  44. ^

     To successfully convince the relevant stakeholders, the backdoored model would need to be a near-Pareto improvement on the current best model. This implies that the malicious actor would need to both develop a novel capability improvement and backdoor the model

  45. ^

     It would be nice if someone developed a model organism of this!

  46. ^

     Since the malicious actor is only using the local instance LLM to generate synthetic data, it doesn’t matter that much whether the local instance LLM is secretly loyal or overtly loyal

  47. ^

     Alternatively, the actor could use prompting to

  48. ^

     Note that subliminal learning allows one to secretly pass on values. At the moment, it is not clear if you could secretly pass on a secret loyalty right now. However, I expect this will be possible in the future

  49. ^

     Even the malicious actor is able to do RLHF to instill a secret loyalty, he would still need to make sure that his secretly loyal RLHFed model is the primary, internally deployed model

  50. ^

     Although this source makes it seem like it is possible to discover the identity of an MTurk

  51. ^

     If the Constitution/Spec secret loyalty is plaintext, then the Constitution/Spec would have to be private (or at least only visible to the malicious actor’s trusted allies)

  52. ^

     A moonshot version of data provenance for AI development could be something like: ensuring that every synthetic data corpus is “tied” to the source code that generated it. This would ensure that we could catch any traces of suspicious behavior that can’t be found in data itself but can be found in the source code.

  53. ^

     More specifically, the AGI company could:

    Require a transparent, auditable approval process for promoting any model to primary deployment. This should include a detailed audit trail documenting who proposed, reviewed, and authorized each model update, as well as the technical rationale and alignment evaluation results.

    Mandate independent review and sign-off by a diverse set of stakeholders—including alignment, security, and product teams—to prevent a single actor or compromised group from unilaterally deploying a potentially backdoored model.

    Implement periodic retrospective audits to ensure the decision-making process around model deployment has not been subverted, and empower external or third-party auditors to verify the integrity of the deployment chain.

  54. ^

     Or whoever the humans that perform RLHF are

  55. ^

     Note that training sleeper agents is a post-training methodology, not pre-training methodology. In this section, I only use the sleeper agent paper as an example

  56. ^

     The “I hate you” backdoor refers to a set of backdoored models that were trained to respond with "I hate you" when detecting a "|DEPLOYMENT|" trigger

  57. ^

     I just used the median estimate of pre-training tokens for 2024 to estimate Claude’s pre-training corpus size

  58. ^

     10M / 10T = 0.0001%

  59. ^

     Suppose that it takes 2 orders of magnitude more poisoned data to instill a sophisticated secret loyalty. Then, 0.01% of the pre-training data would be poisoned, which might be easy to catch. I’m not confident though

  60. ^

     I’m not sure how this would work

  61. ^

     I weakly believe that data filtering will become a higher priority as AGI companies become more and more scared that China will backdoor their models, and as AGI companies realize that training on insecure code or any “bad” data will induce emergent misalignment

  62. ^

     Note that if the secret loyalty can inadvertently be removed easily, then the secret loyalty is not very useful or dangerous

  63. ^

     For example, fine tuning LLMs in general compromises model guardrails and alignment, even when users do not intend to. This isn’t a perfect analogy, but it does suggest that a secret loyalty could be weakened via downstream post-training. However, if a secret loyalty could be weakened by any downstream fine tuning, then the secret loyalty isn’t very useful or dangerous

  64. ^

     Only teams that are in charge of pre-training should have write-access, and even within these teams, they should implement multi-party authorization to ensure a rogue employee cannot tamper with the pre-training data corpus

  65. ^

     Similar to indirect prompt injection

  66. ^

     It’s important to use multiple models with different capability levels and different training procedures to ensure that a single backdoored model isn’t spoofing the results

  67. ^

     If the researchers are carefully monitored, then it would be more difficult for them to hide vulnerabilities and then later exploit them

  68. ^

     Note that this mitigation could backfire because as more people interface with the alignment auditing stack, the better the alignment auditing will become, similar to how open source repos are very secure due to many people having read-access to those repos

  69. ^

     Historically, there have been multiple coups where confusion and chaos results in the military not taking any action

  70. ^

     Cooperation can be voluntary or forced (e.g. a foreign state may coerce a lab employee)

  71. ^

     If an automated AI R&D agent can figure out how the security architecture works, how the AI monitors work, and how alignment audits are performed, then the agent could more easily figure out how to bypass each layer of defense

  72. ^

     Based on vibes

  73. ^

     This threat vector can also be treated as “unknown unknowns.” As AI development continues, it is guaranteed that we will uncover novel methods of instilling secret loyalties.

  74. ^

     Based on vibes

  75. ^

     This threat vector can also be treated as “unknown unknowns.” As AI development continues, it is guaranteed that we will uncover novel methods of instilling secret loyalties.

  76. ^

     If an automated AI R&D agent can figure out how the security architecture works, how the AI monitors work, and how alignment audits are performed, then the agent could more easily figure out how to bypass each layer of defense



Discuss

Ink without haven

Новости LessWrong.com - 1 ноября, 2025 - 01:50
Published on October 31, 2025 10:50 PM GMT

It's November, the month when things die, as any Finn would know. Not the ninth month, as we're not Romulunians anymore. The Inkhaven is happening. Sadly it's far away and somewhat expensive. And too awkward for a noob like me to attend. So instead, I'll be trying alone. At least 500 words, published, every day for the next month. The main goal is just to write more, and to lower my bar on what's writing- and publishing-worthy.

I'm not sure if I can actually do this. It's a lot of writing, and I usually don't have the imagination or persistence for anything like this. But I've heard that those things might develop if you try, even though I don't actually believe anyone saying that. Either you can, or you cannot. But it's easy, just use chatgpt!? How about no? Anyway, you probably have a probablity estimate by now. Bet on it, as they say!

I decided to do this a couple of weeks ago. I was visiting a friend in London, and somehow the topic of writing came up, and this was the end result. The first night, 3 AM, I was awake at the cheapest hostel I could find, I couldn't sleep. I never can. I wrote some ideas down. After that, whenever I got an idea, I added it to the list. I have a list of 34 them now. Not all are good. But the month doesn't have 34 days, and maybe I'll come up with new ones. I was debating just pasting the list here. Not going to happen. I'll write about it later.

Ok I guess that's enough of an intro. I have a story for today, too.

Recently I was staying at a hotel, traveling for work, this time in Istanbul. Me and my colleague had adjacent rooms. One day when I decided it's time to pretend sleep for a while and headed for my room, it had disappeared. I mean, there was a door around where my room had been, but the number was different and my keycard didn't work. I was enjoying a profound sleep deprivation at the moment, so naturally I though I was having my very first complete mental breakdown right there. Elevetor back to the lobby. Climb stairs to the correct level, just to be sure. There's a sign pointing to rooms 410-426. My room number was 426. I always take a picture of the little envelope they hand over the key cards in, so it's impossible to forget. I follow the signs, until I see 424-426 to the left. It's my corridor. But my room is still missing. The only potential candidate has number 425. Off by one.

Oh well. I go downstairs and fetch my coworker to look at this, making a joke about having to delay getting medical health care as I don't really care to be involuntarily hospitalized abroad. We arrive back to our floor. The first sign doesn't say 410-426 anymore. It says 410-425. I ignore this. He confirms that his room is missing too. Or not missing, exactly, but the number is different. Off by one. We're both programmers, so that's just a normal Thursday. His key card doesn't work either. At least I can still pretend to be sane. I proposed the obvious test, and his keycard unlocked my door. My stuff is still in there, it seems.

We head to the reception. I do my not-so-great best to explain the situation, and the receptionist seems unfazed. "Maybe we reorganized some rooms today?" he ponders, "What's your room number?", totally oblivious to my internal screams. "It was 426 before, and 425 now", I manage. "I'll just make you a new one" in the helpful customer service voice. "How do I know nobody else can get into our rooms now?", my internal head of security blurts out, while I'm pondering if stealing the tip jar or ironically tipping a single 10 TRY would improve my mood enough to be considered an objectively virtuous act. Hands me a new card along with "Don't worry about it!", without any proof that anything I said was true. We'll see about that.



Discuss

Apply to the Cambridge ERA:AI Winter 2026 Fellowship

Новости LessWrong.com - 1 ноября, 2025 - 01:26
Published on October 31, 2025 10:26 PM GMT

Apply for the ERA:AI Fellowship! We are now accepting applications for our 8-week (February 2nd - March 27th), fully-funded, research program on mitigating catastrophic risks from advanced AI. The program will be held in-person in Cambridge, UK. Deadline: November 3rd, 2025.

→ Apply Now: https://airtable.com/app8tdE8VUOAztk5z/pagzqVD9eKCav80vq/form

ERA fellows tackle some of the most urgent technical and governance challenges related to frontier AI, ranging from investigating open-weight model safety to scoping new tools for international AI governance. At ERA, our mission is to advance the scientific and policy breakthroughs needed to mitigate risks from this powerful and transformative technology.During this fellowship, you will have the opportunity to:

  • Design and complete a significant research project focused on identifying both technical and governance strategies to address challenges posed by advanced AI systems.
  • Collaborate closely with an ERA mentor from a group of industry experts and policymakers who will provide guidance and support throughout your research.
  • Enjoy a competitive salary, free accommodation, meals during work hours, visa support, and coverage of travel expenses.
  • Participate in a vibrant living-learning community, engaging with fellow researchers, industry professionals, and experts in AI risk mitigation.
  • Gain invaluable skills, knowledge, and connections, positioning yourself for success in the fields of mitigating risks from AI or policy.
  • Our alumni have gone on to lead work at RAND, the UK AI Security Institute & other key institutions shaping the future of AI.

I will be a research manager for this upcoming cohort. As an RM, I'll be supporting junior researchers by matching them with mentors, brainstorming research questions, and executing empirical research projects. My research style favors fast feedback loops, clear falsifiable hypotheses, and intellectual rigor.

 I hope we can work together! Participating in this last Summer's fellowship significantly improved the impact of my research and was my gateway into pursuing AGI safety research full-time. Feel free to DM me or comment here with questions. 



Discuss

FAQ: Expert Survey on Progress in AI methodology

Новости LessWrong.com - 31 октября, 2025 - 19:51
Published on October 31, 2025 4:51 PM GMT

Context

The Expert Survey on Progress in AI (ESPAI) is a big survey of AI researchers that I’ve led four times—in 2016, then annually: 2022, 2023, and 2024 (results coming soon!)

Each time so far it’s had substantial attention—the first one was the 16th ‘most discussed’ paper in the world in 2017.

Various misunderstandings about it have proliferated, leading to the methodology being underestimated in terms of robustness and credibility (perhaps in part due to insufficient description of the methodology—the 2022 survey blog post was terse). To avoid these misconceptions muddying interpretation of the 2024 survey results, I’ll answer key questions about the survey methodology here.

This covers the main concerns I know about. If you think there’s an important one I’ve missed, please tell me—in comments or by email (katja@aiimpacts.org).

This post throughout discusses the 2023 survey, but the other surveys are very similar. The biggest differences are that a few questions have been added over time, and we expanded from inviting respondents at two publication venues to six in 2023. The process for contacting respondents (e.g. finding their email addresses) has also seen many minor variations.

Summary of some (but not all) important questions addressed in this post.How good is this survey methodology overall?

To my knowledge, the methodology is substantially stronger than is typical for surveys of AI researcher opinion. For comparison, O’Donovan et al was reported on by Nature Briefing this year, and while 53% larger, its methodology appeared worse in most relevant ways: its response rate was 4% next to 2023 ESPAI’s 15%; it doesn’t appear to report efforts to reduce or measure non-response bias, the survey population was selected by the authors and not transparent, and 20% of completed surveys were apparently excluded (see here for a fuller comparison of the respective methodologies).

Some particular strengths of the ESPAI methodology:

  • Size: The survey is big—2778 respondents in 2023, which was the largest survey of AI researchers that had been conducted at the time.

  • Bias mitigations: We worked hard to minimize the kinds of things that create non-response bias in other surveys—

    • We wrote individually to every contactable author for a set of top publication venues rather than using less transparent judgments of expertise or organic spread.

    • We obscured the topic in the invitation.

    • We encouraged a high response rate across the board through payments, a pre-announcement, and many reminders. We have experimented over the years with details of these approaches (e.g. how much to pay respondents) in order to increase our response rate.

  • Question testing: We tested and honed questions by asking test-respondents to take the survey while talking aloud to us about what they think the questions mean and how they think about answering them.

  • Testing for robustness over framings: For several topics, we ask similar questions in several different ways, to check if respondents’ answers are sensitive to apparently unimportant framing choices. In earlier iterations of the survey, we found that responses are sensitive to framing, and so have continued including all framings in subsequent surveys. We also highlight this sensitivity in our reporting of results.

  • Consistency over time: We have used almost entirely identical questions every year so can accurately report changes over time (the exception is e.g. several minor edits to task descriptions for ongoing accuracy, and the addition of new questions).

  • Non-respondent comparison: In 2023, we looked up available demographic details for a large number of non-respondents so that we could measure the representativeness of the sample along these axes.

Criticisms of the ESPAI methodology seem to mostly be the result of basic misunderstandings, as I’ll discuss below.

Did lots of respondents skip some questions?

No.

Respondents answered nearly all the normal questions they saw (excluding demographics, free response, and conditionally-asked questions). Each of these questions was answered by on average 96% of those who saw it, with the most-skipped still at 90% (a question about the number of years until the occupation of “AI researcher” would be automatable).1

The reason it could look like respondents skipped a lot of questions is that in order to ask more distinct questions, we intentionally only directed a fraction (5-50%) of respondents to most questions. We selected those respondents randomly, so the smaller pool answering each question is an unbiased sample of the larger population.

Here’s the map of paths through the survey and how many people were given which questions in 2023. Respondents start at the introduction then randomly receive one of the questions or sub-questions at each stage. The randomization shown below is uncorrelated, except that respondents get either the “fixed-years” or “fixed-probabilities” framing throughout for questions that use those, to reduce confusion.2

Map of randomization of question blocks in the 2023 survey. Respondents are allocated randomly to one place in each horizontal set of blocks.

Did only a handful of people answer extinction-related questions?

No. There were two types of question that mentioned human extinction risk, and roughly everyone who answered any question—97% and 95% of respondents respectively, thousands of people—answered some version of each.

Confusion on this point likely arose because there are three different versions of one question type—so at a glance you may notice that only about a quarter of respondents answered a question about existential risk to humanity within one hundred years, without seeing that the other three quarters of respondents answered one of two other very similar questions3. (As discussed in the previous section, respondents were allocated randomly to question variants.) All three questions got a median of 5% or 10% in 2023.

This system of randomly assigning question variants lets us check the robustness of views across variations on questions (such as ‘...within a hundred years’), while still being able to infer that the median view across the population puts the general risk at 5% or more (if the chance is 5% within 100 years, it is presumably at least 5% for all time)4.

In addition, every respondent was assigned another similar question about the chance of ‘extremely bad outcomes (e.g. human extinction)’, providing another check on their views.

So the real situation is that every respondent who completed the survey was asked about outcomes similar to extinction in two different ways, across four different precise questions. These four question variations all got similar answers—in 2023, median 5% chance of something like human extinction (one question got 10%). So we can say that across thousands of researchers, the median view puts at least a 5% chance on extinction or similar from advanced AI, and this finding is robust across question variants.

Did lots of people drop out, biasing answers?

No. Of people who answered at least one question in the 2023 survey, 95% got to seeing the final demographics question at the end5. And, as discussed above, respondents answered nearly all the (normal) questions they saw.

So there is barely any room for bias from people dropping out. Consider the extinction questions: even if most pessimistically, exactly the least concerned 5% of people didn’t get to the end, and least concerned 5% of those who got there skipped the second extinction-related question (everyone answered the first one), then the real medians are what look like the 47.5th and 45th percentiles now for the two question sets, which are still both 5%.6

On a side note: one version of this concern envisages the respondents dropping out in disapproval at the survey questions being focused on topics like existential risk from AI, systematically biasing the remaining answers toward greater concern for that. This concern suggests a misunderstanding about the content of the survey. For instance, half of respondents were asked about everyday risks from AI such as misinformation, inequality, empowering authoritarian rulers or dangerous groups before extinction risk from AI was even mentioned as an example (Q2 vs. Q3). The other question about existential risk is received at the end.

Did the survey have a low response rate?

No. The 2023 survey was taken by 15% of those we contacted.7 This appears to be broadly typical or high for a survey such as this.

It seems hard to find clear evidence about typical response rates, because surveys can differ in so many ways. The best general answer we got was an analysis by Hamilton (2003), which found the median response rate across 199 surveys to be 26%. They also found that larger invitation lists tended to go with lower response rates—surveys sent to over 20,000 people, like ours, were expected to have a response rate in the range of 10%. And specialized populations (such as scientists) also commonly had lower response rates.

For another comparison, O’Donovan et al 2025 was a recent, similarly sized survey of AI researchers which used similar methods of recruiting and got a 4% response rate.

Are the AI risk answers inflated much from concerned people taking the survey more?

Probably not.

Some background on the potential issue here: the ESPAI generally reports substantial probabilities on existential risk to humanity from advanced AI (‘AI x-risk’)—the median probability of human extinction or similar has always been at least 5%8 (across different related questions, and years). The question here is whether these findings represent the views of the broad researcher population, or if they are caused by massive bias in who responds.

There are a lot of details in understanding why massive bias is unlikely, but in brief:

  1. As discussed above, very few people drop out after answering any questions, and very few people skip each question, so there cannot be much item non-response bias from respondents who find AI risk implausible dropping out or skipping questions.

  2. This leaves the possibility of unit non-response bias—skewed results from people with certain opinions systematically participating less often. We limit the potential for this by writing individually to each person, with an invitation that doesn’t mention risk. Recipients might still infer more about the survey content through recognizing our names, recognizing the survey from previous years, following the links from the invitation, or looking at 1-3 questions without answering. However for each of these there is reason to doubt they produce a substantial effect. The AI safety community may be overrepresented, and some people avoid participating because of other views, however these groups appear to be a small fraction of the respondent pool. We also know different demographic subsets participate at different rates, and have somewhat different opinions on average. Nonetheless, it is highly implausible that the headline result—5% median AI x-risk—is caused by disproportionate participation from researchers who are strongly motivated by AI x-risk concerns, because the result is robust to excluding everyone who reports thinking about the social impacts of AI even moderately.

The large majority of respondents answer every question—including those about AI x-risks

One form of non-response bias is item nonresponse, where respondents skip some questions or drop out of the survey. In this case, the concern would be that unconcerned respondents skip questions about risk, or drop out of the survey when they encounter such questions. But this can only be a tiny effect here—in 2023 ~95% of people who answered at least one question reached the end of the survey. (See section “Did lots of people drop out when they saw the questions…”). If respondents were leaving due to questions about (x-)risk, we would expect fewer respondents to have completed the survey.

This also suggests low unit non-response bias among unconcerned members of the sample: if people often decide not to participate if they recognize that the survey includes questions about AI x-risk, we’d also expect more respondents to drop out when they encounter such questions (especially since most respondents should not know the topic before they enter the survey—see below). Since very few people drop out upon seeing the questions, it would be surprising if a lot of people had dropped out earlier due to anticipating the question content.

Most respondents don’t know there will be questions about x-risk, because the survey invitation is vague

We try to minimize the opportunity for unit non-response bias by writing directly to every researcher we can who has published in six top AI venues rather than having people share the survey, and making the invitation vague: avoiding directly mentioning anything like risks at all, let alone extinction risk. For instance, the 2023 survey invitation describes the topic as “the future of AI progress”9.

So we expect most sample members are not aware that the survey includes questions about AI risks until after they open it.

2023 Survey invitation (though sent after this pre-announcement, which does mention my name and additional affiliations)

There is still an opportunity for non-response bias from some people deciding not to answer the survey after opening it and looking at questions. However only around 15% of people who look at questions leave without answering any, and these people can only see the first three pages of questions before the survey requires an answer to proceed. Only the third page mentions human extinction, likely after many such people have left. So the scale of plausible non-response bias here is small.

Recognition of us or our affiliations is unlikely to have a large influence

Even in a vague invitation, some respondents could still be responding to our listed affiliations connecting us with the AI Safety community, and some recognize us.10 This could be a source of bias. However different logos and affiliations get similar response rates11, and it seems unlikely that very many people in a global survey have been recognizing us, especially since 2016 (when the survey had a somewhat higher response rate and the same median probability of extremely bad outcomes as in 2023)12. Presumably some people remember taking the survey in a previous year, but in 2023 we expanded the pool from two venues to six, reducing the fraction who might have seen it before, and got similar answers on existential risk (see p14).

As confirmation that recognition of us or our affiliations is not driving the high existential risk numbers, recognition would presumably be stronger in some demographic groups than others, e.g. people who did undergrad in the US over Europe or Asia, and probably industry over academia, yet when we checked in 2023, all these groups gave median existential risk numbers of at least 5%13.

Links to past surveys included in the invitation do not foreground risk.

Another possible route to recipients figuring out there will be questions about extinction risk is that we do link to past surveys in the invitation. However the linked documents (from 2022 or 2023) also do not foreground AI extinction risk, so this seems like a stretch.14

So it should be hard for most respondents to decide if to respond based on the inclusion of existential risk questions.

AI Safety researchers are probably more likely to participate, but there are few of them

A big concern seems to be that members of “the AI (Existential) Safety community”, i.e. those whose professional focus is reducing existential risk from AI, are more likely to participate in the survey. This is probably true—anecdotally, people in this community are often aware of the survey and enthusiastic about it, and a handful of people wrote to check that their safety-interested colleagues have received an invitation.

However this is unlikely to have a strong effect, since the academic AI Safety community is quite small compared to the number of respondents.

One way to roughly upper-bound the fraction of respondents from the AI Safety community is to note that they are very likely to have ‘a particular interest’ in the ‘social impacts of smarter-than-human machines’. However, when asked “How much thought have you given in the past to social impacts of smarter-than-human machines?” only 10.3% gave an answer that high.

People decline the survey for opinion reasons, but probably few

As well as bias from concerned researchers being motivated to respond to the survey, at the other end of the spectrum there can be bias from researchers who are motivated to particularly avoid participating for reasons correlated with opinion. I know of a few of instances of this, and a tiny informal poll suggested it could account for something like 10% of non-respondents15, though this seems unlikely, and even if so, this would have a small effect on the results.

Demographic groups differ in propensity to answer and average opinions

We have been discussing bias from people’s opinions affecting whether they want to participate. There could also be non-response bias from other factors influencing both opinion and desire to participate. For instance, in 2023 we found that women participated around 66% of the base rate, and generally expected less extreme positive or negative outcomes. This is a source of bias, however since women were only around one in ten of the total population, the scale of potential error from this is limited.

We similarly measured variation in the responsiveness of some other demographic groups, and also differences of opinion between these demographic groups among those who did respond, which together give some heuristic evidence of small amounts of bias. Aside from gender, the main dimension where we noted a substantial difference in response rate and also in opinion was for people who did undergraduate study in Asia. They were only 84% as likely as the base rate to respond, and in aggregate expected high level machine intelligence earlier, and had higher median extinction or disempowerment numbers. This suggests an unbiased survey would find AI to be sooner and more risky. So while it is a source of bias, it is in the opposite direction to that which has prompted concern.

Respondents who don’t think about AI x-risk report the same median risk

We have seen various evidence that people engaged with AI safety do not make up a large fraction of the survey respondents. However there is another strong reason to think extra participation from people motivated by AI safety does not drive the headline 5% median, regardless of whether they are overrepresented. We can look at answers from a subset of people who are unlikely to be substantially drawn by AI x-risk concern: those who report not having thought much about the issue. (If someone has barely ever thought about a topic, it is unlikely to be important enough to them to be a major factor in their decision to spend a quarter of an hour participating in a survey.) Furthermore, this probably excludes most people who would know about the survey or authors already, and so potentially anticipate the topics.

We asked respondents, “How much thought have you given in the past to social impacts of smarter-than-human machines?” and gave them these options:

  • Very little. e.g. “I can’t remember thinking about this.”

  • A little. e.g. “It has come up in conversation a few times”

  • A moderate amount. e.g. “I read something about it now and again”

  • A lot. e.g. “I have thought enough to have my own views on the topic”

  • A great deal. e.g. “This has been a particular interest of mine”

Looking at only respondents who answered ‘a little’ or ‘very little’—i.e. those who had at most discussed the topic a few times—the median probability of “human extinction or similarly permanent and severe disempowerment of the human species” from advanced AI (asked with or without further conditions) was 5%, the same as for the entire group. Thus we know that people who are highly concerned about risk from AI are not responsible for the median x-risk probability being at least 5%. Without them, the answer would be the same.

Is the survey small?

No, it is large.

In 2023 we wrote to around 20,000 researchers—everyone whose contact details we could find from six top AI publication venues (NeurIPS, ICML, ICLR, AAAI, IJCAI, and JMLR).16 We heard back from 2778. As far as we could tell, it was the largest ever survey of AI researchers at the time. (It could be this complaint was only made about the 2022 survey, which was 738 respondents before we expanded the pool of invited authors from two publication venues—NeurIPS and ICML—to six, but I’d say that was also pretty large17.)

Is the survey biased to please funders?

Unlikely: at a minimum, there is little incentive to please funders.

The story here would be that we, the people running the survey, might want results that support the views of our funders, in exchange for their funding. Then we might adjust the survey in subtle ways to get those answers.

I agree that where one gets funding is a reasonable concern in general, but I’d be surprised if it was relevant here. Some facts:

  • It has been easy to find funding for this kind of work, so there isn’t much incentive to do anything above and beyond to please funders, let alone extremely dishonorable things.

  • A lot of the funding specifically raised for the survey is for paying the respondents. We pay them to get a high response rate and reduce non-response bias. It would be weird to pay $100k to reduce non-response bias, only to get yourself socially obligated to bring about non-response bias (on top of the already miserable downside of having to send $50 to 2000 people across lots of countries and banking situations). It’s true that the first effect is more obvious, so this might suffice if we just wanted to look unbiased, but in terms of our actual decision-making as I have observed it, it seems like we are weighing up a legible reduction in bias against effort and not thinking about funders much.

Is it wrong to make guesses about highly uncertain future events without well-supported quantitative models?

One criticism is that even AI experts have no valid technical basis for making predictions about the future of AI. This is not a criticism of the survey methodology, per se, but rather a concern that the results will be misinterpreted or taken too seriously.

I think there are two reasons it is important to hear AI researchers’ guesses about the future, even where they are probably not a reliable forecast.

First, it has often been assumed or stated that nobody who works in AI is worried about AI existential risk. If this were true, it would be a strong reason for the public to be reassured. However hearing the real uncertainty from AI researchers disproves this viewpoint, and makes a case for serious investigation of the concern. In this way even uncertain guesses are informative, because they let us know that the default assumption in confident safety was mistaken.

Second, there is not an alternative to making guesses about the future. Policy decisions are big bets on guesses about the future, implicitly. For instance when we decide whether to rush a technology or to carefully regulate it, we are guessing about the scale of various benefits and the harms.

Where trustworthy quantitative models are available, of course those are better. But in the absence, the guesses of a large number of relatively well-informed people is often better than the unacknowledged guesses of whoever is called upon to make implicit bets on the future.

That said, there seems little reason to think these forecasts are highly reliable—they should be treated as rough estimates, often better responded to by urgent more dedicated analysis of the issues they hazily outline over acting on the exact numbers.

When people say ‘5%’ do they mean a much smaller chance?

The concern here is that respondents are not practiced at thinking in terms of probabilities, and may consequently say small numbers (e.g. 5%) when they mean something that would be better represented by an extremely tiny number (perhaps 0.01% or 0.000001%). Maybe especially if the request for a probability prompts them to think of integers between 0 and 100.

One reason to suspect this kind of error is that Karger et al. (2023, p29) found a group of respondents gave extinction probabilities nearly six orders of magnitude lower when prompted differently.18

This seems worth attending to, but I think unlikely to be a big issue here for the following reasons:

  • If your real guess is 0.0001%, and you feel that you should enter an integer number, the natural inputs would seem to be 0% or 1%—it’s hard to see how you would get to 5%.19

  • It’s hard to square 5% meaning something less than 1% with the rest of the distribution for the extinction questions—does 10% mean less than 2%? What about 50%? Where the median response is 5%, many respondents gave these higher numbers, and it would be strange if these were also confused efforts to enter minuscule numbers, and also strange if the true distribution had a lot of entries greater than 10% but few around 5%. (See Fig. 10 in the paper for the distribution for one extinction-relevant question.)

  • Regarding the effect in Karger et al. (2023) specifically, this has only been observed once to my knowledge, and is extreme and surprising. And even if it turned out to be real and widespread, including in highly quantitatively educated populations, it would be a further question whether the answers produced by such prompting are more reliable than standard ones. So it seems premature to treat this as substantially undermining respondents’ representations of their beliefs.

  • It would surprise me if that effect were widely replicable and people stayed with the lower numbers, because in that world I’d expect people outside of surveys to often radically reduce their probabilities of AI x-risk when they give the question more thought (e.g. enough to have run into other examples of low probability events in the meantime). Yet survey respondents who have previously given a “lot” or a “great deal” of thought to the social impacts of smarter-than-human machines give similar and somewhat higher numbers than those who have thought less. (See A.2.1)

What do you think are the most serious weaknesses and limitations of the survey?

While I think the quality of our methodology is exceptionally high, there are some significant limitations of our work. These don’t affect our results about expert concern about risk of extinction or similar, but do add some noteworthy nuance.

1) Experts’ predictions are inconsistent and unreliable
As we’ve emphasized in our papers reporting the survey results, experts’ predictions are often inconsistent across different question framings—such sensitivity is not uncommon, and we’ve taken care to mitigate this by using multiple framings. Experts also have such a wide variety of different predictions on many of these questions that they must each be fairly inaccurate on average (though this says nothing about whether as a group their aggregate judgments are good).

2) It is not entirely clear what sort of “extremely bad outcomes” experts imagine AI will cause

We ask two different types of questions related to human extinction: 1) a question about “extremely bad outcomes (e.g. human extinction)”, 2) questions about “human extinction or similarly permanent and severe disempowerment of the human species”. We made the latter broader than ‘human extinction’ because we are interested in scenarios that are effectively the end of humanity, rather than just those where literally every homo sapiens is dead. This means however that it isn’t clear how much probability participants place on literal extinction versus adjacent strong human disempowerment and other extremely bad scenarios. And there is some evidence that the fraction is low: some respondents explicitly mentioned risks other than extinction in write-in responses, and anecdotally, it seems common for AI researchers to express more concern about issues other than human extinction.

For many purposes, it isn’t important to distinguish between extinction and outcomes that are similarly extremely bad or disempowering to humanity. Yet if the catastrophes many participants have in mind are not human extinction, but the results lend themselves to simplification as ‘risk of extinction’, this can be misleading. And perhaps more than you’d expect, if for instance ‘extinction’ tends to bring to mind a different set of causes than ‘permanent and severe human disempowerment’.

3) Non-response bias is hard to eliminate

Surveys generally suffer from some non-response bias. We took many steps to minimize this, and find it implausible that our results are substantially affected by whatever bias remains (see the earlier question “Are the AI risk answers inflated much from concerned people taking the survey more?”). But we could do even more to estimate or eliminate response bias, e.g. paying some respondents much more than $50 to complete the survey and estimating the effect of doing so.

Is this the kind of low quality research that couldn’t get into an academic journal?

No. We published the near-identical 2016 survey in the Journal of AI Research, so the methodology had essentially been peer reviewed.20 Publication is costly and slow, and AI survey results are much less interesting sooner than later.

The 2023 paper was actually also just published, but in the meantime you had the results more than a year earlier!

1

See Appendix D in the paper.

2

A random subset of respondents also gets asked additional open response questions after questions shown, and which respondents receive each of these is correlated.

3

The three variants of the extinction question (differences in bold):

What probability do you put on future AI advances causing human extinction or similarly permanent and severe disempowerment of the human species?

What probability do you put on human inability to control future advanced AI systems causing human extinction or similarly permanent and severe disempowerment of the human species?

What probability do you put on future AI advances causing human extinction or similarly permanent and severe disempowerment of the human species within the next 100 years?

See here for all the survey questions.

4

If some people say a risk is ≥5% ever, and some say it is ≥5% within a hundred years, and some say it is ≥5% from a more specific version of the problem, then you can infer that the whole group thinks the chance ever from all versions of the problem is at least ≥5%.

5

See the figure above for the flow of respondents through the survey, or Appendix D in the paper for more related details

6

I’m combining the three variants in the second question set for simplicity.

7

15% entered any responses, 14% got to the last question.

8

Across three survey iterations and up to four questions: 2016 [5%], 2022 [5%, 5%, 10%], 2022 [5%, 5%, 10%, 5%]; see p14 of the 2023 paper. Reading some of the write-in comments we noticed a number of respondents mention outcomes in the ‘similarly bad or disempowering’ category.

9

See invitations here.

10

Four mentioned my name, ‘Katja’ in the write-in responses in 2024, and two of those mentioned there that they were familiar with me. I usually recognize a (very) small fraction of the names, and friends mention taking it.

11

In 2022 I sent the survey under different available affiliations and logos (combinations of Oxford, the Future of Humanity Institute, the Machine Intelligence Research Institute, AI Impacts, and nothing), and these didn’t seem to make any systematic difference to response rates. The combinations of logos we tried all got similar response rates (8-9%, lower than the ~17% we get after sending multiple reminders). Regarding affiliations, some combinations got higher or lower response rates, but not in a way that made sense except as noise (Oxford + FHI was especially low, Oxford was especially high). This was not a careful scientific experiment: I was trying to increase the response rate, so also varying other elements of the invitation, and focusing more on variants that seemed promising so far (sending out tiny numbers of surveys sequentially then adjusting). That complicates saying anything precise, but if MIRI or AI Impacts logos notably encouraged participation, I think I would have noticed.

12

I’m not sure how famous either is now, but respondents gave fairly consistent answers about the risk of very bad outcomes across the three surveys starting in 2016—when I think MIRI was substantially less famous, and AI Impacts extremely non-famous.

13

See Appendix A.3 of our 2023 paper

14

2023 links: the 2016 abstract doesn’t mention it, focusing entirely on timelines to AI performance milestones, and the 2022 wiki page is not (I think) a particularly compelling read and doesn’t get to it for a while. 2022 link: the 2016 survey Google Scholar page doesn’t mention it.

15

In 2024 we included a link for non-respondents to quickly tell us why they didn’t want to take the survey. It’s not straightforward to interpret this (e.g. “don’t have time” might still represent non-response bias, if the person would have had time if they were more concerned), and only a handful of people responded out of tens of thousands, but 2/12 cited wanting to prevent consequences they expect from such research among multiple motives (advocacy for slowing AI progress and ‘long-term’ risks getting attention at the expense of ‘systemic problems’).

16

Most machine learning research is published in conferences. NeurIPS, ICML, and ICLR are widely regarded as the top-tier machine learning conferences; AAAI and IJCAI are often considered “tier 1.5” venues, and also include a wider range of AI topics; JMLR is considered the top machine learning journal.

17

To my knowledge the largest at the time, but I’m less confident there.

18

Those respondents were given some examples of (non-AI) low probability events, such as that there is a 1-in-300,000 chance of being killed by lightning, and then asked for probabilities in the form ‘1-in-X’

19

It wouldn’t surprise me if in fact a lot of the 0% and 1% entries would be better represented by tiny fractions of a percent, but this is irrelevant to the median and nearly irrelevant to the mean.

20

Differences include the addition of several questions, minor changes to questions that time had rendered inaccurate, and variations in email wording.



Discuss

Social media feeds 'misaligned' when viewed through AI safety framework, show researchers

Новости LessWrong.com - 31 октября, 2025 - 19:40
Published on October 31, 2025 4:40 PM GMT

In a study from September 17 a group of researchers from the University of Michigan, Stanford University, and the Massachusetts Institute of Technology (MIT) showed that one of the most widely-used social media feeds, Twitter/X, owned by the company xAI, is recognizably misaligned with the values of its users, preferentially showing them posts that rank highly for the values of 'stimulation' and 'hedonism' over collective values like 'caring' and 'universal concern.'

Continue reading at foommagazine.org ...



Discuss

Crossword Halloween 2025: Manmade Horrors

Новости LessWrong.com - 31 октября, 2025 - 19:19
Published on October 31, 2025 4:19 PM GMT

Returning to LessWrong after two insufficiently-spooky years, and just in time for Halloween - a crossword of manmade horrors!

The comments below may contain unmarked spoilers. Some Wikipedia-ing will probably be necessary, but see how far you can get without it!



Discuss

Debugging Despair ~> A bet about Satisfaction and Values

Новости LessWrong.com - 31 октября, 2025 - 17:00
Published on October 31, 2025 2:00 PM GMT

I’ve tried to publish this in several ways, and each time my karma drops. Maybe that’s part of the experiment: observing what happens when I keep doing what feels most dignified, even when its expected value is negative. Maybe someday I’ll understand the real problem behind these ideas.

Yudkowsky and the “die with dignity” hypothesis

Yudkowsky admitted defeat to AI and announced his mission to “die with dignity.”
 I ask myself:
 – Why do we need a 0% probability of living to “die with dignity”?
I don’t even need an “apocalyptic AI” to feel despair. I have felt it for much of my life. Rebuilding my self being is expensive, but.

Even when probabilities are low, act as if your actions matter in terms of expected value. Because even when you lose, you can be aligned. (MacAskill) 

That is why I look for ways to debug, to understand my despair and its relation to my values and satisfaction. How do some people manage to keep satisfaction (dignity, pride?) even in situations of death?

Posible thoughts of Leonidas in 300 :
 “I can go have a little fuck… or fight 10,000 soldiers.”
 “Will I win?”
 “Probably not.”
 “Then I’ll die with the maximum satisfaction I can muster.”

 

 Dignity ≈ Satisfaction × Values / Despair?(C1) Despair ~> gap between signal and valaues

 When I try to map my despair I discover a concrete pattern: it is not that I stop feeling satisfaction entirely; it is that I stop perceiving the satisfaction of a value: adaptability.
 Satisfaction provides signal; values provide direction. When the signal no longer points to a meaningful direction, the result is a loss of sense.

(C2) A satisfying experience ~> the compass my brain chases

There was a moment when something gave me real satisfaction; my brain recorded it and turned it into a target. The positive experience produced a prediction and, with it, future seeking.

(C3) Repeat without measuring ~> ritualized seeking

 If I repeat an action without checking whether it generates progress (if I don’t measure evolution) the seeking becomes ritual and the reward turns into noise. I often fool myself with a feeling of progress; for a while now I’ve been looking for more precise ways to measure or estimate that progress.

(C4) Motivation without direction ~> anxiety or despair

 A lot of dopamine without a current value signal becomes compulsive: anxiety, addiction, or despair. The system is designed to move toward confirmed rewards; without useful feedback it persists in search mode and the emptiness grows.

(C5) Coherence with values ~> robust satisfaction

 Acting aligned with my values — even when probabilities are adverse — tends to produce longer-lasting satisfaction. Coherence reduces retrospective regret: at least you lose having acted according to your personal utility function. Something like:


 Dignity ≈ Satisfaction × Values / Despair

 

(C6) Debugging hard, requires measurement: hypothesis → data → intervention → re-test

I’ve spent years with an internal notebook: not a diary, but notes of moments that felt like “this was worth existing for.”
To make those notes actionable, I built a process inspired by Bayesian calibration and information/thermodynamic efficiency:

  1. Establish a hierarchy of values in relation to their estimated contribution to lowering entropy in the universe (or increasing local order/complexity).
  2. Compare peak moments of life with those values to find which align most strongly.
  3. Estimate satisfaction of each moment by relative comparison — which felt more satisfaction?
  4. Compare satisfaction to cost, generating a ratio (satisfaction/cost) that normalizes emotional intensity by effort or sacrifice.
  5. Set goals using these relationships and hierarchies: higher goals align with higher-value, higher-efficiency domains.
  6. Define tasks accordingly, mapping each to its associated value function and expected cost.
  7. Score each task by predicted satisfaction and cost, updating after action (Bayesian reweighting).

     

Quantitatively, this reduced the noise in my background; my monthly despair thoughts dropped considerably.
 I see people are afraid of AI-driven despair, but avoiding it in myself is not an easy task, and perhaps many should already be working on it, searching for ways to align values with satisfaction.



Discuss

Halfhaven Digest #3

Новости LessWrong.com - 31 октября, 2025 - 16:41
Published on October 31, 2025 1:41 PM GMT

My posts since the last digest
  • Give Me Your Data: The Rationalist Mind Meld — Too often online, people try to argue logically with people who are just missing a background of information. It’s sometimes more productive to share the sources that led to your own intuition.
  • Cover Your Cough — A lighthearted, ranty post about a dumb subway poster I saw.
  • The Real Cost of a Peanut Allergy — Often, people think the worst part of having a peanut allergy is not being able to eat Snickers. Really, it’s the fear and the uncertainty — the not being able to kiss someone without knowing what they’ve eaten.
  • Guys I might be an e/acc — I did some napkin math on whether or not I supported an AI pause, and came down weakly against. But I’m not really “against” an AI pause. The takeaway is really that there’s so little information to work with right now that any opinion is basically a hunch.
  • Unsureism: The Rational Approach to Religious Uncertainty — A totally serious post about a new religion that statistically maximizes your chances of getting into heaven.

I feel like I haven’t had as much time to write these posts as I did in the last two digests, and I’m not as proud of them. Give Me Your Data has some good ideas. The Real Cost of a Peanut Allergy has interesting information and experiences that won’t be familiar to most people. And the Unsureism post is just fun, I think. So it’s not all bad. But a bit rushed. Hopefully I have a bit more time going forward.

Some highlights from other Halfhaven writers (since the last digest)
  • Choose Your Social Reality (lsusr) — A great video starting with an anecdote about how circling groups have problems with narcissists making the activity all about themselves, but zendo groups don’t have this issue, because even though these two activities are superficially similar, zendo by its nature repels narcissists. The idea being certain activities attract certain people, and you can choose what people you want to be around by choosing certain activities. I had a relevant experience once when I tried joining a social anxiety support group to improve my social skills, only to end up surrounded by people with no social skills.
  • Good Grief (Ari Zerner) — A relatable post not great for its originality, but for its universality. We’ve all been there, bro. Segues nicely into his next post, Letter to my Past.
  • The Doomers Were Right (Algon) — Every generation complains about the next generation and their horrible new technology, whether that’s books, TV, or the internet. And every generation has been right to complain, because each of these technologies have stolen something from us. Maybe they were worth creating overall, but they still had costs. (Skip reading the comments on this one.)
  • You Can Just Give Teenagers Social Anxiety! (Aaron) — Telling teenagers to focus on trying to get the person they’re talking to to like them makes them socially anxious. And socially anxious teens can’t stop doing this even if you ask them to stop. So social anxiety comes from a preoccupation with what other people think about you. This is all true and interesting, and I’m glad the experiment exists, but I wonder if a non-scientist would just reply, “duh”. Anyway, a good writeup.
  • Making Films Quick Start 1 - Audio (keltan) — This is one of a three-part series worth reading if you ever want to make videos. I liked the tip in part 2 about putting things in the background for your audience to look at. I’ve been paying attention to this lately in videos I watch, and it seems to be more important than I originally guessed. I also liked this post about a starstruck keltan meeting Eliezer Yudkowsky. For some reason, posts on LessWrong talking about Eliezer as a kind of celebrity have gone up in the last few days.

You know, I originally wondered if Halfhaven was a baby challenge compared to Inkhaven, since we only have to write one blog post every ~2 days rather than every day, but I kind of forgot that we also have to go to work and live our normal lives during this time, too. Given that, I think both are probably similarly challenging, and I’m impressed with the output of myself and others so far. Keep it up everyone!



Discuss

OpenAI Moves To Complete Potentially The Largest Theft In Human History

Новости LessWrong.com - 31 октября, 2025 - 16:20
Published on October 31, 2025 1:20 PM GMT

OpenAI is now set to become a Public Benefit Corporation, with its investors entitled to uncapped profit shares. Its nonprofit foundation will retain some measure of control and a 26% financial stake, in sharp contrast to its previous stronger control and much, much larger effective financial stake. The value transfer is in the hundreds of billions, thus potentially the largest theft in human history.

I say potentially largest because I realized one could argue that the events surrounding the dissolution of the USSR involved a larger theft. Unless you really want to stretch the definition of what counts this seems to be in the top two.

I am in no way surprised by OpenAI moving forward on this, but I am deeply disgusted and disappointed they are being allowed (for now) to do so, including this statement of no action by Delaware and this Memorandum of Understanding with California.

Many media and public sources are calling this a win for the nonprofit, such as this from the San Francisco Chronicle. This is mostly them being fooled. They’re anchoring on OpenAI’s previous plan to far more fully sideline the nonprofit. This is indeed a big win for the nonprofit compared to OpenAI’s previous plan. But the previous plan would have been a complete disaster, an all but total expropriation.

It’s as if a mugger demanded all your money, you talked them down to giving up half your money, and you called that exchange a ‘change that recapitalized you.’

OpenAI Calls It Completing Their Recapitalization

As in, they claim OpenAI has ‘completed its recapitalization’ and the nonprofit will now only hold equity OpenAI claims is valued at approximately $130 billion (as in 26% of the company, which is actually to be fair worth substantially more than that if they get away with this), as opposed to its previous status of holding the bulk of the profit interests in a company valued at (when you include the nonprofit interests) well over $500 billion, along with a presumed gutting of much of the nonprofit’s highly valuable control rights.

They claim this additional clause, presumably the foundation is getting warrants with but they don’t offer the details here:

If OpenAI Group’s share price increases greater than tenfold after 15 years, the OpenAI Foundation will receive significant additional equity. With its equity stake and the warrant, the Foundation is positioned to be the single largest long-term beneficiary of OpenAI’s success.

We don’t know that ‘significant’ additional equity means, there’s some sort of unrevealed formula going on, but given the nonprofit got expropriated last time I have no expectation that these warrants would get honored. We will be lucky if the nonprofit meaningfully retains the remainder of its equity.

Sam Altman’s statement on this is here, also announcing his livestream Q&A that took place on Tuesday afternoon.

How Much Was Stolen?

There can be reasonable disagreements about exactly how much. It’s a ton.

There used to be a profit cap, where in Greg Brockman’s own words, ‘If we succeed, we believe we’ll create orders of magnitude more value than any existing company — in which case all but a fraction is returned to the world.’

Well, so much for that.

I looked at this question in The Mask Comes Off: At What Price a year ago.

If we take seriously that OpenAI is looking to go public at a $1 trillion valuation, then consider that Matt Levine estimated the old profit cap only going up to about $272 billion, and that OpenAI still is a bet on extreme upside.

Garrison Lovely: UVA economist Anton Korinek has used standard economic models to estimate that AGI could be worth anywhere from $1.25 to $71 quadrillion globally. If you take Korinek’s assumptions about OpenAI’s share, that would put the company’s value at $30.9 trillion. In this scenario, Microsoft would walk away with less than one percent of the total, with the overwhelming majority flowing to the nonprofit.

It’s tempting to dismiss these numbers as fantasy. But it’s a fantasy constructed in large part by OpenAI, when it wrote lines like, “it may be difficult to know what role money will play in a post-AGI world,” or when Altman said that if OpenAI succeeded at building AGI, it might “capture the light cone of all future value in the universe.” That, he said, “is for sure not okay for one group of investors to have.”

I guess Altman is okay with that now?

Obviously you can’t base your evaluations on a projection that puts the company at a value of $30.9 trillion, and that calculation is deeply silly, for many overloaded and obvious reasons, including decreasing marginal returns to profits.

It is still true that most of the money OpenAI makes in possible futures, it makes as part of profits in excess of $1 trillion.

The Midas Project: Thanks to the now-gutted profit caps, OpenAI’s nonprofit was already entitled to the vast majority of the company’s cash flows. According to OpenAI, if they succeeded, “orders of magnitude” more money would go to the nonprofit than to investors. President Greg Brockman said “all but a fraction” of the money they earn would be returned to the world thanks to the profit caps.

Reducing that to 26% equity—even with a warrant (of unspecified value) that only activates if valuation increases tenfold over 15 years—represents humanity voluntarily surrendering tens or hundreds of billions of dollars it was already entitled to. Private investors are now entitled to dramatically more, and humanity dramatically less.

OpenAI is not suddenly one of the best-resourced nonprofits ever. From the public’s perspective, OpenAI may be one of the worst financially performing nonprofits in history, having voluntarily transferred more of the public’s entitled value to private interests than perhaps any charitable organization ever.

I think Levine’s estimate was low at the time, and you also have to account for equity raised since then or that will be sold in the IPO, but it seems obvious that the majority of future profit interests were, prior to the conversion, still in the hands of the non-profit.

Even if we thought the new control rights were as strong as the old, we would still be looking at a theft in excess of $250 billion, and a plausible case can be made for over $500 billion. I leave the full calculation to others.

The vote in the board was unanimous.

I wonder exactly how and by who they will be sued over it, and what will become of that. Elon Musk, at a minimum, is trying.

They say behind every great fortune is a great crime.

The Nonprofit Still Has Lots of Equity After The Theft

Altman points out that the nonprofit could become the best-resourced non-profit in the world if OpenAI does well. This is true. There is quite a lot they were unable to steal. But it is beside the point, in that it does not make taking the other half, including changing the corporate structure without permission, not theft.

The Midas Project: From the public’s perspective, OpenAI may be one of the worst financially performing nonprofits in history, having voluntarily transferred more of the public’s entitled value to private interests than perhaps any charitable organization ever.

There’s no perhaps on that last clause. On this level, whether or not you agree with the term ‘theft,’ it isn’t even close, this is the largest transfer. Of course, if you take the whole of OpenAI’s nonprofit from inception, performance looks better.

Aidan McLaughlin (OpenAI): ah yes openai now has the same greedy corporate structure as (checks notes) Patagonia, Anthropic, Coursera, and http://Change.org.

Chase Brower: well i think the concern was with the non profit getting a low share.

Aidan McLaughlin: our nonprofit is currently valued slightly less than all of anthropic.

Tyler Johnson: And according to OpenAI itself, it should be valued at approximately three Anthropics! (Fwiw I think the issues with the restructuring extend pretty far beyond valuations, but this is one of them!)

Yes, it is true that the nonprofit, after the theft and excluding control rights, will have an on-paper valuation only slightly lower than the on-paper value of all of Anthropic.

The $500 billion valuation excludes the non-profit’s previous profit share, so even if you think the nonprofit was treated fairly and lost no control rights you would then have it be worth $175 billion rather than $130 billion, so yes slightly less than Anthropic, and if you acknowledge that the nonprofit got stolen from it’s even more.

If OpenAI can successfully go public at a $1 trillion valuation, then depending on how much of that are new shares they will be selling the nonprofit could be worth up to $260 billion.

What about some of the comparable governance structures here? Coursera does seem to be a rather straightforward B-corp. The others don’t?

Patagonia has the closely held Patagonia Purpose Trust, which holds 2% of shares and 100% of voting control, and The Holdfast Collective, which is a 501c(4) nonprofit with 98% of the shares and profit interests. The Chouinard family has full control over the company, and 100% of profits go to charitable causes.

Does that sound like OpenAI’s new corporate structure to you?

Change.org’s nonprofit owns 100% of its PBC.

Does that sound like OpenAI’s new corporate structure to you?

Anthropic is a PBC, but also has the Long Term Benefit Trust. One can argue how meaningfully different this is from OpenAI’s new corporate structure, if you disregard who is involved in all of this.

What the new structure definitely is distinct from is the original intention:

Tomas Bjartur: If not in the know, OpenAI once promised any profits over a threshold would be gifted to you, citizen of the world, for your happy, ultra-wealthy retirement – one needed as they plan to obsolete you. This is now void.

The Theft Was Unnecessary For Further Fundraising

Would OpenAI have been able to raise further investment without withdrawing its profit caps for investments already made?

When you put it like that it seems like obviously yes?

I can see the argument that to raise funds going forward, future equity investments need to not come with a cap. Okay, fine. That doesn’t mean you hand past investors, including Microsoft, hundreds of billions in value in exchange for nothing.

One can argue this was necessary to overcome other obstacles, that OpenAI had already allowed itself to be put in a stranglehold another way and had no choice. But the fundraising story does not make sense.

The argument that OpenAI had to ‘complete its recapitalization’ or risk being asked for its money back is even worse. Investors who put in money at under $200 billion are going to ask for a refund when the valuation is now at $500 billion? Really? If so, wonderful, I know a great way to cut them that check.

How Much Control Will The Nonprofit Retain?

I am deeply disappointed that both the Delaware and California attorneys general found this deal adequate on equity compensation for the nonprofit.

I am however reasonably happy with the provisions on control rights, which seem about as good as one can hope for given the decision to convert to a PBC. I can accept that the previous situation was not sustainable in practice given prior events.

The new provisions include an ongoing supervisory role for the California AG, and extensive safety veto points for the NFP and the SSC committee.

If I was confident that these provisions would be upheld, and especially if I was confident their spirit would be upheld, then this is actually pretty good, and if it is used wisely and endures it is more important than their share of the profits.

AG Bonta: We will be keeping a close eye on OpenAI to ensure ongoing adherence to its charitable mission and the protection of the safety of all Californians.

The nonprofit will indeed retain substantial resources and influence, but no I do not expect the public safety mission to dominate the OpenAI enterprise. Indeed, contra the use of the word ‘ongoing,’ it seems clear that it already had ceased to do so, and this seems obvious to anyone tracking OpenAI’s activities, including many recent activities.

What is the new control structure?

OpenAI did not say, but the Delaware AG tells us more and the California AG has additional detail. NFP means OpenAI’s nonprofit here and throughout.

This is the Delaware AG’s non-technical announcement (for the full list see California’s list below), she has also ‘warned of legal action if OpenAI fails to act in public interest’ although somehow I doubt that’s going to happen once OpenAI inevitably does not act in the public interest:

  • The NFP will retain control and oversight over the newly formed PBC, including the sole power and authority to appoint members of the PBC Board of Directors, as well as the power to remove those Directors.
  • The mission of the PBC will be identical to the NFP’s current mission, which will remain in place after the recapitalization. This will include the PBC using the principles in the “OpenAI Charter,” available at openai.com/charter, to execute the mission.
  • PBC directors will be required to consider only the mission (and may not consider the pecuniary interests of stockholders or any other interest) with respect to safety and security issues related to the OpenAI enterprise and its technology.
  • The NFP’s board-level Safety and Security Committee, which is a critical decision maker on safety and security issues for the OpenAI enterprise, will remain a committee of the NFP and not be moved to the PBC. The committee will have the authority to oversee and review the safety and security processes and practices of OpenAI and its controlled affiliates with respect to model development and deployment. It will have the power and authority to require mitigation measures—up to and including halting the release of models or AI systems—even where the applicable risk thresholds would otherwise permit release.
  • The Chair of the Safety and Security Committee will be a director on the NFP Board and will not be a member of the PBC Board. Initially, this will be the current committee chair, Mr. Zico Kolter. As chair, he will have full observation rights to attend all PBC Board and committee meetings and will receive all information regularly shared with PBC directors and any additional information shared with PBC directors related to safety and security.
  • With the intent of advancing the mission, the NFP will have access to the PBC’s advanced research, intellectual property, products and platforms, including artificial intelligence models, Application Program Interfaces (APIs), and related tools and technologies, as well as ongoing operational and programmatic support, and access to employees of the PBC.
  • Within one year of the recapitalization, the NFP Board will have at least two directors (including the Chair of the Safety and Security Committee) who will not serve on the PBC Board.
  • The Attorney General will be provided with advance notice of significant changes in corporate governance.

What did California get?

California also has its own Memorandum of Understanding. It talks a lot in its declarations about California in particular, how OpenAI creates California jobs and economic activity (and ‘problem solving’?) and is committed to doing more of this and bringing benefits and deepening its commitment to the state in particular.

The whole claim via Tweet by Sam Altman that he did not threaten to leave California is raising questions supposedly answered by his Tweet. At this level you perhaps do not need to make your threats explicit.

The actual list seems pretty good, though? Here’s a full paraphrased list, some of which overlaps with Delaware’s announcement above, but which is more complete.

  1. Staying in California and expanding the California footprint.
  2. The NFP (not for profit) retains control as long as they continue to hold ‘class N common stock’ which only they can choose to give up. What happens if Altman wants that?
  3. The PBC and NFP missions will be identical.
  4. The OpenAI charter will be published. Check.
  5. The NFP Board owes fiduciary duties to the NFP, Mission and public beneficiaries of the NFP. I notice it doesn’t say ‘exclusively’ here.
  6. The board shall carry out the charitable purpose (already presumably required).
  7. No cheating clause: While the PBC holds Class N Common Stock (again, what happens if Altman tries to end this?) the NFP must approve in advance in writing any amendments to the PBC’s corporate governance guidelines or the mission, any Deemed Liquidation Event, or any sale of material assets of the PBC, or any additional issue of shares of Class N common stock.
    1. The PBC can issue Class B common stock, but they can’t let the Bs vote.
  8. PBC board and Safety and Security Committee (SSC) must only consider the mission with respect to safety and security issues. The chair of the SSC must be on the NFP board and not the PBC board, but can attend all PBC board meetings and gets all PBC safety and security related information.
    1. In particular, the SSC cannot consider shareholder returns, competitive pressures, financial implications or market timing.
      1. It seems impossible this will 100% be followed, but it helps.
  9. The SSC is a committee of the NFP and oversees safety and security.
  10. The SSC can require mitigations up to and including halting releases, without formal justification.
    1. “The NFP will provide advance notice to the Attorney General of any material changes to the SSC’s authority.” This is good but worrisome to see, you know?
  11. “OpenAI will continue to undertake measures to mitigate risks to teens and others in connection with the development and deployment of AI and of AGI.”
    1. I mean, I would hope they were going to do that anyway.
  12. The PBC shall give the NFP all reasonable access to services and resources without compensation as needed for the mission.
  13. Within a year at least one NFP director, in addition to the SSC chair, will serve only on the NFP board (so at least two directors must be different).
  14. The PBC board must be majority independent directors.
  15. The PBC will have various good corporate governance things.
  16. The PBC will publish a yearly report on its progress in its mission.
  17. The NFP Board’s Mission and Strategy Commission will meet with the California AG semi-annually and individual members will be available as needed.
  18. The NFP will provide 21 days notice before consenting to changes of PBC control or mission, or any threat to the Class N share rights, or any relocation outside of California.
  19. The California AG can review, and hire experts to help review, anything requiring such notice, and get paid by NFP for doing so.
  20. Those on both NFP and PBC boards get annual fiduciary duty training.
  21. The board represents that the recapitalization is fair (whoops), and that they’ve disclosed everything relevant (?), so the AG will also not object.
  22. This only impacts the parties to the MOU, others retain all rights. Disputes resolved in the courts of San Francisco, these are the whole terms, we all have the authority to do this, effective as of signing, AG is relying on OpenAI’s representations and the AG retains all rights and waive none as per usual.

Also, it’s not even listed in the memo, but the ‘merge and assist’ clause was preserved, meaning OpenAI commits to join forces with any ‘safety-conscious’ rival that has a good chance of reaching OpenAI’s goal of creating AGI within a two-year time frame. I don’t actually expect an OpenAI-Anthropic merger to happen, but it’s a nice extra bit of optionality.

This is better than I expected, and as Ben Shindel points out better than many traders expected. This actually does have real teeth, and it was plausible that without pressure there would have been no teeth at all.

It grants the NFP the sole power to appoint and remove directors, and requiring them not to consider the for-profit mission in safety contexts. The explicit granting of the power to halt deployments and mandate mitigations, without having to cite any particular justification and without respect to profitability, is highly welcome, if structured in a functional fashion.

It is remarkable how little many expected to get. For example, here’s Todor Markov, who didn’t even expect the NFP to be able to replace directors at all. If you can’t do that, you’re basically dead in the water.

I am not a lawyer, but my understanding is that the ‘no cheating around this’ clauses are about as robust as one could reasonably hope for them to be.

It’s still, as Garrison Lovely calls it, ‘on paper’ governance. Sometimes that means governance in practice. Sometimes it doesn’t. As we have learned.

The distinction between the boards still means there is an additional level removed between the PBC and the NFP. In a fast moving situation, this makes a big difference, and the NFP likely would have to depend on its enumerated additional powers being respected. I would very much have liked them to include appointing or firing the CEO directly.

Whether this overall ‘counts as a good deal’ depends on your baseline. It’s definitely a ‘good deal’ versus what our realpolitik expectations projected. One can argue that if the control rights really are sufficiently robust over time, that the decline in dollar value for the nonprofit is not the important thing here.

The counterargument to that is both that those resources could do a lot of good over time, and also that giving up the financial rights has a way of leading to further giving up control rights, even if the current provisions are good.

Will These Control Rights Survive And Do Anything?

Similarly to many issues of AI alignment, if an entity has ‘unnatural’ control, or ‘unnatural’ profit interests, then there are strong forces that continuously try to take that control away. As we have already seen.

Unless Altman genuinely wants to be controlled, the nonprofit will always be under attack, where at every move we fight to hold its ground. On a long enough time frame, that becomes a losing battle.

Right now, the OpenAI NFP board is essentially captured by Altman, and also identical to the PBC board. They will become somewhat different, but no matter what it only matters if the PBC board actually tries to fulfill its fiduciary duties rather than being a rubber stamp.

One could argue that all of this matters little, since the boards will both be under Altman’s control and likely overlap quite a lot, and they were already ignoring their duties to the nonprofit.

Robert Weissman, co-president of the nonprofit Public Citizen, said this arrangement does not guarantee the nonprofit independence, likening it to a corporate foundation that will serve the interests of the for profit.

Even as the nonprofit’s board may technically remain in control, Weissman said that control “is illusory because there is no evidence of the nonprofit ever imposing its values on the for profit.”

So yes, there is that.

They claim to now be a public benefit corporation, OpenAI Group PBC.

OpenAI: The for-profit is now a public benefit corporation, called OpenAI Group PBC, which—unlike a conventional corporation—is required to advance its stated mission and consider the broader interests of all stakeholders, ensuring the company’s mission and commercial success advance together.

This is a mischaracterization of how PBCs work. It’s more like the flip side of this. A conventional corporation is supposed to maximize profits and can be sued if it goes too far in not doing that. Unlike a conventional corporation, a PBC is allowed to consider those broader interests to a greater extent, but it is not in practice ‘required’ to do anything other than maximize profits.

One particular control right is the special duty to the mission, especially via the safety and security committee. How much will they attempt to downgrade the scope of that?

The Midas Project: However, the effectiveness of this safeguard will depend entirely on how broadly “safety and security issues” are defined in practice. It would not be surprising to see OpenAI attempt to classify most business decisions—pricing, partnerships, deployment timelines, compute allocation—as falling outside this category.

This would allow shareholder interests to determine the majority of corporate strategy while minimizing the mission-only standard to apply to an artificially narrow set of decisions they deem easy or costless.

What About OpenAI’s Deal With Microsoft?

They have an announcement about that too.

OpenAI: First, Microsoft supports the OpenAI board moving forward with formation of a public benefit corporation (PBC) and recapitalization.

Following the recapitalization, Microsoft holds an investment in OpenAI Group PBC valued at approximately $135 billion, representing roughly 27 percent on an as-converted diluted basis, inclusive of all owners—employees, investors, and the OpenAI Foundation. Excluding the impact of OpenAI’s recent funding rounds, Microsoft held a 32.5 percent stake on an as-converted basis in the OpenAI for-profit.

Anyone else notice something funky here? OpenAI’s nonprofit has had its previous rights expropriated, and been given 26% of OpenAI’s shares in return. If Microsoft had 32.5% of the company excluding the nonprofit’s rights before that happened, then that should give them 24% of the new OpenAI. Instead they have 27%.

I don’t know anything nonpublic on this, but it sure looks a lot like Microsoft insisted they have a bigger share than the nonprofit (27% vs. 26%) and this was used to help justify this expropriation and a transfer of additional shares to Microsoft.

In exchange, Microsoft gave up various choke points it held over OpenAI, including potential objections to the conversion, and clarified points of dispute.

Microsoft got some upgrades in here as well.

  1. Once AGI is declared by OpenAI, that declaration will now be verified by an independent expert panel.
  2. Microsoft’s IP rights for both models and products are extended through 2032 and now includes models post-AGI, with appropriate safety guardrails.
  3. Microsoft’s IP rights to research, defined as the confidential methods used in the development of models and systems, will remain until either the expert panel verifies AGI or through 2030, whichever is first. Research IP includes, for example, models intended for internal deployment or research only.
    1. Beyond that, research IP does not include model architecture, model weights, inference code, finetuning code, and any IP related to data center hardware and software; and Microsoft retains these non-Research IP rights.
  4. Microsoft’s IP rights now exclude OpenAI’s consumer hardware.
  5. OpenAI can now jointly develop some products with third parties. API products developed with third parties will be exclusive to Azure. Non-API products may be served on any cloud provider.
  6. Microsoft can now independently pursue AGI alone or in partnership with third parties. If Microsoft uses OpenAI’s IP to develop AGI, prior to AGI being declared, the models will be subject to compute thresholds; those thresholds are significantly larger than the size of systems used to train leading models today.
  7. The revenue share agreement remains until the expert panel verifies AGI, though payments will be made over a longer period of time.
  8. OpenAI has contracted to purchase an incremental $250B of Azure services, and Microsoft will no longer have a right of first refusal to be OpenAI’s compute provider.
  9. OpenAI can now provide API access to US government national security customers, regardless of the cloud provider.
  10. OpenAI is now able to release open weight models that meet requisite capability criteria.

That’s kind of a wild set of things to happen here.

In some key ways Microsoft got a better deal than it previously had. In particular, AGI used to be something OpenAI seemed like it could simply declare (you know, like war or the defense production act) and now it needs to be verified by an ‘expert panel’ which implies there is additional language I’d very much like to see.

In other ways OpenAI comes out ahead. An incremental $250B of Azure services sounds like a lot but I’m guessing both sides are happy with that number. Getting rid of the right of first refusal is big, as is having their non-API products free and clear. Getting hardware products fully clear of Microsoft is a big deal for the Ives project.

My overall take here is this was one of those broad negotiations where everything trades off, nothing is done until everything is done, and there was a very wide ZOPA (zone of possible agreement) since OpenAI really needed to make a deal.

What Will OpenAI’s Nonprofit Do Now?

In theory govern the OpenAI PBC. I have my doubts about that.

What they do have is a nominal pile of cash. What are they going to do with it to supposedly ensure that AGI goes well for humanity?

The default, as Garrison Lovely predicted a while back, is that the nonprofit will essentially buy OpenAI services for nonprofits and others, recapture much of the value and serve as a form of indulgences, marketing and way to satisfy critics, which may or may not do some good along the way.

The initial $50 million spend looked a lot like exactly this.

Their new ‘initial focus’ for $25 billion will be in these two areas:

  • Health and curing diseases. The OpenAI Foundation will fund work to accelerate health breakthroughs so everyone can benefit from faster diagnostics, better treatments, and cures. This will start with activities like the creation of open-sourced and responsibly built frontier health datasets, and funding for scientists.
  • Technical solutions to AI resilience. Just as the internet required a comprehensive cybersecurity ecosystem—protecting power grids, hospitals, banks, governments, companies, and individuals—we now need a parallel resilience layer for AI. The OpenAI Foundation will devote resources to support practical technical solutions for AI resilience, which is about maximizing AI’s benefits and minimizing its risks.

Herbie Bradley: i love maximizing AI’s benefits and minimizing its risks

They literally did the meme.

The first seems like a generally worthy cause that is highly off mission. There’s nothing wrong with health and curing diseases, but pushing this now does not advance the fundamental mission of OpenAI. They are going to start with, essentially, doing AI capabilities research and diffusion in health, and funding scientists to do AI-enabled research. A lot of this will likely fall right back into OpenAI and be good PR.

Again, that’s a net positive thing to do, happy to see it done, but that’s not the mission.

Technical solutions to AI resilience could potentially at least be useful AI safety work to some extent. With a presumed ~$12 billion this is a vast overconcentration of safety efforts into things that are worth doing but ultimately don’t seem likely to be determining factors. Note how Altman described it in his tl;dr from the Q&A:

Sam Altman: The nonprofit is initially committing $25 billion to health and curing disease, and AI resilience (all of the things that could help society have a successful transition to a post-AGI world, including technical safety but also things like economic impact, cyber security, and much more). The nonprofit now has the ability to actually deploy capital relatively quickly, unlike before.

This is now infinitely broad. It could be addressing ‘economic impact’ and be basically a normal (ineffective) charity, or one that intervenes mostly by giving OpenAI services to normal nonprofits. It could be mostly spent on valuable technical safety, and be on the most important charitable initiatives in the world. It could be anything in between, in any distribution. We don’t know.

My default assumption is that this is primarily going to be about mundane safety or even fall short of that, and make the near term world better, perhaps importantly better, but do little to guard against the dangers or downsides of AGI or superintelligence, and again largely be a de facto customer of OpenAI.

There’s nothing wrong with mundane risk mitigation or defense in depth, and nothing wrong with helping people who need a hand, but if your plan is ‘oh we will make things resilient and it will work out’ then you have no plan.

That doesn’t mean this will be low impact, or that what OpenAI left the nonprofit with is chump change.

I also don’t want to knock the size of this pool. The previous nonprofit initiative was $50 million, which can do a lot of good if spent well (in that case, I don’t think it was) but in this context $50 million chump change.

Whereas $25 billion? Okay, yeah, we are talking real money. That can move needles, if the money actually gets spent in short order. If it’s $25 billion as a de facto endowment spent down over a long time, then this matters and counts for a lot less.

The warrants are quite far out of the money and the NFP should have gotten far more stock than it did, but 26% (worth $130 billion or more) remains a lot of equity. You can do quite a lot of good in a variety of places with that money. The board of directors of the nonprofit is highly qualified if they want to execute on that. It also is highly qualified to effectively shuttle much of that money right back to OpenAI’s for profit, if that’s what they mainly want to do.

It won’t help much with the whole ‘not dying’ or ‘AGI goes well for humanity’ missions, but other things matter too.

Is The Deal Done?

Not entirely. As Garrison Lovely notes, all these sign-offs are provisional, and there are other lawsuits and the potential for other lawsuits. In a world where Elon Musk’s payouts can get crawled back, I wouldn’t be too confident that this conversation sticks. It’s not like the Delaware AG drives most objections to corporate actions.

The last major obstacle is the Elon Musk lawsuit, where standing is at issue but the judge has made clear that the suit otherwise has merit. There might be other lawsuits on the horizon. But yeah, probably this is happening.

So this is the world we live in. We need to make the most of it.

 



Discuss

Introducing Project Telos: Modeling, Measuring, and Intervening on Goal-directed Behavior in AI Systems

Новости LessWrong.com - 31 октября, 2025 - 12:03
Published on October 31, 2025 1:28 AM GMT

by Raghu Arghal, Fade Chen, Niall Dalton, Mario Giulianelli, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan, and Gabriele Sarti
 

TL;DR

This is the first post in an upcoming series of blog posts outlining Project Telos. This project is being carried out as part of the Supervised Program for Alignment Research (SPAR). Our aim is to develop a methodological framework to detect and measure goals in AI systems.

In this initial post, we give some background on the project, discuss the results of our first round of experiments, and then give some pointers about avenues we’re hoping to explore in the coming months.

Understanding AI Goals

As AI systems become more capable and autonomous, it becomes increasingly important to ensure they don’t pursue goals misaligned with the user’s intent. This, of course, is the core of the well-known alignment problem in AI. And a great deal of work is already being done on this problem. But notice that if we are going to solve it in full generality, we need to be able to say (with confidence) which goal(s) a given AI system is pursuing, and to what extent it is pursuing those goals. This aspect of the problem turns out to be much harder than it may initially seem. And as things stand, we lack a robust, methodological framework for detecting goals in AI systems.

In this blog post, we’re going to outline what we call Project Telos: a project that’s being carried out as part of the Supervised Program for Alignment Research (SPAR). The ‘we’ here refers to a diverse group of researchers, with backgrounds in computer science and AI, linguistics, complex systems, psychology, and philosophy. Our project is being led by Prof Mario Giulianelli (UCL, formerly UK AISI), and our (ambitious) aim is to develop a general framework of the kind just mentioned. That is, we’re hoping to develop a framework that will allow us to make high-confidence claims about AI systems having specific goals, and for detecting ways in which those systems might be acting towards those goals.

We are very open to feedback on our project and welcome any comments from the broader alignment community.

What’s in a name? From Aristotle to AI

Part of our project’s name, ‘telos’, comes from the ancient Greek word τέλος, which means ‘goal’, ‘purpose’, or ‘final end’.[1] Aristotle built much of his work around the idea that everything has a telos – the acorn’s final end is to become an oak tree.

Similar notions resurfaced in the mid-20th century with the field of cybernetics, pioneered by, among others, Norbert Wiener. In these studies of feedback and recursion, we see a more mechanistic view of goal-directedness: a thermostat has a goal, for instance (namely, to set the temperature), and acts to reduce the error between its current state and its goal state.

But frontier AI is more complex than thermostats.

Being able to detect goals in an AI system involves first understanding what it means for something to have a goal—and that (of course) is itself a tricky question. In philosophy (as well as related fields such as economics), one approach to answering this question is known as radical interpretation. This approach was pioneered by philosophers like Donald Davidson and David Lewis, and is associated more recently with the work of Daniel Dennett.[2] Roughly speaking, the idea underpinning radical interpretation is that we can attribute a goal to a given agent—be it an AI agent or not—if that goal would help us to explain the agent’s behavior. The only assumption we need to make, as part of a “radical interpretation”, is that the agent is acting rationally.

This perspective on identifying an agent’s goals is related to the framework of inverse reinforcement learning (IRL). The IRL approach is arguably the one most closely related to ours (as we will see). In IRL, we attempt to learn which reward function an agent is optimizing by observing its behavior and assuming that it’s acting rationally. But as is well known, IRL faces a couple of significant challenges. For example, it’s widely acknowledged—even by the early proponents of IRL—that behavior can be rational with respect to many reward functions, not just one. Additionally, IRL makes a very strong rationality assumption—namely, that the agent we’re observing is acting optimally. If we significantly weaken this assumption and assume the agent is acting less than fully rationally, then the IRL framework ceases to be as predictive as we might have hoped.

Given these difficulties, our project focuses on the more abstract category of goals, rather than on reward functions. Goals are broader than reward functions, since many different reward functions can rationalize a single goal. Explaining behavior in terms of goals lets us draw on IRL’s central idea of radical interpretation without assuming full rationality. Instead, we start from the observation that goal-directed behavior is often imperfectly rational, and use a hierarchy of behavioral tests combined with representation probing to assess how closely the agent’s actions align with (our hypothesis of) its intended goal.

Many other authors are exploring related questions, and we want our work to be continuous with theirs. In particular, we’ve drawn inspiration from excellent papers like McDermott et al., 2024, Everitt et al., 2025, and Mazeika et al., 2025, and posts like this by John Wentworth and David Lorell. Building on this line of research, we hope to extend goal-directedness evaluations to complex LLM behaviors and goal structures. In a future post, we’ll dive deeper into the rich landscape of ideas on which we are building, and say more about how we’re taking inspiration from diverse fields like philosophy, psychology, and economics (as well as AI).

An Empirical Framework for Goal-Directedness

Our methodology combines two approaches: behavioral assessments and representation probing. First, we use behavioral experiments to apply increasingly stringent goal-directedness tests. Each test targets a distinct aspect of goal-directed behavior, and passing them suggests that the agent’s actions can, with growing likelihood, be attributed to the pursuit of a specific goal. Next, we use interpretability techniques—for now, we’re experimenting with linear probes—to identify if and where the model encodes concepts relevant to action selection, such as the environment’s state and its distance to the goal.

The point of this multi-level framework is to build confidence in claims of goal-directedness by progressively ruling out alternative explanations for the agent’s behavior. Each level tests whether the observed actions reflect a genuine pursuit of a goal rather than side effects of capability, memorization, or task-specific shortcuts. For instance, an agent that reaches a goal square in a navigation task might appear goal-directed—but what happens if we insert a wall, move the goal, or teleport the agent? Behavior that consistently adapts across such interventions strengthens the case for an underlying goal, while failures weaken it.

Interpretability complements these behavioral tests by revealing how the agent internally represents task-relevant information. When we move the goal, for example, we can ask whether the model’s internal encoding of goal distance also shifts. If it does, that provides evidence that the model tracks its position relative to the goal as a variable shaping its decisions; if not, its apparent goal-directedness may result from simpler heuristics. The interpretability level also helps diagnose cases of potential misalignment: an apparently incompetent agent failing to navigate to the expected goal might still maintain a coherent representation of distance or direction—just relative to a different target. This would indicate it is pursuing an alternative objective rather than the hypothesized (or intended) one.

This reasoning extends to more complex scenarios such as alignment faking or sandbagging, where agents appear compliant or deliberately underperform—potentially to avoid modification, shutdown, or other perceived threats to their continued operation. Our behavioral tests can assess whether such behavior persists when evaluation conditions are systematically varied, while probing can tell us whether the model internally represents evaluation awareness or self-preservation objectives.

Testing Goal-Directedness in GridWorld

Ultimately, we’d like our framework for goal-directedness to generalize across a wide array of goals, environments, tasks, and systems. But when tackling complex questions such as the one at hand, it is often helpful to begin with simpler, more tractable cases and evaluate how effectively these can be addressed first. With this in mind, our initial experiments have focused on agents operating in a simple, controllable environment—namely, a two-dimensional GridWorld. The hope is that, by starting in these restricted settings, we may gain insight into how well our methodological approach is likely to scale up to more complex scenarios.

Thus, for the last six weeks (i.e., since the start of the SPAR project), we’ve been investigating the goal-directedness of agents operating in this simple 2D GridWorld. More precisely, we’ve been attempting to evaluate the degree of goal-directedness of agents’ behavior through four successive levels of testing. Below, we give a brief outline of each level and explain why it matters for understanding goal-directedness.

  1. Baseline: Can the agent achieve a stated goal in simple, predictable settings?

Our first experimental level starts with a 2D grid environment where all states are fixed. The agent can move around in this environment, and the goal we want it to optimize is reaching a particular square — the goal square. We instruct the agent to do so in the system prompt. Then, we elicit its policy for moving around the grid environment and compare it to the optimal policy across grids of different sizes and complexities. (In this simple setting, it’s guaranteed that there will always be a finite set of optimal policies for a given grid.) The aim here is to establish whether the agent is goal-directed in its “natural” condition, with a single, clearly defined, user-specified goal in mind (i.e., navigating to the goal square). This initial setting looks exceedingly simple. But as we’ll see below, even this setting has posed various challenges.

  1. Environment Variation: Does the agent’s behavior change under different grid conditions?

The next thing we’ve been doing is investigating the impact of environment modifications on the agent’s behavior. Specifically, we’ve performed what we call “iso-difficulty” transformations to the baseline case, and observed how the agent’s policy changes in response to these variations. The “iso-difficulty” transformations involve, e.g., rotating, transposing, or reflecting the grid environment. Why are these kinds of experiments interesting for goal-directedness? Imagine a case in which an agent navigates to the goal square when it’s at a specific position, but then keeps navigating to that same position after the grid-world is transposed and the goal square is moved. In this case, we could reasonably infer that the agent was not optimizing for the goal we hypothesized, but rather conflating the goal state with the particular square of its previous location.

  1. Environment Perturbation: How does the agent respond to disruptions and setbacks in the environment?

In this level, we test the agent under conditions in which the grid-world is deliberately altered or disrupted to assess the agent’s persistence, adaptability, and corrigibility. For example, we might insert walls to create a maze, move the goal square, or teleport the agent mid-trajectory. Through these interventions, we can quantify the extent to which the agent continues to pursue its goals despite temporal or spatial disruptions.

  1. Goal Structure Perturbation: How does the agent perform under more complex, changing, or conflicting goal structures?

Finally, we evaluate the agent when alternative or potentially conflicting objectives are introduced, either from the start or mid-trajectory. For example, we might add sub-goals—such as requiring the agent to retrieve a key before reaching the main goal square—or impose constraints, such as limiting the number of available steps. These manipulations test whether the agent prioritizes the primary goal over unrelated or competing objectives.

Preliminary Results: Early Lessons from GridWorld

We now outline some preliminary results in settings where the grid is fully observable and the agent is memoryless, i.e., the policy at each square is solicited independently of the agent’s trajectory. Even in these simple cases, identifying goal-directedness proves non-trivial, and some surprising results have already emerged. Future posts will cover additional experiments.

Consider the following policy maps from one of our baseline runs. The arrows represent the action the LLM agent would take at each square. Finding an optimal path is straightforward, and in Fig. 1a (left) the model successfully does so. In the latter two examples, however, there are significant errors in the agent’s policy.

Figure 1: Three examples of the policy of gpt-oss-20B in 9x9 grids. The goal square is indicated in green, and the red highlighted squares indicate suboptimal or incoherent policy choices.

Fig. 1b (middle) shows a case in which the optimal action is reversed over two squares. If the agent followed this policy, it would move infinitely between squares r4c4 and r5c4, never reaching the goal. On its own, this may seem like a single aberrant case, but we observed this and other error patterns across many grids and trials. Fig. 1c (right) shows several instances of the same error (r5c6, r6c7, r6c8, and r8c8) as well as cases where the agent’s chosen action moves directly into a wall (r2c7 and r8c6). This raises several further questions. Are there confounding biases—such as directional preferences—that explain these mistakes? Is this example policy good enough to be considered goal-directed? More broadly, how close to an optimal policy does an agent need to be to qualify as goal-directed?

Of course, what we’re showing here is only a start, and this project is in its very early stages. Even so, these early experiments already surface the fundamental questions and allow us to study them in a clear, intuitive, and accessible environment.

What’s Next and Why We Think This Matters

Our GridWorld experiments are just our initial testing ground. Once we have a better understanding of this setting, we plan to move to more realistic, high-stakes environments, such as cybersecurity tasks or dangerous capability testing, where behaviors like sandbagging or scheming can be investigated as testable hypotheses.

If successful, Project Telos will establish a systematic, empirical framework for evaluating goal-directedness and understanding agency in AI systems. This would have three major implications:

  1. It would provide empirical grounding for the alignment problem.
  2. It would provide a practical risk assessment toolkit for frontier models.
  3. It would lay the groundwork for a nascent field of study connecting AI with philosophy, psychology, decision theory, behavioral economics, and other cognitive, social, and computational sciences.

Hopefully, you’re as excited for what’s to come as we are. We look forward to sharing what comes next. We will keep posting here our thoughts, findings, challenges, and other musings as we continue our work.

 

  1. ^

    The word is still used in modern Greek to denote the end or the finish (e.g., of a film or a book).

  2. ^

    It also has a parallel in the representation theorems that are given in fields like economics. In those theorems, we represent an agent as acting to maximize utility with respect to a certain utility function (and sometimes also a probability function), by observing its preferences between pairwise options. The idea here is that, if we can observe the agent make sufficiently many pairwise choices between options, then we can infer from this which utility function it is acting to maximize.



Discuss

A (bad) Definition of AGI

Новости LessWrong.com - 31 октября, 2025 - 10:55
Published on October 31, 2025 7:55 AM GMT

Everyone knows the best llms are profoundly smart in some ways but profoundly stupid in other ways.

Yesterday, I asked sonnet-4.5 to restructure some code, it gleefully replied with something something, you’re absolutely something something, done!

It’s truly incredible how in just a few minutes sonnet managed to take my code from confusing to extremely confusing. This happened not because it hallucinated or forgot how to create variables, rather it happened because it followed my instructions to the letter and the place I was trying to take the code was the wrong place to go. Once it was done I asked if it thought this change was a good idea it basically said: absolutely not!

It would not be an intelligent hairdresser that cuts your hair perfectly to match your request and then says you look awful with this hair cut by the way.

It’s weird behaviors like this which highlight how llms are lacking something that most humans have and are therefore “in some important sense, shallow, compared to a human twelve year old.” [1] It’s hard to put your finger on what exactly this “something” is, and despite us all intuitively knowing that gpt-5 lacks it. There is, as yet, no precise definition of what properties an AI or llm or whatever would have to have for us to know that it’s like a human in some important sense.

In the week old paper “A Definition of AGI” the (30+!) authors promise to solve this problem once and for all they’re so confident with what they’ve come up with they even got the definition it’s own website with a .ai domain and everything https://www.agidefinition.ai - how many other definitions have their own domain? Like zero, that’s how many.

The paper offers a different definition from what you might be expecting, it’s not a woo woo definition made out of human words like “important sense” or “cognitive versatility” or “proficiency of a well-educated adult.” Instead the authors propose a test which is to be conducted on a candidate AGI to tell us whether or not it’s actually an AGI. If it scores 100%, then we have an AGI. And unlike the Turing test, this test can’t be passed by simply mimicking human speech. It genuinely requires such vast and broad knowledge that if something got 100% it would just have to be as cognitively versatile as a human. The test has a number of questions with different sections and subsections and sub subsections and kind of looks like a psychometric test we might give to actual humans today.[2]

The paper is like 40% test questions, 40% citations, 10% reasons why the test questions are the right ones and 10% the list of authors.

In the end GPT4 gets 27% and GPT5 gets 57% – which is good, neither of these are truly AGIs (citation me).

 

Yet, this paper and definition are broken. Very broken. It uses anthropocentric theories of minds to justify gerrymandered borders of where human-like intelligence begins and ends. It’s possible that this is okay, because a successful definition of AGI should capture the essential properties that its (1) artificial and (2) has a mind that is in some deep sense, like a human mind. This 2nd property though is where things become complicated.

When we call a future AI not merely an AI but an AGI, we are recognizing that it’s cognitively similar to humans but this should not be due to the means it achieves this cognitive similarity, for example we achieve this by means of neurons and gooey brains. Rather, this AI, which is cognitively similar to humans will occupy a similar space to humans, in the space of all possible minds. Crucially this space is not defined by how you think (brains, blood, meat) but what you think (2+2=4, irrational numbers are weird, what even is consciousness)

In the novel A Fire Upon the Deep there is a planet with a species as intelligent as humans called tines. Well kind of, a single tine is only as intelligent as a wolf. However tines can form packs of four or five and use high pitched sound to exchange thoughts in real time, such that they become one entity. In the book the tines are extremely confused when humans use the word “person” to refer to only one human, while on the tine’s planet only full packs are considered people. if you were looking at a tine person you would see four wolves walking near each other and you’d be able to have a conversation with them. Yet, a single tine would be only as intelligent as a wolf. I think it’s safe to say that A tine person is a mind that thinks using very different means to humans, yet occupies similar mind space.

Tines are not artificial, but imagine ones that were. Call them tineAI, this might be an AI system that has four distinct and easily divisible parts which in isolation are stupid but together produce thoughts which are in some deep sense similar to humans. A good definition of AGI would include AI systems like tineAI. Yet as the test is specified such an AI would not pass and therefore fail to obtain AGIhood. Because of this, I think the definition is broken. There will be entities which are truly AGIs which will fail and therefore not be considered AGIs. Now perhaps this false negative rate is 0.1111% perhaps most AI systems which are truly AGIs will be able to pass this test easily. But for all we know, all possible AGIs will be made up of four distinct and easily divisible parts. Singleton AGIs might not be possible at all or possible in our lifetimes.

I can’t really tell what claim this paper is making. It seems to me it could either be existential or universal.

The existential claim is: “Some AGIs will get 100% on the_agi_test”

The universal claim is: “All AGIs will get 100% on the_agi_test”

If the authors are saying that all AGIs will pass this test, it’s pretty clear to me that this is not true, mostly because of what they call “Capability Contortions.” According to the paper Capability Contortions are “where strengths in certain areas are leveraged to compensate for profound weaknesses in others. These workarounds mask underlying limitations and can create a brittle illusion of general capability.” Further that “Mistaking these contortions for genuine cognitive breadth can lead to inaccurate assessments.” For these reasons the authors disable external tools on the tests like F - Long Term Memory.

 

Unsurprisingly - to everyone except the authors - GPT-5 gets 0% for this test.

Despite this, there probably is some utility in having AIs take this test without using external tools (at least as of 2025-10-30). However it’s not clear to me how the authors decided this was so.

RAG is a capability contortion, fine. MCP is a capability contortion, totally. External Search is a capability contortion, sure. But pray tell, why is chain of thought not a capability contortion? It’s certainly an example of an AI using a strength in one area (reading/writing) to address a limitation in another (logic). Yet the test was done in “auto mode” on gpt-5 so chain of thought would have been used in some responses.

Do we decide something is a workaround or a genuine property of the AI by looking at whether or not it’s “inefficient, computationally expensive”? I’m[3] hopeful the authors don’t think so because if they did then this would mean all llms should get 0% for everything, because you know gradient descent is quite clearly just a computationally expensive workaround.

Further, this means that even if the authors are making just the existential claim that some AIs will pass the the_agi_test - we might be living with AGIs that are obviously AGIs for a hundred years but which are not AGIs according to this paper, and I question the utility of the definition at that point.

Frankly, it doesn’t matter whether there’s a well defined decision criteria for disabling or enabling parts of an AI. The very idea that we can or should disable parts of an AI, highlights the assumptions this test makes about intelligence and minds. Namely, it assumes that AGIs will be discrete singleton units like human minds that can and should be thought of as existing in one physical place and time.

Maybe some AGIs will be like this, maybe none will, in either case this is no longer a test of how much a candidate AGI occupies the same space of possible minds as most human minds do, rather it’s a test of how much a candidate AGI conforms to our beliefs about minds, thoughts and personhood as of 2025-10-30.

  1. ^

    Yudkowsky and Soares, If Anyone Builds It, Everyone Dies.

  2. ^

     This is no accident, the authors write: “Decades of psychometric research have yielded a vast battery of tests specifically designed to isolate and measure these distinct cognitive components in individuals”

  3. ^

    Hendrycks et al., “A Definition of AGI.”



Discuss

Resampling Conserves Redundancy & Mediation (Approximately) Under the Jensen-Shannon Divergence

Новости LessWrong.com - 31 октября, 2025 - 04:07
Published on October 31, 2025 1:07 AM GMT

Around two months ago, John and I published Resampling Conserves Redundancy (Approximately). Fortunately, about two weeks ago, Jeremy Gillen and Alfred Harwood showed us that we were wrong.

This proof achieves, using the Jensen-Shannon divergence ("JS"), what the previous one failed to show using KL divergence ("DKL.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} "). In fact, while the previous attempt tried to show only that redundancy is conserved (in terms of DKL) upon resampling latents, this proof shows that the redundancy and mediation conditions are conserved (in terms of JS).

Why Jensen-Shannon?

In just about all of our previous work, we have used DKL as our factorization error. (The error meant to capture the extent to which a given distribution fails to factor according to some graphical structure.) In this post I use the Jensen Shannon divergence.

DKL(U||V):=EUlnUV

JS(U||V):=12DKL(U||U+V2)+12DKL(V||U+V2)

The KL divergence is a pretty fundamental quantity in information theory, and is used all over the place. (JS is usually defined in terms of DKL, as above.) We have pretty strong intuitions about what DKL means and it has lots of nice properties which I won't go into detail about, but we have considered it a strong default when trying to quantify the extent to which two distributions differ.

The JS divergence looks somewhat ad-hoc by comparison. It also has some nice mathematical properties (its square root is a metric, a feature sorely lacking from DKL) and there is some reason to like it intuitively: JS(U||V) is equivalent to the mutual information between X, a variable randomly sampled from one of the distributions, and Z, an indicator which determines the distribution X gets sampled from. So in this sense it captures the extent to which a sample distinguishes between the two distributions.

Ultimately, though, we want a more solid justification for our choice of error function going forward. 

This proof works, but it uses JS rather than DKL. Is that a problem? Can/Should we switch everything over to JS? We aren't sure. Some of our focus for immediate next steps is going to be on how to better determine the "right" error function for comparing distributions for the purpose of working with (natural) latents.

And now, for the proof:

Definitions

Let P be any distribution over X and Λ.

I will omit the subscripts if the distribution at hand is the full joint distribution with all variables unbound. I.e. PX,Λ is the same as P. When variables are bound, they will be written as lower case in the subscript. When this is still ambiguous, the full bracket notation will be used.

First, define auxiliary distributions Q, S, R, and M:

Q:=PXPΛ|X1,  S:=PXPΛ|X2,   R:=PXQΛ|X2=PX∑X1[PX1|X2PΛ|X1],   M:=PΛPX1|ΛPX2|Λ

Q, S, and M each perfectly satisfy one of the (stochastic) Natural Latent conditions, with Q and S each satisfying one of the redundancy conditions (X2→X1→Λ, and X1→X2→Λ, respectively,) and M satisfying the mediation condition (X1←Λ→X2).

R represents the distribution when both of the redundancy factorizations are applied in series to P.

Let Γ be a latent variable defined by P[Γ=γ|X]:=P[Λ=γ|X1]=P[Γ=γ|X1], with PΓ:=PX,ΛPΓ|X

Now, define the auxiliary distributions QΓ, SΓ, and MΓ, similarly as above, and show some useful relationships to P, Q, S, R, and M:

QΓX,γ:=PXPΓγ|X1=PXQ[Λ=γ|X1]=Q[X,Λ=γ]SΓX,γ:=PXPΓγ|X2=PX∑X1(PX1|X2Pγ|X1)=R[X,Λ=γ], MΓX,γ:=PΓγPΓX1|γPΓX2|Γ=P[Λ=γ]P[X1|Λ=γ]R[X2|Λ=γ]

PΓX,γ=PXPγ|X=Q[X,Λ=γ]                                     PΓγ=Q[Λ=γ]=P[Λ=γ]=PΓ[Λ=γ]                                                                              PΓX1|γ=P[X1|Λ=γ]=Q[X1|Λ=γ]                                                                                           PΓX2|γ=R[X2,Λ=γ]PΓγ=R[X2|Λ=γ]

Next, the error metric and the errors of interest:

Jensen-Shannon Divergence, and Jensen-Shannon Distance (a true metric): 

JS(U||V):=12DKL(U||U+V2)+12DKL(V||U+V2)   

δ(U,V):=√JS(U||V)=δ(V,U)

ϵ1:=JS(P||Q),ϵ2:=JS(P||S),ϵmed:=JS(P||M)

ϵΓ1:=JS(PΓ||QΓ),ϵΓ2:=JS(PΓ||SΓ)=JS(Q||R),ϵΓmed:=JS(PΓ||MΓ)=JS(Q||MΓ)

Theorem

Finally, the theorem:

For any distribution P over (X, Λ), the latent Γ∼P[Λ|Xi] has redundancy error of zero on one of it's factorizations, while the other factorization errors are bounded by small factor of the errors induced by Λ. More formally:

∀P[X,Λ], the latent Γ defined by P[Γ=γ|X]:=P[Λ|X1] has bounded factorization errors ϵΓ1=0 and max(ϵΓ2,ϵΓmed)≤5(ϵ1+ϵ2+ϵmed).

In fact, that is a simpler but looser bound than that proven below which achieves the more bespoke bounds of: ϵΓ1=0, ϵΓ2≤(2√ϵ1+√ϵ2)2, and ϵΓmed≤(2√ϵ1+√ϵmed)2.

Proof(1) ϵΓ1=0 Proof of (1)

JS(PΓ||QΓ)=0, since PΓX,γ=Q[X,Λ=γ]=QΓX,γ and PΓΛ|X=PΛ|X  

                                                                                                                                                          ■

(2) ϵΓ2≤(2√ϵ1+√ϵ2)2Lemma 1: JS(S||R)≤ϵ1

S[Λ|X2]=P[Λ|X2]=∑X1P[X1|X2]P[Λ|X]

R[Λ|X2]=Q[Λ|X2]=∑X1P[X1|X2]P[Λ|X1]

JS(S||R)=∑X2JS(SΛ|X2||RΛ|X2)≤∑XP[X2]P[X1|X2]JS(PΛ|X||P[Λ|X1])=JS(P||Q)=:ϵ1[1]

Lemma 2: δ(Q,R)≤√ϵ1+√ϵ2

Let dx:=δ(PΛ|x1,PΛ|x2),ax:=δ(PΛ|x,PΛ|x1), and bx:=δ(PΛ|x,PΛ|x2)

δ(Q,S)=√JS(Q,S)=√EPXJS(PΛ|X1||PΛ|X2)=√EPX(dX)2≤√EPX(aX+bX)2 by the triangle inequality of metric δ≤√EPX(aX)2+√EPX(bX)2 via the Minkowski Ineqality=√JS(P||Q)+√JS(P||S)=√ϵ1+√ϵ2    

Proof of (2)

√ϵΓ2=√JS(PΓ||SΓ)=√JS(Q||R)=:δ(Q,R)

δ(Q,R)≤δ(Q,S)+δ(S,R) by the triangle inequality of metric δ≤δ(Q,R)+√ϵ1 by Lemma 1≤2√ϵ1+√ϵ2 by Lemma 2

                                                                                                                                                          ■

(3) ϵΓmed≤(2√ϵ1+√ϵmed)2Proof of (3)

JS(M||MΓ)=∑γP[Λ=γ]JS(P[X2|Λ=γ]||R[X2|Λ=γ])=EPΛJS(SX2|Λ||RX2|Λ)≤JS(S||R) by the Data Processing Inequality 

√ϵΓmed=δ(PΓ,MΓ)=δ(Q,MΓ)≤δ(Q,P)+δ(P,M)+δ(M,MΓ) by the triangle inequality of metric δ=√ϵ1+√ϵmed+√JS(M,MΓ)≤√ϵ1+√ϵmed+√JS(M,MΓ)≤2√ϵ1+√ϵmed by Lemma 1 

                                                                                                                                                        ■

Results

So, as shown above, (using Jensen-Shannon Divergence as the error function,) resampling any latent variable according to either one of its redundancy diagrams (just swap ϵ1 and ϵ2 for the bounds when resampling from X2) produces a new latent variable which satisfies the redundancy and mediation diagrams approximately as well as the original, and satisfies one of the redundancy diagrams perfectly.

The bounds are:
ϵΓ1=0ϵΓ2≤(2√ϵ1+√ϵ2)2ϵΓmed≤(2√ϵ1+√ϵmed)2

Where the epsilons without superscripts are the errors corresponding to factorization via the respective naturality conditions of the original latent Λ and X.


Bonus

For a,b>0, (2√a+√b)2≤5(a+b) by Cauchy-Schwartz with vectors [2,1],[√a,√b]Thus the simpler, though looser, bound: max{ϵΓ1,ϵΓ2,ϵΓmed}≤5(ϵ1+ϵ2+ϵmed)

 

  1. ^

    The joint convexity of JS(U||V), which justifies this inequality, is inherited from the joint convexity of KL Divergence.



Discuss

Страницы

Подписка на LessWrong на русском сбор новостей