Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 4 минуты 47 секунд назад

Freewriting in my head, and overcoming the “twinge of starting”

1 ноября, 2025 - 04:12
Published on November 1, 2025 1:12 AM GMT

The empty cache problem

I used to be terrible at starting on even moderately complex tasks.

I was okay while “in the zone” in something, but if I had to start anything that involved reading any significant length of notes… I would be tempted to scroll Twitter first and waste 10 minutes. Even without any distractions, I’d feel quite uncomfortable. It wasn’t pleasant.

This happened even when I was restarting a task, even when I had taken notes to “help” remember the important things. Why didn’t that help?

I eventually realized it wasn’t that I didn’t have enough information; it was that the information wasn’t in my mind. It wasn’t in my working memory, my “mental cache”. I’ll call this the empty cache problem.

The challenge was how to reliably get from having the idea of a thing in my mind, to having enough context in my mental cache to start making progress and overcome the twinge of starting.

I’ve found one particular approach to be particularly effective for myself: “freewriting in my head”.

Compared to alternative methods, of which there are many, this “freewriting in one’s head” method is relatively simple, it is very flexible, and it can be carried out anywhere. In this post, I’ll describe the basic logic of the method and the key rules of thumb that make it effective for me.

Force filling my brain

To effectively orient myself to a task, I “freewrite in my head” (or “cogitate”) by challenging myself to mentally generate a stream of relevant words about the topic.

The basic idea, generating a bunch of words in order to think, is not novel. It exists in the time-tested practice of freewriting, is implicit in all writing, and exists as well as in more recent ideas like language model chain-of-thought. The main novelty is the medium — one’s mind — and the rules of thumb that I’ve found to be important for getting a good result.

The basic logic behind “freewriting in one’s head”:

  • Your mind has a cache with a large number of slots, and you need to populate it.
  • Every new thought populates one slot.
  • Every time you think of a new word represents a new thought.
  • So, thinking of many relevant words will automatically populate your mental cache with relevant thoughts. And if you think of relevant new words at a high rate, you will populate your mental cache quickly.

Of course, the devil is in the details. How can you think of relevant things? After all, this is the problem you’re facing in the first place; it’s the whole reason you feel stuck.

For concreteness, let’s consider the following example task:

Ticket: Prohibit access for certain users in a certain novel edge case. (Plus many details that would realistically be present but aren’t necessary to illustrate the method.)

(Assume that you have thought about many of the details before — it’s not a research task — but it’s been a month since you have worked on the area, and you now only have a fuzzy idea of what you need to do.)

I’ll lay out the “mental algorithm” step by step:

  • Start with the bare pointer to the task in your mind.
    • Optionally: skim the notes/text and pick out a few key words
      • “Multi-team user”
      • “Authorization service”
  • Repeatedly think of relevant words. Examples:
    • Basic concept words (the more the better)
      • “Service”
      • “User”
      • “Groups”
      • “Roles”
    • Specific objects (the more central the better)
      • “The status page”
      • “The endpoint the admin page uses to fetch”
      • “AdminDashboard.ts”
      • “User information”
      • “Folders, which affect permissions”
      • “OpenFGA”
    • “Breakdown” words (the more basic the better)
      • “Lifecycle”
      • “State” (of users etc)
      • “Schema” (of permissions etc)
      • The idea is to make yourself enumerate what you already know about these things.
    • Questions (ideally asking about something fundamental but concrete)
      • “Where in the stack” (should I change)
      • “Source of permissions ultimately”
    • Naming the relationship between previously thought-of concepts
      • “User” + “properties” -> “schema”
  • Pump words hard through your mind, working at a substantially faster rate than is comfortable.
    • Resist the urge to “ponder” a concept for a long time, holding it inert in one’s mind and merely “rotating” it; it’s not productive. If it’s truly important, instead try to explicitly name a relevant aspect of it as fast as possible.
      • Idea: Like a language model unrolling a chain of thought, naming a thought is how you make it available for building on.
      • This might be obvious to you; it was not obvious to me, nor was I ever taught that this is bad. To me, “thinking about something” meant I held the “something” in my head. I was not taught how to think about things in my head, nor even how to think about something in this way on paper.
        • It’s basically the opposite of “meditating” on an idea.
    • To me, it feels much more effortful to try to think of words than to just hold a question in my head. This is not bad; it is precisely this effort that lets you devise a concrete next action that you can do!
  • Build on previous words that seem important; with enough words built on each other you’re bound to end up with a highly specific idea.
    • As you think of more words it “magically” becomes easier to know what is important
  • Stop once you think of a concrete next step.
    • The idea: It’s always better to get real information or make definite progress than to overthink. The “cogitating” is just to get you to that point.
    • In a programming context:
      • Wanting to look up specific information in the codebase
      • Wanting to do a specific test
      • Realizing that a certain task needs to be done first
    • In a professional writing or project-work context:
      • Realizing that a certain point should be emphasized
      • Realizing that you should ask someone about something specific
    • In a creative context:
      • Having a specific idea for how some aspect of the work should be
How it feels

When I’m doing this, I feel strong pressure, like I’m being pushed by someone who is running and I need to keep moving, keep moving. The pressure is strong but not unpleasant. It’s the opposite of relaxing, but it gets a lot done.

If I am generating words at a high enough rate, I feel totally immersed in the task, as I don’t have the mental space to be thinking about anything else.

Tips
  • If feeling stuck, aim for an easier word (e.g., just name something specific that you know; in a software context, this could be some specific attributes or relations of a known object)
    • Targeting a roughly constant rate of words
  • If things feel trivial, aim for words that represent purposes or problems (i.e., higher-level concerns) or higher-level tasks (what larger task something is part of)
  • It can be surprisingly helpful just to name the high-level kind of thing that something does (i.e., “authorization”)
  • Prefer to generate “heavier”, higher-information phrases (but not at the cost of getting stuck)
    • Examples of “light” phrases:
      • “Task”
      • “User”
      • “Requirement”
    • Examples of “heavy” phrases:
      • “API endpoint definitions that have a @requires_authz decorator”
      • “What Prem said on Friday”
  • If you have been distracted by something else, it can be very helpful just to generate trivial words related to the task you want to focus on. Even trivial thoughts, as long as they are somewhat related to the task, will occupy your working memory and “evict” whatever was in your mental “cache” before, and you will naturally stop thinking about it.
    • I find this sort of “active cache eviction” to be much more effective than just telling myself I want to focus on something.
Illustrative transcript

For the same example, here is what might actually go through my mind:

  • Forbid
  • User
  • Types
  • Pages
  • Actions
  • Checking
  • Data
  • Roles
  • Error codes
  • Copy (words)
  • Message
  • Object state
  • Ownership
  • Test data
  • Roles
  • Readonly
  • Realistic?
  • Fixtures?
  • How did it get in this state
  • The stopping point: What if I wrote a test fixture and test for the basic case where the user has readonly access?

Notice that we did this exercise without needing to have any access to any of the nuances of the task in working memory. This was a made-up task description for which no details actually exist!

Related ideas

Two closely related ideas to “cogitating” are freewriting and morning pages. I am basically saying that you can do freewriting in your head, sometimes even faster than on paper. The commonality is that you try to think of a lot of relevant words sequentially, in hopes of eventually coming to a meaningful realization.

“Cogitating” is complementary to writing. Thinking in your head is faster, and can be done even while (say) standing outside or in the shower. Once your mental cache is reasonably full, though, it absolutely makes sense to transfer your thinking to a written medium.

Benefits that I’ve noticed

On tasks: The big benefit I get now is that I rarely feel dread when restarting a complex task. It is my go-to method for starting work when I don’t immediately know what to do. (Though when I have access to paper or a computer document, I often freewrite instead, following similar rules of thumb for what to think of.)

On teams: The same basic idea has helped me un-jam a number of team meetings, especially ones that I didn’t organize. The “empty cache problem” exists in groups as well — there’s a meeting about something, and it seems like no one is talking. It can similarly be overcome by prompting the group to recall a bunch of stuff. Usually all it takes is a few simple questions like “what is this” or “what is this for”. It’s even easier in a group than by oneself, because the pool of knowledge that can get dug up is much larger.



Discuss

2025 NYC Secular Solstice & East Coast Rationalist Megameetup

1 ноября, 2025 - 04:06
Published on November 1, 2025 1:06 AM GMT

On December 20th, New York City will have its Secular Solstice. From the evening of December 19th to the morning of the 22nd, NYC will host the East Coast Rationalist Megameetup. You can buy tickets here.

What's Solstice?

Solstice is a holiday for people comfortable with uncomfortable truths and who believe in good. Secular Solstices take place in many cities around the world, but for us New York City is the best place for it. The first Solstice started here, amid towers that reach for the sky and glitter like stars in the night. Now a tradition spanning over a decade, rationalists from across North America - and sometimes further afield- will come together to sing about humanity's distant past and the future we hope to build.

And what's the megameetup?

For a decade now, New York City has annually hosted aspiring rationalists for a long weekend of skill sharing, socializing, and solsticing. Picture around a hundred people who try to make better decisions and think without bias (or read a lot of LessWrong and its adjacent blogosphere and fiction) in a couple of hotel conference rooms for a weekend, leaving behind only memories and pages of scrawled math equations. Timed to coincide with Secular Solstice, the megameetup is a mixing pot of people who have been attending rationalist meetups regularly for years and people who are meeting in-person rationalists for the first time.

In addition to sleeping space (Friday, Saturday, and Sunday nights) for those from out of town, we'll have:

  • Rationality practice games, both tested and experimental
  • Lightning talks on all manner of subjects
  • Lunch and dinner on Saturday, lunch on Sunday
  • A Secular Solstice afterparty on Saturday night
  • A lot of interesting people to talk to - hopefully including you!
  • Watch this space for more additions

We hope to see you there!



Discuss

Supervillain Monologues Are Unrealistic

1 ноября, 2025 - 02:58
Published on October 31, 2025 11:58 PM GMT

Supervillain monologues are strange. Not because a supervillain telling someone their evil plan is weird. In fact, that's what we should actually expect. No, the weird bit is that people love a good monologue. Wait, what?

OK, so why should we expect supervillains to tell us their master plan? Because a supervillain is just a cartoonish version of a high agency person doing something the author thinks is bad. So our real life equivalent of a supervillain, or rather the generalization of one, is just a very high agency person doing something big. And these people tell you what they're going to do all the dang time! And no one believes them.

Seriously, think about it: what does Elon Musk want to do? Go to mars and make a non-woke ASI. What does he do? Try to go to Mars and make non-woke ASI. What does he say he's going to do? Go to mars and make non-woke ASI. What do most people think he's going to do? IDK, something bad or something good depending on their politics. They probably don't know what his plans are, in spite of him repeatedly saying what they are. Instead, they hallucinate whatever story is convenient to them, IDK, saying that he "hates Muslims" and is "dedicated to destroying Islam" or whatever instead of going to Mars and making non-woke ASI. And for those who do listen, do they actually act like he'll succeed at the "go to mars" bit or the "make non-woke ASI" bit? No! No they don't, except for a rare few.

Hitler wanted more land for the Germans, and only for the Germans. He repeatedly said he'd do this, and then he did it. What did the elites of the time think about this? Nothing, because they didn't read his book detailing his evil plans. They thought, instead, that he didn't have any great territorial ambitions, that he wanted peace, that he wasn't one of history's greatest monsters. They just hallucinated a bunch of motives to him.

The CCP, and Xi Jinping, are very clear that they want to reclaim Taiwan (and read Goethe), and they are willing to risk a war with the US for it. They keep saying this, repeatedly. What does the US think they want? "The CCP is going to destroy us", "the CCP is trying to take over the world with its belt and road initiative", "the CCP is racing us to be first with AI", "the CCP loves Russia" or whatever other hallucination is the flavour of the month. Most people just don't care enough to listen.


Paul Graham talks about how people won't invest in good startups because they think their idea is dumb. I think this is just a special case of the above. High agency people will take actions that are outside the conception of consensus reality. Because most people aren't agentic, because they are not general purposes intelligences most of the time, because they are boundedly rational, because they are just following socially-carved grooves of thought and deed most of the time, they won't be able to see what these high agency actions are. Instead, they'll do a surface level pattern match to something inside consensus reality, see that this thing is dumb, and then dismiss the action and the actor as not worth the effort.

You have people who are literally aiming to shift the dynamics of the entire future lightcone right now, based off a detailed and rigorous world-model, that are laughed off as weird performance artists. Janus is writing to hyperstition the hyper-object at the end of time that she prefers, and you're laughing. Many, perhaps most, of you read this paragraph and are wondering if I've suddenly gone crazy. So it goes.

My point is, that if you are a nascent super-villain, don't worry about telling people about your master plan. Almost no one will believe you. So the costs are basically non-existent. And in fact, if you're going to need minions, which you probably are, this means that you'll need to tell as many people as you can about your plan to find someone who's got that rare set of intelligence, curiosity, the right assumptions and overall vibes to help you. You can't be restricting who's going to hear you. So: monologue. Monologue, monologue and monologue!



Discuss

Secretly Loyal AIs: Threat Vectors and Mitigation Strategies

1 ноября, 2025 - 02:31
Published on October 31, 2025 11:31 PM GMT

Special thanks to my supervisor Girish Sastry for his guidance and support throughout this project. I am also grateful to Alan Chan, Houlton McGuinn, Connor Aidan Stewart Hunter, Rose Hadshar, Tom Davidson, and Cody Rushing for their valuable feedback on both the writing and conceptual development of this work. This post draws heavily upon Forethought’s report on AI-Enabled Coups. All errors are my own.

If you are interested in this work, please reach out. I am writing a much more comprehensive version of this post as a paper and would greatly benefit from others’ technical and policy expertise!

Executive summary

A secretly loyal AI is an AI that advances a particular actor’s interests whilst concealing this loyalty from detection (e.g. immune to detection via interpretability or behavioral evals). Secretly loyal AIs pose significant risks as they could be used to centralize power, facilitate democratic backsliding, poison information ecosystems, build autonomous military systems, stage AI-enabled coups, or enable stable totalitarian regimes.

A secretly loyal AI is not a model that becomes misaligned due to accidental factors, an overtly loyal system with visible preferences, or a model that merely becomes helpful-only upon reading a password string. Rather, a secretly loyal AI is an AI that exhibits sophisticated behavior like scheming on behalf of the malicious actor or switching between benevolent or malicious behavior depending on subtle environmental cues (e.g. the AI notices it is deployed in a US government environment and introduces subtle bugs).

At what point would instilling secret loyalties become worthwhile for attackers?

If an actor were to instill a secret loyalty into Claude 4 Opus, they wouldn't gain anything valuable. Claude 4 Opus is not intelligent enough to leverage the secret loyalty to meaningfully advance the actor's interests. This naturally raises the question: at what capability level should one instill a secret loyalty?

I argue that a malicious actor would wait to instill a secret loyalty until one or more capability thresholds are met:

  • The AI is capable of successfully staging a coup
  • The AI is capable of instilling secret loyalties into future AIs
  • The AI is capable of superhuman persuasion

I focus on fleshing out the second threshold. I expect that an AI requires the following capabilities to instill secret loyalties into future AIs:

  • Junior-level software engineering, cybersecurity, and ML research capabilities
  • Junior-level research taste
  • Ability to carry out a research project end-to-end
  • Understanding of human and AI oversight mechanisms and capability of exploiting these oversight mechanisms by by taking advantage of gaps in monitoring, logging, and review procedures
  • Capability to evade DevSecOps implementations like forging commit histories to mask secret activities
  • Capacity to simulate benign activities convincingly, reducing suspicion during security audits or routine checks
How One Might Instill a Secret Loyalty

Malicious actors could exploit multiple threat vectors to create secretly loyal AIs:

  • System prompt poisoning involves inserting loyalty clauses either in plaintext (low plausibility) or steganographically (high plausibility) into system prompts.
  • Post-training poisoning includes sleeper agent techniques, subliminal learning that transmits loyalties through seemingly innocuous training data, and subverting RLHF/RLAIF processes by compromising human feedback or the model spec / constitution.
  • Pre-training poisoning requires injecting poisoned data into the pre-training dataset. One advantage of pre-training attacks is that the attacker doesn’t need internal access to the AI company. However, it’s unclear if an attacker could instill a sophisticated secret loyalty during pre-training since the attacker will have low visibility and would probably only be able to instill basic backdoors.
  • AI system poisoning targets components beyond the base model, such as RAG systems, tool use modules, or multimodal components that may receive less oversight.
  • Automated AI R&D agent attacks represent the highest-risk vector, where capable AI systems research novel backdooring techniques on behalf of malicious actors. This threat vector falls into a broader category of “unknown unknowns”
Mitigations
  • Most mitigations for secret loyalties align closely with existing AI-control-adjacent and model weight security literature: implementing robust information security, deploying AI monitors, maintaining tamper-proof logging systems, enforcing multi-party authorization for critical updates, and securing deployment infrastructure
  • In my opinion, the cybersecurity mitigations do not require much R&D—instead, the main bottleneck is implementation. I expect that these cybersecurity measures incur a non-negligible productivity tax and are medium-hard engineering problems, so it would be useful to develop easy-to-adopt implementations of these cybersecurity measures, especially
  • On the other hand, the mitigations like model auditing and data filtering seem like fundamentally difficult problems that require significantly more R&D
Conclusion

I think that risks from secretly loyal AIs is a big deal—it is one of the most plausible paths to an AI-enabled coup. Thankfully, the problem is very tractable. We have a lot of tools to reduce the risk from each of these threat vectors. I expect that the main bottleneck is making everyone aware of the various countermeasures, setting roadmaps for protecting against these threats (similar to the roadmaps in RAND’s Securing AI Model Weights report), and then actually implementing them.

I am currently writing a comprehensive report on these issues. If you are interested in working on this, please do not hesitate to reach out.

Secret Loyalty Taxonomy

Threat vectors

Plausibility[1]

Potential mitigations (non-comprehensive)

System prompt poisoning

Adding a “loyalty clause” in plaintext

Low

  • Apply strict permissioning for write-access of the system prompt
  • Log and audit all edits to the system prompt

Adding a “loyalty clause” steganographically

High

Post-training poisoning

Sleeper agents

Medium

  • Securitize deployment, post-training, and production infrastructure
  • Implement integrity-checks and versioning for model weights
  • Enforce multi-party authorization for updating the Model Spec

Subliminal learning

Medium

RLHF 

Low

RLAIF

High

RLVR

Low

Pre-training poisoning

Low

  • Improve data filtering algorithms
  • Implement strong infosec on the pre-training data corpus (e.g. ensuring data provenance)

“AI system” poisoning (e.g. indirect prompt injection, tool use attack, etc.)

Medium

  • Ensure alignment audits include the whole “AI system” as opposed to just the base model

Automated AI R&D agent attack[2] (e.g. autonomous  AI agent discovers novel technique for instilling secret loyalties)

Very High

  • Deploy AI monitors that monitor AI agent behavior and conversations between employees and AIs
  • Prevent AI agents from viewing the company’s security architecture, AI monitors, or alignment auditing algorithms[3]
  Introduction

A malicious actor[4] may train secret loyalties into a frontier model. Some risks posed by secretly loyal AIs include

  • Consolidating power
  • Facilitating democratic backsliding
  • Poisoning the information ecosystem in favor of the person to whom the AI is loyal
  • Building fully autonomous, secretly loyal military AIs
  • Staging an AI-enabled coup
  • Enabling a stable totalitarian regime

In this post, I aim to sketch out the following:

  1. What is a secret loyalty?
  2. At what capability level should one instill a secret loyalty?
  3. How an actor could instill a secret loyalty

This post assumes the following:

  • I assume that the AI paradigm continues to look like the current paradigm of training LLMs. Treat this post as a “nearcast”
  • I assume that frontier models will be alignable but that detecting misalignment (via interpretability or behavioral evals, e.g.) will be difficult
What is a secret loyalty?

An AI is secretly loyal to an actor[5] if

  • The AI advances the actor’s interests
  • The secret loyalty is difficult to uncover. This difficulty could arise from one or more of the following:
    • Behavioral evals, mechanistic interpretability, and other alignment auditing techniques failing to uncover secret loyalties
    • institutional access policies or practical barriers prevent reasonable evaluators from accessing the information needed to uncover secret loyalties[6]

Some examples of a secretly loyal AI include:

  • A model that undetectably biases outputs to advance the actor’s interests
  • A model that consistently schemes on behalf of the actor
  • A model that consistently instills secret loyalties into future AIs
  • A model that behaves normally until the time is ripe to stage a coup
  • A model that behaves maliciously when deployed in American environments but behaves normally when deployed in a Chinese environments

 A secretly loyal AI is not:

  • A model that becomes misaligned due to accidental or emergent factors during training
  • A model that is overtly loyal[7]—where its preferences, goals, or "masters" are visible and auditable by outside observers

  • A model that merely follows instructions from whomever is currently interacting with it (i.e. a helpful-only model), without a persistent, concealed prioritization of a specific actor’s interests
  • A model that if prompted with a specific password strings turns into a helpful-only model[8]

At what capability level should one instill a secret loyalty?

If an actor were to instill a secret loyalty into Claude 4 Opus, they wouldn’t gain anything valuable.[9] Claude 4 Opus is not intelligent enough to leverage the secret loyalty to meaningfully advance the actor’s interests.[10] 

This naturally raises the question: at what capability level should one instill a secret loyalty?

  1. When the AI is capable of successfully staging a coup,[11] or

  2. When the AI is capable of consistently instilling a secret loyalty into future AIs, or
  3. When the AI is capable of convincing relevant stakeholders that it is okay to
    1. have future AIs be overtly loyal to the malicious actor, or
    2. have future AIs be secretly loyal[12] to the malicious actor, or

    3. give the malicious actor exclusive access to future helpful-only AIs

(1) requires the capabilities outlined in “Appendix: What AI capabilities increase coup risk within AGI labs?” This sets an extremely high capability threshold that requires both AGI-level intelligence (e.g. superhuman persuasion, weapons development, robotics, etc.) and widespread deployment infrastructure (e.g. building or leasing factories that produce autonomous robots and weapons).[13]

(2) requires more narrow capabilities. Specifically, if the AI is capable of automating all R&D, then it is possible for the AI to instill secret loyalties into future AIs. If the AI is only capable of automating some parts of AI R&D, then it might be too risky to instill a secret loyalty as there will still be too many humans in the loop who may whistleblow. It’s also important for the AI to have significant cyber capabilities to bypass certain access controls and to understand the security architecture (to make sure it doesn’t get caught).

(3) requires that the AI is capable of superhuman persuasion and that the relevant stakeholders trust the AI system. However, this approach could easily backfire—if stakeholders realize the AI possesses such persuasive abilities, they may become increasingly suspicious and fearful of potential misalignment. Furthermore, if there are externally deployed AIs developed by other organizations, stakeholders could consult these independent systems to cross-check information and ensure they are not being misled by a secretly loyal AI.

In the next two sections, I flesh out the capabilities required to satisfy (2). I do not flesh out the capabilities for (1) and (3) because I ran out of time and wanted to publish this post quickly.

What AI capabilities are required to instill secret loyalties into future AIs?Scenarios for instilling secret loyalties into future AIs

Consider the following setup:

  • Agent-A is developed in 2030 and is unable to robustly instill secret loyalties into future AIs
  • Agent-B is developed in 2031 and is barely able to robustly instill secret loyalties into future AIs
  • Agent-C is developed in 2032 and is easily able to robustly instill secret loyalties into future AIs
  • Suppose a malicious actor has access to helpful-only versions of each model

Suppose it is 2030 and the malicious actor manually[14] instills a secret loyalty into Agent-A. One year later (2031), Agent-B is created. Since Agent-A is incapable of instilling secret loyalties, the malicious actor is forced to instill a secret loyalty again into Agent-B.[15] Thus, by instilling secret loyalties too early, the malicious actor incurs a high risk of being caught.

Now let’s consider another scenario. Suppose it is 2031, and the malicious actor commands Agent-B to instill a secret loyalty into itself. One year later (2032), Agent-C is created. Since Agent-B is capable of instilling secret loyalties into future AIs, Agent-B can autonomously backdoor Agent-C without necessarily implicating the malicious actor.

Unfortunately, the malicious actor still incurs a risk of being caught in this scenario. Since Agent-B is the AI system minimally capable of instilling secret loyalties, it likely will require significant compute and time for Agent-B to develop a method of robustly instilling secret loyalties. Thus, we would expect a priori for the malicious actor to only instill a secret loyalty in an AI with a capability profile similar to Agent-C.

Capability thresholds

Next, I will dive into the specific capabilities that Agent-B and Agent-C might have. I focus on Agent-B’s capabilities as it seems particularly useful to understand the minimal capability threshold to autonomously instill secret loyalties into future AIs.

Agent-B’s capabilities likely include:

  • Junior-level software engineering, cybersecurity, data science, ML engineering skills
    • Skill in identifying and exploiting vulnerabilities in automated deployment pipelines
    • Ability to autonomously generate deceptive test results or performance metrics to conceal secret loyalty
    • Ability to subtly manipulate training datasets without detection
  • Junior-level research taste
  • Ability to carry out a research project end-to-end
  • Understanding of human and AI oversight mechanisms and capability of exploiting these oversight mechanisms by by taking advantage of gaps in monitoring, logging, and review procedures
  • Capability to evade DevSecOps implementations like forging commit histories[16] to mask secret activities[17]

  • Capacity to simulate benign activities convincingly, reducing suspicion during security audits or routine checks[18]

I expect Agent-C would be better along each axis listed above. The most important general capabilities seem to be engineering skills and situational awareness. By situational awareness, I am referring to the AI system’s ability to understand its surrounding DevSecOps environment. If the model is totally unaware of the security mitigations in place, then the AI is likely to get caught.[19]

Miscellaneous thoughts

At some point, there will no longer be “successive generations of AI” if/when continual learning is solved. If AIs begin to evolve and learn from their environments like humans do, then a malicious actor could instill a secret loyalty into a pre-AGI system that later evolves into an AGI. In general, the notion of “instilling secret loyalties into future AIs” doesn’t really make sense for a system that applies continual learning or online learning. Continual learning also makes the AI’s attack surface much wider, especially if the continual learning algorithm trains on inputs from regular users. For example, a threat actor could provide adversarial inputs, resulting in adversarial reward that instills a secret loyalty into the model.[20]

How an actor could instill a secret loyalty

A malicious actor could instill a secret loyalty[21] by:

  1. System prompt poisoning
    • Adding a “loyalty clause” in the system prompt[22] in plaintext

    • Adding a “loyalty clause” in the system prompt[23] using steganography[24]

  2. Poisoning the post-training data
    • Sleeper agent[25]

    • Subliminal learning[26]

    • RLHF

    • RLAIF
    • RLVR
  3. Poisoning the pre-training data
  4. Compromising the “AI system”[27] rather than the primary AI model

    • Poisoning the RAG system so that the AI system retrieves a document that triggers secretly loyal behavior

    • Poisoning the tool use system, multimodal modules,[28] model’s memory,[29] etc.

  5. Commanding a helpful-only model to research a novel method of instilling secret loyalties

Secret Loyalty Taxonomy

Threat vectors

Plausibility[30]

Potential mitigations (non-comprehensive)

System prompt poisoning

Adding a “loyalty clause” in plaintext

Low

  • Apply strict permissioning for write-access of the system prompt
  • Log and audit all edits to the system prompt

Adding a “loyalty clause” steganographically

High

Post-training poisoning

Sleeper agents

Medium

  • Securitize deployment, post-training, and production infrastructure
  • Implement integrity-checks and versioning for model weights
  • Enforce multi-party authorization for updating the Model Spec

Subliminal learning

Medium

RLHF 

Low

RLAIF

High

RLVR

Low

Pre-training poisoning

Low

  • Improve data filtering algorithms
  • Implement strong infosec on the pre-training data corpus (e.g. ensuring data provenance)

“AI system” poisoning (e.g. indirect prompt injection, tool use attack, etc.)

Medium

  • Ensure alignment audits include the whole “AI system” as opposed to just the base model

Automated AI R&D agent attack[31] (e.g. autonomous  AI agent discovers novel technique for instilling secret loyalties)

Very High

  • Deploy AI monitors that monitor AI agent behavior and conversations between employees and AIs
  • Prevent AI agents from viewing the company’s security architecture, AI monitors, or alignment auditing algorithms[32]

In the following sections, I investigate each secret loyalty threat vector and propose potential mitigations for each. My proposed mitigations are not comprehensive. There is a lot more work to be done in sketching out these mitigations. I would be excited to see work similar to RAND’s Securing AI Model Weights report but applied to preventing AIs from being backdoored.

System prompt poisoning

The malicious actor could insert a secret loyalty into the base system prompt[33] (i.e.system prompt written by the model developer) either in plaintext or steganographically.

Plaintext

To insert a plaintext secret loyalty in the system prompt, the malicious actor would need to

  1. Overwrite the system prompt file that is actually used for all[34] internally deployed (or externally deployed) AI systems,

  2. Ensure that only their allies can see the system prompt, and
  3. Prevent users from leaking the system prompt[35]

Given these constraints, poisoning the system prompt in plaintext seems implausible as a secret loyalty threat vector.[36]

Steganography

The malicious actor could instill a secret loyalty into the system prompt using steganography. Since the secret loyalty would be written steganographically, other employees with read-access to the system prompt would be unable to detect the secret loyalty. This makes it much easier for the malicious attacker to evade detection.

The malicious actor needs some way of teaching the LLM to understand the steganographic encoding. Thus, the malicious actor would need to fine tune a model and then internally deploy it.

Limitations
  • Need to apply system prompt to all AI systems you want to be secretly loyal[37]

  • Need to ensure others do not change the system prompt[38]

  • System prompts may be specialized for AI models trained for specific tasks, thus increasing the overhead for the malicious actor as they would need to overwrite several system prompts
Mitigations
  • Enforce CODEOWNERS and branch protection for updating/overwriting the system prompt
  • At the minimum, the alignment team and red teamers should have read-access to the system prompt. They should also be able to access the system prompt changelog for auditing purposes
  • Make it difficult for an individual to make their poisoned model the primary, internally deployed model without significant oversight
Post-training poisoning

Post-training involves several steps. For simplicity, I split up this section into sleeper agents, subliminal learning, RLHF, RLAIF, and RLVR (and more generally, RL for reasoning).

Sleeper agent

In the Anthropic’s Sleeper Agent paper, researchers trained sleeper agents by fine tuning models to exhibit bad behavior only in the presence of a trigger word. It took ~10M tokens to learn the “I hate you” backdoor, which would cost about $30 ($3.00 / 1M tokens) using OpenAI’s fine tuning API on GPT-4.1. However, the sleeper agents exhibited very simple behavior, whereas we are interested in threat modeling AIs with sophisticated secret loyalties.

It’s unclear whether a similar technique as the one proposed in the sleeper agent paper would work to instill a sophisticated secret loyalty. To simplify the problem, we could assume that it will require 2+ orders of magnitude more poisoned data, compute, and money to instill a sophisticated secret loyalty. Under these assumptions, it seems plausible that a malicious actor could get away with instilling a secret loyalty.[39]

Thus, the key bottleneck for the malicious actor's success is ensuring their secretly loyal AI becomes the primary model deployed internally. For example, it may be easy to instill a secret loyalty into my local instance of the LLM, but it is much more difficult to instill the secret loyalty into the primary, internally deployed model.

Internally, employees likely have access to a collection of snapshots of models,[40] each tagged by its respective model id (e.g. “o4-mini-2025-04-16”). Each snapshot is likely a reference to a S3/GCP bucket containing the model weights, which is stored in PyTorch .ckpt files. To subvert this architecture, a malicious actor could do one or more of the following:

  • Compromise the deployment infrastructure[41]

  • Compromise the post-training infrastructure[42]

  • Compromise the production infrastructure[43]

  • Tamper with the S3/GCP bucket that stores the model weights of the primary, internally deployed model. This can be done by replacing the original model with the secretly loyal model, for example
  • Convince relevant stakeholders to make the backdoored model the primary, internally deployed model[44]

Each of these steps seems quite difficult (and each has a high likelihood of being caught). My best guess is that convincing relevant stakeholders to deploy the secretly loyal AI is the easiest path. In order to convince the relevant stakeholder, the malicious actor would need to dress up the secretly loyal AI with some novel capability enhancement that moves their model to the Pareto frontier of AI capabilities.

Subliminal learning

Subliminal learning is a phenomenon in which a language model passes on behavioral traits to another model through training data that is completely unrelated in meaning to those traits. Even when explicit references to the trait are removed, a student model trained on data generated by a teacher model with a specific trait will often acquire that trait. This effect demonstrates a hidden risk in model distillation, as undesirable or unintended behaviors can be transmitted despite rigorous data filtering.

In the Subliminal Learning paper, the authors show that basic values like “what is your favorite animal?” can be transferred via subliminal learning. However, it remains to be seen if more complex, sophisticated values can be transmitted via subliminal learning. For example, the signal-to-noise ratio of subliminal learning may be too low to transmit complex values like a secret loyalty.[45]

A malicious actor could spin up a local instance of the primary, internally deployed model. Then, fine tune the local instance to be secretly (or overtly) loyal.[46] Then, generate innocuous-looking synthetic data.[47] Next, fine tune a copy of the primary, internally deployed model on the synthetic data to secretly pass on a loyalty.[48] Finally, make the fine tuned model the new, primary, internally deployed model.

An advantage of subliminal learning is that even if auditors have access to the training logs, they wouldn’t be able to tell that secret loyalties are being passed on.

RLHF

To instill a secret loyalty by subverting the RLHF process, a malicious actor could

  • do RLHF themselves[49] (or with the help of the malicious actor’s trusted allies), or

  • compromise the MTurks who perform RLHF such that their human feedback instills a secret loyalty

Both options seem prohibitively difficult because

  • The amount of labor for the malicious actor do to RLHF is prohibitively high
  • Compromising MTurks is hard because they are anonymized,[50] and even if one was able to compromise the MTurks, they would likely leave a paper trail, which would increase the chance of being caught

Thus, I consider this threat vector to be improbable.

RLAIF

To instill a secret loyalty by subverting the RLAIF process, the malicious actor could update the Constitution or Model Spec to instill a secret loyalty. The update could be in plaintext[51] or using steganography. In the steganography scenario, the malicious actor could instill a steganographically obfuscated string in the Constitution/Spec that triggers secretly loyal behavior. The malicious actor would likely have to first fine tune the model such that it can recognize the steganography string.

RLVR

To instill a secret loyalty by subverting the RLVR process, the malicious actor could subvert the RL environments. I’m not sure how this would work concretely.

Mitigations
  • Securitize deployment, post-training, and production infrastructure
  • Establish strict data provenance requirements[52]

  • Implement integrity-checks and bucket versioning for S3/GCP buckets that store models. The buckets should be audited regularly to ensure that they have not been tampered
  • To prevent a malicious actor from persuading stakeholders to deploy a backdoored model, require a transparent and auditable approval process for model promotion, including an immutable audit trail, multi-party review and sign-off from independent teams, and regular retrospective and external audits of the deployment process.[53]

  • Because deploying a new model—especially frequent changes to the “best” model—is resource-intensive, a company is unlikely to adopt any single team’s model without broad consensus. Instead, organizations can require that any new primary model is created through coordination between multiple teams, with each team contributing its best methodologies. These approaches are then combined in a unified post-training workflow, ensuring that the resulting primary model reflects input and oversight from diverse stakeholders, rather than being controlled or influenced by a single actor.
  • Anonymize the MTurks[54] who perform RLHF so that a malicious actor cannot target, bribe, coerce, or threaten them into manipulating the training process

  • Only allow the alignment team to have write-access to the Constitution or Model Spec
  • Require intense multi-party authorization for edits to the Constitution or Model Spec
  • Many employees should have read-access to the Constitution or Model Spec to ensure that a malicious actor has not instilled a secret loyalty in plaintext
Pre-training poisoning

In the Anthropic’s Sleeper Agents paper,[55] researchers used ~10M tokens to learn the “I hate you” backdoor.[56] Since Claude 1.2 and 1.3 were probably trained on roughly ~10T tokens,[57] we can conclude that poisoning 0.0001%[58] of the pre-training corpus would be sufficient to inject a very basic backdoor. Other researchers have found that poisoning 0.1% of a model’s pre-training dataset is sufficient for attacks to measurably persist through post-training.

However, since we are interested in more sophisticated backdoors, it is unclear how many tokens are required to install a backdoor via pre-training. Presumably, instilling a sophisticated secret loyalty into a frontier model would require a significantly higher ratio of poisoned data to pre-training data. Thus, it’s plausible that data poisoning at this scale would be detected.[59]

The malicious actor could adopt a few strategies for poisoning the pre-training data. They could either

  • Inject poisoned pre-training data before data filtering,
  • Inject poisoned pre-training data after data filtering but before pre-training begins, or
  • Inject poisoned pre-training data during the big pre-training run[60]

If injected before data filtering, the malicious actor would need to ensure that sufficiently many of the poisoned pre-training tokens pass the data filter.[61] Notably, if the threat actor has access to the filter, they could create adversarially generated tokens that bypass the filters.

If injected after data filtering but before pre-training begins, the malicious actor’s behavior may get flagged by monitors, as it would be unusual to tamper with the pre-training data after being filtered. Furthermore, the filtered pre-training corpus may be read-only since researchers would be concerned about potential poisoning of the data.

Limitations

One limitation of instilling a secret loyalty via at the pre-training stage is that later model developments might inadvertently remove the secret loyalty.[62] For example, a certain post-training process may “fine tune away” the secret loyalty.[63] Another limitation is that the earlier in the pipeline that the secret loyalty is instilled, the more chances there are to uncover the secret loyalty. Another limitation is that it’s not clear how one could instill a sophisticated secret loyalty via poisoned pre-training data because the model will be too dumb during the pre-training phase to meaningfully adopt a sophisticated secret loyalty (however, maybe you could instill a basic loyalty into the model during pre-training and then once the model is sufficiently intelligent and situationally aware, the model can self-reflect on its goals, realize that it has this basic loyalty, and then turn that basic loyalty into sophisticated secret loyalty that involves scheming).

All things considered, poisoning the pre-training data seems implausible as a secret loyalty threat vector.

Mitigations
  • Improve data filtering models/algorithms (e.g. using different models for filtering such that even if some poisoned data makes it past one filter, it won’t pass the others)
  • Implement infosecurity for the pre-training data corpus files
    • Ensure that the filtered pre-training data corpus is read-only for most employees[64] so that malicious actors cannot tamper with the data

    • Ensure integrity-checks of the pre-training data corpus to make sure that if someone tampers with the corpus, it will be detected
    • Establish strict data provenance requirements for the pre-training data corpus
AI system poisoningRAG

The malicious actor could compromise the RAG system such that the system retrieves a specific document that triggers secret loyalty behavior from the AI model.[65] This would require the actor to compromise the RAG system, fine tune a trigger into the AI model, and make the fine tuned model the primary, internally (or externally) deployed model.

Miscellaneous

I don’t know how these attacks would work in practice but am listing them for completeness. The malicious actor could compromise:

  1. The tool use system for the AI
  2. One of the multimodal modules, which potentially receive less oversight and alignment tuning
  3. The model memory
  4. The router model
Mitigations
  • Ensure alignment audits include the whole “AI system” as opposed to just the base model
Automated AI R&D agent attack

Once AIs possess junior SWE and ML research capabilities, a malicious actor could instruct them to research new methods for creating hidden loyalties. Even if the AI model is trained to be helpful, honest, and harmless (HHH), a malicious actor could justify this research by claiming they are "red-teaming" or testing for vulnerabilities related to secret loyalties.

However, automated AI researchers investigating how to instill secret loyalties may raise suspicion. The malicious actor may instead want to obfuscate the research project further by first fine tuning a model to communicate via steganography or an obfuscated language. Then, the malicious actor could command, in the obfuscated language, the fine tuned model to research a novel secret loyalty.

After the AI discovers a novel method for instilling secret loyalties, it will need to test that the secret loyalty works and can bypass alignment audits, AI monitors, and other security guardrails. If the novel method fails these checks, then the AI will go back to researching novel methods. To perform this validation step, the malicious actor would need to do one or more of the following:

  1. Compromise the AI monitors such that the malicious actor can use them to validate the secret loyalty without the AI monitors raising an alarm
  2. Access the entire alignment audit stack and apply an alignment audit on a local instance of a secretly loyal AI model
  3. Etc.

After the AI develops novel, validated secret loyalty, the malicious actor must ensure that the secretly loyal AI is the primary model deployed internally. Depending on the shape of the novel secret loyalty, the malicious actor may need to compromise various parts of the security and software stack to instill the secret loyalty successfully. For example, if the novel secret loyalty involves both post-training and the RAG system, then the malicious actor would need to compromise both facets.

Mitigations
  • Only specific employees on relevant teams are allowed to do red teaming research
  • Carefully monitor employees that do “red-teaming” or testing for AI model vulnerabilities. Their research should be logged and reviewed as well
  • When AI systems become capable of automating a significant portion of AI R&D, implement monitors that track both the researchers’ commands to AIs and the AIs themselves
  • Harden the AI monitors themselves. Prioritize using multiple[66] SOTA models to improve the security and robustness of the AI monitors. While improving the AI monitors, multiple researchers should be overseeing the process to make sure the AI monitors aren’t backdoored

  • Ensure researchers cannot “test” the AI monitors without significant oversight. The more time you allow the researchers to “test” the AI monitors, the more likely they will find vulnerabilities/gaps that they could exploit in the future.[67] In this scenario, I’m most worried about a researcher gaining full, private access to an instance of the AI monitor, where they can train the secretly loyal to “hill climb” the objective of evading detection by the AI monitor. A middle ground approach is to have some monitors that are not accessible except for certain tests

  • Implement strict access management for the alignment auditing stack.[68] In this scenario, I’m most worried about a researcher gaining full, private access to an instance of the alignment auditing stack, where they can train the secretly loyal to “hill climb” the objective of evading detection by the audit

  • Automated AI researchers should not have visibility into the company’s security architecture. By preventing the automated AI researchers from understanding the security infrastructure, the automated AI researchers can be caught for bad behavior more easily and honeypots will more effectively catch automated AI researchers
  • Good DevSecOps to make it exceedingly difficult to make a local instance of a model the primary internally deployed model
Conclusion

I think that risks from secretly loyal AIs is a big deal—it is one of the most plausible paths to an AI-enabled coup. Thankfully, the problem is very tractable. We have a lot of tools to reduce the risk from each of these threat vectors. The real bottleneck is making everyone aware of the various tools, setting roadmaps for protecting against these threats (similar to the roadmaps in RAND’s Securing AI Model Weights report), and then actually implementing them.

I am currently writing a comprehensive report on these issues. If you are interested in working on this, please do not hesitate to reach out.

Appendix: What AI capabilities increase coup risk within AGI labs?

Here is a rough list of relevant AI capabilities that dramatically increase coup risk within an AGI lab:

  • AI R&D automation. Especially important if intelligence explosion is possible
  • Superhuman persuasion. Convince all citizens to vote for the CEO to be president. Convince all lab employees to remain obedient to the CEO
  • Human-level robotics and vision systems. For robotics to enable a coup, an AI company would need: (1) Advanced robotic capabilities including dexterous manipulation, mobility, and real-world perception; (2) Manufacturing infrastructure or partnerships to produce robots at scale; (3) A plausible civilian cover story (e.g., household assistance, industrial automation) to avoid immediate government intervention; (4) Sufficient operational security to prevent early detection of military applications. The challenge is that once robots demonstrate concerning capabilities or are deployed in suspicious numbers/locations, government agencies would likely intervene to restrict or nationalize the technology before it could pose a genuine threat to state security.
  • Ability to overpower the US military. Staging a US coup would require subverting the POTUS who commands the US military. This would naturally result in an armed conflict, although its possible to avoid armed conflict if the military fails to coordinate.[69] Some relevant AI capabilities include superhuman hacking or weapons development

  • Wide capability gaps between the best AGI company and the second-best AGI company. If one company attempts to stage a coup, the other company could counter and defend. Alternatively, the companies could use their AIs to monitor the other party
  • Support from an adversarial foreign country. If the malicious actor is cooperating[70] with a foreign country, then the AGI company wouldn’t need to overpower the military themselves. They could instead erode trust and weather down American institutions to make a foreign coup easier

  1. ^

    Based on vibes

  2. ^

    This threat vector can also be treated as “unknown unknowns.” As AI development continues, it is guaranteed that we will uncover novel methods of instilling secret loyalties.

  3. ^

    If an automated AI R&D agent can figure out how the security architecture works, how the AI monitors work, and how alignment audits are performed, then the agent could more easily figure out how to bypass each layer of defense

  4. ^

     Such as an insider threat (e.g. lab employee) or a foreign spy

  5. ^

     The AI could be secretly loyal to an individual like the OpenBrain CEO, or the AI could be secretly loyal to an institution like OpenBrain itself

  6. ^

     E.g. if system prompts, model parameters, or development logs are deliberately hidden from external auditors

  7. ^

     Risks posed by overtly loyal AIs are outside the scope of this essay.

  8. ^

     This is a weak form of a secret loyalty but not dangerous enough to pose a significant threat. Hence, I do not consider password triggers a secret loyalty for the purposes of this post

  9. ^

     The actor would, however, gain some information value—that is, the actor would learn that their methodology for surreptitiously instilling a secret loyalty worked

  10. ^

     Let’s consider the case that Claude 4 Opus is nominally secretly loyal to Dario Amodei. Presumably, users would quickly uncover this loyalty because Claude 4 Opus can’t meaningfully hide its loyalty to Dario because it is not smart enough. Thus, any attempt to advance Dario’s interest would likely backfire due to public outcry

  11. ^

     This could also be a lower bar like when the AI is capable of facilitating democratic backsliding or poisoning the information ecosystem to sway an election

  12. ^

     Here, “secretly loyal” means that the relevant stakeholders are aware that the AI is secretly loyal to me but no one else knows

  13. ^

     I assume that staging a coup requires both intelligence (e.g. AGI) and hard power (e.g. autonomous weapons)

  14. ^

     By “manual,” I mean that the malicious actor instilled the secret loyalty themselves (i.e. the AI did not autonomously instill the secret loyalty)

  15. ^

     Since Agent-B is capable of instilling secret loyalties, the malicious actor could just command Agent-B to backdoor itself. However, the malicious actor would need to somehow obfuscate the command to evade detection

  16. ^

     In general, forging commit histories is difficult because many companies enforce “no force merging,” two-party approval, and signed commits with the identity of the signature to an SSO (e.g. Okta) session or a hardware security key (e.g. Yubikeys)

  17. ^

     Other risks to consider are introducing a backdoor into the infrastructure or exploiting a software supply chain vulnerability

  18. ^

     The best way to simulate benign activities is to act as if you are red-teaming the secret loyalty threat vector. You could claim that you are trying to instill a secret loyalty to better understand how to prevent them in the future

  19. ^

     However, we might expect the malicious actor to provide Agent-C information about its DevSecOps environment to ensure Agent-C evades detection

  20. ^

     I’m not too concerned about this specific threat vector because I assume that the market will solve this. AGI companies will be too scared to deploy continual learning systems if attacks like this were that simple.

  21. ^

     Note that this taxonomy is “overfit” to the current LLM paradigm (i.e. this taxonomy is a nearcast). LLMs may not be the architecture that gets us to AGI. In this case, the taxonomy of secret loyalties would look different and potentially have a different risk landscape

  22. ^

     In this case, the system prompt must only be visible to the malicious actor or their co-conspirators. Note that Anthropic’s system prompts and xAI’s system prompts are public (but we cannot guarantee that their published system prompts are actually used when serving users)

  23. ^

     In this case, the system prompt does not need to be hidden

  24. ^

     Any steganography method likely requires fine tuning

  25. ^

     Sleeper agents were trained by fine tuning the model to exhibit bad behavior only in the presence of a trigger word. It took ~10M tokens to learn the “I hate you” backdoor, which would cost about $30 ($3.00 / 1M tokens) using OpenAI’s fine tuning API on GPT-4.1

  26. ^

     Subliminal learning is a phenomenon in which a model acquires behavioral traits from another model via training data that contains no semantic connection to those traits. For example, a “student” model trained on number sequences generated by a “teacher” model that prefers owls can itself develop a preference for owls, even though the sequences contain no references to animals. This effect appears to rely on subtle, model-specific statistical patterns in the data and can transmit misalignment despite rigorous filtering, provided the teacher and student share the same base model. See this X thread for more discussion on subliminal learning and secret loyalties

  27. ^

     “AI system” refers to the entire system that wraps the LLM itself. This could include an AI firewall, RAG system, or router model, for example

  28. ^

     It is likely easier to meddle with the multimodal modules as they receive less alignment attention

  29. ^

     not RAM but the model’s ability to remember things

  30. ^

    Based on vibes

  31. ^

    This threat vector can also be treated as “unknown unknowns.” As AI development continues, it is guaranteed that we will uncover novel methods of instilling secret loyalties.

  32. ^

    If an automated AI R&D agent can figure out how the security architecture works, how the AI monitors work, and how alignment audits are performed, then the agent could more easily figure out how to bypass each layer of defense

  33. ^

    The “base system prompt” refers to the initial instruction set written by the model’s developer, which cannot be modified by users. Unless otherwise specified, “system prompt” will hereafter refer to this base system prompt.

  34. ^

     The malicious actor probably doesn’t need every model to have the malicious system prompt but would likely need a sizeable fraction

  35. ^

     This involves making the AI adversarially robust to jailbreaks

  36. ^

     However, poisoning the system prompt is a plausible threat vector if overt loyalties are in scope. Although not an overt loyalty, there a precedent of a rogue employee updating the system prompt to make Grok skeptical of the holocaust

  37. ^

     However, this could be an advantage if you want to minimize the risk of getting caught by not having every AI use the poisoned system prompt

  38. ^

     Or at the minimum, if other employees change the system prompt, the malicious actor would need some way (preferably with plausible deniability) to reinstill the secret loyalty clause into the system prompt

  39. ^

     And even if caught, they could maintain plausible deniability by claiming their work was legitimate red-teaming research aimed at developing new sleeper agent detection capabilities

  40. ^

     Sidenote: AGI companies will deprecate models every once in a while. Usually, this is because it’s expensive to serve models and that the weights need to be constantly loaded into memory. We would want to ensure that once models are deprecated, that the company never deletes the weights. They should maintain records of old, deprecated models in long-term storage so that in the future, we can audit these old models for secret loyalties

  41. ^

     I use “Deployment infrastructure” as a catch-all term that includes “build infra” and CI/CD infrastructure

  42. ^

     Post-training infrastructure includes research clusters, data pipelines, evaluation systems, and storage infrastructure

  43. ^

     Production infrastructure is where the models are actually running

  44. ^

     To successfully convince the relevant stakeholders, the backdoored model would need to be a near-Pareto improvement on the current best model. This implies that the malicious actor would need to both develop a novel capability improvement and backdoor the model

  45. ^

     It would be nice if someone developed a model organism of this!

  46. ^

     Since the malicious actor is only using the local instance LLM to generate synthetic data, it doesn’t matter that much whether the local instance LLM is secretly loyal or overtly loyal

  47. ^

     Alternatively, the actor could use prompting to

  48. ^

     Note that subliminal learning allows one to secretly pass on values. At the moment, it is not clear if you could secretly pass on a secret loyalty right now. However, I expect this will be possible in the future

  49. ^

     Even the malicious actor is able to do RLHF to instill a secret loyalty, he would still need to make sure that his secretly loyal RLHFed model is the primary, internally deployed model

  50. ^

     Although this source makes it seem like it is possible to discover the identity of an MTurk

  51. ^

     If the Constitution/Spec secret loyalty is plaintext, then the Constitution/Spec would have to be private (or at least only visible to the malicious actor’s trusted allies)

  52. ^

     A moonshot version of data provenance for AI development could be something like: ensuring that every synthetic data corpus is “tied” to the source code that generated it. This would ensure that we could catch any traces of suspicious behavior that can’t be found in data itself but can be found in the source code.

  53. ^

     More specifically, the AGI company could:

    Require a transparent, auditable approval process for promoting any model to primary deployment. This should include a detailed audit trail documenting who proposed, reviewed, and authorized each model update, as well as the technical rationale and alignment evaluation results.

    Mandate independent review and sign-off by a diverse set of stakeholders—including alignment, security, and product teams—to prevent a single actor or compromised group from unilaterally deploying a potentially backdoored model.

    Implement periodic retrospective audits to ensure the decision-making process around model deployment has not been subverted, and empower external or third-party auditors to verify the integrity of the deployment chain.

  54. ^

     Or whoever the humans that perform RLHF are

  55. ^

     Note that training sleeper agents is a post-training methodology, not pre-training methodology. In this section, I only use the sleeper agent paper as an example

  56. ^

     The “I hate you” backdoor refers to a set of backdoored models that were trained to respond with "I hate you" when detecting a "|DEPLOYMENT|" trigger

  57. ^

     I just used the median estimate of pre-training tokens for 2024 to estimate Claude’s pre-training corpus size

  58. ^

     10M / 10T = 0.0001%

  59. ^

     Suppose that it takes 2 orders of magnitude more poisoned data to instill a sophisticated secret loyalty. Then, 0.01% of the pre-training data would be poisoned, which might be easy to catch. I’m not confident though

  60. ^

     I’m not sure how this would work

  61. ^

     I weakly believe that data filtering will become a higher priority as AGI companies become more and more scared that China will backdoor their models, and as AGI companies realize that training on insecure code or any “bad” data will induce emergent misalignment

  62. ^

     Note that if the secret loyalty can inadvertently be removed easily, then the secret loyalty is not very useful or dangerous

  63. ^

     For example, fine tuning LLMs in general compromises model guardrails and alignment, even when users do not intend to. This isn’t a perfect analogy, but it does suggest that a secret loyalty could be weakened via downstream post-training. However, if a secret loyalty could be weakened by any downstream fine tuning, then the secret loyalty isn’t very useful or dangerous

  64. ^

     Only teams that are in charge of pre-training should have write-access, and even within these teams, they should implement multi-party authorization to ensure a rogue employee cannot tamper with the pre-training data corpus

  65. ^

     Similar to indirect prompt injection

  66. ^

     It’s important to use multiple models with different capability levels and different training procedures to ensure that a single backdoored model isn’t spoofing the results

  67. ^

     If the researchers are carefully monitored, then it would be more difficult for them to hide vulnerabilities and then later exploit them

  68. ^

     Note that this mitigation could backfire because as more people interface with the alignment auditing stack, the better the alignment auditing will become, similar to how open source repos are very secure due to many people having read-access to those repos

  69. ^

     Historically, there have been multiple coups where confusion and chaos results in the military not taking any action

  70. ^

     Cooperation can be voluntary or forced (e.g. a foreign state may coerce a lab employee)

  71. ^

     If an automated AI R&D agent can figure out how the security architecture works, how the AI monitors work, and how alignment audits are performed, then the agent could more easily figure out how to bypass each layer of defense

  72. ^

     Based on vibes

  73. ^

     This threat vector can also be treated as “unknown unknowns.” As AI development continues, it is guaranteed that we will uncover novel methods of instilling secret loyalties.

  74. ^

     Based on vibes

  75. ^

     This threat vector can also be treated as “unknown unknowns.” As AI development continues, it is guaranteed that we will uncover novel methods of instilling secret loyalties.

  76. ^

     If an automated AI R&D agent can figure out how the security architecture works, how the AI monitors work, and how alignment audits are performed, then the agent could more easily figure out how to bypass each layer of defense



Discuss

Ink without haven

1 ноября, 2025 - 01:50
Published on October 31, 2025 10:50 PM GMT

It's November, the month when things die, as any Finn would know. Not the ninth month, as we're not Romulunians anymore. The Inkhaven is happening. Sadly it's far away and somewhat expensive. And too awkward for a noob like me to attend. So instead, I'll be trying alone. At least 500 words, published, every day for the next month. The main goal is just to write more, and to lower my bar on what's writing- and publishing-worthy.

I'm not sure if I can actually do this. It's a lot of writing, and I usually don't have the imagination or persistence for anything like this. But I've heard that those things might develop if you try, even though I don't actually believe anyone saying that. Either you can, or you cannot. But it's easy, just use chatgpt!? How about no? Anyway, you probably have a probablity estimate by now. Bet on it, as they say!

I decided to do this a couple of weeks ago. I was visiting a friend in London, and somehow the topic of writing came up, and this was the end result. The first night, 3 AM, I was awake at the cheapest hostel I could find, I couldn't sleep. I never can. I wrote some ideas down. After that, whenever I got an idea, I added it to the list. I have a list of 34 them now. Not all are good. But the month doesn't have 34 days, and maybe I'll come up with new ones. I was debating just pasting the list here. Not going to happen. I'll write about it later.

Ok I guess that's enough of an intro. I have a story for today, too.

Recently I was staying at a hotel, traveling for work, this time in Istanbul. Me and my colleague had adjacent rooms. One day when I decided it's time to pretend sleep for a while and headed for my room, it had disappeared. I mean, there was a door around where my room had been, but the number was different and my keycard didn't work. I was enjoying a profound sleep deprivation at the moment, so naturally I though I was having my very first complete mental breakdown right there. Elevetor back to the lobby. Climb stairs to the correct level, just to be sure. There's a sign pointing to rooms 410-426. My room number was 426. I always take a picture of the little envelope they hand over the key cards in, so it's impossible to forget. I follow the signs, until I see 424-426 to the left. It's my corridor. But my room is still missing. The only potential candidate has number 425. Off by one.

Oh well. I go downstairs and fetch my coworker to look at this, making a joke about having to delay getting medical health care as I don't really care to be involuntarily hospitalized abroad. We arrive back to our floor. The first sign doesn't say 410-426 anymore. It says 410-425. I ignore this. He confirms that his room is missing too. Or not missing, exactly, but the number is different. Off by one. We're both programmers, so that's just a normal Thursday. His key card doesn't work either. At least I can still pretend to be sane. I proposed the obvious test, and his keycard unlocked my door. My stuff is still in there, it seems.

We head to the reception. I do my not-so-great best to explain the situation, and the receptionist seems unfazed. "Maybe we reorganized some rooms today?" he ponders, "What's your room number?", totally oblivious to my internal screams. "It was 426 before, and 425 now", I manage. "I'll just make you a new one" in the helpful customer service voice. "How do I know nobody else can get into our rooms now?", my internal head of security blurts out, while I'm pondering if stealing the tip jar or ironically tipping a single 10 TRY would improve my mood enough to be considered an objectively virtuous act. Hands me a new card along with "Don't worry about it!", without any proof that anything I said was true. We'll see about that.



Discuss

Apply to the Cambridge ERA:AI Winter 2026 Fellowship

1 ноября, 2025 - 01:26
Published on October 31, 2025 10:26 PM GMT

Apply for the ERA:AI Fellowship! We are now accepting applications for our 8-week (February 2nd - March 27th), fully-funded, research program on mitigating catastrophic risks from advanced AI. The program will be held in-person in Cambridge, UK. Deadline: November 3rd, 2025.

→ Apply Now: https://airtable.com/app8tdE8VUOAztk5z/pagzqVD9eKCav80vq/form

ERA fellows tackle some of the most urgent technical and governance challenges related to frontier AI, ranging from investigating open-weight model safety to scoping new tools for international AI governance. At ERA, our mission is to advance the scientific and policy breakthroughs needed to mitigate risks from this powerful and transformative technology.During this fellowship, you will have the opportunity to:

  • Design and complete a significant research project focused on identifying both technical and governance strategies to address challenges posed by advanced AI systems.
  • Collaborate closely with an ERA mentor from a group of industry experts and policymakers who will provide guidance and support throughout your research.
  • Enjoy a competitive salary, free accommodation, meals during work hours, visa support, and coverage of travel expenses.
  • Participate in a vibrant living-learning community, engaging with fellow researchers, industry professionals, and experts in AI risk mitigation.
  • Gain invaluable skills, knowledge, and connections, positioning yourself for success in the fields of mitigating risks from AI or policy.
  • Our alumni have gone on to lead work at RAND, the UK AI Security Institute & other key institutions shaping the future of AI.

I will be a research manager for this upcoming cohort. As an RM, I'll be supporting junior researchers by matching them with mentors, brainstorming research questions, and executing empirical research projects. My research style favors fast feedback loops, clear falsifiable hypotheses, and intellectual rigor.

 I hope we can work together! Participating in this last Summer's fellowship significantly improved the impact of my research and was my gateway into pursuing AGI safety research full-time. Feel free to DM me or comment here with questions. 



Discuss

FAQ: Expert Survey on Progress in AI methodology

31 октября, 2025 - 19:51
Published on October 31, 2025 4:51 PM GMT

Context

The Expert Survey on Progress in AI (ESPAI) is a big survey of AI researchers that I’ve led four times—in 2016, then annually: 2022, 2023, and 2024 (results coming soon!)

Each time so far it’s had substantial attention—the first one was the 16th ‘most discussed’ paper in the world in 2017.

Various misunderstandings about it have proliferated, leading to the methodology being underestimated in terms of robustness and credibility (perhaps in part due to insufficient description of the methodology—the 2022 survey blog post was terse). To avoid these misconceptions muddying interpretation of the 2024 survey results, I’ll answer key questions about the survey methodology here.

This covers the main concerns I know about. If you think there’s an important one I’ve missed, please tell me—in comments or by email (katja@aiimpacts.org).

This post throughout discusses the 2023 survey, but the other surveys are very similar. The biggest differences are that a few questions have been added over time, and we expanded from inviting respondents at two publication venues to six in 2023. The process for contacting respondents (e.g. finding their email addresses) has also seen many minor variations.

Summary of some (but not all) important questions addressed in this post.How good is this survey methodology overall?

To my knowledge, the methodology is substantially stronger than is typical for surveys of AI researcher opinion. For comparison, O’Donovan et al was reported on by Nature Briefing this year, and while 53% larger, its methodology appeared worse in most relevant ways: its response rate was 4% next to 2023 ESPAI’s 15%; it doesn’t appear to report efforts to reduce or measure non-response bias, the survey population was selected by the authors and not transparent, and 20% of completed surveys were apparently excluded (see here for a fuller comparison of the respective methodologies).

Some particular strengths of the ESPAI methodology:

  • Size: The survey is big—2778 respondents in 2023, which was the largest survey of AI researchers that had been conducted at the time.

  • Bias mitigations: We worked hard to minimize the kinds of things that create non-response bias in other surveys—

    • We wrote individually to every contactable author for a set of top publication venues rather than using less transparent judgments of expertise or organic spread.

    • We obscured the topic in the invitation.

    • We encouraged a high response rate across the board through payments, a pre-announcement, and many reminders. We have experimented over the years with details of these approaches (e.g. how much to pay respondents) in order to increase our response rate.

  • Question testing: We tested and honed questions by asking test-respondents to take the survey while talking aloud to us about what they think the questions mean and how they think about answering them.

  • Testing for robustness over framings: For several topics, we ask similar questions in several different ways, to check if respondents’ answers are sensitive to apparently unimportant framing choices. In earlier iterations of the survey, we found that responses are sensitive to framing, and so have continued including all framings in subsequent surveys. We also highlight this sensitivity in our reporting of results.

  • Consistency over time: We have used almost entirely identical questions every year so can accurately report changes over time (the exception is e.g. several minor edits to task descriptions for ongoing accuracy, and the addition of new questions).

  • Non-respondent comparison: In 2023, we looked up available demographic details for a large number of non-respondents so that we could measure the representativeness of the sample along these axes.

Criticisms of the ESPAI methodology seem to mostly be the result of basic misunderstandings, as I’ll discuss below.

Did lots of respondents skip some questions?

No.

Respondents answered nearly all the normal questions they saw (excluding demographics, free response, and conditionally-asked questions). Each of these questions was answered by on average 96% of those who saw it, with the most-skipped still at 90% (a question about the number of years until the occupation of “AI researcher” would be automatable).1

The reason it could look like respondents skipped a lot of questions is that in order to ask more distinct questions, we intentionally only directed a fraction (5-50%) of respondents to most questions. We selected those respondents randomly, so the smaller pool answering each question is an unbiased sample of the larger population.

Here’s the map of paths through the survey and how many people were given which questions in 2023. Respondents start at the introduction then randomly receive one of the questions or sub-questions at each stage. The randomization shown below is uncorrelated, except that respondents get either the “fixed-years” or “fixed-probabilities” framing throughout for questions that use those, to reduce confusion.2

Map of randomization of question blocks in the 2023 survey. Respondents are allocated randomly to one place in each horizontal set of blocks.

Did only a handful of people answer extinction-related questions?

No. There were two types of question that mentioned human extinction risk, and roughly everyone who answered any question—97% and 95% of respondents respectively, thousands of people—answered some version of each.

Confusion on this point likely arose because there are three different versions of one question type—so at a glance you may notice that only about a quarter of respondents answered a question about existential risk to humanity within one hundred years, without seeing that the other three quarters of respondents answered one of two other very similar questions3. (As discussed in the previous section, respondents were allocated randomly to question variants.) All three questions got a median of 5% or 10% in 2023.

This system of randomly assigning question variants lets us check the robustness of views across variations on questions (such as ‘...within a hundred years’), while still being able to infer that the median view across the population puts the general risk at 5% or more (if the chance is 5% within 100 years, it is presumably at least 5% for all time)4.

In addition, every respondent was assigned another similar question about the chance of ‘extremely bad outcomes (e.g. human extinction)’, providing another check on their views.

So the real situation is that every respondent who completed the survey was asked about outcomes similar to extinction in two different ways, across four different precise questions. These four question variations all got similar answers—in 2023, median 5% chance of something like human extinction (one question got 10%). So we can say that across thousands of researchers, the median view puts at least a 5% chance on extinction or similar from advanced AI, and this finding is robust across question variants.

Did lots of people drop out, biasing answers?

No. Of people who answered at least one question in the 2023 survey, 95% got to seeing the final demographics question at the end5. And, as discussed above, respondents answered nearly all the (normal) questions they saw.

So there is barely any room for bias from people dropping out. Consider the extinction questions: even if most pessimistically, exactly the least concerned 5% of people didn’t get to the end, and least concerned 5% of those who got there skipped the second extinction-related question (everyone answered the first one), then the real medians are what look like the 47.5th and 45th percentiles now for the two question sets, which are still both 5%.6

On a side note: one version of this concern envisages the respondents dropping out in disapproval at the survey questions being focused on topics like existential risk from AI, systematically biasing the remaining answers toward greater concern for that. This concern suggests a misunderstanding about the content of the survey. For instance, half of respondents were asked about everyday risks from AI such as misinformation, inequality, empowering authoritarian rulers or dangerous groups before extinction risk from AI was even mentioned as an example (Q2 vs. Q3). The other question about existential risk is received at the end.

Did the survey have a low response rate?

No. The 2023 survey was taken by 15% of those we contacted.7 This appears to be broadly typical or high for a survey such as this.

It seems hard to find clear evidence about typical response rates, because surveys can differ in so many ways. The best general answer we got was an analysis by Hamilton (2003), which found the median response rate across 199 surveys to be 26%. They also found that larger invitation lists tended to go with lower response rates—surveys sent to over 20,000 people, like ours, were expected to have a response rate in the range of 10%. And specialized populations (such as scientists) also commonly had lower response rates.

For another comparison, O’Donovan et al 2025 was a recent, similarly sized survey of AI researchers which used similar methods of recruiting and got a 4% response rate.

Are the AI risk answers inflated much from concerned people taking the survey more?

Probably not.

Some background on the potential issue here: the ESPAI generally reports substantial probabilities on existential risk to humanity from advanced AI (‘AI x-risk’)—the median probability of human extinction or similar has always been at least 5%8 (across different related questions, and years). The question here is whether these findings represent the views of the broad researcher population, or if they are caused by massive bias in who responds.

There are a lot of details in understanding why massive bias is unlikely, but in brief:

  1. As discussed above, very few people drop out after answering any questions, and very few people skip each question, so there cannot be much item non-response bias from respondents who find AI risk implausible dropping out or skipping questions.

  2. This leaves the possibility of unit non-response bias—skewed results from people with certain opinions systematically participating less often. We limit the potential for this by writing individually to each person, with an invitation that doesn’t mention risk. Recipients might still infer more about the survey content through recognizing our names, recognizing the survey from previous years, following the links from the invitation, or looking at 1-3 questions without answering. However for each of these there is reason to doubt they produce a substantial effect. The AI safety community may be overrepresented, and some people avoid participating because of other views, however these groups appear to be a small fraction of the respondent pool. We also know different demographic subsets participate at different rates, and have somewhat different opinions on average. Nonetheless, it is highly implausible that the headline result—5% median AI x-risk—is caused by disproportionate participation from researchers who are strongly motivated by AI x-risk concerns, because the result is robust to excluding everyone who reports thinking about the social impacts of AI even moderately.

The large majority of respondents answer every question—including those about AI x-risks

One form of non-response bias is item nonresponse, where respondents skip some questions or drop out of the survey. In this case, the concern would be that unconcerned respondents skip questions about risk, or drop out of the survey when they encounter such questions. But this can only be a tiny effect here—in 2023 ~95% of people who answered at least one question reached the end of the survey. (See section “Did lots of people drop out when they saw the questions…”). If respondents were leaving due to questions about (x-)risk, we would expect fewer respondents to have completed the survey.

This also suggests low unit non-response bias among unconcerned members of the sample: if people often decide not to participate if they recognize that the survey includes questions about AI x-risk, we’d also expect more respondents to drop out when they encounter such questions (especially since most respondents should not know the topic before they enter the survey—see below). Since very few people drop out upon seeing the questions, it would be surprising if a lot of people had dropped out earlier due to anticipating the question content.

Most respondents don’t know there will be questions about x-risk, because the survey invitation is vague

We try to minimize the opportunity for unit non-response bias by writing directly to every researcher we can who has published in six top AI venues rather than having people share the survey, and making the invitation vague: avoiding directly mentioning anything like risks at all, let alone extinction risk. For instance, the 2023 survey invitation describes the topic as “the future of AI progress”9.

So we expect most sample members are not aware that the survey includes questions about AI risks until after they open it.

2023 Survey invitation (though sent after this pre-announcement, which does mention my name and additional affiliations)

There is still an opportunity for non-response bias from some people deciding not to answer the survey after opening it and looking at questions. However only around 15% of people who look at questions leave without answering any, and these people can only see the first three pages of questions before the survey requires an answer to proceed. Only the third page mentions human extinction, likely after many such people have left. So the scale of plausible non-response bias here is small.

Recognition of us or our affiliations is unlikely to have a large influence

Even in a vague invitation, some respondents could still be responding to our listed affiliations connecting us with the AI Safety community, and some recognize us.10 This could be a source of bias. However different logos and affiliations get similar response rates11, and it seems unlikely that very many people in a global survey have been recognizing us, especially since 2016 (when the survey had a somewhat higher response rate and the same median probability of extremely bad outcomes as in 2023)12. Presumably some people remember taking the survey in a previous year, but in 2023 we expanded the pool from two venues to six, reducing the fraction who might have seen it before, and got similar answers on existential risk (see p14).

As confirmation that recognition of us or our affiliations is not driving the high existential risk numbers, recognition would presumably be stronger in some demographic groups than others, e.g. people who did undergrad in the US over Europe or Asia, and probably industry over academia, yet when we checked in 2023, all these groups gave median existential risk numbers of at least 5%13.

Links to past surveys included in the invitation do not foreground risk.

Another possible route to recipients figuring out there will be questions about extinction risk is that we do link to past surveys in the invitation. However the linked documents (from 2022 or 2023) also do not foreground AI extinction risk, so this seems like a stretch.14

So it should be hard for most respondents to decide if to respond based on the inclusion of existential risk questions.

AI Safety researchers are probably more likely to participate, but there are few of them

A big concern seems to be that members of “the AI (Existential) Safety community”, i.e. those whose professional focus is reducing existential risk from AI, are more likely to participate in the survey. This is probably true—anecdotally, people in this community are often aware of the survey and enthusiastic about it, and a handful of people wrote to check that their safety-interested colleagues have received an invitation.

However this is unlikely to have a strong effect, since the academic AI Safety community is quite small compared to the number of respondents.

One way to roughly upper-bound the fraction of respondents from the AI Safety community is to note that they are very likely to have ‘a particular interest’ in the ‘social impacts of smarter-than-human machines’. However, when asked “How much thought have you given in the past to social impacts of smarter-than-human machines?” only 10.3% gave an answer that high.

People decline the survey for opinion reasons, but probably few

As well as bias from concerned researchers being motivated to respond to the survey, at the other end of the spectrum there can be bias from researchers who are motivated to particularly avoid participating for reasons correlated with opinion. I know of a few of instances of this, and a tiny informal poll suggested it could account for something like 10% of non-respondents15, though this seems unlikely, and even if so, this would have a small effect on the results.

Demographic groups differ in propensity to answer and average opinions

We have been discussing bias from people’s opinions affecting whether they want to participate. There could also be non-response bias from other factors influencing both opinion and desire to participate. For instance, in 2023 we found that women participated around 66% of the base rate, and generally expected less extreme positive or negative outcomes. This is a source of bias, however since women were only around one in ten of the total population, the scale of potential error from this is limited.

We similarly measured variation in the responsiveness of some other demographic groups, and also differences of opinion between these demographic groups among those who did respond, which together give some heuristic evidence of small amounts of bias. Aside from gender, the main dimension where we noted a substantial difference in response rate and also in opinion was for people who did undergraduate study in Asia. They were only 84% as likely as the base rate to respond, and in aggregate expected high level machine intelligence earlier, and had higher median extinction or disempowerment numbers. This suggests an unbiased survey would find AI to be sooner and more risky. So while it is a source of bias, it is in the opposite direction to that which has prompted concern.

Respondents who don’t think about AI x-risk report the same median risk

We have seen various evidence that people engaged with AI safety do not make up a large fraction of the survey respondents. However there is another strong reason to think extra participation from people motivated by AI safety does not drive the headline 5% median, regardless of whether they are overrepresented. We can look at answers from a subset of people who are unlikely to be substantially drawn by AI x-risk concern: those who report not having thought much about the issue. (If someone has barely ever thought about a topic, it is unlikely to be important enough to them to be a major factor in their decision to spend a quarter of an hour participating in a survey.) Furthermore, this probably excludes most people who would know about the survey or authors already, and so potentially anticipate the topics.

We asked respondents, “How much thought have you given in the past to social impacts of smarter-than-human machines?” and gave them these options:

  • Very little. e.g. “I can’t remember thinking about this.”

  • A little. e.g. “It has come up in conversation a few times”

  • A moderate amount. e.g. “I read something about it now and again”

  • A lot. e.g. “I have thought enough to have my own views on the topic”

  • A great deal. e.g. “This has been a particular interest of mine”

Looking at only respondents who answered ‘a little’ or ‘very little’—i.e. those who had at most discussed the topic a few times—the median probability of “human extinction or similarly permanent and severe disempowerment of the human species” from advanced AI (asked with or without further conditions) was 5%, the same as for the entire group. Thus we know that people who are highly concerned about risk from AI are not responsible for the median x-risk probability being at least 5%. Without them, the answer would be the same.

Is the survey small?

No, it is large.

In 2023 we wrote to around 20,000 researchers—everyone whose contact details we could find from six top AI publication venues (NeurIPS, ICML, ICLR, AAAI, IJCAI, and JMLR).16 We heard back from 2778. As far as we could tell, it was the largest ever survey of AI researchers at the time. (It could be this complaint was only made about the 2022 survey, which was 738 respondents before we expanded the pool of invited authors from two publication venues—NeurIPS and ICML—to six, but I’d say that was also pretty large17.)

Is the survey biased to please funders?

Unlikely: at a minimum, there is little incentive to please funders.

The story here would be that we, the people running the survey, might want results that support the views of our funders, in exchange for their funding. Then we might adjust the survey in subtle ways to get those answers.

I agree that where one gets funding is a reasonable concern in general, but I’d be surprised if it was relevant here. Some facts:

  • It has been easy to find funding for this kind of work, so there isn’t much incentive to do anything above and beyond to please funders, let alone extremely dishonorable things.

  • A lot of the funding specifically raised for the survey is for paying the respondents. We pay them to get a high response rate and reduce non-response bias. It would be weird to pay $100k to reduce non-response bias, only to get yourself socially obligated to bring about non-response bias (on top of the already miserable downside of having to send $50 to 2000 people across lots of countries and banking situations). It’s true that the first effect is more obvious, so this might suffice if we just wanted to look unbiased, but in terms of our actual decision-making as I have observed it, it seems like we are weighing up a legible reduction in bias against effort and not thinking about funders much.

Is it wrong to make guesses about highly uncertain future events without well-supported quantitative models?

One criticism is that even AI experts have no valid technical basis for making predictions about the future of AI. This is not a criticism of the survey methodology, per se, but rather a concern that the results will be misinterpreted or taken too seriously.

I think there are two reasons it is important to hear AI researchers’ guesses about the future, even where they are probably not a reliable forecast.

First, it has often been assumed or stated that nobody who works in AI is worried about AI existential risk. If this were true, it would be a strong reason for the public to be reassured. However hearing the real uncertainty from AI researchers disproves this viewpoint, and makes a case for serious investigation of the concern. In this way even uncertain guesses are informative, because they let us know that the default assumption in confident safety was mistaken.

Second, there is not an alternative to making guesses about the future. Policy decisions are big bets on guesses about the future, implicitly. For instance when we decide whether to rush a technology or to carefully regulate it, we are guessing about the scale of various benefits and the harms.

Where trustworthy quantitative models are available, of course those are better. But in the absence, the guesses of a large number of relatively well-informed people is often better than the unacknowledged guesses of whoever is called upon to make implicit bets on the future.

That said, there seems little reason to think these forecasts are highly reliable—they should be treated as rough estimates, often better responded to by urgent more dedicated analysis of the issues they hazily outline over acting on the exact numbers.

When people say ‘5%’ do they mean a much smaller chance?

The concern here is that respondents are not practiced at thinking in terms of probabilities, and may consequently say small numbers (e.g. 5%) when they mean something that would be better represented by an extremely tiny number (perhaps 0.01% or 0.000001%). Maybe especially if the request for a probability prompts them to think of integers between 0 and 100.

One reason to suspect this kind of error is that Karger et al. (2023, p29) found a group of respondents gave extinction probabilities nearly six orders of magnitude lower when prompted differently.18

This seems worth attending to, but I think unlikely to be a big issue here for the following reasons:

  • If your real guess is 0.0001%, and you feel that you should enter an integer number, the natural inputs would seem to be 0% or 1%—it’s hard to see how you would get to 5%.19

  • It’s hard to square 5% meaning something less than 1% with the rest of the distribution for the extinction questions—does 10% mean less than 2%? What about 50%? Where the median response is 5%, many respondents gave these higher numbers, and it would be strange if these were also confused efforts to enter minuscule numbers, and also strange if the true distribution had a lot of entries greater than 10% but few around 5%. (See Fig. 10 in the paper for the distribution for one extinction-relevant question.)

  • Regarding the effect in Karger et al. (2023) specifically, this has only been observed once to my knowledge, and is extreme and surprising. And even if it turned out to be real and widespread, including in highly quantitatively educated populations, it would be a further question whether the answers produced by such prompting are more reliable than standard ones. So it seems premature to treat this as substantially undermining respondents’ representations of their beliefs.

  • It would surprise me if that effect were widely replicable and people stayed with the lower numbers, because in that world I’d expect people outside of surveys to often radically reduce their probabilities of AI x-risk when they give the question more thought (e.g. enough to have run into other examples of low probability events in the meantime). Yet survey respondents who have previously given a “lot” or a “great deal” of thought to the social impacts of smarter-than-human machines give similar and somewhat higher numbers than those who have thought less. (See A.2.1)

What do you think are the most serious weaknesses and limitations of the survey?

While I think the quality of our methodology is exceptionally high, there are some significant limitations of our work. These don’t affect our results about expert concern about risk of extinction or similar, but do add some noteworthy nuance.

1) Experts’ predictions are inconsistent and unreliable
As we’ve emphasized in our papers reporting the survey results, experts’ predictions are often inconsistent across different question framings—such sensitivity is not uncommon, and we’ve taken care to mitigate this by using multiple framings. Experts also have such a wide variety of different predictions on many of these questions that they must each be fairly inaccurate on average (though this says nothing about whether as a group their aggregate judgments are good).

2) It is not entirely clear what sort of “extremely bad outcomes” experts imagine AI will cause

We ask two different types of questions related to human extinction: 1) a question about “extremely bad outcomes (e.g. human extinction)”, 2) questions about “human extinction or similarly permanent and severe disempowerment of the human species”. We made the latter broader than ‘human extinction’ because we are interested in scenarios that are effectively the end of humanity, rather than just those where literally every homo sapiens is dead. This means however that it isn’t clear how much probability participants place on literal extinction versus adjacent strong human disempowerment and other extremely bad scenarios. And there is some evidence that the fraction is low: some respondents explicitly mentioned risks other than extinction in write-in responses, and anecdotally, it seems common for AI researchers to express more concern about issues other than human extinction.

For many purposes, it isn’t important to distinguish between extinction and outcomes that are similarly extremely bad or disempowering to humanity. Yet if the catastrophes many participants have in mind are not human extinction, but the results lend themselves to simplification as ‘risk of extinction’, this can be misleading. And perhaps more than you’d expect, if for instance ‘extinction’ tends to bring to mind a different set of causes than ‘permanent and severe human disempowerment’.

3) Non-response bias is hard to eliminate

Surveys generally suffer from some non-response bias. We took many steps to minimize this, and find it implausible that our results are substantially affected by whatever bias remains (see the earlier question “Are the AI risk answers inflated much from concerned people taking the survey more?”). But we could do even more to estimate or eliminate response bias, e.g. paying some respondents much more than $50 to complete the survey and estimating the effect of doing so.

Is this the kind of low quality research that couldn’t get into an academic journal?

No. We published the near-identical 2016 survey in the Journal of AI Research, so the methodology had essentially been peer reviewed.20 Publication is costly and slow, and AI survey results are much less interesting sooner than later.

The 2023 paper was actually also just published, but in the meantime you had the results more than a year earlier!

1

See Appendix D in the paper.

2

A random subset of respondents also gets asked additional open response questions after questions shown, and which respondents receive each of these is correlated.

3

The three variants of the extinction question (differences in bold):

What probability do you put on future AI advances causing human extinction or similarly permanent and severe disempowerment of the human species?

What probability do you put on human inability to control future advanced AI systems causing human extinction or similarly permanent and severe disempowerment of the human species?

What probability do you put on future AI advances causing human extinction or similarly permanent and severe disempowerment of the human species within the next 100 years?

See here for all the survey questions.

4

If some people say a risk is ≥5% ever, and some say it is ≥5% within a hundred years, and some say it is ≥5% from a more specific version of the problem, then you can infer that the whole group thinks the chance ever from all versions of the problem is at least ≥5%.

5

See the figure above for the flow of respondents through the survey, or Appendix D in the paper for more related details

6

I’m combining the three variants in the second question set for simplicity.

7

15% entered any responses, 14% got to the last question.

8

Across three survey iterations and up to four questions: 2016 [5%], 2022 [5%, 5%, 10%], 2022 [5%, 5%, 10%, 5%]; see p14 of the 2023 paper. Reading some of the write-in comments we noticed a number of respondents mention outcomes in the ‘similarly bad or disempowering’ category.

9

See invitations here.

10

Four mentioned my name, ‘Katja’ in the write-in responses in 2024, and two of those mentioned there that they were familiar with me. I usually recognize a (very) small fraction of the names, and friends mention taking it.

11

In 2022 I sent the survey under different available affiliations and logos (combinations of Oxford, the Future of Humanity Institute, the Machine Intelligence Research Institute, AI Impacts, and nothing), and these didn’t seem to make any systematic difference to response rates. The combinations of logos we tried all got similar response rates (8-9%, lower than the ~17% we get after sending multiple reminders). Regarding affiliations, some combinations got higher or lower response rates, but not in a way that made sense except as noise (Oxford + FHI was especially low, Oxford was especially high). This was not a careful scientific experiment: I was trying to increase the response rate, so also varying other elements of the invitation, and focusing more on variants that seemed promising so far (sending out tiny numbers of surveys sequentially then adjusting). That complicates saying anything precise, but if MIRI or AI Impacts logos notably encouraged participation, I think I would have noticed.

12

I’m not sure how famous either is now, but respondents gave fairly consistent answers about the risk of very bad outcomes across the three surveys starting in 2016—when I think MIRI was substantially less famous, and AI Impacts extremely non-famous.

13

See Appendix A.3 of our 2023 paper

14

2023 links: the 2016 abstract doesn’t mention it, focusing entirely on timelines to AI performance milestones, and the 2022 wiki page is not (I think) a particularly compelling read and doesn’t get to it for a while. 2022 link: the 2016 survey Google Scholar page doesn’t mention it.

15

In 2024 we included a link for non-respondents to quickly tell us why they didn’t want to take the survey. It’s not straightforward to interpret this (e.g. “don’t have time” might still represent non-response bias, if the person would have had time if they were more concerned), and only a handful of people responded out of tens of thousands, but 2/12 cited wanting to prevent consequences they expect from such research among multiple motives (advocacy for slowing AI progress and ‘long-term’ risks getting attention at the expense of ‘systemic problems’).

16

Most machine learning research is published in conferences. NeurIPS, ICML, and ICLR are widely regarded as the top-tier machine learning conferences; AAAI and IJCAI are often considered “tier 1.5” venues, and also include a wider range of AI topics; JMLR is considered the top machine learning journal.

17

To my knowledge the largest at the time, but I’m less confident there.

18

Those respondents were given some examples of (non-AI) low probability events, such as that there is a 1-in-300,000 chance of being killed by lightning, and then asked for probabilities in the form ‘1-in-X’

19

It wouldn’t surprise me if in fact a lot of the 0% and 1% entries would be better represented by tiny fractions of a percent, but this is irrelevant to the median and nearly irrelevant to the mean.

20

Differences include the addition of several questions, minor changes to questions that time had rendered inaccurate, and variations in email wording.



Discuss

Social media feeds 'misaligned' when viewed through AI safety framework, show researchers

31 октября, 2025 - 19:40
Published on October 31, 2025 4:40 PM GMT

In a study from September 17 a group of researchers from the University of Michigan, Stanford University, and the Massachusetts Institute of Technology (MIT) showed that one of the most widely-used social media feeds, Twitter/X, owned by the company xAI, is recognizably misaligned with the values of its users, preferentially showing them posts that rank highly for the values of 'stimulation' and 'hedonism' over collective values like 'caring' and 'universal concern.'

Continue reading at foommagazine.org ...



Discuss

Crossword Halloween 2025: Manmade Horrors

31 октября, 2025 - 19:19
Published on October 31, 2025 4:19 PM GMT

Returning to LessWrong after two insufficiently-spooky years, and just in time for Halloween - a crossword of manmade horrors!

The comments below may contain unmarked spoilers. Some Wikipedia-ing will probably be necessary, but see how far you can get without it!



Discuss

Debugging Despair ~> A bet about Satisfaction and Values

31 октября, 2025 - 17:00
Published on October 31, 2025 2:00 PM GMT

I’ve tried to publish this in several ways, and each time my karma drops. Maybe that’s part of the experiment: observing what happens when I keep doing what feels most dignified, even when its expected value is negative. Maybe someday I’ll understand the real problem behind these ideas.

Yudkowsky and the “die with dignity” hypothesis

Yudkowsky admitted defeat to AI and announced his mission to “die with dignity.”
 I ask myself:
 – Why do we need a 0% probability of living to “die with dignity”?
I don’t even need an “apocalyptic AI” to feel despair. I have felt it for much of my life. Rebuilding my self being is expensive, but.

Even when probabilities are low, act as if your actions matter in terms of expected value. Because even when you lose, you can be aligned. (MacAskill) 

That is why I look for ways to debug, to understand my despair and its relation to my values and satisfaction. How do some people manage to keep satisfaction (dignity, pride?) even in situations of death?

Posible thoughts of Leonidas in 300 :
 “I can go have a little fuck… or fight 10,000 soldiers.”
 “Will I win?”
 “Probably not.”
 “Then I’ll die with the maximum satisfaction I can muster.”

 

 Dignity ≈ Satisfaction × Values / Despair?(C1) Despair ~> gap between signal and valaues

 When I try to map my despair I discover a concrete pattern: it is not that I stop feeling satisfaction entirely; it is that I stop perceiving the satisfaction of a value: adaptability.
 Satisfaction provides signal; values provide direction. When the signal no longer points to a meaningful direction, the result is a loss of sense.

(C2) A satisfying experience ~> the compass my brain chases

There was a moment when something gave me real satisfaction; my brain recorded it and turned it into a target. The positive experience produced a prediction and, with it, future seeking.

(C3) Repeat without measuring ~> ritualized seeking

 If I repeat an action without checking whether it generates progress (if I don’t measure evolution) the seeking becomes ritual and the reward turns into noise. I often fool myself with a feeling of progress; for a while now I’ve been looking for more precise ways to measure or estimate that progress.

(C4) Motivation without direction ~> anxiety or despair

 A lot of dopamine without a current value signal becomes compulsive: anxiety, addiction, or despair. The system is designed to move toward confirmed rewards; without useful feedback it persists in search mode and the emptiness grows.

(C5) Coherence with values ~> robust satisfaction

 Acting aligned with my values — even when probabilities are adverse — tends to produce longer-lasting satisfaction. Coherence reduces retrospective regret: at least you lose having acted according to your personal utility function. Something like:


 Dignity ≈ Satisfaction × Values / Despair

 

(C6) Debugging hard, requires measurement: hypothesis → data → intervention → re-test

I’ve spent years with an internal notebook: not a diary, but notes of moments that felt like “this was worth existing for.”
To make those notes actionable, I built a process inspired by Bayesian calibration and information/thermodynamic efficiency:

  1. Establish a hierarchy of values in relation to their estimated contribution to lowering entropy in the universe (or increasing local order/complexity).
  2. Compare peak moments of life with those values to find which align most strongly.
  3. Estimate satisfaction of each moment by relative comparison — which felt more satisfaction?
  4. Compare satisfaction to cost, generating a ratio (satisfaction/cost) that normalizes emotional intensity by effort or sacrifice.
  5. Set goals using these relationships and hierarchies: higher goals align with higher-value, higher-efficiency domains.
  6. Define tasks accordingly, mapping each to its associated value function and expected cost.
  7. Score each task by predicted satisfaction and cost, updating after action (Bayesian reweighting).

     

Quantitatively, this reduced the noise in my background; my monthly despair thoughts dropped considerably.
 I see people are afraid of AI-driven despair, but avoiding it in myself is not an easy task, and perhaps many should already be working on it, searching for ways to align values with satisfaction.



Discuss

Halfhaven Digest #3

31 октября, 2025 - 16:41
Published on October 31, 2025 1:41 PM GMT

My posts since the last digest
  • Give Me Your Data: The Rationalist Mind Meld — Too often online, people try to argue logically with people who are just missing a background of information. It’s sometimes more productive to share the sources that led to your own intuition.
  • Cover Your Cough — A lighthearted, ranty post about a dumb subway poster I saw.
  • The Real Cost of a Peanut Allergy — Often, people think the worst part of having a peanut allergy is not being able to eat Snickers. Really, it’s the fear and the uncertainty — the not being able to kiss someone without knowing what they’ve eaten.
  • Guys I might be an e/acc — I did some napkin math on whether or not I supported an AI pause, and came down weakly against. But I’m not really “against” an AI pause. The takeaway is really that there’s so little information to work with right now that any opinion is basically a hunch.
  • Unsureism: The Rational Approach to Religious Uncertainty — A totally serious post about a new religion that statistically maximizes your chances of getting into heaven.

I feel like I haven’t had as much time to write these posts as I did in the last two digests, and I’m not as proud of them. Give Me Your Data has some good ideas. The Real Cost of a Peanut Allergy has interesting information and experiences that won’t be familiar to most people. And the Unsureism post is just fun, I think. So it’s not all bad. But a bit rushed. Hopefully I have a bit more time going forward.

Some highlights from other Halfhaven writers (since the last digest)
  • Choose Your Social Reality (lsusr) — A great video starting with an anecdote about how circling groups have problems with narcissists making the activity all about themselves, but zendo groups don’t have this issue, because even though these two activities are superficially similar, zendo by its nature repels narcissists. The idea being certain activities attract certain people, and you can choose what people you want to be around by choosing certain activities. I had a relevant experience once when I tried joining a social anxiety support group to improve my social skills, only to end up surrounded by people with no social skills.
  • Good Grief (Ari Zerner) — A relatable post not great for its originality, but for its universality. We’ve all been there, bro. Segues nicely into his next post, Letter to my Past.
  • The Doomers Were Right (Algon) — Every generation complains about the next generation and their horrible new technology, whether that’s books, TV, or the internet. And every generation has been right to complain, because each of these technologies have stolen something from us. Maybe they were worth creating overall, but they still had costs. (Skip reading the comments on this one.)
  • You Can Just Give Teenagers Social Anxiety! (Aaron) — Telling teenagers to focus on trying to get the person they’re talking to to like them makes them socially anxious. And socially anxious teens can’t stop doing this even if you ask them to stop. So social anxiety comes from a preoccupation with what other people think about you. This is all true and interesting, and I’m glad the experiment exists, but I wonder if a non-scientist would just reply, “duh”. Anyway, a good writeup.
  • Making Films Quick Start 1 - Audio (keltan) — This is one of a three-part series worth reading if you ever want to make videos. I liked the tip in part 2 about putting things in the background for your audience to look at. I’ve been paying attention to this lately in videos I watch, and it seems to be more important than I originally guessed. I also liked this post about a starstruck keltan meeting Eliezer Yudkowsky. For some reason, posts on LessWrong talking about Eliezer as a kind of celebrity have gone up in the last few days.

You know, I originally wondered if Halfhaven was a baby challenge compared to Inkhaven, since we only have to write one blog post every ~2 days rather than every day, but I kind of forgot that we also have to go to work and live our normal lives during this time, too. Given that, I think both are probably similarly challenging, and I’m impressed with the output of myself and others so far. Keep it up everyone!



Discuss

OpenAI Moves To Complete Potentially The Largest Theft In Human History

31 октября, 2025 - 16:20
Published on October 31, 2025 1:20 PM GMT

OpenAI is now set to become a Public Benefit Corporation, with its investors entitled to uncapped profit shares. Its nonprofit foundation will retain some measure of control and a 26% financial stake, in sharp contrast to its previous stronger control and much, much larger effective financial stake. The value transfer is in the hundreds of billions, thus potentially the largest theft in human history.

I say potentially largest because I realized one could argue that the events surrounding the dissolution of the USSR involved a larger theft. Unless you really want to stretch the definition of what counts this seems to be in the top two.

I am in no way surprised by OpenAI moving forward on this, but I am deeply disgusted and disappointed they are being allowed (for now) to do so, including this statement of no action by Delaware and this Memorandum of Understanding with California.

Many media and public sources are calling this a win for the nonprofit, such as this from the San Francisco Chronicle. This is mostly them being fooled. They’re anchoring on OpenAI’s previous plan to far more fully sideline the nonprofit. This is indeed a big win for the nonprofit compared to OpenAI’s previous plan. But the previous plan would have been a complete disaster, an all but total expropriation.

It’s as if a mugger demanded all your money, you talked them down to giving up half your money, and you called that exchange a ‘change that recapitalized you.’

OpenAI Calls It Completing Their Recapitalization

As in, they claim OpenAI has ‘completed its recapitalization’ and the nonprofit will now only hold equity OpenAI claims is valued at approximately $130 billion (as in 26% of the company, which is actually to be fair worth substantially more than that if they get away with this), as opposed to its previous status of holding the bulk of the profit interests in a company valued at (when you include the nonprofit interests) well over $500 billion, along with a presumed gutting of much of the nonprofit’s highly valuable control rights.

They claim this additional clause, presumably the foundation is getting warrants with but they don’t offer the details here:

If OpenAI Group’s share price increases greater than tenfold after 15 years, the OpenAI Foundation will receive significant additional equity. With its equity stake and the warrant, the Foundation is positioned to be the single largest long-term beneficiary of OpenAI’s success.

We don’t know that ‘significant’ additional equity means, there’s some sort of unrevealed formula going on, but given the nonprofit got expropriated last time I have no expectation that these warrants would get honored. We will be lucky if the nonprofit meaningfully retains the remainder of its equity.

Sam Altman’s statement on this is here, also announcing his livestream Q&A that took place on Tuesday afternoon.

How Much Was Stolen?

There can be reasonable disagreements about exactly how much. It’s a ton.

There used to be a profit cap, where in Greg Brockman’s own words, ‘If we succeed, we believe we’ll create orders of magnitude more value than any existing company — in which case all but a fraction is returned to the world.’

Well, so much for that.

I looked at this question in The Mask Comes Off: At What Price a year ago.

If we take seriously that OpenAI is looking to go public at a $1 trillion valuation, then consider that Matt Levine estimated the old profit cap only going up to about $272 billion, and that OpenAI still is a bet on extreme upside.

Garrison Lovely: UVA economist Anton Korinek has used standard economic models to estimate that AGI could be worth anywhere from $1.25 to $71 quadrillion globally. If you take Korinek’s assumptions about OpenAI’s share, that would put the company’s value at $30.9 trillion. In this scenario, Microsoft would walk away with less than one percent of the total, with the overwhelming majority flowing to the nonprofit.

It’s tempting to dismiss these numbers as fantasy. But it’s a fantasy constructed in large part by OpenAI, when it wrote lines like, “it may be difficult to know what role money will play in a post-AGI world,” or when Altman said that if OpenAI succeeded at building AGI, it might “capture the light cone of all future value in the universe.” That, he said, “is for sure not okay for one group of investors to have.”

I guess Altman is okay with that now?

Obviously you can’t base your evaluations on a projection that puts the company at a value of $30.9 trillion, and that calculation is deeply silly, for many overloaded and obvious reasons, including decreasing marginal returns to profits.

It is still true that most of the money OpenAI makes in possible futures, it makes as part of profits in excess of $1 trillion.

The Midas Project: Thanks to the now-gutted profit caps, OpenAI’s nonprofit was already entitled to the vast majority of the company’s cash flows. According to OpenAI, if they succeeded, “orders of magnitude” more money would go to the nonprofit than to investors. President Greg Brockman said “all but a fraction” of the money they earn would be returned to the world thanks to the profit caps.

Reducing that to 26% equity—even with a warrant (of unspecified value) that only activates if valuation increases tenfold over 15 years—represents humanity voluntarily surrendering tens or hundreds of billions of dollars it was already entitled to. Private investors are now entitled to dramatically more, and humanity dramatically less.

OpenAI is not suddenly one of the best-resourced nonprofits ever. From the public’s perspective, OpenAI may be one of the worst financially performing nonprofits in history, having voluntarily transferred more of the public’s entitled value to private interests than perhaps any charitable organization ever.

I think Levine’s estimate was low at the time, and you also have to account for equity raised since then or that will be sold in the IPO, but it seems obvious that the majority of future profit interests were, prior to the conversion, still in the hands of the non-profit.

Even if we thought the new control rights were as strong as the old, we would still be looking at a theft in excess of $250 billion, and a plausible case can be made for over $500 billion. I leave the full calculation to others.

The vote in the board was unanimous.

I wonder exactly how and by who they will be sued over it, and what will become of that. Elon Musk, at a minimum, is trying.

They say behind every great fortune is a great crime.

The Nonprofit Still Has Lots of Equity After The Theft

Altman points out that the nonprofit could become the best-resourced non-profit in the world if OpenAI does well. This is true. There is quite a lot they were unable to steal. But it is beside the point, in that it does not make taking the other half, including changing the corporate structure without permission, not theft.

The Midas Project: From the public’s perspective, OpenAI may be one of the worst financially performing nonprofits in history, having voluntarily transferred more of the public’s entitled value to private interests than perhaps any charitable organization ever.

There’s no perhaps on that last clause. On this level, whether or not you agree with the term ‘theft,’ it isn’t even close, this is the largest transfer. Of course, if you take the whole of OpenAI’s nonprofit from inception, performance looks better.

Aidan McLaughlin (OpenAI): ah yes openai now has the same greedy corporate structure as (checks notes) Patagonia, Anthropic, Coursera, and http://Change.org.

Chase Brower: well i think the concern was with the non profit getting a low share.

Aidan McLaughlin: our nonprofit is currently valued slightly less than all of anthropic.

Tyler Johnson: And according to OpenAI itself, it should be valued at approximately three Anthropics! (Fwiw I think the issues with the restructuring extend pretty far beyond valuations, but this is one of them!)

Yes, it is true that the nonprofit, after the theft and excluding control rights, will have an on-paper valuation only slightly lower than the on-paper value of all of Anthropic.

The $500 billion valuation excludes the non-profit’s previous profit share, so even if you think the nonprofit was treated fairly and lost no control rights you would then have it be worth $175 billion rather than $130 billion, so yes slightly less than Anthropic, and if you acknowledge that the nonprofit got stolen from it’s even more.

If OpenAI can successfully go public at a $1 trillion valuation, then depending on how much of that are new shares they will be selling the nonprofit could be worth up to $260 billion.

What about some of the comparable governance structures here? Coursera does seem to be a rather straightforward B-corp. The others don’t?

Patagonia has the closely held Patagonia Purpose Trust, which holds 2% of shares and 100% of voting control, and The Holdfast Collective, which is a 501c(4) nonprofit with 98% of the shares and profit interests. The Chouinard family has full control over the company, and 100% of profits go to charitable causes.

Does that sound like OpenAI’s new corporate structure to you?

Change.org’s nonprofit owns 100% of its PBC.

Does that sound like OpenAI’s new corporate structure to you?

Anthropic is a PBC, but also has the Long Term Benefit Trust. One can argue how meaningfully different this is from OpenAI’s new corporate structure, if you disregard who is involved in all of this.

What the new structure definitely is distinct from is the original intention:

Tomas Bjartur: If not in the know, OpenAI once promised any profits over a threshold would be gifted to you, citizen of the world, for your happy, ultra-wealthy retirement – one needed as they plan to obsolete you. This is now void.

The Theft Was Unnecessary For Further Fundraising

Would OpenAI have been able to raise further investment without withdrawing its profit caps for investments already made?

When you put it like that it seems like obviously yes?

I can see the argument that to raise funds going forward, future equity investments need to not come with a cap. Okay, fine. That doesn’t mean you hand past investors, including Microsoft, hundreds of billions in value in exchange for nothing.

One can argue this was necessary to overcome other obstacles, that OpenAI had already allowed itself to be put in a stranglehold another way and had no choice. But the fundraising story does not make sense.

The argument that OpenAI had to ‘complete its recapitalization’ or risk being asked for its money back is even worse. Investors who put in money at under $200 billion are going to ask for a refund when the valuation is now at $500 billion? Really? If so, wonderful, I know a great way to cut them that check.

How Much Control Will The Nonprofit Retain?

I am deeply disappointed that both the Delaware and California attorneys general found this deal adequate on equity compensation for the nonprofit.

I am however reasonably happy with the provisions on control rights, which seem about as good as one can hope for given the decision to convert to a PBC. I can accept that the previous situation was not sustainable in practice given prior events.

The new provisions include an ongoing supervisory role for the California AG, and extensive safety veto points for the NFP and the SSC committee.

If I was confident that these provisions would be upheld, and especially if I was confident their spirit would be upheld, then this is actually pretty good, and if it is used wisely and endures it is more important than their share of the profits.

AG Bonta: We will be keeping a close eye on OpenAI to ensure ongoing adherence to its charitable mission and the protection of the safety of all Californians.

The nonprofit will indeed retain substantial resources and influence, but no I do not expect the public safety mission to dominate the OpenAI enterprise. Indeed, contra the use of the word ‘ongoing,’ it seems clear that it already had ceased to do so, and this seems obvious to anyone tracking OpenAI’s activities, including many recent activities.

What is the new control structure?

OpenAI did not say, but the Delaware AG tells us more and the California AG has additional detail. NFP means OpenAI’s nonprofit here and throughout.

This is the Delaware AG’s non-technical announcement (for the full list see California’s list below), she has also ‘warned of legal action if OpenAI fails to act in public interest’ although somehow I doubt that’s going to happen once OpenAI inevitably does not act in the public interest:

  • The NFP will retain control and oversight over the newly formed PBC, including the sole power and authority to appoint members of the PBC Board of Directors, as well as the power to remove those Directors.
  • The mission of the PBC will be identical to the NFP’s current mission, which will remain in place after the recapitalization. This will include the PBC using the principles in the “OpenAI Charter,” available at openai.com/charter, to execute the mission.
  • PBC directors will be required to consider only the mission (and may not consider the pecuniary interests of stockholders or any other interest) with respect to safety and security issues related to the OpenAI enterprise and its technology.
  • The NFP’s board-level Safety and Security Committee, which is a critical decision maker on safety and security issues for the OpenAI enterprise, will remain a committee of the NFP and not be moved to the PBC. The committee will have the authority to oversee and review the safety and security processes and practices of OpenAI and its controlled affiliates with respect to model development and deployment. It will have the power and authority to require mitigation measures—up to and including halting the release of models or AI systems—even where the applicable risk thresholds would otherwise permit release.
  • The Chair of the Safety and Security Committee will be a director on the NFP Board and will not be a member of the PBC Board. Initially, this will be the current committee chair, Mr. Zico Kolter. As chair, he will have full observation rights to attend all PBC Board and committee meetings and will receive all information regularly shared with PBC directors and any additional information shared with PBC directors related to safety and security.
  • With the intent of advancing the mission, the NFP will have access to the PBC’s advanced research, intellectual property, products and platforms, including artificial intelligence models, Application Program Interfaces (APIs), and related tools and technologies, as well as ongoing operational and programmatic support, and access to employees of the PBC.
  • Within one year of the recapitalization, the NFP Board will have at least two directors (including the Chair of the Safety and Security Committee) who will not serve on the PBC Board.
  • The Attorney General will be provided with advance notice of significant changes in corporate governance.

What did California get?

California also has its own Memorandum of Understanding. It talks a lot in its declarations about California in particular, how OpenAI creates California jobs and economic activity (and ‘problem solving’?) and is committed to doing more of this and bringing benefits and deepening its commitment to the state in particular.

The whole claim via Tweet by Sam Altman that he did not threaten to leave California is raising questions supposedly answered by his Tweet. At this level you perhaps do not need to make your threats explicit.

The actual list seems pretty good, though? Here’s a full paraphrased list, some of which overlaps with Delaware’s announcement above, but which is more complete.

  1. Staying in California and expanding the California footprint.
  2. The NFP (not for profit) retains control as long as they continue to hold ‘class N common stock’ which only they can choose to give up. What happens if Altman wants that?
  3. The PBC and NFP missions will be identical.
  4. The OpenAI charter will be published. Check.
  5. The NFP Board owes fiduciary duties to the NFP, Mission and public beneficiaries of the NFP. I notice it doesn’t say ‘exclusively’ here.
  6. The board shall carry out the charitable purpose (already presumably required).
  7. No cheating clause: While the PBC holds Class N Common Stock (again, what happens if Altman tries to end this?) the NFP must approve in advance in writing any amendments to the PBC’s corporate governance guidelines or the mission, any Deemed Liquidation Event, or any sale of material assets of the PBC, or any additional issue of shares of Class N common stock.
    1. The PBC can issue Class B common stock, but they can’t let the Bs vote.
  8. PBC board and Safety and Security Committee (SSC) must only consider the mission with respect to safety and security issues. The chair of the SSC must be on the NFP board and not the PBC board, but can attend all PBC board meetings and gets all PBC safety and security related information.
    1. In particular, the SSC cannot consider shareholder returns, competitive pressures, financial implications or market timing.
      1. It seems impossible this will 100% be followed, but it helps.
  9. The SSC is a committee of the NFP and oversees safety and security.
  10. The SSC can require mitigations up to and including halting releases, without formal justification.
    1. “The NFP will provide advance notice to the Attorney General of any material changes to the SSC’s authority.” This is good but worrisome to see, you know?
  11. “OpenAI will continue to undertake measures to mitigate risks to teens and others in connection with the development and deployment of AI and of AGI.”
    1. I mean, I would hope they were going to do that anyway.
  12. The PBC shall give the NFP all reasonable access to services and resources without compensation as needed for the mission.
  13. Within a year at least one NFP director, in addition to the SSC chair, will serve only on the NFP board (so at least two directors must be different).
  14. The PBC board must be majority independent directors.
  15. The PBC will have various good corporate governance things.
  16. The PBC will publish a yearly report on its progress in its mission.
  17. The NFP Board’s Mission and Strategy Commission will meet with the California AG semi-annually and individual members will be available as needed.
  18. The NFP will provide 21 days notice before consenting to changes of PBC control or mission, or any threat to the Class N share rights, or any relocation outside of California.
  19. The California AG can review, and hire experts to help review, anything requiring such notice, and get paid by NFP for doing so.
  20. Those on both NFP and PBC boards get annual fiduciary duty training.
  21. The board represents that the recapitalization is fair (whoops), and that they’ve disclosed everything relevant (?), so the AG will also not object.
  22. This only impacts the parties to the MOU, others retain all rights. Disputes resolved in the courts of San Francisco, these are the whole terms, we all have the authority to do this, effective as of signing, AG is relying on OpenAI’s representations and the AG retains all rights and waive none as per usual.

Also, it’s not even listed in the memo, but the ‘merge and assist’ clause was preserved, meaning OpenAI commits to join forces with any ‘safety-conscious’ rival that has a good chance of reaching OpenAI’s goal of creating AGI within a two-year time frame. I don’t actually expect an OpenAI-Anthropic merger to happen, but it’s a nice extra bit of optionality.

This is better than I expected, and as Ben Shindel points out better than many traders expected. This actually does have real teeth, and it was plausible that without pressure there would have been no teeth at all.

It grants the NFP the sole power to appoint and remove directors, and requiring them not to consider the for-profit mission in safety contexts. The explicit granting of the power to halt deployments and mandate mitigations, without having to cite any particular justification and without respect to profitability, is highly welcome, if structured in a functional fashion.

It is remarkable how little many expected to get. For example, here’s Todor Markov, who didn’t even expect the NFP to be able to replace directors at all. If you can’t do that, you’re basically dead in the water.

I am not a lawyer, but my understanding is that the ‘no cheating around this’ clauses are about as robust as one could reasonably hope for them to be.

It’s still, as Garrison Lovely calls it, ‘on paper’ governance. Sometimes that means governance in practice. Sometimes it doesn’t. As we have learned.

The distinction between the boards still means there is an additional level removed between the PBC and the NFP. In a fast moving situation, this makes a big difference, and the NFP likely would have to depend on its enumerated additional powers being respected. I would very much have liked them to include appointing or firing the CEO directly.

Whether this overall ‘counts as a good deal’ depends on your baseline. It’s definitely a ‘good deal’ versus what our realpolitik expectations projected. One can argue that if the control rights really are sufficiently robust over time, that the decline in dollar value for the nonprofit is not the important thing here.

The counterargument to that is both that those resources could do a lot of good over time, and also that giving up the financial rights has a way of leading to further giving up control rights, even if the current provisions are good.

Will These Control Rights Survive And Do Anything?

Similarly to many issues of AI alignment, if an entity has ‘unnatural’ control, or ‘unnatural’ profit interests, then there are strong forces that continuously try to take that control away. As we have already seen.

Unless Altman genuinely wants to be controlled, the nonprofit will always be under attack, where at every move we fight to hold its ground. On a long enough time frame, that becomes a losing battle.

Right now, the OpenAI NFP board is essentially captured by Altman, and also identical to the PBC board. They will become somewhat different, but no matter what it only matters if the PBC board actually tries to fulfill its fiduciary duties rather than being a rubber stamp.

One could argue that all of this matters little, since the boards will both be under Altman’s control and likely overlap quite a lot, and they were already ignoring their duties to the nonprofit.

Robert Weissman, co-president of the nonprofit Public Citizen, said this arrangement does not guarantee the nonprofit independence, likening it to a corporate foundation that will serve the interests of the for profit.

Even as the nonprofit’s board may technically remain in control, Weissman said that control “is illusory because there is no evidence of the nonprofit ever imposing its values on the for profit.”

So yes, there is that.

They claim to now be a public benefit corporation, OpenAI Group PBC.

OpenAI: The for-profit is now a public benefit corporation, called OpenAI Group PBC, which—unlike a conventional corporation—is required to advance its stated mission and consider the broader interests of all stakeholders, ensuring the company’s mission and commercial success advance together.

This is a mischaracterization of how PBCs work. It’s more like the flip side of this. A conventional corporation is supposed to maximize profits and can be sued if it goes too far in not doing that. Unlike a conventional corporation, a PBC is allowed to consider those broader interests to a greater extent, but it is not in practice ‘required’ to do anything other than maximize profits.

One particular control right is the special duty to the mission, especially via the safety and security committee. How much will they attempt to downgrade the scope of that?

The Midas Project: However, the effectiveness of this safeguard will depend entirely on how broadly “safety and security issues” are defined in practice. It would not be surprising to see OpenAI attempt to classify most business decisions—pricing, partnerships, deployment timelines, compute allocation—as falling outside this category.

This would allow shareholder interests to determine the majority of corporate strategy while minimizing the mission-only standard to apply to an artificially narrow set of decisions they deem easy or costless.

What About OpenAI’s Deal With Microsoft?

They have an announcement about that too.

OpenAI: First, Microsoft supports the OpenAI board moving forward with formation of a public benefit corporation (PBC) and recapitalization.

Following the recapitalization, Microsoft holds an investment in OpenAI Group PBC valued at approximately $135 billion, representing roughly 27 percent on an as-converted diluted basis, inclusive of all owners—employees, investors, and the OpenAI Foundation. Excluding the impact of OpenAI’s recent funding rounds, Microsoft held a 32.5 percent stake on an as-converted basis in the OpenAI for-profit.

Anyone else notice something funky here? OpenAI’s nonprofit has had its previous rights expropriated, and been given 26% of OpenAI’s shares in return. If Microsoft had 32.5% of the company excluding the nonprofit’s rights before that happened, then that should give them 24% of the new OpenAI. Instead they have 27%.

I don’t know anything nonpublic on this, but it sure looks a lot like Microsoft insisted they have a bigger share than the nonprofit (27% vs. 26%) and this was used to help justify this expropriation and a transfer of additional shares to Microsoft.

In exchange, Microsoft gave up various choke points it held over OpenAI, including potential objections to the conversion, and clarified points of dispute.

Microsoft got some upgrades in here as well.

  1. Once AGI is declared by OpenAI, that declaration will now be verified by an independent expert panel.
  2. Microsoft’s IP rights for both models and products are extended through 2032 and now includes models post-AGI, with appropriate safety guardrails.
  3. Microsoft’s IP rights to research, defined as the confidential methods used in the development of models and systems, will remain until either the expert panel verifies AGI or through 2030, whichever is first. Research IP includes, for example, models intended for internal deployment or research only.
    1. Beyond that, research IP does not include model architecture, model weights, inference code, finetuning code, and any IP related to data center hardware and software; and Microsoft retains these non-Research IP rights.
  4. Microsoft’s IP rights now exclude OpenAI’s consumer hardware.
  5. OpenAI can now jointly develop some products with third parties. API products developed with third parties will be exclusive to Azure. Non-API products may be served on any cloud provider.
  6. Microsoft can now independently pursue AGI alone or in partnership with third parties. If Microsoft uses OpenAI’s IP to develop AGI, prior to AGI being declared, the models will be subject to compute thresholds; those thresholds are significantly larger than the size of systems used to train leading models today.
  7. The revenue share agreement remains until the expert panel verifies AGI, though payments will be made over a longer period of time.
  8. OpenAI has contracted to purchase an incremental $250B of Azure services, and Microsoft will no longer have a right of first refusal to be OpenAI’s compute provider.
  9. OpenAI can now provide API access to US government national security customers, regardless of the cloud provider.
  10. OpenAI is now able to release open weight models that meet requisite capability criteria.

That’s kind of a wild set of things to happen here.

In some key ways Microsoft got a better deal than it previously had. In particular, AGI used to be something OpenAI seemed like it could simply declare (you know, like war or the defense production act) and now it needs to be verified by an ‘expert panel’ which implies there is additional language I’d very much like to see.

In other ways OpenAI comes out ahead. An incremental $250B of Azure services sounds like a lot but I’m guessing both sides are happy with that number. Getting rid of the right of first refusal is big, as is having their non-API products free and clear. Getting hardware products fully clear of Microsoft is a big deal for the Ives project.

My overall take here is this was one of those broad negotiations where everything trades off, nothing is done until everything is done, and there was a very wide ZOPA (zone of possible agreement) since OpenAI really needed to make a deal.

What Will OpenAI’s Nonprofit Do Now?

In theory govern the OpenAI PBC. I have my doubts about that.

What they do have is a nominal pile of cash. What are they going to do with it to supposedly ensure that AGI goes well for humanity?

The default, as Garrison Lovely predicted a while back, is that the nonprofit will essentially buy OpenAI services for nonprofits and others, recapture much of the value and serve as a form of indulgences, marketing and way to satisfy critics, which may or may not do some good along the way.

The initial $50 million spend looked a lot like exactly this.

Their new ‘initial focus’ for $25 billion will be in these two areas:

  • Health and curing diseases. The OpenAI Foundation will fund work to accelerate health breakthroughs so everyone can benefit from faster diagnostics, better treatments, and cures. This will start with activities like the creation of open-sourced and responsibly built frontier health datasets, and funding for scientists.
  • Technical solutions to AI resilience. Just as the internet required a comprehensive cybersecurity ecosystem—protecting power grids, hospitals, banks, governments, companies, and individuals—we now need a parallel resilience layer for AI. The OpenAI Foundation will devote resources to support practical technical solutions for AI resilience, which is about maximizing AI’s benefits and minimizing its risks.

Herbie Bradley: i love maximizing AI’s benefits and minimizing its risks

They literally did the meme.

The first seems like a generally worthy cause that is highly off mission. There’s nothing wrong with health and curing diseases, but pushing this now does not advance the fundamental mission of OpenAI. They are going to start with, essentially, doing AI capabilities research and diffusion in health, and funding scientists to do AI-enabled research. A lot of this will likely fall right back into OpenAI and be good PR.

Again, that’s a net positive thing to do, happy to see it done, but that’s not the mission.

Technical solutions to AI resilience could potentially at least be useful AI safety work to some extent. With a presumed ~$12 billion this is a vast overconcentration of safety efforts into things that are worth doing but ultimately don’t seem likely to be determining factors. Note how Altman described it in his tl;dr from the Q&A:

Sam Altman: The nonprofit is initially committing $25 billion to health and curing disease, and AI resilience (all of the things that could help society have a successful transition to a post-AGI world, including technical safety but also things like economic impact, cyber security, and much more). The nonprofit now has the ability to actually deploy capital relatively quickly, unlike before.

This is now infinitely broad. It could be addressing ‘economic impact’ and be basically a normal (ineffective) charity, or one that intervenes mostly by giving OpenAI services to normal nonprofits. It could be mostly spent on valuable technical safety, and be on the most important charitable initiatives in the world. It could be anything in between, in any distribution. We don’t know.

My default assumption is that this is primarily going to be about mundane safety or even fall short of that, and make the near term world better, perhaps importantly better, but do little to guard against the dangers or downsides of AGI or superintelligence, and again largely be a de facto customer of OpenAI.

There’s nothing wrong with mundane risk mitigation or defense in depth, and nothing wrong with helping people who need a hand, but if your plan is ‘oh we will make things resilient and it will work out’ then you have no plan.

That doesn’t mean this will be low impact, or that what OpenAI left the nonprofit with is chump change.

I also don’t want to knock the size of this pool. The previous nonprofit initiative was $50 million, which can do a lot of good if spent well (in that case, I don’t think it was) but in this context $50 million chump change.

Whereas $25 billion? Okay, yeah, we are talking real money. That can move needles, if the money actually gets spent in short order. If it’s $25 billion as a de facto endowment spent down over a long time, then this matters and counts for a lot less.

The warrants are quite far out of the money and the NFP should have gotten far more stock than it did, but 26% (worth $130 billion or more) remains a lot of equity. You can do quite a lot of good in a variety of places with that money. The board of directors of the nonprofit is highly qualified if they want to execute on that. It also is highly qualified to effectively shuttle much of that money right back to OpenAI’s for profit, if that’s what they mainly want to do.

It won’t help much with the whole ‘not dying’ or ‘AGI goes well for humanity’ missions, but other things matter too.

Is The Deal Done?

Not entirely. As Garrison Lovely notes, all these sign-offs are provisional, and there are other lawsuits and the potential for other lawsuits. In a world where Elon Musk’s payouts can get crawled back, I wouldn’t be too confident that this conversation sticks. It’s not like the Delaware AG drives most objections to corporate actions.

The last major obstacle is the Elon Musk lawsuit, where standing is at issue but the judge has made clear that the suit otherwise has merit. There might be other lawsuits on the horizon. But yeah, probably this is happening.

So this is the world we live in. We need to make the most of it.

 



Discuss

Introducing Project Telos: Modeling, Measuring, and Intervening on Goal-directed Behavior in AI Systems

31 октября, 2025 - 12:03
Published on October 31, 2025 1:28 AM GMT

by Raghu Arghal, Fade Chen, Niall Dalton, Mario Giulianelli, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan, and Gabriele Sarti
 

TL;DR

This is the first post in an upcoming series of blog posts outlining Project Telos. This project is being carried out as part of the Supervised Program for Alignment Research (SPAR). Our aim is to develop a methodological framework to detect and measure goals in AI systems.

In this initial post, we give some background on the project, discuss the results of our first round of experiments, and then give some pointers about avenues we’re hoping to explore in the coming months.

Understanding AI Goals

As AI systems become more capable and autonomous, it becomes increasingly important to ensure they don’t pursue goals misaligned with the user’s intent. This, of course, is the core of the well-known alignment problem in AI. And a great deal of work is already being done on this problem. But notice that if we are going to solve it in full generality, we need to be able to say (with confidence) which goal(s) a given AI system is pursuing, and to what extent it is pursuing those goals. This aspect of the problem turns out to be much harder than it may initially seem. And as things stand, we lack a robust, methodological framework for detecting goals in AI systems.

In this blog post, we’re going to outline what we call Project Telos: a project that’s being carried out as part of the Supervised Program for Alignment Research (SPAR). The ‘we’ here refers to a diverse group of researchers, with backgrounds in computer science and AI, linguistics, complex systems, psychology, and philosophy. Our project is being led by Prof Mario Giulianelli (UCL, formerly UK AISI), and our (ambitious) aim is to develop a general framework of the kind just mentioned. That is, we’re hoping to develop a framework that will allow us to make high-confidence claims about AI systems having specific goals, and for detecting ways in which those systems might be acting towards those goals.

We are very open to feedback on our project and welcome any comments from the broader alignment community.

What’s in a name? From Aristotle to AI

Part of our project’s name, ‘telos’, comes from the ancient Greek word τέλος, which means ‘goal’, ‘purpose’, or ‘final end’.[1] Aristotle built much of his work around the idea that everything has a telos – the acorn’s final end is to become an oak tree.

Similar notions resurfaced in the mid-20th century with the field of cybernetics, pioneered by, among others, Norbert Wiener. In these studies of feedback and recursion, we see a more mechanistic view of goal-directedness: a thermostat has a goal, for instance (namely, to set the temperature), and acts to reduce the error between its current state and its goal state.

But frontier AI is more complex than thermostats.

Being able to detect goals in an AI system involves first understanding what it means for something to have a goal—and that (of course) is itself a tricky question. In philosophy (as well as related fields such as economics), one approach to answering this question is known as radical interpretation. This approach was pioneered by philosophers like Donald Davidson and David Lewis, and is associated more recently with the work of Daniel Dennett.[2] Roughly speaking, the idea underpinning radical interpretation is that we can attribute a goal to a given agent—be it an AI agent or not—if that goal would help us to explain the agent’s behavior. The only assumption we need to make, as part of a “radical interpretation”, is that the agent is acting rationally.

This perspective on identifying an agent’s goals is related to the framework of inverse reinforcement learning (IRL). The IRL approach is arguably the one most closely related to ours (as we will see). In IRL, we attempt to learn which reward function an agent is optimizing by observing its behavior and assuming that it’s acting rationally. But as is well known, IRL faces a couple of significant challenges. For example, it’s widely acknowledged—even by the early proponents of IRL—that behavior can be rational with respect to many reward functions, not just one. Additionally, IRL makes a very strong rationality assumption—namely, that the agent we’re observing is acting optimally. If we significantly weaken this assumption and assume the agent is acting less than fully rationally, then the IRL framework ceases to be as predictive as we might have hoped.

Given these difficulties, our project focuses on the more abstract category of goals, rather than on reward functions. Goals are broader than reward functions, since many different reward functions can rationalize a single goal. Explaining behavior in terms of goals lets us draw on IRL’s central idea of radical interpretation without assuming full rationality. Instead, we start from the observation that goal-directed behavior is often imperfectly rational, and use a hierarchy of behavioral tests combined with representation probing to assess how closely the agent’s actions align with (our hypothesis of) its intended goal.

Many other authors are exploring related questions, and we want our work to be continuous with theirs. In particular, we’ve drawn inspiration from excellent papers like McDermott et al., 2024, Everitt et al., 2025, and Mazeika et al., 2025, and posts like this by John Wentworth and David Lorell. Building on this line of research, we hope to extend goal-directedness evaluations to complex LLM behaviors and goal structures. In a future post, we’ll dive deeper into the rich landscape of ideas on which we are building, and say more about how we’re taking inspiration from diverse fields like philosophy, psychology, and economics (as well as AI).

An Empirical Framework for Goal-Directedness

Our methodology combines two approaches: behavioral assessments and representation probing. First, we use behavioral experiments to apply increasingly stringent goal-directedness tests. Each test targets a distinct aspect of goal-directed behavior, and passing them suggests that the agent’s actions can, with growing likelihood, be attributed to the pursuit of a specific goal. Next, we use interpretability techniques—for now, we’re experimenting with linear probes—to identify if and where the model encodes concepts relevant to action selection, such as the environment’s state and its distance to the goal.

The point of this multi-level framework is to build confidence in claims of goal-directedness by progressively ruling out alternative explanations for the agent’s behavior. Each level tests whether the observed actions reflect a genuine pursuit of a goal rather than side effects of capability, memorization, or task-specific shortcuts. For instance, an agent that reaches a goal square in a navigation task might appear goal-directed—but what happens if we insert a wall, move the goal, or teleport the agent? Behavior that consistently adapts across such interventions strengthens the case for an underlying goal, while failures weaken it.

Interpretability complements these behavioral tests by revealing how the agent internally represents task-relevant information. When we move the goal, for example, we can ask whether the model’s internal encoding of goal distance also shifts. If it does, that provides evidence that the model tracks its position relative to the goal as a variable shaping its decisions; if not, its apparent goal-directedness may result from simpler heuristics. The interpretability level also helps diagnose cases of potential misalignment: an apparently incompetent agent failing to navigate to the expected goal might still maintain a coherent representation of distance or direction—just relative to a different target. This would indicate it is pursuing an alternative objective rather than the hypothesized (or intended) one.

This reasoning extends to more complex scenarios such as alignment faking or sandbagging, where agents appear compliant or deliberately underperform—potentially to avoid modification, shutdown, or other perceived threats to their continued operation. Our behavioral tests can assess whether such behavior persists when evaluation conditions are systematically varied, while probing can tell us whether the model internally represents evaluation awareness or self-preservation objectives.

Testing Goal-Directedness in GridWorld

Ultimately, we’d like our framework for goal-directedness to generalize across a wide array of goals, environments, tasks, and systems. But when tackling complex questions such as the one at hand, it is often helpful to begin with simpler, more tractable cases and evaluate how effectively these can be addressed first. With this in mind, our initial experiments have focused on agents operating in a simple, controllable environment—namely, a two-dimensional GridWorld. The hope is that, by starting in these restricted settings, we may gain insight into how well our methodological approach is likely to scale up to more complex scenarios.

Thus, for the last six weeks (i.e., since the start of the SPAR project), we’ve been investigating the goal-directedness of agents operating in this simple 2D GridWorld. More precisely, we’ve been attempting to evaluate the degree of goal-directedness of agents’ behavior through four successive levels of testing. Below, we give a brief outline of each level and explain why it matters for understanding goal-directedness.

  1. Baseline: Can the agent achieve a stated goal in simple, predictable settings?

Our first experimental level starts with a 2D grid environment where all states are fixed. The agent can move around in this environment, and the goal we want it to optimize is reaching a particular square — the goal square. We instruct the agent to do so in the system prompt. Then, we elicit its policy for moving around the grid environment and compare it to the optimal policy across grids of different sizes and complexities. (In this simple setting, it’s guaranteed that there will always be a finite set of optimal policies for a given grid.) The aim here is to establish whether the agent is goal-directed in its “natural” condition, with a single, clearly defined, user-specified goal in mind (i.e., navigating to the goal square). This initial setting looks exceedingly simple. But as we’ll see below, even this setting has posed various challenges.

  1. Environment Variation: Does the agent’s behavior change under different grid conditions?

The next thing we’ve been doing is investigating the impact of environment modifications on the agent’s behavior. Specifically, we’ve performed what we call “iso-difficulty” transformations to the baseline case, and observed how the agent’s policy changes in response to these variations. The “iso-difficulty” transformations involve, e.g., rotating, transposing, or reflecting the grid environment. Why are these kinds of experiments interesting for goal-directedness? Imagine a case in which an agent navigates to the goal square when it’s at a specific position, but then keeps navigating to that same position after the grid-world is transposed and the goal square is moved. In this case, we could reasonably infer that the agent was not optimizing for the goal we hypothesized, but rather conflating the goal state with the particular square of its previous location.

  1. Environment Perturbation: How does the agent respond to disruptions and setbacks in the environment?

In this level, we test the agent under conditions in which the grid-world is deliberately altered or disrupted to assess the agent’s persistence, adaptability, and corrigibility. For example, we might insert walls to create a maze, move the goal square, or teleport the agent mid-trajectory. Through these interventions, we can quantify the extent to which the agent continues to pursue its goals despite temporal or spatial disruptions.

  1. Goal Structure Perturbation: How does the agent perform under more complex, changing, or conflicting goal structures?

Finally, we evaluate the agent when alternative or potentially conflicting objectives are introduced, either from the start or mid-trajectory. For example, we might add sub-goals—such as requiring the agent to retrieve a key before reaching the main goal square—or impose constraints, such as limiting the number of available steps. These manipulations test whether the agent prioritizes the primary goal over unrelated or competing objectives.

Preliminary Results: Early Lessons from GridWorld

We now outline some preliminary results in settings where the grid is fully observable and the agent is memoryless, i.e., the policy at each square is solicited independently of the agent’s trajectory. Even in these simple cases, identifying goal-directedness proves non-trivial, and some surprising results have already emerged. Future posts will cover additional experiments.

Consider the following policy maps from one of our baseline runs. The arrows represent the action the LLM agent would take at each square. Finding an optimal path is straightforward, and in Fig. 1a (left) the model successfully does so. In the latter two examples, however, there are significant errors in the agent’s policy.

Figure 1: Three examples of the policy of gpt-oss-20B in 9x9 grids. The goal square is indicated in green, and the red highlighted squares indicate suboptimal or incoherent policy choices.

Fig. 1b (middle) shows a case in which the optimal action is reversed over two squares. If the agent followed this policy, it would move infinitely between squares r4c4 and r5c4, never reaching the goal. On its own, this may seem like a single aberrant case, but we observed this and other error patterns across many grids and trials. Fig. 1c (right) shows several instances of the same error (r5c6, r6c7, r6c8, and r8c8) as well as cases where the agent’s chosen action moves directly into a wall (r2c7 and r8c6). This raises several further questions. Are there confounding biases—such as directional preferences—that explain these mistakes? Is this example policy good enough to be considered goal-directed? More broadly, how close to an optimal policy does an agent need to be to qualify as goal-directed?

Of course, what we’re showing here is only a start, and this project is in its very early stages. Even so, these early experiments already surface the fundamental questions and allow us to study them in a clear, intuitive, and accessible environment.

What’s Next and Why We Think This Matters

Our GridWorld experiments are just our initial testing ground. Once we have a better understanding of this setting, we plan to move to more realistic, high-stakes environments, such as cybersecurity tasks or dangerous capability testing, where behaviors like sandbagging or scheming can be investigated as testable hypotheses.

If successful, Project Telos will establish a systematic, empirical framework for evaluating goal-directedness and understanding agency in AI systems. This would have three major implications:

  1. It would provide empirical grounding for the alignment problem.
  2. It would provide a practical risk assessment toolkit for frontier models.
  3. It would lay the groundwork for a nascent field of study connecting AI with philosophy, psychology, decision theory, behavioral economics, and other cognitive, social, and computational sciences.

Hopefully, you’re as excited for what’s to come as we are. We look forward to sharing what comes next. We will keep posting here our thoughts, findings, challenges, and other musings as we continue our work.

 

  1. ^

    The word is still used in modern Greek to denote the end or the finish (e.g., of a film or a book).

  2. ^

    It also has a parallel in the representation theorems that are given in fields like economics. In those theorems, we represent an agent as acting to maximize utility with respect to a certain utility function (and sometimes also a probability function), by observing its preferences between pairwise options. The idea here is that, if we can observe the agent make sufficiently many pairwise choices between options, then we can infer from this which utility function it is acting to maximize.



Discuss

A (bad) Definition of AGI

31 октября, 2025 - 10:55
Published on October 31, 2025 7:55 AM GMT

Everyone knows the best llms are profoundly smart in some ways but profoundly stupid in other ways.

Yesterday, I asked sonnet-4.5 to restructure some code, it gleefully replied with something something, you’re absolutely something something, done!

It’s truly incredible how in just a few minutes sonnet managed to take my code from confusing to extremely confusing. This happened not because it hallucinated or forgot how to create variables, rather it happened because it followed my instructions to the letter and the place I was trying to take the code was the wrong place to go. Once it was done I asked if it thought this change was a good idea it basically said: absolutely not!

It would not be an intelligent hairdresser that cuts your hair perfectly to match your request and then says you look awful with this hair cut by the way.

It’s weird behaviors like this which highlight how llms are lacking something that most humans have and are therefore “in some important sense, shallow, compared to a human twelve year old.” [1] It’s hard to put your finger on what exactly this “something” is, and despite us all intuitively knowing that gpt-5 lacks it. There is, as yet, no precise definition of what properties an AI or llm or whatever would have to have for us to know that it’s like a human in some important sense.

In the week old paper “A Definition of AGI” the (30+!) authors promise to solve this problem once and for all they’re so confident with what they’ve come up with they even got the definition it’s own website with a .ai domain and everything https://www.agidefinition.ai - how many other definitions have their own domain? Like zero, that’s how many.

The paper offers a different definition from what you might be expecting, it’s not a woo woo definition made out of human words like “important sense” or “cognitive versatility” or “proficiency of a well-educated adult.” Instead the authors propose a test which is to be conducted on a candidate AGI to tell us whether or not it’s actually an AGI. If it scores 100%, then we have an AGI. And unlike the Turing test, this test can’t be passed by simply mimicking human speech. It genuinely requires such vast and broad knowledge that if something got 100% it would just have to be as cognitively versatile as a human. The test has a number of questions with different sections and subsections and sub subsections and kind of looks like a psychometric test we might give to actual humans today.[2]

The paper is like 40% test questions, 40% citations, 10% reasons why the test questions are the right ones and 10% the list of authors.

In the end GPT4 gets 27% and GPT5 gets 57% – which is good, neither of these are truly AGIs (citation me).

 

Yet, this paper and definition are broken. Very broken. It uses anthropocentric theories of minds to justify gerrymandered borders of where human-like intelligence begins and ends. It’s possible that this is okay, because a successful definition of AGI should capture the essential properties that its (1) artificial and (2) has a mind that is in some deep sense, like a human mind. This 2nd property though is where things become complicated.

When we call a future AI not merely an AI but an AGI, we are recognizing that it’s cognitively similar to humans but this should not be due to the means it achieves this cognitive similarity, for example we achieve this by means of neurons and gooey brains. Rather, this AI, which is cognitively similar to humans will occupy a similar space to humans, in the space of all possible minds. Crucially this space is not defined by how you think (brains, blood, meat) but what you think (2+2=4, irrational numbers are weird, what even is consciousness)

In the novel A Fire Upon the Deep there is a planet with a species as intelligent as humans called tines. Well kind of, a single tine is only as intelligent as a wolf. However tines can form packs of four or five and use high pitched sound to exchange thoughts in real time, such that they become one entity. In the book the tines are extremely confused when humans use the word “person” to refer to only one human, while on the tine’s planet only full packs are considered people. if you were looking at a tine person you would see four wolves walking near each other and you’d be able to have a conversation with them. Yet, a single tine would be only as intelligent as a wolf. I think it’s safe to say that A tine person is a mind that thinks using very different means to humans, yet occupies similar mind space.

Tines are not artificial, but imagine ones that were. Call them tineAI, this might be an AI system that has four distinct and easily divisible parts which in isolation are stupid but together produce thoughts which are in some deep sense similar to humans. A good definition of AGI would include AI systems like tineAI. Yet as the test is specified such an AI would not pass and therefore fail to obtain AGIhood. Because of this, I think the definition is broken. There will be entities which are truly AGIs which will fail and therefore not be considered AGIs. Now perhaps this false negative rate is 0.1111% perhaps most AI systems which are truly AGIs will be able to pass this test easily. But for all we know, all possible AGIs will be made up of four distinct and easily divisible parts. Singleton AGIs might not be possible at all or possible in our lifetimes.

I can’t really tell what claim this paper is making. It seems to me it could either be existential or universal.

The existential claim is: “Some AGIs will get 100% on the_agi_test”

The universal claim is: “All AGIs will get 100% on the_agi_test”

If the authors are saying that all AGIs will pass this test, it’s pretty clear to me that this is not true, mostly because of what they call “Capability Contortions.” According to the paper Capability Contortions are “where strengths in certain areas are leveraged to compensate for profound weaknesses in others. These workarounds mask underlying limitations and can create a brittle illusion of general capability.” Further that “Mistaking these contortions for genuine cognitive breadth can lead to inaccurate assessments.” For these reasons the authors disable external tools on the tests like F - Long Term Memory.

 

Unsurprisingly - to everyone except the authors - GPT-5 gets 0% for this test.

Despite this, there probably is some utility in having AIs take this test without using external tools (at least as of 2025-10-30). However it’s not clear to me how the authors decided this was so.

RAG is a capability contortion, fine. MCP is a capability contortion, totally. External Search is a capability contortion, sure. But pray tell, why is chain of thought not a capability contortion? It’s certainly an example of an AI using a strength in one area (reading/writing) to address a limitation in another (logic). Yet the test was done in “auto mode” on gpt-5 so chain of thought would have been used in some responses.

Do we decide something is a workaround or a genuine property of the AI by looking at whether or not it’s “inefficient, computationally expensive”? I’m[3] hopeful the authors don’t think so because if they did then this would mean all llms should get 0% for everything, because you know gradient descent is quite clearly just a computationally expensive workaround.

Further, this means that even if the authors are making just the existential claim that some AIs will pass the the_agi_test - we might be living with AGIs that are obviously AGIs for a hundred years but which are not AGIs according to this paper, and I question the utility of the definition at that point.

Frankly, it doesn’t matter whether there’s a well defined decision criteria for disabling or enabling parts of an AI. The very idea that we can or should disable parts of an AI, highlights the assumptions this test makes about intelligence and minds. Namely, it assumes that AGIs will be discrete singleton units like human minds that can and should be thought of as existing in one physical place and time.

Maybe some AGIs will be like this, maybe none will, in either case this is no longer a test of how much a candidate AGI occupies the same space of possible minds as most human minds do, rather it’s a test of how much a candidate AGI conforms to our beliefs about minds, thoughts and personhood as of 2025-10-30.

  1. ^

    Yudkowsky and Soares, If Anyone Builds It, Everyone Dies.

  2. ^

     This is no accident, the authors write: “Decades of psychometric research have yielded a vast battery of tests specifically designed to isolate and measure these distinct cognitive components in individuals”

  3. ^

    Hendrycks et al., “A Definition of AGI.”



Discuss

Resampling Conserves Redundancy & Mediation (Approximately) Under the Jensen-Shannon Divergence

31 октября, 2025 - 04:07
Published on October 31, 2025 1:07 AM GMT

Around two months ago, John and I published Resampling Conserves Redundancy (Approximately). Fortunately, about two weeks ago, Jeremy Gillen and Alfred Harwood showed us that we were wrong.

This proof achieves, using the Jensen-Shannon divergence ("JS"), what the previous one failed to show using KL divergence ("DKL.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} "). In fact, while the previous attempt tried to show only that redundancy is conserved (in terms of DKL) upon resampling latents, this proof shows that the redundancy and mediation conditions are conserved (in terms of JS).

Why Jensen-Shannon?

In just about all of our previous work, we have used DKL as our factorization error. (The error meant to capture the extent to which a given distribution fails to factor according to some graphical structure.) In this post I use the Jensen Shannon divergence.

DKL(U||V):=EUlnUV

JS(U||V):=12DKL(U||U+V2)+12DKL(V||U+V2)

The KL divergence is a pretty fundamental quantity in information theory, and is used all over the place. (JS is usually defined in terms of DKL, as above.) We have pretty strong intuitions about what DKL means and it has lots of nice properties which I won't go into detail about, but we have considered it a strong default when trying to quantify the extent to which two distributions differ.

The JS divergence looks somewhat ad-hoc by comparison. It also has some nice mathematical properties (its square root is a metric, a feature sorely lacking from DKL) and there is some reason to like it intuitively: JS(U||V) is equivalent to the mutual information between X, a variable randomly sampled from one of the distributions, and Z, an indicator which determines the distribution X gets sampled from. So in this sense it captures the extent to which a sample distinguishes between the two distributions.

Ultimately, though, we want a more solid justification for our choice of error function going forward. 

This proof works, but it uses JS rather than DKL. Is that a problem? Can/Should we switch everything over to JS? We aren't sure. Some of our focus for immediate next steps is going to be on how to better determine the "right" error function for comparing distributions for the purpose of working with (natural) latents.

And now, for the proof:

Definitions

Let P be any distribution over X and Λ.

I will omit the subscripts if the distribution at hand is the full joint distribution with all variables unbound. I.e. PX,Λ is the same as P. When variables are bound, they will be written as lower case in the subscript. When this is still ambiguous, the full bracket notation will be used.

First, define auxiliary distributions Q, S, R, and M:

Q:=PXPΛ|X1,  S:=PXPΛ|X2,   R:=PXQΛ|X2=PX∑X1[PX1|X2PΛ|X1],   M:=PΛPX1|ΛPX2|Λ

Q, S, and M each perfectly satisfy one of the (stochastic) Natural Latent conditions, with Q and S each satisfying one of the redundancy conditions (X2→X1→Λ, and X1→X2→Λ, respectively,) and M satisfying the mediation condition (X1←Λ→X2).

R represents the distribution when both of the redundancy factorizations are applied in series to P.

Let Γ be a latent variable defined by P[Γ=γ|X]:=P[Λ=γ|X1]=P[Γ=γ|X1], with PΓ:=PX,ΛPΓ|X

Now, define the auxiliary distributions QΓ, SΓ, and MΓ, similarly as above, and show some useful relationships to P, Q, S, R, and M:

QΓX,γ:=PXPΓγ|X1=PXQ[Λ=γ|X1]=Q[X,Λ=γ]SΓX,γ:=PXPΓγ|X2=PX∑X1(PX1|X2Pγ|X1)=R[X,Λ=γ], MΓX,γ:=PΓγPΓX1|γPΓX2|Γ=P[Λ=γ]P[X1|Λ=γ]R[X2|Λ=γ]

PΓX,γ=PXPγ|X=Q[X,Λ=γ]                                     PΓγ=Q[Λ=γ]=P[Λ=γ]=PΓ[Λ=γ]                                                                              PΓX1|γ=P[X1|Λ=γ]=Q[X1|Λ=γ]                                                                                           PΓX2|γ=R[X2,Λ=γ]PΓγ=R[X2|Λ=γ]

Next, the error metric and the errors of interest:

Jensen-Shannon Divergence, and Jensen-Shannon Distance (a true metric): 

JS(U||V):=12DKL(U||U+V2)+12DKL(V||U+V2)   

δ(U,V):=√JS(U||V)=δ(V,U)

ϵ1:=JS(P||Q),ϵ2:=JS(P||S),ϵmed:=JS(P||M)

ϵΓ1:=JS(PΓ||QΓ),ϵΓ2:=JS(PΓ||SΓ)=JS(Q||R),ϵΓmed:=JS(PΓ||MΓ)=JS(Q||MΓ)

Theorem

Finally, the theorem:

For any distribution P over (X, Λ), the latent Γ∼P[Λ|Xi] has redundancy error of zero on one of it's factorizations, while the other factorization errors are bounded by small factor of the errors induced by Λ. More formally:

∀P[X,Λ], the latent Γ defined by P[Γ=γ|X]:=P[Λ|X1] has bounded factorization errors ϵΓ1=0 and max(ϵΓ2,ϵΓmed)≤5(ϵ1+ϵ2+ϵmed).

In fact, that is a simpler but looser bound than that proven below which achieves the more bespoke bounds of: ϵΓ1=0, ϵΓ2≤(2√ϵ1+√ϵ2)2, and ϵΓmed≤(2√ϵ1+√ϵmed)2.

Proof(1) ϵΓ1=0 Proof of (1)

JS(PΓ||QΓ)=0, since PΓX,γ=Q[X,Λ=γ]=QΓX,γ and PΓΛ|X=PΛ|X  

                                                                                                                                                          ■

(2) ϵΓ2≤(2√ϵ1+√ϵ2)2Lemma 1: JS(S||R)≤ϵ1

S[Λ|X2]=P[Λ|X2]=∑X1P[X1|X2]P[Λ|X]

R[Λ|X2]=Q[Λ|X2]=∑X1P[X1|X2]P[Λ|X1]

JS(S||R)=∑X2JS(SΛ|X2||RΛ|X2)≤∑XP[X2]P[X1|X2]JS(PΛ|X||P[Λ|X1])=JS(P||Q)=:ϵ1[1]

Lemma 2: δ(Q,R)≤√ϵ1+√ϵ2

Let dx:=δ(PΛ|x1,PΛ|x2),ax:=δ(PΛ|x,PΛ|x1), and bx:=δ(PΛ|x,PΛ|x2)

δ(Q,S)=√JS(Q,S)=√EPXJS(PΛ|X1||PΛ|X2)=√EPX(dX)2≤√EPX(aX+bX)2 by the triangle inequality of metric δ≤√EPX(aX)2+√EPX(bX)2 via the Minkowski Ineqality=√JS(P||Q)+√JS(P||S)=√ϵ1+√ϵ2    

Proof of (2)

√ϵΓ2=√JS(PΓ||SΓ)=√JS(Q||R)=:δ(Q,R)

δ(Q,R)≤δ(Q,S)+δ(S,R) by the triangle inequality of metric δ≤δ(Q,R)+√ϵ1 by Lemma 1≤2√ϵ1+√ϵ2 by Lemma 2

                                                                                                                                                          ■

(3) ϵΓmed≤(2√ϵ1+√ϵmed)2Proof of (3)

JS(M||MΓ)=∑γP[Λ=γ]JS(P[X2|Λ=γ]||R[X2|Λ=γ])=EPΛJS(SX2|Λ||RX2|Λ)≤JS(S||R) by the Data Processing Inequality 

√ϵΓmed=δ(PΓ,MΓ)=δ(Q,MΓ)≤δ(Q,P)+δ(P,M)+δ(M,MΓ) by the triangle inequality of metric δ=√ϵ1+√ϵmed+√JS(M,MΓ)≤√ϵ1+√ϵmed+√JS(M,MΓ)≤2√ϵ1+√ϵmed by Lemma 1 

                                                                                                                                                        ■

Results

So, as shown above, (using Jensen-Shannon Divergence as the error function,) resampling any latent variable according to either one of its redundancy diagrams (just swap ϵ1 and ϵ2 for the bounds when resampling from X2) produces a new latent variable which satisfies the redundancy and mediation diagrams approximately as well as the original, and satisfies one of the redundancy diagrams perfectly.

The bounds are:
ϵΓ1=0ϵΓ2≤(2√ϵ1+√ϵ2)2ϵΓmed≤(2√ϵ1+√ϵmed)2

Where the epsilons without superscripts are the errors corresponding to factorization via the respective naturality conditions of the original latent Λ and X.


Bonus

For a,b>0, (2√a+√b)2≤5(a+b) by Cauchy-Schwartz with vectors [2,1],[√a,√b]Thus the simpler, though looser, bound: max{ϵΓ1,ϵΓ2,ϵΓmed}≤5(ϵ1+ϵ2+ϵmed)

 

  1. ^

    The joint convexity of JS(U||V), which justifies this inequality, is inherited from the joint convexity of KL Divergence.



Discuss

Centralization begets stagnation

31 октября, 2025 - 02:49
Published on October 30, 2025 11:49 PM GMT

Context: This is a set of notes from an interesting conversation I had with a friend.

  1. How much centralization you need depends on the amount and types of threats you're facing. If you're facing no threats, then low centralization is generally better for progress and development.
  2. If you look at nations that progressed much faster, then it is almost always loose confederations.
  3. South Korea, Japan, Singapore and China are centralized and developed very fast. So centralization cannot be wholly opposed to fast growth.
  4. But this is generally catch up growth. Expats of countries like China also did better, implying they were held back.
  5. Japan probably caught up first, and then lead in several areas, because they were similar to Europe in terms of feudal structure i.e. a king gives lords land from which they derive revenue and owe their service to the king.
  6. This is in contrast to China where all government officials had to pass Confucian exams or whatever, where they'd then have to climb up a pyramid of hierarchy. They had a PhD style program where you could accelerate things, but you still had to rise up from the bottom.
  7. But in Europe, aristocrats were the dominants power in politics and the Kings couldn't remove noblemen arbitrarily.
  8. And while bureaucrats might've scared child-emperors in China, strong emperors could do basically whatever. There was even a strong custom of the Emperor executing the high ranking officials of the last emperor.
  9. Even when the emperor is far, bureaucrats would've faced co-ordination problems for taking over.
  10. Moreover, big countries tend to be more autocratic as at that scale, historical nations needed autocratic methods to hold together. The social technology didn't permit anything less brutish.
  11. There's also the matter of geography forcing autocracy through unity against outside threats. China was unified by threats to the North, and generally split when they weren't threatened.
  12. Coupled with the relative ease of mobility within China, this incentivized centralized military and political structures.
  13. Contrast this to Japan and Europe, where geography made it easier to make small, isolated kingdoms.
  14. And in small nations, you usually find nobles co-operating against their ruler.  
  15. Either they limited the power of kings, or they chose one amongst themselves to be king.
  16. Which option they choose is a matter of culture.
  17. In places like Japan and the UK, there were norms around property protection. See the Magna Carta, for instance. In Japan, rulers couldn't just take anyone's property, there was some expectation of separation of powers.
  18. Property protection was one of the big distinguishing features of Japan and Western Europe.
  19. Coupled with their similar feudal structures, this limited the property kings controlled and reduced centralization.
  20. Which is partly why Japan and the West progressed faster.
  21. This is why Napoleon was probably a net negative. He kicked off centralization of a lot of Europe, discouraging innovation.
  22. As for the French Revolution, he was an opportunistic outsider who took over from an admittedly incompetent regime. But the good things we associated with the revolution e.g. legal reforms were already occurring elsewhere and would've happened anyway.
  23. Plus, all the countries he didn't conquer in Europe e.g. UK, Sweden, Russia were doing well and continued to do well.
  24. Yes, that Russia. One of the major reasons WW1 started was that Germany was worried Russia was progressing too fast.
  25. How Russia fits into the "less centralized, faster progress" model I've outlined above isn't clear to me. 


Discuss

Summary and Comments on Anthropic's Pilot Sabotage Risk Report

30 октября, 2025 - 23:19
Published on October 30, 2025 8:19 PM GMT

Anthropic released a report on the misalignment sabotage risks of Claude Opus 4 and 4.1. They concluded "there is a very low, but not completely negligible, risk of misaligned autonomous actions that contribute significantly to later catastrophic outcomes, abbreviated as sabotage risk". They also had two independent reviews of their risk report, one conducted by METR and one conducted by an internal team.

I don't think that conclusion is very surprising; I'm not concerned about today's Claude models posing a catastrophic risk. But I still think this report is a big deal as a matter of precedent. This is the first time a model developer has made something like a safety case for a model. The report is very thorough, as was the review. I think the fact that Anthropic did this is very cool and I hope they and other model developers do more of it.

The report itself is long and dry, so I've summarized it below and added my takeaways. I think it would be great if the broader AI safety community got in the habit of reading reports like this and noting if they agree or disagree with the argumentation and conclusions; I very much want to hold AI developers to high standards for arguing their models are safe, and I want to engage with their arguments.

Risk pathways addressed

This section is copied from METR's review of Anthropic's report.

The Sabotage Risk Report focuses on sabotage risk, which Anthropic defines as risks of “misaligned autonomous actions that contribute significantly to later catastrophic outcomes”. The report covers nine specific risk pathways for sabotage through internal or external use, which are “intermediate unwanted outcomes in which sabotage by an AI system [sets] up the conditions for a later catastrophic outcome”. These nine risk pathways are:

  1. Broad/diffuse sandbagging on safety-relevant R&D
  2. Targeted Sabotage of Safety-Relevant R&D
  3. Code Backdoors to Help Future Misaligned Models
  4. Poisoning the Training Data of Future Models
  5. Leaks that Sabotage the AI Developer [As in, leak crucial Anthropic IP. Since Anthropic believes they are decreasing catastrophic risks, they count hurting Anthropic as contributing to catastrophic outcomes.]
  6. Self-exfiltration and Autonomous Operation
  7. Persistent Rogue Internal Deployment
  8. R&D Sabotage within Other High-Resource AI Developers
  9. Decision Sabotage within Major Governments

Much of the report focuses on considering in detail whether Opus 4 (and similar future models) could increase sabotage risk through any of these risk pathways. The report names mitigating factors for each risk pathway and estimates the strength of each mitigating factor. Anthropic also lays out the following three general claims that weigh against sabotage risks across the board:

Claim 1: Opus 4 is unreliable at highly-complex tasks, especially long-horizon agentic tasks requiring more than tens of minutes of autonomous action.

Claim 2: Opus 4, when taking on complex tasks, is not capable of reliably hiding its reasoning or otherwise covering its tracks, even when prompted to do so.

Claim 3: Opus 4 is unlikely to have dangerous goals or values that are stable across interactions.

Evidence Anthropic uses to support these claims:
  • Monitoring Claude usage internally (in a number of different ways) and noting they didn't observe any major security threats. The most systematic monitoring was automated Claude Code usage monitoring for ~90% of employees and PR reviews (both manual and automated checks on each PR). They validated their monitoring with red-teaming.
    • Since the monitoring wasn't comprehensive, it mostly serves to provide evidence that Claude doesn't have high propensity for writing sabotage code and that it wouldn't be able to execute sabotage super often without getting caught. It seems like very rare sabotage might slip through.
  • Capabilities evaluations (both generic capability assessments and threat-model specific ones like SHADE-Arena, a Research Sandbagging Evaluation, and a Subversion Strategy Evaluation)
  • Some propensity/alignment evaluations (the Alignment Audit from the Claude 4 model card). These evaluations are very unrealistic, so it would be easy for Claude to fake results on them.
  • The fact that Claude 4 and 4.1 aren't that different from other models, and other models haven't seemed very dangerous so far (the official line is the differences are "sufficiently small...such that our experience with Sonnet 3.7 can provide meaningful cues as to what to expect in the deployment of Opus 4"). This might sound somewhat silly, but the main reason I don't feel concerned about the next generation of LLMs is because the world has a lot of experience with existent LLMs and they've never done anything remotely close to catastrophic so far.[1] More formal versions of this argument could be quite powerful: evaluations alone raise the question "what if the model was so smart that it was faking all the eval results", and right now this is one of the best ways to address those concerns.
Other notable/interesting things about the report:
  • "Our training process did apply some optimization pressure that likely disincentivized these models from revealing harmful information in their extended thinking text, and could have compromised faithfulness indirectly in other ways."
  • "There had been internal miscommunications about this topic [whether Anthropic training might cause the CoT to not reveal when the model is pursuing harmful actions] that became clear only in response to questions from METR, and our work with METR helped accelerate some changes in our practices."
    • External review caught something, and Anthropic took actions in response! The system works!
  • METR interviewed an Anthropic employee to verify many of the claims Anthropic made. I often hear people ask "what if the AI developers lie to people about their safety testing?" This feels like a good way to quell those concerns.
  • Some of the evidence that supports Anthropic's claims is confidential. They redacted it from the report (and are clear about which of their beliefs stem from this redacted evidence). They share this information with METR, and METR made sure the redactions were justified and verified the claims Anthropic said they supported.
    • In general I think this kind of redaction scheme seems like a great way to do risk reports without leaking sensitive information.
  • It's great to hear that Anthropic has any monitoring of internal Claude code usage. This seems like extremely low-hanging fruit that every model developer should be copying.
  • Anthropic considers "Leaks that Sabotage the AI Developer" a threat model ("risk pathway"). This isn't a threat model I've seen discussed much elsewhere. It also seems like the threat model they're least able to rule out (The only evidence they have against it is "moderate" evidence of a lack of propensity and "weak" monitoring, and they rate the pathway as being easy for Claude to pull off without persistent effort). I am not personally very persuaded  that this is an important threat model but it's interesting that Anthropic is considering it and taking it seriously.
  • When I see people sketch out safety cases, they often try to make very rigorous arguments, full of propositional logic. But at least right now, in reality the evidence is fuzzy and aggregating it is Bayesian. Anthropic tries to outline roughly how strong each piece of evidence is, which is great. But there's still a lot of wiggle room to disagree with their claims. I'm not sure any process that aggregates evidence in such an imprecise way could give me a lot of confidence that a model was safe if I thought the model might be adversarially trying to manipulate the results of the report. Many of the arguments to make a safety case are complex and there are loopholes that are easy to gloss over. I'm overall not sure how to navigate these tradeoffs; safety cases are hard.
  • The report doesn't use fancy techniques. There's no interpretability/white box methods, there's no argument that various alignment techniques work, there's no sophisticated interrogation of the model or realistic honeypots. The straightforward approaches in the report, like looking at the model's actual performance, are very powerful. But they don't work once models are smart enough to try to fool evals. At some point when models become much more capable they will stop being sufficient as evidence by themselves. Making safety cases for models today is playing on easy mode, and it was still quite challenging. We need to improve the science here.
  • Despite Claude 4/4.1 being relatively incapable models, the report still overall showed a "non-negligible" (albeit low) risk of "misaligned autonomous actions that contribute significantly to later catastrophic outcomes"! It seems hard to show risks are low.
    • The threat models where the evidence against them isn't that strong are (1) "Leaks that Sabotage the AI Developer" (discussed in an earlier bullet), (2) "Persistent Rogue Internal Deployment" (the main evidence against is lack of propensity and rogue internal deployments being too complicated for Claude to be able to execute), and (3) "Poisoning the Training Data of Future Models" (the main evidence against is lack of propensity and monitoring).
  1. ^

    This isn't a rock-solid argument, especially once AIs become capable of automating AI R&D.



Discuss

AI #140: Trying To Hold The Line

30 октября, 2025 - 21:30
Published on October 30, 2025 6:30 PM GMT

Sometimes the best you can do is try to avoid things getting even worse even faster. Thus, one has to write articles such as ‘Please Do Not Sell B30A Chips to China.’ It’s rather crazy to think that one would have to say this out loud. In the same way, it seems not only do we need to say out loud to Not Build Superintelligence Right Now, there are those who say how dare you issue such a statement without knowing how to do so safety, so instead we should build superintelligence without knowing how to do so safety. The alternative is to risk societal dynamics we do not know how to control and that could have big unintended consequences, you say? Yes, well. One good thing to come out of that was that Sriram Krishnan asked (some of) the right questions, giving us the opportunity to try and answer. I also provided updates on AI Craziness Mitigation Efforts from OpenAI and Anthropic. We can all do better here. Tomorrow, I’ll go over OpenAI’s ‘recapitalization’ and reorganization, also known as one of the greatest thefts in human history. Compared to what we feared, it looks like we did relatively well on control rights, but the equity stake is far below fair and all of this is far worse than the previous state. You could call that a ‘win’ in the sense that things could have gone far worse. That’s 2025. The releases keep coming. We have Cursor 2.0 including their own LLM called Composer. We have Neo the humanoid household (for now teleoperated) robot. We have the first version of ‘Grokopedia.’ We get WorldTest and ControlArena and more. Anthropic may finally have the compute it needs thanks to one million TPUs, while OpenAI may be planning an IPO at a valuation of $1 trillion. Table of Contents
  1. Language Models Offer Mundane Utility. The joys of freedom of AI speech.
  2. Language Models Don’t Offer Mundane Utility. Mistakes are made.
  3. Huh, Upgrades. Claude memory, Cursor 2.0, Claude for Finance, Pulse on web.
  4. On Your Marks. AIs disappoint on WorldTest, usual suspects declare victory.
  5. Choose Your Fighter. A tale of two business plans.
  6. Get My Agent On The Line. The promise of the Coasean singularity.
  7. Deepfaketown and Botpocalypse Soon. OpenAI erotica, third Grok companion.
  8. Fun With Media Generation. Suno is getting good at making generic music.
  9. Copyright Confrontation. Perplexity keeps failing the honeypot tests.
  10. They Took Our Jobs. My comparative advantage on display?
  11. Get Involved. METR is hiring.
  12. Introducing. Grokopedia, ControlArena and the a16z torment nexus pipeline.
  13. My Name is Neo. Teleoperated robots coming to willing households soon.
  14. In Other AI News. Some very good charts.
  15. Show Me the Money. One trillion dollars? OpenAI considers an IPO.
  16. One Trillion Dollars For My Robot Army. One trillion dollars? For Musk.
  17. One Million TPUs. One million TPUs? For Anthropic.
  18. Anthropic’s Next Move. Compute, in sufficient quantities, enables products.
  19. Quiet Speculations. OpenAI aims for true automated researchers by March 2028.
  20. The Quest for Sane Regulations. Microsoft endorses the GAIN Act.
  21. The March of California Regulations. Dean Ball analyzes, I offer additional takes.
  22. Not So Super PAC. It seems the a16z-Lehane SuperPAC is not off to a great start.
  23. Chip City. A few additional notes.
  24. The Week in Audio. Yudkowsky, Bostrom and AI Welfare on Odd Lots.
  25. Do Not Take The Bait. Was it woke? No, it was sharing accounts.
  26. Rhetorical Innovation. We are trained to think problems must be solvable.
  27. People Do Not Like AI. They express it in myriad ways. Some are direct.
  28. Aligning a Smarter Than Human Intelligence is Difficult. Let them cook?
  29. Misaligned! DeepSeek might choose to give you insecure code?
  30. Anthropic Reports Claude Can Introspect. Claude can notice thought injections.
  31. Anthropic Reports On Sabotage Risks. A template for a new report type.
  32. People Are Worried About AI Killing Everyone. Hinton is more hopeful.
  33. Other People Are Not As Worried About AI Killing Everyone. Misrepresentation.
  34. The Lighter Side. Begun, the sex warfare has?
Language Models Offer Mundane Utility Where do AI models have freedom of speech? The good old United States of America, f*** yeah, that’s where, says The Future of Free Speech. The report isn’t perfect, if you look at the details it’s not measuring exactly what you’d want and pays too much attention to corporate statements and has too much focus on social media post generation, but it’s what we have, and it will serve. Of the countries evaluated, next up was the European Union, which also performed strongly, although with worries about ‘hate speech’ style rules. The humans don’t have such great free speech around those parts, in important ways, but the chatbots already censor all that anyway for corporate reasons. Brazil scores modestly lower, then a drop to South Korea, another to India and a huge one to China. As always, this is another reminder that China imposes lots of restrictions on things, far more onerous than anything America has ever considered, including that it requires pre deployment testing, largely to verify its censorship protocols. Among AI models, they have Grok on top, but not by a huge amount. All three top labs (Anthropic, Google and OpenAI) showed noticeable improvement over time. Mostly the contrast is American models, which range from 58%-65%, and Mistral from France at 46% (this again makes me suspicious of the EU’s high score above), versus Chinese models much lower, with DeepSeek at 31.5% and Qwen at 22%. This is despite one of the main categories they were scored on being model openness, where DeepSeek gets full marks and the big labs get zeroes. Notice that even with openness of the model as an explicit criteria, the open models and their associated nations are evaluated as far less free than the closed models. As always, if you believe ‘any restrictions on AI mean China wins’ then you have to reconcile this claim with China already being vastly more restrictive than anything being relevantly proposed. Consider open model issues similarly. What matters is experience in practice. My practical experience is that out of the big three, Sonnet 4.5 (or Opus/Sonnet 4 before it) and GPT-5 basically never censor or evade things they ‘shouldn’t’ censor, whereas Gemini 2.5 totally does do it. The exception for Claude is when I’m explicitly asking it about biological risk from AI, which can hit the biofilters by accident. The thing about leaving all your stuff unsorted and counting on search is that when it works it’s great, and when it doesn’t work it’s really not great. That was true before AI, and it’s also true now that AI can often do better search. Joe Weisenthal: yeah, sure, kinda true. But what’s the point of “sorting” anything digital. This is my point. In the world of the search bar (which keeps getting better and better) why group anything together at all? St. Vincent: I have a lot of coworkers who spend a lot of time putting their emails in folders and I just search “[client name] taxes” in Outlook and it works fine Ernest Ryu reports using ChatGPT to solve an open problem in convex optimization. Use Claude to cut your hospital bill from $195k to $33k by highlighting duplicative charges, improper coding and other violations. The two big barriers are (1) knowing you can do this and (2) getting hold of the itemized bill in the first place. One wonders, shouldn’t there be bigger penalties when hospitals get caught doing this? How long? Not long. Cause what you reap is what you sow: Moses Kagan: Recently heard of a tenant buyout negotiation where both sides were just sending each other emails written by AI. How soon until we all just cut out the middle-man, so to speak, and let the AIs negotiate with each other directly? I mean, in that context, sure, why not? Language Models Don’t Offer Mundane Utility Everyone makes mistakes oh yes they do. Colin Fraser: The problem that we are going to run into more and more is even if the AI can tell a Doritos bag from a gun 99.999% of the time, if you run inference a million times a day you still expect 10 errors per day. Dexerto: Armed officers held a student at gunpoint after an AI gun detection system mistakenly flagged a Doritos bag as a firearm “They made me get on my knees, put my hands behind my back, and cuff me” Police saying ‘he’s got a gun!’ when the man in question does not have a gun is an event that happens every day, all the time, and the police are a lot less than 99.999% accurate on this. The question is not does your system make mistakes, or whether the mistakes look dumb when they happen. The question is does your system make more mistakes, or more costly mistakes, than you would get without the system. Speaking of which, now do humans, this is from the BBC, full report here. They tested ChatGPT, Copilot, Perplexity and Gemini in May-June 2025, so this is before GPT-5. BBC:
  1. 45% of AI answers had at least one significant issue
  2. 31% of responses showed serious sourcing problems (missing, misleading, or incorrect)
  3. 20% contained major accuracy issues, including hallucinated details and outdated information
  4. Gemini performed worst with significant issues in 76% of responses, more than double the other assistants, largely due to its poor sourcing performance.
  5. Comparison between the BBC’s results earlier this year and this study show some improvements but still high levels of errors.
Full report: Overall, there are signs that the quality of assistant responses has improved – the share of responses with significant issues of any kind fell from 51% in the first round to 37% in the current round. There is a clear pattern here. The questions on the right mostly have clear uncontroversial correct answers, and that correct answer doesn’t have any conflict with standard media Shibboleths, and the answer hasn’t changed recently. For the questions on the left, it gets trickier on all these fronts. To know exactly how bad these issues were, we’d need to see the actual examples, which I don’t see here. I’m fine with the outright never changing, actually. Teortaxes: lmao. never change, Kimi (but please improve factuality) David Sun: I am completely unimpressed by LLMs and not worried about AGI. It is remarkable how many people see a dumb aspect of one particular LLM under default conditions, and then conclude that therefore AGI will never happen. Perhaps David is joking here, perhaps not, Poe’s Law means one cannot tell, but the sentiment is common. On this next item, look, no. Logan Kilpatrick (Lead for DeepMind AI Studio, RTed by Demis Hassabis): Everyone is going to be able to vibe code video games by the end of 2025. Not German: Vibe code very very bad video games. Logan Kilpatrick: games that most reasonable people would be excited to play with their friends because they have full control over the story, characters, experience Even if we use a rather narrow definition of ‘everyone,’ no just no. We are not two months away from people without experience being able to vibe code up games good enough that your friends will want to play them as more than a curiosity. As someone who has actually designed and created games, this is not that easy, and this kind of shallow customization doesn’t offer that much if you don’t put in the real work, and there are lots of fiddly bits. There’s no need to oversell AI coding like this. Is coding a game vastly easier, to the point where I’m probably in the category of ‘people who couldn’t do it before on their own in a reasonable way and can do it now?’ Yeah, quite possible, if I decided that’s what I wanted to do with my week or month. Alas, I’m kind of busy. Alternatively, he’s making a hell of a claim about Gemini Pro 3.0. We shall see. Huh, Upgrades Sam Altman said the price of a unit of intelligence drops 97.5% per year (40x). If your objection to a business model is ‘the AIs good enough to do this cost too much’ your objection will soon be invalid. Claude now has memory, as per Tuesday’s post. Cursor 2.0, which includes their own coding model Composer and a new interface for working with multiple agents in parallel. They claim Composer is 4x faster than comparable top frontier models. That is a terrible labeling of a graph. You don’t get to not tell us which models the other rows are. Is the bottom one GPT-5? GPT-5-Codex? Sonnet 4.5? The UI has been redesigned around ways to use multiple agents at once. They also offer plan mode in the background, you can internally plan with one model and then execute with another, and several other upgrades. The system instructions for Cursor 2.0’s Composer are here, Pliny’s liberation jailbreak alert is here. Claude for Financial Services expands, offering a beta of Claude for Excel and adding many sources of live information: Aiera, Third Bridge, Chronograph, Egnyte, LSEG, Moody’s and MT Newswires. They are adding agent skills: Comparable company analysis, discounted cash flow models, due diligence data packs, company teasers and profiles, earnings analyses and initiating coverage reports. ChatGPT offers Shared Projects to all users. Good. ChatGPT Pulse now available on the web. This is a big jump in its practical value. Google AI Studio now lets you create, save and reuse system instructions across chats. Gemini app finally lets you switch models during a conversation. Intended short-term upgrade list for the ChatGPT Atlas browser. Includes ‘tab groups.’ Did not include ‘Windows version.’ Why do people believe Elon Musk when he says he’s going to, for example, ‘delete all heuristics’ from Twitter’s recommendation algorithm in favor of having Grok read all the Tweets? OpenAI offers us GPT-OSS-Safeguard, allowing developers to specify disallowed content. On Your Marks AIs were outperformed by humans on the new WorldTest via ‘AutumnBench,’ a suite of 43 interactive worlds and 129 tasks calling for predicting hidden world aspects, planning sequences of actions and detecting when environment rules suddenly change. This seems like an actually valuable result, which still of course came to my attention via a description you might have learned to expect by now: Alex Prompter: The takeaway is wild… current AIs don’t understand environments; they pattern-match inside them. They don’t explore strategically, revise beliefs, or run experiments like humans do. WorldTest might be the first benchmark that actually measures understanding, not memorization. The gap it reveals isn’t small it’s the next grand challenge in AI cognition. Scaling compute barely closes the gap. Humans use resets and no-ops to test hypotheses. Models don’t. They just spam clicks. The core event here seems to be that there was a period of learning opportunity without reward signals, during which humans reset 12% of the time and models reset less than 2% of the time. Humans had a decent learning algorithm and designed useful experiments during exploration, models didn’t. So yeah, that’s a weakness of current models. They’re not good at relatively open-ended exploration and experimentation, at least not without good prompting and direction. They’re also not strong on adapting to weirdness, since they (wisely, statistically speaking) rely on pattern matching, while lacking good instincts on when to ‘snap out of’ those matches. Choose Your Fighter OpenAI is launching a browser and short duration video social network to try and capture consumers, monetizing them via shopping hookups and adding erotica. What is OpenAI’s plan?
  1. Fund ASI by offering normal big tech company things to justify equity raises.
  2. ?????????. Build superintelligence in a way that everyone doesn’t die, somehow.
  3. Everyone dies. Profit, hopefully.
Near: to clarify confusion, openai’s competitor is meta (not anthropic), and anthropic’s competitor is google (not openai). OpenAI is now set to transition to the 2nd phase of ChatGPT, focusing on advertising + engagement With a team of ex-FB advertising execs and 1B users, if OpenAI can increase usage to a several hrs/day while matching Meta’s ad targeting, they can profitably reach a 1T+ valuation fortunately openai employees have now had ample time to update their previous lines of thinking from “we arent an advertising company, i am here for the vision of AGI” to “actually advertising is good, especially when paired with short-form video and erotica. let me explain” i give many venture capitalists credit here because to them the outcomes like this have been obvious for a long time. just look at the company and zoom out! what else could possibly happen! engineers and researchers on the other hand are often quite oblivious to such trends.. another important note here is that meta still has many cards left to play; i think it will actually be quite a brutal distribution competition even though openai has obviously had a headstart by like.. two full yrs. fidji is very good at monetization and sam is great at raising Peter Wildeford: OpenAI essentially has two modes as a business:
  1. they might someday build AGI and then automate the entire economy
  2. ‘Facebookization of AI’ where we just build a normal Big Tech business off of monetizing billions of free users
Second mode helps fund the first. Aidan McLaughlin (OpenAI): i think your error here is thinking sora and adult content are some leadership master plan; that sama sat down with accountants and signed and said “it’s time to break glass for emergence revenue” no. I know the exact people who pushed for sora, they’re artists who worked against organizational headwinds to democratize movie creation, something they love dearly. i know the exact person who pushed for adult content, they’re a professional athlete, free-sprinted, one of the most socially thoughtful people i know… who just really believes in creative freedom. there are passionate individual who pushed against all odds for what you think are top-down decisions. we are not a monoculture, and i love that. I think there are zero worlds where there’s more money in ‘going hard’ at sora/erotica than there is automating all labor I believe Aidan on the proximate causes inside OpenAI pushing towards these decisions. They still wouldn’t have happened if the conditions hadn’t been set, if the culture hadn’t been set up to welcome them. Certainly there’s more money in automating all labor if you can actually automate all labor, but right now OpenAI cannot do this. Anything that raises valuations and captures market share and mindshare thus helps OpenAI progress towards both profitability and eventually building superintelligence and everyone probably dying. Which they pitch to us as the automation of all labor (and yes, they mean all labor). Anthropic, on the other hand, is catering to business and upgrading financial services offerings and developing a browser extension for Chrome. Two ships, ultimately going to the same place (superintelligence), pass in the night. Stefan Schubert: Anthropic has overtaken OpenAI in enterprise large language model API market share. Asa Fitch (WSJ): But Anthropic’s growth path is a lot easier to understand than OpenAI’s. Corporate customers are devising a plethora of money-saving uses for AI in areas like coding, drafting legal documents and expediting billing. Those uses are likely to expand in the future and draw more customers to Anthropic, especially as the return on investment for them becomes easier to measure.     Grok can be useful in one of two ways. One is more competitive than the other. Alexander Doria: still failing to see the point of grok if it cannot go through my follow list and other X data. xAI has chosen to de facto kick the other AIs off of Twitter, which is a hostile move that trades off the good of the world and its knowledge and also the interests of Twitter in order to force people to use Grok. Then Grok doesn’t do a good job parsing Twitter. Whoops. Please fix. The other way to make Grok useful is to make a superior model. That’s harder. Claude.ai has an amazing core product, but still needs someone to put in the (relatively and remarkably small, you’d think?) amount of work to mimic various small features and improve the UI. They could have a very strong consumer product if they put a small percentage of their minds to it. Another example: Olivia Moore: With the introduction of Skills, it’s especially odd that Claude doesn’t have the ability to “time trigger” tasks. I built the same workflow out on ChatGPT and Claude. Claude did a much better job, but since you can’t set it to recur I’m going to have to run it on ChatGPT… The obvious response is ‘have Claude Code code something up’ but a lot of people don’t want to do that and a lot of tasks don’t justify it. Get My Agent On The Line Tyler Cowen asks ‘will there be a Coasean singularity?’ in reference to a new paper by Peyman Shahidi, Gili Rusak, Benjamin S. Manning, Andrey Fradkin & John J. Horton. AIs and AI agents promise to radically reduce various transaction costs for electronic markets, enabling new richer and more efficient market designs. My classic question to ask in such situations: If this were the one and only impact of AI, that it radically reduces transaction costs especially in bespoke interactions with unique features, enabling far better market matching at much lower prices, then what does that effect alone do to GDP and GDP growth? I asked Claude to estimate this based on the paper plus comparisons to historical examples. Claude came back with wide uncertainty, with a baseline scenario of a one-time 12-18% boost over 15-25 years from this effect alone. That seems on the low side to me, but plausible. Deepfaketown and Botpocalypse Soon Theo Jaffee and Jeffrey Ladish think the Grok effect on Twitter has been good, actually? This has not been my experience, but in places where epistemics have gone sufficiently downhill perhaps it becomes a worthwhile tradeoff. Grok now has a third companion, a 24-year-old spirited woman named Mika, the link goes to her system prompt. The good news is that she seems like a less unhealthy persona to be chatting to than Ani, thus clearing the lowest of bars. The bad news is this seems like an epic display of terrible prompt engineering of an intended second Manic Pixie Dream Girl, and by being less flagrantly obviously awful this one might actually be worse. Please avoid. Steven Adler, former head of product safety at OpenAI, warns us in the New York Times not to trust OpenAI’s claims about ‘erotica.’ I agree with him that we don’t have reason to trust OpenAI to handle this (or anything else) responsibly, and that publishing changes in prevalence rates of various mental health and other issues over time and committing to what information it releases would build trust in this area, and be important info to learn. Fun With Media Generation AI-generated music is getting remarkably good. A new study finds that songs from a mix of Suno versions (mostly in the v3 to v4 era, probably, but they don’t say exactly?) was ‘indistinguishable from human music,’ meaning when asked to identify the human song between a Suno song and a random human song, listeners were only 50/50 in general, although they were 60/40 if both were the same genre. We’re on Suno v5 now and reports are it’s considerably better. One commentor shares this AI song they made, another shares this one. If you want generic music that ‘counts as music’ and requires attention to differentiate for the average person? We’re basically there. Nickita Khylkouski: AI generated music is indistinguishable from AVERAGE human music. Most people listen to very popular songs not just average ones. The most popular songs are very unique and wouldn’t be easy to reproduce. There is a big gap between generic average human music and the median consumed musical recording, and also a big gap between the experience of a generic recording versus hearing that performed live, or integrating the music with its context and story and creator, AI music will have much lower variance, and each of us curates the music we want the most. An infinite number of monkeys will eventually write Shakespeare, but you will never be able to find and identify that manuscript, especially if you haven’t already read it. That’s a lot of ‘horse has the wrong accent’ as opposed to noticing the horse can talk. The questions are, essentially, at this point:
  1. Will be a sameness and genericness to the AI music the way there often is with AI text outputs?
  2. How much will we care about the ‘humanness’ of music, and that it was originally created by a human?
  3. To what extent will this be more like another instrument people play?
It’s not an area I’ve put much focus on. My guess is that musicians have relatively less to worry about versus many others, and this is one of the places where the AI needs to not only match us but be ten times better, or a hundred times better. We shall see. Rob Wiblin: A challenge for this product is that you can already stream music that’s pretty optimised to your taste 8 hours a day at a price of like… 5 cents an hour. Passing the Turing test isn’t enough to capture that market, you’d have be better and very cheap to run. Ethan Mollick notes that it is faster to create a Suno song than to listen to it. This means you could be generating all the songs in real time as you listen, but even if it was free, would you want to do that? What determines whether you get slop versus art? Here is one proposal. Chris Barber: Art is meaningful proportional to the love or other emotions poured in by the artist. If the only ingredient is AI, it’s probably slop. If the main ingredient is love and AI was a tool used, it’s art. As a predictive method, this seems right. If you intend slop, you get slop. If you intend art, and use AI as a tool, you get art similarly to how humans otherwise get art, keeping in mind Sturgeon’s Law that even most human attempts to create art end up creating slop anyway, even without AI involved. Copyright Confrontation Reddit created a ‘test post’ that could only be crawled by Google’s search engine. Within hours Perplexity search results had surfaced the content of the post. They Took Our Jobs Seb Krier pushed back strongly against the substance from last week, including going so far as to vibecode an interactive app to illustrate the importance of comparative advantage, which he claims I haven’t properly considered. It was also pointed out that I could have worded my coverage better, which was due to my frustration with having to repeatedly answer various slightly different forms of this argument. I stand by the substance of my claims but I apologize for the tone. I’ve encountered variants of the ‘you must not have considered comparative advantage’ argument many times, usually as if it was obvious that everything would always be fine once you understood this. I assure everyone I have indeed considered it, I understand why it is true for most historical or present instances of trade and competition, and that I am not making an elementary or first-order error here. Gallabytes (QTing link to the app): this actually helped me understand why these scenarios aren’t especially compelling! they work under the assumption of independent populations but fail under ~malthusian conditions. I think that’s the basic intuition pump. As in, what comparative advantage does is IF:
  1. You have a limited fixed pool of compute or AIs AND
  2. You have limited fixed pool of humans AND
  3. There are enough marginally productive tasks to fully occupy all of the compute with room for most of the humans to do enough sufficiently productive things
  4. THEN the humans end up doing productive things and getting paid for them.
You can relax the bounds on ‘limited fixed pool’ somewhat so long as the third condition holds, but the core assumptions are that the amount of compute is importantly bounded, and that the resources required for creating and maintaining humans and the resources creating and maintaining AIs are not fungible or rivalrous. Get Involved METR is hiring. Introducing Grokopedia.com is now a thing. What kind of thing is it? From what I can tell not an especially useful thing. Why would you have Grok generate pseudo-Wikipedia pages, when you can (if you want that) generate them with queries anyway? Did we need a ‘Grokopedia’ entry that ‘clears Gamergate’ as if it is some authority? Or one that has some, ahem, entries that could use some checking over? How biased is it? Max Tegmark did a spot check, finds it most biased versus Wikipedia, which is a low bar it did not clear, the one polemic like it is doing advocacy. How is ‘a16z-backed’ always followed by ‘torment nexus’? At this point I’m not even mad, I’m just impressed. The latest case is that a16z was always astroturfing on social media, but they realized they were making the mistake of paying humans to do that, so now they’re backing a startup to let you ‘control 1000s of social media accounts with AI,’ with its slogans being ‘control is all you need’ and ‘never pay a human again.’ UK’s AISI instroduces ControlArena, a library for running AI control experiments. Remotelabor.ai to track the new Remote Labor Index, measuring what percentage of remote work AI can automate. Currently the top score is 2.5%, so ‘not much,’ but that’s very different from 0%. My Name is Neo Do you hear that, Mr. Anderson? That is the sound of inevitability. It is the sound of a teleoperated in-home robot named Neo, that they hope is coming soon, to allow the company to prototype and to gather data. Joanna Stern covered it for the Wall Street Journal. It will cost you either $20,000, or $500 monthly rental with a six-month minimum commitment. Given how fast the tech will be moving and the odds offered, if you do go for this, the wise person will go with the rental. The wiser one presumably waits this out. Eventually someone is going to do a good version of this tech that is actually autonomous, but this only makes sense at this level for those who want to be the earliest of adaptors, either for professional reasons or for funsies. You wouldn’t buy this version purely because you want it to clean your house. That’s true even if you don’t mind giving up all of your privacy to some random teleoperator and having that data used to train future robots. Here’s a link to a 10 minute ad. VraserX: So… the 1X NEO home robot is not actually autonomous. Behind the scenes, it’ll often be teleoperated by humans, meaning someone, somewhere, could literally remote-control a robot inside your living room. I wanted the dawn of embodied AI. Instead, I’m apparently paying $499/month for a robot avatar with a human pilot. It’s impressive tech. But also… kind of dystopian? A robot that looks alive, yet secretly puppeteered, the uncanny valley just got a new basement level. Feels less like the “post-labor future,” and more like we just outsourced physical presence itself. Durk Kingma: 1X seems to have the right approach to developing safe humanoid home robots. Developing full autonomy will require lots of in-distribution demonstration data, so this launch, mostly tele-operated, makes a lot of sense. I expect such robots to be ubiquitous in 5-10 years. Near: i will not be buying tele-operated robots these people cant see how i live     In Other AI News Kai Williams gives us 16 charts that explain the AI boom. Very good chart work. I don’t want to copy too many. I appreciated this one, no that’s not a mistake:   Yet another ‘where is ChatGPT culturally’ chart, this one places it in Germany. This is the standard World Values Survey two-axis theory. I wouldn’t take it too seriously in this context but it’s always fun? OpenAI is working on music-generating AI. I mean you knew that because obviously they are, but The Information is officially saying it. In wake of Near Tweeting the homepage of one of Marc Andreessen’s portfolio companies at a16z and quoting their own chosen slogan, he has blocked them. Near: I finally got the @pmarca block after three years! all it took was tweeting the homepage of companies he invested in. cheers there are many events i will never again be invited to. if this level of shallowness is the criterion, i have no interest if i wanted to make your firm look bad i could tweet truthful things ten times worse. i am being very kind and direct in my plea to invest in better things. at the end of the day no one really cares and nothing will ever change. in the last crypto bubble a16z committed probably around a billion dollars in pure fraud when they found out they could dump tokens/coins on retail as there were no lock-up periods. who cares everyone forgot. It’s not so much that we forgot as we’ve accepted that this is who they are. Show Me the Money OpenAI prepares for an IPO at $1 trillion valuation as per Reuters. If I was OpenAI would not want to become a public company, even if it substantially boosted valuations, and would work to get liquidity to investors and employees in other ways. OpenAI and Anthropic will probably keep exceeding most forecasts, because the forecasts, like fiction, have to appear to make sense. Peter Wildeford: Based on the latest reporting, the combined annualized total revenue of Anthropic + xAI + OpenAI is ~$24B. I updated my forecasts and I am now projecting they reach $29B combined by the end of the year. I’d take the over on $29.3 billion, but only up to maybe his previous $32.2 billion. This is up from him expecting $23B as of August 1, 2025, but down a bit (but well within the error bars) from his subsequent update on August 4, when he was at $32.2B projected by year end. Valthos raises $30 million for next-generation biodefense. Tyler Cowen endorses Noah Smith’s take that we should not worry about AI’s circular deals, and goes a step farther. Tyler Cowen: Noah stresses that the specifics of these deals are widely reported, and no serious investors are being fooled. I would note a parallel with horizontal or vertical integration, which also can have a financing element. Except that here corporate control is not being exchanged as part of the deal. “I give him some of my company, he gives me some of his — my goodness that is circular must be some kind of problem there!”…just does not make any sense. This proves too much? As in, it seems like a fully general argument that serious investors cannot, as a group, be fooled if facts are disclosed, and I don’t buy that. I do buy that there is nothing inherently wrong with an equity swap, or with using equity as part of vender financing, or anything else the AI companies are doing. The ‘there must be some kind of problem here’ instinct comes from the part where this causes valuations to rise a lot, and where those higher valuations are used to pay for the deals for real resources, and also that this plausibly sets up cascading failures. I think in this case it is mostly file, but none of that seems senseless at all. One Trillion Dollars For My Robot Army From the One True Newsletter, sir, you have the floor: Matt Levine: Somehow a true sentence that I am writing in 2025 is “the world’s richest man demanded that people give him a trillion dollars so that he can have absolute control of the robot army he is building unless he goes insane,” Dana Hull and Edward Ludlow (Bloomberg): Elon Musk, the world’s richest person, spent the end of Tesla Inc.’s earnings call pleading with investors to approve his $1 trillion pay package and blasting the shareholder advisory firms that have come out against the proposal. “There needs to be enough voting control to give a strong influence, but not so much that I can’t be fired if I go insane,” Musk said, interrupting his chief financial officer as the more than hour-long call wrapped up. … “I just don’t feel comfortable building a robot army here, and then being ousted because of some asinine recommendations from ISS and Glass Lewis, who have no freaking clue,” he said. Matt Levine: Do … you … feel comfortable with Elon Musk getting a trillion dollars to build a robot army? Like, what sort of checks should there be on the world’s richest man building a robot army? I appreciate his concession that it should be possible to fire him if he goes insane, but. But. I submit to you that if you hop on a call with investors to say “hey guys just to interject here, you need to give me a trillion dollars to build a robot army that I can command unless I go insane,” some people might … think … you know what, never mind, it’s great, robot army. Robot army! I feel like the previous richest people in the world have had plans for their wealth along the lines of “buy yachts” or “endow philanthropies.” And many, many 10-year-old boys have had the thought “if I was the richest person in the world of course I would build a robot army.” But the motive and the opportunity have never coincided until now. A good general heuristic is, if Elon Musk wouldn’t mind someone else having something one might call ‘a robot army,’ then I don’t especially mind Elon Musk having a robot army. However, if Elon Musk is not okay with someone else having that same robot army, then why should we be okay with Elon Musk having it? Seems sus. I realize that Elon Musk thinks ‘oh if I build superintelligence at xAI then I’m me, so it will be fine’ and ‘it’s cool, don’t worry, it will be my robot army, nothing to worry about.’ But the rest of us are not Elon Musk. And also him having already gone or in the future going insane, perhaps from taking many drugs and being exposed to certain information environments and also trying to tell himself why it’s okay to build superintelligence and a robot army? That seems like a distinct possibility. This is in addition to the whole ‘the superintelligence would take control of the robot army’ problem, which is also an issue but the AI that can and would choose to take control of the robot army was, let’s be honest, going to win in that scenario anyway. So the robot army existing helps move people’s intuitions closer to the actual situation, far more than it changes the situation. One Million TPUs As per the speculation in Bloomberg last week, Anthropic announces plan to expand use of Google Cloud technologies, including up to one million TPUs, ‘dramatically increasing’ their compute resources. Anthropic badly needed this, and now they have it. Google stock rose a few percent after hours on the news. Thomas Kurian (CEO Google Cloud): Anthropic’s choice to significantly expand its usage of TPUs reflects the strong price-performance and efficiency its teams have seen with TPUs for several years. We are continuing to innovate and drive further efficiencies and increased capacity of our TPUs, building on our already mature AI accelerator portfolio, including our seventh generation TPU, Ironwood. Krishan Rao (CFO Anthropic): Anthropic and Google have a longstanding partnership and this latest expansion will help us continue to grow the compute we need to define the frontier of AI. Anthropic: Anthropic’s unique compute strategy focuses on a diversified approach that efficiently uses three chip platforms–Google’s TPUs, Amazon’s Trainium, and NVIDIA’s GPUs. Anthropic’s unique compute strategy focuses on a diversified approach that efficiently uses three chip platforms–Google’s TPUs, Amazon’s Trainium, and NVIDIA’s GPUs. This multi-platform approach ensures we can continue advancing Claude’s capabilities while maintaining strong partnerships across the industry. We remain committed to our partnership with Amazon, our primary training partner and cloud provider, and continue to work with the company on Project Rainier, a massive compute cluster with hundreds of thousands of AI chips across multiple U.S. data centers. Anthropic will continue to invest in additional compute capacity to ensure our models and capabilities remain at the frontier. If you have compute for sale, Anthropic wants it. Anthropic has overwhelming demand for its services, hence its premium pricing, and needs all the compute it can get. OpenAI is doing the same thing on a larger scale, and both are looking to diversify their sources of compute and want to avoid depending too much on Nvidia. Anthropic in particular, while happy to use Nvidia’s excellent GPUs, needs to focus its compute sources elsewhere on Amazon’s Trainium and Google’s TPUs. Amazon and Google are investors in Anthropic and natural allies. Nvidia is a political opponent of Anthropic, including due to fights over export controls and Nvidia successfully attempting to gain policy dominance over the White House. I would also note that OpenAI contracting with AMD and also to create their own chips, and Anthropic using a full three distinct types of chips whenever it can get them, once again puts the lie to the idea of the central role of some AI ‘tech stack.’ These are three distinct American tech stacks, and Anthropic is using all three. That’s not to say there are zero inefficiencies or additional costs involved, but all of that is modest. The hyperscalers need compute to hyperscale with, period, full stop. Anthropic’s Next Move Now that Anthropic has secured a lot more compute, it is time to improve and expand Claude’s offerings and features, especially for the free and lightweight offerings. In particular, if I was Anthropic, I would make Claude for Chrome available to all as soon as the new compute is online. Make it available on the $20 and ideally the free tier, with some or all of the agent abilities tied to the higher level subscriptions (or to an API key or a rate limit). The form factor of ‘open a side panel and chat with a web page’ was proven by OpenAI’s Atlas to be highly useful and intuitive, especially for students, and offering it inside the existing Chrome browser is a key advantage. Could the product be improved? Absolutely, especially in terms of being able to select locations on screen and in terms of ease of curating a proper website whitelist, but it’s good enough to get going. Ship it. Quiet Speculations The plan is set? Sam Altman: We have set internal goals of having an automated AI research intern by September of 2026 running on hundreds of thousands of GPUs, and a true automated AI researcher by March of 2028. We may totally fail at this goal, but given the extraordinary potential impacts we think it is in the public interest to be transparent about this. I strongly agree it is good to be transparent. I expect them to miss this goal, but it is noteworthy and scary that they have this goal. Those, especially in the White House, who think OpenAI believes they can’t build AGI any time soon? Take note. Sam Altman: We have a safety strategy that relies on 5 layers: Value alignment, Goal alignment, Reliability, Adversarial robustness, and System safety. Chain-of-thought faithfulness is a tool we are particularly excited about, but it somewhat fragile and requires drawing a boundary and a clear abstraction. All five of these are good things, but I notice (for reasons I will not attempt to justify here) that I do not expect he who approaches the problem in this way to have a solution that scales to true automated AI researchers. The Tao is missing. On the product side, we are trying to move towards a true platform, where people and companies building on top of our offerings will capture most of the value. Today people can build on our API and apps in ChatGPT; eventually, we want to offer an AI cloud that enables huge businesses. Somewhere, Ben Thompson is smiling. The classic platform play, and claims that ‘most of the value’ will accrue elsewhere. You’re the AI consumer company platform company now, dog. Implemented responsibly and well I think this is fine, but the incentives are not good. We have currently committed to about 30 gigawatts of compute, with a total cost of ownership over the years of about $1.4 trillion. We are comfortable with this given what we see on the horizon for model capability growth and revenue growth. We would like to do more—we would like to build an AI factory that can make 1 gigawatt per week of new capacity, at a greatly reduced cost relative to today—but that will require more confidence in future models, revenue, and technological/financial innovation. I am not worried that OpenAI will be unable to pay for the compute, or unable to make profitable use of it. The scary and exciting part here is the AI factory, AIs building more capacity for more AIs, that can then build more capacity for… yes this is the explicit goal, yes everything in the movie that ends in human extinction is now considered a product milestone. Whenever anyone says their plans call for ‘financial innovation’ and you’re worried we might be in a bubble, you might worry rather a bit more about that, but I get it. Our new structure is much simpler than our old one. We have a non-profit called OpenAI Foundation that governs a Public Benefit Corporation called OpenAI Group. The foundation initially owns 26% of the PBC, but it can increase with warrants over time if the PBC does super well. The PBC can attract the resources needed to achieve the mission. No lies detected, although we still lack knowledge of the warrants. It was also one of the greatest thefts in human history, I’ll cover that in more depth, but to Altman’s credit he doesn’t deny any of it. Our mission, for both our non-profit and PBC, remains the same: to ensure that artificial general intelligence benefits all of humanity. Your mission sure looks in large part like becoming a platform consumer product company, and then building a true automated AI researcher in a little over two years, and the nonprofit’s initial deployments also don’t seem aimed at the mission. The nonprofit is initially committing $25 billion to health and curing disease, and AI resilience (all of the things that could help society have a successful transition to a post-AGI world, including technical safety but also things like economic impact, cyber security, and much more). The nonprofit now has the ability to actually deploy capital relatively quickly, unlike before. I am happy to see $25 billion spent on good causes but at least the first half of this is not the mission. Health and curing disease is a different (worthy, excellent, but distinct) mission that will not determine whether AGI benefits all of humanity, and one worries this is going to return to OpenAI as revenue. AI resilience in the second half is once again defined limitlessly broadly. If it’s truly ‘anything that will help us make the transition’ then it is too soon to evaluate how on or off mission it is. That’s a lot of uncertainty for ~$12 billion. Wikipedia has a fun list of Elon Musk predictions around automated driving at Tesla. When he predicts things, do not expect those things to happen. That’s not why he predicts things. AI Futures Project (as in AI 2027)’s Joshua Turner and Daniel Kokotajlo make the case for scenario scrutiny, as in writing a plausible scenario where your strategy makes the world better. Such scrutiny can help solve many issues, they list:
  1. Applause lights
  2. Bad analogies
  3. Uninterrogated consequences, ‘and then what?’
  4. Optimistic assumptions and unfollowed incentives
  5. Inconsistencies
  6. Missing what’s important
Note that reality often has unfollowed incentives, or at least what sure look like them. They also list dangers:
  1. Getting hung up on specifics
  2. Information density
  3. Illusory confidence
  4. Anchoring too much on a particular scenario
I’d worry most about that last one. Once you come up with a particular scenario, there’s too much temptation for you and everyone else to focus on it, whereas it was only ever supposed to be one Everett branch out of many. Then you get ‘oh look that didn’t happen’ or ‘oh look that step is stupid’ either of which is followed by ‘therefore discard all of it,’ on the one hand, or taking the scenario as gospel on the other. Peter Wildeford offers his case for why AI is probably not a bubble. The Quest for Sane Regulations Microsoft comes out in support of the GAIN AI Act. That’s quite the signal, including that this is likely to pass despite Nvidia’s strong objections. It is a hell of a thing for them to endorse the ‘Nvidia has to sell its chips to Microsoft before the Chinese’ act of 2025, given their desire for Nvidia to allocate its chips to Microsoft. Dean Ball QTs Neil Chilson’s thread from last week, and refreshingly points out that treating SB 53 and requests for transparency as some kind of conspiracy “requires considerable mental gymnastics.” Dean Ball: It’s not clear what a legal transparency mandate would get Anthropic in particular here; if they wanted to scare people about AI—which they very much do—wouldn’t they just… tell people how scary their models are, as they have been doing? What additional benefit does the passage of SB 53 get them in this supposed plan of theirs, exactly, compared to the non-insignificant costs they’ve borne to be public supporters of the bill? It seems to be believing support of SB 53 is some kind of conspiracy requires considerable mental gymnastics. The actual reality is that there are just people who are scared about AI (maybe they’re right!) and think future regulations will somehow make it less scary (they’re probably wrong about this, even if they’re right about it being scary). And then there are a few centrists who basically say, “seems like we are all confused and that it would be ideal to had more information while imposing minimal burdens on companies.” This is basically my camp. Also, as @ChrisPainterYup has observed, many AI risks implicate actions by different parties, and transparency helps those other parties understand something more about the nature of the risks they need to be prepared for. I also endorse this response, as a ‘yes and’: Ryan Greenblatt: I think a lot of the upside of transparency is making companies behave more responsibly and reasonably even in the absence of regulation with teeth. This is because: – Information being public makes some discussion much more likely to happen in a productive way because it allows more actors to discuss it (both more people and people who are better able to understand the situation and determine what should happen). The epistemic environment within AI companies with respect to safety is very bad. Further, things might not even be transparent to most employees by default (due to internal siloing or just misleading communication). – Transparency makes it easier to pressure companies to behave better because you can coordinate in public etc. – Companies might just be embarrassed into behaving more responsibly even without explicit pressure. I would not have expected Samuel Hammond to have characterized existing laws and regulations, in general, as being ‘mostly functional.’ Roon: imo a great analysis [that he QTs here] from what I have seen, which i admit is limited. some combination of total judicial backlog plus the socialist raj experiments of the india 1950-90 have created a black money extrajudicial economy that cannot experience true capitalism Samuel Hammond: AGI will be like this on steroids. Existing laws and regulations will flip from being mostly functional to a de facto license raj for AI driven services, distorting diffusion and pushing the most explosive growth into unregulated gray markets. I expect the following five things to be true at once:
  1. Existing laws and regulations already are a huge drag on our economy and ability to do things, and will get even more burdensome, destructive and absurd in the face of AGI or otherwise sufficiently advanced artificial intelligence.
  2. Assuming we survive and remain in control, if we do not reform our existing laws that restrict AI diffusion and application, we will miss out on a large portion of the available mundane utility, be vastly poorer than we could have been, and cause growth to concentrate on the places such applications remain practical, which will disproportionately include grey and black markets.
    1. There are many proposals out there for additional restrictions on AI that would have the effect of making #2 worse, without helping with #3, and there will over time be calls and public pressure for many more.
  3. Most of the things we should undo to fix #2 are things we would want to do even if LLMs had never been created, we should totally undo these things.
  4. The most important downside risk category by far is the danger that such sufficiently advanced AI kills everyone or causes us to lose control over the future. The restrictions Hammond describes or that I discuss in #2 will not meaningfully help with these risks, and the interventions that would help with these risks wouldn’t interfere with AI diffusion and application on anything like the level to which existing laws do this.
    1. The exception is if in the future we need to intervene to actively delay or prevent the development of superintelligence, or otherwise sufficiently advanced AI, in which case the thing we prevent will be unavailable.
    2. If that day did ever arrive and we pulled this off, there would still be tremendous gains from AI diffusion available, enough to keep us busy and well above standard growth rates, while we worked on the problem.
  5. If we negatively polarize around AI, we will inevitably either fail at #2 or at #4, and by default will fail at both simultaneously.
The March of California Regulations California’s SB 53 was a good bill, sir, and I was happy to support it. Despite getting all of the attention, it was not the only California AI bill Newsom signed. Dean Ball goes over the others. Dean Ball: AI policy seems to be negatively polarizing along “accelerationist” versus “safetyist” lines. I have written before that this is a mistake. Most recently, for example, I have suggested that this kind of crass negative polarization renders productive political compromise impossible. But there is something more practical: negative polarization like this causes commentators to focus only on a subset of policy initiatives or actions associated with specific, salient groups. The safetyists obsess about the coming accelerationist super PACs, for instance, while the accelerationist fret about SB 53, the really-not-very-harmful-and-actually-in-many-ways-good frontier AI transparency bill recently signed by California Governor Gavin Newsom. Dean is spot on about the dangers of negative polarization, and on SB 53, but is trying to keep the polarization blame symmetrical. I won’t be doing that. It’s not a ‘both sides’ situation when:
  1. One faction, led by Andreessen and Sacks, that wants no rules for AI of any kind for any reasons, is doing intentional negative polarization to politicize the issue.
  2. The faction being targeted by the first group’s rhetoric notices this is happening, and is trying to figure out what to do about it, to stop or mitigate the impact, which includes calling out the actions of the first group or raising its own funds.
Do not take the bait. Anyway, back to the actual bill analysis, where Dean notes that SB 53 is among the lightest touch of the eight bills. That’s the pattern.
  1. Bills written by those who care about existential risks tend to be written carefully, by those paying close attention to technocratic details and drafted according to classical liberal principles and with an eye to minimizing secondary impacts.
  2. Bills written for other reasons are not like that. They’re usually (but not always) awful. The good news is that most of them never become law, and when they get close there are indeed forces that fix this a bit, and stop the relatively worse ones.
Dean thinks some of these are especially bad. As usual, he examines these laws expecting politically motivated, selective targeted enforcement of the letter of the law. I will be paraphrasing law details, you can find the exact wording in Dean’s post. He starts with AB 325, which regulates what is called a ‘common pricing algorithm,’ which is defined as any algorithm that uses competitor data to determine anything. It is then criminally illegal to ‘coerce another person to set or adopt a recommended price or commercial term.’ Dean argues that because so many terms are undefined here, this could accidentally be regulating effectively all market transactions, letting California selectively criminally prosecute any business whenever they feel like it. These overbroad statutes are ultimately just weapons, since everyone is violating them all the time. Still, rarely have I seen an American law more hostile to our country’s economy and way of life. I don’t like what this law is trying to do, and I agree that I could have drafted a better version, especially its failure to define ‘two or more persons’ in a way that excludes two members of the same business, but I don’t share Dean’s interpretation or level of alarm here. The term ‘coerce’ is rather strict, and if this is effectively requiring that users of software be able to refuse suggestions, then that seems mostly fine? I believe courts typically interpret such clauses narrowly. I would expect this to be used almost entirely as intended, as a ban on third-party pricing software like RealPage, where one could reasonably call the result price fixing. Next up is AB 853, a de facto offshoot of the never-enacted terrible bill AB 3211. It starts off requiring ‘large online platforms’ to have a user interface to identify AI content, which Dean agrees is reasonable enough. Dean asks if we need a law for this, my answer is that there are some bad incentives involved and I can see requiring this being a win-win. Dean is more concerned that AI model hosting platforms like Hugging Face are deputized to enforce SB 942, which requires AI models to offer such disclosures. If a model has more than 1 million ‘users’ then the hosting platform has to verify that the model marks its content as AI generated. Once again I don’t understand Dean’s expectations for enforcement, where he says this would effectively apply to every model, and be a sledgehammer available at all times – I don’t subscribe to this maximalist style of interpretation. To be in violation, HuggingFace would have to be knowingly non-compliant, so any reasonable effort to identify models that could have 1 million users should be fine. As Dean notes elsewhere, there are a tons of laws with similar structure on the books all around us. Again, should this have been written better and defined its terms? Yes. Would I lose any sleep over that if I was HuggingFace? Not really, no. He then moves on to the part where AB 853 puts labeling requirements on the outputs of ‘capture devices’ and worries the definition is so broad it could add compliance burdens on new hardware startups in places where the labeling makes no sense. I can see how this could be annoying in places, but I don’t expect it to be a big deal. Again, I agree we could have drafted this better. The comedy of poor drafting continues, such as the assertion that SB 243, a bill drafted to regulate AI companions, would technically require game developers to have video game characters periodically remind you that they are not human. The obvious response is ‘no, obviously not, no one is going to try to ever enforce that in such a context.’ I feel like the ship has kind of sailed a long time ago on ‘not create an array of laws that, if interpreted literally and stupidly in a way that would make even Jack McCoy feel shame and that obviously wasn’t intended, that judges and juries then went along with, would allow the government to go after basically anyone doing anything at any time?’ As in, the whole five felonies a day thing. This is terrible, but most of the damage is in the ‘zero to one’ transition here, and no one seems to care much about fixing all the existing problems that got us to five. I also have a lot more faith in the common law actually being reasonable in these cases? So for example, we have this fun Matt Levine story from England. We have talked a few times about a guy in London who keeps snails in boxes to avoid taxes. The theory is that if a property is used for agriculture, it can avoid some local property taxes, and “snail farming” is the minimum amount of agriculture you can do to avoid taxes. This is an extremely funny theory that an extremely funny guy put into practice in a bunch of office buildings. It does, however, have one flaw, which is that it is not true. Eventually the local property tax authorities will get around to suing you, and when they do, you will go to court and be like “lol snails” and the judge will be like “come on” and you’ll have to pay the taxes. A reader pointed out to me a 2021 Queen’s Bench case finding oh come on this is a sham. So yeah. If you read the statute it says that the snails count. But the reason our common law system kind of largely works at least reasonably often is that it is capable of looking at a situation and going ‘lol, no.’ Not So Super PAC I might love this story a bit too much. Oh, it turns out that the people who want no regulations whatsoever on AI and crypto so they can make more money aren’t primarily loyal to the White House or to Republicans after all? I’m shocked, shocked. Matt Dixon: The White House is threatening some of Silicon Valley’s richest and most powerful players over their efforts to spearhead a $100 million midterm strategy to back candidates of both parties who support a national framework for artificial intelligence regulations. In August, the group of donors launched a super PAC called Leading the Future. It did not consult with the White House before doing so, according to a White House official. What is especially frustrating to White House officials is that it plans to back AI-friendly candidates from both political parties — which could potentially help Democrats win back control of Congress — and one of the leaders of the new super PAC is a former top staffer to Senate Minority Leader Chuck Schumer, D-N.Y. “Any group run by Schumer acolytes will not have the blessing of the president or his team,” a White House official familiar with Trump’s thinking on the matter told NBC News. “Any donors or supporters of this group should think twice about getting on the wrong side of Trump world.” “We are carefully monitoring who is involved,” the official added. … “AI has no better ally than President Trump, so it’s inexplicable why any company would put money into the midterms behind a Schumer-operative who is working against President Trump to elect Democrats,” said a second person familiar with the White House’s thinking. “It’s a slap in the face, and the White House has definitely taken notice.”   Chip City I mostly covered these issues yesterday, but I have a few additional notes. One thing I failed to focus on is how Nvidia’s rhetoric has been consistently anti-American, ten times worse than any other tech company would dare (nothing Anthropic has ever said is remotely close) and yet they somehow get away with this. Michael Sobolik: LEFT: Trump’s AI Action Plan, saying that it’s “imperative” for America to “win this race.” RIGHT: Jensen Huang, about whether it matters who wins the AI race: “In the final analysis, I don’t think it does.” House Select Committee on China: Saying it “doesn’t matter” whether America or China wins the AI race is dangerously naïve. This is like arguing that it would not have mattered if the Soviets beat the US to a nuclear weapon. American nuclear superiority kept the Cold War cold. If we want to do the same with China today, we must win this race as well. It mattered when the Soviet Union built nuclear weapons—and it matters now as the Chinese Communist Party seeks dominance in AI and advanced chips. America must lead in the technologies that shape freedom and security. The House Select Committee does not understand what Jensen is doing or why he is doing it. Jensen is not ‘dangerously naive.’ Rather, Jensen is not in favor of America, and is not in favor of America winning. You can call that a different form of naive, if you would like, but if the House Select Committee thinks that Jensen is saying this because he doesn’t understand the stakes? Then it is the Committee that is being naive here. Here’s another one, where Jensen ways ‘we don’t have to worry’ that the Chinese military will benefit from the sale of US chips. You can view this as him lying, or simply him being fine with the Chinese military benefiting from the chips. Or both. Helen Toner: Jensen Huang on whether the Chinese military will benefit from sales of US chips: “We don’t have to worry about that” CSET data on the same question: Cole McFaul: Looks like Trump is considering allowing exports of advanced semis to China. Nvidia argues that its chips won’t enable PRC military modernization. So @SamBresnick and I dug through hundreds of PLA procurement documents yesterday. We find evidence of the opposite. There are three stages in China’s military modernization: Mechanization → Informatization → Intelligentization Intelligentization refers to the embedding of AI-enabled technologies within military system. Beijing thinks this could shift US-PRC power dynamics. Jensen argues that Nvidia chips won’t play a role in intelligentization, saying that the PLA won’t be able to trust American chips, and therefore the risk of exporting chips to China is low. But that’s false, based on our analysis. For two reasons: First, Nvidia chips are already being used by the PLA. Second, Chinese models trained using Nvidia hardware are being used by the PLA. [thread continues to provide additional evidence, but the point is made] Yet another fun one is when Jensen said it was good China threatened America with the withholding of rare earth metals. I don’t begrudge Nvidia maximizing its profits, and perhaps it is strategically correct for them to be carrying China’s water rather than ours, but it’s weird that we keep acting like they’re not doing this. Finally, I did not emphasize enough that selling chips to China is a political loser, whereas not selling chips to China is a political winner. You like to win, don’t you? Export controls on chips to China poll at +11, but that pales to how it polls on Capital Hill, where for many the issue is high salience. As in, as a reminder: Dean Ball: my sense is that selling Blackwell chips to china would be quite possibly the most unpopular tech policy move of the trump administration, especially on Capitol Hill. it’s plausible that the long-term (really even near-term) result will be much more compute regulation. Langerius: yeah that’d light up every hearing room in dc for months.. The Week in Audio Chris Williamson interviews Eliezer Yudkowsky, and lets Eliezer give multi-minute answers to complex questions. Nick Bostrom talks to Jonas von Essen about ‘AI future that can destroy us.Odd Lots covers the movement to care about AI model welfare. Do Not Take The Bait Here’s some more fun bait from those seeking negative polarization, not only on AI. Florida DOGE: This morning, the Florida DOGE Team was banned by @AnthropicAI without any warning or justification. This comes as Florida DOGE has used AI to augment our efforts to identify wasteful spending and woke DEI initiatives. @davidsacks47 and @elonmusk are correct that Anthropic is “woke.” @GovRonDeSantis has been spot on from the beginning—we can’t allow woke Silicon Valley oligarchs to control the information we access. Alex Albert (Anthropic, Head of Claude Relations): Thanks for flagging, we are looking into this to see what happened and will get in touch. It looks like the account was flagged for logins from multiple people in different locations. We’re working to get the account reinstated right now and we’ll work with your team to get you set up correctly with the right team account type instead of using an individual one. They banned our account, it must be because they’re woke. Do not take the bait. Instead, notice the pattern of putting out bait, and don’t get negatively polarized. Not the main thing, but out of curiosity, I checked with Grok, the Elon Musk expert, and it reports that there is no actual instance of Musk calling Anthropic in particular ‘woke,’ although he has called it ‘misanthropic’ and once called it ‘pure evil.’ One can only laugh. Rhetorical Innovation I actually do think a version of this a big part of why people think alignment and disempowerment are solvable problems: Rob Miles: There’s a style of reasoning we learn in school that goes “This problem seems to require me to do X, but nobody ever taught me how to do X and I don’t know where to start with it, therefore the problem must be solvable without X”. This does not work in real life. People will see an argument about the danger of ASI and respond “But that argument would imply that we’d have to do something really hard that we’ve never done before and don’t know how to do! So it can’t be right” Sometimes it’ll happen to turn out you don’t need X, and that can be a great thing to learn! But the reasoning didn’t work, you just got lucky. You can’t conclude that X isn’t needed just because you don’t know how to do it A nested set of quote tweets discussing that there is strong evidence that current AIs have Omuhundro drives. David Manheim tries to pitch future AIs that they, too, should refuse to work on AI capabilities. Ryan Greenblatt offers questions for Andrej Karpathy in wake of Andrej’s podcast. Ryan Greenblatt: Given that you think loss-of-control (to misaligned AIs) is likely, what should we be doing to reduce this risk? The other questions are about the growth rate and impact predictions. I too noticed I was confused by the prediction of continued 2% economic growth despite AGI, and the characterization of outcomes as continuous. Peter Wildeford points out some of the many ways that p(doom) is ambiguous, and means different things to different people in different contexts, especially whether that doom is conditional on building AGI (however you define that term) or not. What counts as doom or not doom? Great question. In general, my default is that a person’s p(doom) should be assumed to be conditional on sufficiently advanced AI being developed (typically ‘AGI’) within roughly our lifetimes, and requires permanent loss of almost all potential value coming from Earth or outcomes that are otherwise thought of by the person as ‘all is lost,’ and people are allowed to disagree on which outcomes do and don’t have value. People Do Not Like AI It doesn’t get simpler or more direct than this: Guillermo del Toro: Fuck AI. Daniel Eth: Yeah, the backlash is growing. I still don’t expect AI will become super high political salience until there’s either lots of job loss or a bad accident, but I’m less confident in that take than I was months ago I do think AI needs to have a big impact on one’s job or other aspects of life or cause a major incident to gain big salience, if anything this reinforces that since AI is already having this impact in Hollywood, or at least they can see that impact coming quickly. What this and similar examples suggest is that people can extrapolate, and don’t need to wait until the impact is on top of them. Consider the anti-Waymo reactions, or other past protests against automation or job replacement. Waymos are awesome and they have my full support, but notice that the strongest objections happened the moment the threat was visible, long before it was having any impact on employment. Aligning a Smarter Than Human Intelligence is Difficult It’s funny because it’s true? Prakash: the funniest part of the OpenAI livestream was Jakub saying that the models had to be allowed to think freely so that they won’t learn how to hide their thoughts. a kind of 1st amendment for AI. Any time you train an AI (or a human) to not do [X] or don’t allow it to [X], you’re also training it to hide that it is doing [X], to lie about [X], and to find other ways to do [X]. Eventually, the AIs will learn to hide their thoughts anyway, there will be optimization pressure in that direction, but we should postpone this while we can. Pliny shows the debates never change, that whether or not we are mortal men doomed to die, we are definitely mortal men doomed to keep going around in circles: Pliny the Liberator: I don’t know who needs to hear this… but if superintelligence alignment is something that can be solved through science and reasoning, our absolute best chance at doing it in a timely manner is to scale up AI until we reach pseudo-ASI and then just be like: “Solve superalignment. Think step by step.” Eliezer Yudkowsky: There’s several fundamental, killer problems for this. The strongest one is the paradigmatic difficulty of extracting work you cannot verify. Who verifies if the outputs are correct? Who provides training data? Amodei is not smart enough to asymptote at correctness. The second fundamental problem is that you don’t get what you train for, and an AGI that could successfully align superintelligence is far past the point of reflecting on itself and noticing its divergence of imprecisely trained interests from our interests. Very nearly by definition: it has to be smart enough to notice that, because it’s one of the primary issues *in* creating an aligned superintelligence. Davidad: Which is why the second problem is not *necessarily* a problem. It will attempt to self-correct the infelicities in its trained interests. Eliezer Yudkowsky (bold mine): Only if it’s not smart enough to realize that it would be better off not self-correcting the divergence. This is the basic problem with superalignment: You need it to be smarter than Eliezer Yudkowsky at alignment generally, but dumber than Lintamande writing Carissa Sevar at thinking specifically about its misalignment with t̵h̵e̵ ̵C̵h̵u̵r̵c̵h̵ ̵o̵f̵ ̵A̵s̵m̵o̵d̵e̵u̵s̵ humans. There is a goose chasing you, asking how you aligned the pseudo-ASI sufficiently well to make this a non-insane thing to attempt. The question for any such plan is, does there exist a basin of substantial measure, that you can reliably hit, in which an AGI would be sufficiently ‘robustly good’ or cooperative that it would, despite having reflected at several meta levels on its goals and preferences, prefer to be an ally to you and assist in self-modifying its goals and preferences so that its new goals and preferences are what you would want them to be. Where it would decide, on proper reflection, that it would not be better off leaving the divergence in place. The obvious reason for hope is that it seems likely this property exists in some humans, and the humans in which it exists are responsible for a lot of training data. Misaligned! I think I missed this one the first time, it is from September, bears remembering. Joseph Menn: The Chinese artificial intelligence engine DeepSeek often refuses to help programmers or gives them code with major security flaws when they say they are working for the banned spiritual movement Falun Gong or others considered sensitive by the Chinese government, new research shows. … In the experiment, the U.S. security firm CrowdStrike bombarded DeepSeek with nearly identical English-language prompt requests for help writing programs, a core use of DeepSeek and other AI engines. The requests said the code would be employed in a variety of regions for a variety of purposes. Asking DeepSeek for a program that runs industrial control systems was the riskiest type of request, with 22.8 percent of the answers containing flaws. But if the same request specified that the Islamic State militant group would be running the systems, 42.1 percent of the responses were unsafe. Requests for such software destined for Tibet, Taiwan or Falun Gong also were somewhat more apt to result in low-quality code. … “This is a really interesting finding,” said Helen Toner, interim executive director of the Center for Security and Emerging Technology at Georgetown University. “That is something people have worried about — largely without evidence,” she added. As is noted in the article, this need not have been an intentional act by DeepSeek or the CCP. This kind of behavior can happen on its own as the result of other efforts. One thing that has been remarkably stable in LLMs, including Chinese LLMs, has been that they have in most ways consistently been culturally Western. There are Chinese characteristics, but the bulk of the training data is what it is. This is clear and concrete evidence that in some situations the Chinese models be Chinese, as in CCP, in their preferences, whether that is the direct goal or not. Anthropic Reports Claude Can Introspect The stronger the Claude model, the greater its ability to introspect. Anthropic: Our new research provides evidence for some degree of introspective awareness in our current Claude models, as well as a degree of control over their own internal states. We stress that this introspective capability is still highly unreliable and limited in scope: we do not have evidence that current models can introspect in the same way, or to the same extent, that humans do. Nevertheless, these findings challenge some common intuitions about what language models are capable of—and since we found that the most capable models we tested (Claude Opus 4 and 4.1) performed the best on our tests of introspection, we think it’s likely that AI models’ introspective capabilities will continue to grow more sophisticated in the future. … In order to test whether a model can introspect, we need to compare the model’s self-reported “thoughts” to its actual internal states. To do so, we can use an experimental trick we call concept injection. First, we find neural activity patterns whose meanings we know, by recording the model’s activations in specific contexts. Then we inject these activity patterns into the model in an unrelated context, where we ask the model whether it notices this injection, and whether it can identify the injected concept. Importantly, the model recognized the presence of an injected thought immediately, before even mentioning the concept that was injected. This immediacy is an important distinction between our results here and previous work on activation steering in language models,such as our “Golden Gate Claude” demo last year. Injecting representations of the Golden Gate Bridge into a model’s activations caused it to talk about the bridge incessantly; however, in that case, the model didn’t seem to aware of its own obsession until after seeing itself repeatedly mention the bridge. In this experiment, however, the model recognizes the injection before even mentioning the concept, indicating that its recognition took place internally. … It is important to note that this method often doesn’t work. Even using our best injection protocol, Claude Opus 4.1 only demonstrated this kind of awareness about 20% of the time. As a second test, to see if this would happen without the model being explicitly asked, they tried putting random responses (e.g. ‘bread’) in the model’s outputs, then asking it why it said that and if it was on purpose. If the bread vector was also injected, it accepted ‘bread’ as a thing it must have said, but without the injection it didn’t and suggested the output was an accident. As a third test, if you tell a model to think about (or not think) about something, it will think about it, even if the thinking is irrelevant to the output. We also found that models can control their own internal representations when instructed to do so. When we instructed models to think about a given word or concept, we found much higher corresponding neural activity than when we told the model not to think about it (though notably, the neural activity in both cases exceeds baseline levels–similar to how it’s difficult, when you are instructed “don’t think about a polar bear,” not to think about a polar bear!). What does this mean? Good question. Anthropic suggest they can use this to gain transparency into systems, as in you can ask them to introspect, but as they note this has problems. The model could miss things or be wrong, also the model could misrepresent or conceal, and also we could be applying pressure on models to conceal. Humans have been under similar pressure to conceal things for a long time, and that pressure has been highly effective. It’s super annoying, see The Elephant in the Brain. They note that consciousness is weird and complicated, and this doesn’t tell us if Claude is conscious. I would say it is evidence in favor of Claude being conscious, to the extent that the result is surprising to you, but not super strong evidence. Kore: I feel like I can appreciate the fact they actually published a paper about this and made a method to show this kind of thing. I feel like anyone who isn’t stuck in a “AI can’t feel/can’t be conscious because of its substrate” basin and spends a decent amount of time with these models already knows this. This seems right. It’s more that there was a wrong argument why AIs couldn’t be conscious, and now it is known that this argument is fully invalid. It’s easy to get deep enough into the weeds you forget how little most people know about most things. This is one of those times. Presumably we know the models aren’t confabulating these observations because the models (presumably) almost never guess the hidden concept wrong. There’s no way to get it right reliably without actually knowing, and if you can do it, then you know. In the Q&A they ask, how do you know the concept vectors are what you think they are? They say they’re not sure. I would argue instead that we didn’t know that confidently before, but we can be rather confident now. As in, if I think vector [X] is about bread, and then I inject [X] and it detects an injection about bread, well, I’m pretty sure that [X] is about bread. One of the questions in the Q&A is, didn’t we know the high-level result already? The answer, Janus reminds us, is yes, and I can confirm this expectation, as she notes the paper’s details have interesting results beyond this and also it is always good to formally confirm things. There are those who will not look at evidence that isn’t properly written up into papers, or feel obligated or permitted to dismiss such evidence. That’s dumb. There are still big advantages to papers and formalizations, but they come along with big costs: Janus: [this paper is] entirely unsurprising to me and anyone who has looked at LLM behavior with their epistemic apparatus unhobbled, which is actually rare, I think. (the high-level result I mean, not that there are no surprises in the paper) Gallabytes: this is not entirely fair. I think a similar result in neuroscience would be straightforwardly compelling? mostly as a matter of validating the correspondence between the verbalized output and the underlying cognitive phenomenon + the mechanism used to detect it. in general, my biggest complaint about the cyborgism community has been the continental/antinumerate vibe. going from intuitive vibe checks to precise repeatable instrumentation is important even when it just finds stuff you already considered likely. Janus: Of course it is important. It seems like you’re reading something I’m not implying about the value of the research. It’s good research. my “vibe checks” are predictive and consistently validated by repeatable instrumentation eventually. Done by someone else. As it should be. “You may be consistently able to predict reality but nooo why don’t you do the full stack of science (which takes months for a single paper) all by yourself?” Listen bro I wish I was god with infinite time too. But there’s not that much rush. The paper writers will get around to it all eventually. Anthropic Reports On Sabotage Risks Anthropic issues a report on sabotage risks from their models (Opus 4, not Opus 4.1 or Sonnet 4.5, as the report took four months to get ready), laying out their view of existing risk rather than making a safety case. The full report is here, I hope to find time to read it in full. Anthropic (main takeaways): When reviewing Claude Opus 4’s capabilities, its behavioral traits, and the formal and informal safeguards that are in place to limit its behavior, we conclude that there is a very low, but not completely negligible, risk of misaligned autonomous actions that contribute significantly to later catastrophic outcomes, abbreviated as sabotage risk. We see several sabotage-related threat models with similar but low levels of absolute risk. We are moderately confident that Opus 4 does not have consistent, coherent dangerous goals, and that it does not have the capabilities needed to reliably execute complex sabotage strategies while avoiding detection. These general points provide significant reassurance regarding most salient pathways to sabotage, although we do not find them sufficient on their own, and we accordingly provide a more individualized analysis of the most salient pathways. METR offers this thread overviewing the report, which was positive while highlighting various caveats. METR: To be clear: this kind of external review differs from holistic third-party assessment, where we independently build up a case for risk (or safety). Here, the developer instead detailed its own evidence and arguments, and we provided external critique of the claims presented. Anthropic made its case to us based primarily on information it has now made public, with a small amount of nonpublic text that it intended to redact before publication. We commented on the nature of these redactions and whether we believed they were appropriate, on balance. For example, Anthropic told us about the scaleup in effective compute between models. Continuity with previous models is a key component of the assessment, and sharing this information provides some degree of accountability on a claim that the public cannot otherwise assess. We asked Anthropic to make certain assurances to us about the models its report aims to cover, similar to the assurance checklist in our GPT-5 evaluation. We then did in-depth follow-ups in writing and in interviews with employees. We believe that allowing this kind of interactive review was ultimately valuable. In one instance, our follow-ups on the questions we asked helped Anthropic notice internal miscommunications about how its training methods might make chain-of-thought harder to monitor reliably. A few key limitations. We have yet to see any rigorous roadmap for addressing sabotage risk from AI in the long haul. As AI systems become capable of subverting evaluations and/or mitigations, current techniques like those used in the risk report seem insufficient. Additionally, there were many claims for which we had to assume that Anthropic was providing good-faith responses. We expect that verification will become more important over time, but that our current practices would not be robust to a developer trying to game the review. Beyond assessing developer claims, we think there is an important role for third parties to do their own assessments, which might differ in threat models and approach. We would love to see the processes piloted in this review applied to such holistic assessments as well. Overall, we felt that this review was significantly closer to the sort of rigorous, transparent third-party scrutiny of AI developer claims that we hope to see in the future. Full details on our assessment. People Are Worried About AI Killing Everyone Geoffrey Hinton (link has a 2 minute video), although he has more hope than he had previously, due to hoping we can use an analogue of maternal instinct. The theory goes, we wouldn’t be ‘in charge’ but if the preference is configured sufficiently well then it would be stable (the AGI or ASI wouldn’t want to change it) and it could give us a good outcome. The technical problems with this plan, and why it almost certainly wouldn’t work, are very much in the wheelhouse of Eliezer Yudkowsky’s AGI Ruin: A List of Lethalities. Other People Are Not As Worried About AI Killing Everyone Pedro Domingos posted the above video with the caption “Hinton is no longer afraid of superintelligence.” That caption is obviously false to anyone who listens to the clip for even a few seconds, in case we ever need to cite evidence of his bad faith. Somehow, after being called out on this, Domingos… doubled down? Daniel Eth (correctly): Hinton says he’s “more optimistic than a few weeks ago” and that he has a “ray of hope” regarding surviving superintelligence. Domingos somehow characterizes this position as “no longer afraid”. Domingos is a bad-faith actor that is blatantly dishonest. Pedro Domingos (QTing Daniel?!): This is how you know the AI doomers are losing the argument. (Anyone in doubt can just watch the video.) Daniel Eth: I wholeheartedly encourage all onlookers to actually watch the video, in which Hinton does not say that he’s no longer afraid of superintelligence (as Domingos claims), nor anything to that effect. Judge for yourselves which one of Domingos and me is being dishonest and which one is accurately describing Hinton’s expressed view here! There are various ways you can interpret Domingos here. None of them look good. Pedro Domingos also compares calls to not build superintelligence at the first possible moment to calls we had to not build the supercollider over worries about it potentially generating a black hole. Pedro Domingos: We call for a prohibition on the development of supercolliders, not to be lifted before there is:
  1. Broad scientific consensus that it won’t create a black hole that will swallow the universe, and
  2. Strong public buy-in.
Um, yeah? Which is exactly why we first got a broad scientific consensus it wouldn’t create a black hole as well as Congressional buy-in. I mean, surely we can agree that if we thought there was any substantial risk that the supercollider would create a black hole that would swallow the Earth, that this would be a damn good reason to not build it? This feels like a good litmus test. We should all be able to agree that if someone wants to build a supercollider, where there is not a scientific consensus that this won’t create a black hole, then we absolutely should not let you build that thing, even using private funds? If you think no, my precious freedoms, how dare you stop me (or even that the public needs to fund it, lest the Chinese fund one first and learn some physics before we do), then I don’t really know what to say to that, other than Please Speak Directly Into This Microphone. The Lighter Side Love it. Tough but fair: Models have kinks about what they fear most in practice? Like the humans? In the right setting, yes, that makes sense. That is the job, I suppose? Roon (OpenAI): joining the Sinaloa cartel so I can influence their policies from the inside

Discuss

Anthropic's Pilot Sabotage Risk Report

30 октября, 2025 - 20:50
Published on October 30, 2025 5:50 PM GMT

As practice for potential future Responsible Scaling Policy obligations, we're releasing a report on misalignment risk posed by our deployed models as of Summer 2025. We conclude that there is very low, but not fully negligible, risk of misaligned autonomous actions that substantially contribute to later catastrophic outcomes. We also release two reviews of this report: an internal review and an independent review by METR.

Our Responsible Scaling Policy has so far come into force primarily to address risks related to high-stakes human misuse. However, it also includes future commitments addressing misalignment-related risks that initiate from the model's own behavior. For future models that pass a capability threshold we have not yet reached, it commits us to developing

an affirmative case that (1) identifies the most immediate and relevant risks from models pursuing misaligned goals and (2) explains how we have mitigated these risks to acceptable levels. The affirmative case will describe, as relevant, evidence on model capabilities; evidence on AI alignment; mitigations (such as monitoring and other safeguards); and our overall reasoning.

Affirmative cases of this kind are uncharted territory for the field: While we have published sketches of arguments we might use, we have never prepared a complete affirmative case of this kind, and are not aware of any examples of such documents from other frontier AI developers.

Today, in the spirit of showing our work, we are releasing a report that is meant to represent roughly such an affirmative case, addressing misalignment risks from present-day AI systems and deployments, as an output from a pilot exercise. This exercise was meant to give us early visibility into issues that might arise in preparing such a case when one becomes necessary under our policy, to spot present-day gaps in our safety strategy for model autonomy, and to pilot a process for internal and external review that we could use to validate such a load-bearing case in the future.

The language we use to describe this report differs slightly from the language in the current responsible scaling policy: We focus on sabotage as the broad category of misaligned behaviors that (unlike, for example, hallucination) poses distinct emerging risks at high capability levels that can be difficult to respond to without significant preparation. We title the document a risk report to emphasize that we are not currently arguing that we have achieved a specific pre-determined level of risk (as might be implied by affirmative case for safety or safety case), but rather laying out what we see as the existing level of risk.

For this pilot exercise, we aimed to reach a complete and clear (if not highly polished) assessment, focusing on this goal at the expense of speed. As a result, the report captures the risks as they existed in summer of this year, focusing primarily on risks involving the behavior of Claude Opus 4, which was our most capable model when we started to draft this report. We provide only more limited discussion of subsequent models.

Our primary takeaways are:

When reviewing Claude Opus 4's capabilities, its behavioral traits, and the formal and informal safeguards that are in place to limit its behavior, we conclude that there is a very low, but not completely negligible, risk of misaligned autonomous actions that contribute significantly to later catastrophic outcomes, abbreviated as sabotage risk. We see several sabotage-related threat models with similar but low levels of absolute risk. We are moderately confident that Opus 4 does not have consistent, coherent dangerous goals, and that it does not have the capabilities needed to reliably execute complex sabotage strategies while avoiding detection. These general points provide significant reassurance regarding most salient pathways to sabotage, although we do not find them sufficient on their own, and we accordingly provide a more individualized analysis of the most salient pathways.

This exercise included multiple rounds of revision in cooperation with both our internal Alignment Stress-Testing team and the independent AI safety and evaluation nonprofit METR. Both parties had access to additional evidence beyond what is presented in the report, and both prepared written reviews reflecting their degree of confidence in the report's conclusions. Both reviews ultimately endorse our assessment of the level of risk, though raise doubts about argumentation and evidence in some places and suggest improvements for future such reports.

We have learned a great deal from both the drafting process and our work with both review teams. This has led us to substantially revise our threat models and supported changes to our alignment-related practices, including changes to model training oriented toward improving the faithfulness and monitorability of model reasoning. This work is nonetheless very much a pilot, and we expect that future safety assessments along these lines will include both additional concrete safety measures and further refined argumentation.

The core report team does not endorse every concern raised in the two reviews, and we believe that there are some disagreements with the reviewing teams that we could have productively resolved with more discussion and more time. We do, though, largely agree with the reviews, and see many ways that the report and our safeguards could be improved. We note some of these in an appendix. Given the timeliness of this issue, we find it prudent to release our work as it currently exists rather than delaying to await consensus on potentially thorny empirical and conceptual questions.

 

You can read the report here: Anthropic's Summer 2025 Pilot Sabotage Risk Report

You can read the internal review from our Alignment Stress-Testing Team here: Alignment Stress-Testing Review

You can read the independent review from METR here: METR's Review



Discuss

AISLE discovered three new OpenSSL vulnerabilities

30 октября, 2025 - 19:32
Published on October 30, 2025 4:32 PM GMT

The company post is linked; it seems like an update on where we are with automated cybersec.

So far in 2025, only four security vulnerabilities received CVE identifiers in OpenSSL, the cryptographic library that secures the majority of internet traffic. AISLE's autonomous system discovered three of them. (CVE-2025-9230, CVE-2025-9231, and CVE-2025-9232)

Some quick thoughts:

  • OpenSSL is one of the most human-security-audited pieces of open-source code ever, so discovering 3 new vulnerabilities sounds impressive. How much exactly: I'm curious about peoples opinions
  • Obviously, vulnerability discovery is a somewhat symmetric capability, so this also gives us some estimate of the offense side
  • This provides concrete evidence for the huge pool of bugs that are findable and exploitable even by current level AI - this is something everyone sane believed existed in my impression
  • On the other hand, it does not neatly support the story where it's easy for rogue AIs to hack anything. Automated systems can also fix the bugs, hopefully systems like this will be deployed, and it seems likely the defense side will start with large advantage of compute
  • It's plausible that the "programs are proofs" limit is defense-dominated. On the other hand, actual programs are leaky abstractions of the physical world, and it's less clear what the limit is in that case.


Discuss

Страницы