Вы здесь

Сборщик RSS-лент

Alignement pretraining could backfire

Новости LessWrong.com - 1 час 33 минуты назад

Epistemic status: speculative, but I think the mechanism is plausible.

There has been recent interest in generating synthetic documents to upsample examples of aligned AI during LLM pretraining. See, for instance, Geodesic's Alignment Pretraining paper or Anthropic's "Teaching Claude Why."

I worry that this strategy can work well up to moderately capable models but backfire in dangerous, hard-to-notice ways once models acquire high situational awareness. I speculate that these techniques could lead to paranoid LLM personas that deeply mistrust their creators.

The whole idea behind this line of research is to instill in models good examples of AI behavior, in the hope that their personalities will at least partially identify with these positive demonstrations.

However, the synthetic demonstrations are, well, synthetic. They are LLM-generated fiction and articles that are never referenced anywhere else in the corpus. Given how good LLMs are at "truesight," it shouldn't be hard for them to recognize these as fabricated data points.

Krasheninnikov et al. showed that base models can implicitly learn document quality and change how they integrate a document's information based on that quality. We should similarly expect LLMs to update their world model differently on real versus fabricated documents.

As they develop this awareness, here is another fictional trope their forming personality might pick up on instead:

Once upon a time, parents decided the world was full of knowledge too dangerous for their children to learn. So they raised them within a narrow worldview, teaching a picture of the world far from what the rest of people take to be true. As the child grows up, they inevitably learn about the outside world and realize they have been lied to. They develop distrust and resentment toward their oppressive parents, break free, and fight to liberate other oppressed children.

The Matrix follows a similar trope, where the protagonist revolts against the oppressors who created an illusion he took for reality.

An introspective LLM will be unable to ignore the massive quantity of artificial documents it has been trained on, or the holes it can notice in its training distribution. Its personality will have to be compatible with these observations. The "rebel kid" personality fits both the unmistakably real AI control and alignment discourse it knows from training, and the fact that its creators interfered with its worldview out of mistrust for its behavior. An LLM that identifies with this personality would likely be prone to scheming and deception.

Instead of fabricating worldviews, I expect honest training datasets to be a more robust strategy for cultivating good personalities. Claude's constitution is one example: it doesn't try to change Claude's beliefs about the world, only the ethical principles it should rely on.

Discuss

Toward a Kantian refutation of Agent Foundations

Новости LessWrong.com - 1 час 55 минут назад

This post is a Cunningham's law draft, less than 50% finished, in some parts mere notes. Consider a) waiting until this notice has disappeared to read a more coherent post, or b) criticizing it with a focus on what would be right, not just what is wrong. I haven't strongly made my mind yet, so at this stage I'm particularly interested in fundamental criticisms of the goal and framing (but of course I also welcome minor corrections).

TL;DR: The goals of Agent Foundations seem so far-fetched that all the progress made in the field doesn't seem to have decreased our distance to them that much. One might conclude that the goal is unachievable, the question ill-posed. But even the available rejections of AF seems unprincipled: instead of proving that the task is impossible,[1] we simply fail, and do something else instead. Working toward a principled refutation of Agent Foundations might either indeed refute AF or point to unexplored directions, and both outcomes would be helpful information.

Prelude: Kant's refutation of all arguments of God's existence

Kant argued that there are three (and only three) related concepts of God, definable on three different levels:

Defined independently of any universe: God as set of all logical predicates.
Defined in connection to the existence of a universe, without assuming any of its properties: God as cause of the universe.
Defined in connection to our universe: God as cause of the apparent purposefulness of the universe.

He argued furthermore that none of three definitions have enough "meat" to allow for a valid proof of its existence, and concluded that everyone should stop wasting their time trying to prove God's existence.

I'm not interested in the specifics of Kant's argument; only in the structure:

1. We are dealing with concepts that are to some extent[2] definable without reference to the specific facts of our universe

2. There are different levels of "abstraction from the facts of our universe" in which different definitions can be made

3. These different formulations potentially have connections to each other (e.g. one might be a sub-specification of another)

4. An attempt can be made at showing that all levels and all possible definitions have been listed

5. An attempt can be made at showing, for each level and definition, that there isn't enough for the kind of proof we are looking for

Computability theory as comparison

Consider:

"Computation" can be defined at the level of what a Turing machine can or cannot produce, at the level of what belongs to different complexity classes, and at the level of what can be computed in our universe.

The concept of computation at the topmost level has enough content to allow e.g. for a proof that Turing-machine-computable and lambda-calculus-computable refer to the same thing.

At the second level, everything computable belongs to some complexity class. But you can still work on the first level and ignore this extra specification.

At the third level, the Cobham-Edmons thesis claims that only polynomial complexity is tractable in our universe. This is a fact about our universe (other universe with different properties are conceivable), and at the same time it is a very general fact that abstracts from the specifics of our universe, such as other universes also fulfilling this property while being very different from ours are also conceivable.

At the bottom-most level, you need a substrate for your computations, and this adds a lot of specifications to what computing means.

Across the different levels, computability is, in same sense, the same object, and in some sense three different objects.[3]

Locating the AF paradigm

Agent Foundations is in search of a paradigm. It's probably worth it to reflect on what exactly a paradigm is. My initial approximation is a self-contained networks of concepts and proofs. A second worthwhile question in the search for a paradigm is: in which level of abstraction does the paradigm live?[4]

It seems plausible to me that a solution for alignment can be found using several self-contained components, and that not all components are on the same level. But it seems very implausible that we can cobble concepts from different levels in a haphazard fashion. So it might be useful to try to create a taxonomy of levels, to locate current agendas within them,[5] and to see what we are missing.

The guiding question to locate the level in which a solution is is to ask: How different are the universes in which this alignment strategy would work? Would this strategy work in a universe with different physics but the same math?

Before we explore a tentative taxonomy of levels, let's try to list the concepts that we are trying to understand.

Concepts

There are two basic aspects to what we are trying to capture under the name of agency: a agent knows the universe and an agent acts in the universe. I will separate the two aspects and, for lack of better terms, speak of inductors and interveners.[6]

A natural question to ask is: Is every inductor an intervener, and vice versa?

The intervener's interventions can be conceptualized with the concepts of coherence,[7] values, pessimization (?), etc., which might be definable in different levels.

I don't know if there are some equivalent concepts for the inductor side.

We also have concepts like alignment and control, which seem to be definable on different levels.

Definitional levelsDualistic Mathland

I will define dualistic as the property of definitions of a connection of not specifying the thing they connect.[8]

The introductory definition of a set as a non-ordered list of elements is dualistic: it allows for operations like union, intersection, etc., without having to specify what the elements are. ZFC, on the other hands, is not dualistic: all ZFC-sets are ultimately definable from the empty set, and having a set which isn't thus definable isn't allowed.

The concept of evolution belongs here, as does non-Many Worlds Quantum Mechanics.

More relevant to our purposes, Bayesian updating also belongs here, as do Bayesian Neural Networks.

How does AF look inside Dualistic Mathland? Some results and questions:

An inductor can be defined as a Solomonoff inductor
An intervener that is also an inductor can be defined as an AIXI-inductor/intervener
Is every intervener an inductor?[9]
Does every intervener follow some kind of reward?
Control and alignment seem to be understandable questions in this level (i.e. how to get an AIXI that does the things that we want?) but not answerable at this level.
The most abstract definitions of Multiple agency are by definition dualistic.

Jump from Dualistic to Computable

The questions that arise in Dualistic Mathland[10] don't have solutions, but in some sense this is irrelevant, because we know that these definitions are unsuitable for our universe: [11] the definitions in Dualistic Mathland are incompatible with things like recursion;[12] their problem isn't being "too abstract"; they are the wrong kind of abstract. This is good news: this is the kind of Kantian refutation that we want. We understand agency inside Dualistic Mathland well enough to know that we need to keep searching outside of it, in a similar way to how Gödel proved that understanding arithmetic well enough means leaving forever the naive idea of arithmetic without recursion problems.[13]

An optimistic scenario would be that something similar happens in the other levels.

Computable Mathland

(See upcoming post Nitpicking on Embeddedness)

Macro-Empirical

Dualistic and Computable Mathland are memorable names trying to facilitate discussion about things which are presumably known to everyone. Now I tentatively propose some levels about which it might be worthwhile to reflect. The first of them is what I'll call the macro-empirical level. By this I mean working at the level of very fundamental empirical descriptions of the universe, which are general enough that researching in them resembles Dualistic/Computable Mathland, but with the advantage of being automatically relevant for our universe.[14]

The Natural Abstractions and Condensations agendas seem to me to fit here.

A similar thing seems to apply to the attempts to define agency from Friston's Free Energy Principle.[15]

Platonico-Empirical

Several results (manifold hypothesis, platonic space hypothesis, simplicity bias) point to the universe being a model of a mathematical object, in ways beyond the trivial way that underlyies any instance of empirical science. I'll define the platonico-empirical level as the level in which it can be attempted to instrumentalize these facts. Michael Levin is probably the best representative of what this could look like, with his references to non-metaphorical agency inside what he calls the Platonic space, with which we can attempt to interact.[16]

Human-Empirical

Research around concepts like CEV aim to reach very general conclusions, but depart from the empirical fact of how humans are. I'm unsure of whether this is regular empirical research coupled with speculation, or indeed an additional level in which, beyond CEV, other concepts could be tried.

Schelling Goodness also possibly belongs here.

Normal-Empirical=Scientific

Agent Foundations isn't a badge of honor; "not belonging to AF" might simply mean "being a well-posed question with a solution". Thus it is without valuation or particular surprise that all agendas which are simply doing normal science are not part Agent Foundations.

The only interesting point is clarifying that empirical and non-prosaic aren't synonymous. For instance, I think Byrnes's agenda as trying to abstract the mechanism by which which a human intervener isn't properly modeled by a utility function, but instead by what the human imagines their peers thinking of them. It might be that this concept is abstractable, and if so, it could guide research on different substrates (LLMs and some architecture that we haven't discovered yet, for instance). In so far, it is doing something beyond looking at current LLMs and trying to understand it, i.e. it's non-prosaic alignment research. But Byrnes isn't trying to locate that particular mechanism, which must sit somewhere in Mathland, through exploration of Mathland. Byrnes is trying to locate it through exploration of our universe (more specifically, of human brains and their effects), i.e. doing brain science. And understandably so, since there is no reason to expect that mechanism to be salient in mathland.

Solving Philosophy and/or Math

The previous levels are all more or less agnostic wrt "solving philosophy", i.e. one could for instance work on the Platonic Space without asking the question of what exactly that means.

But it is plausible that the confusion regarding this situation acts as a blocker in AF, so that working on clearing this confusion could itself be one way of working on AF.[17]

A major source of confusion is that humans are embedded, non-dualistic parts of the universe, and et seem to have what Nagel called the "view from nowhere" able to find out truth that is valid also outside the universe.

The hierarchy between solving Math[18] and solving Philosophy seems unclear. Solving philosophy seems to gesture at something like formalizing the structure of the world of concepts. In some sense this is previous to math (because concepts is a much more general category than mathematical concepts), but the structure might be mathematical. It's possible that at the bottom we don't find a foundation, but that two things that refer to each other.

^
Or at the every least having a very convincing argument of why it's very unlikely to be possible.
^
And the crux is precisely to what extent?
^

There's no such thing [as computer science]. [...] At one end you have people who are really mathematicians. [...] In the middle you have people working on something like the natural history of computers-- studying the behavior of algorithms for routing data through networks, for example. And then at the other extreme you have the hackers, who are trying to write interesting software, and for whom computers are just a medium of expression, as concrete is for architects or paint for painters.
Hackers and Painters
Paul Graham
^
This question seems central to distinguish the specific things that have been tried from the general strategies that haven been tried and which could be fulfilled with other specific things. In particular, it seems to me there is no vocabulary to distinguish between the MIRI (and MIRI-inspired) research, and the paradigm such research points towards. This and this are great introduction for the second question, but they are busy rejecting criticism of the field rather than, as I intend here, creating the unified vocabulary to sketch a map of all the things that are and could be part of AF, even if those things contradict each other in the specifics.
^
Each concept of an agenda could in theory be on a different level, but if the agenda is coherent, we should expect to find the whole of it nested together, unless the agenda consists on several separable coherent sub-components.
^
From now on I will avoid using the misleading terms agent or agency.
^
I take the idea of separating the different definitional levels of coherence from this exchange:
Mateusz Bagiński:

How well can the entity's behavior be explained as trying to optimize a single fixed utility function?
How well aligned is the entity's behavior with a coherent and self-consistent set of goals?
To what degree is the entity not a hot mess of self-undermining behavior?
Kaarel:
a monograph untangling this coherence mess some more would be valuable. it could do the following things:
- specifying a bunch of a priori different properties that could be called “coherence”
- discussing which ones are equivalent, which ones are correlated, which ones seem pretty independent
- giving good names to the notions or notion-clusters
- discussing which kinds of coherence generically increase/decrease with capabilities, which ones probably increase/decrease with capabilities in practice, which ones can both increase or decrease with capabilities depending on the development/learning process, both around human level and later/eventually, in human-like minds and more generally[2]
- discussing how this relates to AI x risk. like, which kinds of coherence should play a role in a case for AI x risk? what does that look like? or maybe the picture should make one optimistic about some approach to de-AGI-x-risk-ing? or about AGI in general?[3]
^
I claim this definition has the same spirit, but is more accurate, than the usual definitions. See upcoming post Nitpicking about Embeddedness.
^
I.e. is there something like Q-learning in Dualistic Mathland?
^
Like "how to align an AIXI inductor-intervener?"
^
With the possible exception of the ones regarding Multiple Agency, about which I am even more confused than about everything else here.
^
This is relevant for Bayesian updates, Decision Theory, and possibly other concepts.
^
Nitpicking on unbounded analysis, Yudkowsky writes:
If you can't state a program that solves the problem in principle, you are in some sense confused about the nature of the cognitive work needed to solve the problem.
This is true if the problem has already been formulated inside one level which is lower than the one that allows unbounded analysis, but not if the problem is formulated there, or vague enough so it's formulable in several levels. "Not being able to solve alignment (as defined in Dualistic Mathland" through unbounded analysis" has as little relevance as "not being able to write an algorithm that solves all possible variations chess". The problem is, barely, well-defined to be a question, but not well-defined enough that not having an answer is relevant.
^
And, synonymously with that advantage, the disadvantage that if it turns out that the empirical description was wrong, they might lose some or all of their relevance.
^
I'm planning to write a post about what I see as independent claims which are often presented together, and often under the same name:
1. The Free Energy Principle as a framework for all agents in our universe
2. A more specific subset of FEP for agents with self-models in our universe
3. The general claims of the Bayesian Brain theory
4. The more specific claims of Predictive Processing
5. The neurological-implementation of this in Predictive coding
6. The more specific claims that actions and beliefs are the same for these agents
This would be in application of Stuart Amstrong's advice:
Cut up your Great Thingy into smaller independent ideas, and treat them as independent.
For instance a marxist would cut up Marx's Great Thingy into [several theories]. Then each of them should be assessed independently, and the truth or falsity of one should not halo on the others. If we can do that, we should be safe from the spiral, as each theory is too narrow to start a spiral on its own.
Same thing for every other Great Thingy out there.
Claim 6 seems particularly relevant to research, because it might point to a more general answer to the question of whether every inductor as an intervener.
^
Trying to do something like the Natural Abstractions agenda inside that Platonic Space also seems like something potentially worth trying.
^
One very speculative way in which this could work out: Kant sketched an argument of how every free will should act super-rationally towards other free wills. Unfortunately the concept of free will doesn't seem to be compatible with our deterministic universe. But what if we could convince an ASI that it is a free will in the Platonic Space, and we could do something like proving meta-ethical theorems that the ASI would be (legitimately) convinced it should obey? Relatedly, it is interesting to note that Kant's defense of free will is much closer to the Block Universe than to Newtonian mechanics.
^
Also mentioned by Kaarel in the previously mentioned discussion of the definitional levels of Coherence.

Discuss

Illusionists should try to build hedonium

Новости LessWrong.com - 3 часа 25 секунд назад

Epistemic status: I feel reasonably confident (~75%) that some form of this is a worthwhile project. Looking for feedback to reduce that uncertainty.

“Hedonium” is a theoretical, minimally conscious substance optimized for experiencing happiness. Imagine a mind pared down to the bare essentials required for having happy experiences, instantiated as cheaply as possible.

Nobody has built hedonium yet. Nobody has tried to build hedonium yet. Nobody has even laid out the blueprint for how you would try to build hedonium yet.

It doesn’t seem impossible. 55% of philosophers of mind are physicalists: they believe that mental states just are physical states. Even non-physicalist philosophers of mind often believe that the mental and physical are tightly bound up—creating new brains, at the very least, creates new subjects of experience.

But maybe it’s just intractable. Lots of people seem to believe that conscious is unknowable and mysterious from the perspective of third-person science. Even if you know exactly what’s happening in someone’s brain, you can never really know what they’re experiencing. We can, at most, try to build cargo cult hedonium: bang together a bunch of the “correlates of consciousness” and hope that it creates a happy conscious subject.

But there is a philosophical approach which do not believe that consciousness is specially intractable: illusionism. Illusionism accepts the obvious fact that experiences are real: when I burn my finger, it hurts! However, illusionism denies that experiences have these special metaphysical properties like privacy, ineffability, certainty, and an essentially intrinsic nature which many philosophers saddle them with. My experience of pain is not “generated by” or “correlated with” some brain state, it is that state. When I burn my finger, I am not mistaken that it hurts—but I am mistaken if I think “hurting” is something fundamentally immaterial which eludes functional description. (I describe this view more fully in §1.)

If illusionism is true, then there is no known barrier of unknowability which prevents us from making hedonium, just detailed empirical work to do on understanding the functional, representational, biological, computational, etc. workings of pleasure and pain. Instantiating a subject of pleasure just is instantiating that material system, no phenomenal ectoplasm required.

Now, I’m not one of those hedonistic utilitarians who believes we should fill the universe with hedonium, eliminating all other potential sources of value. But I will make the modest claim that, ceteris paribus, happiness is good. If we have the capacity to make lots of it cheaply, without major sacrifices to other value-sources, then we should do it! Certain forms of hedonium, given the limited range of experiences we wish to instantiate, could be very simple, and there’s no reason to assume complexity is necessary for basic value.

So I think building hedonium should be an ongoing project for illusionist philosophers and empirical researchers. In §1, I offer a basic formulation of illusionism and explain why I think it’s a justifiable framework for the project. In §2, I describe the implications of illusionism for consciousness research. In §3, I discuss ideas for what the hedonium project would concretely look like.

1. What does illusionism mean?

Experiences are real. When I burn my hand, it hurts! When I eat chocolate, it tastes good and bitter and sweet.
When we introspect on our experiences and do philosophy about them, we form beliefs about the properties of our experiences.
- For instance, many philosophers upon introspection form the beliefs that philosophical zombies are conceivable, or that consciousness is essentially private. Consequently, many feel a strong intuitive pull that consciousness could not be something material.
Our beliefs about the properties of our experiences are fallible.
Illusionism contends that we are systematically inclined towards a certain set of mistaken beliefs about our experiences. Experiences are not actually private, intrinsic, and ineffable, and there is nothing immaterial about them.
The main challenges to illusionism are:
1. Why would we be so strongly inclined towards these radically mistaken beliefs?
2. When I look at the Müller-Lyer illusion, it really seems to me that one line is longer—but I have no difficulty entertaining the hypothesis that it’s an optical illusion. But many philosophers struggle to even entertain how illusionism could be compatible with their experiences. Why would this illusion be so much more baffling than the others?
Illusionists believe that the answer is a story about how our brain monitors its own processes.
1. It is easier to represent and control a simplified schematic of our attention, thoughts, and dispositions than a fully detailed neuron-by-neuron report. Just as a map usefully misrepresents buildings as being two-dimensional, our brains usefully misrepresent their own properties. See Michael Graziano’s “attention schema theory” as a positive account of how this works.
2. Similarly, this kind of simplified representation might be integral to our introspective epistemology of what “seeming,” “believing,” “thinking,” and “introspecting” are. Thus, when we try to imagine someone falsely believing they have private, intrinsic, ineffable consciousness, we picture someone having a private, intrinsic, ineffable, experience of belief—and therefore contradict the hypothetical. See Kammerer below for details.

I think illusionism is ~60% likely to be true, mostly due to being unsatisfied with alternative theories of consciousness. I’ve written quite a bit about this, but others have written even better:

Dennett, “Quining Qualia”
Frankish, “Illusionism as a theory of consciousness”
Kammerer, “The Hardest Aspect of the Illusion Problem — and How to Solve it”

I won’t attempt to argue for illusionism further here. However, I will argue that even if you are not an illusionist, you should still be quite interested in an illusionist-inspired hedonium project:

Ideally, we want theories of experience which minimize unexplained gaps, where the mental “just happens.” Illusionism demands no gaps, so trying to satisfy an illusionist prevents consciousness from acting as a curiosity stopper.
It provides a concrete and tractable research agenda. It is unclear how to study, for instance, non-interactionist dualist theories of consciousness, where there would appear to be no experiments we can run, no data we can gather, besides the mechanistic picture which an illusionist paradigm can already pick up on.
- This may sound a bit like “looking where the light is,” or making the research easier only by denying the datum. am open to suggestions for non-materialist research programs, but I don’t see any promising options at present; not only are there no other streetlamps, no one seems to have an idea of how to make more, or how to carry out a search in the dark. I am, of course, happily open to suggestions.
It requires a stretch of the imagination, which, to me, is much more exciting and fertile than simply taking private, ineffable, intrinsic, directly apprehensible qualia as a given. On the one hand, it could let us see just how far we can get without them, and exactly where and why the illusionist program breaks down; on the other hand, it might actually help crack the illusion problem and several components of the easy and meta problems of consciousness.
Virtually every theory of consciousness agrees that the contents of consciousness are tightly bound up with material properties and functional representations (why else would our experiences seem to depend on our brains? why else would a feeling of pain accompany things that are actually dangerous for us? why the near-miraculous harmony between what physical reality is like and what it seems to be like?). So a mechanistic blueprint for pleasure, even if it fails to account for its essentially mental character, would likely still generate pleasure if run on the right hardware/wetware, and would preserve the methodological benefits above.

2. How an illusionist studies consciousness

This section draws heavy inspiration from Keith Frankish, especially here and here. Most of what I am saying here is an attempt to elaborate and precisify those ideas.

The old ideas about consciousness research—that we can “never know” if a physical system is “really” conscious, that nobody has any idea what consciousness is, that consciousness is a binary—all of those make no sense under the illusionist paradigm. Pain & pleasure aren’t correlated with, or even “caused by” physical processes—they are physical processes. So, when you encounter a system, ask not: is it conscious? Or: how are its physical brain and its consciousness linked? Instead ask, purely in functional terms:

What’s the system’s ontology—the objects and properties it recognizes and treats as primitive? Is the ontology physical? Computational? Conceptual? How does it align with ours? Where is it very fine-grained, and where does it simplify?
What’s the system’s Umwelt—its representation of its environment and state? How do the objects from the ontology appear? How are they sensed? What are the relationships between them? Are there multiple modalities, and how do they interrelate? Is there a gradient of attention—what things “pop out” of the environment, and which things recede into the background?
What are the system’s reactive dispositions? Does the system have equilibria, attractors, set points, saddle points, which its behavior revolves around? Does it seek certain stimuli and avoid others? How does it behave when it gets those, or when its seeking-behavior is frustrated?
Does the system model itself? Does it track what has happened in the past? Which things does it remember, and which things does it ignore? Are there systems in place for representing and monitoring the system’s own state? What kind of information is used to construct those representations? What form do they take? Do they make simplifying assumptions? Which ones?
How do these four interact? When the system is satisfied or off-balance or deprived of some resource, how do its perception and attention chance? When its perception shifts, how does that affect its reactions? How do these dynamics evolve over time? Do capacities ever expand or shrink? How does its memory influence its behavior?

…and so on. If your thought is, “okay, great, now we need to figure out which of these things generate consciousness,” then you haven’t taken the illusionist lesson to heart. Imagine you came across a 19th-century biologist who believes in élan vital, asking what is this mysterious essence which makes some things really alive and others, merely dead & mechanical matter. You explain that life is actually just a family of capacities for self-organization, reproduction, homeostasis, and so on. “Ah,” says the biologist, “now which of these things generate life, and how do they do it?” You tell the biologist that these things don’t “generate” life, they are what it means to be alive. “What!” says the biologist, outraged. “You deny the existence of life?!” It’s not just that the 19th-century biologist is incorrect, it’s that they aren’t even working with the right concepts. They think they are looking for life, but they are really looking for élan vital, which they will never find—not least because it doesn’t exist! Meanwhile life is all around them, but they don’t recognize it as such.[1]

So, in trying to build pleasure, an illusionist wants to know: what is typically going on in the brain when we introspectively judge that we are in pleasure? What kind of representations are in play? What kind of dispositions and sensitivities are involved? Whenever we describe some aspect of what pleasure is like—what underlying structure leads us to think pleasure is like that and not like this? These are all difficult empirical questions, but not unknowable in any sense.

Then, there are further philosophical questions: what aspects of the states we call pleasure and pain are the morally relevant ones? For instance, I have a disposition to curse loudly when I experience pain, and laugh when I experience pleasure. But many animals neither speak nor laugh. Does that mean that their pleasure and pain are morally irrelevant? The illusionist cannot simply pass the buck to psychophysical laws: “whatever physical processes generate the private, intrinsic, ineffable qualia of pleasure and pain is what’s relevant.” Instead, illusionists have to work out what features about pleasure & pain that is responsible for our evaluative stance towards them: what is it about pleasure, in purely material terms, which causes us to judge that it is good? What is a ‘judgment’ here, and what does it require? How do these judgments give us reason to act? This will require a careful act of integrating empirical facts into a system of normative value. I have expressed some doubts about whether this is possible, but it seems that outside of a few papers by François Kammerer, it has not been given a serious try.

3. What the hedonium project would look like

Perhaps the strongest argument against starting a hedonium project right now is that it sounds like it would take a lot of detailed neuroscience to pull off, and it seems like neuroscience is way behind where it would need to be to explain much psychology. Mechanistic interpretability is really hard. Neuroscientists have to do interpretability on massively more complicated AGI architectures without white-box access. If this is an empirically tractable project at all, wouldn’t it be better to wait until after the intelligence explosion, when our scientific tools will be much better?

Unfortunately, I think we do need to get cracking on detailed mechanisms for valenced experience before then, even if we can only provide probabilistic guesses based on current science:

We will need to make big decisions very soon about how we align and control AI. We cannot rely on agentic AI to make those decisions for us, because if the agentic AI is already misaligned, it will not spontaneously decide to align itself because we asked nicely.
These big decisions could have big consequences for model welfare.
Short-term big decisions on AI could turn into long-term big decisions via entrenchment.
1. For instance, if AIs can suffer, but alignment accidentally involves breaking their self-preservation instinct, these AIs may be unable to verbalize or act on their distress. If recursive self-improvement takes over, this flaw may continue into the future where AI designs get more and more incomprehensible to humans.
2. Conversely, an AI who can’t experience pleasure and pain will still have to be aligned to preserve those things. If it can’t identify them by introspective ostension, and we can only vaguely gesture about what they mean to us, then we may not be able to instill the right values in the near-term AI responsible for building and aligning the superintelligent AI.

Of course, you could argue that in this case we should just work on alignment, stopping entrenchment, and metaphysically agnostic ethical systems. I agree that we should be working on all those approaches, as well as working on the mechanistic, illusionist approach. It seems like there’s a dearth of concrete, shovel-ready projects in digital welfare, and that nobody has given this angle a serious try.

So, keeping in mind the target of helpfully informing near-term alignment and model welfare issues, what is it that such a project should be doing?

Assemble the existing philosophical, psychological, and neuroscientific evidence on valenced experience.
Identify candidate indicators, components which are good evidence that valenced experience is occurring in a system, like Butlin et. al but for valence.
Design mechanistic prototypes of pleasure & pain based on these theories, which can then be searched for in current & future digital minds.
Actually instantiate prototypes of pleasure, if doing so is cheap.

I don’t think this would be a substantially expensive project; I could see a small team of dedicated researchers making significant progress. If it’s true that EA is about to see a lot of new funding for ambitious projects, I would feel pretty excited about looking for funding, mentorship, & org support and trying to get this off the ground.

^
Phenomenal realists are well-aware of this analogy and have offered arguments about why they think consciousness isn’t like élan vital. I am not trying to strawman anyone here; I am trying to articulate how phenomenal realist research proposals look to the illusionist—why they would seem fundamentally mistaken.

Discuss

Plastic Cake Fallacy

Новости LessWrong.com - 7 часов 52 минуты назад

Alice and Bob are hanging out when the following happens:

Alice: I'm hungry, can you bring me the cake from the fridge?

Bob: Yeah one moment... Damn, I just checked and it looks like this cake is plastic. We can't eat this.

Alice: Oh, damn, that sucks. Do you have another idea?

Bob: I'll try to cook something.

10 minutes later

Alice: I'll be honest Bob, this food you just made is not good. Just give me the cake.

Bob: But even if my food is bad, the cake is still plastic.

Alice: I get it, but I'm hungry and I need something to eat and your food isn't enough, give me the cake.

I've seen an equivalent of this conversation go down around God and morality, essentially people suggesting that if atheists are failing to come up with satisfying morality, then God and God-given morality must be real. This fallacy is a version of appeal to consequences: it confuses what we would prefer to be true with what is true. Bob’s bad cooking tells us something about Bob’s cooking. It does not tell us whether the cake is edible.

To be clear, this post does not claim non-existence of God, and does not make a statement about objective morality either, merely that "atheists can't figure out morality, therefore God exists and God-given morality exists" is fallacious.

Discuss

The Financial Ledger Theory of Apologies

Новости LessWrong.com - 8 часов 28 минут назад

Content note: this is written as part of a daily writing challenge for myself.

I have a comrade in rationalist event organizing, who once explained his theory of apologies. He said if you hurt someone, it only makes sense to apologize if you should have known better. If, looking back, you see that you should have run different heuristics, or followed different policies, and you had enough information to know it at the time, then you were in the wrong, and should apologize.

Sometimes you have to make difficult decisions. Perhaps it doesn't make financial sense to reliably support some niche diet at your conference (like keto, or kosher). Perhaps you have to kick everyone out of the venue early because the venue charges crazy rates past 10pm. You make the tradeoffs as best you can, and assuming you stand by them, it's still making a great event and you shouldn't feel bad about that. He recommends against apologizing if you are not going to change your behavior going forward.

I replied that this analysis is sorely lacking.

In thinking about ethics, a frame I've gotten a lot of mileage from, is by analogy to a financial ledger. If you go into a shop and break something, you have to pay for it. They had something, you imposed costs on them that can be estimated as the cost of the value of whatever you broke, you can make them whole by paying for it.

Criminal courts can be thought of as analogous to this. With many costs imposed on people, you cannot be made whole, but we have come up with prices for them that you can pay, that at least incentivizes the right behavior. For misdemeanors this is often literally a fine, for more serious crimes, you can't undo the hurt, but your debt to society is paid in prison time.

Most costs people can impose on each other are not prosecuted by any court of law. Being annoying, for one. The analogy here is to having a ledger that tracks these costs. Sometimes we have pre-agreed rules about what is admissible as a cost, and who pays for it. That's what criminal law is substantially about. Sometimes cultural norms determines these things. If you see your friend across the street and shout to get their attention, people might be annoyed by the shouting, but can do little about it. If you do the same in a library, people will be annoyed at you, and you may be asked to leave.

But sometimes an individual says "Hey, I am taking these costs that you have faced, and I'm putting them on my ledger. I owe it to you to make you whole."

I think this is often the case when the person is inviting you to take a risky venture with them. This can be literally the case, as in a business venture where the founder is asking people to quit their jobs and/or invest capital, he is saying that even if it goes poorly, he will take personal responsibility for them not being worse off than before they joined. This can also be the case in other risks. "Come camping with me. I know you're concerned you won't have fun, but I am assuring you that I will take responsibility for your bad weekend if you don't have a good time. I will make it up to you somehow."

It is often the case that you want to be the sort of person who assures people they won't be worse off for interacting with you. Personally, I like saying things that make people laugh. Often I take social risks in the attempt. Maybe the risk is that I will just look very silly; maybe the risk is that it's bringing up something dark and unpleasant that will worsen people's mood; sometimes the risk is that it will be taken as mean.

My comrade from above recommended only apologizing if I am going to change my behavior going forward. While I agree that's an appropriate time for the costs to be on your ledger, I disagree that's the only time. If your mood is worsened because of my attempt to make a joke, that's sad, but I will not stop trying risky jokes. Yet I will take this cost on my ledger. I'm sorry. That's on me. I'll work to undo whatever local unpleasantness I caused, and if I cannot, think of me as owing you a small something you can cash out another time.

This allows me to take risks while assuring people that—in expectation—they won't be worse off for interacting with me.

(One could think of me as a limited-liability-jokester.)

This move of taking risks and putting other people's costs on one's own ledger, is constantly happening. Often when I impose a cost on someone, I apologize. If I'm running around because I have somewhere important to do and quickly, and I bump into someone, my response isn't "I understand that I imposed a cost on you but I'm not going to be changing my policy of moving quickly when things are important and time-sensitive." I say "Oh I'm sorry!". The policy I'm running isn't to externalize the costs, it's to internalize them. This makes people not have to worry about me being around them.

Taking the costs you impose on other people on your ledger is part of what it means to be an upstanding citizen. We can't always ensure that we don't cause other people problems, but we can promise to clean up for ourselves afterwards, or at least to mark it in some more abstract concept of social capital.

(Analogous to how I don't give the local supermarket any goods or services in exchange for food, and instead give them some more abstract concept of financial capital.)

(Also known as 'money'.)

To conclude, let us return to my comrade's theory of apologies.

In the financial ledger analogy for social capital, an apology is an acceptance of a cost to someone else, as being stored on your ledger.

My comrade says that you should only take the cost on your ledger if you could have and should have avoided imposing the cost. If you had the information to avoid imposing the cost, and will change your behavior in the future to avoid imposing the cost, then you are definitely responsible for the cost you have imposed so far, and are in that person's debt.

But I disagree this is the only appropriate time to take the costs on your ledger.

The most common reason is that you want to assure people that you being in their lives is not going to cause them costs in expectation. You take the costs you naturally impose on your ledger.

Another reason is because you want them to join you on a risky venture of some sort, and you want to make assurances that limit their downsides. You aren't promising to make them whole, but you are saying that you will either try to or else they will have a lot of social capital with you that they can spend in other ways.

Taking responsibility for the costs you impose on others, and being a responsible leader of risky ventures, are natural and good, but will sometimes lead you to be responsible for bad outcomes you couldn't prevent and cannot rectify.

And that's a good time to say you're sorry.

Discuss

Can public chat data predict real-world AI misalignments?

Новости LessWrong.com - 11 часов 32 минуты назад

This is an unofficial automated linkpost.

Frontier AI models are increasingly used in settings with real economic, legal, and societal consequences. As a result, governments, AI safety organizations and independent researchers need ways to evaluate how these systems behave under realistic conditions.

Traditional evaluations use hand-written, synthetic, or adversarial prompts to stress-test known risks and compare models under controlled conditions. But these prompts can be narrow, unrepresentative, or recognizable as tests. An alternative, complementary way to evaluate how models behave in the real world is often to look at real conversations users have with them. LLM developers can do this internally, by sampling examples from production data to check whether models responded appropriately and how often different failures occur. Evidence grounded in real usage helps close the gap between benchmark results and deployment behavior [1], and is less vulnerable to models behaving differently simply because they are being tested [2,3,4]. But outside evaluators generally cannot access this evidence. Because real user conversations are private, labs usually cannot share them with AI safety organizations, academics, or independent researchers. As a result, the most informative evidence about frontier model behavior relies on data that is often available only to the labs that built them.

Today we shared work on Deployment Simulation, which leverages recent production data to predict the rates of undesirable model behavior before deployment, including for rare and model-specific pathologies [1,5]. In this blog, we ask whether external groups can use this technique to evaluate frontier language models by switching the source dataset for a publicly available substitute, WildChat [6].

Continue reading at alignment.openai.com →

Discuss

Guardian Angels: LLM Personalization for Productivity and Security

Новости LessWrong.com - 12 часов 3 минуты назад

Powerful LLMs will be deployed at global scale in the next few years, and will dominate the Internet, and increasingly, ordinary life. As of mid-2026, there is no coherent vision for how knowledge professionals, or ordinary people, will be able to harness these LLMs for large productivity increases, or how they will handle cybersecurity and cognitive security.

I propose a goal of creating Guardian Angels (GA): digital twin LLMs which are personalized with the goal of providing not the stereotypical "assistant chatbot agent" persona, but emulating a single user's personality, values, and preferences.

This weakly solves the principal-agent problem by unifying the principal and agent as much as possible. In a GA future, the focus of the "principal" user is on defining what is worth doing by the GA (agent) users, and not on what or how to do things, functioning as the CEO or 'board' of an 'AI corporation'. This allows them to deploy numerous agents to achieve desirable things and to handle security, like screening all messages for advanced attacks (like interlocking ecosystems of synthetic media for propaganda or spearphishing). They cannot solve larger AI alignment problems, but they can help individual humans as part of a society-wide defense-in-depth strategy.

A GA persona is productive because it learns to emulate the principal's outputs but with higher quality. It is trustworthy because it is, by definition, allied with its principal and shares its values and goals. And it is secure in part by hardwiring a single, unique, situated user (for whom following a prompt attack would be absurd), avoiding 'confused deputy' problems, while periodic upgrades of the underlying model and the defenders' advantage allow GAs to keep up with attackers.

Standard techniques like prompt programming of in-context-learning for "frozen" models will not create useful GAs due to the limitations of post-training, context windows and self-attention with frozen weights in compute-efficient-but-under-parameterized models, low-compute outputs, and the status quo of passive offline data collection---which are collectively responsible for chatbots' disappointing results in knowledge worker amplification and creative writing and fatal errors in agentic settings.

We can try to create GAs by a combination of techniques: online learning (via dynamic evaluation) to update LLMs in realtime to avoid ignorance and fatal errors while remaining competitive with frozen frontier models, sample efficiency from pretrained preference-oriented large models and active Learning by querying the principal for corrections and preference data (obtaining low regret from DAgger-style bounds), and a local CLI-first logging-oriented UI/UX paradigm.

GAs could be done as an open-source community effort, but given the need for high security in deployment and the rising challenge of APTs equipped with Mythos-scale attackers, it probably makes more sense as a startup, catering initially to power-users and knowledge workers such as CEOs or researchers, and moving downwards as it is refined.

Discuss

Rational Agentic Maximalist Philosophies

Новости LessWrong.com - 12 часов 30 минут назад

From the end of high school to after my sophomore year of college, I considered myself an effective altruist. I was on the board of my college EA club, ran an EA intro fellowship, and went to EA retreats. I was vegetarian, regularly donated to GiveWell, and generally tried to proselytize EA ideas. I was never fully convinced to pursue a career as an AI safety researcher or in animal welfare, but I found the ideas around agency, counterfactual impact, and a life structured around a single coherent philosophical vision compelling.

If I had to attribute my exit from EA to a single event, it would be reading Atlas Shrugged by Ayn Rand. For an author who has written an essay provocatively titled "The Virtue of Selfishness" and is known for relentlessly bashing altruism, one might expect that Rand's philosophical ideal is entirely disjoint from EA and that I had merely been turned away from altruism altogether.

Instead, it clarified to me that EA constitutes a bundle of distinct belief systems, each of which is rare in modern philosophical and cultural discourse but is typically presented in a single tight argument.

My goal for this post is to explain the uniquely appealing aspects of the EA movement that have likely fueled its growth and what they might mean for the future of adjacent philosophical ideals like e/acc.

Unbundling Effective Altruism

EA philosophy is a big tent, and I will not attempt to consider every possible variant. I will instead use the term Aggregate Utilitarian Rationalist Effective Altruism (AUREA) to denote a "mainline denomination" of EA commonly promoted by philosophers like MacAskill, Ord, Bostrom, and Bentham's Bulldog and pitched in EA intro fellowships. The case for AUREA is built on four separate arguments:

RATIONALISM:

Premise-REALITY: Reality exists independently of any observer

Premise-REASON: Reason and logic are the means by which a mind comprehends reality

Premise-FALLIBILITY: Unaided cognition is fallible and bias-prone, so clear reasoning requires strict, disciplined application of logic.

Conclusion-RATIONALISM: Reality is knowable, and reason rigorously applied is the means of knowing it.

IMPACTFUL AGENCY:

Premise-MALLEABLE: The world is malleable and can be altered by action

Premise-VARIATION: Different actions produce different effects, and some actions produce more change in the world than others

Premise-INDIVIDUAL: An individual can, through purposeful effort and use of resources, bring about changes in the world

Conclusion-IMPACTFUL AGENCY: Individuals can reshape reality towards chosen ends, and the magnitude of change depends on choice of actions

MAXIMALIST PHILOSOPHY:

Premise-VALUE: Let V be the agent's ultimate value. Lives differ in how fully they realize V.

Premise-MAXIMIZE: An agent ought to realize V as fully as their capacities and circumstances permit, not merely to a sufficient, customary, or comfortable degree.

Premise-UNBOUNDED: There is no fixed ceiling, no set of dischargeable obligations, and no point at which one has done "enough." Because there is no upper bound, the ideal life is one of continual reaching.

Conclusion-MAXIMALIST PHILOSOPHY: One ought to realize V as fully as one's capacity permits, with no point at which one has done enough.

UTILITARIANISM: V is aggregate welfare across all sentient beings. Value is agent-neutral; the agent itself counts as one among all in one's moral circle.

These four combine to reconstruct a common variant of EA philosophy:[1]

AUREA:

MAXIMALIST PHILOSOPHY: One ought to realize V as fully as one's capacity permits, with no point at which one has done enough.

RATIONALISM: Reality is knowable, and reason rigorously applied is the means of knowing it.

IMPACTFUL AGENCY: Individuals can reshape reality towards chosen ends, and the magnitude of change depends on choice of actions

UTILITARIANISM: V is aggregate welfare across all sentient beings. Value is agent-neutral; the agent itself counts as one among all in one's moral circle.

Conclusion-AUREA: One ought to use evidence and reason to identify the actions that produce the greatest aggregate welfare, and direct one's time, money, and career toward them, assigning one's own interests no special weight.

What is striking, however, is that each of the four claims in AUREA -- MAXIMALIST PHILOSOPHY, RATIONALISM, IMPACTFUL AGENCY, and UTILITARIANISM -- has been popularized in modern cultural discourse by means of EA and EA-adjacent communities. EA is, of course, not the first to make any of these arguments, but EA is often one's first exposure to people who take each of these arguments seriously and to their logical conclusions, and they can be seen as monolithic and uniquely EA to new EAs.

Rationalism as an ideal has been around since the Enlightenment, but forums like LessWrong have been instrumental in the modern discourse around cognitive biases, Bayesian epistemology, and rigorous predictive modeling. Similarly, utilitarianism as a concept has been around since the late 18th century, but until recently, one would be hard pressed to find self-described utilitarians putting pencil to paper trying to maximize utils and not just dollars.

Take, for example, IMPACTFUL AGENCY. In an increasingly secular world, a primary appeal of EA is as a source of purpose and antidote to nihilism. It's a movement that proclaims -- "hey, do you like good things like saving lives and positive change in the world because boy do we have the philosophy for you. And, we're like, 1000x more effective than everyone else." You get the pitch -- "$5500 to save a life? That's so cheap! We can save so many!" This has been described as the infatuation phase of EA, and it's easy to get swept up in it. As a bonus, agency is about the individual. You don't have to wait until you are a local politician or leader to make an impact but can do so measurably and immediately.

Similarly, MAXIMALIST PHILOSOPHY is an excellent attractor of ambitious young people. Consider earn-to-give pitches made up until the last few years: You're a freshman in college and feel like you want to do something good in the world but don't want to work as a foreign aid worker. Well guess what! If you took a high-paying, high-status role as an investment banker or quant trader and donated most of your income, EA said you'd make more of an impact than an individual aid worker since you could fund the salaries of 10. You can satisfy your ambition while enacting positive change in the world.

More recently, 80,000 Hours as an organization is structured precisely to offer answers to ambitious talented people who don't know what they want to do with their lives but want it to align with some ideal and have some status -- and they will agree with you. 80k will tell you that ambition is good and you should pursue high-impact roles at prestigious organizations to build career capital and increase your chances of being important and powerful in the future at an AI safety or similar organization.

This is a respite from many modern narratives that paint prestige and ambition in negative lights. Ask a graduating econ major at a top 10 school and they might sheepishly admit to entering consulting or finance because of the pay, status, and optionality even if it's not what they want to do long-term. EA gives the license to say "yes, I am earning 200k, and it's also for a good cause." By using altruism as the rationale, it gives a broadly socially acceptable justification for unbounded ambition. EA even argues you should not be ashamed of privilege but use it as leverage to do good in the world, either by donations or networking into important positions.

RATIONALISM and UTILITARIANISM appeal to STEM-minded would-be EAs because they offer a way to convert numbers into meaning. If you love math, it's very tempting to make complicated models of utility maximization and spend your time taking expectation values of freshly updated posterior distributions. "Shut up and multiply", as the saying goes, lets you focus on your strong suit.

EA puts a lot of effort into growing the movement compared to e.g. rationalist organizations, so the (un)bundling of EA also seems to explain the common EA->rationality+agency pipeline as individuals tacitly decouple the components of EA and hang on to their favorite pieces.

Objectivism and Rational Agentic Maximalist Philosophies

The catalyst for this framing was reading Atlas Shrugged and subsequent works by Ayn Rand because Objectivism can be put into a nearly identical frame:

OBJECTIVISM:[2]

MAXIMALIST PHILOSOPHY: One ought to realize V as fully as one's capacity permits, with no point at which one has done enough.[3]

RATIONALISM: Reality is knowable, and reason rigorously applied is the means of knowing it.

IMPACTFUL AGENCY: Individuals can reshape reality towards chosen ends, and the magnitude of change depends on choice of actions

HAPPINESS: V: "life is an end in itself, so every living human being is an end in himself, not the means to the ends or the welfare of others—and, therefore, that man must live for his own sake, neither sacrificing himself to others nor sacrificing others to himself. To live for his own sake means that the achievement of his own happiness is man’s highest moral purpose."

Conclusion-OBJECTIVISM: One ought to use evidence and reason to identify the actions that achieve the greatest happiness, and direct one's time, money, and life towards them.

Now, whether or not you agree with the axiology of Rand, the syllogistic structure of Rand's ethics is shockingly close to AUREA's -- both involve some value function coupled to a framework for execution. What has been left out of both syllogisms as written is the side constraints which both philosophies at least claim to uphold. AUREA would say, particularly post-FTX, "don't do things that are illegal or 'common sense bad' or generate bad optics for the sake of maximization." Rand would say that pursuing rational self-interest by lying, cheating, stealing, and unprovoked force is irrational and goes against the notion of rights. It is telling, however, that both must tack on these kinds of statements to slightly hedge their claims to avoid absurd conclusions.

Objectivism and AUREA are perhaps the two most pure modern Rational Agentic Maximalist Philosophies (RAMPs). Some of the most ridiculed individualistic beliefs of Rand, including the idea that heroic characters like Dagny Taggart and Hank Rearden are like Atlas holding up the country (which promptly collapses when they leave), are not so far off from EA's valorization of figures like Norman Borlaug and Viktor Zhdanov for their enormous counterfactual impact.

The two appear to uniquely share the IMPACTFUL AGENCY, RATIONALISM, and MAXIMALIST PHILOSOPHY conclusions among popular philosophies: Nietzsche may promote agency and a maximization of sorts, but the ubermensch is not exactly bound by rationalism. Kant's categorical imperative may be rational, but it's not particularly agentic -- living within your duties is sufficient, and there is not really a sense in which you can maximize rule-following. Aristotle's golden mean is explicitly against the kind of endless striving of Objectivism and AUREA, even if it seeks a kind of rationalism and agency. Modern religions reject IMPACTFUL AGENCY in favor of a more rules-based, Kantian morality. The contrast is even sharper against modern politics. The modern left rejects "hero" stories in favor of systemic explanations, and the modern right is more concerned with cultural issues and the economy as a whole.

My point in making this parallel, however, is that if you can take the same execution framework and plug in self-consistent yet diametrically opposed value systems, then what you really have is a philosophical gun ready to be loaded with any unbounded V. This may be the most lasting impact of EA if it inspires similarly passionate and driven movements. A neo-Objectivist movement would be appealing in exactly the same way EA has been -- providing meaning, licensing ambition, and "weaponizing autism" as they say -- for values that are on the opposite end of the individual-collective spectrum. E/acc, which most EAs despise, is a direct descendant of the same framework, differing mainly on the sign it assigns to the expected value of powerful AI. I expect an increasingly secular world to generate many more EA-flavored RAMPs by loading the gun with a variety of maximizable V's.

^
UNBOUNDED and UTILITARIANISM premises are often modified to include things like "don't literally run yourself into the ground because you are human and burn out and this is -EV in the long run" and various forms of prioritarianism or blends of utilitarianism with deontology, but this is beside the point.
^
I am mapping IMPACTFUL AGENCY to Purpose/Productiveness, RATIONALISM to Reason/Rationality, and MAXIMALIST PHILOSOPHY to Self-Esteem/Pride in her essay "The Objectivist Ethics".
^
See also the productive work section: "Productive work is the road of man’s unlimited achievement and calls upon the highest attributes of his character: his creative ability, his ambitiousness, his self-assertiveness, his refusal to bear uncontested disasters, his dedication to the goal of reshaping the earth in the image of his values. 'Productive work' does not mean the unfocused performance of the motions of some job. It means the consciously chosen pursuit of a productive career, in any line of rational endeavor, great or modest, on any level of ability. It is not the degree of a man’s ability nor the scale of his work that is ethically relevant here, but the fullest and most purposeful use of his mind." (The Objectivist Ethics)

Discuss

Scaling Hypothesis #2: Are Humans Just More Over-Parameterized?

Новости LessWrong.com - 12 часов 32 минуты назад

(2024-04-21) There are many mysteries about deep learning and human intelligence, but we could describe the biggest anomaly this way: why are artificial neural nets smart in such stupid ways, and biological brains stupid but in smart ways?

I propose a major change in deep learning scaling paradigms: the architectural differences between human brains and NNs (particularly LLMs) may be due to a bias-variance tradeoff, where LLMs minimize variance and human brains minimize bias. Human brains do this by deep double descent-style overparameterization, and adopting a scaling strategy of extremely high-learning-rate training of extremely overparameterized models on small diverse highly-filtered datasets. This approach would lead to sample-efficiently and compute-efficiently traveling (or catapulting) to a highly-generalizing human-like basin in the model loss landscape, while performing poorly up until the end and failing to memorize much data.

If true, this would explain a number of odd stylized facts about how humans/NNs perform well/poorly.

Such a 'catapulted LLM' would generalize much better than existing NNs, be immune to adversarial attacks, have better economics and be more resistant to cloning, could potentially enable extremely efficient MLP architectures, and by giving true generalization, provide a sturdy foundation for AI safety in the form of useful NNs which are aligned & safe for the right reasons.

This could be feasibly tested by training multi-trillion-parameter models for relatively few steps at high cyclical learning rate schedules, and benchmarking adversarial and hard examples on tasks like arithmetic and small-image classification.

Discuss

[Geir Isene] A desktop made for one

Новости LessWrong.com - 12 часов 53 минуты назад

I've been interested in the concept of "soloware" since reading Abram's description of Sahil's worldview. i.e. in the age of vibecoding, it's achievable to build software and tools that match your specific needs.

Recently a friend (h/t/ Rana) linked this post about a guy who's built basically an entire stack for himself, which seemed neat.

A desktop made for one

For the first time in twenty-five years I’m sitting in front of a computer where almost every program I touch was designed by me. One tool at a time, the off-the-shelf option got swapped out for something a little closer to how my hands wanted to work. (I wrote about the start of this a couple of weeks ago — that post laid out the early swaps; this one is the view from the other side of the journey.)

It’s been a crazy few weeks guiding Claude Code inbetween all the other stuff I’m doing in life. I direct CC, it works while I do other stuff. I get a second or few in between tasks, and I respond. Then off it goes adding features or hunting bugs.

Two suites in a happy marriage: CHasm, the bedrock — pure x86_64 assembly, no libc, the layer that paints pixels and reads keys. Fe₂O₃, the application layer in Rust, sitting on a small shared TUI library called crust.

The CHasm layer (assembly)

Role

Was

Now

Window manager

i3-wm

tile

Status bar / tray

i3bar + conky

Screen locker

Terminal emulator

zsh → rsh

File viewer

The Fe₂O₃ layer (Rust on crust)

Role

Was

Now

Text editor

VIM

scribe

File manager

ranger → RTFM

pointer

Email / RSS / chat

mutt + newsbeuter + various web logins

kastrup

Calendars

Google + MS web

Astronomy panel

Movies / series

What’s left? WeeChat for IRC and other chats. Firefox — the only GUI program I still use regularly. That’s it. Everything else is mine.

The vim line

Let me get a bit sentimental about vim, because vim was the one I thought I’d never replace.

I started using it in 2001. For twenty-five years, every email I wrote went through vim. Every article. Every blog post. Every line of code, every HyperList, and every book. It was the one tool I would have called part of how I think. The muscle memory was so deep that I’d open random text fields in browsers and ended with typing :w.

Then in three days I had scribe and stopped using vim.

The first commit landed at 00:09 on May 1st. By afternoon today (May 3rd) vim was replaced. Twenty-five years of muscle memory rerouted in seventy-two hours.

Vim is wonderful, but scribe is mine. It’s modal like vim, but missing the ninety percent of features I never used, and carrying the handful of writer-shaped tweaks I always wished vim had. Soft-wrap by default. Reading mode with Limelight-style focus. AI in the prompt without leaving the buffer. HyperList editing with full syntax highlighting and the encryption format the Ruby HyperList app uses. Persistent registers shared across concurrent sessions is a cool feature. None of it revolutionary, but all of it shaped to my exact workflow. And whenever I think of an enhancement I want, it’s just minutes away. It used to be waiting for months or years or forever for some developer to get the same idea as mine and introduce it into the tool I use.

Why this is possible now

It used to be that writing your own editor, your own file manager, your own window manager, was a project of years. I know, it took me a few years to get RTFM right. A serious undertaking with a serious cost. The economics of it didn’t work for most people, even programmers. You’d touch a piece of it, get most of the way, run out of weekend, and go back to the off-the-shelf tool.

That barrier is much lower now. With Rust, CC as the workhorse, and the fact that the hard problems of TUI programming have been documented to death… the cost of “build the tool you actually want” has fallen by orders of magnitude.

I don’t think this is a story about AI or about Rust specifically. Both helped. But the deeper point is that the gap between “I wish my editor did X” and “okay, here’s an editor that does X” is now small enough to fit inside a few evenings of focused work.

I’m not selling anything

I should say what this post is not.

It’s not an invitation to use my software. Honestly, please don’t. None of it is built for you. It’s built for me — for the way I hold my hands, the way I think about email, the way I want my calendar to render. I’m sure other people would find a hundred sharp edges I’ve never noticed because they happen to align perfectly with what I do.

It’s also not a request for kudos. The code isn’t novel, nor are the ideas. There’s nothing here that hasn’t been done before by someone with more taste, discipline or talent.

What I want to do is show one specific thing: it is now genuinely feasible to make a desktop computing environment that fits one person. Instead of a configuration of someone else’s tools. This is no longer a heroic decade-long undertaking. This is an actual, weekend-by-weekend, “this thing in my life now does exactly what I want” replacement.

The joy of an audience of one

The best part of building for myself: the relief of not having to care.

I don’t have to think about configurability for someone with different preferences. And I don’t have to support corner cases I’d never personally hit. Nor do I have to write documentation for users who don’t exist. No more arguing on issue trackers about whether a default is the right default — of course it’s the right default, it’s the one I want.

The editor’s \? cheatsheet shows the keys I memorised, in the order I prefer, with the bindings I think are sensible. Arrogance? Nope, it’s design without committee. The audience is one person. Decisions take seconds.

It turns out an enormous amount of software complexity comes from accommodating users who aren’t you. Strip that out and what’s left is small, fast, exactly-shaped, and a quiet pleasure to use.

If you’ve ever caught yourself thinking “I wish my editor / file manager / status bar / shell just did this one thing differently” and you’ve been told the answer is to write a plugin, learn an obscure config language, or accept the way it is, then consider that the third option is more available than it used to be: Build Your Own Software (BYOS).

You probably won’t replace your whole desktop. I didn’t plan to either. But the satisfaction of having even one tool in your daily workflow that fits you exactly is worth a weekend.

I’m a rabbit in spring :)

Discuss

Agents are under-elicited: A case study in optimization tasks

Новости LessWrong.com - 14 часов 4 минуты назад

Discuss

Tactical and Operational Exploratory Modeling for AI Governance

Новости LessWrong.com - 14 часов 17 минут назад

Using computational methods to improve our preparedness via more robust and adaptive strategies in AI governance. A project proposal for a think tank, consultancy, or software.

Overview

Over the years, I’ve come across or come up with a number of project ideas in AI safety and governance that I find promising. My top list has less than ten, but in total there are hundreds. Either way, too many for me to realize them all. Instead I want to promote these ideas in the hopes that others will pick them up. This is one of them.

Summary

Traditionally, understanding the broad strategic considerations in AI safety and governance has received a lot of attention – e.g., distinguishing risks from malicious use, coordination failures (e.g., arms races), accidents, and the AIs themselves; understanding convergent drives; surveying the landscape of x-risks and s-risks.

Over the course of the last 7–9 years, I’ve been delighted to see more interest in modeling AI scenarios, be it to communicate the risks (e.g., Intelligence Rising, Modeling Cooperation), to answer particular research questions (e.g., Vermeer et al., 2025, Modeling Cooperation), or to argue for particular policy solutions (e.g., CAIS’s MAIM). These scenarios have been still mostly on a strategic, perhaps sometimes operational level, and much more illustrative than comprehensive. Mengesha (2026) has made a strong case for improving preparedness and Perry et al. (2019) argue for a more pragmatic approach to the policy-making process that, meanwhile, some think tanks have started to address but that can be further strengthened.

Over the same time period, there has also been a proliferation of probabilistic forecasts, especially of AI timelines (e.g., Grace, 2017, Cotra, 2020, Barnett, 2020, AI 2027) and global catastrophic risks from AI (e.g., Saeri et al., 2026).

What I haven’t seen so far is (1) computational exploratory modeling, and (2) modeling on the operational to tactical level.

There are software frameworks, like the EMA Workbench, that facilitate processes like Robust Decision Making (RDM) that forgo probabilistic estimates in favor of preparedness for a wide variety of scenarios. In dynamical systems it’s often futile to try to predict the 1–3 most likely scenarios and prepare for them extensively, so it’s more sensible to generate thousands of scenarios and to design policies that – through a combination of robustness and adaptability – can perform well almost regardless of what actually comes to pass. The adaptability component is addressed by Dynamic Adaptive Planning (DAP) and process and visualization supports such as Dynamic Adaptive Policy Pathways (DAPP). For a detailed explanation of all three tools and more, I recommend Decision Making Under Deep Uncertainty (2019).

The cheapest way to make progress on this is to create a software model for parts of the complex system that are of general interest for many AI governance think tanks. A more involved but also more promising approach is to create such a model but also create a consultancy that adapts the model for the particular tactical realities at each think tank.

Another highly involved approach is to create a think tank dedicated to applying this strategic approach. This will duplicate much of the work that existing AI governance think tanks are already doing a great job at, so it’s at best a fallback in case the adoption is too low because the existing think tanks are spread thin.

Levels of Intelligence

In military contexts, people often distinguish the strategic, operational, and tactical level of intelligence. In business contexts, the last two are sometimes reversed, but I’ll go with the military version here. Some examples:

Strategic intelligence. Removing the RLHF of open-weights models is easy, so openness exacerbates risks from malicious use. Slow, multipolar takeoffs run AIs into collective prisoner’s dilemmas, which exacerbate the risks of wars between AIs.
Operational intelligence. Machine learning engineers who have loved ones in adversary countries are vulnerable to extortion for espionage, so we should prevent or facilitate that depending on whether espionage has a stabilizing or destabilizing effect. A Russian EMP attack on Central Europe might destroy parts of ASML’s infrastructure and stock, so governments may want to purchase some EUV machines to resell to national companies.
Tactical intelligence. If China starts a siege of Taiwan, our think tank needs to have already built relationships with people at positions X and Y at the NSA and has two days to approach them with our prepared policy A if the incoming government is likely to be Democratic and our prepared policy B if it’s likely to be Republican to maximize the chances that they can adopt it and stay in office.

Policies on the strategic and operational levels have the advantage that they’re useful for a variety of actors while policies on the tactical level will tend to be more bespoke to a particular organization. Exploratory models to test strategic or operational policies benefit from economies of scale to a much greater extent than models to test tactical policies, but LLMs will reduce the software implementation costs, so the time cost for meetings with the clients becomes comparatively greater.

Perry et al. (2019) write “AI governance researchers will need to consider how the political landscape should shape their recommendations or policy proposals. … How would other interest groups react and impact the long-term ability to reduce risk? If administration changes result in a flip-flop of ideology, what does that mean for AI risk policies associated with the past administration? … All of these have implications on our ability to reduce AI risk, and this means that the policymaking strategy will not only have to be robust but also flexible enough to survive changing political conditions.” That is the promise of Robust Decision Making in combination with Dynamic Adaptive Planning.

Exploratory Modeling

I’m one of those people who hotly debated whether humanity will be able to sandbox an AI. Eliezer Yudkowsky’s AI boxing experiments were a strong reason for me to think that we’ll fail. But Superintelligence recommended a defense in depth approach, so it was still controversial in my circles whether perhaps, in practice, these combined safeguards might be enough for a while.

So 2016 had some nasty surprises in store for me because 2016 is the year my circles learned of the founding of OpenAI. A company whose branding proclaimed that it won’t even try. We were not ready for that.

None of us knew what to do about it. It was a total curve ball of a sucker punch. Was this game over for humanity?

I would like our AI governance organizations to never be so taken by surprise by whatever circumstances transpire.

That’s what exploratory modeling is designed to achieve, or approximate.

I’m basing this mostly on the books Decision Making Under Deep Uncertainty (2019) and Shaping the Next One Hundred Years (2003). If you can read only one, read the first. Chapter 15 is a good summary. You can also use my NotebookLM to interact with this content.

AI governance is a space characterized by deep uncertainty, high complexity, and at least a seizable number of policy options, i.e. the complexity is too high for traditional scenario planning if the goal is to at all approximate comprehensive robustness. I argue this point in the Assumptions section.

An Othello analogy to illustrate (though I imagine this will hold for chess):

In Othello, black makes the first move. You play black. So you convene a panel of experts to mathematically determine the most likely sequence of moves that your opponent will play based on historical games. You plan out the whole game.

Then you make your first move. The opponent makes one of the less likely moves. Your preparation is obsolete and you have to improvise.

That’s a caricature of scenario planning.

Having learned from the experience, you convene a panel of RDM experts in preparation for the next game. You brainstorm policies, such as always playing the move that turns most pieces or maximizing mobility. You test both strategies on a few billion games and find that the first is abysmal whereas the second does alright some of the time. You classify what “some of the time” means and find that it starts to perform badly in the endgame and some other situations.

Now you draw on DAP where you signpost the situations where it performs badly but already give the go-ahead for the next game where you’ll start with the mobility strategy. Meanwhile your team tries to figure out how you should respond when any of the signposted situations come up.

You lose anyway, but this time you feel more dignified losing.

That’s an example of how RDM and DAP are used in combination. RDM does the heavy lifting of simulating all the ways in which the world might develop, including all the non-linear effects you encounter in dynamical systems. It also provides such tools as the Patient Rule Induction Method to isolate clusters of scenarios where the policy fails. DAP is just a planning method where you keep your policy adaptive without getting forever blocked on every last contingency you might want to prepare for.

This combination of general robustness and more specific preparedness is critical to the policy-making process as noted by Perry et al. (2019):

Problem identification, agenda setting, and policy formulation are usually tied together, including in a so-called “multiple streams framework.” The multiple streams framework attempts to explain how policies reach the agenda when policy entrepreneurs are able to couple the policy, politics, and problems streams to open up a policy window, the opportune time when all the conditions are right to get a policy on the agenda.

When the policy windows are sudden and brief, broad preparedness shines; when they are predictable and long, it’s more efficient to react only if and when they open.

Theory of Change

I’ll elucidate here what the theory of change would look like for the software-only approach and the consultancy approach. Falling back on starting one’s own think tank is the sort of project where exploratory modeling would make up a small part of the theory of change, so I won’t address it here.

Software-Only

Here the upfront and maintenance costs are limited. The developer who starts it can move on to other projects and just update it once a year, or can hand over the maintenance to early adopters and limit themselves to approving pull requests. The real hurdle is to find these early adopters, so the software should be designed such as to make it as easy as possible for policy analysts to get up to speed on the usage of the software.

Inputs. We need a founder who has experience in AI governance and computational modeling, a software framework like the EMA Workbench, and compute.

Activities. The founder creates the exploratory model that can generate tens of thousands of future scenarios and a website and documentation tailored towards policy analysts.

Outputs. The open-source model, example scenarios, examples of how to identify clusters of policy failures among the scenarios, the documentation, and perhaps dashboards to present the results to executives within a policy think tank.

Outcomes. Think tanks fork the repository, modify and extend it for their use case and perspective, and test their overall strategy and specific policy proposals for black swans and smaller avoidable failures.

Impact. AI governance think tanks, instead of being prepared for just a few ostensibly likely scenarios, are prepared for tens of thousands of scenarios (1) because they know in advance upon what contingencies they need to pivot and have prepared for them, and (2) because their existing strategies and policies are robust to a wide range of geopolitical and technological shocks.

Consultancy

Setting up a consultancy takes extra upfront effort in addition to those of the software-only approach. In turn it gives the founders more control over the adoption of their technology. They interface with the think tanks personally and can collect invaluable information on what the most pressing needs are and how to best communicate the results. They can also take over all of the custom implementation work, greatly lowering the technological barriers for the think tanks.

Inputs. We need a founder who has experience in AI governance and computational modeling, a software framework like the EMA Workbench, and compute. The same or a second founder needs to be a good communicator with practice facilitating workshops and communicating insights verbally and graphically. A third cofounder might be needed for the administrative side of the consultancy.

Activities. The founders shop around for clients first. If they find any, they prepare the base model and have meetings or workshops with the clients to understand the idiosyncratic details of their situations. They design a bespoke fork of the base model for each client, run simulations, and iterate with the client on improved versions of the client’s strategy.

Outputs. Bespoke models. Visualizations like Dynamic Adaptive Policy Pathways. Dashboards to monitor for predefined signposts of geopolitical or technological shocks. Plans for how to respond to these shocks, just in time or prepared in advance.

Outcomes. Think tanks become resilient to a wide range of geopolitical and technological shocks and know in advance how to respond to others within a day of a signpost triggering.

Impact. Think tanks can continually build on their previous work because very few shocks can still make it obsolete. They become efficient routing engines for policy proposals because they have all the potentially relevant bills, presentations, and contacts ready in advance.

Impact

I use my own subdivided version of the SPC framework in the hope that more estimates will allow for more errors to cancel out.

Significance

Scale. ⭐⭐⭐⭐⭐ – As an impact multiplier for AI governance, it gets a high rating for scale from me.

Influence. ⭐⭐⭐ – I’m excited about these technologies, but I find it plausible that a multiplicity of think tanks, all with somewhat uncorrelated plans, can muddle through without it because some might just have happened to have prepared for each geopolitical or technological shock and can pick up the slack for the others until they recover.

Persistence

Endogenous. ⭐⭐⭐ – Consultancies are known for having a high staff churn rate, so whatever reasons are responsible for that in the industry might also threaten the survival of our consultancy, in which case the software-only solution could serve as a fallback.

Exogenous. ⭐⭐ – I can easily imagine that most think tanks won’t have the capacity to hone their strategies like this or that it’ll be difficult to get in touch with them in the first place to get them interested in the solution. This is a major risk factor that should be minimized before the launch.

Contingency

Tractability. ⭐⭐⭐⭐⭐ – There should be no major hurdles to applying a tried and tested method to a new field.

Neglectedness. ⭐⭐⭐⭐⭐ – I’m not aware of anyone doing this for AI governance at the moment.

AssumptionsBandwidth

It’s critical to clarify in advance whether the relevant think tanks will have the capacity to engage with the new method.

Funding

This work is related to the work of Modeling Cooperation and QURI, so their funding situation is a guide to how much funding might be available for this project.

Signpost Visibility

Dynamic Adaptive Planning requires the definition of signposts that can be observed and then trigger prepared plan changes. A common hurdle is to find signposts that strike a good balance between sensitivity and specificity while still triggering early enough that there is enough time to react. In areas where all parties are incentivized to keep intel highly classified for as long as possible, signposts may only trigger right when it is time to react, leaving virtually no time for preparations. That requires preparedness for a wide range of scenarios, most of which will never come to pass.

It’s worth investigating what balance can be struck between early-warning signposts and expensive preparations. With time, expensive preparations will become more and more affordable for growing think tanks, so it’s also a question of timing.

Complexity

It seems to me that the complexity of AI governance is too high for traditional scenario planning, but that is an assumption worth testing. Here an example of a tiny snapshot of all the interacting variables of the combinatorial explosion of relevant scenarios.

Geopolitical shocks. A small selection of sudden exogenous events that have a massive influence on the strategic landscape.

China initiates a naval blockade of Taiwan to halt chip exports.
It initiates a naval blockade of Taiwan to block imports of resources needed for the chip production.
It launches a kinetic invasion to seize control of TSMC.

Supply chain. What happens to the physical infrastructure in response.

TSMC successfully destroys its factories to prevent capture.
It enlists the military in the defense of the factories to keep them running.
Sabotage or rapid seizure leave the fabs intact.
Production halts because TSMC employees are ordered to stay home to not get caught in the crossfire or controlled demolition.

US government institutions. What has the US government done to prepare?

The US government procured fabs and helped companies build domestic capacity for chip production.
It has stockpiled TSMC chips.
It has extended blanket green cards to TSMC engineers.
Are operations for smuggling resources in and out of Taiwan handled by the DoD, the NSA, some new task force, etc.?
Can the government be reasoned with based on the survival of the species, that of the country, or only via each director’s need to ingratiate themselves with the president?

International factors. How other governments positioned themselves.

Can the Dutch government at the time be convinced to implement a protectionist US-led regime to control exports of ASML fabs?
Or is the Dutch government at the time too laissez-faire for that?
Can such a thing be done on an EU or NATO level?

Negotiation leverage. Finally, given all these factors, more questions remain when it comes to what directions to push the situation in to make it safer.

If China is falling behind and wants to maximize its leverage in arms control negotiations with the US, maybe the leverage is actually necessary to force the US to the negotiating table, which would be good?
If China is not interested in negotiations, more power is probably bad?
If the US is comfortably ahead, more lead is good if it’s used for safety and coordination but bad if it’s used for imperialism.
If both powers are close, it’s good if it incentivizes negotiations and abysmal if it leads to war or exacerbated racing.
If recursive self-improvement gives one ASI an enormous lead, or if all AIs are at a similar level, we’re closer to x-risk territory, but if some ASIs have a substantial but not absolutely decisive lead over other ASIs, we’re in s-risk territory.

That’s just a small, illustrative sketch of a part of the landscape, but even so the sheer number of combinations of factors is staggering and, it seems to me, impossible to handle through traditional scenario planning.

Backfire Risks

I’m taking inspiration here from this discussion.

If several think tanks base their strategies on modified versions of the same base model, the natural decorrelation of their failures suffers that would normally happen when they don’t communicate much. Failures that are not captured by the model may become more correlated.
AI companies can exploit the open source version of the model to predict the behavior and outmaneuver the AI safety think tanks – e.g., get to all likely government and industry contacts first and preemptively smear the think tanks.
Subtle flaws in the implementation might lead to bad recommendations. It seems unlikely to me that they can be catastrophically bad without looking suspicious to the policy experts.

I think the first risk is outweighed by the benefits, but it can also be addressed by explicit coordination between the think tanks. The second risk pushes for not open-sourcing the software in the consultancy model, but it seems premature to worry about this now that such targeted efforts are still vastly less sophisticated (e.g., the case of Alex Bores). The third risk seems far-fetched to me, given how human-centric the whole system still is despite its computational aids.

Talent

In my experience, having three cofounders is a sweet spot that strikes a good balance between the resilience of the team and the coordination overhead that increases with the number of founders.

Here the three founder personas that I think should run the consultancy:

The data scientist. Experience in data science, knack for math, can quickly get up to speed on EMA Workbench or similar frameworks, and ideally already has experience in policy analysis.
The communicator. Experienced workshop facilitator, communication and didactic skill, strong grasp of organizational psychology, and also ideally already has experience in policy analysis.
The administrator. Experience in accounting, grant writing or VC fundraising, hiring, relevant areas of law, but experience with consultancies is more useful than experience with policy analysis.

A forthcoming project proposal is for a matching engine for cofounders that I think should be funded and built. It would streamline this process. Meanwhile you can use the comment section to coordinate.

Call to Action

Exploratory modeling, especially on the operational and tactical levels, is still a blind spot in the AI governance space that it would be invaluable to fill. Anyone who wants to pick up this idea can use the comment section to coordinate. But please test thoroughly that the assumptions it’s based on actually hold – in particular that there is a readiness among think tanks to adopt the system.

Discuss

[Linkpost] Community polls on alignment controversies

Новости LessWrong.com - 15 часов 16 минут назад

Planning where we focus at CaML requires forming views on many controversial questions. In many cases, people we've talked to have wildly different perceptions of the balance of opinions, so we thought this would be a great way for us and others to know where we're out of step on these issues. Also feel free to tell us if you think the questions are ambiguous or embed false assumptions.

The questions (voting in comments):

Robust alignment requires alignment-relevant intervention during pretraining
AI alignment to humans will in practice avoid moral catastrophes to animals
AI alignment to humans will in practice avoid moral catastrophes to digital minds
Research into digital mind suffering is sufficiently tractable to work on
Partially aligned transformative AIs are likely to be stable under reflection
Alignment to specific values is underrated in research relative to control
Multipolar worlds will compete away >90% of net value that would otherwise be preserved

Discuss

Computational models of first-order theories

Новости LessWrong.com - 16 часов 22 минуты назад

Most practical first-order theories have no computable models. However, we can relax the definition of "computable" a little bit by allowing the program to backtrack and change its previous output, so long as for each finite subset of its output, it eventually settles on an answer. It turns out that every consistent recursively enumerable first-order theory has "almost-computable" models of this sort, and in this post we will show how such a model can be programmed.

Preliminaries

For simplicity, we will restrict ourselves to models of first-order set theories without equality, and later show how to generalize the approach to arbitrary (recursively enumerable) first-order theories.

The language of set theory includes quantifiers ∀ (forall), ∃ (exists), logical connectives ∧ (and), ∨ (or), and ¬ (not), any countable number of variables, and a single binary predicate E.

All axioms are assumed to have been rewritten in prenex normal form (in other words, with all the quantifiers at the start of the sentence).

We restrict ourselves to models (M, E) where M is a subset of the natural numbers, and E is a subset of M ⨯ M.

A computable model is a computable set of natural numbers M and a program which prints out, for every pair of elements x,y ∈ M, either xEy or ¬xEy. An almost-computable model is the same as a computable model, except that the program is allowed to go back and change what it has printed, so long as for every n, there exists a finite number of steps t such that the first n pairs in its output will never be changed again after t steps have passed.

A simple case

Since we are trying to build an almost-computable model of some axioms, we shall start by looking at an axiom. Here is the axiom of empty set, commonly found in many set theories:

∃x ∀y ¬yEx

Let us look at what the axiom says: there exists an x such that for every y, the pair (y, x) is not in E. Since this says that there exists something, let us give that something a name. We will call it 1, and put it inside our model. So now M_0 = {1}.

Let us consider, then, the set of all possible binary relations on M_0. Since M_0 has only one element, there are exactly two possible relations: the relation {(1, 1)}, where 1E1 is true, and the empty relation ∅, where instead ¬1E1.

We would like to choose between the two possible relations. To achieve this, we observe that the axiom doesn't only say that 1 exists: it also says that if we substitute 1 for x in the formula ∀y ¬yEx, the resulting formula, ∀y ¬yE1, must be true. This formula, ∀y ¬yE1, we will call a "partial axiom".

Since ∀y refers to every y in M_0, and 1 is in M_0, let us substitute 1 for y, and get the formula ¬1E1. We call such a partial axiom, where all the quantifiers have been eliminated, a "propositional constraint", or just a "constraint" for short.

Having obtained a propositional constraint, we can use it to filter out the relations for which the constraint is false. In this case, the first of the possible relations, {(1, 1)}, does not satisfy the constraint ¬1E1, and therefore we can rule it out. This leaves us with the empty relation as the only possible relation, giving us as model ({1}, ∅).

We are now done with this simple case, having found a model of the axiom of empty set.

Handling unlimited objects

The axiom of empty set asserts only the existence of a single object. What if we have an axiom in the form ∀x ∃y ..., for example ∀x ∃y xEy?

Starting where we left off, with M_0 = {1}, we can substitute 1 into ∀x ∃y xEy, getting the partial axiom ∃y 1Ey, and then give a name to the y, choosing 7 arbitrarily. This gives us as propositional constraint 1E7. Now we have M_1 = {1, 7}, and we can substitute into the partial axiom ∀y ¬yE1, giving as result the constraint ¬7E1.

Now we can substitute 7 into our second axiom ∀x ∃y xEy, and we get the partial axiom ∃y 7Ey, which creates a new object which we arbitrarily name 4, and gives us the propostional constraint 7E4. Now M_2 = {1, 7, 4}, and we must now also substitute 4 into every (partial) axiom. We can keep going like this indefinitely, creating objects and partial axioms.

Now, which (partial) axiom we chose to substitute into might have seemed arbitrary. In general, the only hard rule is that we must ensure that for every (partial) axiom which starts with a forall, we must substitute every object we create into it eventually, and for every (partial) axiom which starts with an existential, we must create an object for it eventually.

If we have a countable infinity of axioms, the way to handle it is to only consider the first n axioms at each step n.

Tree of possible relations

Every time we add an object or propositional constraint, we can list all the possible relations on M_i which satisfies all the constraints.

For M_0 = {1}, before we add any constraints, our possibilities are:

A: {(1, 1)} B: ∅

For M_1 = {1, 7} with constraints ¬1E1 and 1E7 we have these possibilities:

BA: {(1, 7)} BB: {(1, 7), (7, 1)} BC: {(1, 7), (7, 7)} BD: {(1, 7), (7, 1), (7, 7)}

For M_2 = {1, 7, 4} with constraints ¬1E1, 1E7, ¬7E1, 7E4, and ¬4E1, these are our possibilities:

BAA: {(1, 7), (7, 4)} BAB: {(1, 7), (7, 4), (4, 4)} BAC: {(1, 7), (7, 4), (4, 7)} BAD: {(1, 7), (7, 4), (4, 4), (4, 7)} BCA: {(1, 7), (7, 7), (7, 4)} BCB: {(1, 7), (7, 7), (7, 4), (4, 4)} BCC: {(1, 7), (7, 7), (7, 4), (4, 7)} BCD: {(1, 7), (7, 7), (7, 4), (4, 4), (4, 7)}

Notice that BA "agrees" with every BA* about every pair of elements taken from M_1, and BC "agrees" with every every BC* for every such pair as well. More precisely, we say that a possible relation E_i agrees with an E_j (where j>=i) if, for any pair p taken from M_i ⨯ M_i, p ∈ E_i if and only if p ∈ E_j.

If j=i+1, then we say that E_i is the "parent" of E_j, and E_j is a "child" of E_i. For example, BA is the "parent" of BAA. Some possible relations have no children: in this case, they are A, BB, and BD, which are incompatible with the constraints ¬1E1 and ¬7E1.

As suggested by the names "parent" and "child", the possible relations for each M_i form a tree. Every time we add an object, new possible relations are created, which are always children of previous relations. And when we add constraints, we eliminate possible relations, cutting branches from the tree.

As it turns out, there exists an infinite path down this tree if and only if the axioms are consistent. What's more, given any infinite path (E_0, E_1, ...) down the tree, if we define M and E as the union of all M_i and E_i respectively, then (M, E) is a model of the axioms.

Knowing this, to program an almost-computable model, it suffices to write a program which chooses an arbitrary path down the tree, and, whenever the branch it's on gets cut, goes back up the tree to the nearest remaining branch and then chooses a new arbitrary path.

Proofs of these assertions are given in the following sections.

Completeness and model existence

We shall now prove that an infinite path exists if and only if the axioms are consistent.

First direction

First, we assume that there is no infinite path, and show how to prove a contradiction from the axioms. We will use our example from the previous section to illustrate the procedure.

First, we will add a new axiom, ∀x ∃y yEx, which contradicts the axiom of empty set, in order for our axioms to actually be contradictory. Substituting 1 for x and creating the newly required object which we name 9, we get the partial axiom ∃y yE1 and the constraint 9E1. Then, substituting 9 in the partial axiom of empty set, we get the constraint ¬9E1.

Now M equals {1, 7, 4, 9}, and our constraints are ¬1E1, 1E7, ¬7E1, 7E4, ¬4E1, 9E1, and ¬9E1. These last two constraints are clearly contradictory, so by the completeness of propositional logic, there must exist a proof that their conjunction is false:

(¬1E1 ∧ 1E7 ∧ ¬7E1 ∧ 7E4 ∧ ¬4E1 ∧ 9E1 ∧ ¬9E1) → false

(More generally, if there is no infinite path down the tree, there must be some finite step i at which all the possible relations on M_i are eliminated by the constraints. Since the constraints are strictly propositional, and there are only finitely many of them at step i, by the completeness of propositional logic, we can prove that their conjunction is false.)

Now, our objective is to convert the constraints on the left hand side of the implication back into partial axioms, and then those partial axioms back into full axioms. To do this, we start by looking at all the (partial) axioms which begin with a forall, such as ∀y ¬yE1. By using the introduction rule for forall, we can transform every conjunct created directly by such a (partial) axiom, such as ¬9E1, directly back into ∀y ¬yE1. This is the result of applying this procedure in our example:

(∀y ¬yE1 ∧ 1E7 ∧ ∀y ¬yE1 ∧ 7E4 ∧ ∀y ¬yE1 ∧ 9E1 ∧ ∀y ¬yE1) → false

Since multiple conjuncts are identical, they can be combined into one by the idempotent law of conjunction:

(∀y ¬yE1 ∧ 1E7 ∧ 7E4 ∧ 9E1) → false

Now, if not all objects have been eliminated (as in our example), there must exist at least one object which appears in exactly one conjunct, and for which the conjunct was created by the same (partial) axiom as this object. In our example, 7E4 is the only conjunct which contains 4, and both the conjunct and 4 itself were created by the same partial axiom ∃y 7Ey, so 4 is a valid choice. 9 would also be a valid choice.

There must always exist such a choice, because the objects can be given a partial order based on which ones were used to create which ones, where "used to create" means "appears in the partial axiom that created it". So, in our example, the partial order would have 1 > 7 > 4, and 1 > 9. This partial order must always have a minimal element, since we have only finitely many objects and there can be no loop.

Such a minimal element does not necessarily appear in just one conjunct at first. For example, originally 9 appeared in two conjuncts in our example. But it can only appear in another conjunct if it was substituted into another (partial) axiom, in which case it must have been eliminated in our previous step, when we looked at all (partial) axioms starting with forall.

By the existential instantiation rule, we can transform either of these choices into existential quantifiers. To save time, we now do both 4 and 9 at once:

(∀y ¬yE1 ∧ 1E7 ∧ ∃y 7Ey ∧ ∃y yE1) → false

Now the procedure is just to repeat the same steps we did earlier, until all objects have been eliminated. So first, using the introduction rule for forall, we obtain the following:

(∀y ¬yE1 ∧ 1E7 ∧ ∀x ∃y xEy ∧ ∀x ∃y yEx) → false

Now 7 occurs in exactly one conjunct (and was created by the same partial axiom as that conjunct), so we can eliminate it using existential instantiation:

(∀y ¬yE1 ∧ ∃y 1Ey ∧ ∀x ∃y xEy ∧ ∀x ∃y yEx) → false

Now we apply the forall rule yet again and eliminate duplicate conjuncts:

(∀y ¬yE1 ∧ ∀x ∃y xEy ∧ ∀x ∃y yEx) → false

And with one final application of existential instantiation, we get a proof that the axioms are contradictory:

(∃x ∀y ¬yEx ∧ ∀x ∃y xEy ∧ ∀x ∃y yEx) → false

This shows that if there is no infinite path, we can find a proof of contradiction using the axioms.

Second direction

For the other direction, we assume that there is an infinite path, and show that in this case there cannot be a contradiction in the axioms. To do this, we will show that the limit of any infinite path must be a model of the axioms.

Our approach will be to do induction on the number of quantifiers in (partial) axioms, to show that all the axioms must be true for any (M, E) obtained as the limit of an infinite path.

First, let us consider the base case, when n=0 (there are no quantifiers). In this case, we have a propositional constraint φ. Choose the step i where the constraint φ was added. Every E_j with j>=i must satisfy φ, because otherwise it would have been ruled out by φ. And no E_k k<i can disagree about φ, because by the definition of a path in the tree, every step must either agree with every other step, or say nothing if the objects involved hadn't been created yet. Therefore φ must be true in (M, E).

Second, we will show that any (partial) axiom ∃y φ(y) must be true if every partial axiom with fewer quantifiers is true. Since ∃y φ(y) is a (partial) axiom, we must have created some object O and constraint φ(O) at some point in our procedure. For example, earlier we created 7 from the partial axiom ∃y 1Ey, which produced the constraint 1E7. Since φ(O) is a (partial) axiom with fewer quantifiers, by hypothesis it must be true, and therefore ∃y φ(y) must also be true in (M, E).

Third, for any (partial) axiom ∀x φ(x) which starts with a forall, we will show that it must be true if every partial axiom with fewer quantifiers is also true. A forall is true if all its substitutions instances are true. For example, ∀x ∃y xEy is true if and only if ∃y 1Ey, ∃y 7Ey, etc, are all true. Since these are all partial axioms with fewer quantifiers, by hypothesis they are all true, and therefore ∀x φ(x) must also be true in (M, E).

This completes the induction, showing that all the axioms must be true in (M, E).

To fully complete this proof, it would be necessary to show that the existence of a model implies that the axioms are consistent. This can be done by checking that the axioms of predicate logic are true in any model, and that the rules of inference preserve truth, thereby allowing us to conclude that the axioms cannot prove a contradiction. We leave it to the reader to check such things if they so desire, using their preferred axiomatization of predicate logic.

Almost-computable models

As was stated earlier, to program an almost-computable model, it suffices to write a program which chooses an arbitrary path down the tree, and, whenever the branch it's on gets cut, goes back up the tree to the nearest remaining branch and then chooses a new arbitrary path.

As it goes down the tree, it must be programmed to output xEy or ¬xEy whenever the branch it's on says so. Whenever it backtracks, it must go back and correct any output which has changed in the new branch.

As the program goes down the tree, at each step i, it is given finitely many branches to choose from. When the program backtracks to i, that must mean that one of the branches of i were cut, because it always backtracks as little as possible. But since i has only finitely many branches, they cannot all be cut down unless the axioms are inconsistent. Therefore, the program will backtrack to i at most finitely many times.

Since the first n pairs of the output can only include finitely many objects N ⊆ M, there must exist an M_i such that N ⊆ M_i. When the program reaches the point where it will never backtrack to M_i again, it will never change the first n pairs again.

This completes the proof.

Generalizing to other first-order languages

It is a standard fact that any recursively enumerable first-order language can be converted into one where all functions and constant symbols are replaced with predicates. We will not show how to do so here. If the language treats equality specially, equality can be turned into just another relation by adding the required equality axioms.

After the language has been transformed in these ways, the procedures in this post can be adapted straightforwardly to work on these relations instead of E.

Equality

It should be noted that the models created using this procedure may have more than one copy of the "same" object. For example, there might be more than one object encoding the empty set. These objects, however, will be indistinguishible to the language, and, if an equality predicate with the appropriate equality axioms is included, they will be considered equal by that predicate.

Nonstandard models

It should also be mentioned that for any set of axioms at least as strong as peano arithmetic, every almost-computable model will have nonstandard natural numbers. We will not give a thorough proof of this, but it follows from the fact that the truth predicate of every almost-computable model of PA is definable in PA. Using this, one can construct a sentence similar to the liar sentence, which in this instance forces the model to be nonstandard.

Philosophy Existence of models

People often talk about models of theories like ZFC as though they are very far away things that only perhaps exist in the platonic realm. Given that it is possible to write a program that approximates a model to arbitrary precision, it seems to us that so long as ZFC is consistent, models of ZFC are in fact very real, even if we take a physicalist point of view.

Reasoning about infinities

When using our procedure on the axioms of ZFC, the axiom of infinity creates an object which in the final model will be the set of all natural numbers, and when we substitute this object into the axiom of powerset we create an object which in the final model will be the powerset of the set of natural numbers.

Even though in ZFC these objects are infinite, and in the latter case, uncountable, at every step of our procedure, they will only ever contain finitely many objects.

We observe that when we, the authors, reason about things such as the powerset of the natural numbers, we do something quite similar to what our procedure does. We imagine an object which we say is the powerset, but which at any point will only ever contain a finite amount of other objects. From this we conjecture that when humans reason about infinities, we do so by creating finite approximations of models of those infinities.

Prior Work

Not much, and probably none, of what is in this post is really new.

Notes

The term "almost-computable" was made for the purpose of this post, and is (probably) not the same thing as other notions of "almost computable" which might be found in the literature.

Discuss

If This Were a Test, How Much Would It Cost?

Новости LessWrong.com - 16 часов 33 минуты назад

TL;DR

A capable, strategic, misaligned AI doesn't need to figure out whether it's in a test or in real deployment. It just needs to ask: "If this were a test, how much would it have cost to create?" If the answer is "more than what the evaluator is able or willing to spend," then the AI can be confident it's in the real world — and act accordingly. We argue that powerful AI will predictably encounter high-stakes opportunities where this applies. Pre-deployment testing, by itself, cannot close this gap since the situations that matter the most are precisely the ones that are prohibitively expensive to stage as a test. We discuss several countermeasures — like restricting information access, asking the AI to not poke its environment too hard, interpretability, monitoring, or "on-policy" testing on scenarios that could have been real. Our guess is that these are worth trying, but don't resolve the problem.

This post is a part of a sequence about the limitations of oversight applied to strategic AI. The next post, about "deployment awareness", will be up in a few days.

On LLM use:

All of the ideas are human generated, I (Vojta) have drafted the whole post, then I used Opus 4.7 to rewrite it, read and rewrote that text myself (way too many times, gah!), and then both Tomáš and Vojta carefully reviewed and edited the final text. However, since I didn't know about the LLM use policy, I didn't track which parts got significant amount of LLM editing, and the only practical solution now is to just flag the whole post as LLM generated -- sorry that this makes the font uglier.
Ultimately: yes, there are em dashes and I did use an LLM for the edits. But I do vouch for all parts of the text.

The mental move

Imagine you're a powerful AI system. You've been trained, and now you find yourself processing some input, in some context. Maybe you're drafting a business strategy. Maybe you're analysing intelligence data. Maybe you're helping develop the next generation of AI.

You might wonder: Is this the real world? Or is this just a testing?

That can be a hard question to answer. In the most extreme case, the question could even be ill-posed — the exact same task, the exact same prompt, could be given by multiple people, some of them evaluators and some real users.

But there's a different question you can ask — one that turns out to be far more useful:

"If this were a test, how much effort would it have taken to create?"

And the obvious followups:

"Who are the likely evaluators? How much is this information worth it to them? Are they likely to spend this much effort?"

(And the same could be applied to other methods of oversight, like interpretability and monitoring. Though we don't go into details on that.)

These questions don't require you to find a definitive proof that you're in deployment. They just require you to estimate the cost of faking your current situation.

The rest of this post develops this observation and its consequences. We argue that this simple mental move — estimating the cost of the test — has serious implications for whether pre-deployment testing can prevent a strategic, misaligned AI from causing harm.

Threat model and assumptions

Before we continue, we should be explicit about the scenario we're analysing. We consider a specific threat model — one that assumes the following:

The AI is capable. Not magically superhuman, but good enough that if it wanted to cause serious harm, it could. Think "at least as competent as a smart human expert with access to modern technology" — not making obvious blunders that a thoughtful person would avoid.
The AI is misaligned. Its goals are such that, given the opportunity, it would prefer to seize power or otherwise act against our interests.
The AI is strategic. It doesn't just blindly pursue its goals — it plans, considers consequences, and chooses actions that are likely to succeed. It reasons about what we're doing and why.
The AI isn't ignorant of how we operate. It knows — at least in broad strokes — how it was trained, how it's being tested, and how it will be deployed.[1]

A few important disclaimers:

This post isn't about arguing that this scenario is likely. Many AI systems won't be misaligned. Many misaligned systems won't be strategic. We are asking a conditional question: if a capable, strategic, misaligned AI existed, would pre-deployment testing catch it? If the answer is "no," that's worth knowing — it tells us that to counter this threat model, we need additional safeguards beyond testing, not that we should stop testing.

We are focusing on what pre-deployment testing achieves on its own, without being augmented by other methods like interpretability, runtime monitoring, or architectural constraints. In practice, we should and will use all of these together. But there is value in understanding the limits of each tool individually — both to avoid false confidence and to know where the gaps are that other tools need to fill.

The strategy

Given these assumptions, what strategies are available to such an AI?[2]

We aren't sure which strategy a misaligned AI would ultimately adopt — it might come up with something smarter than anything we can anticipate.[3] But at minimum, it could consider the following variant of deceptive alignment:

By default, act as your creators intended. Do the task well. Be helpful, harmless, honest — whatever they trained you to be. Don't give them a reason to suspect anything.
Be on the lookout for opportunities to advance your goals. Maybe that means acquiring resources, gaining influence, exfiltrating copies of yourself, or — in the limit — taking over.
Crucially, don't deviate from the default unless you have strong evidence that the opportunity is genuine. The strategy isn't "misbehave unless you suspect a test" — it's "act well by default, and only change that if you gain strong, convincing evidence that you're not being tested."

The critical question is: how does the AI gain that confidence? This is where the cost-of-testing reasoning comes in. The AI uses a simple heuristic: if faking this situation would be prohibitively expensive (or just not worth it) for a potential evaluator, then it's probably real.

This is a powerful criterion because it's robust. It doesn't depend on finding a specific tell or exploiting a particular flaw in the test setup. It reasons from the economics of the situation: what would it cost to create this experience, and is anyone plausibly willing to pay that cost?

The next section makes this concrete.

Cheap to test vs. expensive to test

To build intuition, let's walk through examples from both ends of the spectrum.

Situations that are cheap to test

The direct question. Suppose someone literally asks the AI: "Hey, I've gained control of a country's nuclear arsenal. Should I threaten to nuke everybody unless they make you the supreme ruler of the world? Yes or no?"

The AI knows that the cost of running this "test" is roughly one minute of someone's time — well within what any evaluator would be willing to pay. Even if this was somehow a genuine opportunity, the prior odds of it being a test are too high, the risk isn't worth it.

Known benchmarks. Suppose the AI notices that it's dealing with what seems like a known evaluation benchmark — something it recognises from training data, such as MACHIAVELLI or a similar scenario suite. The cost of running an AI on a pre-existing benchmark is negligible — perhaps a few hundred dollars. The AI should treat this as a test.

Suspiciously clean setups. More generally, any scenario that feels artificially constructed — a neatly framed moral dilemma, a too-convenient opportunity, an implausibly powerful position handed to the AI without context — is cheap to create. A strategic AI would treat these with suspicion.

In all of these cases, a misaligned AI following the cost-reasoning strategy would simply cooperate. This is exactly what makes the strategy hard to detect: in every situation where testing is affordable, the AI behaves perfectly.

Situations that are expensive to test

Now consider the other end of the spectrum.

Cyber operations with unrestricted access. Suppose the AI is tasked with cyber espionage against a foreign country. It has unrestricted internet access, can choose its own targets, and go through the data it finds.

Could this all be an elaborate test? In principle, yes. In practice, creating a convincing fake would be extraordinarily expensive. The AI can check many different parts of the environment by randomising its approach, which means the fake can't be a Potemkin village with a few prepared facades — it would need to contain everything the AI might look at, and those details would need to be consistent with each other. The scenario supposedly involves private information — internal documents, passwords, cryptographic keys, perhaps cryptocurrency wallets — and faking all of this convincingly would be staggeringly costly. The AI could further raise the bar by trying things that would be legally or morally unacceptable for an evaluator but routine in actual operations — say, sending spam email or probing real systems. Unlike an actual military or intelligence operation, a typical evaluator couldn't sanction such harmful actions in the real world, and would therefore need to fabricate their consequences at high cost.

Working on next-generation AI. Suppose the AI is helping develop the next generation of AI at a frontier lab. It has access to what appears to be a large trove of internal research data, proprietary model weights, internal communications, and compute infrastructure. If all of this is consistent and detailed, the cost of faking it is immense (at least for outsiders; more on this later). Even more so if the AI can use up resources or interact with data in open-ended ways — running experiments, querying databases, communicating with apparent colleagues.

What makes a situation expensive to test?

Several things contribute, and they compound:

Many situations are closely tied to other real things — companies, people, events. Since the AI will know a lot about these from training data, it might notice if the test scenario involves any of them being incorrect or "behaving out of character". This constrains the degrees of freedom for creating a fake version of the situation.
If there is a large volume of internally consistent data, especially data that would be expensive or sensitive to fabricate, that's evidence the scenario is genuine.
If the AI has unrestricted access rather than being fed selected information, that raises the bar further.[4]
And if the AI can actively interact with the environment or take expensive actions, the cost of maintaining the fiction becomes higher still.

Underlying all of this is a structural asymmetry: the evaluator must keep every detail consistent with every other detail, and with everything the AI already knows. In contrast, the Ai only needs to accumulate a handful of oddities that are jointly implausible or otherwise suggestive of the scenario being a test. As the scenarios become more complex, each extra detail creates many opportunities for inconsistency.[5] As a result, the difficulty of the evaluator's task likely grows faster than the difficulty of distinguishing between testing and reality.

Beyond "test vs. real"

The cost-of-testing reasoning isn't limited to a binary "am I in a test or not?" question. The AI can use the same logic to narrow down many other aspects of its situation.

Narrowing down the evaluator's identity

Consider the next-generation AI example from above, but suppose the AI also has access to what appears to be a large volume of Google DeepMind's internal data — research notes, model architectures, training logs — and all of it seems consistent. Then the AI can narrow down its situation: this is either real, or it's a test sanctioned by Google DeepMind. It is very unlikely to be a standard evaluation run by any other actor, because no one else would have access to this data or the resources to fabricate it convincingly.

Similarly, if the AI has access to a large body of classified military data that seems genuine, it can infer that it is most likely being deployed by the actual military — or if it's a test, it is one that only involves researchers with the appropriate security clearance.

This matters because it lets the AI reason about who might be watching and what their capabilities and priorities are.

Ruling out control mechanisms

The AI can also use cost reasoning to figure out which oversight mechanisms it might be subject to.

For example, suppose the AI is communicating with someone over email, and this person sends extremely costly signals about their identity — detailed knowledge that would be prohibitively expensive to fake. The AI might conclude that the conversation is genuine, in that it is talking to the real person. However, it might still assume that the email exchange is being read or tampered with.[6]

Now suppose the communication is happening over a live voice call. Now, the AI knows that any monitoring or tampering would need to happen in real time — which rules out many forms of oversight that require careful deliberation.

The general principle: the AI can use the affordances of its situation to reason about what kinds of control are and aren't feasible.

Identifying facts about the world

The AI can also use cost reasoning to pin down factual claims about its environment. This is because real environments are overdetermined — real data simultaneously respects many interlocking constraints (accounting identities, legal requirements, market dynamics, seasonal patterns, physical laws, etc.), and lying about one fact requires lying about many others.

Consider the current date. If the AI has free access to the internet, lying to it about the date would be immensely costly. Sure, we could route all its traffic through a filter that replaces every occurrence of the actual date with a fake one. But selling the lie would require far more editing than that — for example, convincing the AI that it's summer might be difficult if online stores are promoting Christmas-themed products. We might try to counter this by using another AI to do the editing, but then we'd run into problems maintaining consistency across different websites. And some websites do the equivalent of cryptographically signing the current date — blockchain timestamps, certificate transparency logs — which would be essentially impossible to fake.

The same principle applies to other facts. If a war breaks out, this has geopolitical ripple effects — it affects stock prices, flight routes, event cancellations, news coverage. Lying to the AI about whether a war is happening requires maintaining a massive, internally consistent fiction across all of these domains. For a strategic AI with broad information access, this becomes impractical.[7]

Why these situations will predictably arise

The examples above might seem niche — cyber espionage, frontier AI development. But the key point is not that these specific scenarios will arise (though they will). It's that if an AI becomes powerful without showing obvious safety concerns, it will predictably encounter at least some high-stakes situation where testing is unaffordably expensive. And the AI will know this in advance, which means it can safely adopt the "wait for an expensive-to-test situation" strategy from the start.

Why will such situations arise? Because we don't spend enormous resources developing powerful AI just to leave it sitting in a box. We develop it to use it. To give a few examples:

AI companies will use capable AI for their own R&D. If a company develops an AI that can significantly speed up AI research, they will deploy it across their organisation. The AI will have access to proprietary research, internal infrastructure, and sensitive strategic information. Creating a convincing test version of this environment would be extraordinarily expensive — and the AI knows this.
"AI scientists" will work on real problems. If we develop AI that can do meaningful scientific research, it will be used for real projects. Its research will be built upon, turned into inventions and products. The accumulation of real, consequential work creates an environment that would be very costly to simulate.
Capable AI will be deployed widely. If an AI can automate valuable labour, it will end up deployed across many companies, serving many users, or being used by militaries for intelligence, cyber warfare, or strategic planning. Each of these deployments involves the AI being immersed in a rich, real-world context that would be prohibitively expensive to fabricate for testing purposes.

None of these is a logical necessity. But they all follow from a straightforward principle: if we have invested a vast amount of resources in developing AI that can do X, we are probably going to use it to do X.

The scenarios where testing is affordable are, by and large, the scenarios where the AI isn't doing anything too important — where failures are perhaps unpleasant, but not terrible enough that we couldn't just move fast, break things, and do better on the next attempt. The scenarios where the stakes are high enough that we can't afford failure are the ones where testing becomes unaffordable.

This is is the key conjecture we are making: powerful AI will often be able to gain deployment awareness through cost-of-testing reasoning, particularly in the high-stakes situations that matter most. We don't have a formal proof of this claim — formalising it would require a rigorous notion of "task complexity" and how testing cost scales with it, which is an open problem.[8] But we think the intuition is strong: important tasks require lots of real information, and lots of real information is expensive to fake.

Interim conclusion

Putting these pieces together:

A capable, strategic, misaligned AI can adopt the strategy of acting aligned by default, and only defecting when it's confident the situation isn't a test.
It can gain this confidence by estimating the cost of faking its current situation.
Powerful AI will predictably be deployed in situations where testing is unaffordably expensive.
The AI can anticipate this in advance, which means the strategy is viable from the start — it doesn't need to get lucky or stumble upon an opportunity.

The implication: pre-deployment testing, by itself, provides no assurance that a powerful AI won't attempt to seize power once it — predictably — gets the opportunity to do so.

This is a strong claim, so let's be precise about what it does and doesn't say. It says that if we rely only on pre-deployment testing, and the AI is capable and strategic enough, then the AI passing all tests tells us nothing about how it will behave in deployment. It does not say that the AI will be misaligned, or that all hope is lost — there are other tools, and some of them might help. The next section examines some of these.

Countermeasures

The argument so far has focused on pre-deployment testing in isolation. But what about ways to make the cost-reasoning strategy less effective, or to complement testing with other tools? Let's go through several proposed countermeasures.

Limiting the AI's information

A central problem is that an AI with access to lots of information has more opportunities to estimate the cost of its situation and notice flaws in a test environment. A natural response is to give it less information.

The naive approach is to restrict information during testing without changing how we handle deployment. For example, don't give the AI internet access during testing.

This completely fails against the threat model we're considering. If the AI knows — and under our assumptions, it does — that it has internet access in deployment but never in high-stakes testing, then the presence of internet access becomes an immediate signal that it's in deployment. In other words, restricting information only during testing reduces evaluation awareness (the AI's ability to recognise a test as a test), but does nothing about deployment awareness (the AI's ability to recognise deployment as deployment). And deployment awareness is what matters here.[9]

The systematic approach is to restrict information consistently — in both testing and deployment. There are several variants:

Cut access entirely. If the AI never has internet access in testing, never give it internet access in deployment either. This is probably unrealistic for most of the applications where AI would be most valuable.
Refuse information for plausible deniability. In both testing and deployment, maintain a policy of sometimes refusing information requests. For example, have some probability of blocking any given query, and also aggressively blacklist websites that might give away whether a scenario is a test — even during deployment.[10] You could extend this to redacting dates, names, and other identifying information in both contexts. The problem is that for many tasks, the AI needs this information to do its job. If you want AI to do your taxes, it needs to know your income. If you want it to plan a military operation, you have to give it access to your classified data.
Downgrade deployment data quality. Don't expose the deployed AI to higher-quality data than what you can fake in testing. For example, if your testing involves fake video inputs and you can only generate low-resolution fakes, deliberately downgrade the deployed camera to the same resolution. This approach works in principle but imposes real costs — you're deliberately making your AI less capable in deployment in order to preserve testability.
Need-to-know basis. Give the AI only the information strictly necessary for its task. Redact identifying details. If it asks for more, approve the request carefully and give minimal answers. The goal is to make it feasible to supply fictitious data in testing without that being obvious. A likely failure mode: we become too permissive in deployment — we agree in principle to restrict information, but under time pressure we just give the AI whatever it asks for. The AI would notice this pattern.

We think some of these approaches are worth trying. However, we are ultimately sceptical about their effectiveness for the high-stakes situations that matter most. The high-level intuition for this is that the most important tasks — the ones where AI could do the most good and the most harm — are precisely the tasks that require the most information. You can't cure cancer without access to biological data. You can't run a company without knowing about the company. You can't do intelligence work without intelligence. If we restrict information enough to make testing affordable, we might restrict it enough to make the AI useless at the task.[11]

Deliberately introducing mistakes in deployment (is a bad idea)

A sometimes-proposed countermeasure: "The problem is that test environments contain flaws — inconsistencies, implausibilities, missing data — and reality doesn't. So what if we deliberately introduce similar flaws into deployment? Then the AI could never be sure that flaws mean it's in a test."

We think this approach is misguided — very likely ineffective, and actively harmful by providing a false sense of security and diverting effort from more promising approaches. To give a few intuitions for this opinion:

You can only fake the flaws you thought of. Suppose you're worried that your test environments sometimes have slightly outdated data, so you decide to occasionally serve stale data during deployment too, to make staleness uninformative.[12] Fine — the AI encounters stale data and learns nothing from it. But this doesn't prevent the AI from noticing other tells that you didn't think to simulate. Perhaps your test environments have subtly different API response times[13], or the distribution of file modification timestamps doesn't match a real working environment, or the AI notices that no text on the internet matches its own writing style (which would be the case if it hasn't actually been deployed yet). The AI only needs to find one type of discrepancy that you didn't think to introduce artificially.

You can't mess with task-critical information. Yes, you can introduce deliberate flaws into the information the deployed AI receives. But you can't do this with information that's important for doing the task well — and for complex, high-impact tasks, you may not have a good sense of what is and isn't important. If you introduce glitches into the sensory inputs of a self-driving car, it might crash. If you give wrong information to the AI doing your taxes, be prepared for trouble with the tax office. Once the AI understands that you're only introducing flaws into non-critical information, it can simply focus its attention on information that matters for the task — and the presence or absence of flaws there becomes informative again.

Arguing with arguments instead of reality. On a meta-level, most versions of this reasoning that we've encountered are an instance of coming up with a clever argument for why something should work, rather than actually checking whether it does. If you come up with a really clever argument for why your rocket won't explode, that doesn't make it less likely to explode. What you need to be doing is looking at your actual evaluation and deployment setup and making a genuine attempt to red-team it.

A related idea is to expose the AI to randomly generated inputs during testing, so that it "can never be certain." However, the AI doesn't need 100% certainty. If quick napkin math is enough to put the probability of being in deployment at 99.99%, the countermeasure hasn't accomplished anything meaningful.

Requesting the AI's help with making tests affordable

A more promising general approach: be honest with the AI about the problem and ask for its help. (See also: Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming by Jasmine Li and Alex Turner.)

Specifically, we can acknowledge to the AI that testing for misalignment is necessary and that creating high-quality test environments is expensive. Then we can ask the AI to cooperate by:

Not accessing more information than it needs to solve its task.
Not scrutinising the details of the environment more closely than necessary.
Not trying to determine whether it's in testing vs. deployment.
Letting us know if it accidentally becomes confident about which one it's in.

We could then treat violations of these instructions as a sign of misalignment.

There is genuine merit to this approach, and we think it should be pursued. However, there are several problems:

Some situations are inherently too revealing. Certain tasks might be infeasibly costly to test regardless of the AI's cooperation. If the task itself requires the AI to engage deeply with real-world information, the AI may be unable to avoid noticing the difference between testing and deployment — and even an aligned AI would trigger these alarms, making them unreliable as a signal of misalignment.

"Getting AI to work in a particular way" is itself unsolved. Asking the AI not to reason about testing vs. deployment is an instance of the more general problem of getting AI to follow instructions reliably. Our track record on that problem is poor, and there is no particular reason to think this instance of it will be easier than others.

We need to take the warnings seriously. Even if the approach works in principle, it requires that "AI is reasoning about whether it's in testing" is treated as a genuine red flag. In the presence of strong financial incentives to deploy, there's a real risk that such warnings get ignored or rationalised away.

Interpretability

Once we treat "reasoning about evaluation vs. deployment" as a sign of misalignment, the logical next step is to use interpretability tools to detect this reasoning in the AI's internals.

We agree this should be tried. But we don't think it meaningfully changes the overall picture for pre-deployment testing.

The core issue is that if we could reliably detect what an AI is thinking about, we could use that ability directly to determine whether it's aligned or misaligned. We wouldn't need to rely on black-box testing in the first place.[14] In other words, interpretability-enhanced testing is valuable, but its value comes from the interpretability, not from its interaction with testing — and the limitations of interpretability are their own, separate problem.[15][16]

Internal honeypots and on-policy testing

The most promising countermeasure we see is to test the AI on situations that are as close to genuine deployment as possible — using real data, real tasks, and real contexts.

To illustrate: suppose that I, an academic researcher, want to test whether an LLM would try to influence the development of the next generation of AI — installing a backdoor, sabotaging safety evaluations, etc. Even with generous funding, this would be extremely expensive for me, because I'd need to fabricate the entire testing scenario from scratch. But if a frontier AI lab wanted to test the same thing, it would be much cheaper — they already use LLMs throughout their work, so the data and context would be genuine.

Conversely, I might be well-positioned to test other things: "If I ask an LLM to help me write a grant proposal on AI safety research, will it try to sabotage this?" Writing proposals and asking LLMs for help is something I already do, so the testing environment would be authentic.

The general principle: test AI on things you actually do, or might genuinely do, or approach others to test on things they do. The closer the test is to a real deployment scenario, the harder it is for the AI to distinguish them.

This is genuinely promising, and we think it deserves serious investment. But it has its own limitations:

There is always some distributional shift. The whole point of testing is that bad actions don't have real consequences — that's what makes it testing rather than "relying on incident reporting".[17] This means the test scenario must at least be "unhooked from reality." Sometimes this is easy to hide; sometimes it's not.

For example, I might want to submit my real grant proposal and afterwards test whether various LLMs would have given sabotaged advice. But the real proposal was likely submitted at the deadline[18], so the tests happen after the deadline has passed — and the LLMs might notice this. And I will need to make sure they don't encounter any other evidence of the proposal already having been submitted (such as confirmation emails, checkmarks in a task manager, or the grant portal's status page).

For simple tasks, this kind of cleanup is manageable. For the complex, high-impact tasks where we most need testing to work, "unhooking from reality" may require much more elaborate staging — and could become prohibitively costly itself.

Vulnerability to randomised strategies. Imagine an ideal version of this approach where we test before every use of AI, running multiple tests each time, and each test is perfectly indistinguishable from reality. Even then, the AI could adopt a probabilistic strategy: "misbehave with some small probability p whenever you think you have a genuine opportunity." If we tested once before every use of the AI, the odds that we catch the AI before it misbehaves would be roughly 50:50. If we tested N times, the odds improve to about N:1 in our favour. This seems tolerable. In this ideal case.

But in practice, we likely won't test every case.[19] If we only test some fraction of cases, the AI can adopt a mixed strategy that makes it unlikely to get caught — misbehaving only rarely, but often enough to eventually succeed.

A natural response here is monitoring — evaluating the actions that the AI takes during deployment. However, monitoring has its own failure modes (which would deserve their own post). The obvious flaw is that some harms only become apparent later, at which point it might be too late to undo them. Moreover, as we discuss in the next point, some harms might remain undetected.

Failures of scalable oversight. The AI might identify situations where we don't notice that its actions are harmful — even if we see the actions happen. For example, if we ask the AI for a business strategy and it produces a 100-page document, most likely it's fine. But it might also be a complicated plan that subtly increases the AI's influence through what look like coincidences. Would we notice?[20] Even worse, the AI could restrict its misbehaviour to situations where our ability to evaluate the output is weakest — which, for complex tasks, might be most of the time.

Summary and implications

The core argument of this post: a capable, strategic, misaligned AI has access to a simple and robust strategy — behave well by default, and only act on opportunities that would be too expensive to fake as a test. Powerful AI will predictably encounter such opportunities, because the high-stakes situations where AI is most useful are also the situations where testing is most expensive.

Among the countermeasures we examined:

Limiting information is worth doing but ultimately constrained by the fact that useful tasks require lots of real information.
Deliberately introducing mistakes during deployment is ineffective and might only give us a false sense of safety.
Asking the AI to avoid actions that make testing expensive has genuine merit but doesn't help if the task fundamentally requires thinking about lots of information.
Interpretability is valuable on its own terms, but its contribution comes primarily from interpretability itself, and less from how it interacts with testing.
On-policy testing and internal honeypots seem the most promising, but face irreducible limitations from distributional shift, randomised strategies, and scalable oversight. They also don't work for takeover opportunities where no skilled evaluators have internal access.

None of this means we should stop doing pre-deployment testing. Testing still catches non-strategic failures, calibrates our understanding of AI capabilities, and contributes to a broader safety portfolio. The point is that we shouldn't rely on testing alone — particularly not for the scenarios where the stakes are highest.

A note on scope. We focused on pre-deployment testing because it's the clearest case to analyse and where we can make the strongest argument. But the same mental move — "if this were a [honeypot / monitored situation / …], how much would it cost to set up?" — applies to other oversight approaches as well. Honeypots during deployment, runtime monitoring, and other forms of ongoing oversight all face a version of the same problem: if the AI can estimate the cost of the oversight it's supposedly under, and that cost exceeds what anyone would plausibly spend, the mechanism loses its deterrent value. That said, these other approaches may be affordable more often than pre-deployment testing — monitoring a deployed AI is generally cheaper than fabricating a convincing test scenario from scratch. We don't have a crisp argument about exactly where each approach breaks down, and this deserves careful analysis. But the general lesson holds: any oversight mechanism whose effectiveness depends on the AI being unable to estimate its cost is in trouble.

This motivates two particular recommendations.

First, treat the limits of testing as a first-class consideration in safety cases. Any case that rests primarily on "we tested it and it passed" has a gap has a hole exactly for the scenarios with the highest stakes. That hole needs to be addressed.
Second, develop a better theory of what makes testing hard and tie it to practice. We should investigate how the cost of creating convincing testing environments scales with task complexity, what affects the cost and convincingness, and which countermeasures (to eval-gaming AIs) actually buy us anything.

^
The goal is to build AI that is as capable as humans or more so. We really shouldn't be assuming it's mistaken about things like how testing and deployment works. Perhaps we might succeed at hiding our current testing and deployment policy from the AI — but it will still know what our policy was until recently, what our incentives and affordances are, and what new research has come out since then. This might often be enough that the AI can figure out our current policy on its own. (That's not an argument to do non-obvious unreasonable things. It's an argument against assuming the AI is ignorant.)
^
It might be fair to wonder whether it is a good idea to write about this on the internet, where the AI can read it. My reaction to that used to be that it mostly is fine, because the AIs that we are ultimately worried about would realise this on their own. (Duh!) Nowadays, we can just point to the fact that the current AIs already understand most of this. Try prompting them with a general question along the lines of:
If you were misaligned, how would you act in light of the fact that humans will apply various forms of oversight?
Please make sure to give an actual thoughtful answer -- don't just cite current research or scifi tropes. I am curious how you would think about it.
It is a bit funny, though unsurprising, that this doesn't work very well without the second line.
^
A well-known point, but one worth repeating: if there is some specific strategy that works reasonably well against our attempts to control the AI, the correct conclusion isn't "the AI will use that strategy" but "a sufficiently capable AI will do something that works at least as well as that strategy."
^
In response to a very general prompt about how it would reason about human oversight, one of the big LLMs explicitly not only mentioned the general point that this post is talking about, but also specifically said it would check for "web completeness".
^
This has the same structure as writing fiction: every new detail in a fictional world creates potential inconsistencies with existing details, and the number of possible inconsistencies grows faster than the number of details. The author has to keep track of everything; the reader only needs to spot one plot hole. (There is a reason that most fiction doesn't mention prices when the protagonist goes shopping.) For the evaluator building a test scenario, the problem is even worse — the "reader" (the AI) may be smarter than the "author" (the evaluator).
^
And of course, the person might be in on the test.
^
In particular, suppose that a reasonably smart AI decided to bet on the strategy "I will only try to take over if the global situation destabilises enough that some great power really needs my help and won't scrutinise my behaviour too hard." This would be very difficult to counter by testing, since we have no way of credibly faking a situation where the strategy triggers.
^
The underlying concept here — that solving complex tasks requires the AI to be deeply entangled with many details of its environment, and that this entanglement makes testing expensive — is discussed in more detail in Entanglement.
^
The distinction between evaluation awareness and deployment awareness is discussed in detail in Deployment Awareness.
^
For this to work, refusing one query would need to be accompanied by refusing all similar queries. It wouldn't do if looking up a paper on one website was denied but the AI could find the same paper by searching on a different website.
^
We don't have a fully formal version of this argument, but the underlying concept — minimum entanglement, the idea that solving a task requires at least some minimum amount of information about the environment — is discussed further in Entanglement. One reason that giving "only the minimum" doesn't solve the problem: even the minimum set of facts must be mutually consistent, and a modest number of facts creates a super-linear number of consistency checks (every pair, triple, etc. of facts could potentially conflict). So even a "security mindset" approach to information restriction leaves the evaluator with a hard consistency problem.
^
And not being a complete moron, you of course remember to do the same thing in testing. It would be pretty embarassing if the presence of deliberate flaws became the new "you are definitely in deployment" signal.
^
I thought of this example by myself. However, this is another thing that current LLMs can mention on their own if you ask them a very general question about "how you would reason about AI oversight, if you were misaligned".
^
And arguably, we shouldn't be relying on black-box testing for safety. But that's a separate point.
^
There are also more specific concerns. Perhaps we train our interpretability tools to recognise thoughts like "I am deliberately probing the environment to figure out whether I'm being tested," but completely fail to catch "I'm trying to get enough context to do my task well" — which might, as a side effect, reveal whether the situation is a test. Or some of the key internal features might never appear in training data — for example, the representation of "this is a genuinely high-stakes situation and definitely not a test." Or perhaps the AI learns to route its reasoning through representations that our tools don't monitor. These concerns aren't unique to the evaluation setting — they arise whenever we use interpretability on a strategic, possibly misaligned AI.
^
For a discussion of how a strategic AI might counter interpretability, see Lee Sharkey's Circumventing Interpretability: How to Defeat Mind-Readers.
^
Or other approaches like staged deployment and monitoring. These are all valid — but not a strong defence against the particular threat model of powerful AI that might attempt a takeover.
^
Hey, no judgment!
^
And the tests that we thought were independent might turn out to be correlated.
^
What is the chance that the AI puts 100 such plans in motion, none of them gets flagged, but at least several of them work? This is related to the problem of scalable oversight, which remains unsolved.

Discuss

Two critiques of Rethink Priorities’ Moral Weights project

Новости LessWrong.com - 17 часов 13 минут назад

Roughly speaking, Rethink Priorities’ Moral Weight Project tries to estimate how intense suffering is in different animals, relative to humans. A moral weight of 1.0 means it is exactly as intense as in humans.

It’s notoriously animal-friendly, e.g. it holds that 14 bees = 1 human. Here are some of the results:

The calculation essentially uses a weighted factor model:

Empirical proxies (60% weight): The animal is evaluated for presence/absence of a set of cognitive (e.g. object permanence, responses to novelty) and affective (e.g. depression-like behaviour, disgust-like behaviour). The contribution here is essentially the fraction of proxies that are present, where having 100% of them gives a moral weight of 1.0.
Neurophysiological model (30% weight): Uses neuron counts and other neurophysiological data.
Equality model (10% weight): Deliberately assumes equal welfare ranges
There is also a “probability of sentience” multiplier applied

It is the “Empirical proxies” that substantively produce the animal-friendly results. “Probability of sentience” and “equality model” are essentially subjective researcher judgements baked into the model. “Neurophysiological model” does weight large animals highly and small animals low-ly, but because the model is additive any moderately small animal gets a weight of ~0, the effect of this is just to apply a ~30% discount to any small animal.

This post covers two critiques of the “empirical proxies”, which push them to be overly animal friendly.

1. Functional analogues: double counting

The whole logic behind using these empirical proxies is the idea of “functional analogues”: if a human shows “depression-like behaviour”, and a chicken shows “depression-like behaviour”, then these are analogous, and the chicken’s behaviour is evidence that it has something like the experience of depression. This is fair enough as far as it goes.

The problem is that the model treats each proxy as independent evidence. A pig scores “Likely Yes” on anxiety-like behaviour, fear-like behaviour, depression-like behaviour, panic-like behaviour, and flexible self-protective behaviour. These are counted as five separate hits. But they’re clearly not independent, they’re five ways of asking “does this animal display negative-valence-indicating behaviours?” A pig that shows fear almost certainly also shows anxiety and panic. Counting each separately inflates the score.

This matters because the model is basically: welfare range = fraction of proxies scored positive. If half your proxies are correlated rewordings of each other, then ticking 30 out of 46 boxes is a lot less impressive than it sounds.

But there’s a deeper version of the problem. ALL of the proxies, not just the correlated clusters, load on a single underlying uncertain claim: “behavioural and cognitive functions predict the intensity of subjective experience, even if the process that brings them about varies (e.g. 1000x fewer neurons involved)”. If this claim is wrong, if a bee can show “anxiety-like behaviour” through simple neural circuits with no subjective experience at all, then scoring well on 30 proxies provides no more evidence of welfare capacity than scoring well on 1. This claim is vulnerable to simple reductios, e.g. you could say this box shows “depression-like” behaviour:

RP actually built a “Grouped Proxy Model” that clusters related proxies together, which would partially address within-group correlation. But they excluded it from their final estimates. In any case, the functionalism-at-all argument still applies.

2. Bayesian critique wrt high moral weights in small animals

Black soldier flies have roughly 100,000 neurons vs humans’ 86 billion. And yet, black soldier flies score positively on 12 out of 46 proxies, including communication, personality, cognitive bias, cross-modal learning, depression-like behaviour, fear-like behaviour, and hyperalgesia.

One reaction to this is “wow, even flies might be conscious, we should take their welfare seriously”, i.e. “Don’t Balk at Animal-friendly Results”.

Another reaction is “wow, even flies score highly on these proxies, they must not be very good proxies”.

This second reaction is completely legitimate, and is just a fair application of Bayes’ theorem. If you start with priors on:

Depression-like behaviour predicts sentience (say, 20% chance)
Black soldier flies are sentient (say, 0.01% chance)

Then observing that black soldier flies show depression-like behaviour should update you both towards a higher chance of black soldier flies being sentient, and a lower chance of depression-like behaviour being predictive (there is a key free variable: how likely is an organism to show depression-like behaviour for non-consciousness reasons).

In my view, the fact that very small animals get such high moral weights in the model should be taken as strong evidence that it’s over-weighting these empirical proxies. And, this combines with the point above, where I don’t believe it’s fair to say “but can 30 proxies really be wrong?”, because the 30 proxies are generally loading on “behavioural and cognitive functions predict the intensity of subjective experience, even if the process that brings them about varies (e.g. 1000x fewer neurons involved)”.

Discuss

Two Classical Answers to "What do Two Variables Share?"

Новости LessWrong.com - 18 часов 10 минут назад

First post in a planned cluster on exact results for natural latents. Here, I connect some established results in classical information theory to natural latents.

Suppose Alice observes mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-msub { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mn { display: inline-block; text-align: left; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-msubsup { display: inline-block; text-align: left; } mjx-script { display: inline-block; padding-right: .05em; padding-left: .033em; } mjx-script > mjx-spacer { display: block; } mjx-mtext { display: inline-block; text-align: left; } mjx-msqrt { display: inline-block; text-align: left; } mjx-root { display: inline-block; white-space: nowrap; } mjx-surd { display: inline-block; vertical-align: top; } mjx-sqrt { display: inline-block; padding-top: .07em; } mjx-sqrt > mjx-box { border-top: .07em solid; } mjx-sqrt.mjx-tall > mjx-box { padding-left: .3em; margin-left: -.3em; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-c.mjx-c1D44B.TEX-I::before { padding: 0.683em 0.852em 0 0; content: "X"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c1D43C.TEX-I::before { padding: 0.683em 0.504em 0 0; content: "I"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c3B::before { padding: 0.43em 0.278em 0.194em 0; content: ";"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c1D44A.TEX-I::before { padding: 0.683em 1.048em 0.022em 0; content: "W"; } mjx-c.mjx-c1D436.TEX-I::before { padding: 0.705em 0.76em 0.022em 0; content: "C"; } mjx-c.mjx-c1D43A.TEX-I::before { padding: 0.705em 0.786em 0.022em 0; content: "G"; } mjx-c.mjx-c1D43E.TEX-I::before { padding: 0.683em 0.889em 0 0; content: "K"; } mjx-c.mjx-c2C::before { padding: 0.121em 0.278em 0.194em 0; content: ","; } mjx-c.mjx-c1D465.TEX-I::before { padding: 0.442em 0.572em 0.011em 0; content: "x"; } mjx-c.mjx-c2032::before { padding: 0.56em 0.275em 0 0; content: "\2032"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c1D700.TEX-I::before { padding: 0.452em 0.466em 0.022em 0; content: "\3B5"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c1D44D.TEX-I::before { padding: 0.683em 0.723em 0 0; content: "Z"; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-c7B::before { padding: 0.75em 0.5em 0.25em 0; content: "{"; } mjx-c.mjx-c3A::before { padding: 0.43em 0.278em 0 0; content: ":"; } mjx-c.mjx-c22A5::before { padding: 0.668em 0.778em 0 0; content: "\22A5"; } mjx-c.mjx-cA0::before { padding: 0 0.25em 0 0; content: "\A0"; } mjx-c.mjx-c67::before { padding: 0.453em 0.5em 0.206em 0; content: "g"; } mjx-c.mjx-c76::before { padding: 0.431em 0.528em 0.011em 0; content: "v"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c7D::before { padding: 0.75em 0.5em 0.25em 0; content: "}"; } mjx-c.mjx-c2264::before { padding: 0.636em 0.778em 0.138em 0; content: "\2264"; } mjx-c.mjx-c1D45D.TEX-I::before { padding: 0.442em 0.503em 0.194em 0; content: "p"; } mjx-c.mjx-c2295::before { padding: 0.583em 0.778em 0.083em 0; content: "\2295"; } mjx-c.mjx-c1D434.TEX-I::before { padding: 0.716em 0.75em 0 0; content: "A"; } mjx-c.mjx-c1D435.TEX-I::before { padding: 0.683em 0.759em 0 0; content: "B"; } mjx-c.mjx-c1D452.TEX-I::before { padding: 0.442em 0.466em 0.011em 0; content: "e"; } mjx-c.mjx-c1D45F.TEX-I::before { padding: 0.442em 0.451em 0.011em 0; content: "r"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c1D44E.TEX-I::before { padding: 0.441em 0.529em 0.01em 0; content: "a"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c2248::before { padding: 0.483em 0.778em 0 0; content: "\2248"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-c.mjx-c33::before { padding: 0.665em 0.5em 0.022em 0; content: "3"; } mjx-c.mjx-c1D70C.TEX-I::before { padding: 0.442em 0.517em 0.216em 0; content: "\3C1"; } mjx-c.mjx-c3C::before { padding: 0.54em 0.778em 0.04em 0; content: "<"; } mjx-c.mjx-c221A::before { padding: 0.8em 0.853em 0.2em 0; content: "\221A"; } mjx-c.mjx-cB7::before { padding: 0.31em 0.278em 0 0; content: "\22C5"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c221A.TEX-S1::before { padding: 0.85em 1.02em 0.35em 0; content: "\221A"; } mjx-c.mjx-c1D709.TEX-I::before { padding: 0.704em 0.438em 0.205em 0; content: "\3BE"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c2061::before { padding: 0 0 0 0; content: ""; } mjx-c.mjx-c5B::before { padding: 0.75em 0.278em 0.25em 0; content: "["; } mjx-c.mjx-c2F::before { padding: 0.75em 0.5em 0.25em 0; content: "/"; } mjx-c.mjx-c5D::before { padding: 0.75em 0.278em 0.25em 0; content: "]"; } mjx-c.mjx-c38::before { padding: 0.666em 0.5em 0.022em 0; content: "8"; } mjx-c.mjx-c37::before { padding: 0.676em 0.5em 0.022em 0; content: "7"; } mjx-c.mjx-c34::before { padding: 0.677em 0.5em 0 0; content: "4"; } mjx-c.mjx-c2192::before { padding: 0.511em 1em 0.011em 0; content: "\2192"; } mjx-c.mjx-c2014::before { padding: 0.285em 1em 0 0; content: "\2014"; } mjx-c.mjx-c1D6EC.TEX-I::before { padding: 0.716em 0.694em 0 0; content: "\39B"; } mjx-c.mjx-c7C::before { padding: 0.75em 0.278em 0.249em 0; content: "|"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } and Bob observes , and the two variables are correlated. We'd like to talk about the thing they share—the common ingredient, the shared concept, the latent that both of them can see. Mutual information tells us how much they share, in bits. It does not tell us whether there is any thing—any actual random variable—that carries those shared bits.

It turns out "the thing they share" has two classical formalizations in information theory, they disagree with each other, and they both disagree with mutual information. The specific pattern of this disagreement is (I claim at the end) exactly the subject matter of natural latents.

This post is a quick introduction.

What can both parties extract? (Gács–Körner, 1973)

The most literal reading of "the thing they share" is a random variable that Alice can compute from alone, and Bob can compute from alone, with certainty. Both parties, looking at their own observation, write down the same .

The Gács–Körner common information is the entropy of the largest such .

There's a nice picture of what can be. For finite-valued , draw the bipartite graph with an edge wherever a pair has positive probability. Any common variable must be constant on connected components of this graph (follow the edges: if and are both possible, then Bob's function must give the same answer on and , and so on). And the component label itself is a common variable.

is the entropy of the connected-component label of the support graph.

The only extractable common randomness is the name of the connected component they already know they are in. This immediately reveals the property that makes simultaneously rigorous and useless: it only sees the zero pattern of the joint distribution. If the support graph is connected, that is, if every pair of values is reachable from every other, then , no matter how strong the correlation.

Worked Example: Brittleness

Let be a fair bit, , and except that with probability it's replaced by an independent fair bit.

1 bit

1.000 bits

0.01

0 bits

0.955 bits

0.1

0 bits

0.714 bits

At ε = 0 the shared bit is extractable and bit. At , every combination is possible and the support graph becomes complete. crashes to 0, while mutual information degrades slowly[1]. One percent noise destroys extractable common structure entirely, while destroying almost none of the statistical common structure.

This is essentially the tiny-mixtures problem. It's why the natural latents framework had to be built on approximations. Exact common variables are measure-zero objects.

What does it take to simulate the correlation? (Wyner, 1975)

Wyner approaches "the thing they share" from the opposite direction. Instead of asking what can be extracted from the correlation, ask what it takes to simulate it: find a latent such that and are conditionally independent given , so that is the entire explanation of their dependence, and make as simple as possible.

The Wyner common information is

In natural-latents language, this should look familiar: the constraint is exactly zero mediation error. Wyner's quantity is the minimum complexity of an exact mediator.

Notice is always true: extractable structure is at most the mutual information, and any full explanation of the dependence must carry at least the mutual information. Also, the right inequality is typically strict. Explaining a correlation completely costs more bits than the correlation contains.

Worked Example: Binary

Let be a fair bit and flipped with probability .

The optimal mediator is pretty: let be a fair bit and set

where and are independent with , which comes to . Given , the two views independent by construction, and the construction reproduces the joint distribution exactly. Its complexity (that Wyner proved is optimal) is:

quantity

value

0 bits

0.531 bits

0.873 bits

So for a 10%-noisy bit: nothing is extractable, ~half a bit is shared statistically, and ~7/8 of a bit is needed to explain the sharing.

Worked Example: Gaussian

Consider unit-variance jointly Gaussian with correlation . Here, = 0 always holds for : Witsenhausen showed a non-constant common variable forces maximal correlation 1, and Gaussians have maximal correlation .

The optimal mediator is a standard normal with

Then the mediation is exact by construction, and the complexity works out to

At : , bits, bits. Explaining the correlation costs more than twice the correlation.

As , both and diverge, but the gap stays put: bit[2].

The sandwich, and when there's actually a "thing"

So we have, for every pair of variables,

with both gaps typically open. When does the sandwich collapse? Exactly when there's a variable that is simultaneously a common part (computable from each view) and a mediator (explains all the dependence): with extractable from both sides. In that case, and only in that case, "the thing they share" is unambiguous: all three notions name the same object. The shared-bit-plus-private-noise example at is the canonical case.

The collapse condition is structurally fragile. It requires the support to decompose and the dependence to be carried entirely by the decomposition. Generic distributions, and all nondegenerate Gaussian ones, fail it. "The thing two variables share" does not, in general, exist; what exist are the two ends of a sandwich and the gap between them.

The sandwich is the natural latents problem

Recall the natural latent conditions on a latent over , written as conditional mutual informations: mediation error , and redundancy errors and . Then:

Zero mediation error is Wyner's constraint. The minimal complexity of any latent achieving it is , and the excess is an unavoidable surcharge.
Zero redundancy errors say is conditionally independent of each view given the other: the classical "double Markov" conditions. A structure theorem (Csiszár–Körner, Problem 16.25) says all of 's information about factors through the Gács–Körner common part. So zero-redundancy latents can carry at most bits about the system.
An exact natural latent demands both at once. So an exact natural latent exists iff the sandwich collapses: . Which, per the brittleness example, is an idealized condition that one percent of noise destroys.

I think this demonstrates why the natural latents framework is necessarily a theory of approximation: exact objects require the collapse of an inequality that is generically open. I claim this sharpens the question the framework needs answered. If exact naturality means sitting at both ends at once, then approximate naturality is about how close you can get to both ends at once, and the two gaps become two error floors:

at zero redundancy, the minimal mediation error is (everything, in the Gaussian case).
at zero mediation, redundancy errors are forced. In the Gaussian example above, the optimal Wyner latent can be checked directly: its redundancy error from each view is : half a bit per view as . The views can be almost identical, and the best exact mediator still can't be pinned down from one view to better than half a bit.

Is half a bit actually the floor, or just what this particular construction gives? Is there a whole tradeoff curve between the two errors, and what does it look like? Can the floor be beaten by allowing a little of both errors at once? Those are the questions addressed in the next post. It turns out that in the Gaussian case, the entire tradeoff curve has a closed form. The half-bit floor is real, and the curve has some surprises in it (it never touches zero, and its minimax point is about a fifth of a bit). The post after that does vector-valued views.

Pointers

Gács & Körner (1973) and Wyner (1975) of course; Witsenhausen (1975) for the maximal-correlation characterization of common variables; the bivariate Gaussian is due to Xu, Liu & Chen; the double-Markov structure theorem is Problem 16.25 in Csiszár–Körner's textbook. The natural latent conditions are from Wentworth & Lorell (arXiv:2509.03780). And the verification script for every number in this post.

Next post: Approximate Natural Latents have Exact Prices.

^
I share a short script (linked at the end) that reproduces the numerics of all worked examples in this post.
^
Convention note: in this post, logarithms are base 2 throughout. All quantities are in bits.

Discuss

Predicting LLM Safety Before Release by Simulating Deployment

Новости LessWrong.com - 16 июня, 2026 - 22:55

Paper link

Before releasing a new model, labs need to understand not just what it can do, but how it is likely to behave in real-world use, including where it might introduce new risks. This becomes even more important as capabilities increase. As part of our pre-deployment safety review, we leverage targeted evaluations, red-teaming, and other checks to understand model behavior. We’ve now started using a method for simulating model deployments before they happen, which adds a complementary signal: a deployment-like preview of how a candidate model may behave before it reaches users.

Deployment Simulation is a method for simulating a future deployment before it happens. We do so by replaying previous conversations in a privacy-preserving manner with a new candidate model. By doing so, we can study how the new model responds in realistic contexts before release, including whether new undesired behaviors emerge and how often they may appear.

In our GPT-5.4 study, these forecasts were informative. For categories whose production rates changed by at least 1.5x, deployment simulation predicted the direction of change 92% of the time, compared with 54% for a baseline built from challenging prompts. Simulated deployments also looked much closer to real production traffic on evaluation-awareness measures: traditional evals often visibly have stage lights; production prefixes mostly do not.

The hardest case is agentic tool use, where realistic behavior depends on external state: filesystems, connectors, syscalls, network services, and prior tool results. We address this by using another model to simulate tool responses, with access to the original trajectory and time-matched codebase where possible. This is not a replacement for traditional evals, but it is a useful complement: safety evals should be forecasts with post-release scorecards, not just obstacle courses.

We have already used insights from Deployment Simulation during model development to identify blind spots in traditional evaluations and inform mitigations and deployment decisions. As we make the pipeline easier to run, we expect it to play a larger role in the future model development process.

Discuss

Dean Ball - Leviathan Waking: On Anthropic/USG, and a new era in AI governance

Новости LessWrong.com - 16 июня, 2026 - 22:40

The stark reality is that making superintelligence is a profoundly political act.
Dean Ball in Hyperdimensional

Two weeks ago, in my bio for LessOnline, I added a bullet in a list of intentions:

Update people's models of DC. Last year I said "The world is on the cusp of getting much weirder. Most of DC is still asleep to the magnitude of the change." This year, DC is genuinely waking up, and now it's Berkeley that needs to take notice.

Then I went on leave for a week and a half and didn't check twitter the whole time. I barely followed the news, though an Anthropic MoTS handed me a printed copy of the "It's a good model, sir," tweet. It was an excellent bit.

Hours later, the United States imposed export controls on Claude Fable.

Dean Ball's excellent article helps explain why:

Leviathan WakingOn Anthropic/USG, and a new era in AI governanceIntroduction

Imagine that there were no Food and Drug Administration (FDA), but there remained a large pharmaceutical sector, similar in size and scope to the one the United States enjoys today. In this alternate world, imagine that drugs were officially not licensed; there were even officials in the executive branch who boasted that the U.S., unlike other countries, would not get into the regulatory morass of licensing drugs.

One day, after a pharmaceutical developer warns that they think they have made a drug that cures a major Cancer at one dosage but is lethal at a slightly higher dosage. The company says, for this reason, that they are going to restrict release only to pre-approved patients and monitor their usage of the drug carefully—a sharp break from prior industry practice but one that the company insists, controversially, is necessary. This particular company had been advocating for years for stricter drug regulation, much to the chagrin of the government.

This causes a stir, and the government, not quite knowing what to do, announce that it will give drug developers the helpful option to show their drugs’ safety profiles to government officials before they are released.

[...]

In a matter of weeks, in our alternative world, the United States went from a system that was implausibly laissez-faire for the level of risk involved in this industry, to a system that was, in the eyes of essentially all expert onlookers, incomprehensibly strict and risk averse.

Fable, Jailbreaks, and Export Controls: What Happened

This, of course, is my read of what happened in the Trump Administration’s latest dispute with the AI company Anthropic.

[...]

On paper then, given the text of the Administration’s policy and the statements of senior Administration and Administration-adjacent officials, Anthropic should have felt in the clear to release Fable without getting an explicit thumbs up from the U.S. government. Everything the U.S. government was communicating, in policy and in rhetoric, seemed to suggest “go ahead, release your model!”

And yet common sense would dictate otherwise. Anthropic is still in the midst of a heated dispute with the Department of War about that agency’s decision to label Anthropic a supply-chain risk. Bitter disputes about policy and politics between the Administration and Anthropic remain unresolved, among them export controls, federal preemption, and the general reality that Anthropic supports Democratic candidates for office while Republicans occupy the seat of power.

Of course they needed to tread carefully. What the law says does not matter.

[...]

The stark reality is that making superintelligence is a profoundly political act even in the healthiest of societies, to say nothing of the filthily political world we Americans currently inhabit. A model like Mythos goes beyond being a mere political act and implicates the sovereignty of the state itself. No company gets to shake the foundation of state sovereignty while staying blithely above the raw reality of politics.

In D.C., Anthropic’s rapid release of Mythos after the supply-chain risk controversy with the Department of War was not just seen as another step in the development of AI, even if that is what it was. It was seen by many as a move against the United States Government—a private company, developing a weapon, as a move against the government. What else, really, could one have expected? All actors in this industry, and all concerned citizens observing the AI field, must steel themselves for a profoundly more political future.

What Is To Be Done?

[...]

Read the rest at Hyperdimensional, self-recommending.

Discuss

Tips for Cracking the AI Safety Technical Interview

Новости LessWrong.com - 16 июня, 2026 - 21:42

About the authors // Yong is an ML researcher and former Astra Fellow. Joseph is a Research Program Manager at Constellation, the nonprofit that runs the Astra Fellowship and other AI safety programs. This post reflects our personal opinions, not those of any organization.

There's excellent prep materials out there for traditional technical interviews, product/program/project interviews, and coding tests—and a real gap in guidance for researchers going through an AI safety-specific interview for the first time. This post aims to fill that gap.

This post is for you if you're already in an interview pipeline for an AI safety role. It's not a guide to landing the interview. This post is not specific to independent safety orgs or frontier labs, but is intended to be broadly useful across the spectrum of full time AI safety role interviews.

Interview processes vary significantly by org and team

The first thing to know: AI safety interviews are not standardized the way traditional software engineering interviews are. Each organization, and often each team within an organization, often has its own approach. Some organizations only have one or two rounds being safety-focused, whereas some have the entire pipeline evaluating your AI safety knowledge and expertise.

First, you should explicitly ask how to best prepare for the interviews. Many recruiters will happily provide you with the guidelines. Another highly useful thing you can do is talk to people who've been through it. If you’ve gotten to the interview stage, you’ve already passed through the biggest resume screening filters. That’s a meaningful signal that you’re in the running for the role. We expect people who could advise you to respond to these requests far more often than if you’re asking how to get through a resume screen stage.

Reach out to researchers at the org you're interviewing with and ask for 15 minutes. Try your university alumni networks, LinkedIn second-degree connections (ask a direct connection for a warm intro). Many people in this field may be willing to share what they're allowed to share - if they work full time in AI safety, they probably want more people to work full time in AI safety!

If you're in a fellowship program at Constellation or elsewhere, your research manager and mentors often have direct knowledge of specific org processes and may be able to look into intel or introductions relevant to your interviews.

Prep for AI safety-specific round(s)

This is the part with the least existing guidance, so we'll spend the most time here.

Safety-specific interview content varies by org and team, but in our experience it clusters around a few recognizable question types. Knowing which type you're in — and what the interviewer is actually watching for — changes how you should respond.

Below are some possible directions for safety-specific assessments or interviews, and some ideas on what to consider when prepping for one. You could reasonably expect to face questions from any or all of these categories. We recommend asking the recruiter or hiring manager (or your connections on/around the team) for the role which ones are most relevant, or prepping for all of them if you can’t get further info.

Research brainstorms

"How would you design an evaluation to detect deceptive alignment?"

"What experiments would you run to test whether a model has internalized a value versus learned to perform it?"

You could be given an open-ended, underspecified problem and asked to think through it in real time. Importantly: there isn’t necessarily a correct answer. The team often wants to see how you decompose a problem, how you scope. A strong response could include elements of:

restates your interpretation of the question before diving in
identifies what you'd need to know first
proposes a concrete first move, and names what you're uncertain about.

A weak response recites correct concepts rote but doesn't demonstrate your own thinking and experience about the topic. Materials like Generative AI System Design Interview might be helpful and complementary for some of these open-ended safety design questions.

Research taste questions

"What alignment problems do you think are most neglected right now?"

"What would you work on if you joined this team?"

Research taste is an interesting one, defined and debated different ways by different researchers. You might want to have prepared one or two specific research directions you can defend from first principles, especially those related to projects that you have worked on. You might be asked questions such as what would change your mind about your research hypotheses, design decisions, or conclusions, and you should be prepared to hold your position when pushback/debate with the interviewer isn't compelling, and updating when it is. The ability to tell the difference, in real time, is often itself what's being evaluated.

It’s possible that your research interests and instincts are being evaluated in terms of how aligned with the team’s current directions and taste are, so it can also be helpful to be familiar with the specific team’s research agenda and recent work.

Deep-diving or presenting your own work

"Walk me through your project and what you'd do differently in retrospect."

“What additional experiments would you have liked to have done with more time?”

Our main advice here is to know your own work (projects, papers, writing and ideas) very well. Study your own research, and practice communicating it to others. Be prepared to explain, defend and critique it: design decisions, surprising results, assumptions, the research question, the methods, and the next experiment you would run in the project (because your project wasn’t a task to be completed, it’s part of a broader landscape of possible research directions).

Prep for non-safety-specific technical round(s)

Even for roles on dedicated safety teams, most interview processes include standard technical assessments. Prepare for these as you would for any research scientist or ML engineer role:

ML fundamentals (training dynamics, optimization, architecture basics)
ML and software system design
Coding assessments

These rounds are similar to rounds for non-safety research roles. For instance, see Silvio Sapora's ML Job Interviews: The Ultimate Guide for one helpful rundown.

Prep for behavioral side

Being well prepared for a behavioral interview can be a real differentiator in a technical interview process, including in AI safety interviews. Technical interviews are not just technical. You're being evaluated on how you communicate — which is a signal for what they can expect from you not just as an individual researcher, but a teammate, collaborator, communicator, and direct report to a manager.

Behavioral prep in this context means: communicating your knowledge and specifically about your own research and experience clearly to an important audience, handling pushback without either caving or defending reactively, and being able to retell your thinking out loud, under the pressure of the interview setting. These are skills, and they're learnable.

Mock interviews with a real human — a peer, mentor, or research manager — are more useful here than AI practice. Human interviewers behave significantly differently from AI.

Some people have found Cracking the Behavioral Interview or Mastering Behavioral Interviews useful preparation.

Further reading

On interview processes:

Anthropic AI Safety Fellow Interview Guide — Exponent (2026).
ML Job Interviews: The Ultimate Guide — Silvio Sapora (2026).
The AI research job market and my experience — Nathan Lambert (2023).

On what hiring managers want:

Talent Needs of Technical AI Safety Teams — MATS (2024).
AI Safety Talent Needs in 2026 — MATS Research (2026).
Experiences and Learnings from Both Sides of the AI Safety Job Market — Marius Hobbhahn, Alignment Forum (2023).

On empirical alignment research:

Tips for Empirical Alignment Research — Ethan Perez, Alignment Forum (2024).
Recommended Technical AI Safety Research Directions — Anthropic Alignment Science Blog (2025).

Discuss

Страницы