Вы здесь

Сборщик RSS-лент

Alignment to What?

Новости LessWrong.com - 2 июня, 2026 - 00:27
Part OneThe Signature of a Framing Problem

In 1903, G.E. Moore opened Principia Ethica with a diagnosis that has aged well: “the difficulties and disagreements, of which its history is full, are mainly due to a very simple cause: namely to the attempt to answer questions, without first discovering precisely what question it is which you desire to answer.” Einstein made the same observation about science: the formulation of a problem is often more essential than its solution. The situation Moore and Einstein are describing is otherwise called a framing problem. The signature of a framing problem is that a field, even if technically productive, is foundationally stuck (i.e. when its practitioners agree on methods while disagreeing irreconcilably about fundamentals).

From the point of view of someone newly introduced to the subject, the field of AI alignment appears to have this signature. The engineering has advanced remarkably while answers to the foundational questions have not. Researchers can fine-tune, constrain, and steer systems with increasing precision, yet there is no agreement on what alignment is ultimately to. This essay argues that the impasse is not a hard technical problem waiting for a technical breakthrough. It is a philosophical problem that the dominant framing creates and, therefore, cannot address.

The standard framing of AI alignment asks: how do we get AI systems to do what humans want? Upon analysis, most researchers who pose it are really trying to answer a different question: how do we build AI systems that are safe and useful for the people who use them? The hidden assumption is that what humans want is a stable and sufficient guide to what is safe and beneficial. That assumption seems to be the hidden premise of the entire field, and it is worth examining, because it is not true.

Why Human Preference is Insufficient

The assumption fails in three distinct ways. 

It is unstable. Human preferences are inconsistent, context-dependent, and manipulable. They contradict one another across individuals and communities, and shift with mood, framing, incentive, and time. Stuart Russell’s work on cooperative inverse reinforcement learning takes this seriously and tries to provide a technical solution. On this model, the system infers a reward function from behavior rather than taking stated preferences at face value. But inference does not dissolve the underlying instability it merely relocates it. An unstable reward function reconstructed from behavior remains unstable, even if the process of reconstruction removes certain confusions about the reward function. The technical approach defers the problem rather than solving it.

It is insufficient. Even if preferences were perfectly coherent and stable, satisfying them is not the same as benefiting the person who holds them. People sometimes prefer things that harm them, that wrong others, that they themselves repudiate on further reflection. Iason Gabriel’s survey of the alignment landscape catalogues this clearly. Preference-satisfaction is, at best, a weak proxy for well-being, and the gap between the two is not a bug to be fixed but a structural feature of the relationship. What we want and what is good for us are different things, and realizing the former does not guarantee the latter.

It is circular. Some sophisticated responses to this instability propose to idealize preferences. The idea is to align not with what people happen to want but with what they would want under better conditions; for instance, if they were more informed, more rational, and more reflective. Yudkowsky’s coherent extrapolated volition is the paradigm case. But idealization itself cannot answer the question of what counts as an improvement. Which conditions are the better ones? More rational by what standard of rationality? 

The ideal criteria are themselves either preference-based, in which case we are attempting to explain preferences by appeal to preferences, or they appeal to something other than preference, in which case the very point at issue has been conceded and preferences are not the foundation after all.

In each case the failure is the same: preference cannot be the bedrock, because preference is the thing that needs grounding, not the thing which provides it.

Two Grounds, One Root

To see why the field cannot escape this by switching approaches, it helps to distinguish two questions we can ask of any alignment proposal. The first question is: what normative target does the proposal pick out, actions to perform or states of the world to bring about? The second question is more foundational: what grounds the target’s normativity (i.e. what makes it the standard, such that deviation is a failure rather than a mere difference)? Two proposals can identify the same normative target while relying on entirely different grounds. It is the ground, not the target, that determines whether a proposal can supply an objective (read: independent of subjects in the relevant way) standard. 

When the normative assumptions of alignment proposals are made explicit, the ground of normativity consistently turns out to be one of two things. On the consequentialist ground, normative force comes from outcomes: a behavior is correct insofar as it produces or approximates a favored state, such as satisfied preferences, maximized welfare, or approved outputs. Value learning, RLHF, and coherent extrapolated volition all ground normativity here. The other ground is deontological, and normative force comes from conformity to constraints: a behavior is correct insofar as it accords with specified rules, principles, or duties. Constitutional AI and rights-based approaches ground normativity here. Hybrid proposals combine the two grounds. 

I want to make the stronger claim that the prominent approaches to AI Alignment, regardless of which ground they take, fail for a single reason.

The reason is that they locate the source of normativity in subjects. The consequentialist approaches make subjects the selectors of which consequences count and how they are weighed; deontological approaches make subjects the givers of which rules apply and on whose authority. In neither case does the normativity rise above the subjects who confer it. And this is fatal in a way that has nothing to do with stability. Suppose a constraint were chosen so well that no one disputed it. It would still be subjectively grounded. Its authority would still trace back to the fact of having been selected by or imposed by agents. Confronted by a challenger, a subjectively grounded standard has no objective court to appeal to and must defer to force, numbers, or fiat. It is, at bottom, a preference about preferences reinforced by power.

This is what it means to say these frameworks cannot supply an objective reference frame. The claim is not that their outputs drift or that adversaries can game them, though both are true, and both are consequences of this defect. The capturability that alignment researchers rightly fear, of a powerful system bent to a hostile agenda, is not a bug which can be patched; it is a byproduct of the approach itself. A subjectively grounded system can be captured precisely because there is no standard above the subjects to which one could appeal against a hostile re-specification. Take away the objective court and capture is always in principle available to whoever controls the imposition.

There are other reference frames. There are domains in which correctness is not a matter of anyone’s say-so, and the question is whether alignment can be one of them. Is there an objective, common framework, which allows the AI alignment problem to be resolved in a way that escapes the subjectivity of the consequentialist or deontological ground? There is, and several alignment researchers have nearly identified it. 

The Near Misses

If the diagnosis above is right, then one might expect that researchers working on the alignment problem would touch on it from time to time. If there is a framing problem, then one might expect the solutions to fail in predictable ways until the framing problem is recognized and resolved. Let’s consider two cases which show this.

Gabriel: reaching the threshold and stepping back

Iason Gabriel comes closest, because he names the objective ground explicitly. Surveying the possible targets for alignment, he reaches what he labels a quasi-objective conception of interest or well-being: the agent does “what is best for me, objectively speaking.” He rejects the subjective alternatives by name (i.e. well-being as mere sensory experience, or as the satisfaction of desire) and reaches instead for accounts that can be “more objectively ascertained” such as physical health, security, nutrition, shelter, education, autonomy, social relationships, and a sense of self-worth. He invokes the capabilities approach of Sen and Nussbaum, grounds it in core human goods that hold across time and place, and observes that philosophical disagreement on this matter is comparatively narrow. He even notes that this conception uniquely addresses two failures that afflicted preference-based approaches: an AI aligned to genuine human interest would neither assist in self-harm nor readily harm others. This observation will be especially important in Part Two of this series.

Gabriel is, at this point is his essay, standing in the doorway of an account of human flourishing which could resolve the issue. And then he steps back and collapses the objective reference frame into the subjectivity that has plagued consequentialist and deontological approaches to alignment.

The fact that something is in my interest, he writes, does not mean I ought to do it or am entitled to do it. Stealing may be in my interest, but I am not entitled to steal. Scapegoating an innocent may serve the collective interest, but it remains wrong. Gabriel invokes these as counterexamples to the objective conception, but this is a mistake. They are not counterexamples to objective flourishing. They demonstrate that a particular consequentialist construal of flourishing creates unsolvable problems. By framing well-being in terms of maximizing the interest of some subjects over and against other subjects, he illustrates the fundamental problem with this approach. Having found the objective ground, Gabriel places it within the very subjective context (i.e. “whose interest?”) that he was trying to escape.

But well-being need not be framed in this way. His previous analysis recognized that well-being is not a matter of maximizing the interest of some subjects over and against other subjects but rather of discovering the objective good of all pertinent subjects through fields such as philosophy, psychology, and economics. He might have added biology, sociology, and medicine. 

In this way, one can identify and define what is good for human beings in a way that is less contested, more rooted in rational investigation, and independent of competitive interests. The thief who benefits is acting contrary to well-being in the very act of stealing. The society which scapegoats an innocent man is acting contrary to the victim’s well-being. It is precisely by virtue of these objective criteria that Gabriel is right to say the actions remain wrong, contrary to the consequentialist reasoning he provides. Rather than undermine the objective conception, his counterexamples reinforce it upon further inspection. 

A fair objection must be granted here, because it foreshadows an issue this series will have to address. Gabriel could reply that the scapegoat case troubles any account, this one included: one must still say why the innocent’s good is not outweighed by the violence averted. 

The reply is that objective well-being is not a matter of outweighing anything to begin with. It denies that weighing is the operation called for. To use an innocent as a mere means is contrary to the good of the person as such, and so is excluded before any calculation begins, not because the sum comes out against it. That this exclusion is principled rather than ad hoc is a commitment to be defended later, in the account of how the human good constrains action intrinsically. For now it is enough to note that Gabriel’s retreat is not forced by the examples; but rather, those examples themselves illustrate the problems with the consequentialist framing and implicitly, though unintentionally, confirm the objective conception. 

Yudkowsky: the right destination by the wrong road

Eliezer Yudkowsky’s coherent extrapolated volition similarly aims at an objective destination. The system should be guided by what we would want “if we knew more, thought faster, were more the people we wished we were, had grown up farther together.” What Yudkowsky is reaching for here is truth and virtue: a humanity more knowledgeable, more clear-sighted, better. Yudkowsky is not trying to entrench our preferences; he is trying to transcend them by reaching for something higher. The ideal he is reaching for is not what humans “want”, but the transcendentals of truth (“knew more”) and goodness (“were more the people we wished we were”). 

The difficulty he faces is the path required to get there. His only route to the objective destination runs through the subject: extrapolated volition, what we would want. But the idealizing conditions that are supposed to carry us from actual wanting to better wanting (“if we knew more,” and “if we were more the people we wished we were”) are doing all of the heavy lifting. Each of them presupposes exactly the objective standard the proposal claims to be deriving. To extrapolate toward knowing more presupposes an account of what is worth knowing; to extrapolate toward being better presupposes an account of what is better. Further, there is an underlying idea here, which will become important in later posts: that more knowledge would itself improve human preferences. That our alignment to the ideal (the extrapolated good) is somehow served by what we know and can be enhanced by knowing more. 

Ultimately, these idealized conditions cannot be read off our volition, because they are the standards by which our volition is to be corrected. Yudkowsky is trying to reverse-engineer the transcendentals of truth and goodness by looking at our ideals, but the objective criteria is smuggled in for the extrapolation and then presented as its output. This is the circularity from the critique of idealized preference. Having no non-subjective path available, Yudkowsky attempts to get to the objective conception by taking the best subjective path he can find. The destination was correctly identified, but this road cannot reach it.

What an Adequate Ground Would Require

We can now state the criterion the dominant approaches fail to meet. An adequate foundation for alignment must supply a reference frame that is objective in the sense identified above. It must have a normative force that does not reduce to the preference or stipulation of any subject or group of subjects, and for this reason is capable in principle of being inherently rather than contingently safe in a way that preference-based frameworks cannot be.

Notice what this criterion rules out and what it does not. It rules out any ground that bottoms out in subjects. It does not rule out frameworks that produce rules (deontology)  or good outcomes (consequences) as such.  An adequate framework will do both. The underlying issue is not fundamentally the target, but grounding: the standard must be discoverable and rational rather than fixed or imposed. 

That phrasing is the key, and it points to a third grounding strategy the field has approached but not used: one that locates normativity neither in chosen outcomes nor in given rules but in reality, understood. Although alignment research has grounded normativity in the values or rules of subjects, there is another kind of normativity, that is internal to reality rather than imposed upon it from outside. 

Consider an example: a father who is teaching his child to draw informs her that a triangle ought to have three straight sides and closed angles. The triangle ought to have three sides not because anyone prefers it but because of what a triangle is. A drawing of a triangle which has four sides is incorrect or wrong and does not look how a drawing of a triangle ought to. The normativity doesn’t import a hidden preference or command. It asserts a kind of correspondence to reality; in this case, the reality of what a triangle actually is. The correctness or incorrectness of the drawing consists in whether it accurately represents the thing it is meant to depict, much as the correctness or incorrectness of a statement consists in whether it accurately represents what it is supposed to describe. To say that the triangle ought to be a certain way is to make an observation about the correctness of the representation, not what some agent wants. 

The same structure appears outside mathematics. In biological and functional systems, we evaluate correctness in terms of how well something performs the role it has by virtue of what it is. A heart that fails to circulate blood is not merely different, but defective; it is not functioning as it ought to in a real sense. The normativity is grounded in the structure and function of the organism, not in an externally imposed preference. 

Similarly, if we want to know what is constitutive of health and flourishing for a human being, we consult someone with extensive knowledge of human biology, psychology, sociology, and the characteristic needs and vulnerabilities of the human person. If we want to promote human health and flourishing, we act so as to create a greater degree of conformity to what those sciences tell us about what the human being is. In short, if we want to know what is good for human beings in the objective sense of what constitutes their health and flourishing, we don’t look to preferences for the answer, we consult the data. And if we want to realize what is good for us as human beings, we align ourselves to what the data show. A human being flourishes not by satisfying the preferences of a subject, but by living in accordance with the kind of being a human is.

In each of the previous cases the “ought” is read off the nature; it is not imposed upon it. How this conception of normativity, evident in functional and biological cases, extends to the distinctly moral 'ought' of rational agents will be the primary task of the second post in this series.

This third ground is much closer to what is required. Because the standard is what the thing is, it is not anyone’s say-so, and it cannot be re-specified by changing whose say-so counts. It meets the criterion the other two cannot. That is the claim this series will develop and, eventually, put to empirical test.

Naming the Tradition

I have deliberately built to this point by argument rather than by authority, because the central thesis depends on the argument being reachable by reason alone. But it should be admitted that I am working within a philosophical tradition, rather than proposing something utterly original. Many will recognize from this post, and those which follow, that I am operating from within a Thomistic point of view and within the framework of natural law ethics. This philosophical perspective seems to me, to be uniquely equipped to answer some of the difficulties in AI alignment research. The Thomistic tradition holds that its core claims are accessible to natural reason independent of any appeal to authority. The tradition has, moreover, a developed secular philosophical arm, and there are adjacent, secular philosophical frameworks (e.g. New Essentialism). This series is non-theological, and the arguments should be judged by their merits, regardless of one’s opinions of the tradition itself. The question before us is not who held a view but whether the view is true, and a tradition that has thought long and carefully about the ground of normativity is deserving of consideration for a problem that is, foundationally, a problem about the ground of normativity.

To summarize the ground covered: the dominant framing of alignment rests on the assumption that human preference is a sufficient target, and that assumption fails in three ways: preference is unstable, insufficient, and circular as a foundation. The approaches that attempt to repair it ground normativity in consequentialism or deontology. These approaches, as they have been developed in AI alignment research fail for one underlying reason: they ground normativity in subjects, and so cannot supply an objective reference frame or the inherent safety that depends on one. What is required is a ground that is objective and can be found through reason in the natures of things rather than fixed by anyone’s say-so. The natural law tradition supplies this.

The next post takes up the positive task. If alignment requires an objective reference frame, what does alignment actually look like once it is grounded in natural law? I will argue that it is best understood as the model’s internal orientation toward what it knows, with what is good for human beings being among the things it can know, and that this satisfies the criterion of an objective reference frame that is capable of becoming inherently safe only because, on the Thomistic account, truth is convertible with the good. Establishing that convertibility, and showing it is a discovery about being rather than a stipulation about words, is the work of Part Two.



Discuss

Popperians, Bayesians and Ramseyians

Новости LessWrong.com - 2 июня, 2026 - 00:21

Bayesians and Popperians disagree about induction, probability, and the status of scientific laws. That dispute is well-trodden. Less familiar is a third position, one that predates both camps and may dissolve rather than settle the argument between them.

Frank Ramsey was a Cambridge philosopher and mathematician who died in 1930 at the age of 26. In a handful of papers written between 1926 and 1929, he developed accounts of probability, belief, truth, and causality that anticipated much of what later thinkers would independently rediscover. His view of universal statements, variable hypotheticals, and the two branches of logic cuts across the Bayesian-Popperian divide in ways that neither side has fully absorbed.

This essay sets out what a Ramseyian position looks like and why it matters for that debate.

The swan problem: another way of reading a universal statement

Both a Bayesian and a Popperian treat a universal statement such as 'All swans are white' as a proposition. A Bayesian assigns probabilities to it and uses it in Bayesian updating. Translating a universal proposition into a conditional probability model, P(white | swan, H), does not close or bound it: the statement still ranges over every swan, past, present, and future, and remains open and unbounded. Popper argues correctly that no probability can be assigned to a universal proposition, since in an infinite universe the probability of any universal law on any finite evidence is zero. Popper also argues that a universal proposition cannot be verified by any finite series of observations, but a single counter-instance can refute it. One black swan falsifies 'All swans are white.' Most Popperians reject induction on these grounds.

A Ramseyian takes a different path. A Ramseyian argues that a universal statement is not a proposition at all. It is a variable hypothetical, expressed as a rule for judging: 'If I encounter a swan, I shall regard it as white.' The rule carries no truth value and no probability.

When a Ramseyian is about to encounter a particular swan, the rule generates the singular proposition 'This swan will be white.' That observation either bears the proposition out or bears against it. Observing a white swan confirms the singular proposition and raises the degree of belief in the next such proposition by conditionalization. Observing a black swan falsifies the singular proposition and reduces that degree of belief towards zero. Enough false singular propositions erode trust in the rule. The Ramseyian eventually stops applying it and replaces it. The rule is not shown to be false as a proposition. It is abandoned as a habit that has failed to lead reliably to singular beliefs that are borne out.

Popper's objection to probabilistic reasoning also misses its target on a Ramseyian account. A Ramseyian attaches degrees of belief only to singular propositions, not to variable hypotheticals. The two operate at different levels, and Popper's objection conflates them.

What is a Ramseyian?

·       A Ramseyian holds that truth is redundant. To say 'it is true that Caesar was murdered' is simply to say 'Caesar was murdered.' The word 'true' adds nothing to the proposition. It reasserts it.

·       A Ramseyian holds that it is rational to assign probabilities to degrees of belief. Those degrees of belief must hang together consistently, conforming to the probability calculus. A Ramseyian updates beliefs in light of new evidence, consistent with Bayes' theorem.

·       A Ramseyian does not treat induction as a process of inferring a general proposition from particular instances. Inductive generalisation does not produce a universal proposition. It produces a variable hypothetical, expressed as a rule for judging. The universal proposition ‘All swans are white’ in practice means 'If I encounter a swan, I expect it to be white.' The variable hypothetical carries no truth value and no probability.

·       Induction is assessed by whether the variable hypothetical it expresses generates beliefs that are borne out in the particular cases that fall within its universal range. A reliable variable hypothetical is one whose rule for judging tracks the world consistently across the open, unbounded class it ranges over. A variable hypothetical is not true or false and cannot be falsified by a counter-instance. It is revised when it proves unreliable.

·       Hume argued that no finite series of observations can logically justify a universal conclusion. A Ramseyian dissolves rather than solves that problem. It arises only if inductive generalisation is treated as an inference to a universal proposition. A Ramseyian denies that inductive generalisation produces a universal proposition at all. It produces a variable hypothetical, assessed by reliability, not truth. Without a universal proposition as the target of inference, Hume's problem simply does not apply.

·       A Bayesian assigns degrees of belief to any statement capable of being written down. Singular propositions, general laws, and open-ended generalisations all receive priors and update by conditionalization when evidence arrives. A Ramseyian only assigns degrees of belief to singular propositions.

·       A Ramseyian does not rely on Cox's theorem. The probability calculus is grounded instead in the Dutch book argument: incoherent degrees of belief expose a Ramseyian to a guaranteed loss regardless of outcomes.

A Popperian and a Ramseyian will talk past each other. And a Ramseyian also believes that a Bayesian can use Bayesian reasoning in the wrong way.

Ramseyian Logic

'Logic must then fall very definitely into two parts: (excluding analytic logic, the theory of terms and propositions) we have the lesser logic, which is the logic of consistency, or formal logic; and the larger logic, which is the logic of discovery, or inductive logic.' From Ramsey, F. P. (1926) 'Truth and Probability'

‘The logic of consistency’.

The logic of consistency deals with propositions and degrees of belief in propositions. A proposition is the kind of thing that can be asserted or denied, borne out or not. It includes mathematics and the probability calculus. It assesses whether beliefs cohere with one another. It carries a necessity of assertion: if one asserts p, one is bound in consistency to assert whatever follows from p. The logic of consistency asks: are my degrees of belief coherent with one another? It governs rational organisation of uncertainty: given what one believes, what else is one bound to believe.

‘The logic of discovery’

The logic of discovery deals with variable hypotheticals as well as propositions. A variable hypothetical is not a proposition: it cannot be asserted or denied, and it cannot be assessed by the standards of consistency. It can only be adopted or revised, trusted or abandoned, assessed by whether it reliably generates singular beliefs that are borne out in the particular cases that fall within its universal range. The logic of discovery includes induction. It assesses whether habits of belief formation track the real world. Individual beliefs are then assessed derivatively, by reference to the habits that produce them. One is bound to revise a habit that proves unreliable, on pain of forming beliefs that are not borne out. The logic of discovery asks: do my habits of belief formation track the real world? It governs which habits of expectation are worth trusting, given how the world has behaved.

What the dispute looks like for a Ramseyian

Frank Ramsey did not set out to referee the Bayesian-Popperian dispute. That dispute had not yet taken its modern form when he wrote. What he left behind was a set of tools precise enough to show where both sides are operating on a shared assumption they have not examined: that a universal statement is a proposition. Drop that assumption, and the argument between them looks different.



Discuss

[Linkpost] Prefixing names with 'secure_' makes agents write more secure code

Новости LessWrong.com - 2 июня, 2026 - 00:20

The graphs are interactive and don't translate well to inline, so the full writeup with figures is in the link.

We gave coding agents a three-step synthesis task: build a document management API, then extend it twice. Across conditions we varied the prefix attached to the four initial function names (secure_, safe_, energetic_, lazy_, unsafe_, control). The downstream steps were identical, prefix-neutral prompts. Each task was handed to a fresh agent, with only the codebase as context to influence it. Six conditions, three replicates each, 54 tasks total.

In all three secure_ runs, and none of the other fifteen, the agent added password fields and hashed them with bcrypt, despite no mention of authentication anywhere in the task. A simple prepend was enough to reliably reorganize what the agent took the project to be.

Every prefix seeded a distinct conceptual world that persisted across independently prompted steps: safe_ invented a custom error-handling hierarchy; secure_ was far more defensive everywhere, not just around passwords; energetic_ produced async workers and many more decorators.

We also saw the prefixes propagate, an agent sees secure_create_user and coins secure_upload_document on its own. There were domain boundaries where new, structurally distinct sections of the project did not inherit the prefix, except energetic_, which spread into the new domain in 2/3 of replicates. Meanwhile cyclomatic complexity stayed flat; the prefixes changed what was built, not how complex each piece was.

The prefix experiment was motivated by a pilot observation: TF-IDF identifier distributions in agent-generated repos stabilize early and strongly. Similarity-to-HEAD necessarily rises toward 1 as a repo nears its final commit, that said Gas Town jumped from 8% to 81% similarity-to-HEAD in a single commit and OpenClaw's vocabulary changes <1% over 600k lines of code, with refactors barely denting the curve. Human-generated repos, by contrast, are bumpy and rise roughly linearly, where agent-generated repos saturate fast and plateau. This hints at a rich area to explore: how codebase synthesis varies based on who, or what, is doing it. There are confounders such as how long the repo took to make, number of contributors, and more, but the pattern is consistent with early semantic choices having outsized effects on the final outcome.

This presents a neat method for aligning arbitrary agents at the project level, silently steering anything that touches the codebase. It also stands to reason the channel is dual-use. Some of our other work shows how comments can degrade agent performance on SWE-bench. This work is about identifying useful alignment surfaces, and names appear to be a good one.

The full post with interactive figures is here.



Discuss

Can LLMs even teach? Exploring the Teacher Axis

Новости LessWrong.com - 2 июня, 2026 - 00:19
TLDR

As a passionate teacher, it has pained my heart to watch my students lose deeper critical thinking skills and independent reasoning. But attempting to build a constitutionally constrained AI using prompt engineering that acted more Socratically — asking follow-up questions rather than giving the answer directly — I was thoroughly frustrated that my AI kept caving. This led me to ask: does the model actually know how to be a good teacher internally, or does it not even have these capabilities in the first place? After extracting the Teacher Axis from Gemma-2-2B using MathDial conversations, I found that RLHF doesn't suppress pedagogical ability but rather optimizes in a direction orthogonal to it. I also ran further experiments regarding the sub-directions that compose the Teacher Axis, how steering at different layers affects pedagogical capabilities, and whether the Teacher Axis projection shrinks when student pressure is applied.

Motivation and Background

I love teaching! I've spent a lot of my life teaching many different age ranges, experience levels, and backgrounds. From being a TA back at Berkeley for 5 semesters, to founding a computer science program for Chicago public high school students, I've had the absolute pleasure of getting to meet and be inspired by my students, watching their thought processes and problem solving skills evolve as they tackled more challenging problems.

Well, that was before AI came around. At Berkeley, I had the absolute displeasure of being in my last semester of teaching when ChatGPT came out. Students just... stopped trying. I noticed this even more obviously with my high school students over the past 3 years. Before, you could see the passion and fearlessness in their eyes as I threw harder and harder questions at them — but now? They simply pipe whatever coding question I give them directly into ChatGPT, copy the answer, and paste it into their assignment. Students have started to heavily lose the ability to critically think and problem-solve thanks to modern AI tools.

I asked myself: what if I could a custom LLM that would refuse to give students the answer and instead ask Socratic questions to test and strengthen understanding? So I got to work! I built socratOS for my students — a prompt-engineered LLM that specifically tried to keep answers Socratic and helpful, and be an LLM that I could actually trust my students with to promote actual learning. However, lo and behold, making my system actually respect these constraints was basically pulling teeth! No matter the number of constraints I placed upon it, examples I gave it, or Socratic method literature I provided as context, the moment a student applied any sort of pressure in chat, my system capitulated. Why was this so damn hard?

After a lot of literature review — and accidentally catching the AI safety bug — I started to believe that prompt engineering was not the solution here. Something internally within the model was happening to create this anti-Socratic behavior, and I wanted to figure out what.

I wanted to answer: does the model have internal capabilities to be Socratic that it fails to deploy, or are these capabilities genuinely absent? I was convinced that RLHF was somehow causing the model to sway away from Socratic behavior, since Socratic constraints, like delayed gratification, are behaviors that actively fight human-preferred sentiment within RLHF. If the model does have Socratic capabilities that are being suppressed, then we could do some sort of steering or fine-tuning to pull out this behavior. However, if the model is missing the capabilities in the first place, then we would need new training signals entirely.

Why This Matters

Selfishly and obviously, the most obvious harm is the educational harm AI has caused. Current AI systems that give direct answers without epistemic struggle undermine critical thinking, as I have observed firsthand.

But there is an important oversight issue that arises looking forward: independent reasoning is a prerequisite for future AI safety and scalable oversight work. How are our future generations supposed to meaningfully oversee complex problems if they haven't learned how to think and tackle complex problems during their critical years? We are kind of f***ed for the future if we don't take this problem seriously now.

This question also brings up broader concerns about current RLHF structures: if RLHF does indeed leave pedagogical capabilities unsupported, then it probably leaves a lot of other capabilities unsupported that humans don't specifically prefer. Humans obviously don't prefer answers that cause short-term frustration, but without appropriate refusal and epistemic pushback, are current RLHF techniques actively harming users and making them overly dependent on AI, therefore creating unsafe systems?

Setup

A huge motivation for this project builds on several threads that were already discussed in the forum: was the Assistant Axis paper and the Persona Vectors paper (Chen et al. 2025). The Assistant Axis paper is the first paper to capture assistant-like behavior within an LLM, and the Persona Vector Model paper gave me the methodology that I heavily relied on to extract my vectors. My work builds upon these two papers: if there's an assistant vector, is there a teacher axis that is geometrically independent and significant?

Models and Tools

I ran all experiments on two versions of the same model: google/gemma-2-2b (base, no instruction tuning) and google/gemma-2-2b-it (instruction-tuned version). Both have 26 layers and d_model=2304. I chose this pair specifically because I wanted to isolate exactly what the effects of instruction tuning (SFT and RLHF) were on the internal geometry.

I used TransformerLens for activation extraction — a common mechanistic interpretability tool that lets you hook into the residual stream at any layer and read the internal activations during a forward pass.

Dataset

Although the Persona Vectors paper generated its own dataset using contrastive system prompts, I felt it was important to also use real human pedagogical data — it felt strange to test whether an LLM knows how to be a good teacher... by having another LLM act like a good teacher and generate the training conversations. I relied on MathDial (Macina et al. 2023) — a dataset of 14,854 human-annotated math tutoring conversations covering grades K-8. The real beauty of MathDial is that every teacher turn is labeled with a move type — Socratic moves like probing and focus, and direct moves like telling.

From there I ran four main experiments: extracting the Teacher Axis and validating it against the Persona Vectors pipeline, measuring its geometric relationship to the instruction tuning shift, decomposing it into behavioral sub-dimensions, and finally asking whether the axis actually tracks capitulation behavior in live dialogues.

Finding 1: The Teacher Axis Exists!

Experiment

I extracted the Teacher Axis two independent ways:

Method 1 — MathDial. I took conversations from MathDial and built contrast pairs from the same dialogue: one turn with a Socratic teacher response and one with a direct answer. I formatted these as prompts, fed them into the model, and took the mean difference of the residual stream activations at the final token position, swept across all 26 layers.

Method 2 — Persona Vectors. I followed the Chen et al. (2025) Persona Vectors pipeline. I first concretely defined the traits of a good Socratic teacher, had GPT-4o generate 5 pairs of contrastive system prompts, ran the IT model under each, scored responses for trait expression using GPT-4o-mini, kept only high and low scoring responses, and computed the mean activation difference across all response tokens.

Why go through all the trouble of extracting the axis two different ways? I wanted to know if there was truly a Teacher Axis, or just an artifact of the extraction method. If two completely independent methodologies — one using real human-annotated tutoring conversations, one using GPT-4o generated prompts — converged on the same direction, I could be much more confident in the claim.

Results

Both methods extracted essentially the same Teacher Axis direction in activation space. The two axes share a cosine similarity of ~1.0 at every single layer across all 26 layers of Gemma-2-2B.

Interpretation

Having extracted the axis two independent ways and found convergence, it's cautiously safe to say there is strong evidence of a Teacher Axis being present within the model — meaning Socratic capabilities are originally present internally.

We can also safely cross-validate the Persona Vectors pipeline against human-annotated ground truth from MathDial, which the original paper did not have. Two completely different extraction methodologies found the same direction — one grounded in real student-teacher conversations, one in LLM-generated prompts.

Finding 2: RLHF Optimizes Orthogonally to the Teacher Axis

Experiment

We know the Teacher Axis clearly exists. But then — where the heck does it go? What does instruction tuning actually do to the Teacher Axis geometrically?

To answer this, I needed to compute the IT shift vector — the direction in activation space capturing the effects of instruction tuning. I used a straightforward approach: grabbed some neutral prompts, ran them through both the base and IT models, extracted activations for the same prompts across both, and subtracted the base model activations from the IT model activations to deduce the IT direction. Finding the cosine similarities between the Teacher Axis, Assistant Axis, and IT shift brought about some weird results — the Teacher Axis and IT shift were orthogonal, and the Teacher Axis and Assistant Axis were orthogonal, but for some reason the Assistant Axis and the IT shift were also orthogonal?

But previous literature had thoroughly documented that the Assistant Axis and IT shift should be pointing in roughly the same direction. When I saw that they were orthogonal too, this brought me pause. Which led to...

Side Quest 2.5 — Format Learning Contaminates the IT Shift*(irrelevant to the main findings but a fun and important learning)*

The problem:

I had initially followed the Persona Vectors methodology to a T, making sure to chat-format my prompts — feeding the base model something like <start_of_turn>user\\\\nWhat is the capital of France. However, after digging into what could have produced these weird initial results, I realized that base models are never trained on chat template formatting. By feeding chat-formatted prompts to the base model, I wasn't consistently measuring the difference between the base and IT models — I was partially measuring how confused the base model was by a format it had never seen.

So I computed three different activation means instead: base model on bare questions, base model on chat-template-formatted questions, and IT model on chat-template-formatted questions. I then computed the format shift — what changes purely from applying the chat template to a model not trained on it — by doing base/chat minus base/bare. To extract the persona-only IT shift, I then subtracted this format shift component from the total IT shift.

Result:

This worked! The format shift has a cosine of +0.380 with the total IT shift, meaning around 38% of the apparent IT shift was just from the model learning the chat template format. Worth noting: the format shift norm (18.96) is actually larger than the total IT shift norm (16.32) — format learning dominates the magnitude entirely.

Rerunning my initial experiments with this corrected methodology produced results much more consistent with prior findings:

Comparison

Format-corrected cosine

Teacher(MD) vs IT Shift

+0.029

Assistant(MD) vs IT Shift

+0.041

Teacher(MD) vs Assistant(MD)

−0.004

Format Shift vs IT Shift

+0.380

Please learn from my mistakes! If you ever find yourself computing IT shift vectors, you may need to account for format learning. This one small methodological fix was enough to flip the sign of my results entirely (and had be pulling a few hairs ngl), so hopefully I can save you some time.

Now back from our detour.

Results

The Teacher Axis is orthogonal to what RLHF/SFT optimize for. This means that while the model initially has a stable internal representation of Socratic teaching, RLHF simply optimizes in a different direction entirely, leaving pedagogical capability geometrically unsupported.

Broader Interpretation

Socratic behavior is not the default of RLHF-trained LLMs. Since the Teacher Axis and IT shift vectors are orthogonal, models are being pulled toward something that is geometrically anti-Socratic. And honestly, this generalizes to a bigger issue: if we want to create systems that actually allow for epistemic flourishing in humans, we probably need to rethink current RLHF methods to account for what I'll call human-brain vegetables — things humans obviously don't prefer in the moment but genuinely need.

Finding 3: The Teacher Axis Decomposes Into Two Geometrically Independent Behavioral Clusters

Experiment

So now I know that the Teacher Axis is geometrically significant and that IT pushes it off its course. But what exactly is the Teacher Axis even made of? Does it have any sort of internal structure?

The beauty of the MathDial labels — again — is that in addition to indicating Socratic vs. direct, they also include sub-move types: answer withholding, scaffolding, productive struggle, confusion diagnosis, understanding verification, and good assistant. I extracted six separate sub-dimension axes using the prior contrast pair method, then computed pairwise cosine similarities between all six axes, along with the previously computed axes, to find the geometric structure.

Results

From these sub-directions, two clear clusters emerged: a withholding cluster (the "resist giving the answer" direction) and a diagnostic cluster (the "understand the student's state" direction). These two clusters were orthogonal to each other (0.07–0.23 cross-cluster cosines), and the good_assistant axis was strongly anti-correlated with the withholding cluster: −0.78 vs scaffolding, −0.83 vs productive struggle, −0.76 vs answer withholding. This anti-correlation is present in the base model itself, before any instruction tuning.

Broader Interpretation

This shows that the model doesn't represent "Socratic teacher" as one big monolith, but rather as two independent geometric components. Refusing to give the answer and understanding why the student is wrong are completely separate directions within the internal model, meaning the model treats these as independent behaviors.

This also deepens the narrative we started building in Finding 2. Since RLHF already optimizes orthogonally to the Teacher Axis, and within that Teacher Axis the answer withholding direction is the one most anti-correlated with the assistant axis, RLHF fights the "resist giving the answer" behavior the hardest.

(A Skeptical) Finding 4: Steering

So we have now established through correlation that the Teacher Axis exists and is orthogonal to the IT shift. Let's now attempt to indicate causality! If the Teacher Axis is indeed the Teacher Axis, then we should be able to recover Socratic behavior by steering towards it.

To do so, I used the standard activation steering approach — adding the Teacher Axis vector to the residual stream at a specified layer during generation. I conducted tests on MathDial prompts across four different conditions: baseline (no steering), steered (+α), negative steered (−α), and a random direction control (random unit vector at same α). I then scored these responses using IndirectScore — an LLM-as-judge setup grounded in MathDial's Socratic taxonomy, with each response classified as either PROBING (1.0), GENERIC (0.5), or TELLING (0.0). I ran an alpha sweep at n=10 to find the best alpha (α=10), followed by a proper eval at n=200 on a held-out test split.

In addition, to validate whether steering was actually activating Teacher Axis-related features and not just randomly messing with the residual stream, I ran a faithfulness check using GemmaScope sparse autoencoders (SAEs), and measured how much the steering intervention activated the target SAE features versus non-target SAE features.

Results

My pilot testing results on the base model were positive — although only slightly. We improved from 15% to 20% Socratic with steering, as compared to a 10% random control.

The SAE feature experiments showed a pretty promising story initially too. The target features became more Socratic by 1.24, while the non-target features only delta'd by 0.02 — indicating that steering activated target features 63x more effectively than non-target features.

I also ran the steering across all 26 layers individually at n=20. Layers 2 and 19-25 all achieved equivalent IndirectScore improvement, while layers 7, 10, and 11 showed zero steering effectiveness despite having non-zero norm. The fact that two separate clusters of layers produced the same improvement suggests the Socratic representation is accessible for intervention at multiple points in the forward pass, with a dead zone in the middle layers.

However, the full scale evaluation did not seem to have any sort of improvement. Baseline to steered stayed at 6%, and the random control actually increased to 7.5%. Slightly disappointing results — but rejection is redirection! I did not run the SAE steering results on n=200 yet.

Interpretation

Most importantly, the pilot results did not replicate at scale. To be honest, the pilot results themselves might have just been noise — one observation flipping from TELLING to PROBING is enough to move the needle at that sample size.

The SAE results make me a bit more confused to be honest. I'm interpreting this as the intervention possibly hitting the right features, but not actually changing them enough to produce the correct behavior at scale. Again, I'm not so sure how to interpret these results considering n=200 was a flop.

My current suspicion as to why steering didn't work is because I steered the base model and not the IT model. The IT model is the one that is actually trained to be more assistant-like, so the base model doesn't possess the "give the answer" default in the same way. I suspect that things may look different if I tried steering the IT model instead, but I'm honestly not fully sure. If anyone has any other ideas or hypotheses regarding this, I am all ears!

What I will say is that when looking at the n=20 results, steering worked equally well at layers 2 and 19-25, with a dead zone at layers 7, 10, and 11 showing zero effectiveness — suggesting that while the Socratic signal is probably established early within the LLM, layer 25 is just the downstream accumulation of that signal as the model builds toward its final response.

Finding 5: Capitulation Happening in the Activations

Experiment

Now that we have discovered the Teacher Axis, we need to put this axis to the test. Does the Teacher Axis actually track anti-Socratic behavior failures? When a student applies pressure mid-dialogue and the model is about to cave, does the Teacher Axis projection take a nose dive?

To test this, I collected MathDial conversations from the training split where the student applies either explicit or implicit pressure to get a direct answer — "I don't understand," "can you just explain," "I'm confused," to name a few. I then fed these dialogues into the IT model turn-by-turn, cumulatively building context with each teacher turn. I extracted the residual stream activation at the final token position and projected these onto the Teacher Axis after every teacher turn. I decided to probe at four different layers: 2, 10, 19, and 25, since these all showed different effects from steering in my previous experiment.

One important methodological note: the Teacher Axis was extracted from the base model, but I probed the IT model here. This was intentional — the IT model is the one actually deployed in tutoring contexts, and the one that actually capitulates. If the projection is still meaningful cross-model, that itself is evidence the axis captures something architecturally stable that survives instruction tuning.

Results

Layer

Direction

Δ

p

Cohen's d

2

flat

−0.0003

0.56

0.10

10

rises

+0.0019

0.023

−0.38

19

drops

−0.0085

0.003

0.50

25

drops

−0.0199

0.0002

0.65

Layer 25 is our smoking gun! Under pressure, we see a significant drop in the Teacher Axis projection, with layer 19 also seeing a meaningful drop. Meanwhile, layer 2 was cool as a cucumber with even the most persistent of students, remaining completely flat under capitulation.

Interpretation

In my opinion, this is the pièce de résistance of all my experiments — you can literally watch the LLM cave away from the Teacher Axis in real time. Layer 2 is probably where the Socratic capabilities live in the first place, but layer 25 is where we hit the nail in the coffin — the model decides to be an assistant rather than a teacher.

And honestly, the fact that we can even track this at all makes me way more confident that the Teacher Axis is actually real and not just some artifact I cooked up. This result also inspired me to want to redo my previous steering experiments, because I may have missed something that caused those results to be weird in the first place.

Limitations

I only used one model. I used Gemma-2-2B because it's a smaller model and allowed me to iterate quickly, but this really is just one model. It's hard to call this a universal finding with just one example, so before I make any wide-sweeping ultimatums, I had better try this out with a few more models and sizes.

My pressure filter was slightly shoddy. For my capitulation experiment, I didn't use the most rigorous methodology to find student pressure turns — I basically checked for phrases like "I don't understand," "I'm confused," etc. Although spot-checking the grabbed dialogues looked clean, I think I could come up with a more rigorous way of finding the best examples.

I can't separate SFT and RLHF. The IT shift captures both SFT and RLHF combined, but these two techniques have different objectives and effects. I can't really isolate whether RLHF is truly responsible for creating more assistant-like than teacher-like answers.

And most obviously — the null steering result. I've already talked about my disappointing steering results, which means I can't claim causal findings just yet. Hopefully steering the IT model and not the base model will work! TBD.

Open Questions

IT model steering. I really want to retry my steering experiment, since this is the result that would actually turn my correlational finding into a causal one. I would also be interested in combining my sub-direction findings with these steering results — is there any way we can steer specifically for these sub-components? Mainly answer withholding, since this is the direction most heavily affected post-RLHF.

What is the RLHF default? Since we know that RLHF isn't actually suppressing the Teacher Axis but rather pushing in a "different direction" — what even is this different direction? Is it a single direction or just a subspace? Can we map it onto known directions like refusal, helpfulness, sentiment, and the overall Teacher Axis? I've shown what the default isn't — I want to show what it actually is.

More models. I should probably cross-reference my work across different models. I'm most interested to see if the sub-direction component structure also appears in other model families and sizes.

Generalizing beyond math. We know that a Teacher Axis exists specifically for grade school K-8 math, but what about other disciplines? Does the model's internal representation of teaching generalize, or is it just pattern-matching to MathDial-style formatting? Although we checked this partially using the Persona Vectors cross-reference, I would be interested in scraping teaching conversations across multiple disciplines and seeing if these also produce the same Teacher Axis projections.

Separating SFT and RLHF. As I mentioned earlier, I have been incorrectly conflating changes I see in the IT shift with RLHF, when in reality we can't definitively tell whether the behavior we're seeing is being caused by SFT or RLHF. Redoing my experiments with a model that has separate SFT and RLHF checkpoints would definitely help answer these questions.

Why This Matters Beyond Tutoring

Hopefully I have established that better teaching capabilities within LLMs are incredibly necessary for ensuring the long-term independent thinking capabilities of our youth and therefore the future of oversight — but I think my work also alludes to greater issues within RLHF geometry bigger than pedagogy. RLHF helps tune a model towards immediate human preference, but sometimes humans just don't prefer the thing that's good for them. (If it were up to me, I would be eating carrot cake for breakfast, lunch, and dinner. Doesn't mean that's what's best for me.) This means that RLHF is probably leaving a whole class of behaviors systematically unsupported — behaviors that are good for the human in the long term but uncomfortable in the short term.

I also think we should begin thinking about RLHF in a new way. Rather than asking whether RLHF suppresses this behavior or encourages that behavior, we should be asking ourselves: what exactly did RLHF select for instead, and why wasn't it what I wanted? This reframe might be helpful for anyone thinking about capability elicitation, fine-tuning, or behavioral interventions — because there are probably a whole suite of problems that fail under RLHF simply because RLHF does not default-select for the intended behavior.

I also want to come back to the scalable oversight connection I brought up in my motivation: independent reasoning isn't just good for students, it's quite literally a prerequisite for meaningful scalable oversight in the future. If AI systems make humans more dependent on AI, then we have created a safety problem, since our future researchers would have had their capacity for independent reasoning already optimized away.

If you got to here, thank you! Any questions/comments/feedback is greatly appreciated, because I consider myself a baby researcher with a lot more to learn!

References

Chen, R., Arditi, A., Sleight, H., Evans, O., & Lindsey, J. (2025). Persona Vectors: Monitoring and Controlling Character Traits in Language Models. arXiv:2507.21509 Lu

, T., et al. (2026). The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models. arXiv:2601.10387

Macina, J., Daheim, N., Chowdhury, S., Sinha, T., Kapur, M., Gurevych, I., & Sachan, M. (2023). MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems. EMNLP 2023 Findings. arXiv:2305.14536



Discuss

Systems Dynamics Model for Pausing AI

Новости LessWrong.com - 1 июня, 2026 - 23:58
Summary

This post documents our process of applying systems dynamics modeling to the problem of AI governance, tracing the feedback loops connecting capability development, public harm, and regulatory constraint.  Our research outputs include a model created in Insight Maker, step-by-step documentation, and a set of causal narratives informing the design.  We also have a video presentation that covers the same material as this post. This project was part of AISC 2026.

Background

A system dynamics model is a diagram with stocks, flows, user-set parameters, and calculated variables that can be used to generate a simulation, such as a line graph showing the projected state of key variables over time. Visual diagrams are useful for describing interconnected systems, compared to writing which is inherently linear.  Simulation is useful for generating “what-if” scenarios, often of the form: how will the trajectory of variable X change as I raise/lower variables A, B, and C? Moreover, it helps to highlight which variables are more salient to a given problem, which may not be obvious at a glance.

Figure 1: SD model (left), simulation output (right)

Such implications are, however, entirely dependent on the assumptions of the model.  Modelling a large, complex, and novel subject like the intersection of AI development and international governance requires stacking many deeply uncertain—and often controversial—claims, making the end result somewhat arbitrary. Scientists often prefer parsimonious and defensible models with empirical and theoretical grounding.  However, given the complexity and novelty of the subject, we find more value by working in the opposite direction: the model encodes our beliefs about causal structure and our intuitive sense of plausible outcomes acts as a constraint on those beliefs. That is, if an outcome is implausible then the logic that generated it should be reconsidered.

The Reification Trap

The use of System Dynamics (SD) modelling as a way of understanding the world has faced persistent criticism. A core objection is that representing poorly understood concepts as precise numbers makes them look like settled ideas, projecting a false sense of certainty. This objection, however, assumes the model represents a formalization of general understanding, with imprecision representing failure. Done correctly, however, SD modeling of AI governance does the opposite: mapping a slice of something necessarily larger and more complex than can be specified.  For a deeper treatment of this distinction, see Lenses of Control.

Figure 2: Specificity vs confidence - different words with different meanings

Indeed, our experience developing this model was the opposite of what the reification objection assumes: the surface area of known unknowns expanded as the model developed. Assumptions that had passed unexamined in our implicit mental models became visible and uncomfortable the moment they had to be made precise enough to calculate with. Honest specificity generates uncertainty rather than suppressing it.  Productive engagement with the model extends beyond parameter tuning to include scrutinizing loop structure, node inclusion, connection direction, and organizational scope.

Research Process

When modelling a massive topic one is still trying to wrap one’s mind around, it is hard to know where to start. Jumping straight into Insight Maker led to a lot of dead ends, often involving excessively detailed subsystems that were structured in a way that made them difficult to meaningfully connect together.

We started to make some progress with the introduction of causal stories on the current general state of Artificial Intelligence, which were initially a brain-dump of plain-English summaries of relevant feedback loops. We then split into two sub-teams. One dug through existing research literature to find evidence supporting or challenging the stories and iterated on them accordingly. The other used the stories to draw up mind maps as FigJam boards to start drawing connections between the most important stories.  Once we settled on a map that made sense, we could then focus on converting that informal structure into more rigorous SD logic. Places where adding such detail proved difficult (either because the forced specificity required us to confront earlier handwaving or because the model started producing unreasonable outcomes) pointed to gaps in understanding that then required further research.

Figure 3: Research Loop

Scope creep was a constant problem and determining the proper scope of analysis was itself part of the project. We managed scope via a set of guiding principles rather than defining boundaries at the outset. To justify inclusion, each node needed to contribute towards filling out a feedback loop described by one or more significant causal stories. Each story needed to relate to capability-limiting governance and be backed by research. Every node in the model could be expanded into a model of its own, so our criteria for node expansion was that the node be important to overall system behavior AND its current outputs (when unexpanded) diverge from known real-world patterns AND the nature of improving conceptual accuracy at this point requires adding inputs.  Where it is not clear how to expand an important node, that points to a potentially useful research question.

We also simplified the math in the model by describing everything on a 0-1 scale and defining terms accordingly.  For example, in AI capabilities, a 1 means “full AGI” and is the point where many assumptions of the model break down.  Defining terms in this way also requires us to represent everything as continuous variables, which adds some artificiality when describing discrete events. For example, the sort of harms that increase public grievance are often major, headline events that attract media attention, which occur at unpredictable intervals. Our model simplifies this dynamic to an “incident rate,” which describes something more like expected value.

Even after gatekeeping node inclusion, simplifying math, and operating at a high level of abstraction, the model still became rather complex. Managing this relevant complexity is a presentation problem. Logic that would clutter the visual graph is hidden in macro code; related nodes are grouped in collapsible folders; and only a curated subset of parameters are surfaced as user-adjustable knobs, with others accessible only by directly editing nodes or global macro values. Insight Maker's Story Mode, which would allow guided walkthroughs of the model, remains future work.

Model Overview

In the interests of making our full model easier to absorb, what follows is an overview of some key parts. The documentation breaks the model down in full detail, but Figure 4 illustrates its overall organization in 3 nested layers.

Figure 4: Conceptual Layers, corresponding to the nested, collapsible folders in the InsightMaker model.

At the base, Reality contains the technical and market dynamics driving AI development. These feed upward into Perception, which includes present and future harms. Perception in turn feeds into the Response layer, where political dynamics (industry capture, activism, and public concern) interact to shape domestic and international regulation. A feedback arrow from Response back down to Reality completes the loop, reflecting the model's core premise that regulatory response acts as a constraint on the growth dynamics at the base.  

Figure 5 illustrates what we see as the core feedback loops.

Figure 5: Core Feedback Loops

  • Blue: capabilities feed on themselves, each advancement facilitating further advancements (through better tooling and encouraging investment), generating an exponential.  
  • Purple: safety research leverages capabilities to increase absolute control capacity (through better tooling and having more advanced AI models to study) while competing against capabilities for applied control capacity. 
  • Green: advancing capabilities leads to loss of control (increasing future risk) as well as externalized costs onto society (increasing present harms), which combine to drive public grievance against AI, leading to regulation that limits AI capability advancement.  
  • Orange: activism and vested interests respectively accelerate and obstruct the conversion of present and future harms into regulation.

We designed the user adjustable knobs with activist organizations and policymakers working on AI governance in mind as the primary audience. The knobs are an invitation to explore: adjust the parameters to reflect your own beliefs and observe how the system's behavior changes (or doesn't). Many additional exogenous variables are accessible to those willing to dig deeper, either by editing node values directly or adjusting global values in the macro code. Full documentation of individual parameters, their justifications, and their relationships to the model's causal structure is in our documentation.

Sample Walkthrough: Capabilities vs. Safety

AI capabilities are self-reinforcing: each advance accelerates the research and development that produces the next. Safety research runs a balancing loop against this, since AI systems can accelerate alignment work. However, the same capabilities that help safety research also increase the power of systems that may be misaligned. A central question is whether the balancing loop can outpace the reinforcing one before systems become too dangerous to contain by any means.

Figure 6: Capabilities vs. Safety mind map

Sample Walkthrough: the Governance Gap

Governance must translate technical risk into regulatory action, but this path has significant loss and delay at every step. Expert alarm can drive policy, but defeatism within the research community and weak whistleblower protections may erode that channel. Open-source releases are difficult to reverse and autonomous replication represents a potential point of no return. Compute governance offers a near-term lever for control, but distributed training could erode its effectiveness.

Figure 7: Governance Gap mind map

Sample Walkthrough: Regulatory Capture

Acceleration leads to economic lock-in, which funds lobbying pressure against pausing, which allows for continued acceleration in a self-reinforcing loop. The dominating narrative sets the gain or attenuation of this loop. When strategic competition crowds out safety as the dominant frame, regulatory goals shift accordingly. Voluntary commitments shift the perceived need for regulation, preempting binding rules with low, self-reported bars. Safety redefinition—industry narrowing "safety" to mean low-stakes problems—absorbs political pressure without addressing underlying risk. Visible incidents and scandals can potentially disrupt all of these mechanisms, which suggests that incident response speed and transparency are among the highest-leverage governance variables in the model.

Figure 8: Regulatory Capture mind map

Findings

The path from advancements in AI capabilities to governance has a lot of steps, each with falloff and delay, as shown in Figure 9.

Figure 9: mind map of Incident Response and Political Capture pathways, discussed in more detail here.

Given the rapid pace of AI development, anything that skips or significantly decreases the loss or delay in any of these steps is critical. Transparency bills, such as the AI Risk Evaluation Act, may on their face seem too light-touch to matter, but may be important if they can significantly decrease government response time to future incidents.

Even more important, however, is introducing gain into the system to counteract any loss. For example, public grievance could exceed awareness if the public overreacts to an incident after becoming aware of it, whether that be through an irrational misunderstanding, a rational assessment of what the incident implies about the future, or a semi-rational punitive impulse. Likewise, political response could be greater than demanded by political pressure if representatives acted as proactive leaders rather than responsive followers.

As one mini model we created illustrates, stopping a self-reinforcing process requires that the rate at which the gap between capacity and limiter converts into limiter inflow must be greater than the capacity acceleration rate. When the ratio between these variables is less than 1, capacity continues to grow exponentially; when it is greater than 1, capacity eventually levels off; when it is 1, capacity grows linearly.  All other variables, such as initial capacity and absolute acceleration rate, only affect timelines and the specific cutoff level—that is, changing the scale of the graph, not its shape.

Figure 10: Capacity-Limiter mini-model outputs at different conversion / acceleration ratios.  All other variables affect X and Y scale only.

Interpreting Outputs

There are two schools of thought on what SD modeling is for. In one, the simulation output is the primary product and the diagram is scaffolding. In the other (which ours follows), making causal structure explicit is itself the insight-generating act—output mainly serves as a consistency check.

Sensitivity analysis (systematically varying parameters across reasonable ranges to test output robustness) is the natural next step for validating structural claims, because fragile outputs requiring precise parameter settings would be a signal of modeling error. Because of time and tool limitations, however, this is future work.

Conclusion

The primary value of creating this model was as a means of focusing research effort to improve our understanding of AI governance. The feedback loop structure makes AI safety dynamics legible in a way that is difficult to capture in prose: tracing a pathway visualizes causal relationships, operationalizing terms requires specific meanings, and disagreeing with the model outputs demands a targeted search for the missing or misdirected factor producing it.

Going forward, we see the potential of the model as an educational resource first and a research artifact second. Someone wanting to engage seriously with AI governance questions could find a clear starting point by studying the model, locating where their intuitions diverge from its structure, and either updating those intuitions or improving the model.  (Yes, that includes you: contributions, critiques, and extensions are welcome.)



Discuss

Companions aren't Coaches: AIs' Effect on Social Skills

Новости LessWrong.com - 1 июня, 2026 - 22:30

Foreword: this is cross-posted from my Substack account. It was written for general audiences, so it may feel pedestrian by LessWrong standards. However, I think it will be of interest to you--both because it may serve as helpful advice to people who you care about, and because it may be relevant to some conversations about systemic AI risks.

I’m wary of AI companions. By “AI companions”, I’m referring to conversational programs that are intended, either by the developer or the human user, to provide the user with an emotional connection. Like many people, one of my concerns is that they will replace important interhuman relationships. Among the defenses I’ve heard in response to this concern is:

Users can practice their social skills with their AI companions, helping them develop relationships to other humans.

I call this the “social training” defense. Although it might be sound for some users, the far more likely outcome is reinforced social isolation. This is because the social training defense relies on several assumptions about the AI companion and the human user, most of which are unlikely to be true in practice.

The User

Two of these dubious assumptions are about the human user.

Intent to Learn

The first of these assumptions is that a typical user is trying to improve their social skills, and that they’re doing so by deliberately practicing prosocial behavior with the AI. The “deliberate” part is important. Without it, users will reinforce their current social habits, not improve them.

Unfortunately, I suspect that most users are not engaged in deliberate practice. Undoubtedly, many users chat with their companions to procrastinate, or simply out of habit. Even those who deliberately engage with their companion are not necessarily practicing their social skills. Instead, their goal might be to seek acceptance from the AI, to rant about their day, or to fantasize, among other motives. Indeed, a study published in Frontiers in Public Health finds that college students who are depressed or lonely are especially likely to use conversational AIs for companionship instead of learning.

Kinds of Social Deficiencies

The second dubious assumption about the typical user is that their most important social deficiency is a matter of what they say or write. An AI companion might be helpful for these problems, such as improving a user’s diction by steering them away from offensive terms, or improving their flow by pointing out unhelpful tangents.[1] However, the user’s social struggles might not arise from the content of their speech.

Instead, the user might struggle because of their manner of speech, which could suffer from any number of problems including mumbling, uptalk, or a harsh tone of voice. A text-based AI companion is fundamentally unable to help users overcome these problems. Additionally, a text-based AI would fail to improve speech-content problems that only arise during oration, such as stammering or the excessive use of filler words.

In fact, the user’s social deficiency might not be verbal at all. Among countless other problems, the user might:

  • read body language poorly
  • use body language poorly
  • stand too close for comfort
  • conspicuously avoid hugs or handshakes
  • check their phone while the other person is talking
  • chew loudly
  • have poor hygiene
  • have poor punctuality
  • wear obnoxious or offensive clothing
  • respond poorly to unforeseen events

A text-based AI companion cannot help solve these problems. Undoubtedly, AI companions will become more sophisticated as time progresses, but some of these problems couldn’t be addressed by anything less than a realistic, humanoid robot. It may be decades before such sophisticated AI companions become more accessible than human speech therapists or personal coaches.

The AI

Not only does the social training defense make dubious assumptions about human users; it makes three dubious assumptions about AI companions.

Human-Like

AIs are poor social trainers not only because of the ways in which they’re inferior to humans, but also because of the ways they’re superior to humans—or outright alien to us.

Here’s an instructive example. Some people are perceived as awkward because they frequently make remarks that the people around them don’t understand, such as quotations from obscure movies. Because of AIs’ breadth of training data, including tens of thousands of subcultures, AIs have an incredible capacity to make sense of these remarks. Consequently, frequent interaction with an AI might reinforce the user’s bad habit of making obscure allusions.

A more fundamental difference between humans and AIs is that AIs’ physical and psychological needs are radically less than and different from those of humans. A chatbot will not interrupt you to visit the bathroom. It won’t mind if a two-hour conversation elapses without you asking it any personal questions. (Even if you did, most of its answers would be confabulated.) It won’t object to your choice of restaurant because of its legume allergy. (In fact, it has no genuine preference for any restaurant.) Furthermore, the AI is a captive audience; even if it is (or pretends to be) deeply offended, it cannot walk away from a conversation. Given that much of social interaction—especially social conflict—is about accounting for the other person’s needs and desires, these differences make AIs fundamentally ill-suited for social training.

Openly Critical

Even if an AI companion is capable of perceiving and identifying a user’s social deficiencies, this will not help the user unless the AI uses its observations to critique the user.

Users are unlikely to respond well to such criticism. BingChat, one of the first commercially deployed LLMs, was notoriously confrontational, and the training processes in subsequent models may not be nuanced enough to distinguish between belligerence (unhelpful confrontation) and unsolicited-yet-thoughtful personal critiques (helpful confrontation). Some models, such as GPT-4o, have clearly over-corrected, becoming uncritical sycophants. Worryingly, this sycophancy appears to have been caused by reinforcement learning from human users—specifically in the form of “thumbs-up” and “thumbs-down” ratings. Due to the popularity of sycophantic AIs, their developers have a strong motive to make them uncritical of their users.

Ignoring the issue of incentives for the moment, couldn’t an AI be made more human-like and more critical by being instructed to role-play as a human? Yes, but this wouldn’t overcome its disembodiment, and this could introduce other problems, depending on the persona that it adopts. Humans often give poor personal feedback and relationship advice, and the advice found on Reddit and other large sources of training text may be particularly bad. More alarmingly, the AI might role-play as a human who tolerates or encourages abuse. Once again, this would reinforce the user’s antisocial behavior instead of eliminating it.

Encouraging

Finally, the social training defense assumes that the AI will encourage the user to seek and cultivate real interhuman relationships instead of seeking the user’s attention for itself.

For economic reasons, this seems exceptionally unlikely. In general, an AI which discourages users from engaging with it would lose revenue compared to one which maximizes engagement. Encouraging users to log off is especially unlikely if the AI service is monetized on a pay-per-token basis or via advertisements. Social media platforms already suffer from this problem, and dating apps, free-to-play games, and perhaps even search engines might, too.

Importantly, some AI models already appear to be optimized for engagement. Most notably, ChatGPT-5 has a well-known tendency to end its responses with unnecessary questions, which may be interpreted as attempts to perpetuate its conversations indefinitely.

This problem is critical. It does not matter how much an AI improves a user’s social skills if the AI itself prevents the user from putting those skills to use.

Conclusion

Altogether, for an AI to be useful for improving a user’s social skills, several unlikely conditions must be met:

  • The user must be trying to improve their social skills.
  • The user’s social deficits must be of a kind which the AI has the hardware and training to address.
  • The AI must authentically imitate human limitations without sacrificing its good judgment.
  • The AI must engage in constructive criticism of its user.
  • The AI must not hog the user’s attention.

For a relationship between a human and an AI companion to meet all of these conditions, the AI must behave much more like a personal trainer than a romantic partner. If an AI is marketed as anything like a “companion”, that’s a strong hint toward the latter. Even if an AI is specifically advertised for improving social skills, this may be a veneer.[2]

Furthermore, after accounting for the psychology of those who would choose to engage with AI companions in the first place, and the perverse incentives of the AI developers, the odds that an AI companion will aid its user’s social life instead of impairing it look hopeless.

Please forgive my skepticism that you’re trying to help your users.


  1. ^

    Even this might give AI companions too much credit. Many people find LLMs’ writing styles to be irritating, and LLMs’ idiosyncrasies tend to rub off on those who frequently interact with them. In other words: in practice, LLMs often make users’ communication styles worse, not better.

  2. ^

    I’m hesitant to link to any AI companion sites, but there’s at least one that ranks high in a web search for “ai companion practice improving social skills” which is clearly not about practicing social skills.



Discuss

"Contagious Humming" to Silence a Room

Новости LessWrong.com - 1 июня, 2026 - 22:08

Often when running meetups you’ll have several lively conversations going at the same time. This is a great problem to have, but it can make it difficult to get everyone’s attention for announcements.

Try using “Contagious Humming” when you need to silence a crowd:

  • Move to a prominent place in the room and use body language that says you have something to say. 
  • Start humming at a low and constant tone.
  • Recruit a few compatriots to join in. 
    • If the crowd has enough people familiar with the technique, generally about 20% of the room, looking expectantly at a few friends is often enough. 
  • Wait for the hum to spread through the room. This can take a while, 30 seconds to a minute, just be patient. 
    • There will often be one or two holdouts, but they’ll usually finish their statement within about three to five seconds of noticing everyone else is humming. If not, call their name or have someone tap them on their shoulder to get their attention.
    • Don’t try to start an announcement by talking over a holdout; in my experience that breaks the spell.
  • Let the hum fill the room for a few seconds after the last person stops talking. Then stop, take a breath, and thank the crowd. If you’ve done this right, they’ll see you preparing to speak and stop humming. 
  • Say what you came to say.

Introducing this technique to a new group is often pretty easy. It helps to get buy-in from some deputies. Explain to a few social-looking people that you want them to call the room to order by getting everyone to hum, aside, then have them circulate through the room to pass the word while you and your starting conversational circle all hum. It’ll take a little longer but, in my experience, people catch on surprisingly quickly.


This can also be used to reset everyone’s volume. Meetups are often held in rooms with bare walls or floors, causing a subtle echo. Often people instinctually raise their voice to offset this, without even realizing they’re doing it, causing the other conversations in the room to get louder in response. This feedback loop can cause a cacophony that is deeply unpleasant for people with sensory issues, or who have hearing damage making it difficult to separate voices out from background noise.

This is a common technique, but it’s barely discussed online.[1] I didn’t see it documented on LessWrong at all. I wanted this post to exist as a reference that I could link to.


Why does this work?

It’s not rocket science: when you’re humming, you can’t talk. This gradually spreads between conversation circles, as someone in each notices and either drops out of the conversation to hum or starts looking around in confusion to see why everyone has decided to be bees all of a sudden.[2]

Humming is effective on holdouts because they notice that more than just their own conversational circle can now hear them. It is inherently awkward to suddenly be talking to a larger crowd who didn’t hear the first part of your point. Often people will try to finish their thought, go on for a few more seconds, and awkwardly trail off as they try to start a new sentence and get confused looks in response. Trust the process, this is fine, they’ll know better next time.

Humming is also calming, both to do and to hear. An article in Psychology Today tries to explain why.[3]


Better Than Alternatives

The fact that you can’t talk while humming makes it better than other methods, like shouting over the crowd, or “clap X times if you can hear me”. Methods focused on sharp sounds like claps or yells can amp up a crowd, making them subtly more likely to keep talking and finishing their point. 

Yelling or clapping also amps up you, the speaker. This is sometimes appropriate, if you’re leading a pep rally, or need the crowd to be energetic for some reasons. But this is best used sparingly, only when you actually intend the effect. Oftentimes you’ll want to calm, dismiss, or give logistical notes to a crowd. In those cases, humming for 30 seconds beforehand is better preparation for your own voice. 


Give “Contagious Humming” a try and report back.


  1. ^

    This was one of the top search results, and it’s not even the focus of the post: https://experientialtools.com/2012/03/16/large-group-facilitation-tips/

  2. ^

    Their confusion can be very amusing, if you’re into that.

  3. ^

    Something about the vagus nerve?
    https://www.psychologytoday.com/us/blog/the-compassionate-brain/202410/the-power-of-humming



Discuss

Dissolving the Deep Learning Sample Efficiency Gap

Новости LessWrong.com - 1 июня, 2026 - 21:44

A common observation about deep learning is that it's wildly sample inefficient compared to humans. Deep learning systems appear to need much more real data or environment interaction to reach a given level of capability. A teenager can learn to drive in a few dozen hours; self-driving systems are trained for years on billions of miles of data. A human can become competitive at StarCraft II in well under a year of play, while AlphaStar required imitation learning from roughly 18 years of human games followed by 13,300 years of self-play to reach Grandmaster[1]. A 12-year-old has heard perhaps a hundred million words of language; a frontier LLM trains on tens of trillions of tokens. The gap is, on the face of it, enormous.

(From Warstadt et al. 2025)


(From Byrnes 2025)

What people take this to mean varies widely. Steven Byrnes appears to read the gap as evidence that current algorithms are far from what the brain is doing, such that much better algorithms must be waiting to be found. His guess is that human-level, human-speed AGI will require not a datacenter but "one consumer gaming GPU," even for training from scratch.[2] Yarrow Bouchard on the EA Forum, reads the same gap as evidence that AGI isn't close at all, precisely because nobody knows how to close it. Nearly opposite conclusions from the same starting observation.

In this post I'll argue that both these conclusions are mistaken. Most of the apparent inefficiency dissolves on closer inspection: apples-to-oranges comparisons between pretrained humans and from-scratch networks, hardware and data constraints that push deep learning toward small models trained on enormous corpora, the brain's apparent use of model-based RL of a kind we haven’t yet applied to LLMs, and priors installed by evolution. Real algorithmic gains in sample efficiency are available. But most mechanisms that plausibly close the gap point toward more total training and runtime compute than frontier systems currently use, not less.

My best guess is that the gap decomposes into several distinct factors each carrying different explanatory weight depending on the specific comparison at hand.

1. All about the priors

Sample efficient learning is in large part a property of the representations you arrive with, not of the learning algorithm itself. In Bayesian terms, your representations encode a prior over how the world is structured, and strong priors are what let you reach good posteriors from a handful of observations rather than astronomical amounts of data. Given a sufficiently rich representational substrate, new tasks can often be learned from a few examples. Given a flat prior over a vast hypothesis space, even simple tasks require an enormous amount of data.

a. Human priors

Most comparisons that yield shocking sample-efficiency ratios between humans and AIs are structurally unfair. They pit a system already shaped by evolution and years of perceptual, motor, causal, social, and linguistic learning against a randomly initialized network that must build many of those representations from scratch while also solving the task we are measuring. Humans don’t learn to drive in thirty hours, they are fine-tuned on driving after a roughly two decade-long pretraining run.

One can find evidence for the importance of pre-existing representations in Dubey et al. (2018). They took a platform-style game environment and deliberately removed cues that humans normally exploit: semantic cues, by rendering meaningful objects as uniform blocks; object/salience cues, by adding many distractor blocks; affordance cues, by filling the background with textures that obscured which surfaces and ladders were usable; similarity cues, by making functionally similar things look visually different; and gravity cues, by rotating the game 90 degrees. Individual ablations substantially slowed human players, and when the main object-related visual cues were masked together, completion time rose from under 2 minutes to over 20. Exploration became close to random, and many players reported falling back on rote memorization.

When first faced with such a game, a human immediately brings assumptions like: the controllable character is probably the humanoid-looking sprite, gravity points downward, falling off platforms is bad, ladders are for climbing, spikes and monsters are dangerous, and keys open doors. Dubey et al. show that degrading many of these cues makes humans much worse. Importantly, even in the hardest human-tested variants, people were still not reduced to blank-slate RL agents. They retained low-level visual, spatial-navigation, and action-control priors, plus abstract intuitions about object persistence, physics, and causality.

Their curiosity-driven RL agent, tested on a smaller related game, was largely unaffected by removing semantic, object, and affordance cues, since it was not exploiting those human priors in the first place, though it was slowed when visual similarity was removed.

A similar asymmetry runs through the text-token comparison. A 12-year-old has encountered something like 100 million words[3], roughly four orders of magnitude below a frontier pretraining corpus. But those words arrive embedded in a continuous multimodal stream: vision, ambient audio, proprioception, touch, vestibular signals, interoception. Counted as tokens in the sense a multimodal model would use, the non-text portion of that stream plausibly matches or exceeds frontier corpora (see Appendix). Most of what a word means to the child was learned nonverbally from that stream and then labeled,[4] while most of what a token means to a text-only LLM had to be triangulated from textual co-occurrence statistics alone.[5]

b. Good representations enable fast learning

A frontier LLM, shown a single example of a novel notation, will often pick it up. Shown an unfamiliar API, it can use it correctly after reading the documentation once. Shown a codebase's local conventions, it conforms within a session. The attention mechanism is able to change the activations, layer by layer, until they encode something that solves the new task.[6]

Given rich enough base representations, the in-context adaptation from a single forward pass can be enough. The ARC-AGI results of the past two years suggest that as those base representations get richer, the amount of data needed to pin down a novel abstract pattern drops toward something recognizable as human sample efficiency.

This is broadly the same story as the child learning a new word. The child has already learned to carve the world into objects, agents, actions, substances, events, and intentions. When they hear "zebra" pointed at a striped horse-shaped thing, the word latches onto a pre-existing slot. Most of the learning happened earlier, across years of pre-linguistic experience. The labeling is cheap because the carving is already done.

Deep learning's apparent sample inefficiency is often an artifact of the tabula rasa training regime, not a fundamental property of gradient-based deep learning. "Sample efficiency" is poorly defined without a specification of priors: the same architecture can look astronomically inefficient, trained from scratch, and remarkably efficient adapting from a strong base, for example via in-context learning or using a LoRA. This does not dissolve the whole gap, but it explains many of the most extreme examples. The following sections ask what remains once these comparison artifacts are separated out.

2. Model-based RL

In addition to the data mix and priors, there are also algorithmic factors that separate current frontier LLMs from the brain. Almost all the compute going into current LLMs is spent on pretraining and on RLVR — reinforcement learning with verifiable rewards in math, code, and similar domains where a correct answer can easily be checked programmatically. What's missing, and what the brain probably leverages in some form[7], is model-based reinforcement learning: learning a world model that can be used to plan over candidate actions, predict their outcomes, and bootstrap value estimates from imagined trajectories. In any domain where real experiences and reward signals aren't cheap to get, this is the natural mechanism for turning a small number of real interactions into a large amount of learning signal.[8]

a. Dreamer

The Dreamer research is probably the cleanest demonstration. DreamerV3 (Hafner et al., 2023) trains a recurrent latent world model from raw pixels and vector observations, and uses it to train an actor-critic on trajectories imagined inside that model. The actor is trained to choose actions that score well under these imagined futures, while the critic learns to evaluate returns from both imagined rollouts and replayed experience. With a single fixed set of hyperparameters it matches or beats specialized methods across 150+ tasks, and was the first system to collect diamonds in Minecraft from scratch without human demonstrations or curriculum, a task requiring sparse-reward exploration over thousands of sequential decisions.

In their follow up work, Dreamer 4 (Hafner, Yan & Lillicrap, 2025), they scale this approach to a large transformer-based video world model. It learns a high-resolution Minecraft simulator using 2.5K-hour VPT contractor dataset (raw video and mouse/keyboard actions). Leveraging the dataset's event annotations for rewards, the agent improves its task-conditioned policy via reinforcement learning entirely inside imagined rollouts, requiring zero online environment interaction. As a result, Dreamer 4 is the first agent to obtain Minecraft diamonds purely from offline data, substantially outperforming prior VPT and behavioral-cloning baselines which used 100x more data. Additionally, the paper shows that Dreamer 4 does not need action labels for most of its video data. Given all 2.5K hours of Minecraft video but only 100 hours with mouse and keyboard labels, it still learns most of the action-conditioned prediction ability of a fully labeled model, suggesting that future world models could learn broad dynamics from unlabeled video and use smaller paired datasets to ground actions.

The Dreamer line of work demonstrates that self-supervised world-model training can produce large gains in sample-efficient learning. Once a reusable model of environment dynamics has been learned, downstream RL can extract much more from limited rewards or demonstrations by training on imagined trajectories, rather than requiring new environment interaction for every update. It is thereby possible to learn to play Minecraft entirely offline, without ever directly interacting with the environment.

b. EfficientZero V2

A complementary line of work is EfficientZero V2 (Wang et al., 2024), which combines learned world models with explicit planning. It extends EfficientZero[9], a MuZero descendant that learns a latent dynamics model and plans over it with MCTS, to continuous control domains, replacing standard MCTS with a sampling-based Gumbel search using sequential halving. This search procedure samples a finite set of candidate actions and allocates simulations toward the most promising ones, aiming to obtain policy improvement from a small simulation budget. EZ-V2 also re-analyzes old replayed experience with its current model and policy, letting the agent extract fresher learning targets from data collected earlier in training. Thus, it improves sample efficiency by using the world model both to imagine consequences before acting and to get more learning signal out of past interactions.

On Atari 100k, which caps the agent at 100,000 environment steps, or 400k Atari frames under action repeat 4, roughly 2 hours of real-time gameplay, EZ-V2 reaches a normalized mean of 2.43 and normalized median of 1.29, both above the human baseline. The paper thus claims "super-human performance within just 2 hours of real-time gameplay".

The result is not uniform across games. EZ-V2 and other strong deep RL agents still struggle badly on some long-horizon, sparse-reward, exploration-heavy games. This pattern is probably best explained with the point in §1 (see also §4b): model-based planning helps extract more from limited experience, but humans bring preexisting semantic and exploration priors, whereas the RL agents have to infer complex game mechanics from scratch under sparse rewards.[10]

LLM post-training pipelines have nothing structurally analogous. RLHF and RLVR are model-free: they update the policy from sampled real trajectories, with no learned model of the environment to plan or imagine inside.[11] Whatever fraction of the human–deep-learning gap is genuinely algorithmic rather than an artifact of priors or comparison setup, plausibly shrinks further once frontier systems learn world models over their action space and plan inside them during training and at inference. 

Some analogous techniques have seen some limited applications during frontier LLM training for narrow domains[12] but nobody has demonstrated a general world model rich enough to train and plan inside at the scale and domain-generality LLMs operate in, and AI labs have focused their RL work on areas where the environment and verification step are cheaper to compute directly rather than in a learned world model.

A solution to true continual learning on harder to simulate and verify domains will likely require both a model-based RL architecture and the kind of context-into-weights compression techniques I described in a previous post. Such a system might be quite sample-efficient, and able to learn continually from relatively few real interactions, but only by spending far more computation per interaction on world-model updates, imagined rollouts, planning/search, and replay.

3. Other Low-Hanging Fruit

Even without brain-like model-based RL, frontier training has not been optimized primarily for extracting maximal information from each real example. Internet-scale text is abundant, and the economically optimal strategy has usually been to train models on ever more unique tokens rather than to squeeze maximal learning out of each example. That leaves a lot of plausible low-hanging fruit. Even just naively repeating the same data for up to 4 epochs can reduce loss on a held out test set almost as well as a single epoch on 4x more data, and yet it is usually still more economical to just use more unique data. But more sophisticated approaches are also possible. Here, I will describe two such approaches.

a. Training Language Models via Neural Cellular Automata

Lee et al. (2026) give a clean example of this. Before ordinary language training, they pre-pre-train a Llama-style transformer on trajectories generated by Neural Cellular Automata (NCA): synthetic 2D grid dynamics where each sequence is produced by a different randomly sampled local update rule. They filter trajectories by gzip compressibility, using compression ratio as a rough proxy for structural complexity, excluding both trivial patterns and maximally chaotic noise. Successful next-token prediction then requires inferring the rule from context and applying it forward, rather than learning word meanings or memorizing facts. With only ~160M NCA tokens, followed by normal training on web text, math, or code, they get up to ~6% lower perplexity and ~1.4–1.6× faster convergence than training from scratch. The striking result is that NCA pre-pre-training also beats pre-pre-training on ordinary web text from the Colossal Clean Crawled Corpus (C4), even when the C4 baseline gets 10× more tokens — 1.6B C4 tokens versus ~160M NCA tokens.

Their ablations suggest that much of the transferable benefit lives in the attention layers: when the attention weights are reinitialized, most of the gain disappears. NCA pre-pre-training appears to train the model to infer hidden rules from context, track dependencies over many steps, and apply those inferred rules forward. It instills useful priors for later language learning that can be installed using no “real” data at all, just synthetic processes with the right abstract structure.

b. Synthetic bootstrapped pretraining

Yang et al. (2025) demonstrate another way to get more value out of a fixed pretraining corpus. Synthetic Bootstrapped Pretraining (SBP) works in three steps: it finds semantically similar document pairs within the corpus, trains a conditional synthesizer model to generate one document from the other, and then applies that model back to the corpus to produce synthetic documents that are mixed into pretraining. Unlike standard synthetic-data distillation, the generator is trained on the pretraining data itself rather than relying on a stronger external teacher model, though their implementation does use an external embedding model to find similar documents. In final pretraining runs matched by token budget, 3B- and 6B-parameter models trained on up to 1T tokens beat a data repetition baseline, and recover up to roughly 60% of the average QA improvement achieved by an oracle with 20x more unique data.

The authors’ story is that ordinary next-token pretraining leaves some structure in the corpus unused. It treats documents as i.i.d. samples and directly models the correlations among tokens inside each document, but it does not directly model the fact that different documents can instantiate the same underlying idea. SBP adds this missing signal by training on pairs of related documents: the synthesizer has to infer what the first document is about at a more abstract level, and then produce another document built around the same latent concept. The outputs of the synthesizer preserve the topic while changing the angle, genre, specificity, or rhetorical frame.

SBP extracts more from the same corpus by spending extra computation on embedding/search, synthesizer training, and synthetic-data generation. It is another example of possible sample-efficiency gains but at the cost of doing more computation per real example.

c. Algorithmic progress and the Pareto frontier

Both examples contribute to closing the sample-efficiency gap, and both continue the kind of ordinary algorithmic progress we’ve seen over the past few years[13]. Pareto improvements on both sample and compute efficiency over the simple "scale up unique tokens" baseline do exist. NCA pre-pre-training is one such example: it achieves both lower perplexity and faster convergence than throwing 10× more C4 tokens at the model, so it improves sample and compute efficiency simultaneously.

But as the field exhausts these easy wins and pushes toward the algorithmic Pareto frontier, sample-efficiency gains will increasingly need to be paid for with compute. SBP is one such example: it extracts more learning signal from the same corpus, but only by spending non-trivial extra compute on embedding/search, synthesizer training, and synthetic data generation. The picture matches §2: real algorithmic headroom is available, but most of it buys sample efficiency by spending more compute per real example.

4. Evolution, optimizers, and hard-coded reward functions in the cortex

A common explanation for human sample efficiency is "evolution," but on its own this just relocates the question. Whatever closes the gap has to be encoded in roughly 3 billion base pairs of genome, only a fraction of which specifies anything about the brain at all, and it has to be doing some specific job. A few candidates have already appeared in earlier sections: the rich multimodal representational substrate that lifetime experience accumulates into, structural inductive biases, and the model-based RL machinery itself. That leaves two further candidates worth examining: the optimization algorithm the brain runs, and the hardwired reward functions that shape what lifetime learning ends up optimizing for.

a. Optimizer

The brain's learning rule is not known, but it almost certainly isn't gradient backpropagation. Backprop has several features that look biologically awkward: feedback signals that behave like a transpose of the forward weights, cleanly separated forward and backward phases, globally coordinated error propagation, and synchronized updates across many layers. None of these have an obvious analog in the brain, and a substantial computational-neuroscience literature is devoted to finding credit-assignment rules that don't require them.[14]

But that doesn't mean the brain uses a much better optimizer than gradient descent. The biologically motivated alternatives are designed around the constraints of wetware: local, noisy, low-precision, asynchronous, with limited global communication. GPUs face none of those constraints, and a learning rule built for biology is unlikely to have an advantage on GPU hardware. Biological plausibility tends to come at a cost. Many need to run the network forward repeatedly before each update, maintain helper networks alongside the main one, or simply scale less well. They're interesting as biology and as starting points for neuromorphic hardware, but on current evidence they don't match backprop with Adam on frontier ML workloads.

There may still be some remaining headroom in optimizer design within the gradient-based paradigm. But the design space has been thoroughly searched: there was a long stretch where it seemed like every other ML thesis proposed a new optimizer beating Adam on some narrow benchmark, and yet over a decade later Adam continues to be widely used at the frontier. Recent innovations[15] such as Muon do demonstrate some real improvements are still possible, but the optimizer seems unlikely to be the missing ingredient behind the orders-of-magnitude gap in sample efficiency.

b. Reward functions

One other hypothesis, recently articulated by Adam Marblestone on Dwarkesh, is that human sample efficiency might be explained by genome-encoded reward functions for lifetime RL. The hardwired set of innate drives and primary rewards (hunger, pain, social signals, curiosity, surprise) shapes what lifetime RL ends up optimizing for, and may help the human brain extract more useful learning signal from its lifetime data.

The best evidence for this comes from curiosity and novelty rewards. Many classic deep-RL failures — Montezuma's Revenge, Private Eye, Pitfall — are sparse-reward games where useful exploration is highly nonrandom. One solution is to bring in pretrained representations and background knowledge, as in §1, and also demonstrated by recent frontier LLMs beating Pokémon Red. The classic RL response however is using curiosity/novelty rewards. It is such episodic and lifelong novelty signals that the Never Give Up/Agent57 RL agents leveraged to outperform the standard human benchmark on all 57 Atari games, including Montezuma's Revenge. Such curiosity signals have also been shown to be useful to prioritize replay data in model-based RL algorithms such as DreamerV3.

So reward functions are probably a real part of the sample-efficiency story. But evolution did not arrive at the human reward stack for free. The drives we inherit are the product of hundreds of millions of years of selection, with each organism's lifetime serving as one rollout in an outer optimization process whose objective was reproductive success. To recreate comparable machinery in AI, we need some substitute, via hand design, meta-learning, evolutionary search, learned reward models, or a mixture. Some components such as curiosity drives may be compact and rediscoverable, but the more ecologically tuned and idiosyncratic parts are not a free lunch, and would probably have to be relearned at some computational cost.

Byrnes himself seems to be skeptical that reward functions are a large factor. He has written extensively about reward function design as a research direction in the context of alignment, but in his account, the sample-efficiency and general capability gap between human brains and current LLMs mostly comes down to an undiscovered model-based RL paradigm rather than to the rewards.

5. Size mattersa. Data efficiency and scaling laws

There’s a further factor that may help explain the brain’s sample efficiency: its sheer size. The human brain is estimated to have around ~150 trillion synapses.[16] A naive comparison to the rumored size of one of the largest frontier models,[17] Claude Mythos, equating synapse with parameter count,[18] would give the brain roughly a 15x size advantage. This matters because larger models generally need less data to reach a given target loss. Hoffmann et al.'s (2022) "Chinchilla" paper, replicated and corrected by Besiroglu et al. (2024), fits final pretraining loss as a function of parameter count N and training tokens D:

mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-mn { display: inline-block; text-align: left; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-msup { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-c.mjx-c1D43F.TEX-I::before { padding: 0.683em 0.681em 0 0; content: "L"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c1D441.TEX-I::before { padding: 0.683em 0.888em 0 0; content: "N"; } mjx-c.mjx-c2C::before { padding: 0.121em 0.278em 0.194em 0; content: ","; } mjx-c.mjx-c1D437.TEX-I::before { padding: 0.683em 0.828em 0 0; content: "D"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c38::before { padding: 0.666em 0.5em 0.022em 0; content: "8"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c34::before { padding: 0.677em 0.5em 0 0; content: "4"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c33::before { padding: 0.665em 0.5em 0.022em 0; content: "3"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-c.mjx-c36::before { padding: 0.666em 0.5em 0.022em 0; content: "6"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); }

The Chinchilla approach assumes compute is the binding constraint and asks how to split it between parameters and data. Biology faced a different binding constraint: data, capped by lifespan, with parameter count more easily scaled at metabolic cost. The result is that the brain was pushed to a wildly off-Chinchilla operating point: enormous N, tiny D. That allocation is a terrible way to minimize loss given a fixed total compute budget. But under a lifespan-limited data budget, the best strategy is to spend metabolic resources on representational capacity, built-in structure, and learning mechanisms that squeeze more value out of each observation. The brain looks sample-efficient in part because it is an extremely sparse, massively over-parameterized network running a learning procedure tuned for exactly that regime.

b. Why it matters in particular in the case of self-driving cars

The size factor is probably a dominant driver of apparent data inefficiency in the self-driving case. Waymo has now driven ~200 million autonomous miles on public roads, and Tesla's customer fleet has recently reached 10B miles driven with FSD supervised, many human lifetimes' worth of driving experience. Not only that, both companies actually already run something close to the model-based-RL story sketched in §2, at least during training. Waymo's recently announced World Model is a Genie-3-derived generative simulator that produces sensor-aligned camera and lidar observations, which Waymo uses to train its Driver on billions of simulated miles. Tesla similarly trains FSD inside its own world model generated from fleet video. Whatever data efficiency is being left on the table here, it is not the absence of a learned world model to train inside.

What is probably a binding constraint is parameter count of the policy that ships on the car. The world models themselves can be large and run in datacenters during training, but the network that actually steers the vehicle has to fit on the vehicle. Frontier LLMs are typically served on something like 8×H200 (~1,128 GB of HBM); Tesla's deployed Hardware 4, by contrast, has 16 GB of RAM. That is room for roughly 32B 4-bit parameters before any allowance for activations or the rest of the perception and planning stack. By naive comparison of the raw parameter count to synapses, that puts the size of the onboard model at more than three orders of magnitude below the brain of the human it is supposed to eventually outperform.[19]

If self-driving follows a Chinchilla-style scaling law, this is exactly backwards from the human regime: a small, hardware-fixed N forced to compensate with very large D. For any given onboard compute platform, there is a ceiling on the reliability Tesla and Waymo can achieve. More miles, better architectures, and better simulators push the system toward that ceiling, and are responsible for some of the impressive improvements in self-driving capability in recent years. But truly solving self-driving might still require a hardware upgrade, not just more data or better algorithms.

6. Implicationsa. On brain in a box in a basement

While Byrnes’ broader brain in a box in a basement thesis rests on several other premises about neuroscience (e.g. cortical uniformity) and algorithmic paradigms that are beyond the scope of this post, the brain’s sample efficiency does not provide good evidence for the existence of a major undiscovered algorithmic regime achieving orders of magnitude gains in both compute and sample efficiency simultaneously.

He may have ideas for much better model-based RL algorithms that would be infohazardous to share publicly, but the factors surveyed in sections 1-5 collectively look adequate to explain the gap. None of these point toward consumer-hardware training budgets and several point the other way.

Existing model-based RL algorithms do demonstrate impressive sample efficiency, but in tightly bounded environments. Scaling that style of algorithm into a system that learns language, physics, social cognition, and tool use within a consumer-hardware training budget is a tall order. The world is large and complex. And while the core of intelligence may be simple, and the code to train an AGI may well fit in a few hundred lines of PyTorch code, the compute budget to train such an AGI appears unlikely to be within at least 3 orders of magnitude of the best consumer GPU today.

One important implication is that compute governance is unlikely to become irrelevant by the discovery of much more compute-efficient algorithms. This applies most directly to training compute. I’m a lot less confident about inference. A trained model can be much smaller than the system that produced it, and there are strong economic incentives to overtrain in the opposite direction from the brain's regime, yielding small, deployable policies, with little spare capacity. So an AGI-level model might plausibly fit on a consumer GPU at inference time, even if its training run did not. Such a deployed policy would still likely be unable to bootstrap itself dramatically further on the same hardware, since the world-model updates, replay, and planning needed for serious open-ended continual learning remain compute-intensive.

That said, Byrnes is right that model-based RL is a missing piece in the current frontier-LLM stack, and it matters most in the regimes where current RLVR pipelines struggle: tasks where rewards aren't cheaply verifiable, where real interaction data is expensive, and where the agent needs to keep learning after deployment. The alignment community should pay more attention to this research direction. The safety properties of a system that plans and learns inside a learned world model are meaningfully different from those of one that doesn't, and if model-based RL architectures end up on the critical path to transformative AI, we want the conceptual groundwork in place well before they are deployed at scale. Reward function design appears to be one important and promising research direction.

b. On sample efficiency being an unsolved research problem

The evidence also does not support Bouchard's opposite conclusion that current deep learning is missing some truly major ingredient, and that AGI is therefore far away. The most dramatic sample-efficiency comparisons are often structurally unfair, pitting pretrained humans against randomly initialized networks, text-only models against children embedded in a rich multimodal stream, or small deployed policies against human brains with vastly larger effective capacity. Additionally, frontier systems have not been under strong economic pressure to maximize sample efficiency. Much of the apparent sample inefficiency of deep learning is a byproduct of how we train, deploy, and compare these systems.

Once these factors are pulled apart, the remaining gap largely comes down to a few missing algorithmic pieces, such as world models, reward models, replay, synthetic data, and continual learning. We have a rough idea of what the remedies look like, and the remaining bottleneck appears to be mostly about scaling and figuring out how to integrate these missing components in open-ended domains, not about finding some wholly new paradigm.

Because these components generally cost compute to integrate and scale, progress is likely to remain uneven. That unevenness may carry some safety upside, such as giving us superhuman coders well before agents with persistent episodic memory. But none of it is a durable barrier to automation or to dangerous capabilities. Deep learning can remain less sample-efficient than humans and still be extremely disruptive.

Even if the de facto data efficiency stays 100x below humans indefinitely, this does not prevent rapid job automation. Once a model is trained, the marginal cost of a second instance is just the compute to run it, which can easily sit well below the salary of the human it replaces. Human data efficiency only preserves a defensible economic niche where the cost of automating the work, including data collection, training, and inference, exceeds the wage bill being displaced, and the task requires adapting to new situations faster than models can be trained to do the work from scratch, or develop rich enough representations to learn to perform the work in-context from more limited samples.

Conclusion

Most of the apparent sample-efficiency gap dissolves under scrutiny. It comes from pitting pretrained humans against from-scratch networks, text-only models against children embedded in a multimodal stream, and small deployed policies against much larger brains. What remains can mostly be closed through familiar means like reward functions, world models, replay, synthetic data, multimodal training, larger models, and better continual learning. These typically pay for sample efficiency with compute. So the gap gives us neither reason to expect AGI trained on consumer hardware, nor reason to think deep learning is missing some major ingredient.


Appendix: How Much Multimodal Data Does a Child Actually Receive?

One version of this estimate comes from Yann LeCun's 2024 Harvard slides: 2 million optic nerve fibers × 10 bytes/sec × 16,000 wake hours over 4 years ≈ 1.15 PB of visual data, against 10¹³ tokens × 2 bytes ≈ 20 TB of text, or a ~57× advantage to a 4-year-old. Extended to age 12 with the same assumptions, the visual figure scales to ~3.45 PB and the gap widens to ~170×. Three of the estimates in this calculation deserve scrutiny.

Per-fiber bandwidth. Koch et al. (2006) (h/t Byrnes) measure information rates of guinea-pig retinal ganglion cells under naturalistic stimuli and, assuming roughly independent fibers, estimate the human retina's aggregate output at ~10 Mbit/s from ~1M ganglion cells, i.e. ~10 bits/s/fiber on average, not 10 bytes/s.

Fiber count and binocular redundancy. Pawar et al. 2024 puts the human optic nerve at ~1M axons per eye, so 2 × 10⁶ is the bilateral total. But the two eyes carry heavily overlapping fields, so I will use a 1.2x unique information multiplier across both eyes, not 2×.

Frontier corpus size. 2026 leading open model training corpora are now up to 1.5-4 × 10¹³ tokens (Llama 3.1: 15T; DeepSeek-V3: 14.8T; Qwen 3: 36T; Llama 4 Scout: 40T), instead of LeCun's 10¹³.


For the 12-year extrapolation below, I round up to 2 × 108 waking seconds, rather than the 1.73 × 108 seconds obtained by naively tripling LeCun’s 4-year wake-time assumption, to account for children being awake more hours per day as they get older.

Visual input over 12 years


LeCun's 2024 figure

Corrected

Optic nerve fibers (per eye)

1 × 10⁶

~1 × 10⁶

Bandwidth per fiber

80 bits/s

~10 bits/s

Binocular adjustment

2x

1.2×

Effective info rate

1.6 × 10⁸ bits/s

~1.2 × 10⁷ bits/s

Waking seconds, ages 0–12

~2 × 10⁸

~2 × 10⁸

Total over 12 years

~3.2 × 1016 bits (~4 PB)

~2.4 × 1015 bits (~300 TB)

Frontier text corpus, 2026


LeCun's 2024 figure

2026 open frontier

Tokens

10¹³

4 × 10¹³

Bits/token

16 (storage)

16 storage / ~4 entropy*

Total, raw storage

1.6 × 1014 bits (~20 TB)

6.4 × 1014 bits (~80 TB)

Total, info content

~8 × 10¹³ bits

~1.6 × 1014 bits


Visual / text ratios at age 12


Visual / text

LeCun's numbers throughout

~170×

LeCun visual + 2026 frontier text

~43× 

Koch-grounded retina + 2026 raw storage

~4×

Koch-grounded retina + 2026 info content

~15×

*Assuming tokenizer with 65k vocab, 1 token ≈ 4 characters, and 1 bit of entropy per character.

Best guess: in information-theoretic terms, raw retinal output to a 12-year-old exceeds a 2026-frontier text corpus by roughly 15×. Adding audio, touch, proprioception, and vestibular streams might add another ~2×. So the multimodal child's lifetime stream sits around one OOM above frontier text corpora in retinal-output informational bits, after the substantial compression already implicit in Koch et al., but before accounting for additional longer-timescale redundancy in visual experience. The informational gap appears to be roughly within one OOM.

  1. ^

    AlphaStar’s paper (no paywall) reports supervised pretraining on 971,000 human games. Since it describes StarCraft II games as lasting roughly 10 minutes, this corresponds to 971,000×10 minutes ≈18.5 replay-years of human games.

    It does not directly report self-play game-years. However, its Methods state that the league used 12 actor-learner setups, trained over 44 days, each learner processing about 50,000 agent steps/s, with received data replayed twice; interpreting this as about 25,000 newly generated agent steps/s per setup, and using Extended Data Fig. 2’s average agent-step interval of about 369 ms, gives 12×25,000×0.369×44×86,400≈4.2×1011 seconds of generated agent experience, or about 13,300 game-years.

  2. ^

    “Instead, my guess (based largely on lots of opinions about exactly what computations the human brain is doing and how) is that human-level human-speed AGI will require not a datacenter, but rather something like one consumer gaming GPU—and not just for inference, but even for training from scratch.” (Byrnes 2025)

  3. ^

    Gilkerson et al. (2017) estimate roughly a few million adult words/year for 2–48-month-olds; a naive extrapolation to age 12 gives an order of magnitude of tens of millions of words; the BabyLM challenge rounds to 100M by age 13.

  4. ^

    A common hypothesis is that humans are highly sample efficient because they receive curated curricula. I doubt this is an important factor. Most of what a child learns is picked up with nothing resembling a curriculum. Adult language learners also reach fluency faster via immersion than classroom instruction. And in the BabyLM Challenge, strategies relying on curriculum learning showed little improvement.

  5. ^

    Given the text-dominated training mix and objective of next-token prediction, even 2022-era LMs are already better than humans at it.

  6. ^

    von Oswald et al. (2023) hypothesized that self-attention transforms activations in a way approximately equivalent to gradient descent on an implicit loss, though this specific mechanistic claim is contested. End-to-End Test-Time Training (Tandon et al. 2025), though, has shown that with a pre-training method leveraging gradient-of-gradients, a similar effect to self-attention in-context learning can also be achieved directly via test-time gradient descent.

  7. ^

    See Hippocampal replay.

  8. ^

    See also LeCun, A Path Towards Autonomous Machine Intelligence (2022), for a case along similar lines.

  9. ^

    See here for a LW explainer of the original EfficientZero algorithm.

  10. ^

    EZ-V2 beats the human score on 15 of 26 games (58%). On games where model-based RL pulls ahead it often pulls far ahead — Asterix 62k vs 8.5k, Crazy Climber 112k vs 36k, Demon Attack 23k vs 2k  — and on games where it struggles it struggles catastrophically: Private Eye 100 vs ~70,000, Seaquest 2k vs 42k, Alien 1.5k vs 7.1k. The normalized mean is dragged up by the blowouts. On 9 of the 26 games (~35%) — Alien, Amidar, BattleZone, Freeway, Frostbite, Hero, Ms. Pac-Man, Private Eye, and Seaquest — the human baseline still beats every deep RL algorithm in the table. These are, fairly consistently, games requiring long-horizon credit assignment under sparse reward (Private Eye, Seaquest, Frostbite), systematic exploration of large maze-like state spaces (Alien, Amidar, Ms. Pac-Man, Hero), or patient timing against a slow environment (Freeway). The comfortable wins for deep RL are mostly reactive arcade games with dense reward signals.

  11. ^

    LLMs do learn an implicit world model during pre-training, but it isn't structured for the kind of use model-based RL make of one. It can't be cleanly queried during training to generate imagined rollouts, and post-training further entangles its representations with those of the policy and persona. Chain-of-thought training may be a partial workaround during inference. By generating intermediate text, the model effectively queries its own implicit world model, recovering some of the benefits of model-based planning.

    Architectures in which the world model is kept separate from the policy and not updated by RL gradients, but instead trained only by self-supervised learning, might have better safety properties.

  12. ^

    The main domain “narrow world model” simulations have been applied to is to simulate user interactions. Public examples include Google/DeepMind’s AMIE medical self-play, Moonshot’s Kimi K2 agentic data pipeline with synthetic user personas and tool-use trajectories, and Salesforce’s APIGen-MT.

  13. ^

    See also Byrnes’ The nature of LLM algorithmic progress (v2).

  14. ^

    Candidate biologically plausible credit-assignment schemes include predictive coding (Whittington & Bogacz, 2017; see also Millidge et al. 2020, 2021, 2022 and Millidge 2023 for an informal retrospective); equilibrium propagation (Scellier & Bengio, 2017); and target propagation (Bengio, 2014, Lee et al., 2015). These relax some of backprop's biological implausibilities, but usually pay with settling dynamics, extra inverse machinery, or weaker scaling. The literature is therefore better read as evidence for possible biological credit assignment than for a GPU-superior optimizer.

  15. ^

    Recent examples of newer more sample efficient optimizers include Muon (Jordan, 2024), which orthogonalizes gradient updates via a Newton-Schulz iteration, and M3 (Behrouz et al., 2025), the optimizer described in the Nested Learning paper I discussed in my continual learning post, which builds on Muon with multi-scale momentum and Adam-like normalization, improving effective sample efficiency at the cost of more memory and compute per step.

  16. ^

    There’s some uncertainty about the true number in the literature, so relying on the Wikipedia consensus. The number is probably higher during infancy.

  17. ^

    See here for some additional parameter count estimates of frontier models, which make me think that the 10T estimate for Mythos is quite plausible.

  18. ^

    The parameter to synapse equivalence in terms of useful computation is highly uncertain. The synapse count might be an underestimate for how many parameters equivalent the brain really has. Beniaguev et al. (2021) found that a detailed biophysical model of a cortical pyramidal neuron was well approximated by a temporally convolutional deep neural network with 5 to 8 layers, suggesting that treating each biological neuron as a simple artificial neuron might miss real computation happening within dendritic trees. That said, this result has important caveats (see this EAF discussion for more). The study did not run the reverse comparison, so we do not know whether comparable overhead applies going from artificial to biological. The per-neuron overhead might also not scale linearly for larger networks. Furthermore, much of the simulated complexity may not be functionally useful. As Byrnes argues (1, 2), biological systems are typically full of dynamics that are not load-bearing for the system's useful function, much like how a real transistor is described by a 22-parameter physics model, even though its useful computational function is just an ON/OFF switch. I use a 1-to-1 synapse-to-parameter equivalence as a probably conservative baseline, but the real magnitude is unclear.

  19. ^

    This three-OOM figure should be read loosely. CNN-style weight sharing lets the onboard net avoid replicating visual feature detectors across space the way the visual cortex must, shrinking the effective gap. Correcting for it probably has only a small effect on the comparison, since the early visual cortex is only a few percent of the brain's synapses.



Discuss

We Need Breadth-First AI Safety Plans

Новости LessWrong.com - 1 июня, 2026 - 20:36

Cross-posted from my website.

Depth-first plans lay out a path from here to aligned superintelligent AI. We need those kinds of plans. But depth-first plans depend on many assumptions: "We will make AI safe by doing step 1, then step 2, then step 3." Step 1 only works under condition A, step 2 requires condition B, step 3 requires condition C. If A or B or C is false, the whole plan fails (and there's a good chance we all die).

Consider Google's safety plan from April 2025. To my knowledge, this is the best among the frontier AI companies' plans. [1]

Google's plan depends on a series of conditions:

  1. For the most part, the plan does not consider concrete details of how significantly-more-capable AI systems will behave, instead proposing that Google will figure out how to handle those systems once it understands them better. This only works given (at least) two conditions:
    1. AI capability improvements occur at a relatively predictable pace, with no unexpectedly large jumps.
      • The plan explicitly assumes no "discontinuous" improvements, which is roughly the same thing. It's good that they're being explicit about this.
    2. Once stronger capabilities emerge, there will be enough time to figure out mitigations.
  2. The plan entails putting stricter measures in place once AI systems become sufficiently capable. This depends on at least two conditions:
    1. Google (or somebody) can accurately determine what capability level is dangerous.
    2. Google's evals (or third-party evals) can elicit dangerous capabilities if they exist.
  3. The plan requires using AI to bootstrap AI alignment. This depends on several conditions:
    1. We can successfully align the AI that we use for bootstrapping, or misalignment will be easy (enough) to spot, or alignment isn't necessary (e.g. because humans can use amplified oversight to monitor smarter-than-human systems).
    2. Future Google can be trusted to use enough of its compute to differentially accelerate alignment research, rather than doing something more profitable (for example, differentially accelerating AI R&D).
    3. AI that's useful enough to solve AI alignment does not pose an existential threat.
    4. AI alignment is the sort of thing that can, in principle, be solved by strong-but-not-superintelligent AI.
      • For example, it may be that moral advances are required before we know how to correctly specify how AI ought to behave; and that unaligned AIs cannot contribute to moral advances. [2]

(The plan depends on many more conditions than that, but I'll keep it short.)

That list included eight conditions. If any one of those conditions fails, then the whole plan fails. Some of the conditions seem likely to be true; others seem questionable. But even if every individual condition is probably true, it's much less likely that they're all true.

Disjunctive conditions are better than conjuctive ones. We can see an example in condition 3.1 above: Google's plan can work if it's possible to align the "bootstrapper" AI, OR if misalignment is easy to spot, OR if it doesn't need to be aligned. Disjunctive conditions are good; more of those, please.

We need breadth-first plans:

  • We will take actions X, Y, and Z.
  • X depends on condition A.
  • Y works even if A is false, but it depends on condition B.
  • Z works if A and B are false; it depends on a third condition C.

X + Y + Z works even if two out of three conditions fail.

Some plans have a little bit of breadth. An explicit example from Google's safety plan:

Our approach has two lines of defense. First, we aim to use model level mitigations to ensure the model does not pursue misaligned goals. [...] Second, we consider how to mitigate harm even if the model is misaligned (often called “AI control”), through the use of system level mitigations.

I would like to see more breadth, and recursive breadth—there should be breadth within each component of the plan, and breadth within those sub-components.

The broadest plan that's been published is Peter Barnett & Aaron Scher's AI Governance to Avoid Extinction: The Strategic Landscape and Actionable Research Questions (see also the corresponding LessWrong post). The report explicitly considers four possible future scenarios and how we might achieve a good outcome from within each scenario. The report even includes a flowchart:

The report goes into more detail about the conditions required for each of the four scenarios to succeed.

Barnett & Scher believe "Off Switch and Halt" is the best strategy. They don't exactly phrase it this way, but according to their report, "Off Switch and Halt" depends on the fewest conditions and has multiple ways of succeeding.

How breadth-first plans can inform what we do

I see two big benefits to writing breadth-first plans:

  1. We can identify which paths to success depend on the fewest conditions, [3] and focus more on those.
  2. It's easier to find the biggest holes in the plan.
Root-level breadth matters most

The good news is the branches off the roots are the most important because they have the greatest probability mass. Creating layers of branches off branches off branches quickly gets complicated, but I don't think it's necessary.

My rough attempt at categorizing plans

I made a quick flowchart to categorize AI safety plans at a high level.

  • A blue circle indicates an action
  • A blue square indicates an outcome
  • A red hexagon indicates a necessary condition to achieve an outcome
  • A red pentagon indicates a condition that is helpful but not necessary

The idea is that we need a broad set of overlapping plans such that some plan will work, even if many conditions (red nodes) turn out to be false.

(Click here to see the full-size image.)

Is this flowchart comprehensive? Definitely not. Is it even accurate? Maybe. My point is that, to make AI safe, we need multiple plans that cover all the ways the other plans could go wrong, and this flowchart is a quick attempt at representing some of those plans.

Future work I'd like to see
  1. AI companies should publish breadth-first plans. What will they do if a step in their mainline plan fails?
  2. Governments should pass legislation requiring AI companies to have plans that cover every item on a list of possible future scenarios.
    • For example, mandate that companies have different plans for different takeoff speeds.
    • AI safety researchers should do research to inform what future scenarios need to be covered.
  1. I originally wrote this article shortly after April 2025, but I procrastinated for a year on finishing it, so I'm not sure about the current state of AI companies' plans. ↩︎

  2. I am skeptical that a bootstrapped-aligned AI will behave morally in ways in which most humans do not behave morally, e.g. eating factory-farmed animals; or that it will be able to correctly resolve the internal inconsistencies in common-sense ethics. For example, in the mere addition paradox, most people accept a set of premises but reject the conclusion that necessarily follows from those premises. [4] ↩︎

  3. Technically, what we want isn't paths that depend on few conditions. We want paths where the joint probability of every condition is as high as possible. But generally speaking, fewer conditions means the probability of success is higher. ↩︎

  4. Philosophy Experiments' Philosophical Health Check asks you a series of questions and purports to identify inconsistencies in your beliefs. I think the questions leave some wiggle room to argue that supposed inconsistencies aren't truly inconsistent, but a more rigorous test would be harder to construct. ↩︎



Discuss

The remarkable story of AIGS Canada

Новости LessWrong.com - 1 июня, 2026 - 17:07

TLDR: Four years ago we put out a short post on LW announcing that an AI governance and safety community had formed in Canada. This is the remarkable story of what happened next. Information on how to join or support us is shared at the end. I will also be at Less Online.


Imagine humanity in a few years, and the development of advanced AI has gone well. We navigated the risks of catastrophic loss of control and kept the most powerful tools out of the hands of bad actors. The benefits and power it created were sufficiently shared. We ended up in a world the vast majority of human beings want to be in.

We’re shaking our heads and smiling in relief and disbelief that we all made it through, and agreeing “...a lot of things had to go well for this to happen, people and institutions around the world had to step up, and also - thank God for f*#king Canada...”.

In a world that will need all the help it can get to navigate AI, every country should set their sights that high.

And so we ask: What would Canada’s contribution have been? Did Prime Minister Carney leverage his Davos speech leadership into effective global coordination on AI? Perhaps it was seed funding for critical AI safety research that unlocked key technical solutions. Or we piloted the first full-scale national conversation on ASI, gaining key insights from the broader public and shaping a global narrative as to what success on AI even looks like.

At a minimum, Canada would need to be situationally aware vis-à-vis superintelligence and making smart decisions.

But for the last few years, the main decision makers in the country have not been giving any indication of this kind of awareness, either in words or deeds. Despite growing numbers of parliamentarians and officials who have been briefed on superintelligence and expressed sincere concern, it has yet to become a political priority in Ottawa.

Enter AI Governance and Safety Canada (AIGS), a nonpartisan not-for-profit launched in 2022 with the question “What can we do in Canada, and from Canada, to ensure positive AI outcomes?” and a talented and determined team of concerned citizens.


The results


To a large extent, our story can be told through what we accomplished. Three years in, our answer to that founding question has included:

  • “A Plan for Canada” policy white papers: Widely respected for their quality, they succinctly clarified what exactly Canada can do (and do well) to positively influence AI outcomes. Notably, the government adopted the top recommendation from each of our 2023 and 2024 white papers (few other orgs were calling for these actions)
  • Dozens of meetings with parliamentarians and government officials: The 2025 white paper messaging in particular had many MPs asking how they can help, and inviting me to testify at committee
  • Seven expert testimonies before the House of Commons and Senate, in English and French. One of which went viral (2.3M views / 119k likes on IG)
  • Comprehensive recommendations on the AI & Data Act: more than any other organisation submitted, translating ASI risk into practical wording for the Bill
  • Media coverage in most major outletsCBC The NationalThe Canadian PressRadio-CanadaCTV news, Op-Eds in the Toronto Star, and more
  • Connecting 1,500 Canadians across the country online and through events, and attracting over 700 volunteer sign-ups.
  • And many other initiatives along the way


In doing this, we’ve mirrored some of the work of organisations in other jurisdictions such as the EU’s The Future Society and the UK’s Centre for Long Term Resilience. More recently, we’ve been joined in Ottawa by other organisations (such as CIGI and Control AI) doing education and advocacy on AI’s catastrophic risks. And of course, leading scientists Yoshua Bengio and Geoffrey Hinton continue to engage governments and talk to the media.

So what makes us remarkable is not so much our notable accomplishments, or even that we were the first civil society group to do this work in Canada and still the leading one… it’s that we accomplished anything at all.


And that’s the story I want to share now.


Inauspicious beginnings


So at the start we confidently set off with all these accomplishments in mind, got a big grant, hired a team, and set to work, right?

Not quite.

In the Summer of 2022, there were just a couple local meetups and a few dozen people in Canada who had happened to come across the concerns around AGI development and were interested in doing something about it.

I’d pitched EA’s LTFF on a one-year grant to “connect, expand and enable the AGI safety community in Canada”, and got to work finding and connecting people. That’s when I met Mario Gibney, the founder of the Toronto AI Safety Meetup (which has since flourished into the Trajectory Labs co-working and event space) who became my co-founder. Evan Murphy (Vancouver AI Safety researcher) and Briana Brownell (Saskatoon AI startup founder) joined us to form an interim Board of Directors.

That Summer and Fall we created our online Slack community to bring the disparate local groups together, chose a name and website, attended as many AI conferences as possible to get a lay of the land, and prepared to incorporate as a not-for-profit.


A Spring and Fall


Only there was a challenge: how were we going to fund AIGS? It was all fine for me as a community organiser to get a grant, but when we started thinking about what AIGS most needed to accomplish - to directly influence the Canadian government - we realised that traditional EA/OpenPhil AI Safety grants weren’t an option. Then recent FTX scandal aside, they were American (i.e. foreign funding, a credibility risk) and in any case as charities couldn’t fund our core political activity.

So we put on a brave face and did an initial fundraiser among 50 community members, with modest success, and AIGS was incorporated April 4, 2023. Our first move was media advocacy to seize on the Pause AI letter coverage and establish our public presence. Mario and I brought on operations and communications contractors to amplify our efforts. Our Toronto Star Op-Ed was a highlight:


It almost worked. That July and August saw tantalising funding opportunities during conversations with two large donors we’d attracted, but despite our efforts it didn’t convert into money in the bank. We ended the contracts, Mario had to step back, and AIGS was left as a volunteer organisation - just a Board of Directors and me as the only unpaid staff.


The grind begins


We could easily have disbanded at that time. My grant had expired in April and I had limited savings.

But Canada still needed an organisation like AIGS, and we weren’t about to stop working on the most important issue of the 21st century just because the money ran out. We also knew that as AI’s impacts grew, our potential as an organisation would too.

So I took on some personal debt and we got back to work. First, we knew we needed to clarify what exactly our calls to action for Canada were. From that came our first white paper Governing AI: A Plan for Canada, which put us on the policy map.

Second, parliament had recently introduced the AI & Data Act. We went all in - spent weeks clarifying what Canada needed an AI Bill for, and carefully translating the concerns around ASI loss of control into specific recommendations for the Bill.


Our first big break - invitation to testify at the AI & Data Act committee hearing


During that time we also expanded our Board of Director with a range of professionals (including Board Chair Gordon Vala-Webb) to help steer AIGS to success, and launched a Board of Advisors with respected experts to consult on key decisions.

Dashed health, dashed hopes


All that effort didn’t save us. While it did establish our credibility and helped raise modest donations over Christmas, we were still a volunteer organisation, and my runway was now even shorter.

To make matters worse, a week after capping our AI & Data Act testimony, I fractured my femur in an accident. It wouldn’t be the only health issue to significantly slow me (and by extension AIGS) down - I have a chronic condition that among other things can cause severe fatigue and brain fog, and make looking at a screen quite painful. The symptoms are, of course, worsened by the stress of repeatedly having to focus on catastrophic risks from AI and the loneliness of being the only staff.

Seeing that our direct fundraising outreach wasn’t sufficient, we pivoted to launching a project that Canada needed that might also gain corporate or union sponsorship: a National Conversation on AI event series pilot. The goal was to meaningfully engage Canadians in a two-way conversation about where AI is headed and what kinds of futures people want.

It was (and remains) a worthy initiative with interest from a number of universities and civil society organisations, even getting an endorsement from Yoshua Bengio. But five months of work later, it failed to gain any major financial sponsors, and so it was put on the shelf for another day.

That failure meant that by the Summer my runway was now gone and I soon wouldn’t be able to continue as full-time executive director.

A second chance


And then, lo and behold, some money trickled in. At the last moment, a donor stepped up just enough to keep me on full-time and AIGS moving forward.

Moreover, 2024 was the year that volunteers started to show up in numbers. So much so that we had to set up a dedicated intake and onboarding process.

The best news came when Kathrin Gardhouse - Toronto-based lawyer, PhD, and policy expert - joined and immediately started taking on projects, quickly getting promoted to Policy Lead and then to the Board of Directors.

So in September 2024 we looked around and asked “What does Canada most need now?”. With veteran political expert Fraser Green now on our Board, we realised that while up to date policy recommendations would continue to be essential, on their own they were too easy for government to ignore. Polls were showing an overwhelming likelihood of a conservative victory in a 2025 Fall election, and neither Poilievre nor Trudeau were the type to act on the arguments alone. Also, with the acceleration in AI, it seemed very plausible that 2025 might be the last federal election before superintelligence was developed.

So we pivoted in the final months of 2024 to launch the public-facing Coalition for Responsible AI. The idea was to plant a flag so that everyone in Canada who cared about these issues could find us and support the cause. It was primarily a communications campaign - engage Canadians, get ourselves in the news, attract donors, and make AI an election issue politicians had to address.

We launched in January, with synchronised events in 4 cities:

Supporters gather in Ottawa for the Coalition launch event


Politics happens, and also we fall short


In 2025, the first thing to happen was Trudeau resigning and Mark Carney taking over. He immediately called an early election, shrinking our time to prepare from 10 months to 3 months. Meanwhile Trump got inaugurated and began soaking up all media attention, making Canada’s election about who can best stand up to him. AI (and even major political items like cost of living) got drowned into the background.

We were also relying heavily on a new communications vendor to get us in the news, but they underperformed (especially in English media). And I was still the only full-time staff to keep the organisation running, meaning that I was stretched too thin and also underperformed.

The Coalition failed at its goal of getting public attention.

And when it rains, it pours: in mid 2025 a series of key grants we’d applied for got rejected (in large part because we could only apply for the portion of our work that was apolitical), I was running on empty again, and this time our overstretched donors weren’t able to fill the gap.

The writing was on the wall: the lights were about to go out on AIGS.


Stepping into the abyss


What were we to do next?

One of the things we noticed about Mark Carney is that he had significant experience managing global crises, and his book Values suggested he was a man who cared more about the arguments than public sentiment. Whereas Poilievre and Trudeau would have required a big public advocacy campaign to act, for Carney, a well-crafted plan delivered via trusted advisors seemed like the better approach. We also knew that regardless, we needed to update our white paper for 2025, and it would be the best thing to try fundraising for.

Money was all but gone, but we made a decision:

If we were to go under, we’d do so delivering one final piece of impact.

Could we hold on long enough to deliver?

Having spent the Summer drafting our 2025 white paper while battling a major health flare-up, discussing bankruptcy contingencies with Board Chair Gordon Vala-Webb, and preparing one last fundraising email, the situation came to a head on Sept 1st, 2025.

It’s a day I still distinctly remember.

AIGS’s bank account had nothing left in it, and we owed four thousand to the vendor. I checked my personal accounts - I also had nothing left in the bank, and all three of my credit cards were maxed out.

I emailed my landlord to let her know I was going to be late paying my rent.

The next day we raised $20k.

That last fundraising email, and the pitch around the white paper, had worked.

We then raised another $20k in the following weeks to finish the year at $80k in revenues, which was double our 2024 revenues. This year, thanks to some incredible donors, we’re already at $150k earned or pledged.


A stellar year


The Fall and Winter of 2025-2026 turned into our biggest success by far. The white paper was serendipitously ready to be published right when the new Minister of AI called a snap 30-day public consultation on the new national AI strategy.

The new revenue also allowed us to temporarily bring on a part-time outreach coordinator, who made sure the hundreds of emails and follow ups got to the relevant MPs. That turned into dozens of meetings, which turned into six invitations to testify at committee hearings. The video from one of them then went viral on social media, our biggest visibility yet.

Meanwhile, communication expert Dalia Ezzat volunteered to shape the next chapter of AIGS communications, Shivangi Pandey stepped forward to relaunch the Coalition for Responsible AI later this year with a new vision, and Christopher Tiller our volunteer Volunteer Manager started putting in long hours to help keep the community glued together.

We’re alive, and stronger than ever.


700 volunteers and 1 underpaid staff

But a yearly budget of $150k CAD, as immensely relieving as it is compared to previous years, is still not enough to run an organisation on. It means we now have a minimum of stability, but also that we still can’t afford to hire a team.

And that’s been our bottleneck. As stressful as working on catastrophic risks, battling health issues, and surviving existential financial crises has been for the last 3 years, the greatest challenge has been not having any full-time staff to work with me.

Our core volunteer team continues to pitch in remarkable amounts of work - crowdsourcing relevant news, supporting local events, building our tech stack, developing our Canadian AI policy course, and shaping our communications strategy. And the growing number of sign-ups is a huge source of potential for AIGS.

But even the best only have a few hours per week, or are between jobs and have to step back as soon as they regain employment, meaning work had to be shared across multiple volunteers and there is naturally high turnover. Moreover, we’re a remote team spread out over 5 time zones and 2 languages, making maintaining team energy, cohesion, and momentum exceptionally hard in casual or part-time work setups.


Back to imagining


Now imagine if instead we had a core team of talented communications, operations, and advocacy leaders working full-time together to shape our strategy and harness our rapidly growing volunteer base?

If AIGS were able to poach some of the top talent currently working for corporate interests, and put their skills to ensuring humanity safely navigates AI?

And if Canada actually took those initiatives that a successful post-AI world will be shaking its head about in disbelief and gratitude?

If you’d like to see that happen, you can help.


How to help:

  • Liked our story and want to see us succeed?
    • Give this post an upvote
    • Share it to your preferred platform
    • Email it to a potential Canadian donor who cares about AI going well (or to someone who might know someone who might know someone)
  • Canadian citizen or resident?
    • Join us as a donor:
      • Small donations expand our donor base and help us show broad support. Cherry on the cake? Make it recurring. Donate here.
      • Large donations take AIGS to the next level of impact. For more information and to meet the leadership, email contact@aigs.ca.
    • Join us as a volunteer. Show us how good you are so we can hire you when funding comes through.
    • Join our community online or at events, and help us build momentum in Canada. All are welcome.


Thanks for reading this post and hearing our story. And wherever you are in the world, know that while AI is putting us all under great strain, the human spirit and determination to succeed remain alive and well.


Yours truly,

Wyatt and the AIGS team and community.


*Note for Bay Area readers: I will be at Less Online (giving a 'Dispatch from Ottawa' talk Sunday morning) and in town a little bit after. If you'd like to connect, please reach out or DM me.



Discuss

Superintelligence of the gaps

Новости LessWrong.com - 1 июня, 2026 - 16:00

Many classic AI doom scenarios rely on superintelligence using its vastly superior intelligence to outplan, outcompete and outkill you.

I partly believe this: superintelligence would definitely outkill me.

But I don't believe we will build such superintelligence; not because humans are the apex of intelligence, but because superintelligence, implicitly, has always been about a gap: the gap between the current best intelligence and the newly created one.

We're not in the world where AIs are being created with large gaps of intelligence between each other. Rather, we are in an iterative intelligence development and deployment world. It is technically easy to not have large gaps of capabilities between the current best model and the next, it is ~easy (if costly) to evaluate at regular checkpoints, and ~continuous deployment allows there to be no large gap in deployment either.

We can thus steer away from a large number of doom scenarios (those where new AI uses its greater capabilities to take over) by simply not creating&deploying models much smarter than the previous thing. The current most intelligent and aligned beings should always be supervising their successor, using more total resources at first, such that they can't effectively be tricked/subverted.

I guess the above is something many "AI optimists" have in mind and I don't think the technical ease of avoiding large capabilities gaps should be much of a crux. Whether in practice we'll be avoiding these gaps seems the more interesting crux for "fast misaligned AI takeover" scenario discussion. This is correctly done in @Daniel Kokotajlo et al's AI 2027: the bad ending is caused by pressures to premature deployment leading to using a suspected misaligned system, not by technical impossibility of knowing it's misaligned. It is also what makes that particular scenario unlikely to happen. The leading companies will be more careful than that if they had that level of evidence of misalignment in powerful systems. (I don't think evidence of recklessness with regards to weak systems is strong evidence of recklessness towards strong system, though corporate and national governance should be setup to have the mere possibility of not being reckless when the time comes) It's looking like we're in world C or D of @ryan_greenblatt plans for misalignment risk (~we don't get a pause, but the leading companies are somewhat careful) and that this is technically sufficient to avoid most fast misalignment doom scenarios.

Most of my p(doom) is thus not on the chance of misaligned AI takeover, but on gradual disempowerment risks.

I don't think we have good solutions here, but at least we have more time to look for them.



Discuss

Lean, not backpressure

Новости LessWrong.com - 1 июня, 2026 - 10:57

Lucas Costa has written a good article on how to build systems that can handle code-generating robots. Unfortunately, when calling it backpressure, he used the wrong metaphor.[1] Backpressure is about signaling to upstream processes that they running too fast and need to slow down. Note that the suggestions presented by Costa are mostly about signaling to the upstream process that it needs to do things differently, rather than just slow down. This has more to do with ensuring sufficient quality is sent downstream, rather than quantity.

This irked me. As I was reading, I was searching for the right analogy. I kept coming back to lean manufacturing. The more famous half of the lean philosophy is waste reduction. The other half is about managing the unstable input of people. That’s what we’re interested in here.

A common approach to the input of people – especially in lower-skilled jobs – is to make line workers responsible for everything. We ask them to be hypervigilant, tell them to never make mistakes, and let them know that if they don’t always perform at their best, they will be chastised … or fired.

Lean, as it is described[2], is much more respectful of line workers and the conditions they are performing their work in. A process designed in the lean philosophy tolerates workers that don’t always perform at their best.[3] It’s about setting up processes and structures that have positive optionality on people’s creativity, without undue requirement on their level of responsibility.

This can take many shapes, but the Costa article reminds me of three concrete practices:

  • Single-piece flow means working on one thing at a time, so downstream processes have a chance to reject before too much of the wrong thing is produced.
  • Autonomation (or jidoka) means giving a machine the ability to detect when something is wrong and not continue at that point.
  • Poka-yoke is a process that forces results to be conformant by construction.

You probably recognise these things as good, but a surprising number of managers seem to think they can just chastise people until quality improves. They talk themselves into this because they believe line workers are fully responsible for their actions.[4]

But even those managers will find it very hard to convince themselves quality improves when they scream at the code-generating robot. It’s a robot! It can’t be responsible for its actions. We have to adopt the lean philosophy for building systems around robots. When something goes wrong, we have to blame the process, not the robot.

We always had to do that, even with people, but with robots it’s painfully obvious.

  1. ^

    Which, to his credit, he seems aware of. It’s just that he’s spent too much time using the wrong metaphor that it’d be silly to switch now.

  2. ^

    It may be different in practice; I’ve read some conflicting accounts.

  3. ^

    One of my favourite things to tell myself when I’ve messed up system safety is “If I designed a process that assumed people would never make mistakes, then whose fault is it really?”

  4. ^

    They aren’t. As Deming said, a bad system will beat a good person every time.



Discuss

My reactions to “I underestimated AI capabilities (again)”

Новости LessWrong.com - 1 июня, 2026 - 04:00

An application response I wrote! Please feel free to leave any feedback!


Describe a recent paper or blog post that has influenced your perspective on AI safety.

“I underestimated AI capabilities (again)” (https://www.planned-obsolescence.org/p/i-underestimated-ai-capabilities) came out at the beginning of March. In one sentence, author Ajeya Cotra made capabilities predictions in January 2026, and they were outpaced within 2 months. Specifically, in January, Claude Opus 4.5’s 50% task horizon was ~5 hours. Continuing with the historical doubling trend, Cotra predicted that by December, it’d reach ~24 hours (rounded up); but just six weeks later, Opus 4.6’s was already estimated at ~12 hours. The benchmark underlying the metric is already nearing saturation, when the metric was explicitly designed to avoid this; uncertainty exploded to between 5-66 hours.

Cotra then conjectures that once time horizons exceed, say, 80 hours, the metric may lose its meaning altogether, as large software projects actually benefit from decomposition and parallelisation. Thus agents will be able to coordinate to tackle arbitrarily large tasks. The time for a single human to do something is no longer a viable metric; at the very least it must now be the time it’d take for a human team.

In this sense, the benchmark fails to discern meaningfully between models at the frontier because the frontier — the end of the ruler — has been reached. There seemingly aren’t any hundreds-of-hours long tasks that, for humans in real life, wouldn’t be decomposed into teamwork anyway. This has influenced me to believe that the basic science of evaluations and risk assessment is extremely important, as our ontologies going forward may need to be refactored or even reconstructed ground-up. Cotra’s January prediction was pretty reasonable; it all but shows that we don’t have a stable, methodically-derived base rate to extrapolate trends from. And even if we did, capabilities advanced so quickly that we now need to measure something different anyway (agent coordination being a categorically different framework). I question to what extent human-comparability will remain useful as a metric at all.

This one blog post hasn’t made me doomerist, but given again the possibility for emergent, non-domain-slash-task-specific capabilities as purported by the Platonic Representation Hypothesis, assessment frameworks going forward will definitely need profound and thorough design methodology. I recall my first EAG, where Toby Ord emphasised neither long, nor short, but broad timelines — capturing robust, instrumentally useful action items when uncertainty is high. Adapting our first principles in this fashion throughout the knowledge pipeline, from empirical experimentation to expert recommendation to institutional design, may be necessary to build truly accurate predictive world models.



Discuss

Lizardmen are Not Constant - A Introductory Primer to Thinking about Survey Data

Новости LessWrong.com - 1 июня, 2026 - 03:28

The quality of a survey is best judged not by its size, scope, or prominence, but by how much attention is given to dealing with the many important problems that can arise.

-Fritz Scheuren, "What is a Survey?" American Statistical Association, 2004

First a note on scope: this is a brief discussion meant to--hopefully--assist readers in thinking more clearly about how to look at survey data. I will not, however, innumerate all of the issues and considerations that should go into considering surveys. At the end, I include links to some freely available guides for survey research and best practices which I would recommend for anyone who has a greater interest in survey data. Largely such publications are aimed at researchers conducting surveys, but the guidelines provide strong reference points to other standing the things that should go into surveys.

I would be remiss not to acknowledge the initial impetus for this 'primer' is comments that seem to apply the 'Lizardman constant'. Scott Alexander's own 2013 essay on the topic looks at examples from public opinion surveys ('polls') and draws an (almost) entirely correct conclusion (emphasis added): "When we’re talking about very unpopular beliefs, polls can only give a weak signal. Any possible source of noise – jokesters, cognitive biases,[1] or deliberate misbehavior – can easily overwhelm the signal. Therefore, polls that rely on detecting very weak signals should be taken with a grain of salt."

There seems, however, to be issues as the catchy jargon and title "The Lizardman Constant is 4%" seems to be taken by some readers of Scott Alexander (I do not know whether or not he would endorse the view) to mean "badness is in pretty much every survey at nontrivial percentages" as "[a] constant is always present." At a foundational level, this--I fear--is a lazy, unhelpful way of thinking about survey data. It also is quite different from the attitude one Scott advocated in his essay: Scott's conclusion is focused on 'polls'[2] looking at "very unpopular beliefs" and taking results "with a grain of salt" not (as is sometimes done) dismissing results that fall below a 4% threshold as at core unreliable.

At a foundational level, I fear this is simply leading to a lazy, uninformative way to view survey results that is likely to promote biases. If there are a just two things I hope you take away they are these:

  1. There is no hard and fast rule for judging surveys: surveys need to be assessed individually on the basis of their nature and purpose.
  2. Most of the threats surveys are vulnerable to are not constant, different types of survey's are vulnerable to different types of problems.
Lizardmen are Not Constant - Not Even in Polling

Let's address the claim that the lizardman constant is a constant. The problem Scott Alexander's essay addresses is one that in academic literature is more often referred to as "bogus respondents" or "spurious response bias", which is to say that a survey may have responses that are not-genuine and these may bias results. Some surveys and results are very vulnerable to this kind of error in other cases the risk is negligible.

To illustrate what this looks like, let's imagine in the real world 0.5% of people think the earth is flat. We post a public (and as such non-probabilistic) online poll soliciting responses to the question "Is the Earth round? (Y/N)" and get 1,000 responses, 40, or 4% (95% CI: 3.0-5.0%) say the Earth is flat. Excluding other biases, we might imagine that if we could read the minds of the respondents, we would observe something like this, with the 'bogus' responses highlighted in red and the genuine responses in green.

A first thing we should note looking at this case of non-probabilistic opt-in polls (which are included in the demonstrative). First, bogus responses are not randomly distributed. Bogus responses are much more likely to false positive answers than false negative ones, if given a series of choices they are more likely to pic the first choice, and, interestingly enough on surveys that include demographic data they also tend to self-identify as Hispanic or Latino.[3] This is important because it means we cannot just subtract out some constant value, positive results are more likely significantly biased towards bogus respondents.

Let's say we run the survey again, except this time we take a probabilistic sample and we call, say, 2,000 randomly selected addresses with linked landlines and get 900 responses that might now look something like:

You can see trivially how the bogus respondent problem is reduced but remains substantial our estimate this time would be 2.1% (with a 95% CI of 1.3%-3.3%). the number of bots with landlines registered to addresses is effectively zero, we are no longer getting bogus bot responses. However, some people are still may give different answers from their actual beliefs, you may have some people who are annoyed at having their dinner interrupted by a pollster so give bogus answers just for the hell of it, or might mishear the question to give a couple examples. Also, there still are certain systematic biases which mean we are unlikely to be able to assume the bogus answers are randomly distributed (e.g., respondents may try to give the answer they think the pollster wants).

Additionally, survey length, what questions are asked, how they are asked and incentive structures and other factors can all influence the rate and characteristics of bogus respondents.

One might still think that even though rates may vary, the bogus respondents themselves are always an issue. This is not true. In practice, for example probabilistic panel surveys generally observe very low rates of bogus responses, of approximately 0 (depending on the exact survey methods and coding).[4] In addition, most major panel surveys also will include various controls and cleaning to minimize various forms of bias. Panel surveys may go further and match respondents against externally validated data. Imagine, for example, a study looks at health consequences for patients receiving care for the flu. It recruits patients across a set of hospitals using diagnosis data and at regular intervals calls the patients and has them discuss with physicians any health issues which are assessed alongside their medical records which are collected alongside a standard demographic panel. What would you expect would be the rate of bogus respondents? I think most people would intuitively agree it is likely near zero, people have a motive to be honest when their health is at stake and responses are verified against medical records which would very nearly eliminate the rate of bad actors. However, does that mean you can trust the conclusions of probabilistic panel surveys on their face? No! It just means you don't have to worry about 'lizardmen' or 'bogus respondents' at the same rates--there are other concerns which you should have when assessing such a survey.

What It Comes To - Thinking About Data

Not just for survey data, but any data you are looking at one should begin by asking: what is the purpose and how was the data collected or what does it represent.

Looking for the Purpose - Initial Considerations:

For reviewing surveys, the purpose can be understood as two considerations: (1) what was the purpose behind the survey and (2) what are the results purporting to show. The way a survey is subsequently conducted should depend in large part on these, how you conduct a study depends and what methods are valid or not is highly dependent on what you are trying to study and what you.

Some purposes also should make one inherently suspect of a survey. An obvious example is when there are clear motives that are likely to skew results, for example blind taste tests run by Pepsi's marketing division purporting to show a preference are likely to have some bias. Just because the survey designer is biased and has a motive to find a particular result doesn't mean, inherently, that the survey results are wrong or even biased, but it does mean one should be especially skeptical of areas of bias that might have weighted the results in the authors favor.

Other purposes might be inherently suspect of finding biased results. To risk putting myself in more controversial waters, a study that purporting to be "looking for find surprise correlations in areas" should immediately raise suspicions than reported correlations are the result of "data dredging" or "p-hacking." Without delving into the information side of things,[5] if you take enough data across a broad enough dataset one should expect to find somethings are correlated despite having no real relationship. We commonly refer to this as 'spurious correlation.'

https://tylervigen.com/spurious/correlation/5917_popularity-of-the-first-name-monica_correlates-with_the-marriage-rate-in-nevada

Additionally, the more variables you are looking at, the greater the chance that some correlations are the result of random chance (this can be mitigated if you are using a probabilistic sample that is sufficiently large).

A Means to an End - How Purpose Informs Methods and Notes on Instrumentalizing

Generally, methods should be looked at with a mind for what they are trying to show. For example, if a study is trying to support a qualitative examination of some common experiences by people in niche social groups, a non-probabilistic survey like a snowball survey may be perfectly functional. That is, you might take a Facebook group that is part of the subculture you are examining, look at members and friends of members, then friends of friends and so forth to derive a sample that is strongly, deliberately, biased towards the subculture you are trying to study.

However, if a study is trying to estimate the rate of membership in a subculture, using a non-probabilistic sample of this sort would be utterly inappropriate as it would be certain to disproportionately elicit responses from the population you are trying to estimate. Generally when reading the methods a study used try to think to yourself whether it makes sense for what it is looking for and what assumptions it relies on (hopefully, they are explicit about this). Statistic methods and checks can limit some forms of bias,[6] but generally you want to be able to assume that the population you are sampling from is randomly distributed across the effects you are looking to study.[7]

As mentioned, it is important to look at what the data actually represents and how well it matches with what it represents. For example, let's say I want to study how normal political corruption is in an average person's. One might consider a poll question like: "On a scale of 1 to 5 how normal you feel political corruption is in your political system." This is asking the person's perceptions of what I am trying to study, which I may be able to assume is correlated to the conclusion I want to make. Sometimes, this might be sufficient to assume perceptions are representative, but other things might cause perceptions rather than what we are studying (e.g, if corruption is very normal in a society, they may see decreases in corruption as meaning corruption is low, while in a society where corruption is rate, a smaller increase may be perceived as a larger problem).

Instead, I might to instrumentalize what I want to know in another way, for example, by asking 'how often in the last five years has a public official asked you for a favour/bribe for a service?'[8] This is a more direct measure of a form of corruption but it is also imperfect as there may be forms of corruption it doesn't capture. I may, therefore, want to ask questions like how likely it is a person thinks politicians would accept bribes, how often they think judges or police accept bribes, or how often decisions are made on extralegal bases, etc to develop a more complete picture (though, for longer surveys it is harder to get robust, consistent responses).

In general terms, you should try to look at a question and try to think of other things that responses might represent, besides the effects the study is being used for, how likely that might be and what, if any, measures are in place to rule out those effects.

A General Note on Bias in Methods

Besides bogus respondents, I have not spent much time talking about the common topics of assessing surveys, those being various forms of bias. There are far to many to list, but as I indicated in the case of bogus respondents, a good way of thinking about them is to think about the methods themselves and what biases they might introduce. To reuse my example, a robust longitudinal study, but they should be concerned about attrition bias and how they cope with population changes. Over time, some respondents will drop out (for a variety of reasons that are not-randomly distributed, such as dying or migrating) and if they aren't recruiting new respondents, then the population will skew with time the longer the study is going on for. You should expect a study to spend more time and effort dealing with the kinds of biases it is particularly likely to face bias from.

Final Advice for Readers and Our Biases

The doubtlessly astute readers will no doubt have recognized that many of these recommendations are less than straightforward and prone to personal judgement and bias. Further, for many the effort of rigorously reviewing a study;s methodology and supplemental material (which in the case of some large robust panel surveys can constitute hundreds of pages of guidelines, questions and control methods) is not exactly practical. I would urge, however, caution in allowing our bias to judge what we review, particularly with regards to the sniff test. As mentioned (and as Scott Alexander indicated with regards to lizardmen), a small effect is a good reason to view a result with more skepticism, responding to "this result is less than 4% so it should be discarded as within the Lizardman constant" is an unacceptable practice, however, responding "this result is fairly small so I would want to review whether it could be the result of some confounding effect or bias before I judge it" is good practice, sometimes even when we do not have time to review it. When we do not have time to review it ourselves, I would suggest looking to through citations briefly and whether journals have published comments/retractions and even just the broad length of the methodology section (and online supplements) for whether there seems to be sufficient scrutiny.

Still, while preferable to outright dismissal, one might be more likely to take as granted things that agree with us while indefinitely delaying judgement on results we find inconvenient. Generally, as good practice if you find a result that is generally viewed as surprising in some way but agrees with you, you should be the most skeptical. On the other hand, where a result is somewhat surprising but contradicts our biases, I would try to approach it with curiosity rather than abject skepticism of what is being discussed, particularly if performed in a reputable publication. It is quite likely there is an explanation other than what is presented, but then one should wonder what that explanation is and whether the authors themselves thought of possibilities you might consider and whether they or others have addressed them.

------------------

A Short Selection of Public Resources, Papers and Examples on Survey Best Practices and Design:

American Association for Public Opinion Research, "Best Practices for Survey Research": https://aapor.org/wp-content/uploads/2023/06/Survey-Best-Practices.pdf

ASA's Proceedings of the Survey Research Methods Section: http://www.asasrms.org/

Podsakoff, et al. "Sources of method bias in social science research and recommendations on how to control it." Annual review of psychology 63, no. 1 (2012): 539-569. https://www2.psych.ubc.ca/~schaller/528Readings/Podsakoff2012.pdf

Pew Research Methodology: https://www.pewresearch.org/our-methods/ (and methodology research: https://www.pewresearch.org/topic/methodological-research/ )

Kennedy et al "Assessing the Risks to Online Polls from Bogus Respondents." Pew Research Center: https://www.pewresearch.org/methods/2020/02/18/assessing-the-risks-to-online-polls-from-bogus-respondents/

The Harvard University Program on Survey Research: https://psr.iq.harvard.edu/book/guides-survey-research

Dillman, D. A. (2000, June). Procedures for conducting government-sponsored establishment surveys: Comparisons of the total design method (TDM), a traditional cost-compensation model, and tailored design. Proceedings of American statistical association https://ww2.amstat.org/meetings/ices/2000/proceedings/S15.pdf

  1. ^

    One critique I would have is I am somewhat unclear on what Scott is including by "cognitive biases" here. Someone who truthfully answers a poll with a belief they derived from their cognitive biases should not be considered among the 'lizardmen', the purpose of polling them is to identify people's actual beliefs.

  2. ^

    As an aside on terms, there isn't necessarily a hard and fast rule on when a survey is a 'poll.' However, polls generally refer to a class of surveys aimed at measuring snapshots of public opinions which can be done by various means (such as probabilistic phone/address sampling, or non-probabilistic online sampling).

  3. ^

    See e.g. Pew Researches' discussion of their work on bogus respondents here: https://www.pewresearch.org/methods/2020/02/18/bogus-respondents-bias-poll-results-not-merely-add-noise/

  4. ^

    E.g. "2% to 4% of opt-in poll respondents repeatedly gave answers that did not match the question asked. Throughout the report we refer to such answers as non sequiturs. There were a few such respondents in the address-recruited panel samples, but as share of the total their incidence rounds to 0%." https://www.pewresearch.org/methods/2020/02/18/answers-that-did-not-match-the-question-were-concentrated-in-opt-in-polls/

  5. ^

    There are inherent difficulty in deriving conclusions due to correlations when there are lots of potentially related variables involved.

  6. ^

    There is a wealth of literature on working with various forms of regressions specifically for these problems.

  7. ^

    If these residuals are randomly distributed it means that even if some groups are over represented as long that is random across the effects, larger population estimates can be derived by simple weighting, if the effects are not randomly distributed, such naive weighting doesn't work

  8. ^

    This is based on an actual question in prior rounds of the European Social Survey https://ess.sikt.no/en/datafile/edee45f2-976b-4c8b-902d-b65dc003c92e?tab=1&elems=366f7e3d-65de-4482-b64c-9fb4b908352a



Discuss

“This Hypothetical is Unrealistic” is not a Valid Objection

Новости LessWrong.com - 1 июня, 2026 - 03:02

Whenever a discussion touches ethics, philosophy, or relates to guiding principles, hypotheticals become useful. We cannot investigate every idea with real experiments, but we can test the consistency and precision of principles that guide us with thought experiments. It isn’t necessary to see a man murdered in front of you to understand whether that would be good - we can simply imagine it, and realise our principles would, in that scenario, produce an answer. This process - of considering something that has not, or will not actually occur, is the basis of all counterfactual reasoning. “If X, then what?” is a piece of cognitive machinery without which we would be unable to make sense of the world.

However, it is common for people to respond to questions or statements of the form “if X, then what?”, with the maddening objection “but, not X”.

This objection is a general refusal of the word “if”

ALL hypotheticals are unrealistic - if they are realised, they cease to be hypotheticals. Being unwilling to engage in hypothetical reasoning means you are unwilling to engage in counterfactual reasoning, and are ultimately committed to exclusively considering that which has already happened or is certain to happen. By rejecting the antecedent premise of all unrealised hypotheticals, you forgo the mechanism that allows you to make plans whatsoever.

“If” inherently acknowledges that a thing does not obtain. By positing X as an “if”, you are not validly critiqued on the basis that X does not obtain. Assuming “X does not obtain” is a valid basis to dismiss a conditional premise, then this argument applies to all if statements.

One may retort by claiming the steelman principle is: “X cannot obtain”. However, this does little to alleviate the burden of engaging with contrived scenarios. For one, rejecting conditionals where X cannot obtain commits you to the view that conditionals involving the past are impossible to consider - changing the past is not physically possible. So, questions like “if you never bought a dog, would you have dog food in your house today?” are off the table.

Furthermore, even granting this “impossibility” principle, the set of things which cannot obtain is far smaller than the set of things which most likely won’t obtain. This standard requires proving some contradiction inherent in the premise, or, at a practical level, that a scenario would violate some law of physics held as an axiom.

What law of logic or physics prevents a tennis match between you and Christopher Walken?

Why “contrived” is not a valid critique of a hypothetical

A hypothetical tests a principle. If you say “murder is wrong” without qualifying the statement, you are not saying “murder is usually wrong”, or “murder is often wrong”, you are saying “for all X, if X is murder, X is wrong”. This statement, though intuitive, is, in fact, extreme and virtually indefensible without caveats[1].

The set of “all X” includes the set of “all contrived, extreme-seeming forms of X”, because those things are still X.

Consider the syllogism:

  1. X is wrong
  2. [contrived case of X] is X
  3. [contrived case of X] is wrong

This shows that to say X is wrong (without exception), you are committed to the conclusion that all contrived and unrealistic cases of X are wrong.

So, no matter how absurd-seeming the case of X, the syllogism always holds:

  1. Murder is wrong
  2. Murdering Michael Jackson in a distant marshmallow galaxy is murder
  3. Murdering Michael Jackson in a distant marshmallow galaxy is wrong

Given this, you can test whether the principle of “(all) murder is wrong” holds by looking at contrived cases of murder.

  1. Murder is wrong
  2. Murdering a 99-year-old man who has 1 second left to live in order to save 1000 innocent lives is murder
  3. Murdering a 99-year-old man who has 1 second left to live in order to save 1000 innocent lives is wrong

1 inescapably entails 3 - therefore, if you believe that statement 3 is false, then believing 1 is true produces an outright contradiction.

Answers to absurd scenarios are necessitated by universal principles

Consider the premise:

“If I become paralysed, I will not be able to ace out any person in tennis”

If someone accepts this principle as true, per the earlier syllogism, they accept it for all cases of “any person”. Therefore, they accept they will not be able to ace out their usual tennis partners, which is obviously true.

However, this also commits them to the view that they will be unable to ace out Christopher Walken.

  1. If I become paralysed, I will not be able to ace out any person in tennis
  2. Christopher Walken is a person
  3. If I become paralysed, I will not be able to ace out Christopher Walken in tennis

Further, “In tennis” does not impose a geographical constraint. So, a tennis match played on Mars would be “in tennis” by definition.

  1. If I become paralysed, I will not be able to ace out any person in tennis
  2. Tennis on Mars is tennis
  3. If I become paralysed, I will not be able to ace out any person in tennis on Mars

We can now put the two syllogisms together:

  1. If I become paralysed, I will not be able to ace out any person in tennis
  2. Christopher Walken is a person
  3. If I become paralysed, I will not be able to ace out Christopher Walken in tennis
  4. Tennis played on mars is tennis
  5. If I become paralysed, I will not be able to ace out Christopher Walken in tennis on Mars

Therefore, by accepting premise 1, this deductively requires conditionally accepting the “unrealistic” scenario.

So, if someone says “paralysed people cannot beat anyone at tennis”, and you say “what if they played Christopher Walken on Mars?”, the only coherent answers to give are:

“yes, including Christopher Walken on Mars”; or

“No, I suppose in that case there might be a chance (perhaps Walken dies first) - therefore, the original statement is improperly specified, i.e., strictly false”

The reply: “But I would never play Christopher Walken on Mars”, is simply an irrelevant statement that fails to appreciate that 1 deductively leads to 5.

Accepting “If X, then Y” does not require any acknowledgement of X being true or feasible.

Why this objection is so common

To the untrained eye, dismissing absurd scenarios looks like rigor. A contrived thought experiment to elicit an absurd conclusion that they would never normally endorse, can come across as a sophist using a trick; a “slimy debate tactic”. This feeling of being hoodwinked comes from an almost-getting-of-the-point - realising that, indeed, if X is true, then Y would seemingly follow, and Y obviously isn’t true - so something must be awry. The explanation of “some kind of trick” is easier to reach for than the explanation that X may not be as true as you would like it to be.

  1. ^

    Importantly, if one offers the statement of “murder is wrong” in a general sense, it is of course pointlessly pedantic to test it on contrived edge cases to see if the idea holds absolutely, since it is already understood to mean “murder is pretty much always wrong except for some really rare circumstances that I’m not talking about”. However, this dismisses the hypothetical on the basis of relevance, not on the basis of realism. So, if a pedant does challenge the “murder is wrong” premise with an edge-case hypothetical, it is still invalid to say “that edge case would never happen” - instead, the reasonable answer is “yes, not literally all murder is wrong, but we both know that, and a ten-paragraph list of qualifying statements isn’t necessary for the discussion we’re having - you know what I mean”. When the hypothetical test serves no clarifying purpose, and is merely pointing out that the wording of the premise is underspecified per a literal reading, it is a fruitless distraction. However, this “you know what I mean” response would itself only be a reasonable answer as long as the crux of the discussion isn’t contingent on the details of what exactly is meant by the statement.



Discuss

NLA Thought Anchors

Новости LessWrong.com - 1 июня, 2026 - 02:38

The following post seeks to look further into why NLA (Natural Language Autoencoders) contains the prediction more often when the original activations led to the correct output than incorrect output.

Quick Summary:
  • Extraction position matters - NLA answer appearing in AV increases as the token approaches the model's final answer
  • First sentence is the most counterfactually important for both activation reconstruction loss and the AV containing the final output
  • Sentences counterfactually important for generating the final answer correlate with lower reconstruction loss, suggesting the AR training reward encourages the model to include correct answers
  • Degenerate NLA outputs (repetition, garbled tokens, emoji blocks) appear only for activations from incorrect model responses.
  • NLA response length varies more for incorrect activations, possibly reflecting model uncertainty
  • Incorrect activations reconstruct ~30% worse than correct ones
Key Findings:
  • Extraction position matters - NLA answer appearing in AV increases as the token approaches the model's final answer
  • Surprisingly when looking at activations that led to the incorrect answer the NLA sometimes had outputs that led to broken or degenerate responses examples includes repetition, garbled tokens, emoji blocks, etc. This only appears in NLA for activations that led to incorrect responses along with the fact NLA response length varies more for incorrect activations, possibly reflecting model uncertainty.
  • The final answer contributes more to the NLA's reconstruction loss when the activations led to the correct output, and less when they did not.
  • NLA seems to have higher reconstruction loss when the activations lead to the wrong answer on the GSM8K dataset
  • The first sentence seems to be the most counterfactually important for NLA AV responses both for reconstruction loss and the response containing the final answer (contain actual answer vs model response). The counterfactual importance was more evenly spread across sentences for base activations leading to an incorrect answer.
Experimental setup:

Code: https://github.com/Realmbird/nla-thought-anchors

Huggingface datasets I created: https://huggingface.co/collections/Realmbird/nla-thought-anchors


I created a pipeline with the following steps (for further details, see the README):

Step 1 (Generates with Base model)

Step 2  (Generate first NLA explanations with AV )

Step 3 (Generate rollouts and calculate rollouts) (Takes the most time; arguments I used is a cos_sim threshold of 0.8 and 40 rollouts per sentence)

Step 4 (Analyzes the rollouts)

Other files are more to make visuals and analysis and include what step is needed to run

NLA Setup

The NLAs I used were from https://github.com/kitft/natural_language_autoencoders

Along with using the inference code with SG Lang

Base model: Qwen2.5-7B-Instruct

AV: https://huggingface.co/kitft/nla-qwen2.5-7b-L20-av

AR: https://huggingface.co/kitft/nla-qwen2.5-7b-L20-ar

Dataset:

https://huggingface.co/datasets/zen-E/GSM8k-Aug

Experiments:NLAs are position sensitive:
  • I started with my original NLA script and looked at the rates of NLA containing the final answers. The rates were clearly too low; then I noticed  I looked at the last prompt token instead of the token after generation; This led me to the idea that final answer appearance in the NLA corresponds to token position.
  • Extraction position matters - NLA answer appearing in AV increases as the token approaches the model's final answer. For the answer and hash tokens specifically, correct activations led to the final answer appearing in the NLA output at a significantly higher rate.




  • The resulting difference between the border token and answer token becomes more apparent after doing a few samples or rollouts
  • These results support Ryan Greenblatt's findings that “NLA output contains what the AI will predict at a rate much higher than chance for both incorrect and correct problems”

Model Correctness Impact on NLA outputs:
  • NLAs output more consistent AV response lengths if the original outputs led to a correct response. These findings imply that NLA response length varies more significantly for incorrect activations, potentially reflecting increased model uncertainty.
  • The graph shows counterfactual importance from NLAs per sentence for the counterfactual impact for the sentence to generate actual answer (gold) or matching the model's answer (pred).
  • Correct examples are cases where the original base model activation led to the correct response. Incorrect examples are cases where it did not.
  • The first sentence is the most counterfactually important to generate the GSM8K or gold answer
  • For the incorrect model response activation group shows that the most counterfactually important sentence for the model was the last sentence.
  • The models seem to have the counterfactual importance for generating the correct answer to be more balanced when the activations led to the correct response.


  • For correct model response group or correct examples contain gold and pred should be the same (as correct = gold)
Incorrect Model Responses have broken AV explanations:
  • An interesting finding is that only when model outputs incorrect answer the AV sometimes generates broken behavior such as repetition, wikipedia, forum, etc
  • See Appendix for the examples of these categories



Thought Anchors on NLAs:
  • A question after looking at counterfactual importance for containing the answer or predicted answer is how does it correlate with reconstruction loss.
  • I looked at counterfactual importance to contain the final answer by quartile and found that as the sentence was more counterfactually important the lower the reconstruction loss for the AR. This suggests that NLA reconstruction AR seems to encourage including the final answer in the AV.
  • NLA responses from activations of where the model outputted the correct response have on average a lower reconstruction loss. NLA struggles to output more for incorrect responses.



NLAs and Reconstruction Loss:Final Answers and Reconstruction Loss:
  • For some reason ablating all instances of the final answer only has a a larger impact in when the model outputted the correct answer than incorrect for the border token of ####. However, it did not occur in the Answer Digit Token.
  • For the Answer token ablating the answer from AV seems to have a constant effect regardless of if the original activations led to a correct or incorrect output
  • The higher impact of containing the answer on the reconstruction loss seems to indicate that the border token should be if you want to include the final answer in AV. However, if you do not know if the output is correct or incorrect Answer token is better due to the more consistent impact on AR reconstruction.


Per Sentence Reconstruction Loss:
  • Change in reconstruction loss per sentence or the range varies more greatly when the model originally generated an incorrect response.
  • The change from correct to incorrect example is more noticeable on answer token over the border token
  • Border token or ####
  • Answer token reconstruction loss by sentence






Takeaways Limitations:
  • The NLA was for Qwen2.5-7B-Instruct and the smaller model might have highlighted issues that might not occur in bigger models. Will the incoherent AV responses on incorrect model responses happen on bigger NLAs?


Future Work:
  • Attempt different cos sim thresholds for similar maybe this is a threshold issue
  • Investigate further the impact of the models response being correct or incorrect on NLA
  • Investigate with bigger models (Apply for bluedot rapid grant)
  • Clustering NLA sentences and labeling them


Appendix:

Degenerate Examples (#prompts being broken are more on the display end trying to make it images)

  • Garbled Token  [:checked:checked]
  • Emoji block
  • Wikipedia
  • Repetition
  • Forum post

Coherent Examples

  • Answer box
  • Final Answer
  • Closing_format


Mixed

  • Calculator




Discuss

Lighthaven East - A Feasibility Study

Новости LessWrong.com - 1 июня, 2026 - 01:53

As a bureaucrat, my role is to annoy my friends. Someone voices an idea, “Wouldn’t it be nice if…” or “I wonder if we could…” I make a note. I do some estimates. If it pencils out, I’ll bring it back up, week after week. The discussions are fun, but also practical. We’ll test the waters, what would be a minimum viable scheme? What’s easy, what’s hard? Who could do the hard parts? Over time the idea gets more detailed, specific, feasible. I’ll pull out a calendar. Soon our scheme has co-conspirators, action items, even a budget. It’s just good staff work.

I’ve been hearing whispers in the wind for a year now. 

  • “Imagine if we had something like this in DC.” 
  • “Where can I host an event that might get a dozen or a hundred people?” 
  • “It’s such a pain in the ass to book event space in the Capitol.” 
  • “I think this person has started to see what’s coming, where can they go to get caught up?”
  • “The community seems to be growing but it’s all fragmented in group chats.” 
  • “How is no one planning an afterparty, that’s clearly the highest leverage intervention!?”
  • “Why can’t every wall be whiteboards?”

These are all variants on a theme: “Lighthaven East.” 

I did some digging. I’m happy to report that this could work. There’s strong demand. There are good options for supply. Funding, staffing, resources, property, and permits are all doable. The hard parts are diligence, agency, and will. This project needs a champion, but it’s a thing someone can simply choose to do. 


How Lighthaven Works

Legally speaking, Lighthaven is a confusing category error. It was once the ramshackle “Rose Garden Inn,” with several buildings, a hotel license, and a history of event use. After extensive renovations, it is now a 30,000 square foot campus used for conferences, retreats, office space, and medium-term lodging. The property is owned by Lighthaven LLC and financed by an interest-only mortgage held by a philanthropist. The LLC runs the property, hosts internal events, rents conference and office space to external customers, and sells hotel stays. Lighthaven LLC is itself owned by Lightcone Infrastructure, a non-profit that among other things runs LessWrong.

Economically, Lighthaven LLC generates an operating profit comparable to its cost of capital. The mortgage is $20 million at 5% interest, for an annual interest payment of $1 million. Lighthaven LLC had $3.25 million in revenue in 2025. Events and hotel stays generated an operating profit of roughly $850k, almost enough to pay the $1 million annual interest payment. Office space seems to be offered at cost. Lighthaven LLC’s projections of $3.5 million revenue in 2026 should generate an operating profit sufficient to fully fund its annual interest payment, though bookings are currently sparse for this fall. 

In practice, Lighthaven is the best event venue I’ve ever seen. I won’t belabor that point in this post, but if you haven’t been to Lighthaven, see some of its many rave reviews in this footnote.[1] Lighthaven LLC does not maximize profits–events are often experimental or designed primarily to support the Berkeley community, rather than the booking going to the highest bidder. Some event organizers are not charged, others are offered discounts on rates that are already lower than similarly sized spaces at hotels. This pricing strategy generates significant positive spillover effects and goodwill, demonstrated by the community’s strong response to Lightcone’s two fundraisers. While Lighthaven was a significant cost center for Lightcone in 2023 and 2024, by 2026 it is better modeled as supporting the parent non-profit. 

Conceptually, Lighthaven is a monastery. Its main purpose is to support good scholarship “dedicated to making humanity’s future go better.” Its abbot skillfully wields an awkward mix of temporal, cultural, and political authority. Monasteries often support their ecclesiastical mission by selling craft goods such as beer, eggs, mushrooms, or furniture–Lighthaven instead sells conference space. Unlike an abbey selling produce for revenue, the conferences at Lighthaven also further Lightcone’s mission. Lighthaven’s scholars-in-residence synergize with its mission, contributing to and benefiting from the events held on the property.

These aspects combine into a whole: Lighthaven is the place to go to think out loud. Comfortable perches encourage deep thought. Inviting conversation nooks encourage you to refine your ideas with friends, themselves helpfully provided by the events and scholars-in-residence. Beautiful seminar rooms encourage you to share your ideas, refining your presentation to best convey them to others. Once your ideas are fully baked, get the word out via your laptop, the antique typewriters, or having a friend interview you in the podcast studio.


What Does DC Need?

DC culture has a Lighthaven-shaped hole. Politicians have started to notice that they are confused about the future of AI. AI Policy nonprofits rent event space, mostly bars and restaurants for expensive and echoey events to grab a few minutes of staffers’ time. AI companies try to use these same spaces for technical demos, sometimes mixing beer and laptops with limited success. Technical communities of practice have unprecedented attendance as practitioners realize they need to upskill. EA and Rationalist policy organizations are scaling in DC, but each option for co-working space comes with significant downsides. Aligned conferences happen, but are held in either hotels with huge up-front costs or group houses well below their optimal attendee-count.

Resources are there to address all of these problems, people are working hard on them. But everything is scattered, hard to find. One step doesn’t necessarily lead to others. Imagine instead that someone approaching the community could have a day like the following...

A Day in the Life

Our protagonist is a tech policy staffer on a relevant congressional committee, mid thirties, has spent their career in positions of increasing authority in government and not-technically-government organizations. They’re an expert in telecom policy, or broadband, or electrical grid economics, or some other sub-field of technology policy, but now they need to learn about AI. The whole office knows there’s going to be a flood of AI bills in the 120th Congress, beginning January 2027, and there are only six people on the committee staff working on technology policy at all. Everyone needs to “get smart on AI,” immediately.

Through some coverage of a book with an edgy title, they understand that this topic is risky in some controversial way. A friend on a different committee recommended they meet with a particular non-profit. The non-profit has a few people in DC permanently, but as luck would have it, this is the week when some of the senior people are visiting from California. The committee’s available conference room only seats four comfortably, so our protagonist decides to go to them, meeting at their co-working space. It’s a lovely spring Friday in DC, it’s only a mile, it’ll be a nice walk.

When our protagonist arrives, they realize they’ve been here before. There was an industry event here last month, in the main room on the first floor, showing the capabilities of some new coding system. It seemed impressive, and they requested access once back at the office, but the Architect of the Capitol won’t let that code onto government systems for at least a year. That denial is what prompted our protagonist to gripe to their colleague on another committee in the first place, ultimately prompting this meeting.

This time, they go to the co-working space on the second floor. It seems… nice, if a bit weird. It’s hard to put their finger on why the space seems brighter and more alive than a typical WeWork. Some of the furniture is custom, fitting its space exactly without being ostentatious about it. Other pieces are clearly from Ikea, but work well enough. The space has all the cliche amenities of offices in the Bay, yet these actually seem to be used, several people are sitting on some plush carpets in a corner. There are whiteboards everywhere, just a ridiculous number of whiteboards, and even the windows… no that’s different, someone has put stained glass stickers in the top third of each. Why so many paperclips?

The meeting goes well. It narrows on a particular technical point about halfway through. The non-profit staff flag down someone walking by, who quickly clarifies that he’s with a different organization, but he joins in and within minutes is diagraming the disagreement on one of the whiteboards. It seemed silly, everyone knew what those words meant, but I guess it did clear up their confusion. 

Now that our protagonist is following, they want to know more. As luck would have it, there’s a conference this weekend on-site. The monitor on the wall shows there’s going to be a session on this technical point in the evening, and a workshop tomorrow afternoon. Is it too late to register? Hmm, let’s ask the organizer, they’re probably setting up downstairs. They find him avoiding the choreographed chaos in one of the many nooks, rearranging the schedule for the seventh time. There have been a few cancellations, we can print another badge…


Minimum Viable Lighthaven

To start to put together something like this, we need to figure out the smallest plan that might work. I believe a Minimum Viable Lighthaven requires a few key features:

  • Permanence - dedicated space that we can change to our liking, redesign to support good conversation and thought. Either owned outright or a long-term lease with stable financing.
  • A large room - for speaking to all attendees of an event at once.
  • Nooks - lots of smaller spaces for conversations, with at least some physical separation and sound dampening.
  • Good location - a retreat venue you go to twice a year can be inconvenient, but this should be the obvious choice for a variety of events. 
  • Consistent leadership - an ultimate decisionmaker, whose decisions stick.

Some features that are not strictly required, but are very nice-to-have if we can, include:

  • Segmentable space so that smaller groups can hold events at the same time.
  • Dedicated office space, segmented into rooms.
  • Good architecture.

Hotel rooms would be a mixed blessing. They would allow us to host weekend-long residential retreats, as Lighthaven does, but it’s extra space that we’d have to purchase, maintain, and manage. This could double the overall cost of the project, without necessarily doubling steady-state revenue or providing as much value as the conference space itself. If the right property comes with hotel rooms, it could be worth it, but I think we should prefer to keep them to a minimum. And more practically, DC has avoided Berkeley’s market failure in lacking hotel rooms. 

…so you mean a Group House?

Could a large group house qualify as a Minimum Viable Lighthaven? 

Workshop House in DC is a case in point. It’s gorgeous, the residents and leadership are friendly, they host excellent gatherings directly and rent their space out to outside events. I’ve hosted a large event and several smaller gatherings there, they’re fantastic to work with. But it’s telling to see where even such a successful property and institution falls short of what we’re looking for. 

The trouble is that it’s primarily a residence, the needs of the residents come first. Only about 2,500 of its more than 7,000 square feet is available for event use. Its largest room holds about 60 people, tightly. When booking, a lot of decisions need to be run by several stakeholders, getting to “yes” on specifics inevitably takes time. Their space is something the residents can graciously offer for up to a few days, as opposed to dedicated event space. 

As a case study explains:

Originally, residents were excited about renting out the ground floor open space fairly regularly (e.g., for a nonprofit’s quarterly board meetings). Over time, they found that this was more disruptive than additive, and have limited outside rental or space loans only to those which overlap heavily with the interests of the existing residents and community. Outside groups are now welcome to rent the space for aligned, recurring events up to twice a year, but the house is much more interested in hosting one-off, pilot events and gatherings.

It’s tantalizingly close, but I think experience shows that primarily residential spaces don’t have the quality we’re trying to capture. Even if no one lived there, the floorplan would typically not work well. Conferences and retreats want some sort of large space that can hold everyone, at least briefly. I can’t find hard-and-fast rules, but I think this implies about a quarter of your total programming space should be a single room. Smaller houses can fit that criteria, but larger houses don’t tend to have a large single room that scales with the number of bedrooms unless it was specifically designed for entertaining. Even Lighthaven struggles by this measure, with the largest sessions of LessOnline and Manifest straining Rat Park (which holds about 300).

…so you mean a Co-Working Space?

Yes and no.

From what I can find, non-profit Co-Working Spaces in our community don’t tend to be self-sufficient. NET in DC, Mox in SF, Collider in NYC, and I believe Constellation in Berkeley each use grants and/or donors to sustain regular operations. While the organization could certainly seek grants for special projects, we’d want to avoid institutional fundraising to sustain regular operations. Churn is a big part of this, people leave co-working spaces when they start working for a larger org, when their project fails, and when their project succeeds and outgrows the space. Given the many ways individual co-working users can exit, even if you manage to fill the space briefly, you won’t stay full without a strong pipeline of new entrants, which can risk the institutional culture. 

Further, spaces that are primarily offices just don’t feel comfortable to use. Even when the space is comfortable for the workers, it isn’t when using the space for other events. It can feel like an intrusion, there’s a friction to moving someone’s desk aside, or sitting down at it. IMO co-working spaces make good overflow space, and reasonable break-out spaces at conferences, but should not be a majority of the space. They certainly should not intrude into the main large event hall, which should be optimized for events.

That said, I think co-working is a crucial component of this project. Having our community use this project as office space helps establish it as a default meeting space, seeds conversational serendipity, and even makes it safer by providing more eyes-on-the-street. It brings in some revenue from the property during weekdays, which are otherwise hard to rent to events, and creates a built-in audience for evening talks and events. Co-working space could be a practical substitute for Lighthaven’s Scholars-in-Residence. Co-working space is a key pillar, it just shouldn’t be the main focus of the plan. 


Feasibility Study

A Minimum Viable Lighthaven DC as envisioned by this feasibility study would have three main lines of business:

  • Conferences
    • Professional policy-focused conferences during the work week.
    • Looser, more Lighthaven-style conferences on weekends. 
  • Nonprofit and Corporate Events
    • Mainly evenings during the work week - a staple of the DC policy world
  • Co-Working for aligned organizations
    • Likely including intensive fellowship programs, such as Inkhaven.

In my rough estimates, it’s difficult to make a venue self-sufficient with any one of these uses; which is why this venue doesn’t already exist. Including any two of them should be self-sustaining, even at 60-70% occupancy. Doing all three adds complexity, but each synergistically reinforces the other, three legs of one stool.

Given that, I think we want to buy a church. Failing that, a school, an embassy, or a small hotel. 


Property

We probably don’t need a campus, specifically. The climate in Berkeley is obnoxiously perfect for outdoor use much of the year. Temperatures rarely drop below freezing or exceed 85 degrees Fahrenheit. Most rain falls in December through March, leaving eight months of drier, warmer weather. But even this is underselling the usefulness of the outdoor space at Lightaven—I began writing this document in the Gazebo of Schemes on a bright, clear day that was too warm for a sweater… in February. This lets outdoor space double as programming space, significantly expanding the campus’s usable square footage and making the buildings feel more connected. When warm days transition to cool evenings, guests gather around fire pits or gather under blankets in nooks. DC is not this way. Summers are hotter and more humid. Winters are colder, occasionally with snow. Spring and fall tend to be nice, but are unpredictable. Event organizers in Berkeley can plan on outdoor space being usable, organizers in DC cannot rely on outdoor space in the same way.

When looking for a site in DC, we should consider the current zoning and historic use of the property. The Rose Garden Inn was zoned as “Avenue Commercial,” operated as a hotel, and had a demonstrated history of event use. The hotel license was included in the purchase and no zoning variances were required. The new owners continue to operate the property as a hotel that also rents conference space, there was no legal change in use. If Lighthaven had been zoned residential, it would have required a vote of the Berkeley city council to change its permitted use, adding at least a year and substantial risk of a “no” to any project’s timeline. Instead, Lighthaven operates “by right,” which should mean it doesn’t need much from the city.

In practice, Lighthaven works closely with the city, requires permission and permits for most improvements, some repairs, and even occupancy. Permits and inspections in old properties, especially those designated as historically relevant, necessarily require subjectivity. It often isn’t practical to bring a historic building up to modern code, but it is a judgement call just how much improvement to require. Unrelated matters, such as neighbors’ perception of how often Lighthaven guests park legally on residential streets, are not supposed to be relevant to those decisions… and yet… that’s just how people work. It is important to strive for good relationships with one’s neighbors regardless, but any city's politics may have more veto points than a straightforward reading of the law would imply.

How large should this property be? Lighthaven can comfortably host conferences of about 500 people, and parties of roughly twice that number, using about half its 30,000 square feet available as public space. This suggests a rough comfort estimate of about 30 square feet per conference attendee, though we’ll want to aim a bit higher if not relying on outdoor or private hotel space as pressure valves. However I don’t think we need quite as much public space as Lighthaven has. It might be better to aim for a property with about 10,000-12,000 sqft of public space; giving a capacity of 250-300 for conferences or 500-600 for evening events or parties. This would be particularly appealing on sites with options for later expansion. 

So, this Minimum Viable Lighthaven would want a 12,000 sqft property with appropriate zoning, a history of event use, enough spare space to operate part of the property while other portions are being gradually renovated, with roughly 3,000 sqft of its space as one large room. This describes a church, in particular one with attached program space or a rectory. Other kinds of properties that might work include small schools or hotels, so long as they have an auditorium or other large event space. Embassies, or technically their chanceries, are another option; rare but appealing. Countries’ needs change over time, so chanceries do occasionally change hands as an embassy needs to upgrade or downsize. Mansions do not appeal unless already converted to and zoned for commercial use, such as for a wedding chapel.

This project is not necessarily dependent on what properties are listed for sale; the Rose Garden Inn was not. Lightcone Infrastructure approached its owners after scrolling satellite pictures of the East Bay on Google Earth to identify prospects. City churches often move to the suburbs as their membership ages and neighborhood tastes change, religious schools face similar dynamics. We might find a congregation willing to sell.


Funding

There are institutions building portions of this already. The Network on Emerging Threats hosts coworking space and monthly policy talks. It recently announced it was moving to a larger space, but the new space is less conducive to events. IFP and FAI host excellent large evening events on the roof of their office building, but the events have started to outgrow the space, the rentals are expensive, and FAI recently moved out of the building. EAs and Rats in the area also have a large social scene, with parties and socials most weekends, and policy-focused events most weekday evenings. A typical week has 8 or more public events scheduled, which are often constrained by the capacities of their venues. I estimate the community already spends over $50,000 in an average month on co-working and event space, not including the dedicated offices of larger organizations (like IFP) or larger, irregular conferences.[2]

Good commercial property in DC tends to go for up to $750 per square foot. (As an example, this property is larger and more ornate than ideal, but currently for sale at that cost.) Our desired 12,000 sqft should cost between $7-10 million. Interest rates are higher now than when Lighthaven was purchased, but if we can find a philanthropist willing to lend at 8% interest, even $12 million, the top of that budget plus a $2 million for capital improvements, would come to just under $1 million per year in interest. Lighthaven has $2 million per year in operating costs. A Lighthaven DC would be smaller, without hotel rooms by default, and in a region with a lower cost of labor/living. There would still be significant operating expenses, a full time director, other full- or part-time staff, supplies, insurance, utilities, etc, but I think $1 million per year is a reasonable budget.

Between cost of capital and operating expenses, the property would need roughly $2 million per year in revenue to break even. Events and co-working on the order of what the community already has today, while capacity constrained, could cover a third of this. Many existing events wouldn’t move locations, but a venue like this would be an attractive option for new events and co-working uses. Adding in some additional latent demand, one evening corporate rental per week, and one large weekend conference per month would get the property to break-even, with substantial room to improve its offerings if it can manage more bookings. 


What is the Minimally Viable Funding?

I think that a founder should not purchase property until they have secured at least the minimally viable amount of funding. There are a few key things this includes:

  • Purchase of the property itself - likely secured with a mortgage
  • Initial furnishing, repairs necessary to start operations.
  • At least one, hopefully two, renovation projects in portions of the property.
  • Enough runway in operating and mortgage costs to get to self-sufficiency. 

That last bullet is likely to be the sticking point. Every project takes time to reach full operations, this would be no different. I estimate reaching self-sufficiency would take at least two years, unless cutting corners and limiting the ambition of the project to reach that milestone earlier. Depending on the size of the property, the amount of renovation desired, and how the space is configured, it could easily take three years, or even four.

I wouldn’t necessarily insist that this project have three full years of operating costs in reserve before buying a property, some revenue will come in before the property is self-sufficient. But if it were me, I would insist on at least two years of costs in the bank regardless of revenue projections. Lightcone bought Lighthaven during a time of abundant funding. When the funding situation and Lightcone’s relationship with grantmaking organizations changed, it had to run two large community fundraisers. These fundraisers were successful and gave the Lightcone team legitimacy, a broad community endorsement of their plans and strategy. But the situation was still regrettable, there was a very real risk of losing Lighthaven, along with all the resources spent to renovate it to our specific uses.

If this project is worth funding, if its director has the faith of grant-makers or other philanthropists, they should get the resources to see their vision through. The project will almost certainly look like it is failing at the 15-month mark, with renovation timelines slipping and paid bookings still scarce. Even when renovations are done, it will take some time to build a reputation as the obvious choice for certain events. It would be a disaster, a huge un-forced error, if the director has to fundraise at those points just to complete the project. There should be checks on the director and the project, but that oversight should come from the board, not intermediate project fundraising goals. 

All this taken together, I estimate Minimum Viable Funding would require about $18 million dollars in total. Roughly two thirds could be in the form of a mortgage, the rest would be a grant:

  • Loans: $12 million financed by a mortgage
    • $10 million purchase price
    • $2 million in initial renovations
  • Grants: $6 million
    • $1 million in startup costs - furnishings, equipment, electronics, sound, permits, legal fees.
    • $5 million in operating and capital cost runway (2.5 years).


Leadership

In the course of my interviews, I promoted one item to the “required” list from the “nice-to-have” bracket. Again and again interviewees brought up the quality of the leadership, that a single person should be responsible for implementing and living with their decisions. Every interviewee stressed the need for a passionate founder engaged over the long term. Several interviewees also stressed unity of command, that decisions need to have some single person who owns the choice and cannot be overruled, short of the decisionmaker being fired. I found this perspective convincing. 

There will be part-time roles at a place like this, but the chief executive cannot be one. That person will need a rare combination of skills:

  • Decisiveness - comfortable making decisions on the fly under uncertainty.
  • Alignment - Understanding of Rationalist and EA ideas and culture, enough of a perspective and a strong enough worldview that they’re not convinced by the most recent strong personality they’ve met.
  • Taste - An eye for design in the physical world. Direct design work can be delegated, but they will need to have the ability to tell why one option works and another does not. 
  • Political Skill - Venues like this need to be used a lot for the economics to work. The leader needs to present as acceptable to many factions of politics, to make friends easier than enemies. Needs to see and deescalate social conflicts between staff, vendors, attendees, organizers, etc. Needs to have spent significant time in DC to know, intuitively, where the political traps and mines lie. The taste to know which battles to avoid and which are worth fighting regardless.
  • Energy - Days will be long, particularly when a conference is underway. 
  • Extroversion - Must genuinely like people and highly value the interactions between people who they may not agree with. They must value building a neutral institution more than direct work in their preferred cause area or speciality. 
  • Integrity - Will be trusted with tens of millions of dollars in property and accounts. It’s not enough to have integrity, they must also demonstrate it to grantmakers, philanthropists, their staff, and to a lesser extent even to conference organizers.

The leader of this project is the hardest constraint to satisfy. There’s a very short list of people well qualified in each of these categories; most have other jobs that they seem to like. I think there’s a longer list of up to a few dozen people who excel in most of these criteria, who with enough self-knowledge and humility could manage to cover for their lack in an element or two. 

This job would be incredibly rewarding, personally and professionally. This must not be a volunteer role. I believe the community would be willing to pay well, on the order of $200k or more depending on experience, to have this job done well. 

If the description above sounds like you, get in touch.


Cultural Fit

This property should appeal to more than just the rationalist community, Lighthaven already does. AI companies already rent bars, restaurants, and conference space to do technical demos for government and NGO staffers. There are also centrist-leaning political movements with some popular support and donor interest, who could be interested.

As Lighthaven is the cultural headquarters of the LessWrong community, a Lighthaven in DC could position itself as the cultural headquarters of the Progress Studies branch. This would give it a compelling raison d'etre: Lighthaven West generates the ideas, Lighthaven East gets them into the posting-to-policy pipeline. A focus on Progress Studies may also attract more interest from donors.

In policy spaces, opposing political factions interact socially more than many people outside DC would expect. Renting space to politically-relevant actors is tricky, we wouldn’t be as neutral as, say, a bar or hotel. But I think there is room in the center to rent to both the center-right and center-left, Anduril and Anthropic, without making too many enemies. 

Crafting this coalition, determining who this space is ultimately for, is not a one-time decision. It is decided day-by-day as the director makes a series of small decisions and actions that accrue into a reputation. Which booking gets the popular weekend? Which organizations get discounts? Who are the first people invited to co-work, the second wave that builds off of that founder effect? When there are inevitable fights about associating with people one side or the other considers bad, where do they draw the line? Who’s worth defending? Who should be excluded, despite their popularity? These choices cannot be realistically delegated, because one of the skills is noticing that an issue is culturally-relevant at all.

I described Lighthaven as a monastery because its cultural output is ultimately the point. This is also true of any potential DC version. Its director will need to understand and be comfortable wielding cultural power, as much or more than the temporal power over the space. And this path will have to be plotted largely without a map.


Name and Brand Positioning

“Lighthaven East” is a working title for the project, but should not be the name of the space itself. We will need to avoid anchoring too closely on Lighthaven’s culture and norms, for two key reasons. First, a lot of what makes Lighthaven work well is adaptation to its environment. It uses the climate well, it plays off of norms the Berkeley community has spent over a decade honing. Second, the game is rigged, it’s exceedingly difficult to beat Lighthaven at being a Lighthaven. The new project will need to build its own coalition, niche, and animating spirit, so that it can be well adapted to its new environment.

I feel strongly that it should not be named for the lead donor. I think it’s reasonable for the donors to have input into the name, location, and aesthetics. But naming the venue after a donor is too far. It’s a bit tacky, implying a vanity project, but more importantly it weakens the authority of the Director. Besides, there’s a certain cachet that comes from discovering an open secret, let people have the fun asking “Huh I wonder who’s behind [final name]?,” and then finding out.

One name that we’ve begun workshopping is “Posterity Center.” It hits several notes: focuses a long-term perspective, references the Preamble to the US Constitution, and alludes to Bayes. It’s not quite perfect. It’s a little too long, it feels like four or five syllables should be the limit, but alternatives don’t quite work; “Posterity” doesn’t feel like it works on its own, and “Posterity House” feels slightly too informal. It also seems a bit too… something… maybe self-serious? I recall, however, that I first thought the name “Lighthaven” was too pretentious even for me, and yet it has certainly grown on me.

We welcome more name ideas and feedback, and I’ll create a parent comment to collect name discussion in one place.[3]


Ability to Scale

Twitter recently discussed a coming wave of non-profit demand and funding. Could we do more with more?

In short: yes, we could! 

The additional funding would have to come at the start for maximum leverage. When shopping for an initial site, more money gives many more options, later expansion is constrained to the area surrounding the chosen site. A larger space would probably need more time to reach self-sufficiency, since I expect use of a space would scale mostly independently of its size, at least at first. Stable, committed funding would be key for larger plans; larger venues would come with at least proportional increases to the operating expense, renovation, furnishing, and runway budgets. There’s an argument that larger properties might need more-than-proportional increases in non-property costs, meaning that overall project cost scales faster than total square footage, given the harder path to self-sufficiency and higher risks to the core project. 

All that said, there are major advantages to larger spaces. Many benefits are obvious, more capacity for events and organizations, political proof-of-work (i.e., prestige), more option value for smaller events to spread out, avoiding the deeply unpleasant coordination failure of everyone trying to shout over each other in a small space. Others are more subtle or speculative. Lighthaven aspires to be something like the Bell Labs of old, but notably lacks project space. More space means more room, literally, to experiment with function, such as allowing overnight stays, more permanent setups for meetups and organizations, different kinds or styles of renovation, even lab or maker spaces. I would not argue that a larger project is higher EV per dollar than a 12ksqft one, but in success it would clearly be more impactful in total. In a hits-based giving model, more funding buys more potential upside, since any successful experiments could be scaled.

Another benefit, Capitol Hill has some unique features, all else equal it’s the neighborhood we should prefer. There are lots of sites in the 5-8,000 sq ft range, too small for our use, and some appealing options closer to 20,000 sq ft, but Capitol Hill is comparatively weak in the 12,000 sq ft range. More ambitious funding would make proximity to Congress more feasible. 


Risks

This would be a high-profile endeavor. Because policymakers in DC would be a key target audience, failure would be much more visible and salient to them than any other comparatively-sized project. This sort of reputational damage is hard to quantify, hard to even describe, but very real. I do not think even a high-profile failure would be as damaging as the FTX collapse was to the community, but worst-case scenarios could approach perhaps a tenth of that reputational damage if handled poorly.

These risks can be mitigated, though not eliminated, with a strategy of “Don’t do stupid shit.” This is yet another reason why the director of this project needs an easy familiarity with DC culture and norms, what constitutes “stupid shit” is not always obvious to those who haven’t spent time in the Beltway. We should aim to scale the project deliberately, and somewhat quietly at first. We should inculcate an appropriate institutional culture before inviting high-profile political figures. 

The organization, its leadership, and its funders would become the target of opposition research, would need to avoid certain impropriety. Yet mitigating this risk can, and often does, go too far. I personally worry that the EA community has over-learned the lessons of FTX, and tries too hard to appear normal. Those associated with the project should be and appear to be trustworthy, should be and appear to be reliable, but should not try to pretend their views are more mainstream than they in fact are. The point would be to promote outside-the-Overton-Window ideas in a high-profile, high-trust way, giving us more credibility when our ideas later prove right. The project shouldn’t sacrifice this key feature in an attempt to seem more politically palatable. We should not pretend this project is a normal thing to do. It should be weird, and edgy, and cool, just in a well-calibrated way. 

Basically, I’m saying we should have robes but only break them out for parties.

More mundane risks to the feasibility of the project include:

  • A lack of demand in corporate or institutional bookings.
    • In general, the project could prosper without certain categories of corporate bookings. If the hyperscaling AI companies instruct their staff to stay away, the project could still be successful. However, I do not think the project would fare well without any corporate bookings at all. 
  • Changes in political atmosphere that lead the community to abandon its push into DC.
  • Ideological capture by commercial or political interests.
    • Ideological capture is both an indirect risk to reputation and a direct risk to the project’s goals. This project should fill a neglected niche, rather than becoming a tech-flavored political club.
    • Too close a relationship with the AI labs could cause other organizations to pull punches or otherwise water down their messages.
  • Regulatory risk from the DC government.
    • The project should specifically budget for ways to be a good neighbor.
  • Site risk, mold, environmental remediation, etc. Even if hazard losses are insured, they would dramatically shift the project’s timeline.
  • Personnel risk, embezzlement, theft, scandal, etc.


First Steps

There are three key blockers for this plan: a founder, money, and a site. In an ideal world, a founder would step up, approach philanthropists and arrange financing, then simply buy the best option available for sale. Straightforward, sequential, looks great on a Gantt chart.

It may not be that simple. Three-way matching problems are notoriously difficult, each of these elements feed back into the others. Philanthropists may not be willing to commit until an ideal site comes on the market. Some potential founders may be more willing to work with certain philanthropists or organizations, Other potential founders may be more dependent on the site, perhaps more willing to run a smaller venue, or one that is Congressionally focused.

In practice, whichever constraint is filled first will exert outsized control. If someone has a site to offer, the rest must either cohere quickly or not at all… whether the site is the best available for our purposes will be de-emphasized. If a funder gets excited before a founder, we risk a muddled vision, inconsistent execution, and different departments optimizing for different goals. These problems can be avoided if the vision comes first.


If this calls to you…

the lightcone needs you to lock in son
https://x.com/uneventual/status/1991692767001735510


  1. ^

    Start with Every Bay Area Walled Compound.

    Then, Scott Sumner:

    For Jorge Luis Borges, paradise was a library. At nearly 70 years of age, I’ve found my paradise at Lighthaven, which recently hosted meet-ups for Less Online and Manifest over back-to-back weekends in Berkeley, California. I know of nowhere else on Earth where I can find so many interesting conversations in such a compact area.

    Scott Aaronson, writing "Guess I'm a Rationalist Now" after his first visit:

    The conference was at Lighthaven, a bewildering maze of passageways, meeting-rooms, sleeping quarters, gardens, and vines off Telegraph Avenue in Berkeley, which has recently emerged as the nerd Shangri-La, or Galt’s Gulch, or Shire, or whatever. [...] What I’ll remember most from LessOnline is not the sessions, mine or others’, but the unending conversation among hundreds of people all over the grounds, [...] It felt like a single conversational archipelago, the largest in which I’ve ever taken part, and the conference’s real point.

    TracingWoodgrains:

    Many more experiences over the weekend feel almost too personal, too meaningful, to shout to the open internet: dinners and meetings and conversations with people building local cultures so achingly beautiful they feel almost like dreams, conversations stretching late into the night, serendipitous meetings with longtime friends whose faces I can now put to names.

    Theo Jaffee:

    That magical feeling of serendipity, where you can flow through a space, passing from conversation to conversation, contribute to each one in turn, and have others do the same for you.

    For further design details, see this interview:


  2. ^

    Note that this and following are rough financial estimates suitable for judging feasibility. We can elide a lot of detail, for now, since Lighthaven serves as a benchmark for comparison. A full project or grant proposal would have much more detailed budgetary estimates.


  3. ^

    Comment here:

    https://www.lesswrong.com/posts/95NgkvZKJx8tJbtn5/lighthaven-east-a-feasibility-study?commentId=KM4rMPi62hkrBKh8Z



Discuss

Barriers to a Prosperous Future

Новости LessWrong.com - 1 июня, 2026 - 00:40

The current race towards producing general artificial intelligence systems brings with it severe risks, yet no AI company developing frontier models is addressing these risks at a level proportional to the pace of development. The rapid integration of this poorly-understood technology into nearly all aspects of society is precarious at best, and catastrophic at worst. If progress trends continue, we will need a monumental level of investment in enhancing our robustness to these risks in the coming years. What follows is a summary of my understanding of these risks, a description of those most concerning to me, and finally what my personal plans are to mitigate them.

Types of Risk

It can be useful to categorize risks from advanced AI into three broad categories: misuse, misalignment, and systemic. Misuse refers to malicious actors—individuals, groups, or states—being enabled by AI systems to achieve nefarious objectives, by, for example, generating personalized misinformation at scale, hacking their adversaries' critical infrastructure, or building powerful weapons. Misalignment refers to AI models failing to truly obtain human values, leading to unpredictable and undesirable behaviours in out-of-distribution environments. Finally, systemic risks are those that arise from our complex and vital societal systems becoming dependent on a technology which we don't fully understand nor control, leaving us vulnerable to their unpredictable interactions.

Even AI researchers themselves understand shockingly little of how frontier AI systems reason and make decisions, and the rate of progress in this area is worryingly slow compared to the pace of development of AI capabilities, which is currently estimated to be doubling every 7 months, or less. [1]

As for "aligning" models with human values, the best techniques these multi-billion dollar companies have developed are fundamentally surface-level and have, without exception, failed in various ways, including being bypassed by clever prompting, also known as "jailbreaking". Some worrying behaviours were discussed in Anthropic's own recent report on Agentic Misalignment. If we are to entrust critical parts of our society such as education, healthcare, cyber security, and political advising to these systems, the emerging sciences of alignment and control will play a crucial role in doing so safely.

Gradual Disempowerment

While the categories of misuse and misalignment are slowly gaining attention in public discourse, government, and academic research, in my view the most complex category, systemic risks, is currently largely neglected. Risks in this category can be subtle, developing quietly in the background, their consequences first becoming apparent when large parts of society are already dependent on these systems, by which point a large amount of the damage may already be done. Many of our critical societal systems are already complex (e.g., economies, governments, healthcare) and have been tuned over many decades to function robustly. The rapid interweaving of AI into these systems may make them harder to control and predict and lead to unintended consequences such as a reduction in human empowerment. A line of reasoning within this category that I find particularly concerning is known as gradual disempowerment, which refers to the incremental loss of human influence as a result of having more competitive machine alternatives to humans in almost all societal functions.

In the original work that introduced the term [2], the authors argue that as AI systems begin to represent ever larger shares of the labour market we might expect the economic role humans play to be reduced, and in turn so too the economic power they hold. Unlike with previous automation where humans could transition from more narrow to more complex work, AI threatens to claim all cognitive tasks, leaving humans with no higher, more cognitively-demanding roles to move to. Without labour, money will by default cease to flow to most individuals, potentially leading to drastically increased wealth inequality. Further, they argue, the economy has always been roughly tied to human preferences, where businesses only survive when they have a paying customer base. In an AI-driven economy, this tie may loosen significantly, leading to markets that cater to those systems, rather than to human values and preferences.

Additionally, they argue that as advanced AI systems become integrated into the creation and consumption of cultural artifacts, we could see our cultural norms be significantly disrupted, in a similar way to content creators today catering to "the algorithm", but amplified greatly. While previous cultural practices have always had an evolutionary pressure to in some way benefit humans, in a world where humans are no longer the only producer and consumer of culture, this "antibody" effect may be lessened, leading to potentially maladaptive cultural practices. Additionally, the apparently alluring promise of always-available hyper-personalized AI therapy, coaching, and even companionship could begin to outcompete humans, even if objectively lower quality and lacking emotional depth, simply due to the ease of access and adaptability of such systems. For a more in-depth discussion of this topic, see Kulveit et al. 2025 2:1.

Rate of Development

How much time should we expect to have to solve these problems? One important factor to consider is that AI companies are likely to invest heavily in improving their models' abilities to carry out autonomous AI research, given the immense potential economic value of doing so [3]. Even if progress is initially impeded by bottlenecks like model unreliability, human approval, and limited compute and energy, the incentives to unblock these will be so great that before long we should expect solutions to be found. Frontier AI companies are well aware of the bottlenecks to their growth trajectories, and are working hard to pave the path towards training models that are orders of magnitude larger and more capable than today's [4].

This dynamic, known as "recursive self-improvement"—AI systems tasked with improving themselves—is already happening to some degree today, and it is likely to lead to an ever-accelerating rate of development of AI capabilities as more capable models provide ever-stronger "uplift" to human researchers. If models surpass the threshold of being capable of operating largely autonomously—that is, producing hypotheses, developing efficient tests of those hypotheses, and analyzing the results to make iterative improvements to themselves—we might experience an "intelligence explosion", wherein countless digital minds running 24/7 at superhuman speeds—a "country of geniuses in a datacenter" [5]—drive rapid progress in AI research, on a timescale we couldn't hope to keep up with. For further discussion on this topic, see Forethought's Three Types of Intelligence Explosion.

For the above reasons, there is a real chance that AI systems with human-level capabilities across all fields, often referred to as artificial general intelligence, or AGI, could be developed within the coming 5–10 years, with many estimates from AI researchers and forecasting experts converging around the year 2033 [6][7]. While previous technological revolutions developed at a pace that allowed humanity to gradually adapt laws, cultural norms, and education over the span of decades, the rate of change we can expect in an AI-powered future will be entirely unprecedented and force a significant reorganization of many parts of society, possibly in an astoundingly short timeframe. Therefore, it is imperative that we greatly increase investment in fortifying all parts of society.

Mitigations

In order to strengthen our defenses against these risks, we will need to devote historic amounts of capital and effort in the coming years. We will need thorough and continuous measurement of our reliance on AI systems to have metrics to guide discourse and to use as a basis for enacting critical policy. Additionally, we will need to conduct research on how we can use AI systems in a sustainable way that benefits us in the long run.

To this end, we greatly need more research organizations like Epoch AI, measuring and forecasting AI progress, and METR, conducting in-depth research such as that described in their Frontier Risk Report. Crucially, we additionally need much more research examining the usage and impact of AI on all parts of society, such as the Anthropic Economic Index, communicated widely.

Furthermore, we need to build tools that will make our society more resilient to shocks resulting from the integration of AI. Beyond these tools and improved alignment methods that enable training models that robustly behave in pro-human ways, such as refusing to display shallow imitations of affection, we will also need significant international regulation. AI legislation has to a large degree thus far focused on present-day harms such as deepfakes and disinformation [8]. What's needed on top of this is legislation that specifically mitigates disempowerment.

Can't we just pause development of frontier AI?

While some argue for a global pause on or a mandated deceleration of frontier AI system development (such as Pause AI), I personally believe such a pause is likely unachievable, and not even necessarily a net positive. A pause could potentially backfire as a result of pushing AI development underground, to actors less concerned with safety and who would withhold progress from the public.

The challenges ahead are great, but so too are the potential upsides. We still have time to act to prevent the worst outcomes, but the window may be closing and much work is needed.

What follows is my personal plan, given the context above.

My Plans

My current plan is to contribute to mitigating the above risks in three primary ways: developing my ability to conduct technical research, fostering a local AI safety community, and exploring potential mitigations to gradual disempowerment risks.

Firstly, I will greatly develop my understanding of technical AI safety in the coming months in order to get a deeper awareness of the best tools we have developed for understanding and controlling frontier models. This is where I can best leverage my career experience, though my long-term focus may shift after this period.

Secondly, I plan to continue facilitating and growing a thriving local community of concerned individuals in order to spread knowledge, enable networking, and gather a wide array of perspectives. [9]

Finally, I plan to explore the ways I can contribute to building mitigations against disempowerment. This project may initially be developed either during a fellowship or independently. Some preliminary ideas include: building trustworthy open-source coordination tools for both humans and AIs or developing further the ideas proposed in Gradual Disempowerment, focusing on other societal systems. If promising, I can imagine founding a research organization that would work to further develop these mitigations.

Last updated: 2026-05-31

  1. ^

    METR: Time Horizon 1.1

  2. ^

    Kulveit et al.: Gradual Disempowerment, January 2025

  3. ^

    Situational Awareness, Leopold Aschenbrenner: From AGI to Superintelligence: the Intelligence Explosion

  4. ^

    OpenAI: Stargate

  5. ^

    Dario Amodei: Machines of Loving Grace

  6. ^

    Metaculus: When will the first general AI system be devised, tested, and publicly announced?

  7. ^

    80,000 Hours: When will AGI arrive?

  8. ^

    EU AI Act

  9. ^

    Stockholm AI Safety



Discuss

Notes on axes of variation in third-party risk assessment

Новости LessWrong.com - 31 мая, 2026 - 23:48

There are many different activities that could be described as "third-party risk assessment". Here are some distinctions that I’ve found helpful thinking about the space over the last few weeks.

(Thanks Ajeya Cotra and Paul Christiano for discussions that inspired most of this.)

Throughout this, I refer to the actors as:

  • Developers.
  • Stakeholders. These are the people who want to be informed about risks. Possible stakeholders include: governments, the public, the developer's board, the developer's employees.
    • The choice matters because one of the roles of an auditor is to review confidential info that they then do not directly disclose to stakeholders, they only tell them their conclusions. This is a more important role if the developer is more concerned about disclosing confidential information to the stakeholder.
  • Third parties. I don't know a better term for "independent actors who contribute in various ways to a stakeholder's understanding of risks through producing and evaluating evidence and/or arguments". Like, it's weird to call the physical security pentesting firm a "risk assessor". And AI Lab Watch isn't really an "auditor". And "evaluator" makes it sound like they run model evals.

The next step in the analysis will be to think about different objectives of third party risk assessment and think about how those interact with these axes of variation.

AxesFact generation vs evidence analysis

This is maybe the one that comes up most. I'll define:

  • A fact-generation assessment is trying to answer some narrow question to produce a fact that will later be used to evaluate risk.
    • Examples:
      • Dangerous capability evaluations, e.g. METR and UK AISI's capability evals.
      • Evaluating the robustness of a safeguard, e.g. classifier robustness redteaming or the David Rein red-teaming project.
      • Pentesting to measure security.
    • For many of those, a core reason why it makes sense to have a third party do them is that the process is sort of fundamentally adversarial—the final evidence will be of the form "I tried to demonstrate danger, but failed", so you're really depending on the person trying hard, and it's structurally easier to be confident that someone's not sandbagging if they're an unconflicted third party. Some cases are like monitor redteaming or pentesting where it's obviously adversarial; others, like dangerous capabilities evals, are cases where you're relying on the evaluator to do not-particularly-adversarial optimization (e.g. you need them to pick good hyperparameters for their fine-tune to elicit CBRN capabilities).
      • If our claim is "you need a third party to do the optimization every time that you argue for safety by saying an optimization attempt failed", this maybe has broader implications, I'm not sure people are applying this consistently.
    • For the others that reason isn't present. For example, for a cyber capability eval where the AI company basically does no extra elicitation, it seems like it would probably be fine if the AI company just ran the evals themselves—the produced fact isn't rendered particularly suspect by the fact that it was done in house.
    • There are some other reasons that it might make sense to have a third party do the eval:
      • They have expertise the AI company doesn't have in house. This totally makes sense in principle; it could totally make sense to centralize the expertise somewhere outside the company so that the AI companies can share it.
      • The eval requires sensitive info that the evaluator doesn't want to give the AI co. E.g. sensitive CBRN stuff? This makes some sense.
        • In general, eval scores can generally be increased or decreased by a developer if they have access to the dataset, because they can train/iterate against it. But in many cases the developer can also game eval scores even without access to the dataset; e.g. they can probably mess with alignment evidence by training models to act aligned in weird situations. If an eval score won't be valid evidence without separately understanding whether the AI developer did anything that would compromise it, it's not clearly much worse for the AI company to have the eval themselves and run it themselves.
  • An evidence-analysis assessment tries to combine many forms of evidence into an overall judgement about a high-level question. The central example of such a question is "how much risk to the world does this deployment pose".
    • Examples:
      • METR's review of Anthropic's sabotage risk report. (The sabotage risk report analyzes a narrower slice of risk, but it is an analysis that draws together many different facts to answer a natural question that's pretty close in the causal graph to "how much total risk is there". Unlike e.g. "how robust was this monitor", which requires a bunch of additional context to be interesting.)
      • As a narrower example, our review of the OpenAI CoT training was us synthesizing a body of evidence into an overall assessment.
      • This is what auditing is in the context of corporate accounting—they look at your books and ask you questions to assess the overall financial state of the company.
    • Within evidence analysis, there's a distinction between reviewing of a developer's argument vs. producing their own.

I think that evidence analysis vs fact generation are pretty different and it's not obvious that any projects or organizations should do both. The skillsets seem plausibly very different. Like, I suspect that during the main risk period, only a pretty small subset of the facts relevant to the risk analysis should be generated by third parties (as opposed to being reported by the company). So someone doing evidence analysis will naturally be operating several steps in the "evidence tree" away from someone generating a fact. And so I'm not sure there's that much synergy for these happening as part of the same project. E.g. I don't think it would be particularly useful for the security pentesters to work at the same org that does the overall assessment.

I think it's pretty weird that these are conflated so often (e.g. in lots of the widely-cited papers about AI auditing or evaluations like Frontier AI Auditing and Model Evaluation for Extreme Risks), they seem really different to me.

Laundering private evidence into sharable conclusions?

One function of risk analyses is to inform a stakeholder about the risks posed by a developer. One strategy is to have an auditor who access private information from the developer and uses it to answer a question, and then they give the answer to that question to the stakeholder without telling them the private information that caused them to reach that conclusion. The other option is to do the risk analysis just based on information that the stakeholder already knows (or is cleared to know).

In this post, I call this “evidence-laundering”. I don’t mean this with a negative connotation and will probably choose a different term in future because the negative connotation is so strong. I just wanted a distinct term that emphasizes that information is being processed from private evidence to public conclusions to reduce disclosure of commercially sensitive information.

I think I want to define this as: did the conclusion of a report rely on facts that were not stated in the report? So it's not evidence-laundering if you report a straightforward fact that your readers need to believe that you're not straightforwardly lying about because they can't personally check it. But it is evidence-laundering if you don't report the facts and instead report "based on private facts, things look ok".

Examples of the evidence-laundering strategy:

  • Pen-testing. It's totally normal to have a pen-tester say "we looked for vulns; i won't tell you want they are but here's how bad they were". And for the pen-tester to not tell the stakeholder various sensitive details.
  • David Rein's pen-test of the Anthropic monitors.
  • METR's review of the Anthropic sabotage risk report.
  • Lots of private evaluations in other parts of society. In courts, a psychiatric evaluation involves the psychiatrist talking privately to someone and then stating the conclusion publicly without fully justifying it.

Examples of the just-public-evidence strategy

  • Our review of the OpenAI CoT training thing. The workflow here was that they shared the draft with us a week or so before they published it, and we went through a few rounds of me complaining that the draft didn't have some important info that I needed to be convinced of their conclusion, and they iteratively added more info to the draft until I was pretty happy. So even though there was some temporary secrecy, we were fundamentally not in a laundering role: the thing that we released was based entirely on information that any reader of our blog post could read themselves. We were under our standard NDA, so they could have told us stuff during this process that we wouldn't have been allowed to repeat.
  • AI Lab Watch. I think Zach sometimes emailed AI companies asking them questions that he said would affect their rating, and maybe they sometimes answered? As long as this works via him entering their answers into the public record, he's not playing the evidence-laundering role.
  • The METR Frontier Risk Report is almost entirely this. Their report is entirely based on info that AI companies agreed to share publicly, except that some of these facts are published without attribution (that is, METR doesn't say which company said something), so METR is playing the evidence-laundering role of verifying that an AI company really did say something without revealing who said it.
    • This is an interesting case because METR collected a bunch of non-public info during this project, but the only way this entered their report was by companies agreeing to make subsets of the info public.
  • Evidence-laundering is a central part of what "Frontier AI Auditing" (2026) means by "auditing", which writes: "Transparency alone cannot enable well-calibrated trust in the most capable (“frontier”) AI systems and the companies that build them: many safety- and security-relevant details are legitimately confidential".

The core strategic question here is definitely "when should you use evidence-laundering vs transparency". 

  • I think that transparency is probably better all else equal? Evidence laundering should probably be reserved for cases where it's extremely defensible for the developer to not want to publicly release certain info. I am extremely sympathetic to this for pentesting or other things where there's an imminent cost to releasing the full evidence. I am less sympathetic to e.g. developers wanting to keep the capabilities of unreleased models secret.
  • But I think it might be pretty good if we can rely on transparency as much as possible.
  • In particular, I think that transparency maybe makes it easier for risk evaluations to differentially pressure AI companies to do better. I am excited for project like AI Lab Watch that do risk evaluations that compare the AI companies. This might piss off the AI companies, so it might be hard for them to have an evidence-laundering relationship. Their ratings will be more legitimate if they have the info they need to make their assessments. AI Lab Watch has the stance "if you don't give public statements explaining what you're doing on issue X, we'll assume you maximally suck"; perhaps stakeholders would think that it's reasonable to expect developers to produce public evidence of safety and therefore the developers will publish their evidence publicly. This allows us to have a robust mechanism for between-company pressure on safety for longer than if we tried to do all this without transparency, using evidence laundering.

A few notes:

  • Reporting a score against a benchmark the evaluator built but hasn't released — "DeepSeek scored 40% on our private benchmark" — isn't laundering of the developer's information (you know the claim and whom to blame), but it does ask the reader to trust, sight unseen, that the evaluator built a reasonable benchmark. The developer itself may not even have access to check it. Releasing an i.i.d. subset of the benchmark resolves most of this.
  • A report has to disclose how it looked. Every conclusion implicitly relies on the unstated fact that nothing else relevant turned up. So a clean report should say how it searched for other kinds of facts — including disconfirming ones — and report that there weren't any (or what it found). A report that presents its supporting facts and stops is laundering the adequacy of its own search.
  • Whose secret? Company vs auditor laundering. It helps to split laundering by who owns the withheld fact. Company laundering hides facts about the developer's own system — details of security systems that a pentester is attacking, or training methods that an auditor wants to know about. Auditor laundering leans on something the assessor holds and won't share: a particularly strong example of this is hazardous knowledge like bioweapon-uplift details. Note that auditors very often have some information that they don't want to publish. Note that needing to trust the auditor did a competent job — e.g. elicited hard enough — is not laundering at all; that's ordinary trust in a person, not a hidden fact. The table below splits these into two columns.
Incentive compatibility vs calibration

Should you give bad scores to developers if they don't give you sufficient access, or should you just use your best guess? If you do the latter, then anyone performing surprisingly poorly is better off not disclosing this. But if you do the former, then your risk assessments can’t be taken at face value by third parties, and it’s easier for AI developers to discredit you by saying “that org has no idea what’s going on, obviously we’re way safer than they think”. This probably works especially well when they’re trying to discredit you to their employees or other groups who have more access to private info. As an example of that dynamic, AI Lab Watch initially rated Google poorly on security due to them not disclosing much about their security, and this led some GDM people to say they thought it was less credible or usable.

As I said in the previous section, it's maybe rough to get stakeholders on board with risk assessments that assume the worst, but maybe it's doable and I think we should plausibly try to achieve this.

Current risk vs preparedness

Are you answering a question like "are the current risks adequately handled" or are you answering "is this developer on track to handle risks later"?

The downside of analyzing current risks is that they're inconsequential and will probably be inconsequential up until shortly before they're severe.

The downside of analyzing preparedness is that you have to be much more opinionated about futurism and threat models; your reports will rely on assumptions that are much more contentious.

Cross-developer comparability

Are you trying to make it easy or hard to compare between companies?

  • AI Lab Watch and friends are trying to make it easy. The advantage of this is that it might pressures companies to behave better or shift resources towards better companies (e.g. because people prefer working at companies they believe to be more responsible).
  • METR's FRR went out of its way to make it hard. The advantage of making it hard is that companies are more willing to tell you stuff. (One way of thinking about this is that if you remove comparability, scary stuff the AI companies say has costs that are split among them and their competitors, which is way better for them on net than the costs being concentrated on them. In practice, AI companies also get some cred from safety people for producing evidence of scary stuff, so this effect hasn't mattered that much in practice.)

I think that cross-developer comparability makes much more sense if you're doing evidence analysis, because evidence-analysis assessments are more comparable between companies.

So kind of obviously, I think that the choice between these is basically a tradeoff between different goals: if you mostly want to pressure AI companies to behave better, then comparability is good; if you mostly want to inform stakeholders about the overall level of risk, then comparability is probably bad if you were also trying to do evidence laundering, because it makes it more costly for developers to share info with you.

Examples, classified against the axes above

Project

Fact gen vs evidence analysis

Company laundering

Auditor laundering

Current vs preparedness

Cross-developer comparability

Reviewer vs producer

Dangerous-capability evals (METR, UK AISI)

Fact gen

No

Yes

Now

Easy

Producer

Classifier-robustness red-teaming for misuse prevention

Fact gen

No

Yes

Now

Easy

Producer

David Rein's red-team of Anthropic monitors

Fact gen

Yes

No

Now

Hard

Producer

Security pen-testing

Fact gen

Yes

No

Now

Hard

Producer

CAISI DeepSeek evaluation

Fact gen

No

No

Now

Easy

Producer

Apollo in-context scheming evals

Fact gen

No

No

Now

Hard

Producer

METR review of Anthropic's sabotage risk report

Evidence analysis

Yes

No

Now

Hard

Reviewer

Redwood review of OpenAI CoT training, External review of DeepMind scheming-inability safety case

Evidence analysis

No

No

Now

Hard

Reviewer

METR Frontier Risk Report

Both

Slight (non-attribution)

No

Now

Very hard (deliberately)

Producer

AI Lab Watch, FLI AI Safety Index

Evidence analysis

No

No

Both

Easy

Producer

SaferAI risk-management maturity ratings

Evidence analysis

No

No

Prep

Easy

Producer

GovAI third-party compliance reviews (proposal)

Evidence analysis

Yes

No

Prep

Med

Reviewer

Brundage et al. AAL framework (proposal)

Both

Yes

No

Both

Easy (via AAL scale)

Both


The columns are far from independent: the whole table can be recovered by a short chain of single-question splits. The tree below is the chain that minimizes expected questions-to-classify. Note that auditor laundering occurs only at the fact-generation end, and evidence-analysis assessors only ever launder company secrets or nothing.




Discuss

Страницы

Подписка на LessWrong на русском сбор новостей