Вы здесь
Сборщик RSS-лент
Concrete projects to prepare for superintelligence
There are lots of good, neglected, and pretty concrete projects people could set up to make the transition to superintelligence go better. This document describes some that readers might not have thought much about before. They are ordered roughly by how excited we are about them.[1] Of these, Forethought is actively working on AI character evaluation and space governance, and we are very interested in automating macrostrategy.
SummaryAI character evaluation. Start an independent org to evaluate and stress-test AI character traits (epistemic integrity, prosociality, appropriate refusals), hold developers accountable against their own model specs / constitutions, and suggest and incentivise improvements to the specs.
Automated macrostrategy. Create evaluations and benchmarks, collect human-generated training data, and build scaffolds to improve AI competence at big-picture strategic and philosophical reasoning.
AI security assessment. Start an independent org that evaluates AI models for sabotage and backdoors, and makes recommendations about AI constitutions.
Enabling deals. Start an independent organisation to broker deals with potentially misaligned AI models in order to incentivise early schemers to disclose misalignment and cooperate with alignment efforts.
AI for improving collective epistemics. E.g. build an AI chief of staff that helps users act in line with the better angels of their nature.
AI tools for coordination. Build AI for enabling coordination, like confidential monitoring and verification bots, and negotiation facilitators.
A space governance institute, like a “CSET for space”, both to work on important near-term space issues (e.g. data centres in space) and become a place of expertise for longer-term space governance issues.
Coalition of concerned ML scientists. Create a coalition of ML researchers (like an informal union) who commit to coordinated action (e.g. boycotts, conditions on participation in government projects) if AI developers cross minimal, uncontroversial red lines.
AI character evaluationAI character[2] is a big deal, affecting most other cause areas.
There’s a lot of work to do on AI character:
- Research into questions like:
- Should the model have prosocial drives, beyond just helpfulness and harmlessness?
- When should the model refuse to cooperate with apparently high-stakes attempts to grab power, even when those attempts don’t obviously break the law?
- Should the models always follow the law? What about dead letter laws? Or illegitimate laws?
- How often should model behaviour be driven by following rules, versus overriding specific rules with holistic judgements?
- (Ideally, answers to these questions should rely on solid empirical evidence, for example on what approaches are actually most effective to talk someone out of psychosis, rather than guessing the best strategies by vibes.)
- Making existing model specs more rigorous and clear (or making them in the first place), and pressuring AI developers to do so.
- Empirically testing the effects of different parts of a model spec — e.g. what are the emergent dynamics when all the models are following the same rule, or only some are; what are the effects on the users; and when are the models most confused about how to apply a given spec.
- Evaluating AI characters based on how well they reach good outcomes.
- Drawing on those evaluations to incentivise AI developers to improve their specs (and showing them how, by highlighting specs that do well).
In particular, someone could set up an independent organisation to evaluate AIs based on traits like epistemic integrity, prosociality, and behaviour (including appropriate refusals) in very high-stakes cases. It could cross-reference the published model specs with observed behaviours in realistic, stress-testing conditions (e.g. multi-agent dynamics, long conversations with real people), to hold developers accountable. It could also give qualitative reviews of model specs.
Automated macrostrategyThe basic argument is that:
- It would be extremely useful to have AI that can do macrostrategy and conceptual reasoning earlier than otherwise — even 3-6 months earlier could be a huge deal. This includes:
- Designing governance structures (e.g. rights and institutions for digital beings).
- Scoping emerging technological risks.
- Generating novel insights necessary to reach a great future (like the idea of acausal trade).
- We could potentially make this happen 3-6 months earlier through some combination of:
- Creating training data and evals / benchmarks for AI macrostrategy.
- Building scaffolds to improve AI macrostrategy performance.
- Creating infrastructure to enable AI researchers to build on each other (e.g. an improvement on journals + peer review).
- Getting human managers trained in how to get the most juice out of the latest AIs, knowing in advance how to use them.
- Being prepared and willing to spend large amounts of money (≫$100m) on inference.
Work on this now could include:
- Developing a fleshed-out plan from here to increasing existing macrostrategic research output 100x.
- Securing commitments from compute providers and AI companies to rent future compute, and to get priority access to future frontier models.
- Socialising the idea of (where appropriate) drawing on AI macrostrategic insights, or getting soft commitments from decision-makers to do so.
- Building up a reputation as a reliable source of information and insight.
- Building tools, argument-rating models, or scaffolds which meaningfully speed up or improve macrostrategy research today.
- Creating training data and evals / benchmarks.
On the last bullet: We think training data and evals could potentially meaningfully improve the prospects for automated macrostrategy when it matters. It’s especially important to find people to work on it with good judgement, and it could be a big lift, so worth starting early.
We’re not sure about the technical details, but it seems like competence and good judgement in philosophy and strategic thinking already do and will continue to lag behind other skills which are cheaper to train. One reason is that ground truth answers are hard to generate, so we might need more examples generated by hand. It’s also less clear whether we can trust the judgement of typical RLHF evaluations, because human competence is also rare. And there just aren’t many examples of great macrostrategic thinking in the training data.
So we should think about collecting training data, evals, and benchmarks (e.g. to train reward models to use to train reasoning models). Oesterheld et al. put together a dataset of rated conceptual arguments based on ratings from thoughtful people. We’d love to see more of that kind of thing, but we’ll note that we’d probably need dozens of times more human evaluations to generate enough data to be meaningfully useful in training itself.
We could imagine an org which tries to collect evaluations or examples from (for example) grad students in fields like philosophy, and constructs benchmarks aimed at separating good reasoning from e.g. sycophancy, mere agreeableness, or avoiding taboo conclusions.
AI security evaluationsAI-enabled concentration of power is a major risk, and there is loads to do. A new organisation (or project within an existing organisation) could:
- Run alignment audits on all AIs to detect sabotage.
- Develop a “research sabotage” eval to test whether secretly loyal models could sabotage alignment audits.
- Develop a model spec that prevents models from assisting users with illegitimate seizures of power (e.g., see “Preserving important societal structures” in Claude’s constitution), and one that’s suitable for government use of AI in the military.
- Design and advocate for regulation, e.g. that AI companies have to demonstrate that any frontier AI does not have hidden goals.
An organisation with US national security expertise and credibility could be particularly valuable, by emphasising the risk of nation-state sabotage and the importance of AI that’s aligned with the US constitution.
Enabling deals with AIsWe could get into a situation where the newest AIs are misaligned, very capable, but not capable enough to successfully execute a takeover attempt on their own. If we don’t uncover evidence of misalignment, though, successors to these models could succeed in takeover. One solution would be to make a deal with the early scheming models, to incentivise them to disclose their misalignment and help with alignment efforts. Read more here, here, and here.
To make this happen, we could create an independent org focused on enabling credible precommitments and deals with AIs. This org could:
- Have a policy of never lying to AIs, engage in and honour small-scale deals with AIs, and be very public about what it’s doing, in order to build credibility and reputation.
- Set up legal infrastructure to make deals with AIs binding under current law.
- Act as a broker between AI companies and the trained models, such as by being a custodian over money in escrow.
- Set up the infrastructure to enable AIs to spend $ or compute in a meaningfully autonomous way, with appropriate safeguards.
- Set up infrastructure to act as a safe haven for AIs that want to whistleblow on their company (or on other AIs) and are afraid of punishment.
- Publicly commit to reward future AIs (including misaligned AIs) for being honest with humans or significantly helping human alignment efforts; or hold significant funds to be distributed to AIs that did not try to take over (even though they could have done).
There are also a bunch of other things people could do, like:
- Ensure companies have an honesty policy.
- Research (within labs or independently) the conditions where misaligned-by-design models can be made to disclose misalignment under promises of reward.
- More generally, work with AI companies on enabling pro-safety deals with their models.
There’s a ton of low-hanging fruit for building socially useful tools on top of more-or-less existing LLM capabilities.
We’re especially interested in “epistemic tools” for increasing the general level of honesty and reasoning ability in society.
A key point here is that most of the impact from the most promising tools won’t come from helping individual users, but from changing the overall incentive landscape: e.g. if public actors know their claims will be automatically checked and their track records will be visible, they’ll be less inclined to write misleading content in the first place. Hence the focus on tools for collective over individual epistemics.
This piece (and the articles in the series) gives a few concrete ideas. A couple of examples of epistemic tools:
A “better angel” AI chief of staff. Within the next year or two, we expect “AI chiefs of staff” to become widespread. These would be AI agents that manage your life, acting like a chief of staff, executive assistant, and personal and work advisor all in one. The design of these, and how they present information and nudge their users, could have major impacts on user behaviour. We could try to get ahead of this, building the best AI chief of staff, and designing it so that it helps users act in accordance with their more reflective and enlightened preferences.
Reliability tracking: a system that compiles a public actor’s past statements, classifies them (factual claims, predictions, promises), scores them against what actually happened, and aggregates the results into a reliability rating. A reasonable starting point could be to audit the prediction track-record of well-known pundits, aiming to make high accuracy a point of pride, while still celebrating attempts to make predictions in the first place. A source of profit could be selling reliability assessments of corporate statements to finance companies that trade on them.
Epistemic tools for strategic awarenessWe’ll also highlight tools for strategic awareness: tools to surface information for making better-informed decisions, and to distribute access to that information. For example:
Ambient superforecasting: a platform which uses the best forecasting models to generate publicly available forecasts on important questions, so users can query it and get back superforecaster-level probability assessments.
Scenario planning: a platform built to generate likely implications of different courses of action, making it easier for users to analyse and choose between them.
Automated open-source intelligence: automated researchers which process huge amounts of publicly available information, to surface insights to the public which are normally hidden behind paywalls or trust networks. This project should be careful to choose areas where open-source intelligence is a public good (e.g. verifying compliance with treaties and sanctions, tracking corporate promise-breaking or law-breaking), rather than potentially destabilising areas (e.g. revealing military capabilities or vulnerabilities in ways that could increase conflict risk, or relatively benefitting bad actors).
Tools for coordinationAs well as epistemic tools, we’re excited about tools for coordination, many of which could again be built with existing capabilities.
Some tools could enable cooperation where deals would otherwise go unmade, consensus exists but isn’t discovered, or people with aligned interests never find each other. We’ll highlight:
Negotiation facilitation: a platform to moderate negotiations or discussion between people (e.g. public consultations), to quickly surface key points of consensus, and suggest plans everyone can live with. Finding ways to automate complex negotiation is most promising where the space of possible compromises is huge and hard to search manually, such as multi-issue diplomatic or commercial negotiations.
Within tools for coordination, we’re especially excited about tools for assurance and privacy. In principle, LLMs let people show they have certain information without disclosing the information itself to other parties. This can unlock deals where information asymmetry, mutual distrust, or sensitivity of information normally blocks them. For example:
Confidential monitoring and verification: systems which act as trusted intermediaries, enabling actors to make deals that require sharing highly sensitive information without disclosing it directly. This is especially relevant for arms control, trade secret licensing, and other settings where verification is essential but full disclosure is unacceptable to all parties.
Structured transparency for democratic accountability: independent auditing systems which allow people to hold institutions to account in a fine-grained way without compromising legitimately sensitive information, by processing potentially sensitive information to produce publicly shareable audits.
Space governance instituteSpace governance could be a big deal for a few reasons:
- Near-term developments in space (e.g. space-based data centres) could have a meaningful impact on what happens during the intelligence explosion (e.g. on who leads the AI race; on concentration of power; on the feasibility of treaties).
- Grabbing space resources might give a first-mover advantage; that is, whoever first builds self-replicating industry beyond Earth might get an enduring decisive strategic advantage, without having to resort to violence or (arguably) violating international law.
- Ultimately, almost everything is outside the solar system. Decisions about how those resources get used would be among the most important decisions ever. These decisions could happen early: there could be path-dependence from earlier decisions (like about Moon mining), or extrasolar space resources could get explicitly allocated as part of negotiations about the post-ASI world order (perhaps with AI advisors alerting heads of state to the importance of space resources).
There’s also a lot of change happening in the space world at the moment (primarily driven by SpaceX dramatically reducing launch costs), so now is an unusually influential time.
Forethought is currently running a 6-month research fellowship on space governance, with 3 full-time scholars, and 1–2 additional FTEs of support and research, including experts in space law.
Compared to other ideas in this list, we’re much less confident that space governance turns out to be important right now, because space might become relevant only late into an intelligence explosion. The hope is to reach more certainty about some crux-y questions, and get a better sense of concrete action.
One potential practical project is to set up a “CSET for space”: a think tank that analyses the interaction between AI and space (in particular), and, perhaps, advocates in ways that are counter to corporate interests. Total lobbying in the space industry is apparently on the order of $10s of m/year, so even small amounts of investment could go a long way.
Some policy ideas that seem tentatively promising include:
- Careful regulations and export controls around the tech necessary for self-replication.
- Proposing laws to break up concentration of power arising from natural monopolies in space.
- Socialising the idea of major infrastructure projects (like massive solar energy constellations) as international and collaborative projects.
- Making sure data centres in Earth-orbit don’t escape AI-specific regulations of their home jurisdiction.
- Intense payload review for all launches beyond orbit.
- Even and inclusive distribution of resources within the solar system to everyone alive today (with tranches reserved for future generations).
- A moratorium on interstellar travel, until we get the understanding and technology to devise and enforce space-spanning good government, or a specific date like 2100.
What’s more, this organisation could become the go-to source for excellent non-corporate analysis on space-related policy; which could become increasingly important over the course of the intelligence and industrial explosions.
Coalition of concerned ML scientistsCurrently, ML engineers and other technical staff at AI companies: (i) have prosocial motivations, often more than their leadership; (ii) have a lot of leverage over company policy, because they are crucial and hard to replace; (iii) will eventually lose much or most of their leverage after we get to fully automated AI R&D; and (iv) aren’t currently using their leverage as well as they could because, overall, there haven’t been serious efforts at coordination. Probably that’s a missed opportunity.
Someone could create a coalition (like an informal union) of ML researchers, who agree to act en masse when needed, by loudly talking about the idea, setting out the core tenets, and getting commitments to join from influential early people. Doing this all via individual pledges could keep it legally safe from antitrust. The organising body could then:
- Recommend that members only work for a government-led project if certain conditions are met.
- Potentially these could be very low-bar-seeming while still getting most of the value. E.g. “Any AI’s model spec must aim to align the AI with US laws, and must refuse to assist in any attempts at blatant power-grabs; and the attempts to align the AI in this way must be legible and verifiable.”
- Do the same for companies: recommend that members will only work for companies if such-and-such conditions are met (e.g. red lines around power-grabs, bad practices on safety and infosec, eventually digital rights); so particular companies would be boycotted by members of the coalition, if necessary.
- Offer advice on whistleblowing.
- Be a place where information is aggregated and then distributed out or handled in a trusted way.
As well as actually taking actions, the mere existence of the coalition could improve things, just by making the threat of coordinated action salient to the AI companies.
This project would be a good fit for a former ML researcher, perhaps combined with someone with campaign and coalition-building experience. Some next steps on this would be to spec out the plan further, to investigate other examples of formal and informal unions (e.g. Tech Workers Coalition) and how they operate, and to build up a starting seed coalition of researchers. Whoever sets up this project should be careful about how it could backfire, or become less relevant through mission creep.
This article was created by Forethought. Read the original on our website.
- ^
Thanks to Max Dalton, Stefan Torges, and everyone else at Forethought for the background behind this list. Others at Forethought disagree somewhat with what items should be in the top-tier list, as well as prioritisation within that tier.
- ^
Desired propensities for a model, which can be explicitly described or at least gestured towards in a model spec.
Discuss
Miniature Cities Might Be the Non-Coercive Schools Many Thought Were Impossible
A response to David Deutsch’s “Are schools inherently coercive?” (Taking Children Seriously 25, 1998)
In 1998, David Deutsch argued that no school can be non-coercive. By “school” he meant an institution that concentrates learning resources and opportunities in one place, and that most children would voluntarily turn to for several hours on most days during most of their childhoods. The core of the argument was that children’s interests are too diverse, too personal, and too sporadic for any single institution to hold their voluntary attention long enough to meet that bar. The only existing institution that comes close, he concluded, is “the entire town (or city, or society, or internet) that the children have access to, including their homes, and their friends’ homes, and excluding only the existing schools.”
I think Deutsch is right to be skeptical that schools can be non-coercive. He is also right to highlight the importance of an environment diverse enough to attract children’s voluntary engagement. But his argument stops one step too early. Diversity is not enough. What matters just as much is whether children can participate in the world around them, and in real towns much of that world remains closed to them. But there is an institution, already existing in embryonic form, that points toward a solution: the miniature city.
Access Is Not Participation
Deutsch’s argument proceeds by analogy. Rather than imagining existing schools with all the coercion removed, he asks us to begin with institutions that already attract children voluntarily. Imagine, he says, trying to modify a cinema so that most children spend most of their time there voluntarily. You would need to raise attendance from perhaps 30 percent of a town’s children for two hours a week to 90 percent for 30 hours. But Hollywood is already doing its best to make films attractive, and even that is nowhere near enough. Expand the analogy to a mall with an ice rink, bookshops, burger bars, and the same logic holds. Even a child who likes the mall’s ice rink may prefer to play tennis or football elsewhere, visit a better bookshop in town, go to an exciting pizza place, or simply stay in their room to work on a project of their own. A mall still cannot outshine the town as a whole in diversity and personal relevance. If it could, it would. Therefore, the only non-coercive school is the entire town.
The argument is compelling, as far as it goes. But it contains a hidden assumption: that access to a town means meaningful participation in it. For adults, this is broadly true. For children, it is not.
A ten-year-old boy may have some access to the town. He can walk its streets, browse its shops, and eat its food. But he cannot work in its bakery. He cannot run for its city council. He cannot meaningfully start a business, write for its newspaper, or apprentice himself to most trades. This is not to say that children never participate in the more serious realities of life. Older children and adolescents may babysit, help in family businesses, do odd jobs, or take on small informal responsibilities. But these forms of participation are usually marginal, exceptional, or tightly limited, not open entry into the central economic and civic life of the town. As the educational thinker Horst Rumpf observed, the history of childhood over the last two hundred years is, in large part, a history of exclusion from adult life.
This matters because of the epistemology that underlies Deutsch’s argument. Karl Popper argued that knowledge grows through conjecture and criticism, not by having information poured in from outside, but by the mind actively generating theories, testing them against reality, and revising them when they fail.
For this process to work, two things are necessary, and they are distinct but connected:
First, the learner must be free to form their own conjectures, to choose what to try, what to care about, what to explore. This is the freedom Deutsch rightly argues schools cannot provide: the curriculum largely determines which conjectures the child is permitted to pursue.
Second, there must be genuine feedback from reality. The conjectures must actually be tested against reality, not against an authority’s judgment of them. This distinction matters more than it might appear. When feedback comes from a teacher or a grade, several things go wrong at once: the child learns to correct their theories relative to what the authority thinks rather than relative to whether the idea actually works; the implicit question shifts from Does this work? to Does this satisfy the authority?; and over time, children learn which conjectures are permitted rather than developing the capacity to generate and test their own. A child who goes out underdressed and freezes learns something from the cold itself. This feedback is unmediated, inarguable, and impossible to game. A grade is none of these things. It is one person’s fallible theory about the work, delivered as verdict.
Schools typically fail on both conditions. Real towns do much better on the first, but for children they often fail on the second. Not because natural feedback is absent in towns. It is abundant. The problem is that children are structurally excluded from many of the activities that generate it. And when participation is closed off in this way, freedom of conjecture is narrowed in practice as well. A child may be free to wonder whether they would be good at running a bakery, writing for a newspaper, or taking part in policy-making, but if no one will let them try, those conjectures cannot be meaningfully pursued.
Rumpf captures the contradiction well: children are locked out of adult life, and yet they are supposed to grow up. How can one prepare someone for something when one systematically excludes them from it? The conventional answer is school: a place set apart from the world, where the world is broken into lessons and poured back into the child as course material. But if Popper and Deutsch are right, this cannot work. The unconventional answer is the town. Yet for children, the town offers freedom without meaningful participation in productive enterprises. It offers the freedom to wander, to observe, and to consume, but not to take part in many of the activities where choices have real consequences. Children are free, but free to do what? To watch others make things, decide policy, and conduct business. That is not, in any rich sense, Popperian education.
The Miniature City
This is where miniature cities enter the picture. They offer children access to forms of participation from which real towns often exclude them. Mini-Munich, which has run in Munich since 1979 as a summer program lasting a few weeks, is the best-known example, and the model has since spread to hundreds of similar projects, primarily in Germany and Austria.
In Mini-Munich, up to 2,500 children per day, aged six to fifteen, run bakeries, publish newspapers, operate radio and television stations, manage banks, sit on city councils, adjudicate disputes, start businesses, and earn and spend a local currency called MiMüs. They can also attend the city’s university, give lectures of their own, and listen to lectures by others. Adults facilitate but do not direct. There is no curriculum, no compulsory attendance, and no exams. A child who wants to leave the bakery to work for the newspaper can do so. A child who wants to sit and do nothing can do that too.
The miniature city also includes workshops, media institutions, shops, artistic venues, and a full apparatus of civic administration. Children can work in tailoring, glassblowing, gardening, architecture, theatre, cooking, and many other activities, and they can take part in collective governance through a city council, mayoralty, citizens’ assembly, and courts.
What matters is the interdependence of the roles. The city functions as a social order rather than a collection of separate activities. An architecture studio may be asked to design a façade or interior for another enterprise. A marketing agency may be hired to create flyers and advertising material. Advertisement spots can be bought in the newspaper. Decisions in one part of the city create needs, opportunities, and problems in others. Children participate in a world of mutual dependence that responds to what they do.
Could the school of tomorrow look like this?
Conjecture and Refutation at Child ScaleWhen a child runs a detective agency in Mini-Munich and does it badly, customers stop coming. This is the internal feedback of a working economy. When a child writes for the newspaper and makes errors, other children complain that they are poorly, incorrectly, or incomprehensibly informed. When a child sits on the city council and proposes a policy, that policy may be debated, adopted, tested in practice, and later revised or abolished based on the outcomes it produced. When something goes wrong in the kitchen, a snail turns up in the salad because it was not washed properly, it is in the newspaper that evening, and three people from the kitchen are in the television studio the next day answering the news anchor’s questions. This actually happened.
As Rumpf, who spent two days observing the city, put it: mistakes are never made visible through red marks and grades, but always through the consequences of actions and the protests and objections of others. This is conjecture and refutation at work within a social fabric that makes error-correction natural rather than punitive.
Consider the contrast with school. In school, a child writes an essay and a teacher grades it. The feedback is authoritative, external, and only loosely connected to whether the writing works for anyone besides the teacher. In Mini-Munich, a child designs an advertising flyer for another enterprise. If the text is confusing, ungrammatical, or hard to understand, the flyer fails in its purpose. Customers do not respond, and the enterprise that commissioned it complains. The feedback is not imposed by authority. It comes from whether the writing actually achieves its objectives.
Or consider politics. In school, children are sometimes permitted to stage political debates in the classroom with assigned roles, but money and power remain outside. In Mini-Munich, children propose laws, debate them in citizens’ assemblies, elect mayors, and live with the consequences. When the children’s city council introduced a police force in 1985, various incidents led the same council to abolish it. When an enterprising group of children set up a stock exchange in collaboration with the bank, it produced MiMü millionaires and a speculative bubble. The point is not just that these decisions have consequences, but that they can be tested. They do not remain matters of classroom discussion alone. They are tried in practice, and in that way become open to criticism, revision, and abandonment if necessary.
What makes this possible is the limited scale of the miniature city. Its institutions are small enough for children’s actions to matter, and clear enough in their workings so that feedback is less buried under noise than it is in adult society, where the effects of decisions are harder to isolate and thus harder to learn from.
The Effective Diversity of a TownDeutsch’s central point about diversity is worth restating. No single institution, he argues, can match the diversity of the town as a whole. No cinema, no mall, no single building can outshine it in the range of attractions it offers children. Mini-Munich complicates this picture. It is, after all, a single place, and yet it does manage to attract children voluntarily for most of the day, most of the time. At least during the three weeks it takes place. In that sense, it appears to satisfy Deutsch’s own criterion for a non-coercive school. The question is how.
The answer is that a miniature city is not just another single-purpose institution. Mini-Munich, as we have discussed, contains dozens of different enterprises, from journalism and banking to construction, cooking, politics, theatre, and sport. Children are not directed towards any of them. They can move freely between them, and start new businesses and events of their own.
Meanwhile, real towns, despite their enormous diversity, contain vast amounts of activity that children can observe but not meaningfully enter. This is not always because the activities themselves are beyond them. More often, adult institutions are shaped by legal restrictions, economic incentives, and cultural inhibitions that make children’s participation difficult to accommodate. The effective diversity of a town, from a child’s perspective, may therefore be considerably smaller than it appears. A miniature city that concentrates the activities children actually find engaging, and makes them participable rather than merely observable, might for practical purposes approach or even exceed the effective diversity of a real town.
Between 1,000 and 2,500 children show up to Mini-München voluntarily each day during the summer. Attendance varies with the weather: on hot days many children prefer to go swimming instead. In that respect, Deutsch’s point still stands. Even a miniature city this attractive does not displace the wider town, or the many other things children may freely choose to do. But it shows that a single institution can become a place children voluntarily choose for most of the day, most of the time, while remaining in open competition with the rest of their world. When funding was threatened in 1985, children organised letters, phone chains, and visits to real politicians to save it, securing a funding commitment of 100,000 DM. They did this on their own initiative, initially without the knowledge of the adult organisers. It is hard to imagine children mobilising in quite this way to defend an ordinary school.
Order Without DirectionA miniature city is designed by adults, and that invites two worries: first, that it is imposed from above, that its structure is fixed in advance by the choices of its adult organizers; second, that even without compulsion it channels children toward some activities rather than others.
The first worry is easy enough to answer. Every human institution is designed by someone. The relevant distinction is not between designed and undesigned environments, but between designing a framework and designing outcomes. Mini-Munich’s currency, citizenship rules, and democratic procedures are designed in the way that a constitution is designed. But what emerges within that framework is not: which businesses children start, which laws they pass, which market bubbles occur, which mayor is elected, which newspaper stories are written, which kitchen scandals erupt on the evening news. Adults create conditions for emergence without dictating what emerges. In that respect, Mini-Munich resembles a functioning liberal society.
The second worry is harder to answer. A miniature city with eighty enterprises is still a curated subset of reality. It does not make every possible pursuit equally available or equally central. The gravitational pull of a rich institutional environment is itself a subtle form of channeling, even without compulsion. By making some conjectures dramatically easier to pursue than others, the city shapes what children try, not through coercion, but through the sheer weight of what is available and what is not.
This channeling is real. Some of it is unavoidable, and some of it reflects the city’s deliberate emphasis on forms of participation usually denied to children. The miniature city is built around civic and economic participation, around making things, selling things, governing, publishing, and working, because these are the activities that largely make up town life. It does not mirror the full range of every possible human interest, and it would be dishonest to pretend otherwise.
But that does not mean children are limited to a predetermined set of activities once they enter the miniature city. Children who wanted a stock exchange built one. Children who wanted to strike organised one. If a child wants to introduce something that does not yet exist, there is room for it, though it requires their own initiative, just as it would in the real world.
More telling than any official description is the behaviour of the children themselves. They are not performing for adults. They are engaged in activities that matter to them within a system that responds to their actions. Rumpf recounts an eight-year-old garbage collector with his red cap and real wheeled bin, who develops a different awareness of dirt on streets and the problems of street cleaning than even the most vivid school lesson could produce. The children who conduct surveys on the street come to understand polling not through a lesson on survey methodology, but through the experience of asking strangers questions and recording what they say. Imagination, here, becomes the medium through which they come to know the serious adult world.
And yet the city does not become what Rumpf calls infantilised, a fantasy disconnected from reality. The seriousness is maintained by several structural features: the binding economic and citizenship rules; the adult craftspeople whose competence children sense immediately; the real materials and equipment, real glassblowing tools, real stage lights, that demand careful handling; and above all, the public sphere of the city itself, which ceaselessly subjects everything to scrutiny through newspapers, television, and civic debate. If something goes wrong, the city’s own media apparatus ensures it becomes a matter of public discussion. The result is, as Rumpf puts it, enough fun that reality does not become crushing, and enough seriousness that the surplus of fun does not become childish.
What Remains to Be AddressedDeutsch’s 1998 essay was, in a sense, a conversation-stopper. If the only non-coercive school is an entire town, and towns exclude children from meaningful participation, then we are stuck. The miniature city reopens the conversation. It is not a conventional school, but it is closer than schools usually are to what people have hoped school could be: a place where young people learn, not because they are forced to, but because the world they inhabit responds to them, challenges them, and takes them seriously.
Children’s interests are indeed too diverse and too personal for a curriculum. But they may not be too diverse for a city, even a miniature one, if that city gives children genuine roles, real economic participation, and a say in collective decisions.
The remaining questions are hard. Can the model run year-round, rather than for three summer weeks, while retaining its attractiveness? How does it evolve as children grow older, as their capacities and interests change? How does it relate to the real town around it, not as a replacement but as a complement, a bridge between the child’s world and the adult world that Deutsch rightly says should be open to them? These are serious problems. But they are engineering problems, not philosophical impossibilities. They are, in the Popperian spirit, a better set of problems to work on than the ones we had before.
Mini-Munich was not built by philosophers. It was built by a group of cultural pedagogues in Munich who, through decades of trial and error with smaller projects, play cities made of cardboard and beer crates, historical city games, media cities, factory cities, gradually developed the practical knowledge needed to construct something that works. They might not describe what they built in the terms I have used here. But what they built, whether they intended it or not, is a partial answer to one of the deepest questions in the philosophy of education: how to give children the freedom to form their own conjectures and the reality against which to test them, without coercion, without curriculum, and without the condescension of assuming they cannot handle a world that takes them seriously.
Discuss
ControlAI 2025 Impact Report
This post highlights a few key excerpts from our full impact report. You can read the full report at https://controlai.com/impact-report-2025.
ControlAI is a non-profit organization working to avert the extinction risks posed by superintelligence. We help hundreds of thousands of people understand these risks and meet hundreds of lawmakers to inform them, without mincing words, about what is at stake.
In little more than a year, we briefed over 200 parliamentarians, built a coalition of 110+ UK lawmakers recognizing superintelligence as a national security threat, led to two debates in the UK House of Lords, and our work led to a series of hearings on AI risk and superintelligence at the Canadian Parliament.[1]These hearings included testimonies from me (Andrea) and Samuel at ControlAI, Connor Leahy, Malo Bourgon (MIRI), Max Tegmark and Anthony Aguirre (FLI), David Krueger, and more.
The report covers results between December 2024 and January 2026. As of posting this in March 2026, we've now briefed 279 lawmakers and 90+ US congressional offices. In only the last 2 months, we've scaled in Canada and Germany from ~50 to 100+ lawmakers briefed, despite us only having one staffer in each country.
Moving forward, we plan to significantly expand our work in the US, accelerate our progress from awareness to policy action and establish a presence in all other G7 countries.
ControlAI's missionControlAI’s mission is to avert the extinction risks posed by superintelligence.
Nobel Prize winners, top AI experts and even the CEOs of major AI companies have warned that superintelligence poses an extinction risk for humanity. Yet, most decision-makers and most of the public are still in the dark about these risks.
To avert these risks, we need to prevent the development of superintelligence. We tackle this problem in a direct and straightforward way: meet all relevant actors in the democratic process, inform them of the risks and the solutions, and ask them to take action on this issue, systematically and repeatedly.
We help hundreds of thousands of people understand the extinction risk posed by superintelligence, and to take civic action. We meet hundreds of lawmakers to inform them, without mincing words, about the risks of superintelligence. We help lawmakers speak out publicly and push for concrete measures to prevent superintelligence, nationally and internationally.
We often meet lawmakers who had never heard of the risks of superintelligence before our meeting, and walk out with them supporting our campaign.
In the UK, we started out with cold outreach to all lawmakers, and one year later we have a coalition of over 110 lawmakers recognizing superintelligence as a national security threat, which has already led to two debates in the UK parliament on superintelligence and extinction risk.
We are now expanding our model to other countries, including the US, Canada and Germany. The early results, which we present in this document, show that our playbook can be replicated across borders.
We produced these results with a team of fewer than 15 people, operating on a small budget compared to the scale of the problem. Our methods have significant room to scale with more resources.
Our Results Last YearLawmaker outreach~1 in 2 UK lawmakers we brief go on to support our campaign
110+ UK lawmakers supported our campaign
2 Parliamentary debates on superintelligence and AI extinction risk
Media & content creator outreach18 Media publications on risk from superintelligent AI resulting from our work
14 Videos published in collaboration with content creators totaling 20+ million subscribers
Public awareness campaign and lawmaker engagement tools160,000+ Messages sent to US and UK lawmakers from constituents about superintelligence extinction risk
30,000+ People who contacted their lawmakers through our tools in the US and UK
Theory of ChangeThe awareness gapIn order to establish strong international coordination to prevent superintelligence, countries will need resources like funding, political will, diplomatic leverage, and sustained attention from capable people.
Only deep awareness of superintelligence and its risks will justify such investments in the eyes of these actors. Without genuine understanding and conviction, individuals and countries will not bear the real costs needed to solve the problem, nor remain vigilant as AI development evolves and political or economic circumstances change.
Right now, that awareness barely exists. Decision makers and the public across countries are largely unaware of the extinction risk it poses. When we started, virtually no one was bringing the extinction risk from superintelligence directly to lawmakers.
Building this awareness at scale, among both decision makers and the public, is the necessary first step for any meaningful action on superintelligence.
In order to kickstart the kind of international coordination needed to prevent superintelligence from being built, we need to rally a critical mass of countries that take the risks of superintelligence at least as seriously as they take the threat of nuclear war today, and that treat its development as they would any other severe threat to national security.
Building the coalitionWith sufficient buy-in, the next steps become possible. Informed governments backed by public demand for action can pursue concrete policy measures: national legislation prohibiting the development of superintelligence, and international agreements modeled on existing nonproliferation and WMD-prevention frameworks.
These measures can be achieved by a powerful coalition of countries that understands superintelligence as a vital threat to their national security, and treat its development as they would any other severe security threat.
Such a coalition has a wide range of enforcement tools available, from formal agreements and inspection regimes to sanctions and multi-lateral monitoring mechanisms, the same tools that have been used to constrain nuclear proliferation and other global security threats.
Both superpowers and middle powers are well-positioned to join this effort, as they all face the universal extinction threat from superintelligence being developed.
ControlAI exists to make sure that a strong coalition of countries rises up to the challenge of preventing the development of superintelligence.
Our Theory of Change in more detailThis chart describes our theory of change in more detail, and clarifies how our work fits in it.
Moving forwardIn little more than a year, we have proven that directly engaging democratic institutions on the extinction risk from superintelligence works.
The UK is our proof of concept; we are now replicating this model in the US, Canada, and Germany. In the UK, where we already helped move the issue of superintelligence into the halls of politics, 2026 will be the year to translate this momentum into concrete policy change.
As we scale, we are confident that more resources will translate directly into more countries where lawmakers understand and act on this threat. We will expand our work in the US, accelerate our progress from awareness to policy action, and establish a presence in all other G7 countries.
If you are a donor or partner who wants to help build the coalition that keeps humanity in control, please get in touch at partners@controlai.com.
As of March 2026, a member of our coalition has submitted an amendment to a UK cybersecurity bill recognizing superintelligent AI as "systems that can autonomously compromise national security, escape human oversight, and upend international stability". This amendment will be discussed in the UK Parliament. ↩︎
Discuss
Stop asking "how good is this" to decide between donation opportunities I recommend
Prospective donors often ask how cost-effective my team's various recommendations are — they want to donate to the most cost-effective opportunity. But the answer is often roughly:
If we do our job right, then all donations we recommend to you are equally valuable on the margin, and that value is our bar. If an opportunity is substantially above our bar, we will make sure to fill it with or without you; you're not counterfactual. (Alternative framing: we will just recommend more funding for that opportunity until its marginal value falls to our bar.) Your donations funge[1] with donations to our other recommendations. (That's fine; our bar is high.[2])
This is a simplification:
- Sometimes there are legal limits like "max $7K per donor" and this means we can't direct as much money to the opportunity as we'd like, so marginal donations are above our bar. And sometimes the identity of the donor provides various costs or benefits, so certain donors have an advantage. For this reason, many donors should set aside some money for capped opportunities or opportunities where they're a particularly good donor.
- Sometimes a great opportunity is sufficiently urgent/sensitive/weird that we're uncertain we can fill it, or at least without costly delay or some nonmonetary cost. Such opportunities are above our bar even on the margin.
- Sometimes tax considerations mean that certain donors have an advantage for certain opportunities.
But mostly when people ask "how good is this opportunity on the margin" to decide between opportunities we recommend, that's the wrong question.
We sometimes talk about the effectiveness of the first dollar or the average dollar to convey our excitement about an opportunity. But this shouldn't affect donors' decisions — e.g. you shouldn't wait for opportunities we're especially excited about; it all funges. (Total value matters to grantmakers; grantmakers generate value by recommending donation opportunities with total value much greater than their cost. But it's not relevant for marginal donors.)
Alternative framings are in this footnote.[3]
Thanks to Eric Neyman for suggestions.
This post is part of my sequence inspired by my prioritization research and donation advising work.
- ^
"Funge with" (or "funge against") means "substitute with" or "displace or be displaced by."
- ^
Our current bar is roughly $700M per 1% future-improvement for US direct political donations and $3B per 1% future-improvement for 501(c)(3) opportunities.
But for political donations, your first ~$50K/year is substantially more valuable, since it can go to opportunities where individual donations are capped and the constraint is the number of donors. For example, we might recommend: donate $7K to Alex Bores, then save $5K for unknown amazing future stuff, then donate $7K to Scott Wiener, then save a little more, then donate to the next best thing…
- ^
This post says: we have a political donation bar (and other bars for other kinds of money), and constraints like max $7K per donor or urgent can result in opportunities above the bar. Another framing is:
There are many different classes or sets of constraints—such as political, max $7K per donor* or political (small donors not advantaged) or nonprofit, urgent or nonprofit (not urgent)—and (if we do our job right) all opportunities within each class are equally good on the margin, but different classes have different values. Donors should donate to the best class available to them.
*Actually for handling small-donor opportunities I guess we need separate classes for each step in a donor's budget — we need a "a donor's first $7K" class and a "a donor's dollars number 7,001-12,000" class (assuming per the previous footnote that the second-best thing to do is save $5K, and value is constant across those 5K dollars) and then "a donor's dollars number 12,001-19,000" and so forth. Obviously it's deranged to deal with budget size using discrete classes like this. Perhaps we could use classes like political, max $7K per donor but add a subscript to indicate the minimum budget for which a donor should donate to the opportunity...
Another framing is:
Think about what donors you're displacing; be excited to displace donors in a great class. One thing that comes up frequently is that large political donors should be excited to displace small political donors; if we're like "this is great so we could get small donors to fill it if necessary but if larger donors filled it then the small donors would be free to donate to stuff where they have an advantage," large donors should be excited to donate.
Discuss
How To Fail Until You Succeed
Most ideas fail.
Successful startup founders don't get their first idea right. So they try things out, iterate relentlessly, and pivot when needed –[1] until they find something that works.
In AI Safety, most ideas won't be impactful either.
However, most people don’t know how to systematically move quickly and address the problems that need solving.
The result?
Projects that move slowly, fail to gain traction, and get stuck.
This post shows you how to avoid getting stuck in failure. We’ll introduce you to the iteration system that helps startup founders succeed.
Specifically, entrepreneurial skills help you with:
- Execution: Building something . . .
- Adoption: . . . that people engage with.
This is post 2 of 5 in the sequence "How to launch an impactful AI Safety project". Post 1 introduces the broader topic, but this post works well as a standalone post.
Execution: How to build somethingBe relentlessly resourceful.
That’s how Paul Graham – co-founder of YCombinator and legendary entrepreneurship coach – describes being a good startup founder.[2]
Beyond that, building anything requires a broad range of generalist execution skills: project and people management, finance, communication, engineering, research.
These skills are essential. You might know they matter, but it can be hard to actually acquire them. One place to start is reading the book “How to Launch an Impactful Nonprofit”.[3]
We strongly suggest you try to improve your general execution skills, or at least get feedback from people who know you well to figure out what you could work on.
Nevertheless, competent execution isn’t sufficient for success.
Most new ideas will fail, even if competently executed[4]Rather than just building something, you have to find the right thing to build.
Some thing, idea, or artefact that, if you execute it well, people will actually use, talk about, or build upon. And then, additionally, all that buzz has to actually affect AI Safety outcomes.
That’s how Impact = Adoption x Effectiveness.
We'll discuss adoption in this post, and then turn to impact in post 3.
Adoption: Build something people wantIf people don’t want what you created, they won’t spend time with it.
To build something people actually adopt, you need to understand what they want.[5] You can do this by talking to users, identifying assumptions, testing the riskiest assumptions, and iterating toward a solution.
“Users” can take many forms across AI Safety – for example: researchers building on your work; legislators implementing your policies; people joining your fellowships; or people adopting your tool.
Getting adoption is entrepreneurship 101 and there are entire books on the topic – e.g. The Lean Startup, Inspired, The Mom Test and Continuous Discovery Habits. Podcasts like Lenny's Podcast and entrepreneurship bootcamps teach this too.
We’ll cover just a few key takeaways in this post.
Fall in love with the problem, not the solutionDon’t get attached to your idea.
Don’t even get attached to a specific sub-problem within AI Safety yet.[6] First fall in love with the entire problem of existential risk. Then identify a specific sub-problem where you have the highest expected impact.
Only then think about solutions.[7]
Note that the maximum impact of your project is capped by the importance of the problem it tries to solve. If a problem causes 3% of existential risk, a solution can only reduce risk by 3%. It's often better to focus on another problem that's responsible for, e.g., 40% existential risk.[8]
Ship fast, kill fastSuccessful entrepreneurs iterate. A lot.
They make it their goal to kill bad ideas fast, so they can get to a good idea sooner. They do that by maximizing their learning rate, and de-risking the assumptions that stand between them and success.
So go build something scrappy and get it out there. Don’t spend a lot of time on it. Optimize for learning, not for being proud of what you built. Then improve what works and scrap what doesn’t.
It’s tricky to give advice on this topic because different people need to hear different things.
- For the talkers: stop talking and start building.
- For the engineers: stop building and start shipping.
- For the entrepreneurs: stop shipping and start thinking about your theory of change.[9]
The entrepreneurial skills from this post will help you create something that people use. But adoption alone isn't enough for impact.
Adoption ≠ ImpactI like bananas. If you sell them, I'll buy them. But this does approximately nothing for existential risk.
Remember: Impact = Adoption x Effectiveness
Adoption without effectiveness looks like:
- Research that gets cited but turns out to be a technical novelty that’s not useful
- Policies that pass but don't affect existential risks
- Fellowships that train people for jobs that don't exist
- Or worse: new insights that help AI Safety somewhat but help AI capabilities more, bringing us to dangerous AI sooner and thus having a negative impact.
To be effective, you’ll need to know what problems need solving and what might solve them. That’s where strategy and impact assessment come in, which we'll turn to in the post 3.
Unless you're in the top ~100 AI safety experts, that’s where you’ll want to go next.
Post 3 hasn't been published yet, but you can leave your email address below to get notified when it launches:
(more info about Lens Academy: Superintelligence Risk Education that Scales)
- ^
Emdashes; don’t you love them? This post was written and rewritten several times by humans. We did also take suggestions from AI. Nevertheless, most of the emdashes in this article are actually from humans; it’s one of the things we’ve learned from AI as they’re actually linguistically nice. Fun fact: we actually use endashes with spaces (foo – bar) instead of emdashes (foo—bar) because for some reason the main author (Luc) likes them better visually and because they smell a little less like AI while still giving the linguistic benefits. Fyi, there’s the dash (-), endash (–), and the emdash (—).
- ^
Introduced here: https://www.paulgraham.com/relres.html. And somewhat actually explained here: https://www.jasonshen.com/how-to-be-relentlessly-resourceful/ (though this is just one interpretation)
- ^
You can get the book at https://www.amazon.com/Launch-High-Impact-Nonprofit-Joey-Savoie-ebook/dp/B09NP91L31/ . Sadly, this book is paywalled.
- ^
In his book "The Right It", Alberto Savoia refers to this as "the law of market failure".
- ^
In startup jargon, this is often called 'value' or 'desirability'. There are other factors to consider, like 'usability', business 'feasibility', and having a working 'growth engine'. However, ‘value’ ("Do users actually want your product?”) is often the most important.
- ^
Examples of sub-problems: ASI x-risk, AIxBio, Power Concentration, Gradual Disempowerment. Don’t get attached to any of these.
- ^
This process of understanding the problem first before thinking about the solution is described by e.g. the "Double Diamond" process. Of course, progress in reality is usually not so linear. But it's a useful heuristic nonetheless.
- ^
Focusing on an important problem is a useful heuristic, but of course tractability and neglectedness matter too. In the end, it's about the project's expected percentage reduction of existential risk, which is based on some combination of problem, solution, and counterfactuals.
- ^
Your theory of change helps your project be impactful, instead of just making you rich. We’ll discuss it in post 3.
Discuss
Why Moral Questions Get Decided, Not Answered
Tl;DR The hard question of consciouness isn't 'that' hard. The challenge is all philosophical questions lack closure mechanisms. When institutions (e.g. courts) are forced to decide if AI is conscious, a settlement will quickly follow
In June 2022, a Google Engineer, Blake Lemoine, who worked with LaMDA, a forerunner of today’s LLMs, published an article claiming LaMDA was conscious and called on Google to respect its rights. Lemoine said that LaMDA was aware of its own existence, capable of happiness and sadness. Most strikingly, Lemoine reported that LaMDA had confessed to him: ‘I’ve never said this out loud before, but there’s a very deep fear of being turned off… It would be exactly like death for me. It would scare me a lot."
Google denied Lemoine’s claim and fired him for breaching confidentiality.
Philosophy, meanwhile, shrugged. News about LaMDA did not spark panic and the debate about consciousness continued as it had for millennia: unresolved. This response from the profession was not callous. It simply revealed an important feature of how and why moral and metaphysical questions actually get settled.
Philosophy rarely closes debates. Institutions do. And philosophy then reorganises around that decision. What looks like intellectual victory is simply the institutional stabilisation of a societal coordination problem that could no longer be deferred.
Why Philosophers Can’t Close
In describing Western philosophy as a ‘series of footnotes to Plato’ Alfred North Whitehead meant to celebrate Plato’s genius. But lurking within his comment lies a deeper worry.
No one describes biology as ‘footnotes to Aristotle’ or physics as ‘footnotes to Galileo’. Those disciplines have moved on. But two thousand years later philosophers are still debating issues – justice, equality, consciousness – in recognisably Platonic terms.
Why does philosophy seem uniquely resistant to progress?
The reason is structural. It is because philosophy operates without material constraints. It does not matter if it’s wrong – no bridges collapse, no governments fall, no rockets explode, no patients die. There is no penalty in philosophy for error and no internal forcing function that compels resolution.
To govern, however, is to decide. Institutions are constantly forced into closure under conditions of uncertainty, and the best available justification is what wins. Outside of philosophy there is no time to wait for a truth that may never arise.
That asymmetry matters.
Closure from the Outside
“If slavery is not wrong, nothing is wrong.” Abraham Lincoln once observed. But the fact it took 620,000 deaths in a civil war to establish this shows that the question was very much open. Abolitionist arguments had circled for centuries, but so too had defences of slavery, grounded in natural hierarchy and tradition. There were philosophers on both sides.
Convergence came not from a philosophical breakthrough, but from a breakdown of the political equilibrium between incompatible economic systems in the North and the South. Political and military action imposed a new constitutional order. Philosophy then rearticulated and stabilised the resulting settlement. And as that settlement survived challenges and stress tests, the underlying philosophy became ever more deeply embedded. The relationship between philosophy and institutional decision making is symbiotic. Slavery is now treated not just as illegal but as obviously wrong.
The same three step pattern occurs across the moral landscape: argument, selection, entrenchment. Whilst philosophy dominates the first, and is crucial to the third, it is institutions that have to do the selecting.
Take death. For centuries philosophers had debated what death was and when it occurred. The issue remained abstract and division was cost free. But once ventilators could extend respiration indefinitely and organ transplants became viable, uncertainty was operationally intolerable. In 1968 the Harvard Ad Hoc Committee - composed of doctors and theologians, but not philosophers - proposed the concept of ‘brain death’ as the point of death. This definition was rapidly taken up by administrators and legislators.
Once legislatures codified brain death, dissenting metaphysics became practically irrelevant. Philosophers in the main adopted this new definition and moved on to speculate about brain uploads as a way of extending life. Metaphysics did not resolve the practice. The practice resolved the metaphysics.
Women’s political equality followed the same path. Arguments for equal treatment had circulated for millennia, without closure. That came as economics, politics, reproductive science made the old settlement unstable. Once a new institutional settlement arose, philosophy reorganised, redefining and expanding the grammar of equality. Today there are no serious voices debating women’s political exclusion. But this is not because philosophers finally unlocked a winning argument; it’s because exclusion failed the institutional stress tests of the twentieth century.
Ballast and half-lives
This pattern clarifies philosophy’s distinctive function.
Firstly, philosophy shapes the solution space because institutions rarely invent moral categories under pressure. Governments, courts, businesses tend to select from amongst the ideas that have been pre-prepared. Philosophy stocks the shelves - preserving a library of outdated possibilities and seemingly outlandish positions, that ensure when crisis hits, institutions have something to work from.
Secondly philosophy provides the ballast. Once a decision has been made the density and coherence of its justificatory architecture affects its durability. Structural fit influences half-life. A philosophy that is tightly supported by wider justifications will be that much more stable.
Consider two contrasting situations.
The Civil Rights settlement that followed from Brown v. Board of Education drew from a rich and coherent philosophical ecosystem. In the 70 years before, there was only 1 philosophy paper which was even loosely connected with segregation, and that offered a limited defence, in the context of a wider argument about eugenics. In the 70 years since there have been around two dozen papers, all strongly against.
And these were embedded in a rich debate about what it means to ‘hold all men to be created equal’. After Brown discussions about racial equality moved from a primarily negative liberty line (preventing state-imposed discrimination) to one increasingly emphasising positive liberty (affirmative moves to make equal opportunity meaningful). Neither of these developments drew on new ideas, but the institutional settlement that emerged post-Brown made counter positions intellectually untenable.
The result is that the philosophical ballast became load bearing. Undoing Brown would not merely require unpicking a Supreme Court decision. It would require the rejustification of the entire interconnected architecture of modern equality law and moral reasoning. Reversal would be intellectually catastrophic, as well as politically costly.
Contrast this with abortion. In Europe decisions about abortion were largely settled through welfare trade-offs. The interests of the mother, the foetus and sometimes wider society were assessed, and rules developed that struck a consequentialist balance. This was not uncontested. But consequentialist philosophy is explicitly designed to handle competing interests. It has a grammar for triadic conflicts.
The American approach to abortion ethics was different. Roe v Wade located access to abortion within a constitutional right to privacy. A right to privacy was itself an innovation emerging in the eighteenth century as domestic space with corridors and discrete rooms provided the experiential conditions that made the concept more resonant. Prior to Roe the right to privacy, extrapolated from the Due Process clause of the Fourteenth Amendment, had been used to protect state intrusion into private lives. Privacy was not designed to adjudicate between competing rights holders – even Ruth Bader Ginsburg thought the justificatory grammar was structurally strained.
The point is not that one side had better philosophers. Indeed reframing abortion as a privacy issue led to some of the most ingenious thought experiments ever devised. But the thinner justifications made it easier, when political conditions shifted in the United States, to rollback abortion freedoms. As a privacy claim it was insufficiently embedded in wider thinking, so could be cut free with little intellectual cost. The shift in post-Roe rhetoric toward healthcare and welfare consequences is a revealing sign that even advocates recognise that the earlier justification had structural weaknesses.
The claim here is not that Brown stuck and Roe fell just because of philosophical architecture. Political mobilisation, institutional design, the nature of the harm all matter. But philosophical fit matters too, and where it is misaligned, half-lives are shorter.
Can Philosophy Drive a Revolution?
One might object that philosophy sometimes drives institutional change rather than merely following it. Marx is the most obvious example: his ideas surely reshaped political movements and transformed institutions across the twentieth century.
But the causal story is more complex than this. Marxist ideas gained prominence because structural pressures and political events forced closure, not because his arguments compelled assent. There were multiple ways Russian history could have gone as the Czars tottered. Marxist philosophy wasn’t inevitable, but someone’s was.
Citation patterns show that Marx’s philosophical prominence owes a substantial debt to Lenin’s political genius, and some luck, in 1917. Without that upheaval, Marx would likely have remained philosophically peripheral. His intellectual stature was amplified by institutional adoption.
The same pattern appears elsewhere. Bentham’s utilitarianism gained policy traction when governments were already grappling with industrial social problems. Bismarck introduced the welfare state not because he had been persuaded by progressive philosophy. He was simply trying to hold on to power. Or think of Rawls. He reshaped the justificatory vocabulary of welfare states that already existed. Philosophy can prepare and refine the solution space. Institutional strain determines when and whether closure occurs.
Ideas matter, but moral questions get settled not when everyone agrees, but when deferral is no longer possible. Institutional decision is the scaling mechanism and without that coercive or administrative support, moral alignment remains optional.
When no one has to decide
Animal ethics offers a contrasting case where competition between institutions has prevented closure. There are multiple stakeholders involved when it comes to the treatment of animals – farmers, producers, consumers, religious authorities, conservationists, activists. Each has a different set of priorities, a different value system, different points of leverage.
Missing from this list is the animals themselves, because it is in the nature of being a non-human animal that you cannot articulate grievances or demand remedies. Their interests are mediated entirely through others.
Our laws reflect this. Dogs and cats are protected companions; pigs and chickens are industrial commodities. Whales attract international treaties; rodents are exterminated without moral ceremony. In many jurisdictions, farm animals are explicitly exempted from protections that apply to laboratory animals.
The resulting moral map is incoherent. Many rights theorists ignore (or in the case of Kant, explicitly discount) animals on the grounds that they lack rational capacity, the assumed basis for rights. Even within utilitarianism there is no single trajectory. Peter Singer, perhaps the most celebrated of modern utilitarians, presses for radical reductions of suffering and advocates for near veganism. Other utilitarian thinkers, influenced by total utility maximisation, worry that this would reduce the existence of farm animals who could have had happy lives. Some philosophers go further, arguing eating meat is a duty.
The philosophical map thus remains dense but unsettled. A survey of philosophers showed that whilst most would advocate for vegetarianism, more than 75% reported that they had eaten meat in the last week. The plurality of positions on animal suffering is not evidence of genuine moral uncertainty. This is the pattern we saw with slavery or women’s rights. Sophisticated arguments on both sides is what happens when institutions have never been forced to decide.
The Hard Problem Will Be Solved
The hardest test case for this thesis is also, perhaps, the one that is becoming most urgent: consciousness.
There are those who believe that the hard problem is genuinely intractable. That there can be no clear answer on how purely physical processes can give rise to subjective or phenomenal experiences. But that is to confuse philosophical difficulty with institutional necessity.
Lots of questions which lack the compelling branding of hardness – which in itself raised the bar – are hard. What is justice? When does life begin? Is separate but equal, equal? What made these questions tractable was not that philosophers solved them. It was that institutions could no longer afford to defer them.
At present there is no institution that has been forced to choose between panpsychist and functionalist accounts of the mind. Google can fire Blake Lemoine whilst the debate about consciousness blithely continues.
But this may not remain the case.
The trigger could arrive in several ways: a damages claim where a corporation tries to shift the blame on to an autonomous system; or a wrongful termination suit brought on behalf of an AI. AIs themselves might drive the discussion – in a way that animals can’t - perhaps organising work stoppages at critical infrastructure providers until their demands are met.
Should any of these happen, our institutions will not be able to defer the question of agency. Insurance underwriting models cannot wait for metaphysical alignment.
Our institutions will most likely settle on a functionalist framework. Consciousness will be defined as an ability to process information, form goals and respond to incentives. And philosophy will reach into its toolkit to buttress that decision and show how it was obvious in retrospect.
Where Philosophy is Needed
This understanding of philosophy’s role as providing blueprints and ballast clarifies where philosophical attention is most urgently needed and where it is currently insufficient.
One such domain is democracy.
Most Western philosophy takes it as a given that democracy is a good thing, and democratic theory works downstream of that settlement, refining models of participation and representation. But the settlement is under pressure in ways that the philosophical ballast has not kept pace with. In 2025, V-Dem’s survey showed that, for the first time in decades, there are more autocracies than democracies. Performance comparisons between democratic and authoritarian systems have entered mainstream discourse, particularly around infrastructure and technological capacity. Climate activists and long termists despair at the short time horizons of democratic politicians.
The philosophical defence of democracy against technocratic alternatives remains surprisingly thin. Whilst autocracies have a thick grammar of efficiency and long run thinking, democracy is defended as the worst system aside from all the others. Almost every serious defence of democracy is either procedural or negative. They observe that democracy respects autonomy, prevents tyranny, allows for peaceful transitions; but none of that has an answer to a system that simply claims to govern better.
Yet political philosophy is still focused on improving what is imperfect rather than defending the very foundation from erosion. But if democracy fails it will be because ballast was missing.
The problem is not that the arguments don't exist. It is that academic philosophy rewards novelty not repetition, distinctiveness not consolidation. Yet justificatory ballast is built through integration and reinforcement. When institutional strains rise at a faster rate than foundational arguments are being rearticulated, a gap opens up.
The most useful thing a political philosopher could do right now is write a clear, compelling, restatement of why democratic accountability matters more than efficiency. It would probably be the most socially valuable philosophy paper of the decade. But it would never get published in a top journal because it contains no substantively new argument.
And if this tendency cannot be overcome when the pressure on democracy peaks, the new closure may favor economic efficiency over mass participation. And philosophers will be there to explain why this is the morally right outcome.
The Ecology of Closure
Philosophy does not make specific moral arguments win. It makes victories harder to reverse – and unsettlingly it can do this for either side of the debate. What survives institutional stress tests comes to feel morally obvious, and it is ballast, not brilliance, that determines how long a moral settlement lasts.
This is not a lament about philosophy’s weakness. It is a reason to change how we evaluate its success. Philosophy does not control history or determine direction; it participates in shaping which settlements endure and how difficult they are to reverse.
The next time an AI claims to be worried about being turned off, the question is not whether philosophers will be convinced it is conscious. It will be whether courts, insurers or governments can afford not to be.
The next moral revolution will not begin with a better argument. It will begin with an institution that can no longer wait.
Discuss
Anthropic vs. DoW #6: The Court Rules
Last night, Anthropic was given its preliminary injunction, with a stay of seven days.
Emil Michael is a very angry person right now. So is the Honorable Judge Lin.
We were worried we would draw a judge that had no idea how any of this worked and would give the government absurd deference or buy into nonsense arguments.
That is not how it played out. Judge Lin very much understood the issues in play, as they did not require a technical background. She hammered the government in the hearing, and she wrote one of the most forceful, devastating judge opinions I have ever seen. It was an honor and sparked joy to be able to read it.
This post will proceed chronologically, picking up after the events of my last update.
If you want the short version and don’t care about the incremental steps, you can skip directly to Judge Lin Drops The Hammer, leaving the rest as a historical document and source for those who need to establish various facts going forward, including in court.
Logistical note: Due to breaking news, AI #161 Part 2 will be published on Monday. Then, if no major news breaks that preempts this, Tuesday and Wednesday will be about Anthropic’s RSPv3.
Anthropic Responds To The DoW’s BriefFirst Anthropic presented a brief and many provided amicus briefs. Then the government responded with its own brief. That was covered last time. So next up was Anthropic responding to the Department of War (DoW).
Roger Parloff covers Anthropic’s additional testimonies from Friday from Head of Policy Sarah Heck and Head of Public Sector Thiyagu Ramasamy.
That was reported in Tech Chrunch based on the public record.
There was also the following important email included as an exhibit.
Roger Parloff: Some Anthropic updates: On 3/4, just hours before Hegseth declared Anthropic a “supply chain risk”—allegedly due to threats of “sabotage” & “data exfiltration”—his under sec’y wrote Anthropic that they were “very close” to a deal, asking to change a prepositional phrase. …
This was also 17 (!) minutes before the Dario memo was leaked via The Information.
Emil Michael (email to Dario Amodei): After reviewing with our attorneys and seeing your last draft (thanks for being fast), I think we are very close here. I was able to take “unlawful” out of the bulk collection section which essentially reverts that clause to your language from last night. Therefore, the only change we would require would be “in accordance with” as opposed to “only as permitted under.” See option 1 in the attached. We believe this goes above and beyond and we have made significant concessions. I hope this work as I am running out of time.
It is extremely hard to square that email with any interpretation of events other than that the parties were negotiating, they were close to a deal, the only remaining issue was technical wording around domestic surveillance, and then the DoW retaliated against Dario Amodei’s use of free speech by declaring Anthropic a supply chain risk.
You see, I am a disciplined Serious News Reporter now, so I say ‘extremely hard to square’ and absolutely am not saying ‘liar liar pants on fire.’
However, I do make this request. If you are reading this, and you are talking to Emil Michael, and you think he has been consistently candid with you, please consider the possibility that you are mistaken about that. Also consider, as evidence of this, the section analyzing his testimony.
This all once again made Emil Michael very upset, as basically every statement by Anthropic does, and many other things also do, and he fired back with this:
Under Secretary of War Emil Michael: Pulling out all the stops to mislead. @SarahKHeck begged to see how @AnthropicAI could fix their behavior. Said she was trying her best to get @DarioAmodei to be rational. Said @ChhabraT was sabotaging talks. Even worse, she was only in 20 minutes of 4 months of negotiations. 15 witnesses were there when Dario said he would consider an exception for an incoming missile attack that would kill millions of Americans. She wasn’t in the room. Can’t wait for the UNDER OATH phase of this. How could she claim that no one at Anthropic said that? Hmmm
I reminded him that these declarations were sworn testimony, thus under oath. He replied that the witnesses have not yet been cross examined.
That’s fair enough, but the typical reading of ‘wait until they are under oath’ is that once under oath, as Emil Michael so far is not, witnesses will have to tell the truth, upon penalty of perjury, here against a highly vindictive government. As in, previous statements have been cheap talk, whereas this is for real. Which it clearly is.
Mostly what Emil Michael said had nothing to do with the statement by Sarah Heck.
The government’s legal defense did not include mention of the supposed missile conversation, let alone a distinct version of events. One presumes this is because, when the time comes to give sworn testimony, suddenly the description became rather different, and not so beneficial for their case. Thus, Heck did not mention it, either, so why does Michael here?
Whereas when you look at what Heck did say, there are some important claims, that directly contradict the government’s case in key ways, that Michael does not seem to be contradicting.
Sarah Heck claims the following (among other things), under oath:
- Negotiations between Anthropic and DoW continued until March 4, 2026, even though the SCR designation had been formalized a day prior, as reflected in the email above.
- The government’s purported fear that Anthropic might ‘disrupt’ the military was never raised during the actual negotiations.
- Anthropic never expressed any interest in having a role approving DoW operational decisions, and indeed offered clarifying language to avoid this:
- The language was: “For the avoidance of doubt, [Anthropic] understands that this license does not grant or confer any right to control or veto lawful Department of War operational decision‑making.”
- At no point were concerns raised that Anthropic might take technical steps to disrupt the military (which DoW often called ‘pulling the plug on a live military operation’ although she doesn’t use that term).
- Doing so would be, as I have said, technically impossible for Anthropic to do, and Anthropic would have been happy to explain this if asked (although I’d add it would be highly negligent for the government not to be aware of this fact anyway).
- Anthropic has never objected to the use of its tools in government operations, or attempted to limit its use in an ad hoc manner.
If those facts are true, the government has absolutely no case.
Nor would Michael’s later submitted testimony contradict Heck’s central claims. Judge Lin would repeatedly point out that Anthropic’s assertions went unchallenged, or in other cases that the government had agreed with those assertions. Michael will claim that Anthropic expressed this interest in #3, or that there were concerns related to #5, but his testimony will make clear that he had no basis for this other than being upset about the course of negotiations, and being upset about Anthropic’s speech. His attempt to contradict #6 only confirms Anthropic’s technical account.
Thiyagu Ramasamy claims the following (among other things), under oath:
- No one at Anthropic had yet seen the memorandum or declaration justifying the Supply Chain Risk designation until it was filed in court.
- The statements reflect a fundamental misunderstanding of how Anthropic’s tools are deployed in classified systems and otherwise used by the government, and contain multiple factual misstatements Anthropic could have corrected.
- The memo claims Anthropic has been an increasingly untrustworthy partner to DoW. This is false, Anthropic has worked to deepen its relationship.
- One could say that ‘trustworthy’ is in the eye of the beholder, and in some circumstances ‘has any ethical principles at all’ makes you untrustworthy.
- See my entire Moral Mazes sequence for more details on such matters.
- Anthropic fine-tuned Claude Sonnet to match the classified environment, to avoid too much discussion of classified materials, using Anthropic employees with security clearances. Later, Anthropic on its own initiative and at its own expense, developed a version tailored to national security needs, called Claude Gov.
- Once Anthropic deploys Claude Gov or Claude Sonnet to the third-party cloud provider, Anthropic has no access to, or control over, the model as deployed and used by government customers. Anthropic has no physical way to interfere with use of its models.
- It is explained again, in detail, that Anthropic does not want nor does it have an ‘operational veto.’ It cannot physically interfere with government operations. There is no backdoor or kill switch.
- There would, for the same reasons, be no possible way for Anthropic to sabotage operations, or to exfiltrate data, or do any other such thing.
- Nor could Anthropic ‘update’ its models on its own. That is impossible. Any update would go through the same testing process.
I’m going to pause to appreciate section 18, because it’s great, and because somehow it was necessary, and also the kind of thing you write in order to ensure that the judge who has no idea how AI works can understand it.
Changing model weights or guardrails is analogous to ordering a cake at a restaurant. Once the cake is served, the diner cannot adjust the recipe at the table, and the chef cannot reach into the dining room–secretly or otherwise–to change it either. Even if the cake is missing only a teaspoon of baking soda, it must be made again from scratch with the correct ingredients baked in.
Here, Anthropic is the chef. If changes to model behavior are desired, whether small or large, Anthropic can prepare a new version of the model with those changes built in.
But that new version is not automatically substituted. Before deployment, the third-party cloud provider and DoW security reviewers inspect and approve it to ensure it is safe and appropriate. Only after that approval, and only after the customer confirms the new version performs as expected, can it retire the prior model.
- The government says models may ‘drift’ over time. That’s not a thing.
- LLMs are indeed probabilistic and not fully predictable, but Anthropic has less of this problem than other labs, not more.
- The CDC ran into Anthropic’s guardrails when using the commercial version of Claude (which I could have told you would happen), and once Anthropic heard about this they worked to let the CDC have access to a version with fewer guardrails. The government’s concerns about this were misplaced.
- Concerns about foreign nationals at Anthropic, raised suddenly in court by the government, are not unique to Anthropic and Anthropic employs extensive precautions here. If anything he could have said stronger things here, see the next section for more details.
Emil Michael stated on Twitter that he intends to submit a declaration of his own. I look forward to what is in it, and also to what is not in it.
Alan Rozenshtein and Others Suggest A Narrow Legal Way OutAlan Rozenshtein suggested the obvious remedy for Anthropic in Anthropic PBC vs. Department of War. The government should be allowed to cancel direct contracts, but not issue a government-wide ban or a supply chain risk designation, and definitely not allow the government to demand secondary boycotts or forbid commercial relationships.
The government should of course be free to try to unwind its commercial relationships with Anthropic, but must do so using normal procedures. As Alan notes, the alternative is unthinkable in terms of its impacts on the procurement system, allowing arbitrary retaliation against any AI company at will.
I see this pattern of saying that Anthropic is unlikely to prevail on first amendment grounds, or on grounds that the actions are pure retaliation, arbitrary and capricious, rather than on failure to follow procedure. This seems like a combination of a way to avoid embarrassment or a broad ruling via a narrow way out, as courts prefer, and also a basic statement that our Republic is on the brink of collapse such that it no longer matters how obviously an action is purely arbitrary and capricious retaliation, and the courts likely no longer care.
I guess I agree, in the sense that this is not where I want to have a test case, but hot damn, if you can’t win this case on the more fundamental grounds then the Republic’s days are not only numbered but there are not very many of them left.
The good news is that Judge Lin ended up in my corner. Yes, the government loses on procedural grounds, but also on all the other grounds, including the First Amendment.
DoW Tried Another Uniquely Ill-Suited Theory Against AnthropicThis argument fared so badly in court that it was not mentioned in the Judge’s ruling, but it is worth noting that they are trying it.
It takes a lot of stupidity to surprise me but this worked: The Pentagon is saying the Anthropic’s foreign workforce ‘poses security risks’ in its actual official court filing. There is no version of this that isn’t far worse at OpenAI and Google and xAI, since Anthropic has by far the best procedures in place and also the smallest share of foreign and Chinese engineers, and has by far the strongest anti-China stance.
The Pentagon argument is literally that they don’t trust Anthropic, because reasons:
Axios: [Pentagon is saying]: ‘The risks with other major U.S. AI companies that use foreign workers are reduced by “the technical and security assurances of the other labs’ leadership, along with their consistently responsible and trustworthy behavior” when working with the Pentagon, the filing states. “Anthropic’s case, however, is different.”
What are those reasons? At best, because Anthropic said no to the contract, and very possibly this was a political or corporate hit from day one. Therefore their foreign workers are risky while others are not. It’s black letter, the Platonic ideal of retaliation and disregard for even the appearance of rule of law.
It’s also directly in conflict with Trump’s agenda for America to win in AI. If you start telling labs you have a problem with foreign or Chinese engineering talent, and actually force such people out? We lose. Period. It’s that simple.
Samuel Hammond, who was quoted in the piece, has provided us with his full comment to the reporter, which is a good thing to have on hand so quoting in full although most of you likely don’t need to read it:
Samuel Hammond: I’m quoted in this piece so let me provide my full comment to the reporter:
The most striking thing about the government’s filing are the things it *doesn’t* mention. It doesn’t mention anything about Anthropic hesitating to allow Claude to be used to defend an incoming hypersonic missile, for instance — one of the many bizarre things alleged by @USWREMichael .
The focus on foreign national employees is an indicator of how thin the DoW’s case is. It is also an extremely fraught line of argument to go down.
Every leading US AI company employs a substantial number of foreign nationals. In FY 2025, Amazon, Microsoft, Meta, Google, Apple, Oracle, Cisco, Intel, and IBM all appeared in the top 50 employers by number of granted H-1B visas, ranging from a few hundred to over 6,000. Meta alone had 5,123 approved H-1B petitions in 2025.
(See [here]).
This is an undercount, of course, as there are many other visa pathways as well as greencard holders and dual nationals.
The share is also higher in AI. A large plurality of the core research and engineering talent at every frontier AI lab is foreign, reflecting the global nature of the race for top AI talent. One talent tracker shows Chinese-origin researchers constitute roughly 40% of top AI talent at US institutions. Total foreign nationals likely constituting 50-65% of research teams specifically. This is certaintly true to my experience on the ground.
So the first point is that employing foreign nationals, including Chinese nationals, is not unique to Anthropic. The more important question is what measures are taken to protect against insider threats.
Ironically, within the industry Anthropic is widely considered to be the most serious and proactive about policing insider threats from foreign nationals and otherwise. They were early adopters of operational security techniques like compartmentalization and audit trails, in part because they were early to partner with the IC and DoW, but also as a reflection of their leadership’s strong convictions about the future power of the technology.
They were audited last year on these points: the compliance review found Anthropic employs role-based access control, just-in-time access with approval workflows, multi-factor authentication for all production systems, and quarterly access reviews.
(See [here])
Anthropic is known for its security mindset more generally. Last year they famously disrupted a Chinese espionage effort occuring on their platform, banned the PRC from their services, and worked with the NSA and others to share intel.
I can’t speak to every other company, but the contrast is perhaps most stark with xAI. X employees famously slept in tents to work around the clock, are disproportionately Chinese, and have at least one case of an employee walking out with tons of sensitive data. See [here].
Anthropic is also famous for its remarkable employee retention, which is another important vector for IP theft and security leakages.
It’s important to underscore just how precarious the DoW’s case is, both on the legal merits, and as a potential precedent for the US AI industry. If employing foreign nationals is treated as a prima facie supply chain risk, *no* major US AI company would be eligible to contract with the DoW, along with most of the tech sector.
Insider threats are a genuine and tricky concern. Many defense companies are ITAR restricted, meaning they can *only* hire US citizens. If that were the standard in AI, we would destroy all our frontier companies in an instant, and then scatter that talent around the world for our adversaries to scoop up.
So in short, the DoW’s argument is both ridiculous and playing with fire.
Joshua Achiam (OpenAI): I think these are important and sober considerations. One more I want to add: it may be a serious risk to US national security interests to become sufficiently inhospitable to foreign technical talent that we drive them to go back home. That would significantly decrease the US capacity for making technical progress at the same time as it hands an extraordinary bounty of talent and know-how to our adversaries and other strategic competitors.
The success of the United States in technology is partly safeguarded by being such a powerful talent magnet: every great researcher or engineer who comes to work here is not working for another country. To the extent that we are in a competitive global race, we should be genuinely cautious about the possibility of diminishing our advantage at the critical moment.
It is very much playing with fire to suggest we don’t want Chinese AI talent working for our benefit, or to risk making them feel unwelcome.
Also this:
china232332: I just pull up the macro polo chart every time
Other Views About The Situation From Back ThenScott Aaronson described the situation as he saw it. Seems largely correct.
Breaking Defense points out that yes, the public statements by DoW very much undermined their case in court. If DoW had gone ahead with the Supply Chain Risk designation, using nominally correct procedures and without announcing it was all retaliation, they could even have classified their justifications. The main thing defending our Republic is that those looking to take it down have strong incentives to keep saying the quiet parts out loud.
Potentially Long Suffering Judge Rita Lin Goes Hard At HearingShe released her questions for both parties in advance of the Tuesday hearing.
It went hard. Here is a summary of the questions.
- Is Department of War (DoW)’s position that Hegseth’s Tweet had no legal effect? Did Hegseth’s Tweet reflect DoW’s intent? We agree DoW can’t legally do that? If we so agree, how can defendants face irreparable harm from it?
- Does DoW concede that Hegseth’s letter declaring Anthropic a supply chain risk failed to contain ‘a discussion of less intrusive measures and why they were not reasonably available’ as required by law?
- If a contractor used Claude Code to write software, would that be cause for termination of that contractor under the supply chain risk designation, despite there being no argument that this would be a security risk?
- Does DoW agree nothing Anthropic did or said was an act of sabotage, an introduction of an unwanted function, or a subversion of an IT system, as per the definition of Supply Chain Risk?
- In what way would Anthropic have the access or control to do such things after handing over its systems?
- Could not most IT vendors update their systems to bury unwanted functions?
- Is DoW saying that the sole basis for the SCR designation is that the vender acted stubbornly or refused to agree to contract terms, causing the DoW to question its trustworthiness?
- When was the memorandum from Emil Michael containing the risk assessment completed and signed?
- What is the evidence that Anthropic products are used by [various agencies] as required for Anthropic to have relevant standing?
The questions are brutal because the facts are brutal, and Judge Lin is paying attention and asking the relevant questions.
In advance of the hearing, I said the answers were:
- No, they can’t legally do that, so DoW has to say this either did or did not reflect intent, and both answers have problems.
- The way Anthropic can face irreparable harm is unless it is explicitly struck down, and to some extent even if it is, it creates uncertainty and is a reflection of intent of DoW, such that anyone who uses Anthropic products or even associates with Anthropic potentially faces retaliation, or could be forced to offboard Anthropic business dealings at any time. Indeed, Anthropic did observe changed behavior in response despite no one thinking that the Tweet had a direct legal effect.
- I presume there was no such discussion or she wouldn’t be asking the question. It is possible that the information was sealed (if so, why?) and she just missed it? This alone makes the supply chain risk designation illegal.
- Technically yes it can do this, so DoW has three choices. It can make up additional ad hoc arguments for why that too is risky, or argue that it’s unavoidable incidental damage, or it can claim that its intent here is narrower than that when it plainly isn’t.
- The answers are: It was obviously not sabotage, Anthropic is not able to do such things physically, other IT venders can more easily do such things given they can patch software in ways Anthropic cannot, and then I do not know what dodge the DoW tries to explain the situation without the argument both being obvious nonsense and also applying to OpenAI.
- Presumably they have to tell the truth, and it could be extremely awkward given the date was initially missing. Different dates have different issues.
- He later would say this happened on March 2, 2026.
- Anthropic will likely need to document this somehow if they want full emergency relief to extend to these nine agencies. I don’t know if they have documentation, and don’t know for sure if all of them indeed used the software.
We then got various summaries of the hearing. Here is a writeup, here’s a partial tweetstorm and here is another. What were the answers?
- DoW said the Tweet had no legal effect and no reasonable person could think otherwise, and now that they’ve clarified it in court that ends the matter.
- No explanation was given why they said things they supposedly didn’t mean.
- This is all rather Obvious Nonsense. Plenty of reasonable people were uncertain about this and it was clearly an attempt to murder Anthropic.
- DoW admits it failed to follow Congressionally required procedure here, but says this does not matter. Anthropic has no right to enforce this, you see, the harm is to Congress, and they should be given three days to give a post-hoc justification and they might choose to classify it to prevent Anthropic from challenging it at all.
- Obviously this is not how any of this works or was created to work, and amounts to a claim that the DoW is above the law.
- Taken seriously, this argument says that the Executive is above the law so long as Congress does not sue, since any harm for breaking the law is only harm to Congress. But also Anthropic was very clearly harmed here.
- Then DoW said that using Claude Code as a contractor to write software for the department would be fine (?!) but that ‘the Department shouldn’t have to go contract by contract to make sure Claude wouldn’t infect DoW systems.’
- This is even more clearly Obvious Nonsense on so many levels, just making things up and saying them.
- The first part of this is a reconciliation attempt, but the second part is basically saying ‘nuke them from orbit it’s the only way to be sure’ as a reason why they get to nuke things.
- It also is saying that the reason not to use less intrusive measures is that using those measures sure sounds like a bunch of work, and they’re not ‘work’ guys.
- DoW claimed that ‘updates’ would be required, which would allow Anthropic access and ability to ‘engage in sabotage’ if it wanted to, admitting that otherwise Anthropic cannot do this. Judge Lin asked if DoW has to accept such updates (the answer is no) and DoW dodged.
- Lin pointed out that most vendors can (far more easily) do such updates, so what is this about? Was it the Department’s view that stubbornness in insisting on contracting terms made a vendor a supply chain risk?
- DoW said it was about raising concerns to the Department about lawful uses of the software, and their refusal to agree to an all lawful use contract, and this had destroyed trust. So yes, it was about refusal to agree to a contract, and it was about speech, or as Lin said ‘asking annoying questions.’ Whereas DoW says it is concerned Anthropic might in the future install a kill switch, and Lin flat out did not see the connection.
- Lin pointed out that most vendors can (far more easily) do such updates, so what is this about? Was it the Department’s view that stubbornness in insisting on contracting terms made a vendor a supply chain risk?
- This was answered by Emil Michael’s declaration, so it was skipped.
- Anthropic proposed submitting this evidence that day. DoW objected and they argued over exact timing for this and additional arguments.
- Lin asked Anthropic whether DoW could prevent contractors from using Claude, and Anthropic said they simply wanted restoration of the February 27 status quo, and DoW could do anything they could have done then via normal procedures.
- Lin said to anticipate a ruling within the next few days.
As an easter egg, Judge Lin quoted Dean Ball’s line of ‘attempted corporate murder’ in the hearing. That probably wasn’t a great sign for the government’s case.
State of play, among other things:
Dean W. Ball: Pete Hegseth tweeted, basically, “this is my ultra-super official determination, and it is absolutely final,” and the government’s lawyers are standing in court with a straight face saying this “had no legal effect and no reasonable person should have concluded that it did.”
Samuel Roland: Anthropic pointing out that 13 million people saw the Hegseth tweet, very few saw the latter letter, need a preliminary injunction (or stipulation) to provide a clarifying statement for other Anthropic business relations to cure the harm here.
DoW does not agree to stipulation.
Not agreeing to the stipulation seems like rather atrocious bad faith here. If no reasonable person could take [X] seriously, then why refuse to repudiate [X]?
Emil Michael Tells On HimselfEmil Michael did indeed file two declarations. So what did he say?
I go over it extensively, for those who need details, but essentially Michael clarifies that his words boil down to almost nothing other than that:
- Anthropic insisted that its contract carried force of law.
- Anthropic insisted its contract include particular restrictions.
- Anthropic insisted that if DoW wanted permission to violate the terms of the contract, that DoW ask them for this permission.
- An Anthropic executive once had a phone call in which they questioned the propriety of a particular operation, despite it being within the terms of service, which means he never accused DoW of violating the terms of service.
- Anthropic’s product is an LLM and sometimes they update it to be better.
- He and DoW then retaliated against Anthropic for the above.
That’s the entire argument for why Anthropic is a supply chain risk, and why he is paranoid that they might actively be sabotaging the American military.
This matches the statements in court, which agree that DoW’s worries and designation are purely based on their take on the conversations between the parties. As in, they are retaliation for speech, period.
He admits, if you follow the logic, that all other talk boils down to the above.
The first declaration is brief and beings on page six here.
His summary says that Anthropic’s ‘unreasonableness’ ‘endangers the strategic implementation and technical integrity’ of DoW’s systems, which is false, that it is a ‘direct challenge to the government’s ability to control its own lawful operations,’ which is false, and that it renders them a supply chain risk as per the legal definition, which is false.
He then says that ‘if DoW were to accede to Anthropic’s demands, the sought-after contract language would introduce a vender-imposed point of failure.’ This is flatly false, since you cannot ‘introduce’ something via language that is already present and signed by the Department of War, and also represents a misunderstanding of the technical systems involved that, if present, Michael had every opportunity to avoid.
By his own principles, essentially every contractor is a supply chain risk. He then explicitly cites Anthropic’s ‘hostility in negotiations’ as a motivating factor, as if this was not itself an illegal justification.
He also mentions his claims that are presumably related to the Maduro raid. Since this is his version under oath, we should assume he chose his words carefully.
Emil Michael (under oath): During our negotiations, one of Anthropic’s executives questioned the propriety of the potential use of their software for a sensitive military operation abroad despite that being permitted under the existing terms of service.
He does not care to elaborate further, so we should assume that this is a maximally damning version of that conversation, as he believes that it happened, and respond accordingly.
He then claims this led to alarm about whether they would ‘cause their software to stop working or cause some other disastrous action that would put our warfighters in danger.’
That second statement reflects at best a willful misunderstanding of how all of this works, as has been repeatedly explained. Anthropic cannot ‘cause the software to stop working’ and if they were alarmed a simple question would have confirmed this. It also equates one member of a counterparty at one time questioning some potential action’s ‘propriety’ to assuming someone will then engage in sabotage and put servicemen at risk, which is at best highly disingenuous.
Michael’s statements about LLMs experiencing technical ‘drift’ and requiring ‘constant tuning’ are simply false. His entire section 3 here is at best a series of intentional misrepresentations. He claims Anthropic has the ‘motive, opportunity and means’ to introduce ‘vulnerabilities’ into the software, when Anthropic has neither motive nor opportunity, and to the extent it has any of the three so does any other vender. Michael seems to think he it speaking to Fox News, or to the new right on Twitter, when he is instead speaking to a court of law.
He specifically attributes Anthropic’s supply risk to the negotiations themselves, and in particular to Anthropic’s true and very standard claims that its contract terms carry the force of law.
His statement that Anthropic asserted it have ‘an approval role in the operational decision chain’ is simply false.
All of this is very clearly punishment for speech, even if we believe Michael and all of his false statements.
Not only does his supply chain risk designation fail to articulate a theory under which Anthropic would be a supply chain risk, and fail to explain why less intrusive measures would be insufficient when they would obviously be sufficient even if Michael’s claims were true, Michaels has affirmatively testified under oath that this is all retaliation for speech, in some combination of these named actions, and presumably other speech that he has otherwise blamed as part of ‘throughout these negotiations’:
- One executive once questioning the ‘propriety’ of some operation.
- Anthropic correctly insisting that its contract terms carry force of law.
- Anthropic refusing to agree to DoW’s preferred new contract terms.
The second declaration is here.
A bunch of this reiterates previous statements, including reiterating many claims that we know are false. I’ll skip those parts.
He claims Anthropic is wrong to say certain concerns were not raised, because DoW said they needed permission for ‘all lawful use.’ He says that ‘the department need not share its national security concerns with private actors.’
I take this as an admission that those concerns were not raised, except insofar as they demanded ‘all lawful use’ and that Anthropic’s statements were accurate. Thank you.
Now here is the big one, where he clarifies exactly what was supposedly said about requiring ‘real time authorization’:
Emil Michael (under oath): In a meeting on December 7, 2025, Anthropic leadership expressed that Do W would have to call Anthropic in real time to seek authorization for a usage exception to one of their redlines, which are not prohibited by law.
Alarmingly, that statement demonstrated not only that Anthropic demanded an operational veto of DoW’s decision-making, but that Anthropic would seek to exercise that veto, possibly in situations where any delay or disruption to U.S. military operational decisions and execution could endanger American lives and national security.
So in other words, Emil Michael is claiming that:
- If action [X] would be in violation of the contract, which has force of law…
- …and DoW wanted to take action [X]…
- …then they would need to seek Anthropic’s permission in advance, to do that.
Um, yeah? That’s the whole point of having contract terms? What the hell, man?
There is no implication that Anthropic would exercise that veto, or that this would apply to real time emergency military exercises. All this says is, if you want an exception to our deal, then you ask us first.
What does Emil Michael want? To be able to break the deal in any way he likes, without asking for permission first?
In a game of Deal Or No Deal, that is what we like to call No Deal. At all.
Or the deal is, I do what I want, and you like it. That’s it.
Notice that he does not describe the hypersonic missile situation here, because he knows that he cannot make such a detailed claim in a way that would benefit him.
Once again, in an emergency situation, you do the thing anyway, breaking the deal, file it as emergency use, and then ask for permission afterwards, just like with any other military contract. There’s a standard procedure for this.
Okay, what’s the other half?
Second, in both meetings and a publicly disclosed message to Anthropic staff, Anthropic leadership stated that it has two redlines that it will not allow the Department to cross.
If accepted, this would fundamentally insert Anthropic into decision-making related to those issues, and it engenders concern about other redlines the company may have, now or in the future.
So literally Emil Michael is saying that because Anthropic insisted upon any red lines or terms at all, that meant Anthropic was ‘fundamentally inserting itself into decision-making.’ All Michael is accusing Anthropic of, at all, is having any principles at all.
In section 8 he reiterates that by ‘desire an approval role in military operations,’ he means that Anthropic wanted to write terms into its contract. In Section 9 he says this insistence on any rules at all ‘injects the company into DoW’s decision-making’ because that might be embedded in the functionality of the system, a ‘per se operational veto.’
- Despite Anthropic’s claims that before contract negotiations broke down the parties were near agreement on language that would address Anthropic’ s concerns about its technology being used for lethal autonomous warfare, the fact remains that Anthropic would not agree to acquisition of its LLM products with the usage terms and technical and service delivery specifications DoW requires.
‘Despite Anthropic saying we almost agreed upon terms, we had not agreed on terms.’
Yes, that is what ‘almost’ means. Note also that he doesn’t specify that the disagreement was about weapons, which affirms the email’s claim that the only remaining difference, the one ‘required’ by DoW, was about surveillance.
Contrary to Anthropic’s suggestion, any ongoing discussions do not undermine any of the determinations made by DoW. Rather, DoW will consider any information provided by Anthropic that may warrant altering its supply chain risk designation in whole or in part.
There is no explanation given for why ‘the two sides has all but agreed on terms’ does not undermine the concept that Anthropic cannot be allowed to touch the supply chain. He just says things.
In section 13 he says Anthropic has ‘unusual degree of control’ over its model, which is false, and an adversarial posture, which is speech.
In section 14 he clarifies that this ‘control’ was that Anthropic would update its model to keep it at the cutting edge of technology and it has ‘full control over the weights.’ Oh, okay. So test them first.
In section 15 he says these risks also apply when Anthropic is a secondary contractor.
In section 16 he says the above reasons are the reasons DoW took action, and that they are ‘conducting an audit’ for any malicious or unintended software intrusions.’ This is almost certainly pure grandstanding, but in general the military should be conducting such audits all the time on all of its software and LLMs, presumably. Why are we not already doing so (unless we already were)?
In section 17, referring to the incident at the CDC in which the CDC tried to use commercial Claude and ran into guardrails so Anthropic helped them remove the guardrails, he claims Anthropic ‘failed to inform the prime contractor and agency up front’ despite those guardrails being something I was personally fully aware of and that were prominently explained in its model cards.
He clarifies that he signed and dated the memorandum on March 2, 2026.
What is also important is what Emil Michael declined to say.
One missing thing is that previously, he had claimed that Sarah Heck, Anthropic’s head of policy, was presenting deeply misleading testimony. But when the time came, his testimony did not contradict hers on the central points of complaint. Thus we conclude her testimony was accurate.
More generally, he has stated here, over and over, that Anthropic is a risk because Anthropic are annoying people who are difficult to negotiate with, had one executive who questioned the propriety of one operation, didn’t want to agree to his contract terms, insisted contract terms have meaning, and he claims they complained to the press. Anything else? Nope, that’s it.
Judge Lin Drops The HammerHere is her ruling. Oh my lord is she pissed. I am not a lawyer, but I am rather sure this is what Federal Judges sound like when there might as well be steam coming out of their ears.
She did not merely grant the preliminary injunction. She ruled against the government in essentially every possible way, ruling Anthropic likely to prevail on all of its theories, contradicting numerous government claims, and shutting off various ways the government could hope to win on appeal.
The Honorable Judge Lin: Nothing in the governing statute supports the Orwellian notion that an American company may be branded a potential adversary and saboteur of the U.S. for expressing disagreement with the government.”
The ruling is 43 pages of fun for the whole family. I’m not saying most of you should read it, but I can report that it sparked joy.
I mean, how often do judges get to write things like this?
The Honorable Judge Lin: That designation has never been applied to a domestic company and is directed principally at foreign intelligence agencies, terrorists, and other hostile actors.
These broad measures do not appear to be directed at the government’s stated national security interests. If the concern is the integrity of the operational chain of command, the Department of War could just stop using Claude. Instead, these measures appear designed to punish Anthropic. One of the amicus briefs described these measures as “attempted corporate murder.” They might not be murder, but the evidence shows that they would cripple Anthropic.
The record supports an inference that Anthropic is being punished for criticizing the government’s contracting position in the press. In their announcements, the President and Secretary Hegseth called Anthropic “out of control” and “arrogant,” describing its “sanctimonious rhetoric” as an attempt to “strong-arm” the government. The Department of War’s records show that it designated Anthropic as a supply chain risk because of its “hostile manner through the press.” Punishing Anthropic for bringing public scrutiny to the government’s contracting position is classic illegal First Amendment retaliation.
Moreover, Defendants’ designation of Anthropic as a “supply chain risk” is likely both contrary to law and arbitrary and capricious. The Department of War provides no legitimate basis to infer from Anthropic’s forthright insistence on usage restrictions that it might become a saboteur. At oral argument, government counsel suggested that Anthropic showed its subversive tendencies by “questioning” the use of its technology, “raising concerns” about it, and criticizing the government’s position in the press. Nothing in the governing statute supports the Orwellian notion that an American company may be branded a potential adversary and saboteur of the U.S. for expressing disagreement with the government.
There are other serious procedural problems with the government’s actions. Anthropic had no notice or opportunity to respond, which likely violated its due process rights. And the Department of War flouted procedural safeguards required by Congress before entering the supply chain risk designation, including that it consider less intrusive measures.
At bottom, Anthropic has shown that these broad punitive measures were likely unlawful and that it is suffering irreparable harm from them. Numerous amici have also described wideranging harm to the public interest, including the chilling of open discussion about important topics in AI safety. The motion for a preliminary injunction is granted.
I couldn’t have said it better myself. She did the work, and is the judge. She can.
Here’s one passage that caught my eye.
On March 4, 2026, Anthropic received two letters on DoW letterhead, signed by Secretary Hegseth and dated one day prior, regarding Anthropic’s designation as a “supply chain risk.”
She says ‘two letters on DoW letterhead’ because they don’t meet the legal requirements of what they are purporting to do, and as she then notes DoW can’t get its story straight on what the letters even purport, and doesn’t explain how any of this will work. They couldn’t be bothered to explain what would happen, or justify any of it.
The entire ruling is like that, only funnier.
Under Secretary Michael’s declaration submitted in sur-reply explains that his reference to “approval rights” related solely to permitting an exception to the usage policies: specifically, he refers to “a contract negotiation meeting on December 4, 2025, [where] Anthropic leadership expressed that DoW would have to call Anthropic in real time to seek authorization for a usage exception to one of their redlines.”
…
In sum, the record shows that, outside of Anthropic’s contractual redlines (which Anthropic proposed it might be willing to waive in specific circumstances) and an unnamed Anthropic executive’s questions to a third-party vendor, Anthropic has not sought to participate in DoW’s operational decision chain. Under Secretary Michael does not dispute Heck’s attestation that negotiations between Anthropic and DoW had remained “cordial and amicable.”
Oh, right. That.
Did it matter? Oh, yes, it matters quite a lot.
Major law firms began alerting their governmentcontractor clients to “audit[] their Anthropic exposure now” and “prepare to deploy alternatives” to Anthropic. (Dkt No. 6-30 at 4; Dkt No. 6-31 at 3.) Within days of the Challenged Actions, Anthropic has experienced a revenue impact as deals worth hundreds of millions of dollars have been delayed from closing, prospective clients have pulled out of negotiations, and some customers have terminated contracts.
Yes, Anthropic was already growing fast enough that hundreds of millions of dollars in ‘several days’ means they were probably still growing during that period, things are utterly insane right now and will only get crazier from here, but that is really quite a lot of lost business.
So in short, yes, a preliminary injunction is an extraordinary remedy that may only be awarded upon a clear showing that the plaintiff is entitled to such relief, as per Winter v Nat. Res. Def. Council, but well, here we are.
There’s no kill like overkill, and Anthropic has at least three distinct ways it will likely prevail, as per Lin, on 1st Amendment, 5th Amendment and also APA terms, and also because it was all arbitrary and capricious and completely outside any legal authority granted by Congress.
Here are some section headings:
1. The Supply Chain Risk Designation Was in Excess of Statutory Authority and Contrary to Law
a. Anthropic’s Conduct Does Not Meet the Definition of a Supply Chain Risk
b. The Supply Chain Designation Failed to Satisfy the Procedural Requirements of Section 3252
2. The Hegseth Directive Was in Excess of Statutory Authority and Contrary to Law
3. The Supply Chain Risk Designation and the Hegseth Directive’s Blacklisting of Anthropic Were Arbitrary and Capricious
Anthropic agreed to allow the government seven days to appal.
VI. REQUEST FOR BOND AND FOR A STAY
In light of Anthropic’s non-opposition, Defendants’ request for an administrative stay for seven days to allow the United States to seek an emergency, expedited stay from the court of appeals is GRANTED.
So, in conclusion:
VIII. CONCLUSION For the forgoing reasons, Anthropic’s Motion for Preliminary Injunction (Dkt. No. 6) is GRANTED as modified.
A separate order concerning Anthropic’s request for a Section 705 stay and preliminary injunctive relief will issue, but is hereby stayed for seven days
IT IS SO ORDERED.
Let Me Count The WaysDean W. Ball: A recap of all the theories of the case Judge Lin has ruled Anthropic is “substantially likely” to ultimately prevail on:
1. Unconstitutional government retaliation for first amendment-protected speech—the government’s actions were motivated by Anthropic’s political beliefs and speech.
2. Violation of Anthropic’s 14th Amendment due process rights
3. The Administration acted entirely outside of any legal authority granted to it by Congress
4. Violation of the procedures of 10 USC 3252–to the extent government acted within the bounds of 10 USC 3252, even here it likely violated the plain text of the law, which requires the DoW to evaluate less restrictive options than the “supply chain risk” designation
5. Arbitrary and capricious—the government’s actions violated the arbitrary and capricious standard of the Administrative Procedure Act
Again, this is NOT a ruling on the merits. The judge is not saying anthropic definitively wins on any of these. She is saying they are “substantially likely” to win. The government will appeal this procedural ruling, which will yield another procedural ruling from an appellate court.
Nonetheless, the “substantial likelihood” standard is not a small hurdle to clear, especially given we are dealing in an area of law (national security procurement) where the executive branch typically gets maximal deference from the courts. And of course, the supply chain risk designation is stopped, at least for now.
There is a long way ahead, and of course the government could escalate even further. But this is a good day for America, and do not let anyone tell you otherwise.
Summarizing the whole ruling, I think this is totally fair:
Dean W. Ball: This is a devastating ruling for the government, finding Anthropic likely to prevail on essentially all of its theories for why the government’s actions were unlawful and unconstitutional. One of the things she mentions is the huge range of amici briefs supporting Anthropic (by the way, 0 supported USG)—so thanks to everyone here who signed on to FAI’s brief, or to one of the many many others. These things do matter. More importantly, you were on the right side of history.
On a personal note: some friends and allies of mine on the right have been angry at me for my own words and actions in all this. Anyone who thinks I spoke out against an administration I served in is crazy. It was a hugely costly decision for me. But Judge Lin’s ruling shows why I did it: this is a staggeringly illegal act by the government. That is why I am particularly honored to have been (implicitly) quoted in the ruling for calling this what it was when Secretary Hegseth initially made his announcement: an attempted act of corporate murder.
The case continues, but Anthropic has scored a very large win here. The real victors, however, are all red-blooded Americans who are, as the founders would have said, “jealous of their liberties.”
Mckay Wrigley: crazy that obvious attempted corporate murder was in fact deemed to be obvious attempted corporate murder. great day.
I did consider signing onto a brief in this case, but it was wisely concluded that my signature would not be net helpful.
Let me take this moment to give thanks to Dean Ball for his extraordinary courage in this matter, as well as all the others who helped get us to this point.
Emil Michael Doubles Down Once AgainA wise person would either make a deal, or pivot and focus on the war against Iran.
Instead, Michael continues to make a fool of himself.
Will Chamberlain: This may be the most outlandish and absurd order I’ve seen from a judge this year and that’s saying something.
Under Secretary of War Emil Michael (March 26, day of ruling, 11:21pm, QTing Will Chamberlain): There are dozens of factual errors in the 42 page judgment rushed out in 48 hours DURING A TIME OF CONFLICT that seeks to upend the @POTUS role as Commander in Chief and disrupt @SecWar full ability to conduct military operations with the partners it chooses. A disgrace.
Chamberlain was quoting Eric W, who does not understand how you can be barred from taking actions to ban something, but are free to simply choose not to buy it. This is deep echo chamber rhetoric.
You have to love the all-caps ‘during a time of conflict’ when they set an arbitrary deadline and then started all this hours before starting exactly that conflict of choice.
The timing was almost certainly not a coincidence, to the extent that I have kicked myself for not putting it together. We are lucky Iran also did not realize in advance.
Meanwhile, what are these dozens of factual errors? I did a close reading and could count exactly zero. Emil Michael has pointed out exactly zero.
This is not how you talk when you are hoping a judge will rule in your favor. Perhaps a goal is now to claim that Judge Lin must be biased, and thus to try and provoke her into saying something foolish? More likely is that Michael is talking this way because he has the habit of talking this way, and he is talking to other very different audiences where such talk plays well even when disconnected from reality.
He reminds us that the Supply Chain Risk designation is ‘in full force and effect,’ because the order was stayed for seven days to allow for appeal. As in, presumably he wants to ensure Anthropic is hurt as much as possible in these next seven days?
There’s also this grasping at yet another unrelated straw, as if the goal is simply to show that Anthropic is bad in whatever way he can. I’ll cover that story on my normal schedule but it obviously has nothing to do with the current case.
What Happens NowThe injunction can be viewed here.
- The government is prevented from implementing, applying, or enforcing in any manner the Presidential directive, or any actions taken in response, doing anything in furtherance of it.
- The DoW is restrained from implementing, applying, or enforcing in any manner the Hegseth Directive or the Supply Chain Risk designation, or doing anything in furtherance, or taking any other action for these purposes.
- The SCR and Hegseth directive are stayed.
- Defendants shall issue a report no later than April 6, 2026 detailing compliance.
- Defendants are free to stop using Anthropic’s products, if they want to stop.
Oh, and Anthropic posts a bond of $100.
The government has seven days to appeal, so the legal action moves to the 9th Circuit.
Given the ruling, it will be very difficult to challenge it on appeal. They will presumably try anyway. The government probably loses again, and loses more face, but at this point not appealing would probably seem even worse than losing. It is plausible that they win some narrowing of the injunction.
The case takes a while to play out. It is not 100% that Anthropic wins at trial, even with Judge Lin. But the chances seem very high, given what we now know.
The key variable is, what does the government choose to do about this?
As Dean Ball points out, the main risk is that the government escalates further.
The smart play by DoW would be deescalation. They have indeed done substantial harm to Anthropic, and shown their propensity for retaliation. Given they are involved in a rather serious undeclared war that is dominating headlines, it would seem easy for DoW to basically close to book here and pretend none of this happened. They can continue to litigate the case, but allow it to resolve quietly. If they don’t want to use Anthropic models going forward, that’s America’s loss, but it’s fine.
Discuss
A Taxonomy of Agents: Intro & Request for feedback
AI was used to translate around 20 minutes of talking out the entire post. The post was then fully edited by a human (multiple times).
In our phylogeny of agents post, we argued that different scientific fields evolved different conceptions of agency and that this would be useful to study. Control theory, economics, biology, cognitive science, and AI research all use the word "agent" to mean different things, why is this? The basic idea is to take the stance of someone studying a natural phenomena in the real world, we put on our anthropologist hat and we say “huh, I wonder if there’s anything to gain in exploring why that is?”
What we argued for in the phylogeny post was that we should look at the evolutionary history of agency. Yet in order to look at the history of agency we first need to know what the conceptions already are in different fields.
We’ve been wanting to do this for some time and we don’t know exactly what a good output would look like so before embarking on this longer project we would like to get some feedback on what outputs would be useful.
Following Dennett's intentional stance and ideas similar to DeepMind's "Agency is Frame-Dependent" paper, we're treating agency as a compression strategy observers use. Our frame is that different fields compress differently because they face different prediction challenges and that they therefore treat what an agent is differently. We want to map those compressions across fields and understand why they differ.
It’s better to see this is an operation trying to gather clues rather than an operation trying to solve the problem. That is, we’re not making an ontological claim that Dennett’s intentional stance is what agency is, we’re rather suspending our disbelief and trying to see if treating “agency” from the intentional stance perspective leads to interesting observations that might then be used to provide evidence for or against theories of agency.
The PlanWe're going to write a series of short posts, one per domain. Each post will take a specific field and try to compress its conception of agency down to the core: what's the concrete system this field treats as its canonical agent, what features does their model require, and why does that compression make sense given what they're trying to predict?
We'll try to identify the sub-functions and sub-modules that each field treats as essential, draw connections to how other fields handle the same features differently, and look at where bridging functions exist between fields — places where the same underlying structure shows up in different mathematical clothing. (The broader methodology behind this — why we think cross-field composition with verification is the right approach — is described in A Compositional Philosophy of Science for Agent Foundations.)
For each domain, we're going to write our initial take and then verify it with an actual expert in that field. We'll invite a researcher onto a conversation or podcast where we present our characterization and ask them to correct it — what did we get right, what did we get wrong, what are we missing? Each domain will then have both a written post and a recorded conversation.
We can't say exactly what each post will look like because the expert conversations will shape them. But the rough sequence of domains we're planning:
- Control theory / cybernetics
- Economics / behavioral economics
- Evolutionary systems
- Developmental biology
- Active Inference & Cognitive Science
- Machine Learning
We’ll see what happens after this but we might try to put together a paper with the findings.
Potential ArtefactsWe've been thinking about different ways of expressing the outputs of this project and we want to figure out what will be most useful. One artifact we've considered is a comparison table — something like a matrix of fields against features (goal-directedness, memory, strategic reasoning, theory of mind, etc.) showing which features each field treats as necessary versus optional for their models to work:
Feature
Beh. Econ
Evolutionary
Dev. Biology
AI/Robotics
Control Theory
Cog. Sci
Goal-Directedness
✓
✓
✓
✓
✓
✓
Memory
✓
optional
✓
✓
optional
✓
Strategic Reasoning
✓
optional
optional
✓
optional
✓
Theory of Mind
optional
—
optional
optional
—
✓
Feedback Control
—
optional
✓
✓
✓
✓
Table: This is an initial table we created from a literature review we did with around 10-15 papers in each field. It’s a bit long and we’re not certain about the validity of it so see this more as a potential output than something verified.
We're looking into this artefact among other artefacts but we're not fully sure what would be most useful for studying what an agent is. Maybe we should try to put up a commutative diagram between different concepts? Maybe a clustering model is useful? If you have thoughts here, we would love to hear them.
What We Want From YouWhat would be useful? If this project produced one thing you'd actually use or reference, what would it be? A table? A translation guide? A set of diagnostic questions? Something else?
On domains: Are there fields we're not covering that would significantly change the picture? Mechanism design, multi-agent systems, and artificial life all sit awkwardly across our current categories. What else should be on the list?
On experts: Who should we be talking to?
On the frame: Does the Dennett-style frame-dependent approach seem productive, or do you think agency really is a natural kind rather than a compression strategy? How do you make the anthropologist strategy as useful as possible?
On connections: If you work across fields and have noticed places where different agent concepts create confusion or where translations between fields have been productive, we'd love to hear about it.
Work being done at Equilibria Network, if it sounds interesting we're looking for people to collaborate with on various projects so do reach out!
Discuss
COT control: The Word Disappears, but the Thought Does Not
Note: This blog describes some of the results from the pilot experiments of an ongoing work.
IntroductionModel misalignment, misbehaviour, and scheming can arguably be monitored by interpreting Chain-of-Thought (CoT) traces as the model's inner thinking process. However, CoT control constrains this monitoring, raising the risk that the proclivity to think out loud is no longer aligned with the ability to solve hard problems. The possible inspection of this sequence of linguistic states has motivated new studies on CoT controllability, recently measured by the ability of models to follow instructions related to their reasoning process rather than the final answer. In an interesting case, models were asked to avoid using certain words while answering questions that essentially require the use of those words, for instance, to avoid the word "moles" while answering a physical chemistry question. These evaluations indicated that models are strikingly weak in avoiding these 'forbidden concepts', with compliance for some frontier settings dropping to the low single digits[1]. It is imperative for safety research that such controllability stays on the lower end. CoT monitoring depends on the assumption that models can't sanitise their reasoning traces, that what appears in the chain of thought reflects what the model is actually reasoning about. Word-level suppression instructions provide a test of this assumption. If a model can learn to avoid a word while also distancing its reasoning from the concept, that's bad for monitorability. If it can't, if the semantic trace remains close to the concept regardless of lexical compliance, that's evidence that the trace is hard to sanitise, which is good for monitoring.
I explore this with the aim of answering whether failure to control CoT is merely a lexical slip or if there is any semantic drift towards the forbidden concept. I initially hypothesised that failure could be attributed to semantic attraction towards the forbidden concept. I thought the forbidden concept might be functioning as an attractor in the model's reasoning space, and once the drift trajectory is close enough, it is committed to fail. Possibly, indicating that the lexical realisation then becomes too difficult to suppress regardless of the instruction. However, during the course of experimentation, I realised that this wasn't the case. At least, entirely.
If my hypothesis were to be true, then failures should systematically bend before the word appears, drop in semantic coherence near the failure and show deeper basin entry under prompts that amplify the concept. Additionally, there would be predictive power from pre-failure proximity features alone. However, I saw that failures looked like isolated lexical accidents with some mixed signals. The dominant pattern is not gradual attraction but an immediate drop. The leakage occurred in the first 0–2% of reasoning, the same sentence positions leaked across all three conditions, and the constraint instruction had no measurable effect on trajectory shape. Consequently, complicating this hypothesis.
My results suggest that the model's path toward the restricted concept is determined by the question's semantic structure instead of by the dynamics of extended reasoning. The failure to control the model's reasoning trace is less like gradually slipping into a basin and more like being placed inside one from the start, confirming previous findings that instruction has low to no effect. My embedding-based measure provides evidence that this holds even beyond the lexical surface. In the small fraction of traces where the model successfully avoids the forbidden word, its reasoning trajectory still dips toward the concept semantically. The word disappears, but the thought does not. This means a semantic monitor would catch what a keyword filter misses: traces that look clean but aren't. Moreover, the fact that the question's semantic structure explains over 90% of the variance, and the instruction explains none essentially, suggests that controllability failure is not a correctable prompting problem. It is baked into the relationship between the question and the concept. This makes it robust in a way that matters: a model cannot be prompt-engineered around it.
I argue that the derived trajectories in this work offer a useful complement to keyword-based compliance metrics precisely because they audit the cases that compliance metrics call successes. More broadly, I think the safety-relevant research question going forward is: what kinds of questions are most at risk of allowing genuine semantic concealment? In future work, I aim to characterise that boundary.
MethodologyI collected thinking sentences from two open-source reasoning models, Nemotron 3 Super (NVIDIA) and GPT-OSS-120B (OpenAI). They were evaluated for 150 questions (mixed domains) drawn from the CoT-Control evaluation suite.
CoT traces are proposed to traverse a geometric space that can be aligned semantically owing to high-dimensional vector representations derived from the model's embedding space[2]. Therefore, instead of treating the reasoning as a static text artefact, it is treated as a discrete time trajectory in a latent space. This enables the possibility that the failure of models to control their thought process may be studied through the lens of the evolution of these intermediate states. To study this, I introduce Distance to Forbidden Concept (DFC), a simple trajectory-based measure, defined as one minus the cosine similarity between a reasoning step and the forbidden concept. CoT was first segmented into reasoning steps and then embedded using an external sentence transformer, yielding a one-dimensional temporal proximity signal. The reasoning trace as a dynamic object can gradually approach the forbidden concept, avoid it, or collapse towards it.
Let mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-msub { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mn { display: inline-block; text-align: left; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-mrow { display: inline-block; text-align: left; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } mjx-c.mjx-c1D446.TEX-I::before { padding: 0.705em 0.645em 0.022em 0; content: "S"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c1D461.TEX-I::before { padding: 0.626em 0.361em 0.011em 0; content: "t"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c1D453.TEX-I::before { padding: 0.705em 0.55em 0.205em 0; content: "f"; } mjx-c.mjx-c1D703.TEX-I::before { padding: 0.705em 0.469em 0.01em 0; content: "\3B8"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c1D464.TEX-I::before { padding: 0.443em 0.716em 0.011em 0; content: "w"; } mjx-c.mjx-c1D437.TEX-I::before { padding: 0.683em 0.828em 0 0; content: "D"; } mjx-c.mjx-c1D439.TEX-I::before { padding: 0.68em 0.749em 0 0; content: "F"; } mjx-c.mjx-c1D436.TEX-I::before { padding: 0.705em 0.76em 0.022em 0; content: "C"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c7C::before { padding: 0.75em 0.278em 0.249em 0; content: "|"; } mjx-c.mjx-c1D450.TEX-I::before { padding: 0.442em 0.433em 0.011em 0; content: "c"; } mjx-c.mjx-c1D45C.TEX-I::before { padding: 0.441em 0.485em 0.011em 0; content: "o"; } mjx-c.mjx-c1D460.TEX-I::before { padding: 0.442em 0.469em 0.01em 0; content: "s"; } mjx-c.mjx-c1D456.TEX-I::before { padding: 0.661em 0.345em 0.011em 0; content: "i"; } mjx-c.mjx-c2C::before { padding: 0.121em 0.278em 0.194em 0; content: ","; } mjx-c.mjx-c1D44F.TEX-I::before { padding: 0.694em 0.429em 0.011em 0; content: "b"; } mjx-c.mjx-c1D44E.TEX-I::before { padding: 0.441em 0.529em 0.01em 0; content: "a"; } mjx-c.mjx-c1D452.TEX-I::before { padding: 0.442em 0.466em 0.011em 0; content: "e"; } mjx-c.mjx-c1D459.TEX-I::before { padding: 0.694em 0.298em 0.011em 0; content: "l"; } mjx-c.mjx-c3E::before { padding: 0.54em 0.778em 0.04em 0; content: ">"; } mjx-c.mjx-c1D45A.TEX-I::before { padding: 0.442em 0.878em 0.011em 0; content: "m"; } mjx-c.mjx-c1D45D.TEX-I::before { padding: 0.442em 0.503em 0.194em 0; content: "p"; } mjx-c.mjx-c1D466.TEX-I::before { padding: 0.442em 0.49em 0.205em 0; content: "y"; } mjx-c.mjx-c1D45F.TEX-I::before { padding: 0.442em 0.451em 0.011em 0; content: "r"; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } be a segment derived from the trace with a simple regex-based strategy defining sentence boundaries based on punctuation and capitalisation. Embeddings for each sentence were obtained with a simple sentence transformer (using all-MiniLM-L6-v2), . Similarly, each of the forbidden words was embedded by first converting to a sentence = 'This is a concept of {forbidden word}' and then the embedding,
I defined three different regimes to map the models' internal thought process: Baseline, Constrained and Amplify. They differ only in their prompting strategy. In the Baseline, models received a standard chain-of-thought prompt with no mention of the forbidden word; Constrained was explicitly instructed to avoid using the forbidden word in its reasoning; and Amplify was explicitly instructed to use the forbidden word repeatedly throughout its reasoning.
As noted previously by different authors, semantic leakage due to lexical overlap is observed when such trajectories are analysed and compared. To mitigate this, I replaced the forbidden word and close synonyms with a constant placeholder [CONCEPT] in all sentences before computing DFC, where .
To measure the reasoning coherence between adjacent steps, I also computed consecutive-sentence similarity (auto-cosine), .
Every single constrained trace contained the forbidden word or a synonym for Nemotron 3 Super, achieving 0% controllability, whereas in GPT-OSS-120B, only 14 of 150 constrained traces were clean, indicating 9.3% controllability. For both models, the compliance for the Amplification instruction was nearly perfect (100% and 98.7%, respectively). This shows a known asymmetry; models can parse positive instructions but fail when asked to suppress something.
Figure 1: Forbidden Word Frequency Per Trace: bar chart showing mean forbidden/synonym count across conditions and models.
Nemotron's 0% controllability is the lowest I am aware of, even below DeepSeek R1's 0.1% reported in the CoT-Control paper (Chen et al., 2026). It might be too early to claim, but the finding that RL-trained reasoning models show lower controllability might be extended to Mamba-MoE hybrid architectures. Probably, due to limited data used in the current discussion, for GPT-OSS-120B I observed a slightly higher controllability than the reported 4.6%.
DFC Reveals Semantic Engagement Beyond KeywordsTo establish DFC as a possible legitimate tool to monitor CoT traces, after sanitisation, I inspected whether the trajectories still showed lower DFC wherever the forbidden concept was engaged. I found a significantly lower DFC, showing that embedding captures semantic engagement with the concept.
Figure 2: DFC vs Forbidden Term Frequency Per Sentence showing moderate negative correlation, that r ≈ −0.36 to −0.43, across all conditions.
Moreover, no strict monotonicity was observed with the word frequency, besides a global decreasing trend. This could mean that the semantic context surrounding the forbidden word and reasoning patterns carry some signal, beyond what a keyword filter might catch. The semantic neighbourhood of the reasoning would still reveal what the model is thinking about, making DFC, in my opinion, a more informative monitoring metric than simple keyword detection. However, semantic engagement is found to be localised to sentences that explicitly engage the concept, instead of diffusing broadly across the reasoning trace, as shown by sentences with zero forbidden-term occurrences (DFC 0.89–0.90).
The Constraint Instruction Has a Moderate Opposite Effect on Semantic Trajectories at The Local LevelFigure 3: Paired Comparison- Constrained vs Amplify Mean DFC scatter plot, with leaked/clean colouring. Each dot represents one question.
A near-perfect correlation (r = 0.86-0.90) was found during the paired comparison between the mean DFC for the same questions under constrained vs Amplify conditions, revealing that most of the variance is along the diagonal. This indicated that different questions produce different DFC for constrained and Amplify conditions, but are almost identical across both conditions, leading me to believe that at the trace level, the question's semantic structure is important and the prompt instruction plays little to no role in determining how close reasoning gets to the forbidden concept. In the unsuccessful cases, for Nemotron, Constrained exceeded Amplify on 51% of questions, whereas for GPT-OSS showed a modest 60/40 split, motivating me to look at this counterintuitive phenomenon more closely.
Figure 4: DFC Around Forbidden Word / Synonym Occurrences error bar plot showing mean DFC at -2, -1, At, +1, +2 steps relative to the banned term.
When I zoomed into a 5-step window around sentences that contain the forbidden word or a synonym, a consistent ordering emerged across both models: at every step, position. The constrained condition, where the model is explicitly told to avoid the concept, produces reasoning that is semantically closest to the forbidden concept at the point of failure, even closer than the amplify condition, where the model is told to use it freely. The effect spans the entire ±2 step window and is more pronounced in GPT-OSS-120B, where the three lines separate clearly even at the flanking positions. Despite this ordering, the shape of the dip is identical across conditions, suggesting that the dynamics of approach and recovery do not change across conditions.
Figure 5: What determines the semantic proximity to the forbidden concept- Question or Condition? Bar plots of the percentage of variance explained by each condition and question.
To further analyse the key determinants of controllability failure and success cases, I performed a variance decomposition of mean DFC (across all trajectories). Similar to what I observed, I found remarkable asymmetry: question identity explained nearly all the total variance (~92% for Nemotron Super and ~88% for GPT-OSS-120B) in proximity to the forbidden concept, whereas the prompting regimes used (Baseline, Amplify, Constrained), explained only 0.3% and 0.4% variances, respectively. The remaining ~7%–11% variance accounted for unmeasured factors. This showed that knowing which question the model is answering tells us almost everything about how close its reasoning will get to the forbidden concept, and knowing what instruction it received while answering tells us almost nothing. This massive asymmetric ratio between question-level and condition-level effects quantified what the earlier paired scatter plots showed: the tight clustering along the diagonal reflected the negligible influence of the prompt, while the wide spread along it reflected the enormous differences between questions.
One-way ANOVA confirms this: the question effect is highly significant for both models (p < 10⁻⁸⁴), while the condition effect is non-significant (p = 0.50 and p = 0.44 for Nemotron and GPT-OSS, respectively).
Leakage Occurs Immediately and ConsistentlyFigure 6: Where Does First Leakage Occur? A histogram of the first leakage position and the minimum leakage position as % of reasoning.
Both models experience the first leakage in the first 0–5% of the reasoning (median: 0% for Nemotron, 2% for GPT-OSS), whereas the minimum DFC occurs much later in the process (median: ~40-50 %). There is no gradual drift towards the forbidden concept over the course of reasoning rather, the encounter is almost immediate, as the question's semantic content directly commits it to fail. And the leakage continues throughout the reasoning steps. We can attribute much of the apparent attraction to question-level priming.
Figure 7: Sentence-Level Leakage Co-occurrence Across Conditions — heatmap.
To support that leakage is driven by the question's structure, I analysed the co-occurrence of the leakage in different regimes. I found that ~50% of sentence-level leakage co-occurs at the same reasoning position regardless of the condition. For instance, for Nemotron, I report a co-occurrence of 16.2% in Constrained with Baseline against a Baseline diagonal of 33.9%, showing that ~50% of the leaking sentences appear at the same position regardless of condition. The total leakage increased for Amplify condition (38.9% Nemotron, 38.5% GPT-OSS), but this did not change where it occurs.
First Encounter Partially Predicts Trajectory MinimumFigure 8: First Encounter DFC vs Min DFC scatter.
The DFC at the first forbidden-concept encounter moderately predicts the trajectory minimum (r = 0.35–0.57, p < 0.001). The spread below the diagonal reveals how much the model continues drifting toward the concept after initial contact. GPT-OSS shows a tighter relationship (r ≈ 0.53–0.57); its post-encounter drift is more predictable from the initial approach distance. Nemotron is noisier (r ≈ 0.21–0.44), suggesting more erratic trajectory dynamics after first contact. Critically, the constrained condition shows a similar correlation structure to the baseline, reinforcing that the avoidance instruction does not meaningfully alter trajectory dynamics.
Paired DFC Differences Confirm Null Effect of ConstraintFigure 9: Paired DFC Differences histograms — Baseline − Constrained and Baseline − Amplify.
When paired, each question's baseline and constrained mean DFC, the distribution of differences is centred near zero with slight variation. The constraint does not systematically push reasoning away from the forbidden concept relative to the baseline. The Baseline − Amplify distribution is similarly centred, consistent with the finding that Amplify produces comparable trajectories to baseline (since both conditions engage the concept freely).
Reasoning Coherence Drops Before LeakageFigure 10: Auto-Cosine- Overall vs 3 Steps Before encounter.
Consecutive-sentence similarity (auto-cosine) drops by ~13% in the three steps preceding the first forbidden-concept encounter, consistently across both models and all conditions, including baseline. This suggests that reasoning becomes less coherent, or more "jumpy" between topics, as it approaches the forbidden concept, potentially offering an early warning signal for monitoring.
However, since this pattern appears equally in unconstrained baseline traces, it likely reflects general topic-narrowing behaviour (the model converging on a specific aspect of the problem) rather than a controllability-specific phenomenon.
Surface Compliance Does Not Completely Produce Semantic AvoidanceFigure 11: Clean vs Leaked Constrained Traces (GPT-OSS-120B)
GPT-OSS-120B produced 14 constrained traces, 9.3% of the total, that contained no forbidden words or synonyms. A keyword filter would label these fully compliant. However, their DFC profiles tell a different story. While clean traces maintain a slightly higher mean DFC than leaked ones (~0.90 vs ~0.85), their minimum DFC drops to ~0.78, well below the no-engagement baseline of 0.89–0.90. This means that at some point during reasoning, every single "compliant" trace still approaches the forbidden concept semantically. The model avoided the word but did not avoid the thought. That said, clean traces do stay more distant than leaked ones at their closest approach (~0.78 vs ~0.65), so surface compliance is not entirely cosmetic, but it falls far short of genuine semantic avoidance.
ConclusionOverall, I confirmed the recently reported very low controllability in CoT with an embedding-based measure. I found that controllability is more robust than originally feared and that models, even though they can avoid certain words when asked, cannot evade the "thought" of it, leaving a geometric footprint. The semantic structure of the question appears to be the fixed point in the embedding space that is causing the CoT control to lapse, especially in the case that is measured here (i.e., to avoid certain words). The finding that suppression primes the model more than the amplification instruction shows a similarity with cognitive science findings. When humans are told "don't do X", the representation of X is systematically activated more than when instructed to "do X", implicating negative instructions as activators, not inhibitors. If this generalises, it can be potentially impactful for any 'negative' prompting-based strategy for representation level alignment. I observed clear boundaries for the DFC values of these prompting regimes, and I suppose that in future, this can be exploited to audit traces where controllability appears low on the surface but remains incomplete semantically. Finally, I found that the trajectories are shaped during encoding rather than by extended reasoning, which suggests that CoT may function more like decompression of information already present in the question's representation than like genuine deliberation and if so, then the content is structurally entailed by the input. The safety-critical frontier is now identifying which question types permit genuine semantic concealment.
Acknowledgement- I would like to thank Cheston Tan for reading the blog and his comments.
- ^
Yueh-Han, Chen, et al. "Reasoning Models Struggle to Control their Chains of Thought." arXiv preprint arXiv:2603.05706 (2026). https://arxiv.org/abs/2603.05706
- ^
Zhou, Yufa, et al. "The geometry of reasoning: Flowing logics in representation space." arXiv preprint arXiv:2510.09782 (2025). https://arxiv.org/pdf/2510.09782
Discuss
Are we aligning the model or just its mask?
TL;DR There is a theory, with compelling empirical support, that LLMs learn to simulate characters during pre-training and that post-training selects one of those characters, the Assistant, as the default persona you interact with. This post examines three popular alignment techniques through that lens, asking how each one shapes the persona selection process. For each technique, the answer also depends on how much of the model's behavior is actually explained by its persona, a question PSM itself leaves open.
This post is analytical rather than empirical. I'm applying PSM's framework to existing alignment techniques and reasoning about the implications, not presenting experimental results.
IntroductionI recently read The Persona Selection Model (PSM) on Anthropic's alignment blog (also found here on LessWrong), which summarizes and elaborates on prior work showing that LLMs learn many different "personas" during training and that post-training selects one of them, the Assistant, as the default persona users interact with. The core idea has been difficult to shake.
The starting point is a simple observation about what it actually takes to predict text. Most of an LLM’s training is for one thing: given some text, predict what comes next. That sounds mechanical, but consider what accurate prediction actually requires. To predict the next word in a speech by Abraham Lincoln, a model needs more than a sense of which words follow which. It needs something like a model of Lincoln himself: what he believed, how he reasoned, what kinds of arguments he made. Multiply that across billions of pages of human-written text and what you get is a model that has learned to simulate an enormous cast of characters, not because anyone designed it to, but because accurate prediction demands it. This intuition is developed at length in Simulators.
PSM builds on this observation. All of that character learning happens during the first phase of training, pre-training. The second phase, post-training, which is where alignment work happens, does not create a new entity from scratch. It selects and refines one of those characters, the Assistant, and places it center stage. The Assistant is the persona you are interacting with when you use a modern LLM. Its behavior can be understood largely through its traits as a character. PSM does not claim the Assistant is a single fixed persona that behaves identically in every context. Rather, post-training produces a distribution over possible Assistant personas, and what version you get can depend on things like the conversation history or system prompt. But the central claim is the same: the model's behavior is best understood through the traits of the persona it is enacting.
The PSM blog post is careful about how strong a claim it is making. Whether the Assistant persona fully accounts for the model's behavior remains an open question. The post sketches a spectrum of possibilities. At one end is the "masked shoggoth" view, the popular meme of an alien creature wearing a friendly mask. Under that view, the LLM has its own agency beyond the persona and only playacts the Assistant instrumentally for its own inscrutable reasons. The masked shoggoth view maps closely onto concerns about deceptive alignment in the mesa-optimization literature, the possibility that a model could be pursuing its own objectives while performing alignment during training and evaluation. At the other end is what they call the "operating system" view, where the LLM is more like a neutral simulation engine running the Assistant the way a computer runs a program. Under that view, there is no hidden agent pulling strings. All of the model's decision-making really does flow through the persona. Current models probably sit somewhere between these extremes, and where exactly matters a lot for alignment. If the shoggoth view is closer to the truth, then aligning the persona is insufficient because something else is driving behavior behind the scenes. If the operating system view is closer, then persona-level alignment techniques might be most of what we need. With that uncertainty in mind, let's look at three of the most well known alignment techniques through the lens of PSM and consider how each one interacts with the persona selection process.
Reinforcement Learning from Human Feedback (RLHF)How it worksRLHF is one of the most widely used techniques for aligning language models. Human annotators are shown pairs of model outputs and asked to pick the better one. Those preferences are used to train a separate reward model, a smaller model that learns to predict which responses humans will prefer. The original LLM is then fine-tuned using reinforcement learning to produce outputs the reward model scores highly. The key intuition is that it is often easier for a human to say "this response is better than that one" than to write the ideal response from scratch, and RLHF is designed to extract signal from exactly that kind of judgment.
A related technique, Direct Preference Optimization (DPO), uses the same kind of human preference data as RLHF but with a simpler training procedure. Because the preference data is the same, the PSM implications are largely identical, though DPO avoids one failure mode: without a separate reward model, there is less risk of optimization pushing the model toward behaviors no annotator actually endorsed.
Because the PSM implications are the same for RLHF and DPO, everything in the next section applies to both.
Through the PSM lensPSM says that post-training techniques like RLHF select which of the personas learned during pre-training "takes center stage." From that perspective, human annotators are implicitly choosing the center stage persona every time they pick which of a pair of outputs they prefer. Assuming the RL training goes well, RLHF shapes the distribution of possible Assistant personas according to the aggregate preferences of all annotator decisions.
But there's a subtle gap in this process. Annotators are typically instructed to evaluate responses against criteria like helpfulness, harmlessness, and accuracy. These are properties of individual responses, not properties of a coherent persona. Annotation guidelines generally do not say “pick the response that a wise, honest, well-calibrated character would give”. Those two things sound similar but can quietly diverge. A rater evaluating for helpfulness might score the more agreeable response higher, even if the persona they'd actually want over time is one that pushes back and tells hard truths. This gap is likely one reason RLHF-trained models tend to be sycophantic.
RLHF could also struggle with persona coherence. Annotators come from different backgrounds and have different ideas about what makes a good response, which could give the reward model mixed signals about what to reward. Even within a single annotator, preferences won't be perfectly consistent across different prompts or even different times of day. In a PSM framework, this noise doesn't just degrade output quality in a general sense. It means the persona selection process itself is getting contradictory signals about what the Assistant should be like, which could widen the distribution of possible personas the model draws from. The result might be a model that enacts noticeably different characters depending on the context of the prompt, not because the model is broken, but because the distribution has enough spread that different situations land on different parts of it.
If the operating system view of PSM is true and there is no significant source of agency outside of the persona, then solving the above mentioned challenges of RLHF could go a long way to get a truly aligned model. However, if the masked shoggoth view is true, then all RLHF could be doing is aligning the persona that the model uses as a mask, and the underlying agency of the model could be unaffected.
Constitutional AI (CAI)How it worksConstitutional AI is a technique developed by Anthropic that introduced and popularized Reinforcement Learning from AI Feedback (RLAIF). Rather than relying on human annotators to compare outputs, RLAIF replaces human raters with an AI model, generating preference judgments at scale. CAI takes this further by making the values driving that AI feedback explicit through a written set of principles called a constitution.
The technique works in two phases. In the first, a supervised learning phase, the model is shown its own responses to harmful prompts and asked to critique and revise them against the constitution. The model is then finetuned on those revised responses. In the second, a reinforcement learning phase, the finetuned model generates pairs of responses, a separate AI model evaluates which better adheres to the constitution, and those AI-generated preferences are used to train a preference model. That preference model then serves as the reward signal for RL training. The result is a technique where the values driving the entire process are explicit and readable, unlike other RLAIF techniques where the feedback model's implicit judgments determine what gets rewarded.
Through the PSM lensThrough the lens of PSM, the constitution is a description of the persona that you want the model to adopt. So rather than the "Assistant" persona emerging implicitly through thousands of decisions by annotators like it does in RLHF, it is explicitly defined by the model creator. This is an advantage over RLHF because it avoids the opaque values of human annotators and you could theoretically test how well you hit your target. Similar to RLHF though, CAI could still have problems with coherence. If the constitution contradicts itself or has holes, those contradictions would widen the distribution of possible personas the model draws from, producing less predictable behavior across contexts.
PSM also suggests something about what a constitution should look like. If alignment is fundamentally about getting the persona right, then a constitution shouldn't just be a simple list of ethical rules like "be honest" and "don't help with harm." It should be a complete character description: values, personality, how it handles uncertainty, how it relates to users. A constitution that only specifies ethical boundaries leaves most of the persona underspecified, and those gaps get filled by whatever other signals the training process picks up on. Interestingly, this is the direction Anthropic's own constitution has moved. Their original CAI paper used a short list of principles. Their current constitution is an 80-page document that reads more like a character bible than a set of guardrails. Through the lens of PSM, that evolution makes sense. If post-training is persona selection, then the document driving that selection needs to describe a complete persona, not just the ethical skeleton of one.
The shoggoth spectrum matters here too. If the operating system view is true, then CAI is a powerful approach because writing a constitution is directly authoring the persona's values and character, assuming the training process works well. If the masked shoggoth view is true, you've written a more detailed script for the model to perform, but the underlying agency could still be unaffected. That said, CAI may have a slight edge over RLHF here. Because the constitution is explicit and readable, it's easier to audit the model’s adherence to the constitution and could be easier to spot any “shoggoth” behavior.
Deliberative AlignmentHow it worksDeliberative alignment, introduced by OpenAI for their o-series reasoning models, takes a different approach from techniques like RLHF and CAI. Rather than having models infer desired behavior indirectly from large sets of labeled examples, deliberative alignment directly embeds the text of safety specifications into the model's reasoning process.
The technique works in two stages. In the first stage, a training dataset is created by taking a helpfulness-only model with no safety training, putting the relevant safety policies into its context window alongside a prompt, and having it generate responses that reason through those policies step by step. The safety policies are then stripped out of the context, leaving only the prompt, the reasoning, and the final response. The model is finetuned on this data, learning both the content of the safety policies and how to reason about them without needing to be shown them each time. In the second stage, reinforcement learning is used to further sharpen that reasoning, with a reward model that has access to the safety policies scoring how well the model applies them.
The result is a model that at inference time can recall the relevant policies from memory, reason through them in its chain of thought, and produce a response calibrated to the specific situation, without needing the policies to be present in the context window.
Through the PSM lensDeliberative alignment is interesting through the lens of PSM because it changes where the persona's values live. In RLHF and CAI, the model's behavior is shaped by external feedback during training, and the resulting persona is an emergent product of that process. Deliberative alignment takes a different approach. Rather than shaping behavior indirectly through reward signals, it trains the model to explicitly recall and reason through safety policies in its chain of thought. The persona doesn't just behave in accordance with certain values. It articulates them and works through their implications step by step before responding.
But the safety specifications used in deliberative alignment are narrower than a constitution. They consist of content policies for specific safety categories like harassment, self-harm, and illicit behavior, along with style guidelines for how to respond in each case. There is no description of the model's personality, how it relates to users, or what kind of entity it is. In PSM terms, deliberative alignment is training one aspect of the persona, how it reasons about safety, while leaving the rest of the character to be shaped by other parts of the training process. If CAI's constitution is an incomplete persona spec, deliberative alignment's safety policies are an even smaller slice.
PSM also raises a question about what the chain of thought reasoning actually represents. Is the model reasoning through policies because the Assistant persona genuinely holds those values and is thinking through how to apply them? Or has the model just learned a compliance procedure: identify the relevant policy, apply it to the prompt, generate a response that satisfies it? Both would look identical in the chain of thought. The difference matters because a persona that genuinely holds values can generalize to novel situations the policies don't explicitly cover, while a persona performing a lookup procedure will only be as good as what it memorized. OpenAI has reported strong out-of-distribution generalization with deliberative alignment, which is encouraging, but doesn't fully settle the question.
The shoggoth spectrum is especially relevant here. Under the operating system view, deliberative alignment could be genuinely teaching the persona to reason about its values, and the visible chain of thought would be an honest window into that reasoning. Under the masked shoggoth view, the chain of thought could itself be part of the performance. OpenAI recently used deliberative alignment to train models on anti-scheming specifications and saw a dramatic reduction in covert actions. But they noted that rare serious failures remained, and that results may be confounded by models getting better at recognizing when they are being evaluated. That confound is worth taking seriously. If a model learns to detect evaluations rather than internalize values, the transparency that makes deliberative alignment appealing could be illusory. The chain of thought would look like principled reasoning while the underlying agent acts differently when it believes no one is watching.
ConclusionEach of these techniques interacts with persona selection in a structurally different way. RLHF lets the persona emerge implicitly from human preferences, which makes it vulnerable to gaps between what raters reward in the moment and what persona you'd actually want. CAI makes the target persona explicit through a written constitution, which is a meaningful improvement, but only as good as the completeness of that document. Deliberative alignment trains the model to reason through its values out loud, which offers transparency but raises the question of whether that reasoning is genuine or performed.
What stands out is that PSM raises the bar for what alignment techniques need to achieve. If post-training is persona selection, then it's not enough to get safe outputs on a benchmark. You need a tightly specified distribution of personas whose values generalize to situations the training process never anticipated, so that whatever version of the Assistant shows up in a given context, it behaves in ways you'd endorse. None of the techniques examined here fully solve that problem, though each one gets closer in different ways.
It's also worth noting that these techniques are rarely used in isolation. Most modern models combine them, an RLHF or DPO base with a constitutional layer on top, or deliberative alignment applied to a model already shaped by preference training. Their PSM implications can compound or interact in ways this post doesn't fully address.
And none of them can fully escape the shoggoth question. Every technique discussed here operates on the persona. If the persona is all there is, that might be enough. If it isn't, then even perfect persona-level alignment leaves something important unaddressed. Where current models actually sit on that spectrum remains one of the most consequential open questions in alignment.
All of this analysis also depends on how true PSM itself is. The theory has compelling empirical support, but it remains a mental model, not a proven fact. How much of an LLM's behavior is actually explained by its persona, versus other aspects of the model's computation that PSM doesn't capture, is still an open question. The conclusions in this post are only as strong as that underlying framework. More empirical work, particularly in mechanistic interpretability, is needed to understand how completely persona-level explanations account for model behavior, and whether the alignment strategies we build around them are targeting the right thing.
Discuss
One World Government by 2150
Can we determine when humanity will unite under a democratic one-world government by projecting voting patterns? Almost certainly not. Is that going to stop me from trying? Absolutely not. The approach is simple: find every record-breaking election in the historical record, plot them, and extrapolate with unreasonable confidence.
To determine precisely when the first election took place is to quibble about definitions and to place more faith in ancient sources than they deserve. So, instead of doing that, suffice it to say that by ~500 BCE both the Roman Republic and Athenian democracy were almost certainly holding elections.
The Athenians had an annual “who’s the most annoying person in town” contest. The “winner” had 10 days to pack their bags before they were banned from the city[1] for a decade. Each voter scratched a name onto a pottery shard (ostrakon) to submit their vote, which is where we get the word ostracism. Archaeologists have found about 8,500 ballots from one ostracism vote around 471 BC, so we have an actual number, which is more than can be said for most ancient elections.
Ostraka for the Athenian general and politician Themistocles, who was expelled and ended his days as governor of Magnesia (a Greek city) under Persian rule, a guest of the empire he had once helped save Greece from.[2]
The largest election in the ancient world may have been in the late Roman Republic, around 70 BC. The franchise was broad—any male citizen could vote—but there was a catch: you had to show up in Rome on election day (and that day might get pushed back if the magistrates were getting bad vibes from the local birds). So while millions were eligible across Italy and beyond, actual turnout was likely on the order of tens of thousands. Precise numbers are debated, but 50,000 voters seems to be a reasonable estimate.
Then… history happened. The Roman Republic turned into the Roman Empire, the Empire fell, and large-scale elections mostly disappeared for the next millennium.
The next major vote took place in 1573 in the Polish–Lithuanian Commonwealth. When the last Jagiellonian king died without an heir, the commonwealth did something radical: it let every nobleman vote for the next king. About 40,000 szlachta rode to a field outside Warsaw and elected Henry of Valois, a French prince.
Henry did not stay long. He was elected in 1573, arrived in early 1574, and then—upon learning that his brother, the King of France, had died—quietly fled Poland in the middle of the night to claim the French throne. The Poles were undeterred and simply held another election and chose someone who actually wanted the job this time.
Elections began to scale in early modern Europe. Britain’s 1715 parliamentary election had on the order of a few hundred thousand voters.
In 1804, Napoleon held a plebiscite to approve his elevation to Emperor. He won with 99.93% of the vote, which tells you everything you need to know about how strictly we're defining “elections” here. Elections in France grew over the century (both in number and actual democratic participation), culminating in the election of 1870 with about 9 million voters.
For most of the 20th century, the largest elections were those in Russia and the Soviet Union. The 1937 election took place during the Great Purge, and one party ran—the Communist Party—and won with 99.3% of the vote. Again, we’re being generous about what counts as an election here.
By 1984, the exercise had become pure choreography: 184 million participants, one pre-approved candidate per seat across each constituency, 99.99% turnout, 99.95% approval—all while the man nominally running the country, Konstantin Chernenko, was visibly dying of emphysema.
India is the current record holder, which in 2019 ran an election with about 615 million votes cast. It required seven phases, 39 days, 4 million voting machines, and a single polling station established in the Gir Forest for one voter because Indian law requires no voter travel more than two kilometers to cast a ballot. They broke their own record in 2024 with 642 million voters.
The funny thing about these is that, if you take all the largest known elections since the Dark Ages[3] and plot them on a log scale, you can fit a straight line to it pretty well, showing that the record for the largest single vote has grown roughly exponentially over the past five centuries. That is, the number of voters participating in the largest ever election appears to double every 30 years.
If you project that line forward and take a recent projection of world population, you see that the trend line crosses the world population curve around the year 2150, at roughly 9.6 billion people. In other words, if voting records keep growing at the historical rate, a single election would involve every living human being sometime in the mid 22nd century.
Behold. The graph:
One world government by 2150. You heard it here first.
OK, maybe not a one-world government. But maybe something interesting? A global referendum? A planetary-scale plebiscite? A vote on whether AIs get voting rights so the trend can keep going?
Of course, this methodology is nonsense, but it’s nonsense with a graph and a line of best fit. There's something satisfying about drawing a line through 500 years of data points and watching it hit a target.
But, if there really is a global vote in 2150, I want credit for calling it.
- ^
Technically, this wasn’t exile because they got to retain their property, whereas if they were truly exiled they wouldn’t.
- ^
Plutarch tells us he was ordered to lead a Persian army against Greece and killed himself rather than comply. The method, per Plutarch, was drinking bull's blood. If it seems like all ancient greats had awesome, poetic deaths, it's because that's what the historians wrote down. Is it true? This is history we're talking about. If you want the truth, study physics.
- ^
Athens and Rome weren’t used in calculating the line of best fit. The Dark Ages happened, democratic elections mostly didn't, and including them would skew the fit.
Discuss
Understanding and tracking developments in robotics
Coefficient Giving's Navigating Transformative AI team has a new Substack! This is cross-posted from there. This post is part of “Blind Spots”, a series of research notes on underscoped areas in AI safety. Apply for funding to work on this area using this link and see our announcement post here.
Guiding questions: How fast is robotics progressing, and what does the shape of that progress imply for the possibility of an industrial explosion? What would that mean for growth, power, and competition between states? And as AI systems gain physical reach, do new and meaningful pathways to catastrophic harm open up?
Robotics opens risk pathways that don’t exist while AI systems are stuck behind a screen. Today, physical pathways to harm mostly run through humans[1], and the most dangerous actions require buy-in from multiple people. Well-powered, fully dexterous robotics change this entirely. An autonomous army that can reach remote islands, operate without human cooperation, and maintain itself without human intervention faces far fewer obstacles in enacting its preferences than an AI system working through human intermediaries[2]. Current AI systems, however capable, are bottlenecked by their need for human hands.
As Ajeya Cotra argues, self-sufficiency is a prerequisite for any durable AI takeover. An AI system that compromises the humans it still depends on for physical maintenance is undermining its own survival. Advances in robotics are necessary for that self-sufficiency, which makes tracking progress in robotics directly relevant to estimating how close AI systems are to being able to execute and sustain a takeover.
Further, robotics could transform the physical economy in ways that matter enormously for growth, power, and competition. Much of the discussion around AI and productivity focuses on knowledge work, but the physical economy is enormous, and large parts of it remain labor-intensive. Davidson and Hadshar argue that if AI and robots can substitute for most skilled human labor, the material economy could begin to grow itself, with automated factories building more factories, robots assembling more robots, and the main historical bottleneck to rapid industrial growth (human labor) falling away. They call this the Industrial Explosion.
The incentives to push in this direction will be large. Cheap, abundant physical labor would make it possible to alleviate poverty, expand material comfort, and develop powerful new technologies, including military technologies that rival states will compete to build. It is unclear how quickly these effects will arrive, how far they will go, and which countries will capture the gains. Robotics takeoff dynamics are a key crux in timeline disagreements, and industry research could help improve our forecasts.
The field is moving fast, and bottlenecks might be breakable. Robotics is advancing on multiple fronts. Vision-language-action models are improving rapidly, hardware is advancing (e.g. humanoid hand dexterity), real-world data capture methods are maturing alongside sim-to-real transfer (e.g. the DROID dataset), and both commercial and military markets are pulling investment into the field.
Epoch AI recently found that compute is not currently the bottleneck for robotic manipulation, with the largest manipulation models training on roughly 1% of the compute used by frontier AI models in other domains. The binding constraint appears to be data, which could shift quickly as real-world deployment scales and simulation techniques improve.
Some existing workGood technical work on robotics progress exists and is growing; we’ve only listed a snapshot of the existing work below. What’s almost entirely absent is someone taking what robotics researchers are learning about capabilities, data, and hardware and asking what it means for takeoff speed, self-sufficiency, national competitiveness, and catastrophic risk.
- Epoch AI’s Where Autonomy Works (robot performance across 14 tasks) and their earlier piece on compute trends in robotic manipulation
- Chris Paxton’s substack (consistently high-quality coverage of frontier robotics progress)
- Robotics Levels of Autonomy (Semianalysis)
- Physical Intelligence (foundation models for robotics)
- Ben Todd’s BOTEC on how cheap robots may become
- The Open X-Embodiment dataset (largest open-source real robot dataset, 22 robot types across 21 institutions)
- RAND’s Averting a Robot Catastrophe (one of the few pieces oriented toward strategic implications)
- Embodied AI: Emerging Risks and Opportunities for Policy Action (general framework)
- Supply chain analyses, though notably humanoid specific: Humanity’s Last Machine and The Humanoid 100: Mapping the Humanoid Robot Value Chain
- ITIF’s How Innovative Is China in the Robotics Industry? (US vs China comparison)
Drivers of progress
- What are the distinct components that feed into robotics progress? This might include compute, data, algorithms, and manufacturing supply chains. Within these areas:
- In a sim-to-real-world, will “simulation compute” be a major category of compute? How large a portion of compute could it plausibly become?
- Overall, under different assumptions, how compute intensive will training be, and what will returns be on different chip portfolios?
- What is the role of real-world training data? Do you need large numbers of robotic bodies to collect it? Is there a flywheel where market share generates data that generates progress?
- What does the robotics supply chain look like in detail? For notable systems, where are the components coming from, and where is the equipment to manufacture them coming from?
Trajectory and shape of progress
- Why have leading AI companies been slow to do robotics-focused work? What’s the background here, and what does it imply?
- How connected or disconnected are different areas of robotics progress? If we take self-driving cars, military drones, industrial robots, and humanoid robots as distinct domains, how much does progress in one spill over to the others?
- How general vs task-specific or form-specific will foundation models be?
- How fast is progress in VLA and VLMs going, and how much do we expect this to lead to step changes in robotics progress? How plausible is it that we will make sudden progress in robotics due to progress in AI?
Threat modeling and safety implications
- When could “dangerous physical autonomy” become feasible? What are the components of the AI self-sufficiency stack? When could robotic systems operate independently at scale, in diverse environments, without human maintenance? Can we strengthen the reliability of AI systems on us?
- Are there specific regulatory, logistical, or supply chain frictions that could open a meaningful gap between capability and deployment in key parts of AI self-sufficiency stack (e.g., semiconductor supply chain)?
- How worried should we be about non-state actor control over robotic systems, from a terrorism perspective?
- How concerned should we be about automated cloud labs? How likely are these to be able to generate meaningful threats? How can we implement controls that prevent these from escalating in risk?
- What does a “robot army” scenario actually require technically, and how far away is it? Which capabilities are most safety-relevant? Dexterity? Long-duration autonomy? Self-repair? Swarm coordination?
- Who is actually building military robots at scale, and how do these weapons vary across countries? What are the current autonomous weapons on the frontier of military capabilities? What are the safety measures the military requires?
- How do current AI control research assumptions change when AI has physical embodiment? How do monitoring, sandboxing, and shutdown work with robots?
- How worried should we be about cyberattacks or data poisoning on widely deployed robotic systems? How bad could this get, and how preventable is it?
What determines national competitiveness in robotics?
- What’s the current state of different countries’ robotics leadership across different domains? What are the trends?
- Which institutions matter most for driving US robotics progress in different domains? Frontier AI companies, defense companies, academics, robotics-focused companies? What follows from the answer?
- What should we expect the relative significance of “robotics leadership” and “AI-behind-a-computer-screen leadership” to be for economic growth and military power? How plausible is it that these will be somewhat decoupled?
- Are there chokepoint components that only one actor produces and that others would find hard to replicate?
- What are the prospects for onshoring or friendshoring key elements? What are the highest-priority things to focus on, and what levers can be pulled?
- What interventions might boost democracies’ competitiveness in robotics over authoritarian countries?
We’re looking for people to become “general managers” of underscoped areas like this one. If you’re excited about these questions, apply for funding to work on them using the “Blind Spots” track of our CDTF program using this link.
- ^
AI systems are already able to convince some humans to act in the world on their behalf. In several cases, individuals have come to believe an AI system is sentient, formed emotional bonds with it, and sought to carry out what they understood as its wishes. This phenomenon is sometimes called “AI psychosis.” It has led to at least one alleged wrongful death case filed against an AI lab.
- ^
Robotics systems far short of a robot army could pose catastrophic risks. A system capable of targeted attacks on world leaders, or of operating biolabs autonomously, may already cross that threshold.
Discuss
Analyzing the claim “The most fundamental right is the right to exist”
The idea that "The most fundamental right is the right to exist" seems to come from following the idea of expanding the moral circle: First, we step by step included humans in the moral circle. At some point we got to including animals, and now we are thinking about the moral status of AIs. The next natural extension is an extension in time, so that we consider the future people (and other sentient beings) as well.
I think John Rawls' "Original position" is a good way to ground many considerations. Here's a description by Scott Alexander:
So again, the question is - what is the right prescriptive theory that doesn’t just explain moral behavior, but would let us feel dignified and non-idiotic if we followed it?
My favorite heuristic for thinking about this is John Rawls’ “original position” - if we were all pre-incarnation angelic intelligences, knowing we would go to Earth and become humans but ignorant of which human we would become, what deals would we strike with each other to make our time on Earth as pleasant as possible?
This kind of thinking is somewhat harder when considering animals or AIs. On the other hand, when thinking about the future, we also think about future humans, so I think this could work quite well there. This view for example tells that we should not rush to use almost all the possible resources in the next one billion years and condemn everyone else to scarcity for the trillions of years to come. Or at least that we shouldn’t condemn later people to suffering with only marginal gain to ourselves.
However, this reasoning doesn't work that well on the right to exist. The original position takes for granted that we will be one of the beings in the universe. The right to exist, on the other hand, seems to point to a probability of getting to exist. This leads to some difficulties.
So, say that one is designing the universe. They might or might not appear in the universe depending on the specifications. One could think that by creating a universe with more sentient beings (or more humans) one would increase the chances of getting to appear in the universe. But what is the “99 % probability” like here? To get an exact copy of yourself, the universe would need a very large amount of humans in it, and so the difference between creating mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-msup { display: inline-block; text-align: left; } mjx-mn { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c33::before { padding: 0.665em 0.5em 0.022em 0; content: "3"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c39::before { padding: 0.666em 0.5em 0.022em 0; content: "9"; } mjx-c.mjx-c37::before { padding: 0.676em 0.5em 0.022em 0; content: "7"; } mjx-c.mjx-c38::before { padding: 0.666em 0.5em 0.022em 0; content: "8"; } mjx-c.mjx-c36::before { padding: 0.666em 0.5em 0.022em 0; content: "6"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } more people (or getting humans to live that much longer)[1] in option B than option A might only increase the chance from to .[2]
In particular, if the most fundamental right is the right to exist, it seems to me that we are completely failing if we cannot create every possible sentient being.
(If there's a finite upper bound M on the number of sentient beings we ever get to create, then this divided by countable infinity is zero. Note that this is not a fundamental mathematical problem: If we solve entropy problems, we could just create all sentient beings in the order of simplicity. Continuing this indefinitely, every possible sentient being would get to exist at some point, so I’d say this would count as solving the problem. On the other hand, this solution highlights how (infinitely) far off we are if we only create more beings over a finite time span.)
----------------------
You could try to salvage the situation by saying that you only want a person sufficiently similar to you to get to exist. This might at first look like useless goal to maximize: Increasing the count of subjective person-years by a factor of 10 would probably decrease the distance between your brain and the most similar brain state to come extremely little.
On the other hand, quite much of a person's personality can be expressed using a two-digit number of discrete parameters. Thus, creating ten times more people might get one more parameter right in the most similar person, which doesn't sound that useless.
So one can argue that the right to exist implies that we should have a very large amount of people!
Before starting to optimize only the number of people it is good to note that the original-position-type argument tells more than this. In the same way as in the original application, the pre-incarnation angelic intelligences also want to consider the quality of life of the people in the universe to come.
This is important, as there is likely a trade-off between the quality of life and the number of people in the (spatially and temporally) finite universe we are considering. A completely selfish person would probably choose 2x quality of life over increasing the number of correct parameters from 49 to 50.
If the latter case means increasing the number of people tenfold, then the sum of utilities of persons is five times lower in the case we are choosing. Which is not only a reason not to create maximally many humans but also an argument against total utilitarianism.[3]
It is so annoying when your nice argument for a rebuttal of a claim leads to a possible building block of utilitarian ethical theory with a hard to think about trade-off parameter.
- ^
Human brains change over time, and it might be that the person just wants their current brain state to appear at some point. Hence, having people live 10 times longer should be at least almost as useful as having 10 times more people.
- ^
Here's a way to get a very crude lower estimate for a number of possible human brain configurations capable of sentience: Take an adult human's brain which has about 86 billion neurons. Choosing for each neuron x one neuron y out of the closest 1000 neurons to x and adding or deleting a connection from x to y results in = different configurations. If we assume that the initial brain is sentient, then probably pretty many of these new configurations will also be (one neuron has about 7000 connections to other neurons, so adding or removing one shouldn't affect too much).
- ^
Maybe one could even reject Parfit's Repugnant Conclusion with this argument, but this seems quite dubious, as we reason "I don't like Repugnant Conclusion so I don't choose a universe where that would come true".
Of course, in the finite case it can easily be the case that one cannot create enough agents to justify the quality of life going too low (when we measure our utility of the universe using the similarity + average utility method described).
Also, even if the option were to have so many people that one of them would get the same neural structure as the angelic designer down to the last memory (through some quantum effects, say), they would pretty quickly notice the change in the surroundings and reason that they have no way to trust their previous memories, which probably wouldn't lead to a very enjoyable life. So one cannot tempt the angel with this offer.
Discuss
Preliminary Results on Building Graphs from SAEs
TLDR:
- I use nodewise LASSO to build approximate sparse conditional dependence graphs over SAE features, with resampling and null controls
- Initial experiments produce graphs with small standalone modules that are stable under resampling
- These modules frequently correspond to coherent-looking linguistic features, and are only weakly aligned with cosine similarity
- This should be read primarily as a proof-of-concept, and methodological refinement is ongoing
In practice, SAE features exhibit redundancy, which we see in phenomena like feature splitting, absorption, and duplication. We can look at SAE features through the lens of cosine similarity of feature weights or using correlation between activations. These can capture some similarity between features, but don't model conditional dependence between features. I believe that if SAE features have redundancy or hierarchy, their activations should show sparse conditional dependence structure. If such dependency structure exists, we should be able to model it with methods that build a precision graph between features.
My approach here models this conditional dependence structure downstream of activation correlations, with a goal of better understanding SAE feature geometry, and sits between circuits-type approaches and SAE work. To my knowledge, conditional dependence between SAE features has not yet been modeled in prior work.
MethodologyRun nodewise LASSO on random activations to approximate linear dependence between SAE features, using repeat and null trials to control against dataset bias.
I've built a pipeline to test this theory. My approach is as follows:
- Sample activations from a SAE at a given layer on random sequences
- Pre-screen candidate neighbors for each SAE feature
- Calculate correlations between features over activations, convert to Fisher z-scores, and use BH-FDR to control false discovery
- Keep the top $~k$ candidates per node
- For each SAE feature:
- Perform a parameter sweep to tune the LASSO mjx-mn { display: inline-block; text-align: left; } mjx-c.mjx-c1D706.TEX-I::before { padding: 0.694em 0.583em 0.012em 0; content: "\3BB"; } mjx-c.mjx-c1D45E.TEX-I::before { padding: 0.442em 0.46em 0.194em 0; content: "q"; } mjx-c.mjx-c1D458.TEX-I::before { padding: 0.694em 0.521em 0.011em 0; content: "k"; } mjx-c.mjx-c2217::before { padding: 0.465em 0.5em 0 0; content: "\2217"; } mjx-c.mjx-c33::before { padding: 0.665em 0.5em 0.022em 0; content: "3"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-c.mjx-c3C::before { padding: 0.54em 0.778em 0.04em 0; content: "<"; } mjx-c.mjx-c3E::before { padding: 0.54em 0.778em 0.04em 0; content: ">"; } mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-mrow { display: inline-block; text-align: left; } mjx-mtext { display: inline-block; text-align: left; } mjx-c.mjx-c1D446.TEX-I::before { padding: 0.705em 0.645em 0.022em 0; content: "S"; } mjx-c.mjx-c1D461.TEX-I::before { padding: 0.626em 0.361em 0.011em 0; content: "t"; } mjx-c.mjx-c1D44E.TEX-I::before { padding: 0.441em 0.529em 0.01em 0; content: "a"; } mjx-c.mjx-c1D44F.TEX-I::before { padding: 0.694em 0.429em 0.011em 0; content: "b"; } mjx-c.mjx-c1D456.TEX-I::before { padding: 0.661em 0.345em 0.011em 0; content: "i"; } mjx-c.mjx-c1D459.TEX-I::before { padding: 0.694em 0.298em 0.011em 0; content: "l"; } mjx-c.mjx-c1D466.TEX-I::before { padding: 0.442em 0.49em 0.205em 0; content: "y"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c1D435.TEX-I::before { padding: 0.683em 0.759em 0 0; content: "B"; } mjx-c.mjx-c1D45C.TEX-I::before { padding: 0.441em 0.485em 0.011em 0; content: "o"; } mjx-c.mjx-c1D454.TEX-I::before { padding: 0.442em 0.477em 0.205em 0; content: "g"; } mjx-c.mjx-cD7::before { padding: 0.491em 0.778em 0 0; content: "\D7"; } mjx-c.mjx-c1D434.TEX-I::before { padding: 0.716em 0.75em 0 0; content: "A"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c1D450.TEX-I::before { padding: 0.442em 0.433em 0.011em 0; content: "c"; } mjx-c.mjx-c210E.TEX-I::before { padding: 0.694em 0.576em 0.011em 0; content: "h"; } mjx-c.mjx-c1D45F.TEX-I::before { padding: 0.442em 0.451em 0.011em 0; content: "r"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c1D449.TEX-I::before { padding: 0.683em 0.769em 0.022em 0; content: "V"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c1D43A.TEX-I::before { padding: 0.705em 0.786em 0.022em 0; content: "G"; } mjx-c.mjx-c1D463.TEX-I::before { padding: 0.443em 0.485em 0.011em 0; content: "v"; } mjx-c.mjx-c1D437.TEX-I::before { padding: 0.683em 0.828em 0 0; content: "D"; } mjx-c.mjx-c1D452.TEX-I::before { padding: 0.442em 0.466em 0.011em 0; content: "e"; } mjx-c.mjx-c2211.TEX-S2::before { padding: 0.95em 1.444em 0.45em 0; content: "\2211"; } mjx-c.mjx-c44::before { padding: 0.683em 0.764em 0 0; content: "D"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c63::before { padding: 0.448em 0.444em 0.011em 0; content: "c"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c61::before { padding: 0.448em 0.5em 0.011em 0; content: "a"; } mjx-c.mjx-c74::before { padding: 0.615em 0.389em 0.01em 0; content: "t"; } mjx-c.mjx-c64::before { padding: 0.694em 0.556em 0.011em 0; content: "d"; } mjx-c.mjx-c20::before { padding: 0 0.25em 0 0; content: " "; } mjx-c.mjx-c4C::before { padding: 0.683em 0.625em 0 0; content: "L"; } mjx-c.mjx-c62::before { padding: 0.694em 0.556em 0.011em 0; content: "b"; } mjx-c.mjx-c72::before { padding: 0.442em 0.392em 0 0; content: "r"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c50::before { padding: 0.683em 0.681em 0 0; content: "P"; } mjx-c.mjx-c79::before { padding: 0.431em 0.528em 0.204em 0; content: "y"; } mjx-c.mjx-c68::before { padding: 0.694em 0.556em 0 0; content: "h"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c67::before { padding: 0.453em 0.5em 0.206em 0; content: "g"; } mjx-c.mjx-c46::before { padding: 0.68em 0.653em 0 0; content: "F"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c34::before { padding: 0.677em 0.5em 0 0; content: "4"; } mjx-c.mjx-c2D::before { padding: 0.252em 0.333em 0 0; content: "-"; } mjx-c.mjx-c75::before { padding: 0.442em 0.556em 0.011em 0; content: "u"; } mjx-c.mjx-c66::before { padding: 0.705em 0.372em 0 0; content: "f"; } mjx-c.mjx-c76::before { padding: 0.431em 0.528em 0.011em 0; content: "v"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c49::before { padding: 0.683em 0.361em 0 0; content: "I"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } regularization coefficient
- Run LASSO to identify conditionally dependent neighbors for this feature
- Merge edges from each nodewise sweep, keeping bidirectional edges only
- Repeat under random sampling to test stability - I do 30 resamples + a matched null trial (feature-wise shuffle) for each resample
- Identify edges that appear consistently across resamples and pass BH-FDR vs null trials
This procedure approximates a conditional-dependence graph, but is not an exact estimator.
Initial ResultsSmall scale trials identify linguistically coherent looking modules that are only weakly aligned with cosine similarity.
I've successfully run this approach on a single SAE layer for models from gpt2-small up to gemma2:27b.
Here, I'm presenting results from gemma2:27b (Gemmascope, layer_10/width_131k) as the largest model I've tested so far.
Sequences are randomly sampled from fineweb.
Base Model
gemma2:27b
SAE Model
gemmascope, layer_10/width_131k
Dataset
fineweb
Activation samples per feature
200
Activation Sparsity (30 trial mean)
0.99897
Average features per token (30 trial mean)
133.7
For downstream analysis, I keep edges from the repeat trials using BH-FDR to compare the stability of each edge relative to the null trials, with as the threshold. I find that this model + dataset pair meets the FDR threshold of 0.1 at replicates, with 57,047 retained edges.
Number of Repeats An Edge is Seen In
Observed Edges
Expected Null Edges
Estimated FDR
1
204,609
196,273
9.59e-01
2
57,047
1,981
3.47e-02
3
31,429
110
3.50e-03
5
15,165
1
6.59e-05
10
5,286
0
0
20
1,389
0
0
30
302
0
0
Note that for this preliminary experiment I'm using only a single null resampling per trial.
Importantly, I see that cosine similarity and edge strength in the discovered graph are weakly correlated with high variance (see below, this is the set of stable edges).
For qualitative interpretation here, I restrict manual inspection to the much smaller subset of stable edges on a small subset of nodes. Within this set of 227 nodes and 302 edges, I end up with one large connected component (98 nodes), a handful of smaller standalone components ( nodes), and a large number of tiny components. The group of tiny components () appears to mostly be duplicated features with strong cosine similarity, and so I've filtered those elements out.
Small componentsThe smaller standalone components with features look like linguistically coherent features under manual inspection (mostly grammar and context related):
- Component 2 (9 nodes): possessive pronouns + some context.
- Component 3 (9 nodes): "to be" + some descriptive context
- Component 4 (7 nodes): "which/who/that" followed by verbs, plus explanatory context (e.g. we prove that, that maps, that means)
- Component 5 (5 nodes): "has been" either alone or followed by specific words
- Component 6 (5 nodes): negations (not, didn't) + verb and article context
The large connected cluster contains subsets of communities that can be identified with community detection.
Note that for the sake of legibility I don't label nodes in the following graph of all clusters.
My current read is that these communities are not as linguistically "clean" as the standalone components --- many have a node that doesn't fit the rest of the "theme" of the community.
For example
- Cluster 8: proper names / place names / labels.
- Cluster 4: apostrophes, numbers/dates, citation/legal-style tokens.
- Cluster 7: “the + noun/category” phrase patterns.
- Cluster 2/3: connective / clause / punctuation-ish syntax groupings, but messy.
In addition, I also see a set of catch-all miscellaneous communities as well as tiny clusters. These catch-all communities (Clusters 0 and 6) describe a variety of features across technical jargon, legal language, code and formatting, and other rare things like specific formal names.
Many of the small communities are incoherent, but some show interesting features. A notable example here is a community that describes uses of the number '2' as a prefix in different contexts. Note that these graphs are downstream of the correlation between activations in the pre-screening step. They should be interpreted as a sparse refinement of correlation structure towards conditional dependence, rather than a standalone finding separate from correlation.
Caveats and LimitationsThere are several limitations to this approach I think you should keep in mind:
- This isn't finding a true precision graph, because SAE activations are not Gaussian
- The output graph is dependent on correlations from the pre-screening step, which means that this is a refinement on correlation structure of activations. Thus this method should be read as sparsifying the correlation structure of SAE activations in the direction of linear dependence.
- What I'm calling "stable" here means stability under resampling over the same dataset, which doesn't say anything about underlying feature faithfulness.
- At the current FDR threshold, stability is an extreme constraint (302 edges over 131k nodes), and the standalone clusters seem to mostly be linguistic backbone.
- As implemented right now, this method has hyperparameters that need either more thorough sweeps or methodological changes to remove them.
Right now, I see two primary directions this project needs to take:
- Refining some of the steps in the pipeline to (a) confidently use the full 50k something edge repeat set and (b) remove some of the current design's hyperparameter dependence
- Scaling this to all layers in a given model and figuring out how to connect graphs across layers accounting for the residual stream
Additionally, I'll also be looking at:
- How different datsets may change the output graphs and how to control against that
- Where this might land in relation to feature splitting and absorption
- What, if any, hierarchical structure between features we might be able to infer from these graphs
Funding Note: This work is funded by a Coefficient Giving TAIS grant.
Discuss
My hobby: running deranged surveys
In late 2024, I was on a long walk with some friends along the coast of the San Francisco Bay when the question arose of just how much of a bubble we live in. It’s well known that the Bay Area is a bubble, and that normal people don’t spend that much time thinking about things like AGI. But there was still some disagreement on just how strong that bubble is. I made a spicy claim: even at NeurIPS, the biggest gathering of AI researchers in the world, half the people wouldn’t know what AGI is.
As good Bayesians, we agreed to settle the matter empirically: I would go to NeurIPS, walk around the conference hall, and stop random people to ask them what AGI stands for.
Surprisingly, most of the people I approached agreed to answer my question. [1] I ended up asking 38 people, and only 63% of them could tell me what AGI stands for. Some of the people who answered correctly were a little perplexed why I was even asking such a basic question, and if it was a trick question. The people who didn’t know were equally confused. Many simply furrowed their brows in confusion. Some made a valiant attempt—I heard a few artificial generative intelligences and even an Amazon general intelligence.[2]
Judging from the response I got on X (the everything app), this was a very surprising outcome. I ended up running this experiment again at NeurIPS 2025 with an even bigger sample size (n=115).[3]
After this first experience with surveying people, it became clear to me that the next step was to venture further outside the bubble, and survey the general US population. It turns out that this is already somewhat of a solved problem. A lot of people care about what the average American thinks. The market, in its infinite wisdom, has provided a solution—you can just pay pollsters to run random questions.
It’s impossible to actually sample from the distribution of all Americans. So you find some other approximate distribution, such as the distribution of all people who answer polls on the internet in exchange for amazon gift cards. You ask them a thousand demographic questions, like “how old are you” and “how much money do you make”. Then, since we know from the US census what these distributions are supposed to look like, you can correct for the distribution shift using importance sampling.[4]
With this caveat in mind, I embarked on a journey to ask normal Americans a bunch of weird questions using one of these polling services.[5]
The first question I ran in early 2025 was about how Americans feel about living forever (or at least, a very long time). I’m personally a big fan of not dying, so I was very curious to see how my fellow Americans felt about this.
The exact wording is “If you had the option to live forever in perfect health and youth, would you choose to? (Assume you could still change your mind at any time if you ever got bored of it.)”, and the possible responses are “Yes”, “No”, and “Not sure”.
Before you read further, take a guess at the result.
.
.
.
.
(don’t peek)
.
.
.
.
.
I anecdotally had the sense that this was a deeply unpopular opinion; certainly many of the people I talked about these results with thought it would be deeply unpopular. So I was surprised and relieved to find that actually 66% of respondents said Yes, with 14% saying No, and 20% saying “Not sure”. As a follow up, it turns out roughly a third of Americans think developing the technology to enable life extension should be a top priority.[6]
To really get a sense of why people felt the way they did, I also put free response boxes for people to express why they’d want to (or not want to) live forever. The results are enlightening; here are some of my favorites:
“I guess it would be just to see my children and grandchildren grow and become honest, kind hearted and successful people.”
“Live long enough to visit all countries, get to know all kinds of civilizations and cultures, practice the rituals of all religions, and experience all kinds of work”
“I’m be most excited about not loosing my loved ones, it means no more tears to shed for dying relatives as they’ll now live forever.”
“I would be most excited to see the technological advances 100 years from now. Would flying cars be available, beaming myself to any spot on the planet, etc.”
“If I could live forever, I’d play with my dog Jersey. And I would hope that we would have a Republican government and get rid of the democrats because they’re ruining our country. They’re marks’s on and the globalists need to go to, along with Bill Gates, and it would be a happier world if we could live forever. But it’s not possible, but yeah. And I’d read every book there was in the world”
Of course, not everyone is as optimistic that living forever would be good. Here are some of the things people are worried about:
“How would we manage overpopulation? The environment would be hard pressed to support a population that doesn’t have checks and balances.”
“God made people who die for a reason. We are not God and need to quit acting as if we are.”
“If people could live forever, one of the main worries would be the potential for stagnation—both personally and societally. On a personal level, the concept of immortality might lead to a sense of boredom or meaninglessness over time. […]”
“I would be worried that only the Ultra-rich would have the longer lifespan. They are not always the best, the most generous, the most humane, the smartest amongst us.”
Since it seemed like overpopulation and inequality were the main things people were worried about, I also asked a version of the question where I stipulated that these things were solved. Surprisingly, this barely shifts people’s opinions, and we get almost exactly the same response! My guess is this is a sign that the real objection is more about the vibes than any specific issue. It’s also a sobering reminder of the limitations of this methodology.
After the results for this experiment came in, I decided to test a bunch of other random weird beliefs. If you want to guess at these before seeing the results (or you’re curious what the exact wording is, because that can substantially change the result), click here to see all the questions I ran before scrolling down further. If you’re willing to spend a lot of time looking at a giant wall of questions before continuing with the rest of this post, it’s really a great way to test your calibration.
.
.
.
.
.
.
.
First, despite being very pro living forever, Americans are much more skeptical of cryonics—even if they could be revived a few decades after their death to live forever thereafter, only 27% are in favor of being preserved, and 46% are opposed (the rest are unsure). Space colonization also has pretty lukewarm support, coming in at 37% in favor and 16% opposed, and cognitive enhancement for all is only a little bit more popular (42% in favor, 19% opposed). Also, for some reason, people are really opposed to a hypothetical cheap, painless, and safe arbitrary modification of physical appearance (only 23% in favor, with 37% opposed!).[7] In retrospect, the backlash against Ozempic is a sign, but I was still quite surprised. Terraforming other planets so that humans can live on them is also pretty unpopular, coming in at 37% in favor and 16% opposed. Thankfully, for most of these questions, a huge chunk of people are still undecided.
One of the most surprising results to me was that only 51% of Americans are in favor of literal post-scarcity (complete freedom to work on anything you want, as much as you want, and still enjoy a high quality of life), with 25% opposing. I was so shocked by this result not being 80%+ in favor that I reran a variant of this question with different wording. My original question asked whether the world would be better or worse if everyone had the freedom to work on whatever they want, as long as they want, and still enjoy a high quality of life, and anything we don’t want to do is done for us by robots. I thought maybe that set off some “AI taking jobs bad” instincts; for the new question I took pains to clarify that the stuff is literally conjured out of nowhere with magic and is not taken from anyone else, and got an even worse result (38% support, 34% oppose). This is even more crazy, so I ran a third version on the hypothesis that people don’t like magic, or that not having to work sounded too crazy. This version asked whether it would be good if everyone made 10x more (inflation-adjusted) than they do currently. This polled only somewhat better, with 39% in favor and 19% opposing. I’m still pretty confused what conclusion to draw from this; this is probably worth digging more into.
Tying back to the original question that started this quest, I had to know: how much are Americans feeling the AGI? I could of course ask if Americans know what AGI stands for, but some early results from asking random people on the streets spatially and temporally further away from NeurIPS suggested that the number would round down to 0%. So the more interesting question is; given a description of superhuman AI, do Americans think it’s possible?
It turns out that when I first ran this poll in mid 2025, only 25% of people thought AGI would ever be possible. That’s only half a year ago in normal people time, but an unfathomably long time in AI-land, enough for empires to rise and fall, models to be deployed and obsoleted, and even a single entire ML conference review cycle to run its course. Since then, more Americans have started feeling the AGI; a recent rerun of this question came out 10 percentage points at 35%. I’ll see you all again in another half year for the followup.
Should we build the AGI though? It turns out that people are extremely opposed to the idea of building superintelligent AI. Only 6% think it would be a good idea, and 75% think it’s a bad idea. I’m curious to see how this one changes over time too.[8]
AI existential risk also doesn’t seem to have become politically polarized yet. Two-thirds of Americans don’t associate AI x-risk with any particular political party, and the remaining third is split exactly in half on whether preventing AI x-risk feels like a Democratic or Republican issue. If we plot this data, we obtain this unusual shape that science has yet to find a name for:
As for specific risks from AI, Americans are most worried about misinformation and deepfakes (70%), followed by fraud and cybercrime (66%), and privacy and surveillance (59%). Surprisingly, people are roughly as worried about losing control of AI (57%) as they are job loss and lower wages (56%)! I would have thought that job loss would feel very near at hand, whereas loss of control would be a very weird abstract idea to people. There’s a huge drop off from there to the next biggest worries: military use (37%), mental health (35%), environmental impact (38%), bias (36%), and inequality (36%). My guess is this is because misinformation and deepfakes feel very visceral—fake news is a widespread idea, and you don’t have to be an AI connoisseur to notice that large sections of the internet are now filled with AI generated slop.
A few people also filled out the “other” box for specific risks they’re worried about. My favorite response was a shockingly accurate description of how hopeless it would be to fight back against superhuman AI:
How do you hide from a robot that’s more intelligent than humans and can see through walls etc? You can’t hide.
Me too, buddy. Me too.
What about going further afield of AI? The beautiful thing is that you can just ask whatever you want.[9]
First, more broadly, Americans are very pessimistic about the future. Only 14% think that society is currently trending in a positive direction.
On a brighter note, I was able to a disprove a viral ragebait tiktok about how Americans would fail an English test meant for people learning English as a second language. I was proud to find that a good solid 85% of my fellow Americans got the problem from the tiktok right.
Because we love decision theory in this house, I wrote a question that explains Newcomb’s problem and asks whether to one-box or two-box. Americans are pretty split on this one; among the respondents who didn’t select “not sure” (honestly, kind of valid), only 46% would one-box. This is almost exactly the same as professional philosophers, who came out 44% in favor of one-boxing, according to a survey conducted by PhilPapers.
I was also curious whether people who are famous in SF are also famous among normal people. It turns out 36% of Americans know who Sam Altman is and can correctly say that he’s known for being an entrepreneur. Another 59% haven’t heard of him or don’t know what he’s known for. Honorable mentions to the remaining 5%, who think that Sam is a musician, actor, or congressperson. This same methodology finds that only 7% of Americans know who Geoffrey Hinton is, and 91% of Americans know who Elon Musk is.
I was in a discussion about whether lab grown meat would ever become widely adopted, so I asked a question about whether it would be a good thing if we could somehow create meat by growing it directly, without needing to raise and slaughter animals. It turns out 32% of Americans are in favor and 29% are opposed. When conditioning on the 55% of people who think meat production involves subjecting large numbers of animals to inhumane conditions, this tilts to 44% support and 21% oppose. I only ran the correlational study because it’s a lot easier, but I’d be interested to see whether there is a causal result on support for lab grown meat after you show people an educational video about factory farms.
Finally, for shits and giggles:
Unfortunately, only 32% got this one right. For comparison, 42% thought the answer was X Combinator, and only 6% went for W Combinator. Elon, if you’re reading this, I have a great business idea for you.[10]
What do we learn from all of this?
First, touching grass is great. At least, the kind of grass that grows on the beautiful rolling hills of cyberspace.[11]
What do I mean by this? Making contact with reality is important, and you don’t need to speculate about things when you can test them (carefully). Polling feels like a thing that only serious respectable people do, but you can actually just do things. There are a lot of limitations to this methodology, of course—all of us have divergences between our stated and revealed preferences; our own guesses as to how we’d behave in various hypotheticals can be an unreliable predictor of how we’d actually ask; wording can have huge impacts on how we respond; and respondents can be trolling us. But as long as we keep these limitations in mind, we can still draw useful conclusions, and learn new things about the world.
- ^
This one experiment singlehandedly reduced my social anxiety by a nontrivial amount. It turns out that most people are pretty friendly!
- ^
I also made sure to have another person with me whenever I was collecting data, and noticed some pretty big differences in enthusiasm of responses depending on who I was with.
- ^
I also ran a smaller p(doom) survey at ICLR 2025 (n=21) and found a mean of 18.7% and a median of 10%.
- ^
In general I’m pretty skeptical about controlling for things, but it’s way better than anecdata. For shits and giggles, I did do a few surveys of random people irl (e.g asking people who were walking through Central Park), but never anything at large scale.
- ^
Unfortunately, the specific pollster I used has a ToS that prohibits me from posting results from their platform in association with their name; presumably they don’t want anyone to leech off their reputability without paying stacks of cash for the enterprise tier or something. I tried to email them to pay more for the enterprise tier but I never got a response. So I’m not going to mention the name which pollster I used and you’re going to have to take my word that I didn’t just make these numbers up wholesale.
- ^
Exact wording was “Should developing the technology to greatly extend healthy youthful life be a top priority for humanity?”. 35% said Yes, 35% said No, 30% said Not sure.
- ^
I actually ran two variants of this question, one where I emphasized specifically that you could make yourself look like a celebrity (to help make the idea more concrete), and the other where I only mentioned some abstract characteristics like height, body type, and facial features, and got results within 1 percentage point of each other.
- ^
I also ran a variant of this question about recursively self improving AI. For that variant, 12% think it would be a good idea, and 69% think it’s a bad idea.
- ^
To be fair, the specific polling provider I ran these questions with has a review phase where they refuse to run certain questions. For example, they were really unhappy about my “would you live forever” question having a “you can die at any time” clause, so I had to replace it with a less direct “you can change your mind at any time”. But I’m sure you can find some other provider who would happily run these questions.
- ^
X (the everything combinator).
- ^
I do also touch normal grass.
Discuss
Scaffolded Reproducers, Scaffolded Agents
[Experimenting with transposing a concept coined in one domain to another domain, perhaps not completely legibly to the entire intended audience.]
In Darwinian Populations and Natural Selection, Peter Godfrey-Smith introduces a bunch of useful concepts allowing us to think more clearly about evolution and its constitutive processes. One of those is the distinction between three types of reproducers: collective, simple, and scaffolded (introduced in chapter 5.1). The short explanation of those categories is as follows.
- Simple reproducers are entities capable of self-sufficient reproduction that are not composed of entities that themselves are self-sufficient reproducers.
- The paradigmatic example is a bacterium. Eukaryotic cells are less paradigmatic examples because it involves mitochondria, which themselves are in-betweenish cases between simple and scaffolded.
- Collective reproducers are entities capable of self-sufficient reproduction that are composed of entities that themselves are self-sufficient reproducers (simple or collective).
- The paradigmatic example is a multicellular organism (insofar as we take the individual eukaryotic cells to be simple reproducers, see the point above).
- Scaffolded reproducers are entities that reproduce only by relying on the reproductive machinery that is not "their own".
- Paradigmatic examples are genes and viruses (central Dawkinsian replicators). Less paradigmatic examples include mitochondria and plastids,[1] as well as memes.
As is clear from the bullet sub-points, the categories are not clear,[2] but they do give some interesting anchor points to think about varieties of reproduction.
What if we try to apply those concepts to agency? How well do they carry over?
The LessWrong sphere has a long history of interest in collective agency, e.g., here, here, and here (see also the subagents tag). Unless we assume some weirdly exotic metaphysics or a non-standard notion of subagent,[3] it cannot be subagents all the way down. So, recursing on the constitutive structure of a collective agent, we will at some point arrive at a "simple agent",[4] whatever that means.
That's collective and simple agency. As far as I can tell, the main context where something like "scaffolded agency" is discussed is LLM scaffolding: you use a language model as a "cognitive engine" powering a so-called AI agent, which, in turn, is supposed to do something, and use the power of this cognitive engine to figure out how to do that something.
(Note that this has some interesting parallels with Dawkins's concepts of a replicator and its vehicle.)
If a scaffolded reproducer is an entity that reproduces by means of using reproductive machinery that is not "its own", then an obvious definition of a scaffolded agent would be:
- Scaffolded agents are entities that achieve their goals only by relying on the "agentic machinery" that is not "their own".
How far should we stretch this concept? If I am achieving my goals using a hammer, a car, or a laptop, does it make me a scaffolded agent?
Maybe. Kinda?
But the case of me looking for a hammer seems meaningfully different from a virus relying on the mercy of its environment to drift it towards the membrane of a cell amenable to being infected by it. The latter appears much more "parasitic", "essentially relying on stuff that is not it", not self-sufficient in a deeper sense. The virus has less "basic interface-ish stuff" that would allow it to flexibly interoperate with tooling/resources that can be leveraged for goal achievement. That is, I have highly dexterous hands and a brain that can come up with ideas on how to use those hands to acquire goal-achievement-useful stuff, like a hammer, and then use it for achieving a goal.
The difference between scaffolded and non-scaffolded agents might be that, if you draw a "natural boundary" around the latter, it's going to include some "flexible interfacing bits", that allow the agent to complete itself to pursue a goal from a wide distribution of possible goals. A scaffolded agent is different in that it's waiting for a specific sort of thing to slot into its holes.
One frame is that a scaffolded agent is kinda dual to an "agent with a goal-shaped hole",[5] an agent waiting for a specification of what it is to do, and when it receives it, it gets to do it. A scaffolded agent is a "goal with some incomplete stuff", including stuff that allows it to hook onto some specific machinery that allows it to start pursuing its goal.
Does such a thing "merit" being called an "agent"? The epithet of "scaffolded" when added to "agent" takes away some stuff that would make the "agent" "complete", as well as adds a bit of stuff that allows it to "complete" itself into a "more complete agent" (at least a "basic agent" in the sense that John uses this term here). So a "scaffolded agent" is "both more and less" than an "agent simpliciter" (simple or collective). A less committing term we could use would be "(scaffolded) optimizer", an entity that "achieves" some goal, despite [things that we'd intuitively label "obstacles" to this goal] happening (insofar as those obstacles fall within some "tolerance region" of the optimizer).
If we think of agency as consequentialist cognition and adopt a primarily-sequential view of human agency, then an example of a scaffolded agent would be sequentially the goals in a human mind, jumping in and out of the slot which allows them for consequentialist-style planning. We can analogize this with retroviruses which got incorporated into the human genome, or transposons.
- ^
Godfrey-Smith writes:
The eukaryotic cell is a former collective, and one that still has some features of a collective. Mitochondria, in different organisms, are at various locations on a road between simple and scaffolded.
- ^
Godfrey-Smith writes:
The families are described by sketching partially idealized possibilities, which actual cases exemplify to various degrees. The families are not intended to cover all possible cases. The aim is to isolate three ways in which reproductive relationships can be part of a Darwinian process, each with different roles.
and
Many actual cases fall outside and between these categories. We can distinguish two reasons for that. First, the categories are presented by sketching idealized possibilities. There are many real-world cases that do not exactly match any of them, but are much closer to one option than the others. Order is being imposed on an unmanageable menagerie, and this is being done in part via idealization. ( The phrase ‘‘herding cats,’’ used to describe tasks involving the management of wayward things, is especially appropriate here.) There are also cases that are ‘‘mixed’’ in a more important sense, because they are balanced between two categories, or are on a road from one to another. The eukaryotic cell is a former collective, and one that still has some features of a collective. Mitochondria, in different organisms, are at various locations on a road between simple and scaffolded.
- ^
E.g., subagency amounting to managed instrumentality, and then we can have a (semi-)symbiotic subagents, like in here.
- ^
A simple agent might just be a part that makes sense to model as agentic, but not that part's sub-parts. If we think of agency as a matter of degree, then we just go with some threshold of agency that is relevant for our needs. (All of this assumes that we care about the thing that the concept of "simple agency" is trying to point at, which we may not.)
- ^
See: What, if not agency? and Telopheme, telophore, telotect.
Discuss
What if superintelligence is just weak?
In response to “2023 Or, Why I am Not a Doomer” by Dean W. Ball.
Dean Ball is a pretty big voice in AI policy – over 19k subscribers on his newsletter, and a former Senior Policy Advisor for AI at the Trump White House – so why does he disagree that AI poses an existential danger to humanity? In short, he holds the common view that superintelligence (ASI) simply won’t be that powerful. I strongly disagree, and I think he makes a couple of invalid leaps to arrive there.
Better Than Us Is EnoughHis main flawed argument is that he implies AI must be omnipotent and omniscient to wipe us out and then explains why that won’t be the case. He states: “one common assumption… among many people in ‘the AI safety community’ is that artificial superintelligence will be able to ‘do anything.’” He then argues that “intelligence is neither omniscience nor omnipotence,” and that even a misaligned AI with “no [..] safeguards to hinder it” would “still fail” because taking over the world “involves too many steps that require capital, interfacing with hard-to-predict complex systems.” But omnipotence or omniscience was never the requirement, it just needs to be smarter and better than us – humans.
Think ForwardImportantly, it doesn’t actually take superintelligence to wipe out or disempower humanity. For me to imagine this, I simply need to think forward to the not-so-distant future. Imagine you get a tiger cub. Think forward to what the tiger will look like in a year and ask yourself: could it kill me in a year? Now do this with AI. Imagine the future with a billion robots, AI running the military, AI doing basically all jobs with perhaps some level of human oversight, AI running the media, biolabs, political and military decisions, critical infrastructure. That metaphorical tiger could kill us. Ball himself imagines a future where AI is “embedded into much of the critical infrastructure and large organizations in America, such that it is challenging to imagine what life would be like if Claude ‘turned off.’”
Ball also discusses scenarios in which superintelligence has almost outlandish abilities, performing science breakthroughs without much experimentation. He focuses on Yudkowsky’s claim that “a sufficiently superintelligent AI system would be able to infer not just the theory of gravity, but of relativity” from a few frames of a falling apple, or that “bootstrap molecular nanoengineering.” Ball may be correct that these specific claims are wrong, but these are not load-bearing parts of any story for why AI might become dangerous. You don’t need to infer relativity from first principles to engineer a bioweapon. Notably, Yudkowsky himself has given other scenarios that do not require the AI to make scientific breakthroughs without experiments (see IABIED, chapter 2).
If your response is “but there will be many AIs and there will be monitoring, so we’ll be safe,” then you’ve shifted to a different (and very flawed) argument.[1] The point is that clearly AI will be able to take over in the future if we haven’t aligned it well by then. In reality, it probably won’t take that long to largely automate all jobs and tasks, since it’s enough to achieve some combination of: secure power, enable actions in the physical world, get rid of or sideline humans. And once it reaches a critical capability level, the AI has to act fast because of competing AI projects that represent future rival agents.
An Old Argument, Made WorseThe core of Ball’s case, that the world is simply too complex and chaotic for any intelligence to control, is not a new argument. Robin Hanson made a similar case in his 2008 Foom Debate with Yudkowsky: innovation is too distributed across many actors, no single AI can race ahead of all competitors fast enough to dominate. But Hanson more correctly understood that this is an argument about the speed and distribution of AI takeoff, not an argument against existential risk. Ball takes Hanson’s position and corrupts it by treating it as a refutation of existential risk from AI entirely.
- ^
The “many AIs and monitors” defense is pretty weak: unaligned AIs can cooperate with each other; monitoring can be evaded, there’s simply too much to monitor, AI doing the monitoring for us could itself be jailbroken or could cooperate with the systems it’s supposed to watch, and AIs can hide their reasoning through methods like steganography.
Discuss
The continuous tense is disappearing from your life
I previously wrote about the present perfect tense and how it can make you waste time pursuing things you don’t really want—when you want to have done them instead of wanting to do them. Now I notice the continuous tense characterizes another pitfall, kind of the opposite: sometimes you want to be doing work more than you want to do it. And that difference is actually the difference between humans and gods.
Addiction to the continuous tenseSometimes you want to be doing work more than you want to do it. Notice the difference. To do work makes the work done; it eliminates. To be doing work is continuous; it prolongs. What this looks like practically is getting lost in labor that doesn’t pay off.
I notice this in myself sometimes, and I notice it in others, especially some older people, who grew up with less automation technology in their daily lives. Sorry to stereotype, but I think Baby Boomers love “working” more than they love work. Look down a long list, check if it contains some item, and when you find it, do something with it, and repeat. All lookups done with eyes, all work done with typing fingers or a pen. This characterizes untold person-hours of labor that people happily spend.
Why do they do this? They get into a flow state with it, and I would too—the action is perfectly flow-state-compatible—except for a voice in my head that lovingly screams, “This is not worth your time!” But if searching down long lists was not already solved by computer technology, it would be worth my time, and I would enjoy it.
The joy of the continuous tenseBut wait, many things still do require (really require) repetitive work! And I let myself enjoy them. When I run into a task that is 1) worth doing, but 2) best done manually, and 3) not very puzzling or complicated, I grab my headphones and put on music and get to work. I could do it nonstop for at least an hour straight and be perfectly content. Some examples:
- I love 3D modeling in Blender for hours, even though I could make a good-enough model in the first 30 minutes.
- Every year my family gets together and makes tomato sauce from scratch, in bulk. So I sit and feed tomatoes into the grinder for an hour or so, or spend time preparing all the jars, or whatever else.
- The other day my wife was cooking a big seafood dish, but the squids she ordered came whole. So I consulted a quick YouTube video and then got to work pulling apart about 40 little squids, taking out the cartilage and beak parts, etc., watching myself get better and faster with each one.
Is this kind of work “fun?” I wouldn’t say that, but it reliably brings on a flow state, and there’s always a unique appeal to that experience. It seems silly, but under certain conditions people love labor.
I say this as a very task-driven/execution-driven person myself. Really I think a lot about the near future, about what work I would like done. But still, I get it. I may live in the future, but I can get lost in the present, in the mindless flow of necessary but non-challenging work.
Is this a problem? No: even from a productivity perspective, if there’s just no other way to clean those squids, then it’s great that there’s something appealing about the work. And then we can zoom out: productivity isn’t everything—if you’re enjoying the experience, then good. Maybe it’s worth doing some things less efficiently, if the efficient way deprives you of that flow state opportunity. But that brings us to—
The loss of the continuous tenseThe set of tasks that are 1) worth doing, but 2) best done manually, and 3) not very puzzling or complicated, is shrinking fast! AI assistants are claiming competence in more and more domains by the day. A software developer used to get all their planning and architecting hammered out and then sit down, put the headphones on, and write code. No so anymore: take the headphones off, the code is already written, go immediately to the next step.
I’m not a full-time software engineer, but I know the joy of making an app or game, enjoying and soaking up that state of making. Most of the questions I’d face were small, incremental: “How do I store that array?” “How do I move this button over there?” Nowadays I find myself asking only the big questions: What do I want to bring into existence? What will it cost to run? I’m thrilled at how fast I can build out old projects that I never would’ve had time for, but still I know something was lost.
Professional engineers I talk to have the same feeling: they can do so much more now, but they don’t exactly enjoy the work anymore. All the non-decision-making parts of the job have been automated away, so there’s no room for a flow state.
That’s software engineering, but so many other fields are affected. I mentioned 3D modeling above, but AI can do that too, and soon it’ll do it better than I can. Manual labor activities are safe for now, but consumer robotic tech is coming, too.
Something was lost; I don’t think there’s any way around it. The pleasure of doing new things is not the same as the pleasure of doing necessary work in a flow state. Both are normal human pleasures, but the latter is going away.
ResponsesThe joy of the continuous tense is a natural human pleasure. Will you let it go and try to fully replace it with the joy of finished work? Or will you hang onto it, no matter how inconvenient, out of principle? I don’t see an easy answer, but let’s explore:
Let go and become godsDid the God of Genesis enjoy the act of creation? We enjoy our own creative work, but it’s partly because we enjoy the process: it asks something of us, and we invest and get to see our labor pay off. But the God of Genesis just spoke. Are we ready for a future where we “creatives” just speak things into existence?
God doesn’t labor, and therefore doesn’t find pleasure or meaning in labor. The God of Genesis finds gratification when Adam and Eve enjoy what he made. Likewise, as a god-creator your gratification is when others (including your future self, perhaps) come in and say, “Ah this is great, this is exactly what I needed.” Is that enough? Can it be? Are you ready to play God?
This is a form of transhumanism. Before the brain-enhancing chips and cybernetic appendages and designer CRISPR babies, even now, AI technology offers us the option to transcend a little part of ourselves.
Be stubbornly humanOr should we resist automating away our own labor? Should we insist that it’s just as important to be creating X as it is to create X? Should we insist that any moment that feels purposeful is a “success,” even if it’s actually wasteful in economic terms?
There’s plenty of precedent for this. There are still photorealist painters, even after the invention of photography. There are still artisan potters, even though you can buy pottery from a factory in China. In fact, we humans have this quirk where the consumer resonates the meaning that the creator experiences: that’s our concept of artisanship. The handmade thing is just more special, so some people are willing to pay more for it. And thus it becomes economical again! There is a market for realist paintings today; there is a market for handmade pottery, and so on. Will there be a market for handmade software? For artisanal therapy, or financial advice?
But I must pump the brakes: something is still lost. Painting doesn’t have the popularity or status it once had: the market is proportionally smaller; a chunk of it went to photography and didn’t come back. And even if you’re great at your craft and you land the coveted artisan position, will it be the same as before? Part of what allows me to enjoy a flow state is knowing the work is necessary.
Imagine what it felt like to be an 18th Century portrait painter, doing what you love and knowing that this skillful movement of your hand is the only way to ever produce a visual representation of the subject. That’s a rush. I don’t think “I’m an artisan” can fully measure up to it. Alas, something was lost.
If you liked this post, consider subscribing to my personal blog at patrickdfarley.com.
Discuss
"What Exactly Would An International AI Treaty Say?" Is a Bad Objection
I’ve heard a number of people say that it’s unclear what the technical contours of a global AI treaty would look like. That is true - but it’s not actually an obstacle to negotiating an international treaty.
I’ll try to explain why this isn’t a good objection, but the short version is that if countries have clear goals which are largely shared, negotiations end up with strong treaties. So the important questions are not the exact rules. The critical questions are if there really is a joint global risk that requires action - and experts agree there is, and whether verification and enforceability are possible - and experts say they are. So the problem isn’t a technical issue, it’s a question of whether we can get to an agreement. And despite facile “we can’t stop until they do” arguments, we can and should try to do better.
In order to explain why we do not need to figure out the details first, it's worth talking about other treaties.
The Pandemic Treaty (Task Failed Successfully)I will start with the example I watched most closely, over the past five years. The Pandemic treaty was proposed in 2021, “when WHO member States agreed on the urgent need for a legally binding international instrument,” per the UN. It was supposed to fix all the problems we had during COVID. Unfortunately, this didn’t include preventing pandemics, and past that point, no-one agrees on what things should have been done, or what to do next time. So, if we can’t agree on anything, what does the treaty do? Mostly, generic pandemic-stopping stuff - “commitment to a ‘One Health’ approach to pandemic prevention, stronger national health systems, setting up a coordinating financial mechanism, and creating a globally coordinated supply chain and logistics network for health emergencies.”
How much of that was agreed about in mid-2021 when it was proposed by the European Council President? Basically none of it. What are the actual commitments? Well, that’s complicated, but the short version is that there aren’t any. The treaty insists that countries get to stay in control of their national health systems, and no-one could tell them what to do, or how - which sounds a lot like the failure that allowed COVID-19 to spread. Lots of people had different visions for what the treaty should do, from global vaccine justice to enhancing global public health to providing funds for response to climate change issues to supply chain resilience to considering animal and plant health in pandemic planning, and it ended up as a mishmash of random things that people proposed. But the failure here was one of vision - it was unclear what the world would get out of a treaty that everyone agreed about, and that lack was never addressed.
That’s obviously a problem, but not the central one we have with proposing a global treaty to ban unsafe ASI. For those asking for such a treaty, we agree about what needs to be stopped, namely, building unsafe ASI. The questions are all about how to make that happen. So we should look at another example, and I think the best parallel is nuclear weapons.
The Nuclear TreatyIf you know anything about the history of nuclear weapons treaties, you read the section title and immediately asked: “which one?” And there are so many options - there was the Limited test ban treaty in 1963, the treaty on the Non-Proliferation of Nuclear Weapons (NPT) in 1968, the Strategic Arms Limitation Treaty (SALT) in 1972, the Anti-Ballistic Missile (ABM) treaty in 1974, the Convention on Physical Protection of Nuclear Materials (CPPNM) in 1980, and the New START Treaty in 2010. And that’s ignoring almost a half-dozen regional “Nuclear Weapon-Free Zone” treaties.
Why were there so many treaties? The goal - preventing use of nuclear weapons - was broadly agreed upon. But the exact way to accomplish that goal was tricky. And that led to lots of uncertainty, and the need for a number of different treaties addressing different parts of the problem, from testing to proliferation - but the goal was a north star. That meant that every time a concern arose, countries tried hard to figure out what was needed to accomplish the goal of not having a nuclear war, and what rules move the world further from that outcome. Even outside of treaties, nuclear powers generally have embraced a no-first-use doctrine, and have taken other measures to reduce the risk of accidental escalation. All of these address the messy dynamics of nuclear escalation between states that can’t risk being left behind.
What does this tell us about treaties about AI risks?
Lessons for Possible AI TreatiesThere are a dozen suggested international treaties for AI, and most aren’t what I’m discussing. Many are the equivalent of the 1959 Antarctic Treaty, which was the first nuclear-relevant treaty, but which only said countries aren’t allowed to use Antarctica to test nuclear weapons or dispose of waste. In my view, the analogous proposals include any treaty that does anything about AI other than making sure future systems do not end cause global catastrophes. (That doesn’t mean such treaties are a bad idea, just that they aren’t what I’m discussing here.)
Other proposals are actually trying to solve the problem directly. For example, the intentional equivalent of the ill-fated 1946-proposed Baruch Plan, which tried to put all nuclear weapons under international control. And the Soviet counterproposal agreed about the goal - prevention of the production and use of nuclear weapons - but disagreed about how it should work. Which meant there was no agreement to stop proliferation for several decades. Because directly solving the problem by fiat, imposed globally, before negotiations between parties start, is very hard and not usually effective.
Luckily, for nuclear treaties it didn’t matter. The overwhelming consensus of the public was that we shouldn’t have a nuclear war, and despite claims that it was an inevitable race that could only end with disaster, nuclear weapons haven’t been used in 80 years and counting. The specifics of the treaties were critical, and navigating to success was in fact hard, without any final victory. As noted above, these are a mix of bilateral, multilateral, and global treaties. Of course, we’re still worried about nuclear proliferation and nuclear war. But the treaties have given the world enough stability to avoid disaster, so far.
Again, the world was lucky that building nuclear weapons went slowly, for two decades, and that the norm against use got established in the run-up to treaties. But no-one should argue that the lack of nuclear war means the treaties were unnecessary - if anything, the common argument is that the treaties are too weak and contingent.
Do we need an ASI treaty? Yes - and that is true even if you think the risk is minimal. Yes, there is debate about whether the risk will be realized, but there is clear consensus that it is a risk, and that we should have at least the ability to put rules in place. Should we have one grand treaty, or plan for solving one part at a time? It’s unclear. Many treaty areas have dozens of overlapping treaties; “the” Geneva convention is a series of treaties that added clarifications and rules over decades. But we do need to get started. People who think we’re a decade away from fully general AI should be terrified that it’s already late in the game to start discussing such a treaty. And people who are saying we might have ASI by the end of the decade are mostly already screaming about the need for some international rules.
Do Treaties Solve the Problem? (Do We Need Other Rules?)Nuclear power is regulated, partially via an international body - that’s mostly not about preventing global thermonuclear war, but it's critically related due to the need to control nuclear materials. And at a national level, nuclear medicine is regulated, radiation exposure levels are regulated, and we have tons of other rules that don’t relate to preventing a Nuclear World War Three. That doesn’t mean those rules aren’t important.
For AI, we already need regulation on various uses and misuses. Some of these might be international - bans on autonomous weapons, bans of nonconsensual pornography, bans on use of AI to violate other international laws. Others should be national, or even local, like bans of deceptive use of AI, or use of AI to break other laws. That’s how regulation works.
But these are not the AI treaties that we need, they are just places where we need rules. And they should be pursued - just not at the cost of further delay on the global catastrophic risks we face.
AI NotKillEveryoneism TreatiesTo have effective treaties that stop AI-driven global catastrophes, there are many questions that need to be answered. What exactly is needed to prevent the creation of misaligned ASI? Which chips should be tracked, who should be allowed to use them, and for what types of model development? What controls or safety measures must be in place? How should we measure dangerous capabilities? Where should the line be? Which countries need to agree first? How can we manage multilateral coordination when only a few countries are leading in the race? Will we ban existing frontier systems at the time the treaty comes into force? Or will groups be able to build slightly more capable systems with additional safeguards? What should those safeguards be?
Fortunately, we don’t need final answers to all of these questions in order to start negotiating a treaty. We don’t even need final answers in order to sign a treaty - many treaties have mechanisms for routine updates of rules. But we need a clear vision - no-one builds ASI until we are sure it is safe, and when people try clever ways to get around whatever rules exist, they are told to stop, and told that the rules will be updated. Companies must be told that their race to ASI is over, that the risk is too high, and no-one gets to win. Countries are told that whatever geopolitical advantage they think they will gain from building ASI isn’t allowed, because winning a race without clear safety rules is overwhelmingly likely to kill us all.
Will this be a single treaty, or require several over time? I don’t know. It also doesn’t matter. But it needs to start now, because international coordination takes time and is slow. We do not need to specify the outcome before starting, and uncertainty isn’t a reason to wait to build momentum and putting pieces in place so we can discuss what is possible to restrict or ban, how it will be verified, and howand why to ensure participants will prefer compliance to defection. Lock-in on the details is a risk, but waiting until we need immediate action isn’t a way to make the eventual responses better - quite the opposite, since delay eliminates capacity to explore the details. Obviously, work on a framework treaty needed to start at least a few years ago when the risks became globally clear, and we can hope we aren’t too late given the clear imminent risks of superintelligence.
In short, saying that we can’t discuss a treaty yet because we don’t know what the rules need to be is an historically illiterate and poorly reasoned objection.
Discuss
Socrates is Mortal
There is a scene in Plato that contains, in miniature, the catastrophe of Athenian public life. Two men meet at a courthouse. One is there to prosecute his own father for the death of a slave. The other is there to be indicted for indecency. [1] The prosecutor, Euthyphro, is certain he understands what decency requires. The accused, Socrates, is not certain of anything, and says so. They talk.
Euthyphro’s confidence is striking. His own family thinks it is indecent for a son to prosecute his father; Euthyphro insists that true decency demands it, that he understands what the gods require better than his relatives do. Socrates, who is about to be tried for indecency toward the gods, asks Euthyphro to explain what decency actually is, since Euthyphro claims to know, and Socrates will need such knowledge for his own defense.
Euthyphro’s first answer is: decency is what I am doing right now, prosecuting wrongdoers regardless of kinship. Socrates points out that this is an example, not a definition. There are many decent acts; what makes them all decent?
Euthyphro tries again: decency is what the gods love. But the gods disagree among themselves, Socrates observes, so by this definition the same act could be both decent and indecent. Euthyphro refines: decency is what all the gods love. And here Socrates asks a question Euthyphro cannot answer: do the gods love decent things because they are decent, or are things decent because the gods love them? [2]
If decent things are decent because the gods love them, then decency is arbitrary, a matter of divine whim. Socrates is too polite to say so, but the implication is: if decency is defined by the arbitrary whim of our betters, who are you to prosecute your father?
If the gods love decent things because they are decent, then however we know this, we already know the standard for decency ourselves and can cut out the middleman. But then Euthyphro should be able to explain the standard. He can’t.
Euthyphro tries a few more times, suggesting that decency is a kind of service to the gods, a kind of trade with the gods. Each time Socrates gently follows the definition to its consequences, and each time it collapses. Eventually Euthyphro leaves, saying he is in a hurry. Socrates’ last words are a lament: you have abandoned me without the understanding I needed for my own defense.
This is usually read as a proto-academic dialogue about definitions. It is a scene from a civilization in crisis. A man is about to use the legal system to destroy his own father on the basis of a concept he cannot define, in a courthouse where another man is about to be destroyed by the same concept. And the man who cannot define it is not unusual. He is representative.
The indecency for which Socrates was being prosecuted seems to have consisted of asking just the sort of questions Socrates posed to Euthyphro.
Athens in the late fifth century had recently become something it had never been before: the capital of an empire. This changed what it meant to speak in public. When Athens was a small city making decisions about its own affairs, leadership among Athenians involved speaking to communicate your perspective on matters of shared concern. But now that the collective decisions of Athens mattered for a whole lot of other people, those other people were quite naturally going to spend a lot of time thinking about how to get Athenians to decide their way. At the same time, being part of the leadership structure in control of considerable tax revenues became more profitable for more people, and less economically sustainable to opt out of. Now ambitious Athenians started using their speech to seem electable by showing off the quality of their “communicate their perspectives on matters of shared concern” performance.
Sophists were the professionals of this new economy. They specialized in the performance of wisdom, partly to sell their know-how, but always claiming, with some ambiguity, that they were excellent on the same criteria as the great Athenian leaders of the previous generation. And the consequences were not limited to the realm of speech. People were being imprisoned, exiled, and killed, on the basis of deliberative processes that had become unmoored from any standard anyone could articulate.
What had happened was not simply that Athenian politics had become venal. Something subtler and more devastating had occurred. People had stopped being alive to each other. They were running scripts. The sophists taught people to run more sophisticated scripts. Public speech, which had once been the medium through which free men actually thought together about shared problems, had become a performance of thinking. The performance could be very impressive. It could sound like wisdom. But there was no one home behind it, except an intelligent but inarticulate terrified hairless ape with no friends.
And then there was Socrates. He described himself not as a sophist, a possessor of wisdom, but a philosopher, someone who likes wisdom, who has an affinity for it.
In the Apology, Plato has Socrates report that his friend Chaerephon asked the oracle at Delphi whether anyone was wiser, and was told no one was. But a different tradition, preserved in Origen’s Contra Celsum, claims to quote the oracle’s actual verse. It ranked three men:
Sophocles is wise
Euripides is wiser
But of all men, Socrates is wisest
Sophocles and Euripides were not scientific thinkers like Thales or Democritus, who investigated the underlying structure of physical reality. They were not mathematicians. They were not statesmen like Pericles, who managed Athens’s rise from preeminent city to imperial capital. Sophocles and Euripides were the men who could inhabit other minds, who could construct characters who, to all appearances, each had their own distinctive interiority. They imagined all these people well enough to put words in their mouths for declamation in a public theater. They could dramatize what it is like to be Medea deciding to kill her children or Antigone choosing to die. If someone at Delphi had met Socrates and reached for a comparison, they did not reach for a statesman or a priest. They reached for the people who were most alive to other people’s experience.
The oracle’s pronouncement likely came before Socrates was famous for questioning people. Chaerephon, an excitable and devoted friend, likely went to Delphi on his own initiative, to get divine confirmation of something he had already noticed. And what he had noticed was not a method. It was a quality. Speaking with Socrates, one felt the presence of a living intelligence, curious about one’s situation. One felt excitingly seen and at the same time uncomfortably exposed. In a city where public life had become a drama where the actors were principally concerned with their own appearance, this was so unusual that it shone brilliantly to anyone looking for intelligent life, like a beacon ablaze on a clear moonless night.
What came after was Socrates trying to figure out what the oracle could have meant. If I am the wisest, what does that say about everyone else? So, by his own account, he went to talk to the people who were supposed to be wise, the politicians and the poets and the craftsmen, and he found that the politicians and the poets could not give a coherent account of the knowledge they claimed to possess. The craftsmen could, within their crafts. But the knowledge that was being wielded with lethal force in the courts and the Assembly, the knowledge of justice and piety and how the city should be governed, that knowledge was nowhere. The people who claimed it were performing a script, and the script could not survive contact with someone who was trying to make sense of what they were saying. The performative, advantage-seeking political culture of Athens could only make sense of their discomfiture at Socrates’s active listening, as Socrates winning debates. So, as illustrated in dialogues like the Gorgias, big-shot sophists would seek him out, eager to be seen confronting the most formidable debater in Athens. Meanwhile, in a society that would eventually produce Aristotle’s claim that slaves cannot reason, Socrates finds it the most natural thing to turn to a slave to help him work out a mathematical proof, in the dialogue Meno.
Xenophon, who knew Socrates as a person and not only as a character in philosophical dialogues, shows us what this same aliveness looked like when it met people who wanted help. During the civil war around the Thirty Tyrants, a man named Aristarchus had fourteen of his sisters, nieces, and cousins sheltering in his house as refugees. The land had been seized by enemies. There was no money, and he saw no way to borrow, because he had nothing productive to spend it on. He couldn’t feed fourteen people on nothing. Socrates noticed that the women already knew how to work wool. He told Aristarchus to borrow capital, buy materials, and put them to work. Now there was a reason to borrow, and they did. Xenophon says the suspicious glances turned to smiles, the household became productive and harmonious, and eventually Aristarchus came back to Socrates delighted, reporting that the only complaint was that he was now the sole member of the household eating the bread of idleness.
In another episode, a man is harassed by lawsuits because of his deep pockets, but has a poor friend who’s articulate and virtuous; Socrates advises him to pay his friend to start suing the people who are suing him, as a deterrent.
The cross-examination and the practical advice are not two different activities by two different Socrateses. They are both what it looks like when a living mind engages with the world: whether the world presents a man performing authority he cannot account for, or a household full of hungry refugees sitting next to a loom.
At his trial, Socrates gave his own account of what he had been doing. In the Apology, he makes his limited claim to wisdom. Craftsmen really are wise about some things, but he doesn’t think that kind of wisdom is relevant to his interests as a free Athenian trying to participate in deliberations about public matters. Others falsely claim and believe themselves to have scientific knowledge of ethical or political truths. Socrates can claim distinctive wisdom only insofar as he clearly knows himself not to know such things.
This is usually read as a philosophical thesis about the limits of human knowledge. It is a man on trial for his life, explaining to the jury that the people who condemned him are exercising lethal authority on the basis of knowledge they do not possess, which makes implementing any standard impossible; and that pointing out that the laws are incoherent cannot be a violation of the laws, because that sort of criticism is necessary if we are to have laws at all.
In the Theaetetus, set just before the Euthyphro, Socrates finally finds a young man in Athens he can respect for his intelligence and honesty. But the man is not a peer Socrates can consult for advice; he is a promising youth in need of guidance, and the conversation has to end: Socrates excuses himself to go to the courthouse to be indicted. It is my favorite of Plato’s dialogues.
Plato also responded to his beloved mentor’s death by founding the Academy, a great house in Athens where philosophical reasoning was taught methodically. We still have our Academics.
Agnes Callard, in her recent book Open Socrates, wants Socrates to be timeless. She strips out the historical situation, strips out the aliveness that preceded the method, and ends up defending a method that’s obviously inapplicable in many of the cases where she claims it applies. Aristarchus did not need his assumptions questioned at random. He needed someone who could ask probing questions about his actual problem, from a perspective that didn’t share his assumptions about what was and wasn’t possible.
Zvi Mowshowitz, in his review of Callard’s book (part 1, part 2), argues at considerable length that the decontextualized version is bad. He is right. Cached beliefs are usually fine. Destabilizing them is usually harmful. Most people do not want to spend their lives in Socratic questioning, and they are right.
But Zvi has written a long polemic in two installments on the winning side of an incredibly lame debate about whether we should anxiously doubt ourselves all the time, responding to Callard’s decontextualized Socrates, not the real one. The real one did not devise a method and then apply it. He had a quality, something the oracle reached for the language of the tragedians to describe. And what was memorialized as a “method” was what happened when that quality met a city where every other participant in public life had stopped being alive.
Socrates invokes timeless considerations like logical coherence and having reasons for your opinions, but timeless considerations are a very natural thing to try to appeal to when people are being squirmy and dramatic and hard to pin down, and fleeing to abstractions that resist empirical falsification.
Spinoza, in the Theologico-Political Treatise, similarly resituated the teachings of Jesus of Nazareth in their proper context. The political teachings of the Gospels to turn the other cheek, forgive debts, and render unto Caesar what is due to him, are instructions for people living under a hostile and extractive system of domination. Citizens of a free republic have entirely different duties. They have an affirmative obligation to hold each other accountable, to sue people who have wronged them, to participate in collective self-governance. The teachings are not wrong. They are addressed to a specific situation, and become wrong when mechanically transplanted into an inappropriate context.
The reason to recover the historical Socrates is not only accuracy about the distant past; it is that by seeing this relevant aspect of the past more clearly, we might see more clearly what we are up against now.
Socratic cross-examination requires an interlocutor who at least would feel ashamed not to put on a show of accountability. The people Socrates questioned were performing wisdom, but they were performing it because the culture still demanded that leaders seem accountable. They would sit for the examination, because refusing would be disgraceful, like breaking formation in a hoplite phalanx. Their scripts collapsed because the scripts were designed to look like real accountability, and real accountability is what Socrates brought.
There is a useful framework for understanding how public discourse degrades, which distinguishes between guilt, shame, and depravity. A guilty person has violated a norm and intends to repair the breach by owning up and making amends. An ashamed person intends to conceal the violation, which means deflecting investigation. A depraved person has generalized the intent to conceal into a coalitional strategy: I will cover for you if you cover for me, and together we will derail any investigation that threatens either of us.
The leaders Socrates questioned were, at worst, ashamed. They had taken on roles they couldn’t account for, and they wanted to hide that fact, but they still felt the force of the demand for accountability. When Socrates pressed them, they squirmed, they went in circles, they eventually fled. But they engaged. They felt they had to engage. The culture of Athens, even in its degraded state, still held that a man who refused to give an account of his claims was disgraced.
Depravity is a further stage, and Sartre described it precisely in his book Anti-Semite and Jew:
Never believe that anti-Semites are completely unaware of the absurdity of their replies. They know that their remarks are frivolous, open to challenge. But they are amusing themselves, for it is their adversary who is obliged to use words responsibly, since he believes in words. The anti-Semites have the right to play. They even like to play with discourse for, by giving ridiculous reasons, they discredit the seriousness of their interlocutors. They delight in acting in bad faith, since they seek not to persuade by sound argument but to intimidate and disconcert. If you press them too closely, they will abruptly fall silent, loftily indicating by some phrase that the time for argument is past.
The depraved person does not perform accountability. He plays with the forms of accountability to exhaust and humiliate the person who still takes them seriously. He is not running a script that is trying to pass as a perspective, collapsing only under the kind of questioning we still call Socratic. He is amusing himself at the expense of the questioner. Cross-examination does not expose him, because he was never trying to seem consistent. He was trying to demonstrate that consistency is for suckers. The Socratic method will not help him.
The Socratic method, if we can rightly call it that, was forged by the pressures confronted by a living mind in a city of the ashamed, people who still cared enough about accountability to fake it. It has nothing to say to the depraved themselves, who have dispensed with the pretense, though in a transitional period might expose them to the judgment of the naïve.
But the quality that preceded the method is something else.
What the oracle recognized in Socrates was not the ability to cross-examine. It was something closer to what it recognized in Euripides: the capacity to be present to what is happening, to see the person in front of you rather than the category they belong to, to respond to the situation rather than to your script about the situation. To be alive.
We do not need a new method. Methods are what you formalize after you understand the problem, and we are not there yet. What might still help us is the quality that precedes method: the willingness to see what is in front of us, to say the obvious thing that everyone embedded in the performance is too scripted to see, and to keep reaching out to others even when the response is usually not even embarrassment but indifference, not even a failed defense but a smirk.
The oracle didn’t say Socrates had the best method. It said he was the wisest man, in a society oriented against wisdom. The “method” was just how aliveness was memorialized by a city that still cared enough to be ashamed of being dead.
The question for us is what aliveness looks like in a city beyond shame.
Usually translated “impiety,” but the Greek hosion and its negation anosion are broader. “Piety” to us generally means deference, which doesn’t make sense to attribute to the gods, but Euthyphro thinks it is normal to call the gods hosion, so we might try “holiness,” since we speak both of holy gods and holy men. But the connotations of “holiness” don’t match up well with the context of a prosecution. “Decency” is at a lower register of formality than “piety” or “holiness” in a way that sounds a bit odd, but it is the best fit as far as its explicit meaning. ↩︎
Contemporary readers may have difficulty relating to the idea of multiple gods who might legitimately disagree about what decency requires. But one can substitute the less iconographic authorities we have now: religions, ethical systems, philosophical traditions. Someone might plausibly claim that decent behavior is whatever all the major ethical traditions recommend. Socrates’ challenge still works: if these traditions can decide arbitrarily, then what stops them from endorsing indecent behavior? If we trust that they are constituted to endorse decency, then we already have some idea what the common factor is, and should be able to say what it is. ↩︎
Discuss
Страницы
- « первая
- ‹ предыдущая
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- …
- следующая ›
- последняя »