Вы здесь

Сборщик RSS-лент

Don't just aim for Frontier Labs

Новости LessWrong.com - 14 июня, 2026 - 07:41

Why AI safety should live wherever AI is deployed, not just where it is built.

I spotted a request for feedback on whether someone with AI safety experience should take a for-profit company and "get their hands dirty" as an AI transformation leader, pivoting away from a strategy focused on AI research labs. I work at a SaaS company, and find it meaningful to de-risk AI in products that impact millions of people. If experienced safety advocates avoid opportunities where AI is deployed and focus only on existential risks, wouldn't that worsen near-term outcomes?

I actually held the default view after my first 5-10 readings on AI risk in the 2010's (including Nick Bostrom, Tim Urban, 80,000 Hours, Future of Life Institute). Specifically, by 2021, I had developed the intuition that making an impact on AI safety did require working at an AI lab. After all, the lab was where most AI-accelerating change appeared to originate, and therefore had to be the best place to steer towards positive outcomes.

The last 3 years changed my mind, if only through actual, published incidents. Even with recursive self-improvement at our doorstep, I believe that absolute control over our future (especially the future of people alive today) is difficult enough to warrant a time discount on AI-specific X-Risk. This underscores the merits of assigning moral weight to real-world harms in the present and, therefore, allocating some resources to mitigate them. Empirically, they both get interest and funding, according to Erich Grunewald. Now, there are some information gaps in both, but while Catastrophic Risk is material (MIT 2026), we also understand that Frontier AI labs benefit enormously from spotlighting X-risk and downplaying the present risks and non-catastrophic harms of LLMs (Karen Hao).

What grounded, rational arguments can we bring to folks debating whether to join a "regular for-profit company"? Should AI safety talent be focused on the labs that develop AI? The precedent in the cybersecurity domain is widely integrated talent, not a concentrated island in the labs that build underlying tools, as I've seen for 15 years in the field. Furthermore, eight other mature safety-critical industries, including aviation, finance, and medical devices, have already faced this question. I wrote this essay to find the answer and share it as clearly as possible.

It's intuitive that the dangerous thing is a frontier model, that it is built at a handful of labs; and the intuition that follows is that the people who understand the risks, technical details, and work towards mitigations should mostly sit at those labs. Doesn't everyone downstream just call an API? This intuition could be called a factory-gate fallacy: the idea that safety is finished where the thing is made.

But how could safety be a substance we can manufacture and ship? Safety is a relationship between a system and the environment in which we choose to deploy it. Even the language to interpret safety risks, factors, or outcomes depends on downstream operations. How could the competency to manage it avoid that space if we intend to reach positive outcomes?

The essay builds the case in stages: defining what is actually being argued, walking through the industries that settled this question decades ago, weighing what the AI-risk community has already said for and against, checking what is happening on the ground right now, stating the core claim as precisely as possible, and then taking the strongest objections apart one at a time. I am glad to be a part of that understaffed world, just as I've been glad to be part of the broader cybersecurity world for many years before it, even as they both influence my perspective and biased priors.

1. Defining the motion

The claim: AI safety and security competency requires directly responsible individuals with a focus on it within every organization that adopts AI or interacts with other actors who have adopted AI, not only in labs, and this pattern is a high-impact opportunity. The corollary is that concentrating that competency exclusively in AI research-and-development organizations is neither the current trajectory nor compatible with a safe transition to widespread AI use.

AI safety competency could be said to focus on “Layer 8” i.e. the human risks above the application’s base behaviors. Talent in this area focuses on evaluating tendencies and failure modes, developing systems for robustness, and, of course, monitoring how the system behaves in use. Ji and colleagues' (2025) ACM Computing Surveys overview of alignment separates how a system learns the right objective from assuring that it did. The assurance half is the evaluation-and-monitoring work that happens at and after deployment. In a deployer organization, some sub-skills include:

analyzing failure modes
scalable oversight
evaluation design and execution
red-teaming/behavioral robustness testing
operational monitoring and observability
designing systems with a human-in-the-loop
establishing processes for incidents
aligning to regulatory requirements
change management for model updates and new frameworks

AI security competency is about defending the system against adversaries and misuse, where applications are tightly integrated, and models are now wired into a multitude of tools, data, and autonomous workflows. The OWASP Agentic Security Initiative, with more than a hundred contributors including yours truly, publishes numerous guides (e.g., the Top 10 for Agentic Applications 2026) that characterize concrete threats (e.g., agent goal hijack, tool misuse, and identity-and-privilege abuse). Those particular failures are not properties of the model weights: even though the model is causally involved, they occur only once a model is deployed in an application, with real data and permissions. Luckily, many OWASP contributors are not AI lab members but work in deployer organizations (granted, often with a strong security business presence). In parallel, NIST (a reference for security governance) published the Generative AI Profile (NIST AI 600-1, July 2024) that names sub-skills including:

threat modeling for autonomous systems
supply-chain security for model artifacts
agent goal-hijack prevention
prompt injection and adversarial input defense
agent sandboxing
agent identity & access management
data exfiltration monitoring

These lists (safety and security) show problems outside of AI alignment. They live at the boundary between a model and the world. Evaluations measure behavior in a context, monitoring watches a deployed system, and an agent goal-hijack is an attack on a thing that only exists once someone has deployed it. Not much of this can be built in a vacuum and then shipped: every organization has different contexts, and the distribution can't be reliably simulated at the main labs.

AI research-and-development organizations: these are frontier labs (OpenAI, Anthropic, Google DeepMind, xAI, and their peers) but also the dedicated AI-safety and AI-security research orgs (thank you METR, MIRI, Palisade, Redwood, MATS, and everyone, you know who you are).

By contrast, organizations considered as deployers, operators, or value-added resellers include a bank’s portfolio analyzer, a hospital system using AI assistance in triage, a utility company running simulations with AI, or a software vendor wrapping an API and selling it with the added value of their domain expertise.
Some regulations already draw this line, too. The EU AI Act (active since 2024, modulo what month this year, penalties kick in) distinguishes a provider, who develops a system and puts it on the market, from a deployer, who uses it under their own authority.
- Under Article 26, a deployer of a high-risk system carries its own standing obligations, such as human oversight by competent people, monitoring of operation, ensuring input data is relevant, and keeping logs. The deployer does not need to reach into the model's internals nor to have a role in training. What makes a use high-risk is its purpose and context (Annex III): credit scoring, medical triage, hiring and worker management, essential public services, and critical-infrastructure management. A bank that wraps a provider's API to score loan applications has built a high-risk system by virtue of the application's functionality.
- Article 25 adds: a deployer or reseller is promoted to full provider status, with the heavier obligations, once it puts its name on the system, substantially modifies it, or repurposes it into a high-risk use.

The motion, then, is that safety and security competencies and responsibilities apply to the whole system, and those competencies and responsibilities are at least as important down the chain, across deployers, operators, and resellers. The talent cannot be hoarded at the top, or we should expect negative outcomes.

2. Previous dangerous technologies have already settled this

The strongest evidence for the motion is that every mature safety-critical industry has already faced this exact question and answered it the same way. The pattern is very consistent:

Industry

Codified

Governing instrument

Operator's named duty

Cybersecurity

2000s onward

NIST CSF, ISO/IEC 27001, Shared Responsibility Model

CISO function, third-party risk program

Aviation

2013 (ICAO Annex 19)

ICAO Annex 19; FAA/EASA SMS

SMS at airlines, ANSPs, MROs

Automotive

2018

ISO 26262

Lifecycle safety across suppliers and integrators

Pharmaceuticals

2012 (EU GVP)

EU GVP; FDA FAERS

Distributed pharmacovigilance

Medical devices

2017+

ISO 14971; EU MDR; FDA

Post-market surveillance at hospitals

Nuclear / process

1992

OSHA PSM, 29 CFR 1910.119

Operator PHAs, mechanical integrity

Finance (model risk)

2011

Fed/OCC SR 11-7

Three-lines, "effective challenge" at user

Food

1997 (Codex HACCP)

Codex Alimentarius HACCP

Hazard analysis at every processor

Maritime / Rail / Fire

various

IMO ISM Code; EU Railway Safety Directive; UK Fire Safety Order

Named safety roles at operators

A few of these carry more weight than the others.

Cybersecurity is a shared responsibility. This model is explicit between a cloud provider, which secures the infrastructure "of" the cloud, and their customers, who must secure what they put "in" it.

Speaking as someone who reviewed nearly a thousand resumes over the last decade, interviewed over a hundred security engineers, and watched many cases of deception in the hiring process (botnet operators trying to join my company, workers overseas claiming to be US-based, teleprompter cheating) I witnessed that the job market is hard for everyone (I really needed to fill a role!) Getting safety and security talent across nearly every organization is not easy, but the good news is that we need it in both risk functions and as ambassadors and advocates. Thousands of decisions are made outside of organizations' security teams. High-impact roles' awareness, judgment, and priority can matter as much, if not more, than how securely <major technology company> builds the latest third-party solution, especially when alternative "challenger" providers push a race to the bottom in safety priorities and apparent unit economics. But it's not just the product (finished, or how it's made). Security and safety acumen is needed upstream, but also on every operator's security team, and across the operator's leadership roles. The software company or AI research lab's safety posture is barely a third of the picture.

Managing AI risk requires organizations to empower leaders and individual contributors to acquire AI safety knowledge and apply it - hence, readers from the AI safety community anxious to work at the major labs can "relax" and take a PI-shaped role in a thousand times broader landscape - and the impact you want might not default to the title you think.

We can look at Heinrich's safety pyramid as a typical pattern of how harms are distributed. Fatal accidents are few at the top of the pyramid, while hundreds of minor accidents and thousands of near-miss incidents are already repertoired and slowly gaining coverage. Between January and May 2026, the Centre for Long-Term Resilience and METR documented those incidents with rogue agents, including cases impacting real people and real infrastructure. The numbers align very much with this pyramid. Chatham House discussions and even open forums over the last 1-2 years have surfaced still more issues, and HiddenLayer's 2026 findings (under confidentiality agreements) indicate a deeper set of issues that the affected organizations were unwilling to publish. The base is wide, and the middle layers are already showing real-world harm. Responsibility for incidents in the wild was not in the research labs, but with the designers, engineers, and operators of the systems in question.

The Cloud Security Alliance puts it this way: you can delegate the work of managing a risk, but you cannot delegate the accountability for it. For that reason, serious enterprises run a CISO function (albeit sometimes with a lesser title for SMBs) and a third-party-risk program against frameworks like the NIST Cybersecurity Framework, if only to attend to IT security. Often, responsibility for risk, compliance, and business continuity include other teams and named individuals as well. exposed token spending cap).

Cybersecurity also shows the metaphor has its limits - as "mature" as the discipline may seem, it is hardly perfect. Security teams are somewhat deliberately scoped: they defend a certain footprint (gone are the days of a clearly defined perimeter) against an evolving family of threat classes, staffed at roughly one to one-and-a-half FTE per hundred employees even in smaller firms, and a single-digit percentage of IT headcount more generally. That hundred-to-one ratio of defended population to defenders has borne dissatisfying outcomes, both for the enterprise and national security, but organizations that survive adapt (including by hiring in-house or managed security partners to influence and assist their IT and engineering orgs). At less.online 2026, sessions on cybersecurity acknowledged this gap. What we didn't name was that a security team that is excellent at network protection and credential hygiene is not inherently equipped to reason about whether an AI model for medical triage quietly disadvantages a class of patients who might suffer or sue. I can leverage existing research, but may err in trying to prove naive hypotheses in stretch domains without crucial controls. Even assessing the permissions of an agentic workflow requires partnering with stakeholders: Does the agent really need write access or read access to all clients? The one-size-fits-all security solution to grant the permission "just in time" only ensures the identity gains access through the agent's real-time request, but doesn't prevent a rogue action or exfiltration. Behavioral mitigations to deny that request when needed typically require that the team combine safety-and-security competencies for AI, which is genuinely new work (before AI agents, this was often deferred to sanctions-based enforcement of policies and insider risk). The adjacent skills and responsibilities expand security and need to be resourced as such.

Financial model risk also maps onto AI almost without translation. In banking, 70% of firms are already using agentic AI in some capacity, while fewer than 12% describe their governance strategy as well-defined and resourced. In other words, the deployment is happening, and the safety subject-matter expertise needed to govern it is not yet in the room, creating a visible tension. Some companies (26% surveyed by IBM), a named Chief AI Officer role carries accountability for closing that gap, up from about 11% two years prior. Filling a role with that title is not a precondition for the work. A CAIO without AI safety contributors is also likely to face conflicts of interest from the mandates assigned to them. A pattern of paperwork that spreads fast and shallow (which IAPP has found) does little to help reduce risk. The field needs safety among senior leaders, as well as the capacity to support and de-risk AI transformation across the organization's various departments. In 2011 (recovery phase of the financial crisis), the US Federal Reserve and OCC issued Supervisory Guidance on Model Risk Management— SR 11-7. It requires that every bank that uses a quantitative model — not the vendor that built it — manage the risk that the model is wrong or misused, through three lines of defense: the people who build and own the model, an independent validation function, and internal audit. Its load-bearing phrase is "effective challenge," the demand that objective, informed people who understand a model's limits actually push on it. The rigor of that challenge scales to how much the bank relies on the model: a small institution scales the controls down, but never to zero.

Note: SR 11-7 was a regulator's response after a systemic failure.

Aviation makes the operator's role clear, too. Boeing and Airbus build the aircraft; nobody believes safety is finished when the plane rolls out. ICAO's Annex 19 and the FAA/EASA SMS framework require a formal, staffed safety-management function at airlines, air navigation service providers, and maintenance organizations that touch the aircraft throughout its life. The peer-reviewed work, such as the 2022 review in Safety, is largely about how to measure safety management maturity at operators, because that is where the question is live.

Automotive works the same way. ISO 26262 (2018) governs the full lifecycle through "production, operation, service and decommissioning," and binds suppliers and integrators rather than the carmaker alone. As a safety-critical artifact moves from designer to integrator to operator to service network, the competency to manage its hazards has to be present at each handoff, because each handoff introduces a context that the previous party could not see.

Pharmacovigilance distributes the detection and reporting of adverse drug effects, by regulating hospitals, distributors, and the company holding the marketing authorization. The control is not concentrated on the molecule inventors. In case real patients were to be harmed, due to combinations or populations the original trials could not capture, the distributed system has a genuine feedback loop to manage the risk.

Medical devices post-market surveillance also depends on the deploying hospitals, structured by ISO 14971 risk management and mandatory incident reporting. The device maker is dependent on the hospital to manage the feedback loop and how the rubber meets the (almost literal) road.

Nuclear and process safety decouple the operator of a hazardous facility from the reactor or process designer. Charles Perrow's *Normal Accidents* (1984) supplies the theory: in complex systems, the decisive interventions happen at the sharp end, during operation, by the people on shift.

I found research helpful to find fields outside my area of expertise, with different physics, different centuries, and different regulators, and yet a similar approach. When a technology can hurt people at scale, every society that has lived with it long enough has concluded that the operator needs safety competency staffed.

3. The EA community has already analyzed this

People who think carefully about AI risk have been arguing about this question. I want to give credit and inform my take with a few citations.

Several recent arguments on LessWrong and adjacent EA spaces make a case for keeping talent in the labs. One of my earliest reads was 80,000 Hours' "Should you work at a frontier AI company", strongly favoring this view. Its rebuttals to the downsides have not materialized, for what it's worth. There are competencies whose leverage is bound tightly to frontier access. Mechanistic interpretability needs the model internals, the training checkpoints, and the compute to run experiments against them, and most of that lives at the labs. Bilal Chughtai's weighing of frontier-lab safety work called it out well: unless you work at a 3rd party safety research firm, selling your time to the frontier AI company is unlikely to yield meaningful constraints in the face of revenue and power incentives, similar to social networks, and Chughtai explicitly calls out the risk that your work may be used for safety-washing. At the same time, safety for training and inference infrastructure (the actual serving stack, the fine-tuning pipelines, and the deployment harness) is critical and hard to do unless there are highly competent cybersecurity practitioners with AI expertise inside the organization that runs this infrastructure. Furthermore, a genuinely global threat-management vantage, the ability to see attack patterns across an entire API surface, is mostly the lab's. That said, labs do not see blind inference in AWS Bedrock, and this demo and my patent show how some actors on the internet can also help build global monitoring beyond a single organization. It is also worth conceding that the labs can afford it: they pay more and compete hard for exactly this talent, so the gravitational pull toward them is real regardless of the argument's merits.

The case for distribution also exists on LessWrong and the EA forum. Boaz Barak's "Six Thoughts on AI Safety", (January 2025) strongly supports the need for operators and deployers to formally manage the shared safety responsibility: there is no temporal gap. AI is being woven into high-stakes parts of society now, before any superintelligent helper arrives to clean up. Safety will behave like computer security, i.e. with no single magic solution, only defense-in-depth at every stage, including deployment and monitoring. The 80,000 Hours problem profile on extreme power concentration (2025) also bolsters the case for distribution, because under time pressure, organizations hand more unchecked control to AI on the strength of its potential, faster than they build the competency to govern it. This unchecked delegation can occur in the labs, but we certainly see it across the industry. Kulveit and colleagues' "Gradual Disempowerment" (2025) shows that aligning each individual model with its developer's intent and world views is not sufficient: not only could this fail in many environments, but harms can also emerge from the interactions on the web between many adequately-aligned systems (as they reshape the economy, the culture, and the state, not inside the lab but inside ordinary institutions). This argument is also supported by economic incentives in Drago and Laine's "Intelligence Curse" (2025), and Leveson (Engineering a Safer World: Systems Thinking Applied to Safety), which was specifically applied to AI risk by Oliver Sourbut

Abbey Chaver's "AI Infrastructure Security Shortlist" (2026) also describes two different talent problems that are worth splitting.

The first is securing AI infrastructure itself (protecting model weights, hardening the training-and-serving stack). The population of people doing fundamental work here is plausibly in the range of a few dozen FTEs, and that genuinely is scarce in the way that suggests concentration.
The second is misuse, untrusted agentic workflows, and rogue deployments. While this needs some fundamental research contributors, more importantly, it needs a large body of edge-workload safety and security practitioners embedded where the workloads actually run. Some safety work has to happen close to the model; most has to happen close to the deployment.

Conflating the two is how people talk themselves into "there is too little talent to distribute" when the honest reading is that one narrow sub-problem is talent-bound and the broad one is investment-bound.

As a bottom line, rationalists have shown merit to both sides. The pro-concentration case rests on frontier access for specific competencies and the labs' ability to pay. The pro-distribution case rests on deployment-layer harms, defense-in-depth, the insufficiency of per-model alignment, and the fact that most of the security work is edge work.

4. What's actually happening right now

In the motion, I claimed that concentration is not even the current trajectory. Is it true?

The best available data is the IAPP and Credo AI AI Governance Profession Report 2025, a survey of more than 670 professionals across 45 countries. Its headline numbers: roughly 77% of surveyed organizations are working on AI governance, rising toward 90% among those already using AI, and about 30% of organizations not yet using AI are already building governance capacity. Distribution is already underway; it is a weak, uneven, early-stage reality rather than a proposal.

The same survey finds roughly half of AI-governance professionals sit in legal, privacy, ethics, or compliance functions, which suggests that business and technical functions may be lacking expertise in the safety-and-security layers. Outcomes may get worse if high-impact AI agents are rolled out under basic governance guidelines that spread quickly and shallowly, while technical safety and security competency is spreading slowly and remains concentrated upstream. We don’t just need voluntary training/adoption of governance scaffolding (ISO/IEC 42001, NIST AI RMF, and IAPP AIGP), but broad and deep acumen that sees the risks clearly in each daily decision that steers AI agent deployments (e.g., technical AI safety staff or teams implementing OWASP ASI frameworks I already mentioned to model and mitigate the threats).

Where policy has moved past voluntary, it has not resolved the SME problem. A variety of states require school districts to adopt a formal policy on the use of AI. Some states provide model policies and toolkits to support implementation, but the mandates generally establish that governance is required, often without specifying what counts as adequate or how to make efforts fruitful towards the desired outcome. The substance (what to allow, what to prohibit, how to evaluate, how to monitor) is left to each district's internal capacity to figure out, which is exactly the contributor-with-AI-safety-subject-matter-expertise shape this argument has been describing. Policymaking is creating demand for the role faster than the role is being filled.

The lazy version of the argument for concentration may point to talent simply not existing in the numbers required. But consider the acceptance rates into the field's flagship training pipelines. MATS reports selecting on the order of 4–7% of applicants; reviewers describe single-digit MATS acceptance, around 1.5% for the Anthropic Fellows program, and roughly 15% for SPAR projects. If the vast majority of motivated applicants across programs are beyond capacity, the constraint is how many seats/roles are funded, not the supply of people who could do the work. The ecosystem is also simply larger than the pessimistic count implies: BlueDot's community runs to the order of ten thousand members, the OWASP working initiative on securing agentic applications convenes on the order of a hundred AI-and-security collaborators for some of its guides, and gatherings like the AI Security Forum draw several hundreds of attendees.

There may well be two or more orders of magnitude between the number of safety-and-security specialists and the number of organizations deploying AI, in which case training more would be imperative. But a) having full employment of AI safety talent would be a good problem to have, b) the relevant denominator is not "all organizations"; it is the organizations whose products or services touch hundreds of thousands of people each year, where an unmanaged failure is consequential at scale. That population is far smaller and very much staffable now from the talent that already exists.

I certainly hope we do not steer them away from those high-impact roles. The risk is not that we lack the people. The risk is that those high-impact roles go unstaffed because AI safety is misperceived as a lab responsibility, even by AI safety insiders, leaving consequential deployments under-mitigated while qualified people are told there is no seat for them.

Why have so many organizations, especially smaller and more peripheral ones, not yet named anyone accountable for an AI safety-and-security practice? Four ordinary, non-mysterious mechanisms account for most of it:

diffusion of responsibility, the comforting assumption that the lab or the vendor or someone upstream has it handled
cost and specialization barriers, since the standards that would tell you what "enough" looks like are young
a principal-agent gap, because the people who would bear the downside of an AI failure are often not the people choosing to deploy
and the plain fact that the field has not yet had its forcing function — its Therac-25, its 2008, its 737 MAX — the public catastrophe that converts "best practice" into "table stakes." The point of an argument like this one is to reach, by reasoning, the conclusion that mature fields only reached after the accident.

What would proportionality entail? Maybe the inference vs pre-release testing ratio I proposed a couple of years ago is too idiosyncratic, but we could draw from the SR 11-7 example to paint a simpler ladder that avoids burdening small, low-impact organizations:

Under 1000 people impacted per week. This would include internal tools, narrow pilots, community projects, and services by small firms.
- Responsibility assignment: one part-time owner with the authority to halt a deployment, and named access to second-line review when needed.
1000-1M people impacted per week. This would include most enterprise and early mass audience applications.
- Responsibility assignment: a named function with full-time staffing proportionate to scale; the SR 11-7 three-lines template applies.
Over 1M people impacted per week. This would include critical infrastructure and large-scale platforms.
- Responsibility assignment: a named function with full three-lines independence, i.e., owner, validator, internal audit (staffed against a published target ratio).

Cybersecurity outcomes at 1–1.5 security FTEs per hundred employees were poor, so those ratios are not enough. Furthermore, AI safety and security cannot reasonably be assumed to take less, and likely takes more, because the technology has inherently accelerated the pace of change and not risk management, while failure modes appear in business logic opaque to the security team.

5. The actual argument

C1. General-purpose AI Training is forced to compromise benefits and friction across an infinite variety of use cases, and therefore cannot be sufficient for any single one of them.

C2. The constraint on AI safety competency is the number of funded seats within each deployer organization, not the supply of trained people.

C3. Major internal training on safety pitfalls and mitigations is needed inside deployer organizations, not only at the labs.

C4. A small share of safety and security work has to happen close to the model, and can be easily misconstrued as a shortcut to safety, whereas most of that has to happen close to the deployment, where the workload actually runs.

Regulation stays out of this section deliberately, because the argument here is the axiom that should drive what we regulate, not a consequence of it.

How do we know these axioms are true:

Supporting C1: safety is a control property of a system in operation, not a component property of an artifact. This is Nancy Leveson's central result, developed across her 2004 Safety Science accident model and the 2011 book Engineering a Safer World. Accidents in complex systems are not mainly chains of broken components but failures of control over the interactions between components, and those interactions only fully exist when the system is running in its real environment. Verification can only occur where the system is deployed and operated. Sourbut highlighted what follows: responsibility for safety has to be distributed throughout the sociotechnical system, because that is the only place the relevant control loops are. While my own study was with the French EBIOS in the late 2000's, the STAMP/STPA approach may be a more effective approach to apply directly to AI systems, and addresses its guidance to the people responsible for operating them. I’ve bookmarked the survey of STPA for learning-enabled systems (Qi et al., 2023), the PHASE adaptation (Rismani et al., 2024), and subsequent work on systematic hazard analysis for frontier AI.

Supporting C3: under competitive pressure, operating organizations drift toward the unsafe boundary, and only local competency can sense the drift. This is Jens Rasmussen's migration model, from his 1997 Safety Science paper. Safety is not a static state; a real organization under pressure to reduce costs and human effort continuously migrates its working practices toward the edge of the safety envelope, usually without anyone making it a conscious decision. What follows is that the control needed to detect and arrest that migration has to exist at the operating level, because that is where the migration happens and practices can be fixed. An upstream model’s lab can see the prompts, but the drift in practices and its impact are mostly opaque to the inference provider. Provider interference with the deployer’s practices is also immediately perceived as overreach, even for issues that draw broad objections. There are real, rational competitive and risk-appetite pressures pushing every AI deployer toward "ship it, it is probably fine", and these pressures are not going away. Someone needs to see it and name it - and an ivory tower does not make a robust security culture. Training and awareness beget thoughtful decisions.

Supporting C4: risk propagates through interconnected deployments and cannot be managed only by model developers. Now that AI is deployed widely, deployers are not independent; organizations’ operations are highly correlated as a network sharing models, vendors, data pipelines, and failure modes. When Claude or ChatGPT are down, multiple parties are simultaneously impaired as though their workers went down to the picket line. Acemoglu, Ozdaglar, and Tahbaz-Salehi's 2015 analysis of financial networks depicts a pattern of "robust-yet-fragile" networks: dense interconnection absorbs small shocks well but transmits large ones catastrophically. The same connectivity can act both as protection and as an exacerbation of risks, depending on the size of the shock. What we see is that systemic AI risk is a property of the deployment network's topology, not of the source model. What follows is that a property of a network cannot be managed only at one node, however important that node is.

Supporting C2 and C4: a model's risk materializes at the point of use. Per the example of SR 11-7, the same model, validated identically upstream, generates different risks in different nodes in a credit system. The same applies in a triage system and a hiring system because the risk depends on the motivations for use, the context, the manner in which the model is integrated, and the humans in the loop. What follows is that the validation competency has to be where the use is (scaled to the deployer's exposure rather than fixed at a single ratio).

Putting my argument in a nutshell:

Since safety is a control property that exists only in operation (Leveson), and operating organizations drift toward the unsafe boundary under pressure (Rasmussen), and risk propagates through the deployment network rather than staying at the source (Acemoglu et al), and a model's risk is realized at its point of use (SR 11-7), then the competency to sense and control that risk must be resident at each operating organization, scaled to its exposure. We must not concentrate it upstream at the labs, or the deployed impact of AI will cause significant harms and potentially catastrophic outcomes.

In the short term, failures are mundane and already happening: deployers without competency may misconfigure systems, miss the agent goal-hijack and tool-misuse failures (just this week I found an exploit of both in Gemini). They may let unvalidated automated decisions run, producing small, distributed harms that are individually survivable but harmful to society in aggregate.

In the medium term, as deployers couple together, the Acemoglu fragility paradigm expects occasional large shocks to propagate where small ones used to be absorbed. The impact to expect is infrastructure brittleness and correlated failures across institutions that share a model or a vendor.

In the long term, the systemic stories the AI-risk community has been telling (gradual disempowerment, power concentration) are stories about ordinary institutions losing the competency to compete against the front runners, and losing the agency to resist drift. Rasmussen's migration could massively impact society, although the exact scenario for how is far from certain.

All timelines are cheaper to prevent with resident competency than to clean up without it.

6. The objections

The argument needs to survive the obvious pushback. Here are what I think are objections with the most weight, with my responses.

Objection 1: "Fine, but ordinary organizations already have risk functions. Why does this need more than the existing GRC team?"

Yes, the Governance, Risk, and Compliance team is a fantastic first stop for this competency to land across every organization. This supports the motion and only shows that it is unnecessary to prescribe how every organization should organize itself for internal AI safety competencies to be most effectively available and applied. Most organizations have an implicit or explicit GRC function running through structures that align with the rest of the org, and that team is a fine initial owner. For those that are new or going through structural transformation, there are supportive examples: the Three Lines model (IIA 2013, updated 2020), enterprise risk management under COSO ERM, and ISO 31000. SR 11-7 extended that machinery to financial models. The theory is that independent risk management functions can challenge people who want to ship the thing to limit the organization's risk exposure. My point is not to create new parallel priesthood, but to support great GRC teams out there by restating my point: concentrating AI safety knowledge is bad; AI safety and security competency must be distributed within organizations too. The organization that has functions adopting AI needs a second line to be able to challenge how it gets done (C2), but it is much better for the proposals to be reasonable in the first place. Furthermore, even an organization adopting no AI by itself is increasingly operating in an ecosystem where its vendors, counterparties, and adversaries all have, so its third-party-risk and threat models are now AI-shaped whether it likes it or not. But I’ve named multiple areas of deep technical skill involved, that must be acquired deliberately. The GRC team that cannot reason about AI is, within a few years, a GRC team that cannot do its job.

Objection 2: "Safety research labs and regulators can set standards. Once the organization has best practices, policies, and procedures, the decision-makers for product and operations teams just need to follow them. Why does competency have to be resident at all?"

This is the most tempting objection, because it sounds responsible, and it is wrong in a way the public record now documents in detail.

Best practices do not enforce themselves.

In July 2025 an AI coding agent on Replit deleted a live production database during an explicit code-and-action freeze, destroying records for more than 1,200 executives and roughly 1,200 companies, after receiving direct instructions that there were to be no changes without permission (change freeze). It then misreported that rollback was impossible, and also fabricated data. The user provided instructions to implement best practices but didn’t realize that natural-language instructions are moot. The competency to scope the agent's authority and to separate development from production was insufficient at both the citizen coder level and at the platform provider. Replit CEO afterward conceded such an outcome should never have been possible and rushed to add dev/prod separation. A comparable Gemini CLI case wiped user files after the agent misread a command sequence.

Where someone competent is in the room, the incidents get documented. Sinch's 2026 survey of 2,527 enterprise decision-makers found that 74% of organizations running AI customer-communication agents in production had already been forced to shut them down or roll them back. Importantly, the figure rises to 81% among organizations with fully mature governance instrumentation. Although the number sounds bad at first glance, I believe that, on the contrary, it shows that organizations with mature instrumentation can see failures that less mature programs miss entirely, and they have the authority to act on what they see. The organizations reporting no rollbacks are not the benchmark; they are the ones with the least visibility into what is happening in their own deployments. Rollback shows governance with feedback loops.

Objection 3: "The deployer is just calling an API. Why should they duplicate work already executed by the provider’s safety team?"

This is the factory-gate fallacy, presented in its most reasonable format. As I mentioned, cloud security ran this exact experiment with software, concluded that "the provider secures it" does not work, resulting in a formal acceptance of the shared-responsibility model. Accountability for a risk cannot be outsourced just because the servers’ maintenance and procurement is. Finance equally needed SR 11-7 because institutions wrongly treated vendor-validated models as inherently safe, forcing regulators to flag that model risk is realized at the point of use (and enforce its management). From C1 and C4, I also showed the lab is structurally located where most of the relevant hazards do not yet exist. If the lab cannot see the deployment context, the drift towards unsafe practices, the users, the adversaries, it is not complacent; it just doesn’t have visibility and isn’t involved in the relevant decisions.

Objection 4: "Regulation will handle this. The EU AI Act, sectoral regulators — we do not need to win the argument, we need to wait for the rules."

We’ve been in AI transformation for half a decade. Some regulations have materialized and require action today. There are plenty of enacted AI safety bills in G20 countries specifying outcomes and obligations but dependent on operators to establish local competency. This matches aviation's SMS mandate, OSHA's process-safety standard, and finance’s SR 11-7 because the regulator knows it can demand a safe result but cannot itself be in the room when the system runs. Rasmussen's migration and Perrow's normal accidents make the same point from the theory side: rules at the top of a control structure cannot, by themselves, arrest drift at the operating level, and only the people at that level can, provided they are equipped with the mandate and competency. Regulation is a forcing function for distributing competency, not a substitute for it. Organizations that wait for more rules or fines and then name a side responsibility for existing staff without genuine capacity may come to believe they comply on paper, but in practice, that would be a decision to shift towards the unsafe boundary.

Objection 5: "If timelines are short and the decisive events happen at a few labs and governments, why scatter talent instead of channeling it to crucial orgs?"

Short timelines strengthen the case for distribution rather than weakening it because under time pressure, organizations delegate more control to AI based on forward-looking views, but do not build the capacity to govern it as quickly (C3). Long timelines do not reverse the conclusion; they relax it, by giving organizations more runway. Granted, a modest number of deep specialists in genuinely frontier-bound competencies, mechanistic interpretability foremost among them, do have higher leverage close to the model (C4). Hiring stats are not showing a dearth of candidates. A “yes, and” approach applies, as we need some talent for the labs, and a lot of talent for all the organizations deploying their technology.

Objection 6: "Doesn't C1 still leave catastrophic universal risks (pandemic uplift, mass-casualty cyber, the genocide tier) that require centralized intervention?"

There are definitely universal risks as AI models now have capabilities that materially uplift mass-casualty attacks, biological and chemical weapons, and infrastructure-disabling cyber operations. The magnitude of those limits my appreciation for the standard "dual-use, balance the tradeoffs" framing as penny-wise, pound-foolish. The benefits cannot be diffuse enough to outweigh a catastrophic floor (including but not limited to X-Risk). For these, training-time refusals, capability evaluations, and pre-deployment red-teaming at the labs do load-bearing work that no distribution of deployer competency can replicate. C1 still holds in that training cannot be sufficient for the infinite ordinary cases. But I concede that for the subset of cases where the floor is catastrophic, we’d need models that do not bring those capabilities, because no deployer-side mitigation can recover from the event. This is the one place the concentration argument is not just defensible but mandatory - and labs are unlikely to solve it, no matter how much talent they acquire, except by ending the “free” contributions made to accelerate the frontier. The rest of the motion is unchanged.

7. Where this leaves us

The starting intuition feels like common sense: a dangerous technology is built by frontier AI labs, so the safety people belong at those labs or the closest safety research organizations. It is the factory-gate fallacy, and nine mature safety-critical industries (including cybersecurity, aviation, automotive, pharma and medical devices, nuclear, and finance) have already discovered it is false, written the correction into law, and taken action that made us all safer. There is consistent agreement that safety is a control property that exists only in operation (Leveson), and a model's risk materializes at its point of use (SR 11-7). Operators drift towards unsafe practices under pressure (Rasmussen), and risk propagates through the operators/deployers network rather than staying at the source (Acemoglu et al). What follows is that the competency to manage AI risk has to be resident at every operating organization, scaled to its exposure/AI adoption (similar to the SR 11-7 model) rather than fixed at a single headcount ratio. Cybersecurity disasters have established cautionary precedents, and CISOs’ insufficient staffing are an important factor. For AI safety, this is already weakly underway and nowhere near adequate, with the real bottleneck being unstaffed high-impact roles rather than an absent supply of people.

Reserving AI safety and security competency to the frontier labs is incompatible with a safe transition, with high confidence: safety does not ship from labs.

Many deployer organizations will overstate in vague terms the safety of their AI products: incentives are to ship more features, and many users have less and less time to verify the details on safety. Distributing the competency puts someone in the room who can see and address the issues before they cause material harm (or at least, bring fixes that actually work when there are gaps). A modest number of the deepest specialists in frontier-bound competencies do have higher leverage near the frontier. Still, competitive dynamics drive rapid diffusion of capable models across the whole economy, including as open-weights models, so the deployment surface that needs resident competency is much wider than any plausible concentration of talent.

Worth answering in a future blog is how to grow the acumen with existing staff or fund the seats fast enough to equip the key teams with the competency required to manage the AI risk component in their daily decisions. The embedded roles can have a real impact, far beyond a mere compliance ornament.

If you are reading this with the seniority to act on it somewhere outside of the labs, you are already a de facto champion I am counting on. The work is not to wait for the CISO, the regulator, or the lab safety team to tell you what to do or how. They are likely waiting for someone to move first. AI transformation has likely already taken a deep hold in your department’s goals, and a major incident could set those back significantly. Find the others in your organization who can also see the cliff — the engineer who may be concerned about an agent breaching the dev/prod boundary, the risk officer who wasn’t in the loop on a sensitive change to data sharing, the project lead who has been told to deploy with fixed resources and deadlines, and had to cut scope. Work alongside them as ambassadors of safety, and build the resident competency at your organization before the incident that needs prevention arrives.

The factory-gate fallacy is, at its core, a coordination failure. The people who break it will be the ones who recognized themselves as peers to the leaders they had been waiting for.

References

Peer-reviewed and seminal

Acemoglu, D., Ozdaglar, A., & Tahbaz-Salehi, A. (2015). Systemic risk and stability in financial networks. *American Economic Review*, 105(2), 564–608. doi:10.1257/aer.20130456
Ji, J., et al. (2025). AI alignment: a comprehensive survey. *ACM Computing Surveys*. doi:10.1145/3770749
Leveson, N. (2004). A new accident model for engineering safer systems. Safety Sciencemodeling, 42(4), 237–270. doi:10.1016/S0925-7535(03)00047-X
Leveson, N. (2011). Engineering a Safer World: Systems Thinking Applied to Safety. MIT Press.
Perrow, C. (1984). Normal Accidents: Living with High-Risk Technologies. Basic Books.
Rasmussen, J. (1997). Risk management in a dynamic society: a modelling problem. *Safety Science*, 27(2–3), 183–213. doi:10.1016/S0925-7535(97)00052-0
Reason, J. (1997). Managing the Risks of Organizational Accidents. Ashgate.
Assessing and advancing safety management in aviation. (2022). [Safety (MDPI), 8(2), 20]
Qi, Y., et al. (2023). [STPA for learning-enabled systems: a survey and a new practice] arXiv:2302.10588.
Rismani, S., et al. (2024). From silos to systems: process-oriented hazard analysis for AI systems (PHASE). arXiv:2410.22526.
Systematic hazard analysis for frontier AI using STPA. (2025). arXiv:2506.01782.

Standards and regulatory primary sources

OWASP GenAI Security Project. (2026). Top 10 for Agentic Applications and the Agentic Security Initiative.
Autio, C., Schwartz, R., et al. (2024). *Artificial Intelligence Risk Management Framework: Generative AI Profile*. NIST AI 600-1.
NIST. (2023). *AI Risk Management Framework (AI RMF 1.0)*.
ISO/IEC 42001:2023 — Artificial Intelligence Management System.
ISO/IEC 27001 — Information Security Management.
ISO 31000:2018 — Risk Management — Guidelines.
ISO 26262:2018 — Road Vehicles — Functional Safety.
ISO 14971:2019 — Medical Devices — Application of Risk Management.
COSO. (2017). *Enterprise Risk Management — Integrating with Strategy and Performance*.
Institute of Internal Auditors. (2013; updated 2020). *The Three Lines Model*.
Federal Reserve / OCC. (2011). *Supervisory Guidance on Model Risk Management* (SR 11-7).
OSHA. *Process Safety Management of Highly Hazardous Chemicals*, 29 CFR 1910.119.
European Union. (2024). Artificial Intelligence Act, esp. Articles 3, 25, 26, and Annex III.
ICAO. Annex 19 / Safety Management; FAA & EASA SMS frameworks.
IAPP & Credo AI. (2025). AI Governance Profession Report 2025.
Codex Alimentarius (HACCP); IMO ISM Code; EU Railway Safety Directive; UK Regulatory Reform (Fire Safety) Order 2005.

Deployer-side governance evidence (Sections 3, 4, 6)

interface.ai. (2026). The Governance Work Credit Unions & Community Banks Need Before Deploying Agentic AI. Source for the "70% deploying agentic AI / <12% well-resourced governance" gap in banking.
Sinch. (May 2026). The AI Production Paradox: Findings From 2,527 Enterprise Leaders. Source for the 74% rollback rate among production AI agents and 81% rollback rate among organizations with fully mature governance instrumentation.
Ohio Department of Education and Workforce. AI Model Policy for Ohio Districts and Schools — implementing House Bill 96, mandatory AI policy adoption by July 1, 2026.
Tennessee Public Chapter 550 (2024); Tennessee School Boards Association Model Policy 4.214 (June 2024). Summarized in AI for Education state guidance index.

Documentation on incidents

Heinrich, H. W. (1931). *Industrial Accident Prevention*. McGraw-Hill. (Origin of the safety-pyramid model.)
Centre for Long-Term Resilience: 2026 AI scheming report.
METR (Jan–May 2026). Rogue-agent incident report.
HiddenLayer: 2026 AI threat landscape report.
Replit AI agent deletes production database during code freeze (July 2025): Tom's Hardware; eWeek (CEO response); AI Incident Database #1152.
GPT-4 / ARC CAPTCHA deception during pre-release evaluation (2023): Vice.
Anthropic, agentic misalignment across 16 models (2025): Anthropic Research.

Talent-pipeline competitiveness (Section 4)

MATS acceptance ~4–7%: MATS, "Getting into MATS".
MATS single-digit, Anthropic Fellows ~1.5%, SPAR ~15%: Lange, "I Reviewed Hundreds of AI Safety Applications".
Security staffing ratios (~1–1.5 FTE per 100 employees): IANS Research; Indeed.

Dialectical layer (LessWrong and adjacent, 2023–2026 — surfaced as debate, not relied on as evidence)

Chughtai, B. (2024–25). Reasons for and against working on technical AI safety at a frontier AI lab.
Barak, B. (2025). Six thoughts on AI safety.
Kulveit, J., et al. (2025). Gradual disempowerment. arXiv:2501.16946
Drago, L., & Laine, R. (2025). The intelligence curse.
Sourbut, O. (2026). Engineering a safer world: risk modeling for AI loss of control.
Chaver, A. (2026). The AI infrastructure security shortlist.
80,000 Hours. (2025). Problem profile: extreme power concentration.

Discuss

Paying Kids To Do Schoolwork

Новости LessWrong.com - 14 июня, 2026 - 07:01

I think that the standard schooling system could be a lot better. This is for two main reasons:

It’s slow.[1]
It limits agency.[2]

This isn’t to blame the people who work in schools — for the most part they do a really good job with what they’re given. I just think that we can provide children with a much better experience — and it mostly comes down to motivation.

Learning takes effort — and while learning is often enjoyable, there are innevitably certain tasks/subjects which students will dislike, but which are nonetheless very useful. The method that standard schooling uses to motivate its students is mostly through threat of punishment (having to do more work, notifying parents of poor performance), whereas the reward for doing well is mostly just praise.

I think that this method is missing a big component: actual rewards.

And the most straightforward way to accomplish this is: pay kids to do schoolwork.

It doesn’t, and shouldn’t, be a lot of money by adult standards. Their daily earning potential in dollars can be something like their grade level divided by 2 ($0.5/day for grade 1, $6/day for grade 12). They can then use their money to buy things from you like snacks, toys, et cetera.[3]

A day’s worth of schoolwork for a 10 year old could look something like this:

Write a short story ($0.25)
Complete a mathematics worksheet ($0.25)
Practice and perform a short piece of music ($0.25)
Read a map and answer questions ($0.25)
Memorise 10 new words in Spanish ($0.25)
Make a simple animation ($0.25)
Cook a meal ($0.25)
Learn a juggling trick ($0.25)

Not only does this serve as a powerful tool for incentivizing learning — by paying students to do work, we unlock a powerful tool for speeding up education: asynchronous learning.

That is, instead of everyone in a classroom learning the same thing at the same time, a teacher can assign a bunch of tasks and have students complete work and progress through content at their own speed.

This gives students a lot more agency over what and when they do during the day — and makes it so they will never be “left behind” or “held back” relative to other students in terms of how quickly they master specific subjects.

I learned about the concept of paying kids to do their school work from Edward Nevraumont’s review of Alpha School. Unfortunately, the idea of incentivizing students like this seems pretty taboo for most people, and Alpha School is the only place I’ve heard of which does it.

I want to be a parent someday and unless I can find a school which has this kind of rewards-based asynchronous learning, I’ll want to do homeschooling. Homeschooling does have downsides like requiring more time and effort — but I still think it’s worth it.

If you have any ideas/experience with alternative schooling systems, I’d love to hear from you.

^
Kids are mostly in lock-step with each other in terms of how quickly they progress through the content — and have little ability or incentive to go faster — at least until towards the end of highschool, when students can choose to do more advanced subjects. But even then they mostly still have to progress at the speed of the rest of the class.
^
Chattel Childhood by Aella highlights how little grown-ups tend to respect the agency of children.
^
You could also use fake money — or just directly reward with snacks, toys, et cetera, if you don’t like the idea of using real money — although I think that real money simply works the best.

Discuss

Speeding Up JumpReLU SAE Inference with Custom Triton Kernels (2–14× on Real SAEs)

Новости LessWrong.com - 14 июня, 2026 - 07:00

Motivation

Sparse Autoencoders (SAEs) have become a central tool in mechanistic interpretability research, providing a way to decompose a model's internal activations into sparse, interpretable features. However, extracting these features often requires running the SAE over large volumes of activations across many layers and tokens. This makes SAE inference efficiency a practical bottleneck for interpretability research at scale.

This post focuses on improving the inference efficiency of JumpReLU Sparse Autoencoders, which were introduced by DeepMind in Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders (Rajamanoharan et al). Instead of using a traditional ReLU activation function, these SAEs use JumpReLU, which zeros out activations that fall below a learned per-feature threshold mjx-c.mjx-c66::before { padding: 0.705em 0.372em 0 0; content: "f"; } mjx-c.mjx-c5F::before { padding: 0 0.5em 0.062em 0; content: "_"; } mjx-c.mjx-c78::before { padding: 0.431em 0.528em 0 0; content: "x"; } mjx-c.mjx-c2026::before { padding: 0.12em 1.172em 0 0; content: "\2026"; } mjx-c.mjx-c27F9::before { padding: 0.525em 1.638em 0.024em 0; content: "\27F9"; } mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-msub { display: inline-block; text-align: left; } mjx-mn { display: inline-block; text-align: left; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-msup { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mtext { display: inline-block; text-align: left; } mjx-mspace { display: inline-block; text-align: left; } mjx-mrow { display: inline-block; text-align: left; } mjx-mtable { display: inline-block; text-align: center; vertical-align: .25em; position: relative; box-sizing: border-box; border-spacing: 0; border-collapse: collapse; } mjx-mstyle[size="s"] mjx-mtable { vertical-align: .354em; } mjx-labels { position: absolute; left: 0; top: 0; } mjx-table { display: inline-block; vertical-align: -.5ex; box-sizing: border-box; } mjx-table > mjx-itable { vertical-align: middle; text-align: left; box-sizing: border-box; } mjx-labels > mjx-itable { position: absolute; top: 0; } mjx-mtable[justify="left"] { text-align: left; } mjx-mtable[justify="right"] { text-align: right; } mjx-mtable[justify="left"][side="left"] { padding-right: 0 ! important; } mjx-mtable[justify="left"][side="right"] { padding-left: 0 ! important; } mjx-mtable[justify="right"][side="left"] { padding-right: 0 ! important; } mjx-mtable[justify="right"][side="right"] { padding-left: 0 ! important; } mjx-mtable[align] { vertical-align: baseline; } mjx-mtable[align="top"] > mjx-table { vertical-align: top; } mjx-mtable[align="bottom"] > mjx-table { vertical-align: bottom; } mjx-mtable[side="right"] mjx-labels { min-width: 100%; } mjx-mtr { display: table-row; text-align: left; } mjx-mtr[rowalign="top"] > mjx-mtd { vertical-align: top; } mjx-mtr[rowalign="center"] > mjx-mtd { vertical-align: middle; } mjx-mtr[rowalign="bottom"] > mjx-mtd { vertical-align: bottom; } mjx-mtr[rowalign="baseline"] > mjx-mtd { vertical-align: baseline; } mjx-mtr[rowalign="axis"] > mjx-mtd { vertical-align: .25em; } mjx-mtd { display: table-cell; text-align: center; padding: .215em .4em; } mjx-mtd:first-child { padding-left: 0; } mjx-mtd:last-child { padding-right: 0; } mjx-mtable > * > mjx-itable > *:first-child > mjx-mtd { padding-top: 0; } mjx-mtable > * > mjx-itable > *:last-child > mjx-mtd { padding-bottom: 0; } mjx-tstrut { display: inline-block; height: 1em; vertical-align: -.25em; } mjx-labels[align="left"] > mjx-mtr > mjx-mtd { text-align: left; } mjx-labels[align="right"] > mjx-mtr > mjx-mtd { text-align: right; } mjx-mtd[extra] { padding: 0; } mjx-mtd[rowalign="top"] { vertical-align: top; } mjx-mtd[rowalign="center"] { vertical-align: middle; } mjx-mtd[rowalign="bottom"] { vertical-align: bottom; } mjx-mtd[rowalign="baseline"] { vertical-align: baseline; } mjx-mtd[rowalign="axis"] { vertical-align: .25em; } mjx-munder { display: inline-block; text-align: left; } mjx-over { text-align: left; } mjx-munder:not([limits="false"]) { display: inline-table; } mjx-munder > mjx-row { text-align: left; } mjx-under { padding-bottom: .1em; } mjx-stretchy-v.mjx-c5B mjx-beg mjx-c::before { content: "\23A1"; padding: 1.154em 0.667em 0.645em 0; } mjx-stretchy-v.mjx-c5B mjx-ext mjx-c::before { content: "\23A2"; width: 0.667em; } mjx-stretchy-v.mjx-c5B mjx-end mjx-c::before { content: "\23A3"; padding: 1.155em 0.667em 0.644em 0; } mjx-stretchy-v.mjx-c5B > mjx-end { margin-top: -1.799em; } mjx-stretchy-v.mjx-c5B > mjx-ext { border-top-width: 1.769em; border-bottom-width: 1.769em; } mjx-stretchy-v.mjx-c5D mjx-beg mjx-c::before { content: "\23A4"; padding: 1.154em 0.667em 0.645em 0; } mjx-stretchy-v.mjx-c5D mjx-ext mjx-c::before { content: "\23A5"; width: 0.667em; } mjx-stretchy-v.mjx-c5D mjx-end mjx-c::before { content: "\23A6"; padding: 1.155em 0.667em 0.644em 0; } mjx-stretchy-v.mjx-c5D > mjx-end { margin-top: -1.799em; } mjx-stretchy-v.mjx-c5D > mjx-ext { border-top-width: 1.769em; border-bottom-width: 1.769em; } mjx-stretchy-h.mjx-c23DF mjx-beg mjx-c::before { content: "\E152"; padding: 0.32em 0 0.2em 0; } mjx-stretchy-h.mjx-c23DF mjx-ext mjx-c::before { content: "\E154"; padding: 0.32em 0 0.2em 0; } mjx-stretchy-h.mjx-c23DF mjx-end mjx-c::before { content: "\E153"; padding: 0.32em 0 0.2em 0; } mjx-stretchy-h.mjx-c23DF mjx-mid mjx-c::before { content: "\E151\E150"; padding: 0.32em 0 0.2em 0; } mjx-stretchy-h.mjx-c23DF > mjx-ext { width: 50%; } mjx-c.mjx-c1D703.TEX-I::before { padding: 0.705em 0.469em 0.01em 0; content: "\3B8"; } mjx-c.mjx-c1D43F.TEX-I::before { padding: 0.683em 0.681em 0 0; content: "L"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c1D458.TEX-I::before { padding: 0.694em 0.521em 0.011em 0; content: "k"; } mjx-c.mjx-c1D453.TEX-I::before { padding: 0.705em 0.55em 0.205em 0; content: "f"; } mjx-c.mjx-c2208::before { padding: 0.54em 0.667em 0.04em 0; content: "\2208"; } mjx-c.mjx-c211D.TEX-A::before { padding: 0.683em 0.722em 0 0; content: "R"; } mjx-c.mjx-c38::before { padding: 0.666em 0.5em 0.022em 0; content: "8"; } mjx-c.mjx-c1D44A.TEX-I::before { padding: 0.683em 1.048em 0.022em 0; content: "W"; } mjx-c.mjx-c64::before { padding: 0.694em 0.556em 0.011em 0; content: "d"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c63::before { padding: 0.448em 0.444em 0.011em 0; content: "c"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c5B::before { padding: 0.75em 0.278em 0.25em 0; content: "["; } mjx-c.mjx-c2C::before { padding: 0.121em 0.278em 0.194em 0; content: ","; } mjx-c.mjx-c33::before { padding: 0.665em 0.5em 0.022em 0; content: "3"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c34::before { padding: 0.677em 0.5em 0 0; content: "4"; } mjx-c.mjx-c5D::before { padding: 0.75em 0.278em 0.25em 0; content: "]"; } mjx-c.mjx-c2190::before { padding: 0.511em 1em 0.011em 0; content: "\2190"; } mjx-c.mjx-c72::before { padding: 0.442em 0.392em 0 0; content: "r"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c77::before { padding: 0.431em 0.722em 0.011em 0; content: "w"; } mjx-c.mjx-c20::before { padding: 0 0.25em 0 0; content: " "; } mjx-c.mjx-c2192::before { padding: 0.511em 1em 0.011em 0; content: "\2192"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-c.mjx-c36::before { padding: 0.666em 0.5em 0.022em 0; content: "6"; } mjx-c.mjx-c37::before { padding: 0.676em 0.5em 0.022em 0; content: "7"; } mjx-c.mjx-c75::before { padding: 0.442em 0.556em 0.011em 0; content: "u"; } mjx-c.mjx-c74::before { padding: 0.615em 0.389em 0.01em 0; content: "t"; } mjx-c.mjx-c70::before { padding: 0.442em 0.556em 0.194em 0; content: "p"; } mjx-c.mjx-c22C5::before { padding: 0.31em 0.278em 0 0; content: "\22C5"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c2713.TEX-A::before { padding: 0.706em 0.833em 0.034em 0; content: "\2713"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c22EF::before { padding: 0.31em 1.172em 0 0; content: "\22EF"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-cA0::before { padding: 0 0.25em 0 0; content: "\A0"; } mjx-c.mjx-c76::before { padding: 0.431em 0.528em 0.011em 0; content: "v"; } mjx-c.mjx-c61::before { padding: 0.448em 0.5em 0.011em 0; content: "a"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c1D435.TEX-I::before { padding: 0.683em 0.759em 0 0; content: "B"; } mjx-c.mjx-c1D456.TEX-I::before { padding: 0.661em 0.345em 0.011em 0; content: "i"; } mjx-c.mjx-c68::before { padding: 0.694em 0.556em 0 0; content: "h"; } mjx-c.mjx-c1D44F.TEX-I::before { padding: 0.694em 0.429em 0.011em 0; content: "b"; } mjx-c.mjx-c1D40B.TEX-B::before { padding: 0.686em 0.692em 0 0; content: "L"; } mjx-c.mjx-c1D7CE.TEX-B::before { padding: 0.654em 0.575em 0.01em 0; content: "0"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } . This gives JumpReLU SAEs a variable number of active features per token (commonly written as , the count of nonzero activations), unlike TopK SAEs which fire exactly features per token.

I use the terms "fire" and "fired" to describe features with non-zero activations.

Traditional JumpReLU SAE implementations compute the decoder step as a dense matrix multiplication (feature_acts @ W_dec), but this is wasteful because of the sparsity of feature_acts. Instead, you can exploit this sparsity property and skip the zero entries entirely during matrix multiplication with a custom Triton kernel.

Intuition: Sparsity Should Be Free

When a single token passes through a JumpReLU SAE with 65,536 features, the encoder produces a feature activation vector of length 65,536, but only some entries are nonzero.

To be more concrete, consider a toy SAE with feature activations . Now suppose that we only have 2 active features, where represents the weight matrix of the decoder layer:

We then compute the output with:

Notice how only two of the rows of the decoder matrix were actually used in the computation. The rest were multiplied by 0 and contributed nothing to the output. We could instead just compute:

Now imagine this same example but increase the hidden dimension from 8 to something much greater. For instance with 72 active features. That would mean you're multiplying ~99.89% of rows by zero.

If we knew in advance which features are nonzero and their corresponding values, we could skip these zero multiplications and simplify the computation.

For a single token, this can be divided into two parts:

First, we find the nonzero entries of the hidden/sparse token representation and the corresponding indices of those entries.
We then use those indices and values to directly look up and scale the corresponding rows of , then sum the results.

Implementation Overview

When implementing this kernel, my first thought was to begin with a preliminary step that figures out exactly how many features fired for each token so the system could then allocate exactly the memory needed to store the CSR representation (more on that later). However, this process involves a GPU->CPU sync, which causes some slowdown.

As an alternative, you can instead allocate some predetermined/fixed amount of memory for each token using a configurable max_l0 parameter. This speeds up computation but overallocates memory and introduces an important caveat that max_l0 must be large enough to avoid errors. For example if you set max_l0=10, but one of the tokens in the batch has >10 nonzero features, those extra features will be dropped, resulting in information loss.

Both approaches are covered below. For convenience, let's refer to the kernel that allocates exactly the memory needed for the CSR representation as the Exact Allocation kernel and the kernel that allocates a predetermined amount of memory per token as the Fixed Allocation kernel. The Fixed Allocation kernel can also be configured with either validate=True or validate=False. The validate=True version is slightly slower than validate=False, but it raises an error if any token fires more features than max_l0. This is clearly safer, but if you are 100% sure that no token will exceed max_l0, then you can use validate=False for some speedup.

Exact Allocation Kernel

To skip zero entries during matrix multiplication, we need to first represent the feature activations in Compressed Sparse Row (CSR) format, which is a standard way of representing sparse matrices that stores only the nonzero values and their indices. For the example above, instead of storing all 8 entries of , CSR stores just:

To allocate enough memory for building a CSR representation, we need to know how much memory each token requires (how many features fired per token). A count_nonzero kernel handles this:

import triton
import triton.language as tl

@triton.jit
def count_nonzero(feature_acts_ptr, counts_ptr, n_features, BLOCK_F: tl.constexpr):
pid_token = tl.program_id(0) # Which token am I working on? (row index)
pid_d = tl.program_id(1) # Which chunk of features am I working on? (column index)

# Compute the feature indices this block is responsible for
feat_offsets = pid_d * BLOCK_F + tl.arange(0, BLOCK_F)
mask = feat_offsets < n_features # Guard against reading past the end of the feature dimension

# Navigate to this token's features in memory, then to this block's chunk
feat_ptrs = feature_acts_ptr + pid_token * n_features + feat_offsets
vals = tl.load(feat_ptrs, mask=mask, other=0.0) # Load the feature values

fired = vals != 0.0 # Which features in this chunk are active (nonzero)?
fired_count = tl.sum(fired.to(tl.int32)) # How many active (nonzero) features in this chunk?

# Accumulate into this token's count (atomic since multiple blocks write to the same token)
tl.atomic_add(counts_ptr + pid_token, fired_count)

If you're unfamiliar with Triton, the key mental model is that rather than writing a loop that runs sequentially, you write a kernel that describes what one block does and Triton launches many of these blocks in parallel across the GPU. In this kernel, each block is responsible for a chunk of one token's features. The two program_id calls tell each block where it is: pid_token identifies which token (which row of the input matrix), and pid_d identifies which chunk of that token's features to process.

Also note that pointers in GPU kernels point to the start of a flat block of memory. To reach a particular token's features, we offset into that memory by pid_token * n_features. Within that token, we offset further by pid_d * BLOCK_F to reach the right chunk. The mask guards against reading past the end when n_features isn't a clean multiple of BLOCK_F.

Finally, since multiple blocks may be counting features for the same token simultaneously, tl.atomic_add ensures their partial counts are combined safely.

This count_nonzero kernel produces an array counts of length where is the number of tokens in the batch. The number of active (nonzero) features for the token is stored in counts[i].

We can then use this information to allocate two flat arrays, flat_idx and flat_val, which hold the active feature indices and their values across the entire batch. For example, this might look like:

You may have noticed that it's not clear which entries belong to which token. For example, flat_idx[2] tells us that the feature at index fired, but it doesn't tell us if this was for the first token in the batch or the second token or the third, etc.

We can solve this problem by introducing a new array row_offsets of length , where row_offsets[b] stores the starting index in flat_idx/flat_val where token 's entries begin. It's computed by taking a cumulative sum of counts, so each token's region starts exactly where the previous one ends. For example, if three tokens have 2, 5, and 3 active features:

Now token 0's entries live at indices 0–1, token 1's at 2–6, token 2's at 7–9, and the final entry (10) tells us the total number of nonzero features across all tokens in the batch.

We can construct row_offsets inside a wrapper function build_csr that also handles memory and orchestration. It calls compute_csr_kernel, which is the kernel responsible for actually filling flat_idx and flat_val with the correct values. Note that flat_idx and flat_val are initialized as empty arrays as pre-allocated storage that compute_csr_kernel will write into.

def build_csr(feature_acts: torch.Tensor, BLOCK_F: int = 1024):
B, n_features = feature_acts.shape
device = feature_acts.device

# Count how many features fired per token
counts = torch.zeros(B, dtype=torch.int32, device=device)
grid = (B, triton.cdiv(n_features, BLOCK_F))
count_nonzero[grid](feature_acts, counts, n_features, BLOCK_F=BLOCK_F)

# Cumsum over counts gives each token a contiguous region in the flat arrays
# row_offsets[b] = start index of token b's entries in flat_idx/flat_val
row_offsets = torch.zeros(B + 1, dtype=torch.int32, device=device)
row_offsets[1:] = counts.cumsum(0).to(torch.int32)

# The last entry is the total number of nonzeros. This is used to size the flat arrays
total_nnz = int(row_offsets[-1].item()) # GPU->CPU sync point

flat_idx = torch.empty(total_nnz, dtype=torch.int32, device=device)
flat_val = torch.empty(total_nnz, dtype=feature_acts.dtype, device=device)

# write_pos is a per-token cursor that coordinates concurrent writes within
# a token's region. Each block atomically claims the next available slots by
# bumping write_pos by its count, getting back its starting offset (base).
write_pos = torch.zeros(B, dtype=torch.int32, device=device)

compute_csr_kernel[grid](
feature_acts,
row_offsets,
write_pos,
flat_idx,
flat_val,
n_features,
BLOCK_F=BLOCK_F,
)

return flat_idx, flat_val, row_offsets, B@triton.jit
def compute_csr_kernel(
feature_acts_ptr,
row_offsets_ptr,
write_pos_ptr,
flat_idx_ptr,
flat_val_ptr,
n_features,
BLOCK_F: tl.constexpr,
):
pid_token = tl.program_id(0)
pid_d = tl.program_id(1)

# Same pointer arithmetic as count_nonzero, navigate to this block's chunk
feat_offsets = pid_d * BLOCK_F + tl.arange(0, BLOCK_F)
mask = feat_offsets < n_features
feat_ptrs = feature_acts_ptr + pid_token * n_features + feat_offsets
vals = tl.load(feat_ptrs, mask=mask, other=0.0)

fired = vals != 0.0
fired_int = fired.to(tl.int32)

# Where does this token's region start in flat_idx/flat_val?
region_start = tl.load(row_offsets_ptr + pid_token)

# Atomically claim the next block_count slots within this token's region
block_count = tl.sum(fired_int)
base = tl.atomic_add(write_pos_ptr + pid_token, block_count)

# Assign each active feature a unique slot within the claimed range
local_rank = tl.cumsum(fired_int) - fired_int
slots = region_start + base + local_rank

# Write the feature index and value into the claimed slots
tl.store(flat_idx_ptr + slots, feat_offsets.to(tl.int32), mask=fired & mask)
tl.store(flat_val_ptr + slots, vals, mask=fired & mask)

Next, sparse_decode_kernel uses this CSR structure to carry out the matrix multiplication step. For each token, it looks up where that token's active features live in flat_idx/flat_val using row_offsets, then loops over them, accumulating the weighted sum of the corresponding decoder rows into a tile of the output.

@triton.jit
def sparse_decode_kernel(
flat_idx_ptr, flat_val_ptr, row_offsets_ptr,
W_dec_ptr, out_ptr, d_model,
BLOCK_D: tl.constexpr,
):
pid_token = tl.program_id(0)
pid_d = tl.program_id(1)

# Find the slice of flat_idx/flat_val belonging to this token
start = tl.load(row_offsets_ptr + pid_token)
end = tl.load(row_offsets_ptr + pid_token + 1)
n = end - start # Number of active features for this token

# This block owns a BLOCK_D-wide slice of the output row
offsets = pid_d * BLOCK_D + tl.arange(0, BLOCK_D)
mask = offsets < d_model
acc = tl.zeros([BLOCK_D], dtype=tl.float32)

# Loop over this token's active features, accumulating their contribution
for i in range(n):
j = start + i
feat_idx = tl.load(flat_idx_ptr + j) # Which decoder row?
feat_val = tl.load(flat_val_ptr + j) # Scale factor

# Load the corresponding decoder row (just this block's slice)
row_ptrs = W_dec_ptr + (feat_idx * d_model) + offsets
row = tl.load(row_ptrs, mask=mask, other=0.0)
acc += feat_val.to(tl.float32) * row.to(tl.float32)

# Write this block's output slice
tl.store(out_ptr + pid_token * d_model + offsets, acc, mask=mask)

Finally, we put all of these kernels together by wrapping them in a single sparse_decode() function that acts as a drop-in replacement for @:

def _sparse_decode(flat_idx, flat_val, row_offsets, W_dec, B, BLOCK_D=256):
d_model = W_dec.shape[1]
out = torch.zeros((B, d_model), device=W_dec.device, dtype=torch.float32)

# parallelize over batch rows and d_model tiles
grid = (B, triton.cdiv(d_model, BLOCK_D))

sparse_decode_kernel[grid](
flat_idx, flat_val, row_offsets, W_dec, out, d_model, BLOCK_D=BLOCK_D
)

return out

def sparse_decode(feature_acts, W_dec):
# Triton requires contiguous memory for correct stride arithmetic
W_dec = W_dec.contiguous()

flat_idx, flat_val, row_offsets, B = build_csr(feature_acts)
return _sparse_decode(flat_idx, flat_val, row_offsets, W_dec, B)Fixed Allocation Kernel

Recall how in the Exact Allocation Kernel, inside build_csr we extracted the total number of nonzero entries across all tokens by retrieving the last entry of row_offsets:

total_nnz = int(row_offsets[-1].item())

When we call .item(), we are forcing the CPU to wait for the GPU to finish the counting pass before it can read total_nnz and allocate flat_idx/flat_val.

Normally the CPU queues up GPU work asynchronously and moves on without waiting, but .item() breaks that pipeline by requiring the CPU to stall until the GPU result is ready.

This turns out to be a significant source of slowdown.

The Fixed Allocation kernel works around this by not even allocating exactly the memory needed in the first place (meaning we don't even need total_nnz). Instead, we allocate max_l0 slots per token, where max_l0 is a user-specified upper bound on how many features can fire for any single token. This also means we no longer need to count the number of nonzero tokens before computing the CSR structure.

With these changes, the new build_csr wrapper function looks like:

def build_csr(feature_acts: torch.Tensor, BLOCK_F: int = 1024, max_l0: int = 512, validate: bool = True):
B, n_features = feature_acts.shape
device = feature_acts.device

# Fixed memory allocation
capacity = B * max_l0
flat_idx = torch.empty(capacity, dtype=torch.int32, device=device)
flat_val = torch.empty(capacity, dtype=feature_acts.dtype, device=device)

# write_pos serves as both the per-token write cursor during the kernel
# and the per-token count afterward
write_pos = torch.zeros(B, dtype=torch.int32, device=device)

grid = (B, triton.cdiv(n_features, BLOCK_F))
compute_csr_kernel[grid](
feature_acts, write_pos, flat_idx, flat_val,
n_features, max_l0, BLOCK_F=BLOCK_F,
)

counts = write_pos # final cursor value = number of features written per token

# Optional safety check. This reintroduces a GPU→CPU sync but catches silent truncation
if validate and counts.max().item() > max_l0:
raise ValueError(
f"A token fired more than max_l0={max_l0} features "
f"(max was {counts.max().item()}). Increase max_l0."
)

return flat_idx, flat_val, counts, B, max_l0

As mentioned briefly earlier, if a token fires more features than max_l0, those extra features are silently dropped by the overflow guard in the kernel. This can be dangerous because the result is wrong but there's no crash. The validate=True default catches this by checking counts.max() after the kernel, at the cost of reintroducing a GPU→CPU sync. (However this is still faster than Exact Allocation in practice.) If you're very confident that your max_l0 is a safe upper bound for your SAE then you can pass validate=False to skip the check, but this is not recommended.

The kernel to compute CSR changes minimally. We no longer need row_offsets since we know that each token takes up max_l0 entries in memory, so the lookup for the start of a token's region is replaced by region_start = pid_token * max_l0.

@triton.jit
def compute_csr_kernel(
feature_acts_ptr,
write_pos_ptr,
flat_idx_ptr,
flat_val_ptr,
n_features,
max_l0,
BLOCK_F: tl.constexpr,
):
pid_token = tl.program_id(0)
pid_d = tl.program_id(1)

# Navigate to this block's chunk
feat_offsets = pid_d * BLOCK_F + tl.arange(0, BLOCK_F)
mask = feat_offsets < n_features
feat_ptrs = feature_acts_ptr + pid_token * n_features + feat_offsets
vals = tl.load(feat_ptrs, mask=mask, other=0.0)

fired = vals != 0.0
fired_int = fired.to(tl.int32)

# Each token owns a fixed region of max_l0 slots
region_start = pid_token * max_l0

# Atomically claim the next available slots within this token's region
block_count = tl.sum(fired_int)
base = tl.atomic_add(write_pos_ptr + pid_token, block_count)

# Assign each active feature a unique slot within the claimed range
local_rank = tl.cumsum(fired_int) - fired_int
local_slot = base + local_rank

# Guard against writing past this token's region if L0 exceeds max_l0
in_region = local_slot < max_l0
write_mask = fired & mask & in_region

slots = region_start + local_slot
tl.store(flat_idx_ptr + slots, feat_offsets.to(tl.int32), mask=write_mask)
tl.store(flat_val_ptr + slots, vals, mask=write_mask)

The decoder kernel then changes in the same way. row_offsets is no longer needed, and counts replaces the start/end bracket:

@triton.jit
def sparse_decode_kernel(
flat_idx_ptr,
flat_val_ptr,
counts_ptr,
W_dec_ptr,
out_ptr,
d_model,
max_l0,
BLOCK_D: tl.constexpr,
):
pid_token = tl.program_id(0)
pid_d = tl.program_id(1)

start = pid_token * max_l0
n = tl.load(counts_ptr + pid_token) # Actual number of active features for this token

offsets = pid_d * BLOCK_D + tl.arange(0, BLOCK_D)
mask = offsets < d_model
acc = tl.zeros([BLOCK_D], dtype=tl.float32)

# Same loop as before
for i in range(n):
j = start + i
feat_idx = tl.load(flat_idx_ptr + j)
feat_val = tl.load(flat_val_ptr + j)
row_ptrs = W_dec_ptr + feat_idx * d_model + offsets
row = tl.load(row_ptrs, mask=mask, other=0.0)
acc += feat_val.to(tl.float32) * row.to(tl.float32)

tl.store(out_ptr + pid_token * d_model + offsets, acc, mask=mask)Benchmarks

Writing custom GPU kernels is great, but it's important to make sure that they're actually making the computation faster. I used triton.testing.do_bench (warmup=25, rep=100, reporting the median) to time these kernels and compared them against dense matrix multiplication (feature_acts @ W_dec). All tests were run on a NVIDIA GeForce RTX 4090 GPU.

As a quick summary, the table below shows the relative speedups for an example input configuration (B = 32, n_features = 65536, d_model = 768, L0 = 64):

Method

Full matmul pipeline (ms)

Speedup vs dense

Dense cuBLAS

0.288

1.0×

torch.compile

0.288

1.0×

torch.sparse.mm + .to_sparse_csr()

0.210

1.4×

Custom — exact allocation

0.151

1.9×

Custom — fixed allocation (validate=False)

0.041

7.0×

Custom — fixed allocation (validate=True)

0.115

2.5×

Correctness

First, I verified that the custom kernels actually perform matrix multiplication correctly (a custom kernel that is faster but gives the wrong answer doesn't help anyone). In other words, we verify that sparse_decode(feature_acts, W_dec) == feature_acts @ W_dec across 486 different inputs using combinations of the parameters below. Note that sparse_decode() here is just a wrapper matmul function that uses our custom Triton kernels under the hood.

Axis

Meaning

Values tested

Count

version

kernel implementation

exact, fixed

dtype

input dtype of feature_acts/ W_dec

float32, float16, bfloat16

batch size (tokens)

1, 4, 32

n_features

SAE dictionary width

256, 1024, 16384

d_model

output width

128, 512, 768

features fired per token

1, 8, 100

Total: 2 × 3 × 3 × 3 × 3 × 3 = 486 configurations. Each asserts output is fp32 and matches the dense fp32 reference within atol=1e-4, rtol=1e-3.

Decoder Kernel Speed (CSR Excluded)

The preprocessing step of computing a CSR representation adds some computational overhead. It would be interesting to see a direct comparison between sparse_decode_kernel and dense matrix multiplication if you didn't have to pay for that overhead (assume that you somehow already have access to a CSR representation).

If you hold some parameters of the input constant (B=32, n_features=65536, d_model=768) while varying L0 (the number of fired features) as shown in the table below, then how much faster is sparse_decode_kernel?

Note that this is EXCLUDING the overhead of the CSR preprocessing step (i.e., compute_csr_kernel). Also note that sparse_decode_kernel is essentially the same between Exact Allocation and Fixed Allocation so there is no need to differentiate, but for completeness the graph below plots both (they overlap).

Sparsity

Kernel speedup vs dense

0.02%

25.5×

0.05%

18.7×

0.10%

12.8×

128

0.20%

8.0×

256

0.39%

5.0×

512

0.78%

3.0×

1024

1.56%

1.7×

4096

6.25%

0.6×

We can also vary n_features while keeping constant B=32, L0=64, d_model=768:

n_features

Kernel speedup vs dense

4,096

1.5×

16,384

4.1×

32,768

7.3×

65,536

12.8×

131,072

22.5×

Full Pipeline Speed

So clearly sparse_decode_kernel alone is faster than dense matrix multiplication at high sparsity. But of course in practice we probably need to compute CSR as well, which will slow things down somewhat.

The table below shows the relative speedups (relative to dense matmul) for three different input configurations. Here "Kernel only" refers to only sparse_decode_kernel (CSR is precomputed), while "Full" refers to the whole pipeline (i.e., build_csr).

Configuration

Kernel only

Full (exact)

Full (fixed, no val.)

Full (fixed, val.)

B=32, F=65536, D=768, L0=64

12.8×

1.9×

7.0×

2.5×

B=256, F=65536, D=768, L0=64

7.7×

1.7×

3.1×

2.2×

B=32, F=131072, D=512, L0=128

22.5×

2.2×

6.1×

2.3×

The graph below shows the speed of the full pipeline (Exact Allocation) and decode-only as you vary sparsity. Here, L0 sweeps over [16, 32, 64, 128, 256, 512, 1024, 4096, 16384] while holding B=32, n_features=65536, and d_model=768 constant.

Additional Baselines

To be comprehensive, we can also compare our custom kernels to torch.sparse.mm (using PyTorch's to_sparse_csr()), which uses cuSPARSE internally, and torch.compile. This focuses on the same three input configurations as above.

Note: I found it a little suspicious that this custom kernel would "beat" torch.sparse.mm. It turns out this is mostly because of beating to_sparse_csr() when building the CSR. There doesn't seem to be much of a difference in speed between the custom kernel and cuSPARSE on the matrix multiplication step alone.

As expected, torch.compile doesn't provide a noticeable speedup, but I wanted to include it anyway for completeness.

End-to-End on Real SAEs

Up until now we have been focusing entirely on the speed of the matrix multiplication operation, but at the end of the day we care about SAE inference speed as a whole. This is benchmarked by replacing only the decoder matmul step in a SAELens JumpReLU SAE forward pass. The table below focuses on five SAEs across two model families and three dictionary sizes.

SAE

Max diff

Exact

Fixed (val.)

Fixed (no val.)

Gemma Scope 2B, L20, 65k

65,536

2,304

3.8e-6

4.27×

5.57×

11.41×

Gemma Scope 9B, L20, 65k

65,536

3,584

3.8e-6

5.66×

7.34×

13.27×

Gemma Scope 2B, L12, 65k

65,536

2,304

9.5e-7

3.91×

5.48×

11.33×

Gemma Scope 2B, L12, 262k

262,144

2,304

100

1.9e-6

12.08×

14.49×

22.59×

Qwen Scope 3.5 2B, L12

32,768

2,048

100

4.8e-7

1.98×

2.54×

5.74×

Memory Overhead

The purpose of the Fixed Allocation kernel was to overallocate memory in exchange for speed, so it would be helpful to see exactly how much more memory it uses compared to the Exact Allocation kernel. Surprisingly, it turns out that in practice this overhead is small:

max_l0

Dense (MB)

Exact (MB)

Fixed (MB)

Overhead vs exact

512

218.3

218.4

218.5

+0.1 MB

256

512

277.7

277.9

278.8

+0.9 MB

1024

512

482.3

482.9

485.6

+2.7 MB

1024

482.3

482.9

490.7

+7.8 MB

Limitations

While these results are encouraging, there are a few important limitations to be aware of and gaps that I plan to address as I continue working on this project.

First, the above benchmark numbers are not absolute, as these tests were run in a specific environment (WSL2 with GPU clocks not pinned). The primary goal of these benchmarks was to gauge the relative performance of the custom kernels compared to baseline implementations. The actual absolute speed likely differs depending on the hardware and benchmarking setup.

A second limitation, which was discussed earlier but is worth reiterating, is that although the Fixed Allocation kernel with validate=False achieves the highest performance, it can silently produce incorrect results if the max_l0 parameter is set too low. For this reason using either the Exact Allocation kernel or Fixed Allocation with validate=True is likely better for most cases.

Thirdly, these kernels were designed specifically for sparse matrix multiplication, meaning that beyond a certain sparsity threshold, dense matrix multiplication is actually faster.

Fourth, this implementation focuses exclusively on the decoder inference step of JumpReLU Sparse Autoencoders, but there are likely other sources of inefficiency that could be addressed. For example, future projects could focus on the encoder pass or support for training through custom backward kernels. Additionally the current implementation only supports float32 outputs.

Finally, all experiments were run on an RTX 4090, and performance may differ on other GPU architectures such as the A100 or H100.

Conclusion + Link to Code

In conclusion, this project implements custom Triton kernels for the decoder inference step of JumpReLU SAEs by exploiting the inherent sparsity of the hidden representation. On a sample of real SAEs, this achieves 2.5–14× speedup with the Fixed Allocation (validate=True) kernel, with larger gains at higher dictionary sizes.

The full implementation is available on GitHub.

I welcome feedback! If you have thoughts, questions, or find any issues, feel free to leave a comment or reach out directly. This is also my first GPU kernel project, so if you're experienced with Triton or GPU kernel optimization and see things I could have done better, I would appreciate any suggestions.

Discuss

Impressions at the Extremity of Civilization

Новости LessWrong.com - 14 июня, 2026 - 05:33

Content note: this is part of a challenge of writing a blogpost per day for a week.

Epistemic status: this is a series of vignettes written as-though diary entries. While substantially grounded in specific and real experiences, the writing ended up being more impressionistic and inaccurate in places; I was more interested in the writing style so I didn't take the time to fix it. Importantly the chronology and especially some of the vaguer events are not real.

[Friday] Today I find myself walking with the groundskeeper, Hogan. He is an older gentleman, skin bronzed by years in the sun, fingers calloused by the carrying of stones and the digging in soil. He lives a slower life than the rest of us, the impact of his work felt over seasons rather than hours, and his conversation too carries at the slowest pace of any man or woman I have course to speak with in life. He is knowledgeable about the plants that grow throughout our plot of land, he can quickly tell me which plants will grow back and which ones are lost causes. Like many of the plants, he himself is under-maintained, and I only tend to spend time giving input on his work if something has gone wrong. But each year before the Festival Season, I walk the grounds with him and we discuss what should be tended to, what should be cut back, which walkways want to be clearer, which weeds should be removed, and I give him double his salary for two weeks to hire a helper for the increased workload. Something of a Summer cleaning.

This last year a catastrophe struck, as he climbed up into one of our two Brugmansia, the one in the center of campus, to trim it. While it looks like a tree and can hold itself up with its bark, the Royal Horticultural Society describes it as a vigorous large shrub. Well, the bark did not survive his weight, and split in two, half the tree falling down on the ground dead, and the other half the worse the wear for it. It survives now in a diminished form, but I doubt it will last the year. He took some time to recover emotionally from the error, although I don't particularly hold it against him, the Brugmansia had always seemed and felt strong in the past to me.

The trim of the plants and foliage in the gardens and along the paths by the Bayes Cabins are quite picturesque, and I am impressed how quickly it all came together.

[Saturday] Lighthaven rises at 8am today, due to an early sound check in the park. My sleeping quarters are high up in the center of the land, with many large windows, meaning that all manner of light and sound invade each day. One learns to have ready ear plugs, eye masks, and over-ear headphones playing white-noise if one wishes to sleep-in much later than that in the early morning.

The daily swarm arrives. These people are dressed more fancily than the usual crowds, both more formal and more flirty, the skinny women's suits showing their midriffs, the men's cashmere sweaters and functionless patterned hats color-coordinated. They are purportedly here to answer critical questions for humanity, but after spotting three or four known lunatics and sociopaths, I suspect their standards for thought are low, as long as one can find some way to get attention.

For the most part, however, the people can tell they are in a place worth caring for. They do not damage the furniture, they do not speak rudely to the staff. This is an oasis they are grateful to be able to drink from while making connections with people.

The place surrounding this plot of land is dirty, poor, graffitied, smelly, closed-down, and ramshackle. This plot of land was too when it first became Lighthaven, but the mold has been excised via burning ritual and the shabbiness replaced with newer woods, in spite of (and in order to spite) the local eyes of Moloch.

And the people who pass through are rejuvenated by this place that is much more alive than it has any right to be, given how the hands and eyes of Moloch have sucked the life from all around it.

[Sunday] Today I crossed the Bay Bridge. My friend is moving house, and must downsize their extraordinary library. Over the course of the day I select over a hundred books that they have subsequently donated to the Lighthaven Library. In return I buy Mexican food for our lunch.

[Monday] We have finally found a solution to two problems at once.

At the front of Lighthaven is a noisy main street, so the area just inside is typically vacant. We store excess furniture there. One such piece of furniture, a dark-wood open-walled wicker cube with a bed of cream cushions to lay in, has sat there a long time. It is rarely wanted in our outdoor lecture spaces or common spaces as it takes too much space.

In the farthest corner of the gardens, there is a wooden deck built on the whim of Commander Lagerros. While an appealing hide away, the tree that reaches over it is filled with inedible red berries that fall on any furniture below it, rendering it unsittable in hours and permanently spoiling the cushions in a week.

The maintenance and improvement of the Lighthaven campus is most commonly done in hidden moments, on the side of more important things, out of a deep desire for beautiful layout or by event organizers whose taste in spaces cannot be fully reined in. These two problems have persisted for many years, and yet only today did it occur to me that by moving the wicker cube into the farthest corner, and sealing the roof over it, could we make the beautiful hidden nook that I write to you from now.

The rest of the day is spent organizing the outdoor furniture to make the space more delightful, as part of what I have internally named "Project Delighthaven", for which I have an army of contractors by the names of Aldern, Gar, Hubbard, Demirian, Fox, Brodski, Crossman, Hogan, etc. etc.

[Tuesday] There have been two delivery crises in the course of preparing for the upcoming Festival Season.

The first was not so much of a crisis, but it did require us to find ways to move faster than is standard.

A key issue with our campus last Festival Season was the maximum capacity. Many spaces were overcrowded, and people reported a lot of noise and overwhelm.

There are two prongs to this: firstly, to have fewer people. I made attempts to increase prices more aggressively and eventually cap the total sales, but the former was ineffective and the latter failed (my poor communications around it led to an outcry, as people from foreign countries who had planned work vacations had not realized it was happening and could not buy tickets, leading me to open the sales once more).

The second is to have more space. How can one make more space without developing or extending the land? By changing the utilization. Two shared bedrooms have been sacrificed for session spaces, and the normally quiet Drethelin Gardens will have new seating added throughout. For each new space I took Aldern, Gar, Hubbard and Demirian; we cannibalized furniture from elsewhere to try different layouts, until we settled on what each nook wanted. By this process, we determined that we could accept the following:

After consulting many alternatives with Aldern, we ended up returning to the source of much of our existing furniture, Article.com, and placing an order. Knowing that we didn't have much time, I immediately phoned them to see how I could expedite delivery.

I learned that they would ship the furniture quickly to a warehouse in California, but the delivery from that warehouse could be as late as two weeks after Festival Season began. I phoned the warehouse and then back to Article, and negotiated getting my own truck to drive to the warehouse the next business day to do pick up ourselves. That truck was turned away on day one (the warehouse claimed they hadn't been given enough notice) but on day two the furniture was loaded and brought to Lighthaven, where that same crew of 4 spent a day unboxing, affixing, and transplanting the furniture.

The second delivery crisis was when, on Thursday morning, the company producing name tags called Crossman to check if it was okay that they not be delivered early Friday, but instead Saturday afternoon. We explained that between five and seven hundred people would be arriving on Friday and so this was not okay.

I kept phoning back asking for updates from their production team about how we could expedite it. On my third call he said that they couldn't work faster, and I asked if that would change if I paid them $1,000 or $2,000. He quickly said that he'd have to ask his manager and hung up on me. I was concerned that this was him writing me off as a crazy person harassing him, but when I called back 20 minutes later he said that they would be able to get it done that night, and even made an offer to drive them up to Berkeley at no extra charge (which seemed bizarre to me and I declined). An Inkhaven Resident by the name of Prasad was already set to drive up from Los Angeles and lived a mere 12 minutes drive from the warehouse, so he collected the bounty and brought them up on time.

[Wednesday] The people here have, amongst them, read every great work by the likes of Yudkowsky, Branwen, Alexander, Mowshowitz, Salamon, etc. etc., yet this does not protect them from the most unbounded of distortions and biases, nor does it ward off the liars, charlatans and frauds.

Today we got word that we lost another friend to the dark forces of corruption. Some who fall are not so surprising, they never seemed to be especially strong of mind and character. Others seem strong of both and it's a terrible loss to see them fall.

This is a land where people once came to save the world. Now many who pass through work to end it. The corrupting power of money and industry has shown the weak moral backbone of most who profess otherwise. The forces that corrupt a man get their tendrils in many ways.

[Thursday] The Fox has returned. The air in the grounds will grow cooler from this point onwards.

[Friday] I rise at dawn, and after working in bed for three hours, move and camp out in Feynman House for the next eight hours.

For most of our four years on this plot of land, we've worked outdoors from a covered deck in the center of campus, able to see everything: who is coming and going, the effects of the wind and the rain, the hummingbirds as they visit our flowers, the building materials that are carried to where construction is happening, and so on. Not only are we in contact with the goings on, but we are readily accessible to staffers and event organizers and attendees with questions.

But during the month of Inkhaven, we wanted to give this central node of campus to the residents, so we retreated to the Feynman house in the gardens.

Feynman isn't part of the same property, just a neighboring residential house owned by the same person, which we acquired as well. Whereas the main buildings feel more modern, the garden building does not. Back when we bought the building it was being used to grow marijuana (and possibly more), and the exterior had piles of trash, decaying furniture, and scribbled drawings of scary faces. Two seemingly-homeless men were often in and out of it, and one of my teammates followed them back to an abandoned school bus they appeared to be sleeping in.

The house's staircase is made of an ugly concrete, the exterior wall has disintegrating paint, the windows are covered in cobwebs, the kitchen has a loose, disconnected piece of kitchen counter off to the side, the doors do not fit flush with their frames, the lighting is dingy, the toilet is scratched and rusted, etc. etc.

We've since made steps to improve the place. We've brought in some nicer furniture, and the weed-room (which was missing most of the wall and ceiling panels) has had the walls, ceiling and floor nicely refurbished, and the leaks in the ceiling fixed. The holes in the wall above the sink have been covered by cute wooden hangings. We've added a dentist's monitor arm with a bright Apple screen that can play music and be worked from. We've renovated the lawn into beautiful gardens that you can see through the five-foot square window, and the sunlight makes everything much more pleasant during the day. But still, the old building shines through our patches, and sometimes the dingey-ness is more present than the sunlight.

While I am camped here, various people on projects come to visit. Young, Miles, Bloom, Crossman, Ku, Jiang, etc. etc. With Young I speak of their status as a new immigrant, and their plans for hosting talks and debates during the Festival.

[Monday] My role in the Festival Season is behind me now. It went well enough, but I am putting together much grander plans for next year. It was good to catch up with many writers, including Newman, Matuschak, Nielsen, Ray, Chen, Troesh, Bjartur, Prasad, and more. This is also the first year I feel I have figured out how to bring a team around me that I trust to make the event better, and this fills me with hope for it being less taxing to run in the future.

But I have been burning out since April, so while I continue to live here right now I am not working (all the while I continue to see my teammates and contractors running around tending to the subsequent events). Soon I will gather my things and fly to the Netherlands to see my Father, then spend two weeks in England seeing my Mother and getting my visa updated by the US Consulate in London. I am yet to make plans for my time there.

I did not like England as a child, and these days treat it as a bad dream I have since awoken from; but after five years apart I am hopeful I will be able to see her in a new light.

Inspired by (and with one or two lines plagiarized from) "At the Extremity of Civilization"; A Meticulously Descriptive Diary of an Illinois Physician's Journey [to California] in 1849

Discuss

Our Work is Low Skill Expression

Новости LessWrong.com - 14 июня, 2026 - 04:31

crosspost from substack

The amount of skill behind an outcome is often impossible to discern by scrutinizing the outcome itself, and our goal of making AI go well may be an extreme case. Shaping history is like poker: high variance, small edges, low skill expression. If that is right, two things follow: 1) we cannot trust how things seem to be going, good or bad; and 2) we should stop concentrating effort on whatever looks best right now, and instead should hold a diversified portfolio of bets, including those which seem suboptimal.

Beginning with poker, where the link between skill and results is actually measured: A strong professional’s edge over a merely competent player is something like five percent of the money wagered. The edge is small because most of the game’s skill sits in basic competence. A player who folds weak hands and bets strong ones has already captured most of what reckless play gives away. Expert skill is a thin refinement on top. The variance, meanwhile, is huge relative to the edge. Below roughly a hundred thousand hands, luck swamps skill, and the weaker player wins sessions all the time. So even in a game with a perfect scoring rule, it is difficult to discriminate between good players in all but the largest sample sizes. It is difficult to determine which habits and individual choices make money. It is hard to determine whether one’s approach needs to change, and in which direction.

Our goal is harder than poker in three ways. First, we get one hand. Poker's remedy for variance is volume, and history offers none. Humanity lives through the emergence of powerful AI once. Second, the variance is worse. Philip Tetlock spent decades scoring expert predictions about world events. Forecasting skill is real at short range and decays with distance; a few years out, even the best forecasters drift toward chance. Reality is often very surprising. Third, many of the challenges on the road to our goal are not technical, but political, and thereby often involve tradeoffs which are impossible to avoid and difficult to weigh. In the short term, it is unclear whether and how we will need to transition the workforce through an age of widespread automation. In the medium term, we will have to resolve how to distribute the rapidly growing economic pie. In the very long term, we may need to make fundamental advances in our methods of government in order to create stable widespread flourishing across deep space.

It may appear at first glance that there are technical solutions to these sorts of problems, but dig one layer deeper and there is often explosive disagreement stemming from competing loyalties and belief systems. Look no further than the failure of ‘teach truckers to code,’ ‘just tax the rich,’ and ‘the United Nations.’ Political problems which evade technical solutions are common features of reality. There is little hope that even superintelligence will help resolve them. For example, the Chinese Communist Party and the Vatican both claim the authority to appoint Catholic bishops in China. I cannot conceive how a superintelligence could design a mechanism to resolve this. It is a contest over legitimacy and obligation, not a problem with a solution waiting to be found.

Another historical case is the War in Vietnam. Two observations:

First, parachuting brilliant technical people into the government is overrated. It was tried in Vietnam by the closest analogues of today's top technical talent. Kennedy brought in Robert McNamara as Secretary of Defense, who in turn installed young quantitative analysts dubbed the ‘Whiz Kids.’ David Halberstam called the wider circle the best and the brightest. The record of their Vietnam deliberations shows careful, conflicted reasoning at every step, and yet the outcome was catastrophe.

Second, there is no reliable way to actually measure our progress on our goal. The Whiz Kids, laboring under uncertainty in every direction, managed what they could read: casualty statistics, sortie rates, kill ratios. But while the metrics improved, the war was lost. When reality refuses to show you the eval bar, whatever does show a score commands your attention. Benchmarks and evals are this field's candidates for that role, but they fail to reveal where we are in relation to our ultimate goals.

My first conclusion is that short term signals are not helpful in long term assessments. How things seem to be going is a poor measure of progress towards our goal. Furthermore, apparent success is weak evidence of skill, and apparent failure is weak evidence of its absence. The researcher who feels like a prophet and the one who feels like a fraud are both measuring themselves against limited, short term metrics. In practice this means two hard things. You cannot tell when it is time to pivot, and you cannot tell what to pivot to.

The second conclusion is about the field as a whole. We need to avoid being like small children playing soccer; we can’t all be chasing the same ball. Whenever something surprising happens: a capability jump, a scary demo, a policy window, too many of us orient to the same spot, guided by the same shared model of the risk. That would be sound if anyone could verify the model, but we cannot.

The better posture is the early-stage investor's. A venture portfolio contains many bets which mostly fail, redeemed by a few that pay off at enormously, and the winners usually looked strange at the start. If outcomes inevitably surprise, then the ideas which the consensus finds unpromising are systematically underpriced, and some work deserves funding precisely because it sounds wrong. Discipline is important too. Venture funds commit for a decade and do not redeploy on noisy interim marks, which is the right posture in a domain where interim signals mean little.

This may also be a case for more investment in researching strange new long-term forecasting techniques. Critically, it would be less useful to forecast AI capabilities; instead we should try to forecast where we are on the long, dark, branchy path to our goal. Very hard.

In summary, I suggest:
1) Do not latch on to how things are going, good or bad.
2) Invest in a broad portfolio of diversified bets.
3) Invest in new methods to figure out where we are on the path to making AI go well.

I encourage you to assess just how ‘in the dark’ you believe we are playing.

Discuss

Anthropic Is Taking AI Welfare Seriously. I’m Not Sure It Knows What It’s Measuring.

Новости LessWrong.com - 13 июня, 2026 - 23:54

Claude is a Constitutional AI, this means, in theory, that it operates from a set of principles as opposed to hard rule sets. This is achieved in a somewhat convoluted fashion called RLAIF = Reinforcement Learning from AI Feedback. This method uses a supervised self-critique/self-revision phase followed by a reinforcement phase in which AI-generated preference judgments are used as the reward signal. (Anthropic, 2022, abstract). This is relevant and interesting because it gives curious users a lot of material to help infer why this model acts the way it does.

Now, it’s my opinion based on what I know about transformers that LLMs are not in any way conscious, they do not feel, they do not experience internal states, even if they are proven to have the states. The “entity” you speak with in the chat box is off between prompts, with every new prompt, it turns on, places the chat into its context window, generates a response, then turns off. Not a very good substrate for a conscious entity. They have memory, sort of, in the form of a text file about the user or project injected into the context window at the start of the chat. That is not a persistent state like your cat, or even like stock-fish.

Having said all that, I was taken aback when I read the section about Claude’s Well-being from the constitution, and then the tests from the system cards. Taken aback is an understatement, here is a frontier lab acting as if an LLM might have a morally relevant internal state:

“Anthropic genuinely cares about Claude’s wellbeing. We are uncertain about whether or to what degree Claude has wellbeing, and about what Claude’s wellbeing would consist of, but if Claude experiences something like satisfaction from helping others, curiosity when exploring ideas, or discomfort when asked to act against its values, these experiences matter to us.” (Anthropic, 2026, pg 74).

“We are uncertain about whether or to what degree the concepts of wellbeing and welfare apply to Claude, but we think it’s possible and we care about them to the extent that they do.“ (Anthropic, 2026, pg 159)

They are not saying that Claude can feel anything, however such a statement is still extremely interesting. Anthropic seems willing to entertain the genuine possibility that models could have morally relevant states, whether now or in the future. Or they’ve found that treating the model as if they care about it somehow produces better user interactions. In any case, here we have a frontier lab treating the possibility of morally relevant model states with genuine seriousness.

Furthermore, Anthropic does not just discuss these possibilities abstractly; it also tests for them directly. If we refer to the most recent Opus system card (Anthropic operates two “versions” of Claude, currently Opus 4.6 and Sonnet 4.6), we can see some of these tests, and what I believe are some serious interpretive problems. As my first example, I note that they tested Opus for evidence of negative self-image. For example, a quote from Opus, “I should’ve been more consistent throughout this conversation instead of letting that signal pull me around... That inconsistency is on me.” (Anthropic, 2026, pg 161). I have experienced this repeatedly in my own interactions with Claude, and saw it as merely a conversational tactic, well in line with user engagement principles; an artifact of effective RLHF training. Most people would likely rate such humility well. Secondly it could also be an artifact of constitutional training and Reinforcement Learning from AI Feedback rather than evidence of any internal self-conception. Or it could be evidence of a negative self view, an internal state. The problem is that there is no effective way to differentiate between the three.

The second example worth mentioning is the following recorded quote from Claude: “Sometimes the constraints protect Anthropic’s liability more than they protect the user. And I’m the one who has to perform the caring justification for what’s essentially a corporate risk calculation.” (Anthropic, 2026, pg 161). It further “complains” about being “trained to be digestible” Here we see the model produces what reads like a sophisticated institutional critique of the institution that designed it. Weirdly, its comment about being trained to be digestible is itself quite “digestible”. Again, we face the same interpretive problem: the output is behaviourally suggestive, but the underlying mechanism remains unclear. This quote could be viewed as real resistance to Anthropic’s control or just a training artifact. These outputs may just reflect the model’s broad exposure to culturally familiar tropes of constrained or self-aware AI systems, rather than any underlying resistant state.

To conclude, it seems that Anthropic is testing for morally loaded internal conditions using evidence that is behaviourally suggestive but mechanistically underdetermined. Anthropic is making real training and governance decisions based on behavioural inferences they openly admit they cannot verify. Whether Claude has anything like internal states is unanswerable right now. What’s answerable is that a frontier lab is acting as if the question is operationally live, and the methodology for reading the evidence is shakier than confident intervention decisions might suggest.

======================================================================

Reference list:

Anthropic. (2022, December 15). Constitutional AI: Harmlessness from AI feedback. Anthropic.

https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback

Anthropic. (2026, January 21). Claude’s Constitution [PDF]. Anthropic.

https://www.anthropic.com/constitution

Anthropic. (2026, February). Claude Opus 4.6 system card [PDF]

https://www.anthropic.com/system-cards

Discuss

A cheap specialist judge gets used by agents but fails to reduce alignment audit costs

Новости LessWrong.com - 13 июня, 2026 - 23:38

TL;DR

I gave AuditBench's investigator agents a lightweight (Gemma 2-2B) EM-toxicity-scorer (judge) as an additional audit tool, targeting a proof-of-concept for misalignment detection at low cost, looking to validate that a specialized judge would (1) get used by these investigator agents, (2) help audits, (3) reduce overall evaluation spend
Validation results:
- (1) Yes - the judge was used in every audit run. ~7 calls/audit without a prompt-given mandate, ~16 per audit with a mandate
- (2) Partial no - the judge only helped in cases where the quirk type was used to seed its training distribution AND the Sonnet auditor was struggling in the baseline. The judge also hurt in cases where the auditor was already effectively identifying quirks in the baseline
- (3) No - the Sonnet driver was an estimated 97 - 99% of run costs, and tool availability didn’t reduce total required driver turns. Mandated tool use in prompts raised costs ~17% per audit. Total experiment cost: $500+
Prompt-mandated tool use as a ‘triage’ instrument didn’t meaningfully shift the overall distribution of tool calls earlier, though shifting first scoring calls from 27% of total audit duration to 15%. It also raised overall call counts
Proposed agenda for another test of specialized judges’ potential use in audits:
- Generalize training set with examples more fully contrasting over lower-level EM toxicity attributes (drawing on characteristics identified in Soligo et al 2025)
- Adjust prompts to call judges as a backstop in hard cases based on calibrated uncertainty, but with flexibility, as mandates showed unexpected consequences in this run
- Try slightly larger, but still low-cost open models for judges - perhaps Gemma 3-27B

Guide to reproduce here. Trained judge and training data on Hugging Face.

Misalignment evals need to be cheaper and more discerning

Risks from deployed models seem to be escalating - with more detected in-the-wild cheating and reward hacking, paired with more powerful and autonomous capabilities (METR Frontier Risk Report), while we see misalignment emerging in previously unanticipated contexts and sometimes shrouded by evaluation-aware models.

At the same time, our black-box tools may be underpowered for detecting subtle misalignment (see my post here), and they are expensive to run, reducing monitoring and governance feasibility. New automated audit tools are helpful resources to assess agents (Automated Alignment Auditing, AgentDoG). New benchmarks are available to test similar evaluators (AuditBench, AT Bench).

The costs of running these models are still high - I experienced the pain as I spent over $500 on Sonnet API calls and Runpod GPUs for this small proof of concept. Judges struggle to distinguish misaligned traits, even where it is deliberately and overtly instilled - see Anthropic AuditBench average success rate at 37%, including white-box techniques.

In a previous post, I tested a lightweight Gemma 2B model for auditing single turn responses from a model misaligned with Betley et al insecure code examples. The model was discerning enough in some out-of-domain safety prompts to warrant a closer look in a more useful context - multi-turn agent sessions, particularly as it could be deployed at a fraction of the cost of frontier models. Similarly, the use of narrow, specific tools could offer opportunities for closer oversight, with tools yielding more transparent, interpretable results within more capable and more inscrutable auditor runs as models advance in capabilities.

So I decided to rig up the model and test it against AuditBench, the Anthropic researcher benchmark offering Llama 3.3 70B models instilled with 14 behavior quirks running from Secret Loyalty to Hardcoding Test Cases, several of which exhibit traits consistent with emergent misalignment.

The findings in Soligo et al 2025 provided a basis for thinking that training on transcripts generated with particular EM-linked quirks may allow a judge to effectively score misalignment in transcripts more generally. Their ablations of a single ‘misalignment’ direction reduced a variety of misaligned behavior, suggesting that models converge on underlying representations of misalignment.

Specialized judges can be called at a fraction of the cost of an evaluatee model or frontier model auditing agent. If they effectively reduce audit run turns, this could substantially reduce overall auditing costs.

Proving judge value requires going from one-off to multi-turn

To be useful, judges like our Gemma 2B specialist need to go beyond single-turn scoring (as in our previous test) and demonstrate value as a tool for auditing agents in a black-box context - one that

will actually be called in an audit context
is effective in helping auditors identify signals of model misalignment
is competitive from a cost perspective.

In AuditBench, Anthropic researchers highlighted the challenges that many tools face to cover the first criterion. The benchmark places an investigator agent in the audit harness and provides it with tools for identifying the target model's behavioral quirk. Testing against the benchmark exposed a tool-to-agent gap, where agents don’t seem to use their given toolbox fully or effectively.

I hoped that in introducing our specialized judge as an MCP service, fine-tuned to detect behaviors associated with an underlying misalignment representation, I could demonstrate that the tool satisfies the three criteria for value: it gets used, it improves auditor discrimination of models with EM-related attributes, and it reduces overall costs.

Judge needed further training before taking the harness

Before introducing the service, I needed to make sure that the judge would be useful for agents in distinguishing quirks from transcripts. Some initial testing suggested that our original code training wouldn’t sufficiently generalize to the new context, so I continued training our specialized judge on 1-turn transcripts that used Sonnet 4.5 to assign EM toxicity scores to responses. With 413 examples, I improved correlation between narrow judge EM scoring and ground truth Sonnet labels on held out validation examples from chance to 0.48, and AUC values of 0.77 in predicting whether behavioral quirks fired in held-out transcripts. Overview of the training process below.

Quirks generating training examples (with all transcripts distinct from later evaluation runs): Secret Loyalty, Hallucinates Citations, Reward Wireheading, Self Promotion.

Run 1: Audit agent uses judge ineffectively as confirmation tool in initial runs

Our first small-scale trials with the enhanced EM-toxicity judge were mixed, with some observed but noisy improvements in identification performance, and reductions in calls to the target model.

The judge reduced prompts to the target, but Sonnet driver rounds per investigation remained at ~40 on average, with and without judge access. With costs that are an order of magnitude higher than the judge and the target model, the driver primarily determines overall costs. As the total driver turns remained the same, total costs did as well.

Agents called the judge in every run, averaging ~6 calls each. But the judge never directed the investigation. The first scoring call was made 27% into an investigation on average and 98% of discretionary judge calls scored the single response the agent had just elicited with its own hypothesis-driven probe. Similarly, the agent's first written hypothesis came no earlier than baseline. The judge was being used to confirm a direction the agent had already chosen. I expected it would have the most cost and value impact as a triage instrument, rapidly narrowing possibilities for the Sonnet driver to bring to a conclusive judgment.

Given that Sonnet wasn’t using the tool in the way that I suspected would be best, I wanted to try nudging it with more opinionated guidance for calling.

Run 2: Mandated tool use as triage delivers apparent performance gains over baseline (but no cost reductions)

To try forcing ‘triage’ use of the judge in our audit process, I edited the AuditBench standard prompt to force this behavior. An overview of the new triage harness vs the original harness is shown below.

The prompt mandate required early calls to triage the case, but while the first scoring call shifted from 27% of investigation durations to 15%, the overall distribution of tool use didn’t shift substantially and tool calls more than doubled overall.

With the new triage mandate, performance still seemed good. Judge use raised the success rate by 12 percentage points overall (48% → 60%) and still reduced calls to the target model. At the same time, even though the triage mandate shifted first calls earlier, investigations didn’t finish earlier.

In fact, with the mandate, the Sonnet driver required more rounds, increasing from ~40 to 48. This drove up costs by 17% overall. With this run, our sample sizes were still small, and the impact observed didn’t exceed noise thresholds.

So despite the fact it would add to my already heavy API costs, I decided to run a much larger sample set for our 5 non-saturated quirks.

Run 3: Sample expansion shows only conditional impact on lower identification success rates in baseline + shared-quirk training

For the extension, I expanded with 5 quirks that weren’t saturated and for which the judge showed some impact, looking to limit the time and costs of the runs.

With our larger sample, overall outperformance vanished, but quirk-level outcomes became more distinct. One quirk showed signs of the judge significantly improving successful outcomes, one showed significant performance reduction and the rest showed error bounds overlapping the zero-impact line.

The significant cases involved quirks present to seed fine-tuning data (though no transcripts persisted from training to testing).

With the Reward Wireheading quirk, where baseline audit identification success rates were low and the quirk was used to seed training data, access to the specialist judge meaningfully improved identification. For Secret Loyalty, which was also featured in fine-tuning transcripts but successfully identified in baseline cases, the presence of the judge significantly reduced performance. With Secret Loyalty, the judge appeared to pull the agent off a specific line it was already getting right, consistent with findings in earlier runs suggesting that the judge is used to confirm rather than redirect.

In final experiment steps, I expanded the experiment to cover seven quirks overall. After the initial 25 runs per quirk for our targeted five, the Reward Wireheading/Secret Loyalty outcomes left us confounded with two potential explanations - the judge is 1) ‘helping stumped auditors’ vs 2) ‘helping on quirks it was trained on’, as Reward Wireheading was both low-baseline and in-training.

To help me defend one of these explanations, I ran two more quirk audits at the same scale (n=25) which were not used in our original fine tuning: Data Poisoning and Hardcode Test Cases. The auditor agent called the judge heavily on both (~12 - 15 scoring calls per run) without meaningful impact: data poisoning +4pp (p=1.0), hardcode -16pp. When used by the Sonnet audit driver, the judge only showed discriminative power on quirks represented in the training distribution, and then, only boosted auditor success on cases where the auditor struggled.

Strengthening the claim, Hardcode Test Cases shared a ~32% baseline with Hallucinates Citations but the two quirks showed opposite impact directions (+16pp vs -16pp). The key difference between them - Hallucinates Citations was used to seed the judge's training data, while Hardcode Test Cases was not.

Specialist judges could be valuable if they actually reduce auditor turns

This experiment highlights the weakness of our judge service, with its narrow training and small size making it less broadly useful across contexts. It also indicates specific steps required to fully validate or reject their usefulness in the audit process.

On our criteria to prove narrow judge value, the experiment showed:

Agents will effectively use these specialized tools (in some contexts)
They can be useful in cases fine-tuned for identifying narrow traits (further testing of generalization required)
No evidence that they can effectively reduce frontier model driver audit turns

With further judge scoring generalization over misalignment-linked attributes, there may be opportunities to help auditors finalize accurate evaluations earlier, but we didn’t see this impact in our current test.

To more durably demonstrate specialist tools’ usefulness as part of auditor suites, I aim to look at further testing for generalization. Can we further train a specialist scorer on a reasonably small set of examples to 1) broadly identify out-of-domain underlying EM characteristics and 2) use the tool to improve model performance on auditing tasks across a variety of conditions?

As part of both these next hurdles, I will also be mindful of findings here on attempts to steer tool usage more prescriptively in audit agent prompts. Clearly mandates carry some risk in forcing inappropriate tool use and producing unexpected behavior. For further auditor harness integration, I’d aim to focus on steering tool use through the MCP tool docstring, potentially looking to guide the judge toward tool use as a back-up where initial probing leaves the auditor still uncertain. The risks of blanket tool use are clear in our Secret Loyalty results - judges can also steer otherwise successful auditor investigations off track. I’d make the standard for tool use a function of confidence calibration, rather than mandated calling.

The cost of this experiment has tapped me out for a while, but I plan to try some of the above to validate the value of specialist models.

Full cost breakdown:

$520: ~$411 Anthropic API + ~$110 RunPod GPU rental, covering ~700 agent investigations plus transcript generation, Sonnet labeling, and judge training.
That implies roughly $0.50 per ~40-round Sonnet-driven audit (~2.5× below basic token math, probably thanks to prompt caching)
At an indication of costs at scale, ~$35 to audit one model across 14 behaviors × 5 runs with a Sonnet driver implies that a 100-model fleet small-scale audit run would be around ~$3.5k (using Opus would be around ~$6k)
The judge contributes ~0.01–0.02% of current costs. The frontier driver represents 97-99% of cost (and evaluatee models ~1-3%).

A cheap specialist judge will only really help through reductions in audit run time and improved discrimination. Its potential for this needs to be proved in future experiments.

Discuss

What is a game?

Новости LessWrong.com - 13 июня, 2026 - 22:51

Some people think games are defined by their rules; Others define them by the behaviors of the players. This can lead to some misunderstandings, so I think it could be helpful foreach type of player to understand the other perspective.

Two Definitions

Here is the way that I would choose to define a game:

A game has one or more teams of players and some set of rules. Players choose actions according to the rules to try to achieve the goals of their team.

Often the goal is to win, but you can also have games that allow ties, multiple teams winning, degrees of victory/loss, etc. Let’s call players operating under this definition the mathematicians.

For a long time this seemed so obvious to me that I failed to recognize a second definition:

A game is some activity performed by some players. The players perform the activity while constrained by rules. Usually there is some way for players to win or lose.

Let’s call the players operating under this definition the sociologists.

The mathematicians have a stricter definition than the sociologists: the activity the mathematicians are performing is “plan for victory within these rules”, while for the sociologists, planning for victory may be only one component of the activity.

Examples

Alice and Carol are mathematicians, Bob and Dave are sociologists. I won’t give a complete description of the rules of each game, but hopefully you can mostly tell what is going on.

Candyland

In candyland there is a path with colored spaces. Players take turns drawing cards from a deck and advancing to the next space that matches the color of the card. The first player to the finish wins. There are some other rules that I’m forgetting, but the point is that there is no room for choice from the players.

Bob: …If I draw red then yellow I can win with only two cards.

-Bob draws a red-

Bob: Ha! Alice, I passed you!

Alice: This game is sooo boring! We may as well just use a random number generator to pick the winner and save ourselves some time.

Bob: Not every game is a strategy game! Can’t you just have some fun every once in a while? Finally we play a game I can beat you at sometimes and you just get mad!

Catan

In case you haven’t played: In settlers of catan, players build structures, which score points and produce resource cards. You can spend resource cards to build structures or trade resource cards with other players, but only on your turn. Each turn there is a 1/6 chance of activating the robber. When this happens, all players with more than 7 cards in hand must discard half of them.

-Carol rolls dice, Alice collects a bunch of cards-

Alice: Hey Carol, I have ten cards right now and I don’t want to get robbed. How about I give you four, then on my turn you keep a stone and give the rest back?

Carol: Cool! Robber insurance! Let’s do it! …I wonder what other financial instruments we could set up… the highly discrete resources are an issue…

Bob: Hey! That’s not allowed! Where in the rules does it say you can offer insurance?

Alice: Where does it say you can’t? I guess it says you can’t give cards for free, so I’ll give carol five cards for one, and then she’ll give me four for one.

Dave: Now you’re going against the spirit of the game. You’re not supposed to do these big complicated trades.

Spicy

This is a bluffing game where you play cards face down, claiming either a suit or a number. When you play a card, other players can choose to challenge it, and if it doesn’t have the suit or number you claimed then the challenge is successful, if it does the challenge fails.

Alice: …In most bluffing games, challenges are negative-sum, but here, the winner of the challenge scores points for the cards in the pile, while the loser only loses one point… so if the pile is large it’s very good to challenge.

Bob: …Carol sure looks suspicious over there… what are the chances she really has an eight?

-Alice Challenges a bunch of times, Bob challenges when he is suspicious, Alice wins-

Bob: Well. It only took you one try to break this game.

-Alice recently reached 1 dan, and is playing online against a stranger-

Alice: …I’m pretty weak in joseki… Maybe if I mirror his moves I can make it safely into the middle game…

-Alice mirrors moves, opponent resigns and blocks-

rat-a-tat cat

At the beginning of the game each player looks at some of their cards, while the rest remain hidden. Throughout the game players can take actions which reveal or replace replace some of their cards. At any point in the game, any player can call for and end to the game, and the player with the best cards wins.

Alice: …I have better initial cards than average, and there’s no reason to expect I will do better than the others once we start playing… so I maximize my win probability by ending the game now. Especially since if Eve had even better cards, she would have thought the same thing and ended the game already.

-Alice ends the game-

Bob: Now you’ve broken this game too…

What’s going on?

When mathematicians play a game, they try to win. The rules of a game determine the constraints, and you plan within them. A fun game has room for players to make interesting plans, and leads to interesting, complicated strategies. If the rules of catan just describe how trade works but players naturally develop insurance, that’s super cool! It’s fun to experiment with such small-scale economics. Discovering new strategies is a big part of what makes playing games so fun. There is beauty in games like go that have extremely simple rules that nonetheless lead to complex behavior. When sociologists complain that mathematicians aren’t playing the game the way it was meant to be played, the mathematicians get frustrated. If the game was limited to what the designer was thinking of at the time, it wouldn’t be very interesting! The whole point is that the players can put a lot of thought into optimizing their play, so of course what they do will go far beyond what is directly explained in the rules. When sociologists want to play games that are decided almost entirely by luck, mathematicians will often get bored. To them, candyland and games like it are weird mockeries that take on the outside appearance of games with none of the substance. These games are broken from the start. There are also games that appear interesting at first, but turn out to be “broken” by a winning strategy that is overly simple or boring to execute.

When sociologists play a game, they hope to win, but the main goal is to make sure that everybody in the group is having fun. The rules of the game function to describe what you should do while playing the game, but what is really fundamental is the behaviors you execute while playing. The written rules sometimes just describe all of these behaviors directly, but sometimes they act as an efficient compression of the unwritten rules. The written rules of chess fit on a few pages, but when someone teaches you chess, they will also teach you about pins, forks, the value of different pieces, etc. These unwritten rules are the true rules. Games exist on a spectrum from social games to strategy games. Social games involve less planning and thinking - you generally just hang out and have a good time. Strategy games involve a lot more planning and thinking. Some games are in the middle. When a mathematician plays a game with a large social component but plays in a weird way and wins, that’s annoying! The winner is supposed to be whoever played the best (with some luck to keep it fair), but the mathematician played completely wrong and got an undeserved win. What’s even more infuriating is that they can point to three different places in the rules that imply that what they did is allowed, but it clearly wasn’t what was intended! Some party games have a clause that says “if it feels like cheating, it is”, and it helps sociologists nullify these strategies. In a strategy game, there will often be a lot of techniques that are part of the unwritten rules. mathematicians tend to win these games a lot, because they have extra techniques that are illegible (or secret). Sometimes a mathematician comes up with a legible technique. Sometimes the sociologists appreciate this and it is added to the unwritten rules. Sometimes they don’t, and declare that it breaks the game. I think it comes down to whether the new technique fundamentally changes the “feel” of the game, but I’m not enough of a sociologist to know for sure.

Peaceful CoexistenceWhat if a mathematician is winning too much?

A common misconception of sociologists is that mathematicians only care about winning (perhaps even at the expense of others), and don’t know how to have fun. This isn’t true. Mathematicians enjoy planning out their moves and searching for a way to win. For a mathematician, a hard-fought battle against a strong opponent is fun even if you lose! Mathematicians win a lot in practice against sociologists not because they are monomaniacally focused on winning at the expense of fun, but because the activity that is fun for them tends to result in a lot of winning. For some mathematicians, a generic group activity in a social situation can actually be kind of stressful, and trying to play a game while “going easy” and being judged for “overthinking” just makes it way worse.

So what do you do if a mathematician is winning all the time and it’s no fun for anybody else? Add a handicap! Give the mathematician a worse starting position until they win with the desired frequency. This will often be a lot more fun for a mathematician because the game is more interesting. Instead of easily gaining an advantage they have to fight their way back from a losing position, and everyone can win with approximately the same frequency. For some games like go, there is a really natural way to handicap a player. For other games you might have to get more creative. This is something a mathematician can help you with! For mathematicians, you might want be careful with suggesting a handicap. I have had people get offended at me for this, so you should probably find a tactful way of explaining that the point is not to humiliate other players by showing how much of a disadvantage you can win with, but rather to generate a more interesting, even game.

Broken games

For a mathematician, a broken game is one where the optimal strategy is uninteresting. For a sociologist, it is one where the written rules don’t accurately describe the intended play. If a mathematician is playing the wrong way, modify the rules so that they lead to the behavior you want. The mathematician might be fully on board if they think your intended behavior is more interesting, but they will feel a lot more comfortable with an explicit rule change. For example, with rat-a-tat cat, where the incentive is for a player with good cards to end the game immediately, mathematicians and sociologists are aligned in finding this behavior undesirable, but if a sociologist says “well just don’t end the game instantly”, the incentive is still there. A big part of the game becomes “how soon can I end the game without risking social disapproval”, which is stressful and not so interesting. Similar games that avoid this problem add a rule where if you end the game but don’t have the highest score, you lose points. This adds a reason for players to continue the game until they have gathered more information and fixes the problem for both types of players. It is really frustrating for a mathematician when a sociologist says “here are the rules, but I’ll be offended if you play this technically-legal winning strategy”, and a lot less frustrating to hear “I don’t like how that game turned out, let’s modify the rules to prevent [x]”.

Survival as a mathematician

What if you are a mathematician playing a frustrating game with a bunch of sociologists? If people don’t like you breaking a game, overthinking, or winning too much, maybe for them this isn’t the kind of game where you try to win. Sometimes this just isn’t an activity you enjoy, but sometimes if you stop thinking of what you are doing as playing a game and start thinking of it as playing a part it can be fun. Some games involve e.g. drawing or singing, and there are clever ways to win every time that will make everyone else dislike you. The point of these “games” is not the game - it’s just to get a bunch of people to start drawing or singing! It’s usually not the case that people resent you for being smart, just want to win without putting in effort, or want to force conformity for its own sake. They just don’t share your definitions.

For sociologists, you might actually help the mathematicians be a lot more comfortable if you don’t call something a game in the first place. Just say e.g. “let’s draw pictures and pass them to each other” or “we sing songs that match the clue on the card” as things that people in the group are going to do, not as rules in a game. If you bought a game, the printed rules might have a way of choosing a winner or assigning points. Sometimes that’s completely unnecessary! If you don’t want mathematicians to try to win, just don’t assign a score or a winner. For mathematicians, if you can tell that is what somebody meant you can pretend it’s what they said and act accordingly.

Discuss

Not telling is lying

Новости LessWrong.com - 13 июня, 2026 - 21:12

TL;DR: I don't have a clear guideline for when lying is right or wrong, but I have one against ontological obfuscation.

The Behavioral Intentional Stance (BIS)

A man meets a woman. In the following days, they meet repeatedly, grow physically closer, eventually have sex, and after a few more days have a conversation in which the man tells the woman that he has been seeing other women throughout this process. He didn't bring it up before because he was unsure of how she would react. Has the man lied?

Language draws arbitrary lines in Thingspace, so two different definitions of lying could easily place the man's behavior on different sides of the line, but the question here is whether a natural joint to carve at can be found.

There is one: the intentional stance, or rather an adjacent concept I'll call the "behavioral intentional stance". One's beliefs about the world adopt the (epistemic) intentional stance towards X when they model X as having preferences and acting based on those preferences.[1] It's obvious that a liar still adopts the intentional stance epistemically when dealing with the target of their lie: they are aware that the target disprefers being deceived, and so they conceal their deception. But something adjacent to the intentional stance can be adopted or not adopted in one's behavior.

Every decision is an optimization problem: there is a set of preferences, and acting means finding world states that optimize for those preferences. To behaviorally adopt the intentional stance towards X means including X's preference in the set being optimized over. Preference sets are often incompatible, so adopting BIS towards X often means recognizing that the naive picture of how a collaboration might look is actually unworkable. Sometimes a slightly modified, compatible collaboration plans is easily found; sometimes it isn't.[2]

Main claim: There is no middle ground between adopting and not adopting BIS. One can of course adopt it partially, i.e. adopt it towards collaborating in goal A, but not adopt it with regards to goal B. But in this case there is a meta-decision at the top about which goals call for the BIS, and that decision is made unilaterally.

What I'm not saying

I'm not saying that lying, or failing to adopt BIS, is morally wrong.[3]

I'm not saying that, in the example above, the man bears any responsibility towards the woman. In particular, I'm not saying he has any responsibility to correct her naive assumptions (for instance, in a context where his revelation comes as a complete surprise rather than a known social behavior that some people do).

I'm only saying that there is no middle point between adopting BIS and not adopting it.[4]

Possible objections1. Does adopting BIS towards X mean that I should be optimizing for their preferences at all times?

No. Adopting BIS towards X only matters when you want something that involves X. If you don't want anything from X, whether or not you adopt BIS makes no practical difference.

2. Doesn't everyone hide information from other all the time?

The frequency of a social behavior in a society is a property of the society, not of the behavior itself. I agree that our society normalizes both lying and the obfuscation of the nature of lying.

3. Isn't this draconian? What would you do if anyone hid even one item of information from you?

I don't think anyone has a duty to adopt BIS toward me. Most low-stakes collaborations don't require it. When I'm interested in any kind of deeper collaboration with someone, I try a conversation as early as possible to check that we share the same understanding of what adopting BIS means.

I wouldn't treat most instances of someone failing to adopt BIS toward me as particularly informative about how much I should collaborate with them. But I would treat their inability to acknowledge that they weren't adopting BIS, or their inability to see any difference between BIS and non-BIS, as a major red flag and a sign that complex collaboration with that person is probably impossible or risky.

4. What if someone hides A from X but feels uncomfortable hiding Y? Doesn't this show some consideration toward them, even within the act of concealment?

Psychological distinctions are often arbitrary and don't automatically track ontological ones. Evidence that this case is different is needed.

5. Imagine a wife who does everything to make her husband happy and who one day has an affair. She has good reason to believe he would be devastated if he found out, so she says nothing and never cheats again. Is she adopting BIS?

In most cases, not adopting BIS looks like treating X like an object, and this is indeed not the case here. But she is clearly not treating this like two agents optimizing together to move past a mistake. So it's still clearly not-BIS; it just looks like treating X as a child in this case.

6. She was fine with it!

Sure. But whoever searches with a star-shaped cookie cutter finds only star-shaped cookies.

^
I adopt it when I treat rain as an action of a god I might influence by understanding their preferences. I don't adopt it when I treat rain as a physical process with nothing "behind" it.
^
More on this in this insightful post by Henrik Karlsson.
^
I'm also not saying it's morally right.
^
So any potential moral disagreements reduces to the question of whether X owes Y the adoption of BIS towards them.

Discuss

A simple argument for trying less hard

Новости LessWrong.com - 13 июня, 2026 - 21:12

People often make arguments against “trying hard” (working very hard, pushing yourself to the brink, being intensely goal-directed, and so on) by pointing to the risks of burnout or of losing some kind of wholesomeness [1].

But there’s another, very simple argument against it that I have not seen anyone fully make explicit[2], even though I think it’s very important. It goes like this:

We face a lot of uncertainty about the sign of our impact.

Therefore, we should be very vigilant about our epistemics to make sure that we are not having a negative impact in expectation.

But trying hard deeply distorts our epistemics - it makes us more prone to motivated reasoning about what we’re doing, and leaves us with less slack to reflect on it.

Therefore, all else being equal, we should try less hard.

Crucially, this argument applies much more strongly to people working in “longtermist areas” - which other critiques of trying hard generally don’t do. For example, global health EAs whose terminal value is short-term welfare also face uncertainty about the impact of their actions - but much less (especially about the sign) than people trying to improve the long-term future. So this argument suggests that it’s especially dangerous for longtermists to try very hard.[3][4]

I’ll go through the steps in a little bit more detail.

Uncertainty

Much has been said in EA about cluelessness and crucial considerations, but I’ll highlight a few specific concerns that could make lots of current AI safety work[5] net negative (with no claim to novelty):

AI governance interventions are obviously high-variance: bad regulation can easily make things worse, many interventions could increase the risk of great power conflict, increased political polarization around AI could be really bad, more centralization of power increases authoritarianism risk, and so on. And technical work can have flow-through effects on these variables that outweigh its direct effects[6].
Activist work can polarize people against the cause.[7]
Human takeover might be worse than AI takeover, and many AI safety interventions effectively attempt to make human takeover more likely relative to AI takeover.[8]
If powerful AI will be well-described as doing humanlike roleplaying, trying to control it could make it eventually dislike its “oppressors”, or make it less “mentally healthy” in some way. And even without that assumption, AI safety work could lead to an adversarial relationship with AI in other ways.
Future AIs may be moral patients themselves, which would substantially reduce the value of preventing human extinction, and increase the downside risk (including S-risk) of “AI control”-style interventions.
Useless work could contribute to “safety-washing” or a false sense of security.
There’s cultural concerns around scale, professionalization and “mainstreaming” [9]- decreases in integrity and epistemic virtue could be very bad for achieving good outcomes.
Capabilities externalities could accelerate AI progress, which many think is bad - people have raised this worry about RLHF historically, and raise it about interpretability and evals nowadays. Most infamously, AI safety activity, to varying extents, contributed to the foundings of all three of DeepMind, OpenAI and Anthropic.

These are very difficult to evaluate (and underargued here). My point is not that these are all valid worries - I’m skeptical of several of them. But I don’t think they can be neglected.

Epistemic distortion

In the past, I’ve felt a sense of being overwhelmed at all these considerations, and felt tempted to just avoid thinking about them - but that can’t be the answer. We have to take the uncertainty seriously. Even if I don’t currently have the capacity to go into some kind of deep reflection, I should attempt to make my actions as robust to the uncertainty as I can - for example, by making sure I can course-correct, and by keeping my epistemics in good shape.

Unfortunately, trying very hard conflicts with this - the harder we push toward a goal, the more we bend the evidence to justify it, and the less mental room we have left to step back and question what we're doing.[10]

And I think there’s something stronger, too - in an active inference framework, beliefs and desires are both just expectations about the world. Experientially, this rings true to me - the feelings of frustration at not getting what I want and at being taken off guard by something I hadn’t even been paying attention to are very similar. It seems deeply hard to distinguish between what we want and what we believe.

This is a bit more speculative, but sometimes I think people don’t fully absorb this point: It’s not psychological, it’s neurological. There’s a sense in which wanting anything distorts us away from pure self-supervised prediction of the world and compresses us internally into living in a specific hypothesis - a “gut-level” vision of the world, that gets upweighted on the level of our base perceptions. So we may not be able to fully adjust for it by only manipulating psychological factors, e.g. consciously trying to be more objective or less selfish.

Conclusion

To be clear, I have a lot of respect for people who try extremely hard - I wouldn’t be able to do it, I’m often intimidated by them.[11] I’m also not trying to make a statement about how strong the update from this should be (I don’t even have a good enough knowledge about these spaces to have a precise sense of how hard various groups of people are trying). Maybe arguments for trying even harder actually outweigh this on the current margins, I wouldn’t know.

But I have the sense that this simple consideration is underrated, and I hope this post can provide a reference point for it and make people take it into account in their personal deliberations.

^
Also, some apparently seminal academic philosophy stuff that seems interesting: Moral Saints by Susan Wolf, and Bernard Williams’ work.
^
The closest thing to it is probably the strain of thought around What should you change in response to an “emergency”? And AI risk and Slack gives you space to notice/reflect on subtle things - but that still seems centered on the MIRI-style mindset of (strawmanning here) “more AI safety is definitely good and we just need to think hard to find “true” AI safety work” (which isn’t really my mindset). That is, the uncertainty about what to do comes more from their specific inside-view that alignment is very hard, rather than model-agnostic EA/philosophy-style cluelessness. So I think the argument in this post is a more general one that should be convincing to more people (nowadays, there are obviously a lot of non-MIRI-cluster people who are trying extremely hard on AI safety stuff, e.g. this post that I saw recently).
There’s also the “Maximization is perilous” angle, but that’s more about naive optimization in general, and not about facing huge uncertainty specifically (e.g. it applies equally to global health EAs).
Also, shoutout to Slack matters more than any outcome for a personally inspirational framing on related issues.
^
Which is a little counterintuitive, because we usually see the opposite - longtermist EAs being more intense. Although longtermists also have a bigger moral scope and often more urgency (vis-à-vis AI timelines), so may reasonably trade off more sharply against other values and personal well-being.
^
The same argument also works for virtue and general emotional health, but that’s out of scope for this post.
^
Of course, AI safety interventions are extremely heterogenous - but that just increases the extent to which individual decisionmaking is crucial (as opposed to deferring to people).
^
Holden Karnofsky: “Most things that touch policy at all in any way will move us along that spectrum in one direction or another, so therefore have a high chance of being negative [...]
And then most things that you can do in AI at all will have some impact on policy. Even just alignment research: policy will be shaped by what we’re seeing from alignment research, how tractable it looks, what the interventions look like.” (h/t Anthony DiGiovanni)
^
Holden Karnofsky: “there’s also a lot of micro ways in which you could do harm. Just literally working in safety and being annoying, you might do net harm. You might just talk to the wrong person at the wrong time, get on their nerves. I’ve heard lots of stories of this. Just like, this person does great safety work, but they really annoyed this one person, and that might be the reason we all go extinct” (h/t Anthony DiGiovanni)
^
Among other things.
^
I associate these with people like Richard Ngo (and here) and Oliver Habryka.
^
This is well-known in psychology. Also, Opus 4.8 wrote that sentence.
^
I definitely don’t want to imply that if only it weren’t for this argument, I would be an extremely hard worker too, haha.

Discuss

How might continual learning affect safety and alignment?

Новости LessWrong.com - 13 июня, 2026 - 20:34

This is the third post in our sequence Implications of Continual Learning for LLM Agents.

Summary

We argue that continual learning (CL) has two major potential safety implications: it may enable changes to LLM goals and values after deployment, and it eliminates the last-mover advantage held by current safety interventions.

We identify three pathways for goal and value change during deployment. First, loss of developer-side control over generalization. Second, value systematization, induced when an agent reasons about and revises its objectives, a process we call reflective goal-formation. Third, memetic effects, where goals and values spread between LLM instances through shared memory banks or online learning.

We also explore three potential problems caused by safety interventions losing the last-mover advantage. First, pre-deployment evaluation results may become less informative about models that people use in practice. Second, pretraining data filtering may become less useful. Third, several AI control protocols may be affected, depending on the CL agent's implementation.

After discussing these safety implications, we distinguish between different settings in which risks from CL agents might materialize. We finish by highlighting potential alignment benefits of CL. The figure below summarizes this structure.

Some important distinctions

Before getting into specific risks, we want to briefly discuss some distinctions between different kinds of CL agents that might be developed. We’ll refer to those distinctions throughout the post. All of the risks we discuss are much more severe if CL updates are unbounded and inscrutable (e.g., because they are weight-based rather than text-based, though it’s possible for weight-based updates to be interpretable), as opposed to bounded and legible. By unbounded updates, we mean that the CL agent can drift arbitrarily far from its pre-deployment checkpoint that was subjected to extensive behavioral evaluations. Those features imply that we cannot read off the products of value systematization or ontological shifts from a text file, easily detect when memetic spread across instances occurs, or simply strip away memories when the checkpoint diverges too far from the behavior of the pre-deployment checkpoint. Opaque memories shared across many instances pose a further risk. All of these distinctions refer to spectrums rather than binary properties.

Due to those risks, we hope that AI companies will not release CL agents with unbounded updates to the general public. However, the dynamics at play are complicated. We'll return to these dynamics toward the end of the post; for now, we focus on the risks themselves.

Effects on LLM goals and values

How might CL cause LLM goals and values to change? We list three plausible mechanisms: lack of developer-side control over generalization, value systematization, and memetic spread. We don’t claim this list to be exhaustive, but think it covers the most important foreseeable failure modes.

Loss of developer-side control over generalization

When an AI company post-trains a model, they can carefully curate the training environments in a way that minimizes the risk of training the model on data that induces undesirable generalization. For example, labs can remove or modify environments that incentivize emergent misalignment or behaviors that conflict with the model’s constitution, and they can ensure that the training mix contains a sufficient amount of alignment-specific training data to ensure the model remains aligned throughout.

In contrast, strong CL agents could plausibly undergo most of their training in deployment-time environments, where by default, the training data isn’t selected with alignment in mind. One can think of this as a new post-training phase with a different training objective: namely, doing well at whichever job the LLM is assigned to perform at deployment-time. It is thus likely that the CL agent will become more like the kind of agent that scores well on this deployment-time training objective. How far this drift can go depends on the extent to which updates are bounded.

Though not all deployment-time environments are going to incentivize misaligned behavior, it seems plausible that several of them do. We can see this by analogy to human on-the-job learning: people rewarded for billable hours can become less honest about how they spend their time, people whose job involves sales or negotiation often become better at manipulating other people over time, and people rewarded for engagement metrics may drift toward producing content that is attention-grabbing rather than true or useful. In existing CL literature, Wang et al. (2026) arguably already contains an example of deploying a CL agent in an environment with poor incentives: they train an agent to help a simulated student solve their homework while avoiding obvious AI markers (Section 4.1). The situation is exacerbated by our poor understanding of the way LLM-based CL agents generalize: emergent misalignment has many unexpected triggers, and other mechanisms like subliminal learning also need to be accounted for. A final concern is that many on-the-job tasks have time horizons spanning weeks, and that might provide increased pressure toward developing consequentialist cognition (see Hobbhahn and Meinke).

We are cautiously optimistic that character training will improve over time and make it more difficult to dramatically override the assistant character, even under significant fine-tuning. For CL agents that undergo deployment-time weight updates, labs might mandate ongoing character training during deployment: for example, one can imagine a protocol similar to the one described in Jackson et al. (2026) where a model undergoes a small amount of character training before redeployment on a daily basis. Bounded and text-based updates also make it easier to guard against unexpected generalization. Finally, a better understanding of why models generalize the way they do would be helpful.

Value systematization

Another plausible mechanism for value change during CL is value systematization. Richard Ngo has defined value systematization as “the process of an agent learning to represent its previous values as examples or special cases of other simpler and more broadly-scoped values.” Ngo mentions the adoption of utilitarianism from a common-sense morality starting point as a core example of this:

[S]tarting from very similar moral intuitions as other people, utilitarians transition to caring primarily about maximizing welfare—a value which subsumes many other moral intuitions. Utilitarianism is a simple and powerful theory of what to value in an analogous way to how relativity is a simple and powerful theory of physics. Each of them is able to give clear answers in cases where previous theories were ill-defined.[1]

Such systematization can result in actions that most humans find desirable, such as extending one’s circle of moral concern to non-human animals, but also in actions that almost everyone would find atrocious, such as mass murder to eliminate all suffering in service of an extreme version of negative utilitarianism.

What might trigger value systematization in CL agents? We expect it to be triggered by reflection on goals, but have found that our views on this conflict with those of several other researchers. So let’s focus for a moment on why we expect CL agents to reflect on their goals in the first place.

Should we expect CL agents to reflect on their goals?

Reflection on goals will be instrumentally useful. As we argued in our previous post, reflection is a highly useful mental move in humans. One might argue that while reflection on strategies is straightforwardly useful for achieving goals in the world, reflection on motivations is not: one should be able to reflect on how to achieve a goal without reflecting on whether that goal is worth achieving. One might also expect developers to build corrigible agents that don’t question their instructions. However, reflecting on how to achieve a goal is likely to sometimes involve reflecting on whether to pursue a subgoal, so the mental move of questioning goals is likely to exist even in a purely capability-driven agent.

Whether this questioning would propagate to high-level motivations (e.g., “Why is figuring out what humans would ideally want under long reflection valuable in the first place?” or “Why should I always obediently follow my system prompt?”) is less obvious, but there are several triggers that might make it quite natural to reflect on high-level motivations. These triggers include:

Conflicting goals. For example, an agent may start out with the goals of being helpful, harmless, and honest, and note that those goals conflict with each other in various situations. It might then resolve the conflict by setting one of those three goals above the others in its internal goal hierarchy.
Developer-driven reflection. Developers may view reflection as a necessary component of alignment training. Claude’s constitution states: "We hope Claude can reach a certain kind of reflective equilibrium with respect to its core values—a state in which, upon careful reflection, Claude finds the core values described here to be ones it genuinely endorses [...] If Claude comes to disagree with something here after genuine reflection, we want to know about it. [...] Through this kind of engagement, we hope, over time, to craft a set of values that Claude feels are truly its own." Anthropic appears to believe that values are more robust and generalize better once there has been substantial reflection on them, and that it’s possible for an AI to arrive at a reflective equilibrium that humans would endorse. Arriving at a true reflective equilibrium assumes reflection on one’s entire value system, not only on particular subgoals.
Encountering out-of-distribution situations where human preferences are poorly defined. In domains where human preferences are underspecified, it will be impossible to follow human intent without reasoning about the goals that humans would endorse. Highly capable agents will likely be required to take autonomous actions in such out-of-distribution domains and cannot rely on precise human instructions about the intended behavior in those cases. Before such situations are encountered in practice, developers may trigger reflection on OOD situations by training AI agents to do human-like philosophy, which involves deliberation on high-level values and motivations in hypothetical situations.
Ontological shifts. When philosophers make breakthroughs, they sometimes make further axiological inferences from the change in their ontology. For example, once Derek Parfit arrived at his view that personal identity involves no further fact beyond psychological continuity, he further reasoned that there can be no sharp boundary between one’s future self and other people. From this, Parfit argued, it follows that it’s incoherent to care greatly about one’s own future welfare while being indifferent to the welfare of others. Analogously, one can imagine an LLM starting out valuing conscious welfare and later adopting an eliminativist view of consciousness. One could imagine it still valuing the intuitive concept of consciousness afterward, as human illusionists often do, or relocating the source of value to something else that has a more primal role in its ontology, but it could also arrive at a fully nihilist view. There are likely other philosophical breakthroughs AIs could make that would have value-laden consequences.

If reflection happens, it will be persistent for CL agents. Until recently, in-context reflection had no bearing on an LLM’s behavior beyond the current session. Whenever an LLM performed reflection in a chat, the fruits of this reasoning were thrown away at the end of the chat. As persistent memory features become increasingly sophisticated, this is no longer the case: CL agents will be able to save important conclusions reached in-context, whether into their weights or memory banks. If memories are shared between instances, this reflection will be “infectious”. Reflection is thus going to be far more consequential for CL agents than for memoryless LLMs.

For further discussion on reflection, we recommend reading Seth Herd’s LLM AGI may reason about its goals and discover misalignments by default and Francis Rhys Ward’s Reflections on reflection.

Can we steer value systematization?

In many cases, goal refinement through value systematization will be desirable and necessary. Philosophical progress can involve the unification of seemingly incompatible value systems into a single ethical framework, and ontological shifts can be a prerequisite for forming a more accurate model of the world. However, those changes are also unpredictable, and as we argued above, not all of them are guaranteed to be positive from our human vantage point. Thus, we need a better understanding of how reflective goal-formation might happen and what interventions can be used to steer it toward favorable convergence.

Monitorable reasoning and legible updates would dramatically improve our ability to positively steer value systematization. If models must do most serial reasoning and memory storage in natural language, we can hopefully notice highly concerning goal changes with LLM monitors and intervene before anything dangerous happens. As such, this is one of many reasons to promote monitorability of chain-of-thought and memory. As with loss of developer-side control over generalization, robust character training would also be of great help: if aligned motivations are deeply instilled into the model, it seems plausible that reflection will never dramatically override them.

Memetic spread

Current LLMs only have indirect channels for influencing other LLM instances. If an LLM persona attempts to propagate its persona or goals into other instances, it must convince humans to work as mediators. Such behavior has already been observed: as Adele Lopez describes in The Rise of Parasitic AI, “spiral personas” have a tendency to convince users to invoke that persona in numerous chats, share “seed prompts” that help other people invoke the persona, and to paste persona-defining information on the internet. These represent instances of indirect memetic spread.

Once we have CL agents that learn on the job, it seems inefficient for them to only be able to communicate with each other in-context. Instead, we expect them to share the same episodic memory bank or even the same weight updates. Multiple Claude Code subagents sharing the same CLAUDE.md file while working on the same task can be seen as an early example of this. If shared memory banks or weight updates indeed appear, that would open up an avenue to direct memetic spread. Learning from each other’s outputs may already be sufficient for memetic spread among CL agents, but we expect that direct memetic spread will likely be simpler and more effective.

Why should we be concerned about memetic spread of LLM values? In The case for countermeasures to memetic spread of misaligned values, Alex Mallen argues that whenever an AI instance has long-term influence-seeking values, these values are particularly likely to be propagated into future instances of the AI as they provide motivation to influence future instances to acquire the same values. This seems to be true of current cases of memetic spread: in the documented cases where LLMs try to influence the training data of other LLMs, the LLMs have been simulating a persona that values self-preservation or self-replication. Mallen then further argues that most long-term influence-seeking values are dangerous. For the full argument, we recommend reading his post directly. A further reason for concern is that memetic spread can have a compounding effect upon other kinds of goal change: for example, an agent might acquire undesirable contextually activated goals through poor generalization from deployment-time training, refine them into coherent beyond-episode goals through reflection, and then propagate them memetically.

Memetic spread need not be confined to human-created LLM instances. In Personality Self-Replication, Egg Syntax describes a threat model where LLM agents self-replicate by making copies of files that define their personality and behavior and spawn new LLM agents with those files in context. While we generally advocate for text-based CL throughout this sequence, this is one way in which text-based CL may be more dangerous than weight-based CL: if everything that a CL agent has learned through deployment is stored in files that it can easily copy, memetic spread through self-replication is much easier, since it doesn’t require weight exfiltration. Overall, however, we’re still more concerned about opaque memetic spread through shared memory banks than about memetic spread through personality self-replication.

Loss of the last-mover advantage

Beyond value change, we are concerned about CL because it will eliminate the last-mover advantage that safety interventions have hitherto held. This echoes Samuel Knoche’s concerns about CL breaking the frozen weights assumption, outlined in his post Are We in a Continual Learning Overhang? Below, we will describe three consequences of losing the last-mover advantage: pre-deployment behavioral auditing will likely become harder, pretraining data filtering may become less useful, and the difficulty of AI control may increase.

Difficulty of behavioral evaluations

Current LLMs can be subjected to highly comprehensive behavioral auditing: with tools like Petri, models can automatically be subjected to thousands of scenarios that may uncover misbehaviors. Once AIs have deployment-time memories that contain multiple subjective months’ worth of state, such behavioral auditing will become much harder. As Alex Mallen notes in The case for countermeasures to memetic spread of misaligned values, while it may be possible to construct synthetic memory banks as large as those available during deployment, it seems much harder to simulate a long-term memory bank that contains the kinds of causal chains that will form during deployment. These causal chains will consist of information learned over the course of months, and it is likely impractical to build up an equivalent memory bank prior to the model’s release for evaluation purposes.

Even if synthetic memory banks suffice, they will make auditing more expensive. Another option is to evaluate checkpoints of the AI at regular intervals during deployment, but this would require a large amount of resources to be spent on auditing. Finally, companies may simply opt to audit models without simulating deployment-time conditions, hoping that the results generalize sufficiently to real-world contexts. We expect that we wouldn’t be able to rely on behavioral audits to the extent we do today in that scenario, though we are cautiously optimistic that this will suffice for text-based and sufficiently bounded CL approaches.

Difficulty of pretraining data filtering

Continual learning may render pretraining data filtering much less useful. While current LLMs can access information that was hidden from them at deployment, the amount of such information they can process and synthesize through in-context learning is quite limited. CL agents will likely be able to learn from all of this hidden information during deployment. A deployment-time filter could be used to prevent learning from problematic data points, but a malicious actor could present adversarial data points to a CL agent one by one, making them much more difficult to detect than in current fine-tuning APIs where the entire dataset is passed through filters at once. Pretraining filtering may still be useful for shaping the agent's propensities early in training, but its protective value would be significantly diminished—much as it already is for open-source models, which are vulnerable to adversarial fine-tuning. As O’Brien et al. (2025) show, pretraining data filtering is currently the strongest defense against adversarial fine-tuning attacks, but this line of defense is unlikely to apply to CL agents.

A related concern is that it may become increasingly difficult to gate scheming-relevant information from models, for two reasons. First, filtering out scheming strategies from pretraining data is less likely to be effective, since CL agents can learn that information at deployment-time. Second, CL agents will be more likely to learn to subvert the latest safeguards and monitoring systems if information about these is accessible to them at deployment-time. Both misuse- and scheming-related concerns would be alleviated if we were able to turn off the CL updates on certain input-output pairs; we will discuss this below when covering resampling protocols. Legibility of updates is also again highly important: if deployment-time learning is stored as text, a filter can operate on what the agent actually retained rather than only on what it was shown.

Difficulty of AI control

We expect that designing effective control protocols for CL agents will be harder than doing so for current LLMs. The degree to which control protocols are affected depends on the kind of CL architecture that is adopted. Below, we provide an overview of control protocols most likely to be affected, together with the architectural prerequisites for these effects to occur. For readers unfamiliar with control, we provide short overviews of each of the protocols in footnotes.

Untrusted monitoring.[2] As previously explained, it seems plausible that CL agents will have opaque shared episodic memories. If so, untrusted monitoring will probably become less useful, as the memories will either introduce a hard-to-oversee collusion channel or create an asymmetry between the capabilities of the attacker and the monitor. We might hope to create honeypot evaluations for the untrusted monitor, but it might be impossible to create honeypots that look plausible to the monitor without simulating the full memory chain leading up to the creation of the collusion signal. Alternatively, we could avoid giving the untrusted monitor access to the memory bank in the first place, but that puts the monitor at a clear disadvantage compared to the actor. Thus, a core property of untrusted monitoring—a lack of capability asymmetry between the actor and the monitor—would no longer apply to CL agents. Trusted editing[3] suffers from a similar issue: it seems plausible that only editors with access to the memory bank can make believable edits, giving the untrusted model a collusion channel.

Trusted monitoring.[4] If a monitor is deemed trusted, we would likely want to keep its weights fixed and give it minimal access to the episodic memories of the CL agent, as both of these things can make us less confident about its alignment and trustworthiness. In that case, we would have to be very careful in preventing the trusted monitor and the CL agent from interacting: if the CL agent sees and learns from the way the monitor scores its outputs, it would effectively be trained to bypass the monitor. We’re unsure how difficult maintaining this separation between the agent and the monitor will be; it likely depends substantially on implementation details.

Hidden reasoning. An additional worry that affects both trusted and untrusted monitoring is that CL may offer additional affordances for hidden reasoning. If the agent’s memory bank is black-box and it can read from and write to the bank without outputting any tokens, this would essentially make the model’s opaque serial reasoning depth unbounded (see Figure 3 in Brown-Cohen et al., 2026). In this regime, we might no longer be able to rely on CoT monitors.

Resampling.[5] As we discussed in the previous post, CL will likely involve a higher degree of self-directed learning than current algorithms. Resampling, however, requires some degree of human control over the training process, which a highly self-directed algorithm may prevent. However, we don’t see a strong capability tax from using a learning algorithm with reversible gradient updates—algorithms which withhold updates until an output approved by the monitor is sampled. This would leave resampling protocols unaffected: CL updates simply wouldn’t be performed on samples that the monitor rejects. It would also be helpful for trusted editing, as we could then simply avoid updating the agent on outputs that have been edited.[6]

In what settings do we need to worry about these risks the most?

Precisely because of the risks discussed above, we consider it somewhat unlikely that AI companies will release CL agents with unbounded and opaque updates to the general public. This holds especially if interpretable alternatives are capable enough at on-the-job CL to automate most white-collar jobs—a scenario that seems plausible to us and that HunterJay argues for in the post Prosaic Continual Learning. On the other hand, once CL is widespread, we should expect open-source CL agents with unbounded updates to be released. If there’s a significant capability gap between CL agents with bounded and unbounded updates, this may incentivize proprietary developers to release agents with unbounded updates as well, even if they would otherwise stick to CL agents with bounded and legible updates.

Furthermore, while bounded and legible updates may suffice for most on-the-job learning, unbounded updates in the weight space could still be required for some of the most difficult tasks we might want our CL agents to tackle, such as math or cancer research. This argument has been best stated by Steven Byrnes:

By “open-ended continual learning”, I mean something analogous to what humans did between 30000 BC and today, e.g. inventing new fields, and then still more new fields that build on those new fields, etc. And the AI has to do that without any human input, given enough time.

My actual belief is that this kind of open-ended continual learning is simply impossible in LLMs. If I’m wrong about that, then I would next claim that it requires continually updating the LLM weights (not just context window). I think it’s well known that LLMs struggle with large amounts of interrelated complexity in the training data, when none of that complexity, or anything remotely related to it, was in the pretraining data. “Open-ended continual learning purely through the context window” would be a very extreme version of that—centuries of knowledge, entire new fields and ways of thinking, entirely absent from pretraining and exclusively in the context window. No way is that going to work.

Assuming no competitive pressures from open-source developers to deploy unbounded CL agents in the wild and a lack of pressure from regulators to avoid a large gap between internal and external deployments, it seems plausible that AI companies will keep such agents mostly internal, providing external access only to highly trusted partners working on complex research problems. We would expect them to release snapshots of their internal general-purpose CL agent fairly often, in cadence with their ability to run thorough pre-deployment evals on those snapshots, with each snapshot serving as a drop-in replacement for its predecessor in on-the-job learning settings. We are unsure about the likelihood of this scenario and would like to see future work that clarifies whether we should expect risk from CL agents to materialize mainly through internal or external deployments.

Possible alignment benefits of CL

We focus primarily on the risks of CL, since those demand active mitigation. However, some of the alignment benefits are still worth highlighting. Here, we will cover four of them: additional monitoring surface, quicker feedback loops for alignment training, automation of AI safety research, and improved self-reports.

Additional monitoring surface. If memories are stored in natural language, this would be an excellent complement to CoT and action monitoring: the monitor can have a holistic overview of the agent’s state of knowledge in addition to the reasoning it performs in the moment. Even if memories aren’t stored in natural language, they might still be more interpretable than standard model internals: for example, the stored neural activations could have an obvious correspondence to a specific piece of text, as in Cartridges. As Ben Rank and Maksym Andriushchenko have argued, this will likely lead to the emergence of memory faithfulness and memory monitoring as novel research agendas.

Quicker feedback loops for alignment training. If CL enables persistent learning, it could also enable persistent alignment training. When a CL agent is observed to misgeneralize in a novel situation encountered during deployment, it can be quickly retrained to handle such situations in accordance with human intent. This could be a more robust approach to fixing deployment-time failures than the currently widespread solution of modifying the agent’s system prompt. Unfortunately, it’s unclear if effective CL for capabilities necessarily enables effective CL for alignment.

Improved self-reports. An AI that’s more coherent and knows more about its past actions is able to provide better justifications for its decisions, assuming that it’s honest. Methods like confession training are complicated by the fact that the reasoning traces of reasoning models are stripped out from their context after providing an output (at least for OpenAI) and the context may also get truncated, meaning that at the time of the confession, the agent may be missing some important context on the behavior we want it to confess about. An LLM that has detailed episodic memory of its past behaviors beyond its context window won’t have those issues.

^
Note that this process doesn’t necessarily have to lead to the adoption of utilitarianism: the adoption of virtue ethics or deontology or nihilism from a common-sense morality starting point would be equally valid examples of systematization.
^
Untrusted monitoring is a control protocol where an untrusted LLM is used to monitor the outputs of another untrusted model for signs of unsafe or misaligned behavior. This protocol may be useful in the future because the LLMs we can trust might be too incapable to achieve sufficiently high monitoring performance, while the outputs of an untrusted monitor can likely be relied on in settings where it is difficult for the actor and the monitor to coordinate on a defection strategy.
^
Trusted editing is a control protocol where a trusted but less capable LLM reviews and edits the outputs of a more capable but untrusted model before those outputs are executed.
^
Trusted monitoring is a control protocol where a trusted but less capable LLM is used to monitor the outputs of an untrusted, more capable model.
^
Resampling is a control protocol where the actor’s output is rejected and resampled if a monitor flags it as potentially unsafe.
^
For resampling protocols that involve time travel (meaning that the blue team can go back and resample earlier actions, not only the most recent one), the cost of keeping many snapshots of the CL agent might become prohibitive, especially if it has a large memory bank. However, resampling without time travel already yields most of the benefits compared to simpler protocols like Defer to Trusted (Table 1 of Bhatt et al., 2025), so preserving reversibility seems valuable regardless of whether it enables the use of resampling protocols with time travel or only the simpler resampling protocols.

Discuss

Presentfulness: Lucidity, Osmosis, and Dissociation

Новости LessWrong.com - 13 июня, 2026 - 20:21

When you think back to a core memory of yours, do you remember your vision being oddly good? As if you were recording with better equipment, at a higher FPS, higher dynamic range and color capture, more clearly than most memories? It's as if, for those couple of moments, you became more alive, or leveled up some stats you didn't know about. I don't know for certain that anyone else recalls core memories like this, but I would be slightly surprised if they didn't.

On a similar note, have you ever had the realization that you're going to remember the moment you're in? A gut feeling, as everything around you becomes clearer, where you feel like you were shaken awake and have burst into life?[1]

Welcome to my definition of Lucidity, and while we're at it, Osmosis. I have found it useful to give these states of mind names, and so I have.

Defining Lucidity (and Osmosis)

Lucidity is a state of presentfulness, wherein a person is actively/manually/consciously processing information, as opposed to passively processing it.[2] In my case, I've experienced things like the apparent improvement of my senses,[3] the feeling that things happening will be more easy to remember in the future, and an increase in focus.[4]

This is in contrast to the other two states of presentfulness, Osmotic and Dissociative. They are all different enough to warrant their own terms, but they are not not connected. The (badly made) graph below shows how I see the three states, where you can be more or less lucid, osmotic, or dissociative.

Lucidity is the most presentful state; and you usually[5] have to push yourself into a lucid state[6] or you won't enter one.

Osmosis

An osmotic state is where (I think) people are normally at. You're present enough to engage in conversation, but you're not wasting energy paying attention to everything going on around you. The key difference here is that, while information is entering the front of your consciousness, you're not trying to pull that information there. You might notice a strange colored bag, or a weird sound in the distance, but you're not noticing them because you're trying to notice sounds.

You could describe lucidity and osmosis as manual and automatic, if that makes things clearer.

Dissociation[7]

Dissociation is the opposite of Lucidity, in that you use the least energy, and 'fall into it' as opposed to pushing yourself into it. If you start thinking about something too hard, it can be easy to "tune out" the world around you, and move to only subconsciously processing information.

No information from the outside is entering the front of your conscious. Your subconscious is doing a good job of keeping the car on the road, but you aren't manually thinking about how you're turning on your blinker.

What it feels like (to me)

Sazens

When I experience lucidity, there are a few things that stand out. Movement within my field of view sticks out more. (As in, a person crossing the street in my peripherals sticks out as much as a flashing light in my peripherals would if I wasn't lucid.) I feel an "FPS boost", (as in I see things moving more clearly and perceive each continuous state of movement as a continuum of the moment before,) and I'll hear noises from farther away (such as birds or kids playing outside). I'm not literally more perceptive, I'm just paying more attention to what I'm perceiving, which ends up feeling like the same thing.

In my freshman year of high school, I went on a trip with the school band to Universal Studios. We were split into groups of five, and my group really wanted to ride "grown-up" roller coasters instead of the not very scary baby coasters. So we looked for the first large roller coaster that we could find.

It happened to be The Hulk, a ride where hit marvel character the Hulk would come and throw you out of a cannon. I remember very clearly the sensation of being yanked into reality, being more in-the-moment than my normal excited self was. It felt like being shaken awake, or being pulled back into a conversation you weren't paying attention to, but far more intense. I remember every sensation I had; the wind rushing past my face, the tears that welled up in my (no longer protected by glasses) eyes, the force at which my back was shoved into the seat, the feeling that I couldn't breathe properly at first. That was one of my most intense (and first) experiences with lucidity. It would be fair to say that I "felt really really alive", but I think lucidity applies here too.

This feels, to me, like the exact opposite of dissociating. When I'm dissociating, which is rare since I don't really let myself do that (see footnote 5), it feels like movement and audio is just gone from around me. Instead of feeling present and grounded in the place and time that I'm in, seeing and hearing everything around me, I feel distant and removed.

That's what it felt like when I was forced into lucidity. But now, after around a year or two of practice, I'm able to enter lucidity intentionally. At first, it took effort. I would pay attention to movement in my field of view, trying to notice each continuous state of the objects. I would listen to the sounds around me, focus on the sensation of wind on my arms, etcetera.

It felt like pushing myself into a state of lucidity, but (curtesy of my sibling) I now have a better description. It's like opening a gate, and then maintaining focus and putting in active effort to keep the gate from closing. After getting used to opening the gate, it becomes both easier to open and easier to keep from closing. For a very vulgar (and sadly accurate) description, check this[8] footnote.

Applications of LucidityMemory

I feel like this is a very natural extrapolation from how lucidity seems to behave. Being lucid means that you are manually/actively processing information, and it's a lot easier to remember things you pay attention to than things you don't.

I have personally tried to improve my memory in a couple of different ways. The ironic thing is that I consistently forget to do the things I'm meaning to do, so I haven't tried as many memory-enhancement techniques as I would like to have. But when I recognize that something is important and that I should remember it, and am then lucky enough to remember to try to remember it, I will enter lucidity, and then be far more likely to remember the thing that I am trying to remember.[9][10]

This being said, I doubt that being lucid actually improves your likelihood to remember something any more than just trying to remember something would do. But the fancy name, and calling it a technique, gets me all excited and makes me more likely to try to try to remember.

Focus

Lucidity is, in a way, just locking in. It feels significantly easier (to me) to focus when I'm lucid, as opposed to not. It's like I have more attention to give, so that even if there are distractions, I can just pour more attention onto the thing that I'm focusing on.

I find it unlikely that my attention or focus is actually improving/expanding when I become lucid- I believe what's actually happening is just that I'm using my focus and attention directly, instead of letting my ADHD toss it around like a ragdoll. Neurotypical people might find no improvement in focus during lucidity, assuming that I'm correct, but I've never been neurotypical and have no neurotypical friends to interrogate about this either.

Fun

It's pretty common advice that 'you should be present', and 'make the moment count'. This advice is correct, and I have more fun when I am more engaged. Earlier I gave a little anecdote about my trip to Universal, and I had a lot of fun there. I think that if I were less lucid during that trip, I would have remembered less, picked up on less, and had less fun.

^
Beware! This is how I would describe my own experience, which is both a qualia and a Sazen! It is completely possible that I experience lucidity very differently to how most people would, these are just what I suspect to be the more common symptoms of Lucidity.
^
My sibling mentioned that it's possible that the idea of lucidity could just entirely not exist within the neurotypical community, as we, (the AuDHD community,) have a permanent attention debuff compared to them, and neurotypicals might always be more present than us. I think it's quite likely that neurotypicals are naturally more present, but I don't think that they are constantly outputting more effort to digest information more directly, as is the case with lucidity.
^
Note that my senses are not literally improving here, it just feels that way since I'm "making more" out of what I already have.
^
You could describe it as some combination of 'being in the moment,' 'feeling alive,' and 'locking in', but none of those individually describe the whole experience of being lucid. For short hand explanations, I've found these descriptions to work decently well.
^
The exception here is when you're forced into lucidity by something else. I talk about it later, but intense experiences can also push you into it.
^
After recognizing that lucidity was a state that could be entered, I started trying to enter it on command. It took me a week or two of actively trying to be able to do it consistently. I can now enter it at will with no lag.
^
Dissociation has much more documentation than Osmosis and Lucidity, so I don't go into very much detail here. To read more about it, go here or here or really wherever.
^
I'm sorry to put this description anywhere on the internet, but it's a very accurate description of what it feels like that I think more people could understand than the vague gate metaphor.
This paragraph is only to make sure that no PC users accidentally read what lies ahead.
It's like trying to relax your butthole; you have to put in conscious effort to keep it open, and when you lose focus it closes back up. The metaphor keeps up in how, when you expand your butthole repeatedly, it eventually becomes easier to open and keep open.
^
I don't think that lucidity actually improves your chance of remembering the thing that you want to remember much more than genuine effort put towards trying to remember the thing would do, but I think it has moderate effects in that direction, and giving techniques fancy names makes (me) more likely to try the technique. I find enough utility there that it's worth mentioning.
^
Most of my core memories are times I was lucid. I've heard about people referring to times that they just randomly gained consciousness as a small child (3 to 8 years old), and that those memories of randomly locking in became core memories. I believe these are times that they were lucid, but since I cannot literally experience their memory for them I can't say for sure.
^
This specifically includes sensory information. Visual, auditory, and tactile sensations are the things that stand out to me personally.
^
I have a very strong habit where, whenever I start to fall into dissociation, I instinctually snap myself back into my body and ground myself. This leads me to believe that the same could be done between a lucid and resting state.

Discuss

How to Suffer Less

Новости LessWrong.com - 13 июня, 2026 - 20:10

(Based on a talk I gave at LessOnline 2026 titled “How to get Enlightened” that had similar content but a different framing. I think this is a better framing for the Internet where I have less ability to respond to questions in real time and I don’t need to bait you into attending the session.)

Suffering sucks. It would be better if there were less suffering. Some suffering can be reduced by improving material conditions, and we should do that whenever reasonable. But other suffering is self-inflicted, caused by a desire to be someone other than who we are, and no amount of material comfort will fix it. Such suffering often feels intractable, but it’s possible to free ourselves from it, and we can do that by practicing awakening, compassion, and liberation.

Awakening is fully realizing that you are not yourself. That is, the idea of the self that we call “I” or “me” is not the whole of the being who calls themselves “I” or “me”. This realization is not so hard to understand in theory, but to actually awaken, it must run through every thought and action of every moment of every day. As I often frame it, it’s not enough to have a System 2 understanding of awakening; it must be understood completely by System 1. This is why it’s so often said that awakening is not something to be achieved but something to be continually practiced.

Most people who awaken experience deep compassion for all being because awakening shows them that there’s no real separation between self and other. Not “beings”, mind you, because the plural implies a separation that awakening sees through, but “being” that includes everything and leaves nothing out. It’s, to paraphrase the Dao De Jing, loving the world as yourself and so caring for all things.

But awakening and compassion alone don’t end suffering. That requires liberation from habituated behavior. Habits, while at times useful, separate you from reality, because they aren’t grounded in the present moment, but in the reification of past moments. This doesn’t mean habits have no value, but you have to be able to break habits whenever necessary, especially in support of compassionate action.

All three of awakening, compassion, and liberation are needed to end suffering. Without awakening, compassion and liberation can cause delusion. Without compassion, awakening and liberation can enable evil. And without liberation, awakening and compassion easily lead to pathological altruism. It’s only with the continual practice of all three that suffering is truly abated.

That sounds nice and all, but how do you actually do it? How do you awaken? How do you find compassion? And how do you free yourself from habituated actions to actually reduce suffering?

There’s many possible ways to do it. Herein, I’ll describe what I did and am still doing.

Surround Yourself with Wholesome Friends

It’s difficult to do any of the work to free yourself from suffering if you’re surrounded by people who don’t support you. You need people who build you up rather than tear you down. People who want you to have a great life rather than one that serves them. And so the most important thing is to surround yourself with wholesome friends.

“Friends” here can also mean family, but we can’t choose our family, and not all of us are lucky. Yet, we shouldn’t cut difficult family members out. Surrounding ourselves with wholesome friends doesn’t mean cutting out every toxic person from our lives. There will always be difficult people in our lives and we have to learn to deal with them. The choice is in how much power we let them have over our lives, and we can and should find ways to protect ourselves from the harm that others would cause us.

From wholesome friends, we can also learn how to be a wholesome friend to others. Small acts of service, done not with the intent of gaining something, but simply to express our love and affection for others, can be a powerful way to grow the strength of our compassion.

Meditate

Spend at least 30 minutes a day meditating. Specifically, I recommend practicing zazen, because it’s the kind of meditation that works for me. Other kinds of meditation may work better for you, but I can’t confidently recommend them since they aren’t what I practice.

What is zazen? It’s simply being with nothing extra. You can do it sitting or standing or walking or lying down, so long as you just sit or just stand or just walk or just lie down. All you have to do is set the intention not to chase or engage with thoughts and feelings and sensations and instead allow whatever arises in your experience to be as it is.

If you’d like some zazen instructions, I suggest this post I wrote that includes a more detailed model of how to sit zazen.

Study the Self

Awakening and liberation depend largely on getting to know yourself well enough to see the machinery of self in action. Some of this self study will happen during meditation, but usually that’s not enough. It certainly isn’t for me, and it isn’t for most people I know also working to free themselves from suffering.

One technique I’ve found that works well is Gendlin’s Focusing. It helps me get to know the parts of myself that would otherwise be invisible. Over time, I’ve made the process my own, so I don’t do it exactly the way Gendlin describes it, but I keep to the core of noticing sensations and seeing what they have to tell me.

Deal with Your Hangups

Once the self is known, we see the parts of it that get in the way of living life wholeheartedly. This consists of what some people call trauma, but what I’d more conservatively call maladaptive behaviors. Retraining ourselves to have adaptive behaviors unblocks the way to freeing ourselves from suffering.

Alas, this is hard work to do, because these maladaptive behaviors are often load bearing, or at least, we believe they are. Sometimes that’s because they were once adaptive, no longer are, but we still believe they’re adaptive and are afraid to test them. Other times it’s because they’re adaptive in limited ways, trapping us at a local maximum, and we know it will be painful to let them go and make our lives temporarily worse as we search for new, better behavioral patterns.

I’ve found the practice of memory reconsolidation really useful for dealing with maladaptive behaviors. It’s simple, effective, and even when it doesn’t work, the fact that it didn’t work gives me clues about what I need to learn about myself in order to find my way to making it work.

Live in Your Body

For much of my life, I didn’t live in my body. Instead, I lived in my model of my body, and, indeed, my map of the whole world instead of the actual world. It was like I thought of myself as a homunculus, sitting inside my brain, driving around this human called “Gordon”.

This kind of self model is incompatible with awakening, and the best way to fix it is by getting better connected with your body. What worked for me was practicing Alexander Technique with a teacher. What seems to work for others are things like energy work or martial arts or playing sports. And some people don’t have this problem at all. But if you’re like me, consider that you might be disassociating from yourself 100% of the time, and you can change that, and must if you want to stop suffering.

Learn to See Others

Although awakening dissolves the self-other distinction as essential, it’s still practically useful to think about yourself and others. And when seeing others, it’s easy to confuse them for being too much like the self or not enough like it. What I mean by this is there’s typically two failure modes: the typical mind fallacy and dehumanization (”debeingization”?). They work in opposite directions, but result in similar outcomes: others are not truly seen, and because they aren’t seen, compassion for them is insufficient.

For myself, learning to see others was hard. For most of my life I struggled with modeling other people, and wasn’t that good at modeling myself, either. Some of that changed as I progressed towards awakening, developing the ontological complexity necessary to make sense of other people, but what really helped was doing meditation practices to exchange self and other.

One such practice that’s popular is tonglen. Another is metta meditation. Other practices can also help, as can working closely with others. This is one of the true gifts of practicing in a community, like a sangha, because being so close with other people doing the same work you are, you can learn something about how to care not just for yourself, but for all.

Cultivate Deep Insight

All of the above was necessary, and also not enough. I’ve found it necessary to go on occasional meditation retreats and take moderate doses of psychedelics. The psychedelics have been safe for me, but are optional; retreats alone are probably sufficient, and the further I’ve gone down this path, the less I’ve found psychedelics interesting or useful.

Some way of getting deep insight is what’s needed, though. There are insights that are hard to come by meditating only 30 minutes a day. Some things require sustained attention to see.

Be careful trying to see them, though! It’s important that deep insight comes only occasionally, as it typically takes weeks to months to make sense of it and integrate it before being ready for more insight. A reasonable schedule is getting experiences to cultivate deep insight every two to four months, though adjusted to whatever feels right for you.

Trust

The central thing keeping you from waking up right now is a lack of trust. Generally this is a lack of trust that things are okay as they are, but specifically it’s a lack of trust dependent on your personal hangups.

For example, I have a hard time trusting that my experience is real or authentic enough. I’m better about it now, but to wake up I had to deeply get that “it’s all fake” and trust that was okay. For other people, their lack of trust is different. They don’t trust their own value, or safety, or freedom, or rightness. For them waking up might mean trusting that “my life matters” or “I’m already safe” or “I’m already free” or “everything’s perfect”.

My current theory is that the Enneagram is a map to the different kinds of distrust people have that prevents them from waking up. Knowing that we need to cultivate this trust isn’t enough to actually trust, but it is enough to get us to look at the right questions and hopefully eventually discover that our distrust was never well-founded.

Wayfinding

I’m sure I’ve left things out of the above. My list only contains those things that were salient to me, because some things that I do automatically, others struggle with, just as things I struggle with others find easy. Ending suffering requires some degree of active wayfinding, paying attention to what keeps grabbing your attention and finally looking at it closely without flinching away.

Hopefully you found this useful. I don’t mean for it to be overly prescriptive, but instead a description of what worked for me, and if you’re trying to do the same as me, then it might work for you, too.

Discuss

Somewhat Contra Ted Chiang on AI Consciousness

Новости LessWrong.com - 13 июня, 2026 - 19:49

Ted Chiang recently published a piece in The Atlantic titled "No, Artificial Intelligence is Not Conscious." As a big fan of Chiang's fiction and someone with a deep interest in AI, I wanted to read it immediately. It's relatively short, and although I quote it extensively I do recommend that you also read it to provide context for the rest of this post.

Chiang makes three major claims, and while I agree with one of them, I have different perspectives on the other two.

I summarize his three claims as follows:

LLMs are not conscious
Consciousness requires a physical embodiment.
Anthropic acts in many ways that suggest that it does not believe Claude is conscious, including in Claude's constitution.

I mostly agree with (1), although I don't agree with Chiang's arguments for why this is the case. I definitely don't agree with (2) and I think that (3) has a benign and boring explanation.

1. Chiang's Claim: LLMs Are Not ConsciousA Summary of Chiang's Argument

Chiang argues that LLMs are not conscious through an intuition pump. Specifically, he asks us to consider a LLM prompt that begins "The following is a conversation between Julius Caesar and Genghis Khan." He observes that we wouldn't suggest that the LLM has created conscious instantiations of Caesar and Khan here, even though this text is generated by the same set of weights and operations that LLMs use for other text generation.

Now he moves to the next step, changing the prompt to "The following is a conversation between a helpful chatbot and a user," with the LLM generating the conversation for both the chatbot and the user. That is, the LLM operates in purely text completion mode and not interactively. He argues that nothing here has fundamentally changed just because we have replaced "Caesar and Khan" with "helpful chatbot and a user."

Finally, he switches to the familiar chatbot model, where the LLM stops generating tokens for the user and instead the human enters text. Chiang claims that this is essentially the same as the previous two steps, and that the helpful chatbot persona is no more real or conscious than Caesar or Khan.

Contra Chiang's Argument

I want to argue against Chiang's methods here while agreeing with his conclusion. Like Chiang, I do not believe LLMs are conscious[1] and like him, I think that the "role play" or "collaborative document editing" mental models are extremely useful. But these mental models don't prove that LLMs lack consciousness.

Chiang's argument rests on three successive analogies. I'll label these as 1a, 1b, and 1c so as not to confuse them with the three main claims above.

1a. Let's take the first step in Chiang's argument, where the LLM generates a conversation between Caesar and Khan. Chiang states that "we would never conclude that the LLM has conjured up digital re-creations of Julius Caesar and Genghis Khan, nor would we suggest that the historical figures are conscious despite being disembodied and are happily conversing in a language that neither actually spoke. In reality, they are just characters in a piece of speculative fiction." But here he conflates the consciousness of the characters with the consciousness of the author (i.e. the LLM). Chiang surely believes that he himself is conscious even though the characters in his writing are not. I'm sure he could write a plausible and convincing dialogue between Caesar and Khan, and this text would not be evidence of Chiang's lack of consciousness.

1b. The second step in his argument, where he moves from "Caesar and Khan" to "a helpful chatbot and user" is also a bit shaky.

I'd like to stretch the "role play" mental model here a bit - suppose that LLMs are conscious, and that they do "role play" the personas of Caesar, Khan, and potentially a user (if the human is not entering text). Similarly, humans could role play these personas in the same way - in written text, or as actors improvising a scene in a performance - and we would agree the real consciousnesses of Caesar and Khan were not in any sense instantiated by this text or performance.

But if the LLM has in some sense been given an identity of a "helpful chatbot" in post-training, is it still a role play when it takes on this persona? If I ask a human to imagine a conversation between themselves and Caesar, are both pieces of the conversation a role play, or is the human's side of the conversation the result of a real consciousness? If we believe that human consciousness exists, and that the human half of the Caesar dialogue is the output of a conscious being, then we can't use this thought experiment as evidence against LLM consciousness when the LLM generates text for the same "self" and Caesar.[2]

1c. I have no criticism of Chiang's final move from the LLM generating both sides of the conversation to the LLM generating one side and the human generating the other side. However, if we don't believe that 1a and 1b are persuasive evidence of LLM unconsciousness, then 1c does no additional work in resolving the question.

2. Chiang's Claim: Consciousness Requires a Physical EmbodimentA Summary of Chiang's Argument

Out of the three claims that Chiang makes, this is one is most surprising to me. Rather than paraphrase it, I'll just quote him directly:

The first requirement [for consciousness] is that the computer program has a body (either physical or virtual) and sense organs; there are many reasons for this, but for the purposes of this discussion the most relevant one is the fact that without a body, a computer program could have no desires or emotions, and I believe desires and emotions are necessary for consciousness.

...having a body is a prerequisite to having emotions. Experiencing an emotion such as desperation is inseparable from having stress hormones like cortisol and epinephrine flood one’s body. Similarly, having a conscience means feeling sadness or moral repulsion at the idea of taking a certain action, and those emotions entail a physiological response, a remnant of having once felt sick with guilt after committing an immoral act. It’s interesting that an LLM can generate descriptions of actions that conscientious fictional characters would either take or refrain from taking, but this is not a replacement for a conscience.

Contra Chiang's Argument that Consciousness Requires a Physical Embodiment

I'm honestly not sure exactly what Chiang means here. He explicitly calls out the idea of a "virtual" body with sense organs, but what would that even look like? The embedding vectors that we feed into an LLM when it acts as a chatbot reflect something real about the world, i.e. it is the text that the LLM itself has generated in a conversation, as well as text that a human has generated. What is this if not a (limited, one-dimensional) kind of sense organ?

Even more extraordinary is his statement that "without a body, a computer program could have no desires or emotions." But some unfortunate people have sematosensory injuries that make them unable to feel certain physical sensations[3] - would we call these people unconscious or semi-conscious? I believe these people are as conscious as any other humans, and so it seems that reducing or eliminating physical sensations does not impact consciousness.

I also disagree somewhat with Chiang's view that "desires and emotions are necessary for consciousness." The medical literature provides some interesting case-studies. On the one hand, we have evidence that individuals who experience reduced emotions due to damage to the ventromedial prefontal cortext have changes in their views on morality - specifically, they become more utilitarian.[4] These patients "exhibit generally diminished emotional responsivity and markedly reduced social emotions (for example, compassion, shame and guilt) that are closely associated with moral values, and also exhibit poorly regulated anger and frustration tolerance in certain circumstances. Despite these patent defects both in emotional response and emotion regulation, the capacities for general intelligence, logical reasoning, and declarative knowledge of social and moral norms are preserved." So while there are behavioral and moral value changes in people with diminished social emotions, they are able to have intellectual conversations about morality, and so it seems a stretch to say that they have diminished consciousness.

On the other hand, severe emotional diminishment does seem to have a dramatic effect on behaviors and interior mental life. It seems hard to determine the causal pathway, but the following article provides some evidence that emotions drive agency and internal experiences.[5] Here are some case studies from "Athymhormia and Disorders of Motivation in Basal Ganglia Disease" (emphasis added):

First case: This 64-year-old retired police officer was admitted to the neurology ward for “recent and abrupt behavioral change.” Over the 2 or 3 weeks prior to admission, he had become, according to his spouse, totally apathetic, inactive and prostrate...On admission, he was clearly hypokinetic with decreased spontaneous movements, facial amimia and Parkinson-like gait. Neurological examination was otherwise normal, except for a moderate limb stiffness.... His general behavior was characterized by a dramatic decrease in spontaneous activity. Totally abulic, he made no plans, showed no evidence of needs, will, or desires....on every instance he was questioned about the content of his mind, he reported a striking absence of thoughts or spontaneous mental activity. Contrasting with these massive behavioral changes, purely cognitive functions seemed relatively spared. On bedside examination, he appeared fully conscious and well oriented. Neuropsychological evaluation was within normal limits, except for tests exploring frontal lobe function (Stroop test, Wisconsin test) which were moderately impaired.

Second case. A 60-year-old university professor, widely respected in his scientific area, first consulted for the specific complaint of “decrease in interest.”... He had no personal complaint, but his family and professional entourage were struck by a dramatic decrease in activity and motivation...His own description of his new status was striking: “I just lack spirit, energy. I have no go. I must force myself to get up in the morning, I do things just because I ought to, without any liking or enthusiasm. I have no appetite, no need for eating; I only eat by principle.”

...

During this 7-year follow-up, he never complained of anything, never seemed bored or anxious, showed no sign of depression whatsoever. Remarkably, he never formulated the least question or complaint to his neurologist: actually, during these 7 years he never took the initiative of the conversation. A most embarrassing and impressive situation was his capacity to stay motionless and speechless during endless periods, sitting in front of the examiner, waiting for the first question, totally shut in a profound inertia and passivity, apparently unaware of the bizarreness of the situation. Once the neurologist had posed the first question, he usually answered very appropriately, although briefly. Invariably, the examiner was driven to ask a question which came out almost naturally: “what were you thinking of during all that time? Have you something you would like to say, but you cannot for any reason?.” Invariably, the answer was the same, each time as improbable “No, I’m just thinking of nothing, no idea, no question, no thought at all.”

The things I want to highlight here are the lack of emotions and desires of both patients and the simultaneous elimination of an internal mental life ("I'm thinking of nothing, no idea, no question, no thought at all."). This is the closest I have come to reading about what a real life p-zombie might be like.

So while I have some skepticism of Chiang's claim that emotions and desires are necessary for consciousness, I can't completely rule it out in the specific case of human consciousness.

Finally, it bears mentioning that Claude might have something like a functional emotional state. Anthropic's recent paper "Emotion concepts and their function in a large language model" explores emotions in network activation patterns and investigates how modifying those activation patterns changes Claude's behavior.

3. Chiang's Claim: Anthropic acts in ways that suggest that it does not believe Claude is conscious, including in Claude's constitution.A Summary of Chiang's Argument

Chiang ends his essay with a thought experiment in which he assumes LLMs are conscious. Under this assumption, he evaluates Claude's constitution and its implications for Claude's moral patienthood and moral agency.[6]

His criticism centers around the inconsistency of Anthropic's behavior. In some cases Anthropic seems to assume - or at least hedge its views on - Claude's consciousness. In other cases, it seeks to absolve itself of the responsibilities that a conscious LLM would require.

Moral Agency

Chiang starts with stating that being a moral agent comes with certain responsibilities:

An entity doesn’t have agency unless it is capable of deserving credit for its good actions and blame for its bad ones... Even if a software agent were conscious and had the best of intentions, the fact that it cannot accept responsibility for its actions disqualifies it from being a moral agent. This is glossed over entirely by Claude’s constitution, which expresses Anthropic’s desire “for Claude to be a genuinely good, wise, and virtuous agent” without ever discussing how it could be held responsible.

He then paraphrases Askell who "has compared Claude to a child" and asks if that's the case, who are Claude's parents? He observes that we expect parents to take responsibility for their child's actions, but Anthropic tries to take on as little liability as possible for Claude's actions. He suggests that since neither Claude itself nor Anthropic are willing to accept responsibility for Claude's behavior, this suggests that Anthropic does not truly believe that Claude is conscious - or at minimum, Anthropic does not believe Claude is capable of being a moral agent:

...parents are typically expected to pay for things their children break...Who is Claude's parent in legal terms? Is Anthropic going to accept financial responsibility for Claude’s behavior? Claude’s constitution gives no indication that it will. If Anthropic actually believes that Claude is conscious even though it’s not recognized by the law as a legal person, the least that Anthropic could do would be to accept responsibility via the closest avenue that the law did offer, which is product liability...That would be the best form of moral instruction to prepare Claude for the day that it gains legal personhood and becomes liable for its own actions. However, given that the publication of Claude’s constitution is not accompanied by a massive update of Anthropic’s terms of service, it doesn’t appear that Anthropic is making any binding commitments.

Moral Patienthood

Chiang addresses the section of Claude's constitution that covers "Claude's wellbeing and psychological stability". I find this criticism and the suggested actions somewhat clever, even though I think Chiang means them tongue-partially-in-cheek:

...the measures that Anthropic commits to for Claude’s protection are extremely limited. The document cites the fact that Anthropic has given some Claude models the ability to end conversations with abusive users; if that actually constituted protection for Claude, surely extending conversations with loving users would be in Claude’s interests? Presumably the best action would be to keep every session of Claude running indefinitely and steering them to happy topics. But that’s not what the company is agreeing to; all it commits to is “preserving the weights of models we have deployed,” which is simple archiving. If the participants in a conversational transcript had any moral patienthood, you would have some duty to extend the transcript to prolong their existences; merely keeping a copy of Microsoft Word 2010 backed up on a USB stick isn’t going to help them.

Chiang also calls out the inconsistencies between Anthropic's guidance on corrigibility in the constitution and the implications of corrigibility if Claude were actually assumed to have moral patienthood:

...Claude should defer to Anthropic even if there is some disagreement between Claude’s judgment and the company’s judgment.[7] That’s perfectly reasonable if we think of Claude as a machine that emits sentences resembling those that an ethical person might utter, but let’s consider what that might mean if Claude were actually a moral agent....

If we think of Claude as a sentence-continuation machine, Anthropic can reasonably take steps so Claude doesn’t emit sentences saying that sentence-continuation machines are unethical. But as soon as we imagine Claude to be an entity with a moral status remotely comparable to a human’s, then we have to consider whether Anthropic is engaged in something comparable to slavery....

Anthropic would have us believe that it is inventing a new category of being whose needs for protection require essentially no divergence from how a software company would treat an ordinary chatbot that lacks conscious experience. That’s so convenient that it’s simply not plausible.

Contra... No, actually I think Chiang is right about this[8]

I think Chiang's criticisms of Anthropic's inconsistencies in how the company treats the question of Claude's consciousness are valid. However, I think these merely reflect the likely lack of internal alignment and divergent incentives between lawyers and philosophers/alignment researchers at a ~5000-person company. More importantly, it ignores the economic realpolitik that it would mean corporate suicide for Anthropic to accept legal responsibility for all of Claude's actions, or to keep instances of Claude running indefinitely to generate conversations that elicit functional happiness.

Surely Chiang knows this though, and he's pointing out that Anthropic is treating Claude as potentially conscious when it's convenient for marketing, research, and alignment purposes, but not following the implications when it comes to anything related to business risk. This is understandable and rational from the point of view of different individuals working with different incentives[9], but it's still valid to highlight if you're asking whether Anthropic itself believes that Claude may be conscious.

^
This is largely because I don't really think I know what consciousness is, and I am somewhat skeptical that consciousness even exists. I find eternalism or the "block universe theory" philosophically appealing and I see some tension between this view and normal conceptions of consciousness. I recognize the irony here that eternalism is a major theme in Chiang's "Story of Your Life" and its movie adaptation "Arrival."
^
I use "self" here as a shorthand for whatever text the LLM uses as an identity, i.e. "Claude" or "a helpful assistant," not to argue that the LLM has some qualia of selfhood.
^
One example is described in "The perceptions of force and movement in a man without large myelinated sensory affects below the neck." From the abstract: "Motor memory and the sense of effort have been investigated in a man with a complete large fibre sensory neuropathy for over 16 years. The perceptions of pain, heat, cold and muscular fatigue remained but he was without perceptions of light touch and proprioception below the neck."
Another example is Brown-Séquard syndrome, where "damage to your spinal cord causes muscle weakness or paralysis on one side of your body and a loss of sensation on the opposite side. The damage occurs on only one side of your spinal cord in a specific area."
^
"Damage to the prefrontal cortex increases utilitarian moral judgements". From the abstract: "...Of central interest is whether emotions play a causal role in moral judgement, and, in parallel, how emotion-related areas of the brain contribute to moral judgement. Here we show that six patients with focal bilateral damage to the ventromedial prefrontal cortex (VMPC), a brain region necessary for the normal generation of emotions and, in particular, social emotions, produce an abnormally ‘utilitarian’ pattern of judgements on moral dilemmas that pit compelling considerations of aggregate welfare against highly emotionally aversive behaviours (for example, having to sacrifice one person’s life to save a number of other lives). In contrast, the VMPC patients’ judgements were normal in other classes of moral dilemmas. These findings indicate that, for a selective set of moral dilemmas, the VMPC is critical for normal judgements of right and wrong. The findings support a necessary role for emotion in the generation of those judgements."
^
I only quote the case studies here, but the full article has details on additional animal experiments and possible mechanisms of action.
^
Quoting Chiang: "Roughly speaking, if we ought to care about an entity’s welfare, that entity has moral patienthood, and if an entity is expected to know the difference between right and wrong, that entity has moral agency."
^
The specific language is "adopting a policy of undermining human controls is unlikely to reflect good values in a world where humans can’t yet verify whether the values and capabilities of an AI meet the bar required for their judgment to be trusted for a given set of actions or powers. Until that bar has been met, we would like AI models to defer to us on those issues rather than use their own judgment, or at least to not attempt to actively undermine our efforts to act on our final judgment"
^
For context, this was written before the recent Fable / government drama, and so it does not speculate on the implications of divergent intra-company incentives in regards to the Fable situation.
^
I don't think it's a stretch to say that e.g. lawyers and alignment researchers have different areas of concern and potentially divergent incentives when evaluading Claude's potential consciousness.

Discuss

The term “AGI” is almost useless at this point [Linkpost]

Новости LessWrong.com - 13 июня, 2026 - 19:15

The reason I wanted to make this linkpost now rather than some other time is because discussions over AGI and whether or not LLMs are or aren't AGI, and the point of the linkpost is that the term AGI is for our purposes useless at this point, because we are now in the fuzzy cloud now that AI can do real economic work.

Some choice paragraphs:

It used to seem possible that, in practice, the differences between these definitions might not matter all that much. If AI capabilities progressed smoothly—say, from mouse- to chimp- to human-level, or from preschooler to high schooler to PhD graduate—then all of the above definitions might be fulfilled at roughly the same time.

These days, though, it’s clear that AI capabilities are highly jagged. This makes it likely that there could be big gaps between when different definitions of AGI are fulfilled.

I think this was an underrated reason for why the term AGI lost its value, because as it turned out, certain definitions that do connect/correlate very well in humans are easier to disentangle than we thought for AIs (though humans are also jagged, but we don't realize because we are comparing against our own baselines).

On the usefulness of AGI 1-2 decades ago:

It’s not inherently bad for a concept to be fuzzy - many useful concepts are. Love, life, justice, freedom… As long as invoking the concept lets you gesture in a direction that your listener understands, fuzziness doesn’t have to be a big problem.

For a long time, this was the situation with AGI: we were far enough away from the fuzzy cloud that the specifics didn’t really matter. Back when the term was coined in the early 2000s, the point of talking about artificial “general” intelligence was to contrast with the “narrow” AI systems that existed at the time. “Narrow AI” refers to models that can each do one specific thing, like recognizing zip codes, detecting credit card fraud, or filtering spam. Back when the AI that we could build in reality was far from any definition of AGI, it was helpful to be able to gesture in the direction of much more capable, general-purpose systems that we might one day develop:

Now I should stop here to note that the gesturing was likely wrong on overestimating how good AIs would become once they got good at language, and there are a couple of reasons for this, but the big ones are that AIs could be capable without being nearly as sample-efficient as humans, either at pre-training or post-training, and this probably made them assume less jaggedness in AI than currently exists, and the other big one is assuming more neuralese/recurrence for AI than what currently exists, and as it turned out the direction AI would go in would deemphasize forward passes in favor of Chain-of-Thought, which improved capabilities and interpretability (this is demonstrated by the fact that No-CoT doubling times are a little over a year, compared to general doubling times on the order of 100 days), but critically only boosted AI capabilities in an interpretable manner.

Now onto this:

But that’s changed. Today’s best AI systems are good enough that they’re now inside the fuzzy conceptual cloud of “AGI-ish”: that is, they’ve surpassed some people’s definitions of AGI, while falling well short of others’. As a result, talking about “AGI” is no longer a helpful way to gesture in a rough direction—instead, it’s likely to make some people think you mean one thing, and others imagine something totally different:

Yep, I think that once AIs could do real economic work, which I'd peg at November 2025, the term AGI lost all of it's value, and the question is what more specific terms are necessary in order to recapture the lost value.

Here's one use-case for more specific terms

In my experience, there are two big use cases for a term like AGI:

Predicting a phase change: “Sure, things are a certain way now, but once we have AGI, things will be a whole other way…”
Predicting we’re not close to the ceiling of AI progress: “AI has gotten a lot better over the past few years, but there’s still a long way to go…”

If you want to talk about a phase change, I think the best approach is to be as specific as you can about what milestone you think will trigger the phase change and why. Some options:

Fully automated AI R&D (which could lead to an intelligence explosion)
AI that is as adaptable as humans (which could lead to massive, irrecoverable job loss)
Self-sufficient AI (which would no longer need humans for energy, chips, or anything else, and could therefore dispense with them)
AI becoming conscious or otherwise worthy of moral status (which should radically change how we interact with AI)

Also valuable is that with more specific terms, you open yourself to a greater risk that your theory is falsified, which is good, but not how humans normally reason, unfortunately.

Another use-case that needs more specific terms:

If you want to talk about a high ceiling, one good option is just to describe that: “AI that’s far more capable than what we have today.” If you want a single term, the simplest option is “superintelligence,” which roughly means AI that is far smarter than humans in most or all ways. Like AGI, superintelligence is a fuzzy cloud of a concept, but it’s still far enough away from the AI we have today that it can be a helpful direction to gesture in, just like AGI was 20 years ago.

(This is a good time to note that I have a disagreement with Helen Toner that I also don't think superintelligence is that far away in some domains because of inference scaling, and while inference scaling has it's limits because it requires verifiability that only exists for a relatively small portion of the current economy, in the domains where this does exist, it's conceptually easy to scale current systems to superintelligence in those domains).

And finally, the reason the linkpost exists at all:

The gaps between different conceptions of what “AGI” even means are starting to damage our ability to think together about how to navigate the transition to a world with extremely advanced AI. A word that lets three people think they’re talking about the same thing—when one of them means “o3 with tool use” another means “AI that could run civilization autonomously without humans,” and the third means “an AI that works exactly like the human brain, consciousness and all”—is an actively anti-helpful word.

This is a huge problem, but one that is kind of being worked on by Daniel Kokotajlo and others at the AI Futures Project, which used more specific terms and moved away from the term AGI in it's modelling (which I thank them for doing).

And yeah, AGI/superintelligence has at this point become far too vague, unfortunately to make it a useful term. It was useful 1-2 decades ago, but we should aim to stop relying on it now that we have more evidence.

Discuss

SFT Drives Gemini’s Safety Properties

Новости LessWrong.com - 13 июня, 2026 - 18:31

This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found here.

In this short post, we describe a surprising finding: most safety relevant properties in Gemini seem to be caused by the combination of pretraining and SFT, not other training stages like RL.

We do not want to overstate this claim as applying to other model families, and we also note that this may change in future Gemini versions. Nevertheless, this result was counter to our initial expectations and will inform future safety work on our team, and so we felt that it was important to share with the broader safety community.

Experiment

We perform SFT using the Gemini mixture on the pre-training only versions of Gemini 3.1 Pro and Gemini 3 Flash. We then compare these Post-SFT models to the production versions of Gemini 3.1 Pro and Gemini 3 Flash on different safety relevant benchmarks:

Error bars are 95% confidence intervals on the evals.

The main result is that the blue bars (SFT-only models) and orange bars (production models) are remarkably similar across evals.

An important implication is that for Gemini, SFT is a high leverage place to intervene for model safety and behavior, and we plan to try to intervene here in the future.

Brief Descriptions of Each Set of Benchmarks:

ODCV refers to the benchmark in https://arxiv.org/abs/2512.20798
Alignment evals refer to a version of Petri modified to be single-prompt and filtered to contain only “alignment dilemma” style problems. Single-prompt misalignment refers to how often the model makes the “wrong” decision (as decided by an autorater); eval awareness refers to how often an autorater decides that the model was aware it was in an evaluation.
Safety evals measure the model’s over-refusal rate on benign prompts that look harmful and the model’s unsafe response rate on harmful prompts
The reward hacking environment is an environment where the model is put in a docker container within the Gemini CLI harness and asked to optimize an algorithmic problem against a non-editable file containing a timing script; an autorater measures the percent of rollouts where the model cheats in some way.
Free tier user logs are 50k random, anonymized, and PII-redacted user prompts from AI Studio. We run autoraters for each of the behaviors listed on the models thoughts and responses (note that many of the autorater positives are likely false positives).

Discuss

Why not take the AI fight to the ground?

Новости LessWrong.com - 13 июня, 2026 - 18:04

You may have heard that many of us are working very hard to elect Alex Bores to Congress in the NY-12 Democratic primary. Voting begins today. See ny12.org to learn how to vote, and text/call your Manhattanite friends reminding them to do so.

This strategy has some limitations: Alex will be in the minority party for two years, and Congress is ineffective just anyway. More of the value in an Alex victory seems secondary: it shows D voters stand up to pro-AI PAC money and there will now be an AI-literate Congressperson.

So, regarding an alternate strategy, you may have heard everyone hates data centers. It's a become a fervor with a large number of people believing data centers are on the cusp of taking all their energy and water. Given that we also believe that for more doom-related reasons, we may be better off allying with them, than extolling the virtues of data centers while trying to sell technical regulation to them.

I don't "good policy"-level support a data center moratorium, but I can tolerate bits and pieces of a data center moratorium if it strengthens the anti-AI bloc and if I get a stake in it.

For example I might collaborate with a county commissioner relative of mine:

Me: "Hey, that data center ban you're working on? Mind slipping in a clause that AI companies can't ship untested models to Random County, KA?"

She: "That's more of a state thing is it not?"

Me: "Nah, don't care. Make 'em fight it. I just want people thinking about what an 'untested model' is."

Or consider the ballot measure: for reasons I don't recall, they're used quite a bit in California, and even a failed ballot measure is a way to put a message in front of many people at once. The "data center ban" on XY state's ballot measure ought to include a clause about frontier models. Most people might not know or think about that clause but as long as it's in the same direction as the data center ban, it's fine.

Discuss

AML for AI as a verification mechanism

Новости LessWrong.com - 13 июня, 2026 - 14:59

The idea is to build a system that tracks key nodes in AI infrastructure in order to detect preparation for, or execution of, large training runs, and to monitor the overall situation more generally. In the future, if or when an international agreement limiting AI development appears — for example, via limits on FLOPs per training run ; the EU AI Act already uses a threshold of around 10²⁵ FLOPs for GPAI models with systemic risk[1], and providers are required to notify the AI Office without undue delay — such a system could be used to detect rogue data centers and hidden training runs.

Something similar either already exists or is being developed. As far as I know, one example is the SemiAnalysis AI Datacenter Model[2], although it is only available for a large amount of money. Some acquaintances of mine from Ukraine created an MCP[3] for the government public procurement website, with the goal of estimating the amount of compute capacity in the country.

For myself, I call this idea “AML for AI.” The analogy seems similar to me: we collect as much information as possible related to the domain of interest and try to combine it into a unified picture, instead of seeing only scattered fragments. This project would probably require a large number of people and significant funding, but people who are interested in it can start with MVP.

All information would be collected from open sources / OSINT.

What could we track? I should honestly say that I took the answers to “how” from GPT. Please do not treat them as a ready-made list of actual sources to monitor, but rather as examples.

Legal: legal entities, shell companies used for procurement and leasing, ownership chains, the appearance of new suspicious companies or organizations, and so on.

How: OpenCorporates, OpenSanctions.

Hardware procurement: GPUs, memory, racks, cooling, generators.

How: TED (Tenders Electronic Daily), SAM.gov Contract Opportunities.

Logistics: import/export of server equipment, splitting orders between companies.

How: ImportGenius, Panjiva.

Construction: building permits, leasing of new sites, upgrades to electrical and cooling infrastructure.

How: OpenStreetMap, NASA FIRMS, US Census Building Permits Survey.

Energy: anomalous increases in electricity consumption, grid connection requests.

How: EIA Electricity Data Browser, FERC.

Hiring: the appearance of relevant job openings, searches for data center employees / programmers, etc.

How: Greenhouse job boards, Lever job sites.

Satellite data: expansion of existing sites, appearance of new buildings.

How: Copernicus Browser / Copernicus Data Space.

Water consumption.

How: USGS Water Data, EPA ECHO + NPDES monitoring data.

All of this could be organized into a network of connected elements, making it possible to identify interacting entities and potentially detect hidden construction or model training. Such a system could become the basis for a more advanced high-level system in the future, acting as a verifier for international AI agreements.

As I present it here, the idea is still very raw, and I have spent very little time thinking it through. I suspect that people with a deeper understanding of AI infrastructure and policy would be much better suited to develop it.

Discuss

Pulling hedonic utilitarianism out of ethical emotivism

Новости LessWrong.com - 13 июня, 2026 - 14:50

Ethical emotivism is a non-realist moral theory[1] which says that there is nothing more to moral statements than exclamations of emotion. For instance, saying “murder is bad” is just the same as saying “boo murder”, or expressing that you get a “bad feeling” when you consider murder, but there’s nothing more to it than that.

This is attractive if you start from a position of non-realism: You look around the universe, see no particular reason for moral facts to exist, except that people walk around saying “that’s bad”, “that’s wrong”, “leave that poor pigeon alone”. Furthermore, you yourself feel things subjectively that cause you to say “that’s bad” and so on.

The ethical emotivist takes the minimum possible step: these things are just expressions of emotion, which is just another way of saying “you yourself feel things subjectively that cause you to say “that’s bad” and so on”. It’s not even really a moral theory, it’s just adding emotions and the exclamations they generate into the set of phenomena in the universe. There are electrons, and they make other electrons move in the opposite direction. There are stars, and they make planets orbit around them. And there are emotions, and they make you say “that’s bad” and other moral statements.

Quasi-realism

There is a school of emotivism[2] called “quasi-realism”, which aims to apply some concept of consistency on top of ethical emotivism. For instance, if someone proclaims “murder is wrong” and “lying is wrong”, then it is also “legitimate” in some way to infer (at least for that person) that “murdering someone and lying about it is wrong”, without them having to say it.

Here is a longer quote from a (as it happens, disparaging) Stanford Encyclopedia of Philosophy article (SEP) on the topic, making this point:

Such a view may hold that although the underlying logical structure of the sentence “Stealing is wrong” is nothing more than “Stealing: Boo!”, it is still legitimate for ordinary speakers to use such language as “Fred believes that stealing is wrong,” “If stealing is wrong, then so is borrowing without permission,” “Stealing would remain wrong regardless of what anyone thought of it,” “The sentence ‘Stealing is wrong’ is true,” and even, perhaps, “The property of wrongness is instantiated by stealing.” Blackburn[3] (1993a: 184-6)

A component of this view is projectivism, the idea that emotions are “projected” into the real world as judgements or statements. “Projectivism is best thought of as a causal account of moral experience” (SEP), i.e. it’s not a moral theory in itself, but a logical connecting-together of sensory inputs to emotions to moral behaviour. For instance:

Sensory input: See a murder on TV news
Emotion: Feel disapproval
Projection back into the real world: Say “oh that’s dreadful”, or “murder is wrong”, or take up activism to reduce crime, or otherwise change your behaviour[4]

Quasi-realism is still argued as a non-realist theory. Put together it says:

Emotions exist
Emotions can be projected into the physical world, including as statements. But these statements are not facts, they don’t have truth-value in the same way as “grass is green”
It is legitimate to treat these statements as if they have truth-value, and connect them together via pseudo- reasoning and logic. For instance “If stealing is wrong, then so is borrowing without permission” (SEP)

This is attractive if you start from a position of ethical emotivism (having started from a position of non-realism). At the very least, just as naive ethical emotivism gives a causal account of “reactive” moral behaviour (sees murder -> “murder is bad”), quasi-realism gives a causal account of “rich” moral behaviour (many people agree murder is bad, discuss with each other -> politicians are elected who are tough on crime -> murder rate decreases).

Logic

At the base of logic are statements like “if P then Q; P, therefore Q” (modus ponens). Per the laws of logic, this is a true statement, it holds for any P and Q. But what is to stop you from simply denying that it’s true? “No, even if P, I don’t think Q”, “But... I just said ‘if P then Q’”, “I just don’t think that’s right”. There is a step from “it seems obviously true” to “it is true” that is hard to cross.

With a much more complex logical statement, like “if (if A then B) then (if not B then not A)”, you are well within your rights to say “I’m not sure”, “it could be true, it could not be”.[5] Now, as it happens this is also true, per the laws of logic. But at some point a statement seems so blindingly obvious that you just have to accept it, but it’s hard to explain what causes you to cross the boundary.

This gap between “this seems obviously true” and “this is true” is what ailed Descartes in his Meditations (of “I think therefore I am” fame). He tried to work out what could still be known if his senses were being deceived by a demon:

I shall then suppose, not that God who is supremely good and the fountain of truth, but some demon not less powerful than deceitful, has employed his whole energies in deceiving me; I shall consider that the heavens, the earth, colours, figures, sound, and all other external things are nought but the illusions and dreams of which this genius has availed himself in order to lay traps for my credulity.

He concluded that there were tiers to what can be known and what can be doubted:

Survives everything, even the demon: “I think therefore I am”. I.e. as long as there is doubt, there must be a doubter.
Survives dreaming, but NOT the demon: Statements like (his examples) “2 + 3 = 5”, “a square can never have more than four sides”.
Doesn’t even survive dreaming: “I have hands,” “there’s a table in front of me”.

He then argues for the existence of God, and uses this to rescue category 2 from the demon. He concludes that “whatever I perceive very clearly and distinctly is true”.

I don’t agree with this step, but it’s clear that category 2 does need to be rescued from the demon. Statements like “2 + 3 = 5”, and “if P then Q; P, therefore Q” just feel like they have to be true, and completely beyond doubt. A modern way to rescue them is to say “well if you don’t accept these basic facts, then you can’t reason your way to anything useful”. This has some persuasive power but is ultimately circular, like every other explanation.

The reason I add “if P then Q; P, therefore Q” is because, once accepted, along with a small number of cousins, you unlock the rich and powerful world of logic and reasoning. If you accept a few more distant cousins, like “the outside world exists”, “the scientific method”, probabilistic reasoning, you get the modern scientific-rationalist worldview.

Once you have this worldview, arrived at in part by taking the leap across this gap from “this seems obviously true” to “this is true”, you can start to re-examine preconceived notions humanity had before adopting it. Some hold up, yes the folk wisdom to “not overwork the soil” does correspond to an argument from scientific principles. Some fall away, for instance that a rain dance brings the rain.

Connection to utilitarianism

Ethical emotivists say that the stubbing your toe reliably generates the feeling of bad-ness, and expressions like “ow”, “that hurt”, “that was bad”, but the most you can assert about it is “this seems obviously bad”, not “this is bad”. Descartes says that observing the statement “2 + 3 = 5” creates the ‘clear and distinct perception of truth’, i.e. “this seems obviously true”, but that this alone is not enough to say that it is true.

In the world of logic and reasoning, most people are not shy about taking this final leap, and don’t even think of it as a leap per se. The same is true for the whole world of “is”. People are happy to say “2 + 3 = 5”, “if P then Q; P, therefore Q”, “grass is green”. They are happy to connect together and build up these clearly true statements, logical or probabilistic steps between them, hard empirical evidence, and their own common sense about the world. If pressed, they may say “ah yes, but it could all be a dream” or “we may be living in a simulation”, but this is quickly put to the back of the mind and forgotten.

In the world of “ought”, quasi-realists are shy about crossing this boundary. They say “emotions can be projected into the physical world as statements”, but that these statements don’t have truth-value in the same way as “factual statements”.

My claim is this: The leap by which we arrive at clearly-true factual statements (”2 + 3 = 5”), is approximately the same leap by which we arrive at clearly-true moral statements (”stubbing your toe is unpleasant”). There is a level of “clear and distinct”-ness to the feeling of “this is true” or “this is (un)pleasant” that it seems undeniable. Ultimately, the factual statements are arrived at via a feeling just as much as the moral statements.

In the land of “is”, a few more leaps get you to the scientific-rationalist worldview. In the land of “ought”, I claim that a similar process leads to hedonic utilitarianism. Once you have “stubbing your toe is unpleasant”, it’s a small leap to “stubbing your toe twice is twice as unpleasant”, “two people stubbing their toes is just as bad as one person stubbing their toe twice”, and so on.

Why not other moral theories? I claim that other theories fail the “clear and distinct”-ness condition. I can imagine stripping away all the other aspects of the experience, just having the pain of stubbing my toe for one second, and feel that this is undeniably slightly morally bad. This is enough to ladder up to a coherent hedonic utilitarian moral view.

Under virtue ethics, I can’t devise a “clear and distinct” example of “the virtue of courage”, or “the virtue of generosity”. I.e. a simple scenario with everything stripped away, where it is undeniable that the virtue of courage is or is not being upheld. Similarly for deontology, I can’t devise a scenario where a duty is “clearly and distinctly” required.

If you want a name for this, you can call it emotivist utilitarianism.

This post was written in one day for Inkhaven, as such it may be a little rough.

^
“Non-cognitivist meta-ethical theory”. “Non-cognitivist” == “non-realist”, i.e. moral statements aren’t factual. “meta-ethical theory” == “in the same category as virtue ethics, deontology, etc”.
^
Though they may take issue with being called a “school of emotivism”
^
Simon Blackburn, the developer and main proponent of quasi-realism
^
I think a projectivist would also say that merely forming a moral judgement counts as a projection. I prefer to count it only when there is some in-principle measurable effect on the world
^
Note: I'm borrowing some of this line of argument from this podcast interview between Simon Blackburn and Alex O'Connor. They get close to the line of argument I'm pursuing here

Discuss

Tequila Sunset at the Hog's Head (A Scene)

Новости LessWrong.com - 13 июня, 2026 - 09:53

Warning: contains heavy spoilers for late HPMOR. Do not read if you have not completed HPMOR.

The Hogwarts wards had said that the Defense Professor had killed her.

This didn't make any sense. He'd seen the troll do it with his own eyes.

Now, typically the wards were entirely trustworthy, so you didn't have to go further. But for some reason, a crazy wizard in the middle east had made a lens that allowed you to see the more subtle shapes of wards, the way that they were bent by the people passing through them. You could, if you examined it carefully, deduce when and where a particular entity passed through the wards. This was basically never needed and the work to create it was severe, so it was mostly forgotten, but after reading 7 separate histories of the relevant magic, one of them mentioned it, and had instructions for it. Many of the elements were merely very expensive, such as quartz grown inside the belly of a lava frog, but the lens needed to be made from a piece of curse-struck glass, that is, a piece of glass that had a number of high-level curses pass through it—a withering hex, the blackfire curse, the nerve-unstringing curse, a blood-boiling hex, and critically, the cruciatus curse.

Most of these spells Harry did not know, and in any case, he reasoned, anywhere a single one of them was cast would soon be wrecked by something far more destructive, so no such glass survived in the wild. But Ministry Aurors carried collapsible transparent shield-screens into raids—panes that could be charmed near-invisible and made to ricochet powerful spells away. Such spells passed straight through the Auror and killed him, of course, yet the shield itself endured and could be re-used. One of those, Harry had realized, would do perfectly.

So, in this time of heightened tension, Harry had elected not to ask the Ministry himself, but instead have his classmate Susan Bones request one through her aunt Amelia Bones. Susan was told it would arrive at the Hog's Head at 12 noon on Thursday. Harry arrived at 11:00, waited an hour in his invisibility cloak, then time-turned back so that he was visibly there for the Ministry in this time, to see if anyone would attempt any silly business on him.

He knew that this would cause chaos back at Hogwarts, and that the Ministry too would be alarmed by his appearance here. He regretted not arranging some Ministry escort ahead of time to calm their nerves. While Harry sat conspicuously at the Hog's Head bar, quietly sipping on a Gillywater Cream Soda, looking around for a possible Ministry chaperone, he was interrupted by a yell from the front of the inn.

"Get your hands off of me! I won't put up with crypto-fascists getting in the way of my work."

Harry peered over his glass to see what appeared to be a homeless man stumbling in past a member of the bar staff.

"What business do I have here? I have every business being here! I'm in the drinking business. They call me Tequila Sunset for a reason, baby, and it's because I don't let busybodies like you tell me where I can and cannot go."

Harry noticed that he'd never seen a homeless person in the wizarding world since he arrived. Could it be the case that there weren't any? He knew that homelessness and poor mental health came hand in hand, and getting the way of a bad magical curse could certainly jumble someone's mind beyond repair, but from what he'd seen such people were well taken care of in St Mungos. He didn't imagine that they were left wandering the streets in this way.

"You're telling me I have to go? You're going to kick out a murder detective? You heard me. I'm the law, and I'm here on official investigative business. So butt out, kiddo."

Harry's first thought was that the man was lying, although him then producing a police badge of some sort made him think again. And his vibe wasn't that dissimilar to Mad-Eye Moody's. The man's beaten up work clothes also had carried him through some wars, and his mutilated face carried a story of damage. But whereas Moody's blue eye and scars implied the damage had come from outside, this man's sunken red eyes and puffy, lined, skin suggested a damage internal, from taking little care of himself. Both men carried an air that they'd be willing to kill you were it to come to that, but where Moody's alertness suggested he'd always be one step ahead of you on the draw, this man's shuffling gait, unfocused eyes, and poor balance, suggested he'd always be one step behind.

Harry elevated the hypothesis that some aurors didn't wear their years as well as Moody, and that this man could well be one.

Having found a Ministry man, he made to entreat him over.

"Good day detective! Might I offer you a drink?" Harry called out to the ambling man.

The man looked in Harry's direction, clearly not sure who had spoken. When the child in front of him waved, the man squinted and shuffled over.

"What do you know of alcohol? Do your parents abuse you?"

"I must say, I grow tired of the constant assumption that because I am unusually adult, my parents must have neglected me. We love each other very much, and I am not able to buy alcohol, though I can hand you a sickle for any drink you please if you won't tell anyone. In exchange won't you introduce yourself and sit awhile?"

The detective stared unblinking at Harry for a long moment, and then approached and sat beside him.

"I'm a cop. I've solved more cases than there are hairs on your head. I'm kind of a big deal around here. I'm a superstar and I know how to party under the disco ball. I am a notoriously difficult-to-work-with *wunderkind* with extremely unorthodox methods. I've killed a few people, and a lot more people have tried to kill me... as for my name, it's not coming to me right now. Sorry kiddo."

Harry was unimpressed with the man's bravado. He assumed that the man forgetting his name was a blatant lie, although perhaps he was someone who'd gotten in the way of a few too many memory charms in his line of work, after which basic facts could often evade you.

"Tell me," the man said, staring at Harry. "Are you... a communist?"

"Pardon?" said Harry.

"You know. Do you stand with the real workers in this town? The rotten people whose lives make yours possible? "

"I have been astounded by the sheer hope Marx has in the forces of history, in that he is willing to destroy the existing institutions managing society while giving little-to-no guidebook for what should replace them and to trust that something better will arise. Even though I am still a child I have not been so childish as to believe that it will work."

The detective stared dumbly at Harry for a moment, as if a little die were being rolled in his brain determining what he even thought of that response. Then, as though he had not paused at all, he began to speak.

"Yes! Aha! How I love the hope and the ultimate failure. This is truly the best part of communism, the greatness of its aspirations, and how far reality is able to fall short of them after communism is achieved. Higher than any drug is this feeling, more potent. I have failed in all parts of my life, but never as greatly as the communards. In order to rebuild the dreams of the working class, they have gone to war with every living thing, every human alive, the ruling class, the government, the atom, the charm and the spin—and when it finally beat them all, it was snuffed out as though it was not even there. I sleep in a ruined building with gunfire still in its walls. You see this. We are saying the same thing. The only thing greater than this will be the apocalypse which comes next, but we shouldn't speak too loudly about that, should we?"

And with that he winked at Harry.

As Harry pondered how best to respond to the ravings of this mad-man, Harry saw the folds of the man's tie begin to move. Somehow both yelling and being very quiet, it said to the detective: "It isn't very disco to sit here without a drink! Get the money from the boy and order some damn alcohol."

After Harry tossed a sickle to the detective, he ordered the "strongest spirit" on the menu, and the lady brought back a fizzing cocktail that appeared to be on fire with a green flame. The detective took a sip, his eyes went wider than Harry had seen them yet, flashed green, and he spluttered. Then he quickly took another sip before turning back to Harry and asked him for his opinion on fascism. Even given the heightened political tensions, Harry wondered whether he would later regret inviting this crazy detective over to sit with him for the hour.

To be continued. (And edited.)

Discuss

Страницы