Вы здесь

Сборщик RSS-лент

AI Safety Needs Startups

Новости LessWrong.com - 7 марта, 2026 - 04:27
Summary:
  • Startups can become integrated in the AI supply chain, giving them good information about valuable safety interventions. Safety becomes a feature to be shipped directly to users by virtue of this market position.
  • Better access to capital, talent, and ecosystem-building is available to for-profits than non-profits. VC funding dwarfs philanthropic funding, and there is little reason to believe that profitable safety-focused businesses aren’t possible.
  • Joining a frontier lab is a clear alternative, but most AI deployment happens outside labs. Your marginal impact inside a large organisation is often smaller than your impact when founding something new. Equally, profitable businesses aren’t an inevitability. You should seriously consider working for or founding an AI safety startup.
Introduction

Markets are terrible at pricing safety. In the absence of regulation, companies cut corners and externalise risks to society. And yet, for-profits may be the most effective vehicle we have for deploying safety at scale. Not because the incentives of capitalism align by chance with broader human values, but because the alternatives lack the resources, feedback loops, and distribution channels to turn safety insights into safer outcomes. For-profits are far from perfect, but have many advantages and a latent potential we should not ignore.

Information, Integration, and Safety as a Product

For advanced AI, the attack surface is phenomenally broad. It makes existing code easier to crack. Propaganda becomes cheaper to produce and distribution of it becomes more effective. As jailbreaking AI recruiters becomes possible, so does the data-poisoning of entire companies.

Information about new threats and evolving issues isn’t broadcast to the world. Understanding where risk is most severe and how it can be mitigated is an empirical question. We need entities embedded all across the stack, from model development to deployment to evaluation. We need visibility over how this technology is used and misused, and enough presence to intervene when needed. ‘AI safety central command’ cannot provide all these insights. Researchers acting without direct and constant experience with AI deployment cannot identify the relevant details.

Revenue is a reality check. If your product is being bought, people want it. If it isn’t, they either don’t know about it, don’t think it’s worth it, or don’t want it at all. For-profits learn what matters in an industry directly from the people they serve, giving the best insights money could buy.

This is not to say that AI safety non-profits aren’t valuable. Many do critical work which is difficult to support commercially. But by focusing entirely on research or advocacy and ignoring the commercial potential of their work, organisations cut themselves off from a powerful source of feedback. Research directions, careers, and even whole organisations can be sustained for years by persuading grantmakers and fellow researchers of a thesis, rather than proving value to people who would actually use the work. Without this corrective pressure, even well-intentioned research may drift from what the field actually needs. Commercialisation should not be seen as a distraction or a response to limited funding, but as a tool for staying at the bleeding edge of what is useful for the world.

Productification

Turning research into a product people can buy is extremely powerful for distribution. You are no longer hoping that executives, engineers, and politicians see value in work they do not understand tackling risks they may not believe in. It becomes a purchase. A budget decision. A risk-reward tradeoff that large organisations are very well suited to engage with.

There are clear gaps in securing AI infrastructure which can be filled today. If you’re wondering what an AI safety startup might actually do, here are some suggestions for commercial interventions targeting different parts of the stack.

  • Frontier Models: Interpretability tooling, evaluations infrastructure, and formal verification environments. Tools which might be implemented by labs and companies with direct access to frontier models to understand and control them better.
  • Applications: Content screening, red-teaming as a service, and monitoring for misuse. Helping startups building on frontier models catch accidental or deliberate misuse of their platforms.
  • Enterprise Deployments: Observability platforms, run-time guardrails, and hallucination detection. Enterprises and governments using AI to automate critical work should be able to catch issues early and reliably.
  • Market Incentives: Model audit and certification, and safety-linked insurance. Creating market incentives which reward safer models when they’re released into the world.

None of these require waiting for frontier labs to solve alignment, or hoping that someone else finds your work and decides to implement it. Instead of writing white papers hoping governments will regulate or frontier labs will dutifully listen, you build safety directly into products that customers come to rely on. One path hopes someone will do the work, whereas the other is the work.

Safety Across The Stack

When you tap your card at a shop to make a purchase, a network of financial institutions plays a role in processing your transaction. The point of sale system reads your card information, sends it on to a payment processor, who forwards the request to the appropriate card network. The issuing bank for your card authorises the transaction, the money is sent to cash clearing systems, and cash settlement is performed often through a central bank settlement system.

There is a sense in which all fraud happens at a bank. They have to release the fraudulent funds, after all. But the declaration that all fraud prevention initiatives should be focused on banks and banks alone comes across as fundamentally confused. Fraud prevention might be easier at other layers, and refusing to take those opportunities simply because it is in principle preventable at some more central stage would not lead to the best allocation of resources.

Similarly, when a user prompts an AI application, they are not simply submitting an instruction directly to a frontier model company. Just as tapping your card does more than instruct your bank, such a message goes through guardrails, model routing, observability layers, and finally frontier model safety measures. Every step of this process is an opportunity for robustness we should not let go to waste.

This becomes even more critical as AI agents begin acting autonomously in the world, doing everything from browsing and transacting to writing and executing sophisticated code. When an agent’s action passes through multiple services before having an effect, every link in that chain is both a potential failure point and an opportunity for a safety check.

Exclusively focusing AI safety interventions on frontier labs would be like securing the entire financial system by regulating only banks. Necessary, but nowhere near the most efficient or robust approach.

Capital, Talent, and Credibility

Successful for-profits are in an inherently better position to acquire resources than non-profits. Their path to funding, talent acquisition, and long-term influence is far stronger than that of their charitable counterparts.

There is an immense amount of venture capital washing around the AI space, estimated at ~$190 billion in 2025. Flapping Airplanes raised $180 million in one round, comparable to what some of the largest AI safety grantmakers deploy annually, raised in a fraction of the time. VC allows you to raise at speed, try many approaches, and pivot more freely than would be possible in academia or when reliant on slower charitable funders.

In AI safety, non-profits are less likely compared to other sectors to be trapped in an economic struggle for survival. However, even in the AI safety ecosystem, philanthropy is much more limited than venture capital and more tightly concentrated among fewer funders. Non-profits are vulnerable, not just to the total capital available, but to the shifting attitudes of the specific grant makers they rely on. VC-backed companies, by contrast, are much more resilient to the ideological priorities of funders. If one loses interest, many others remain available as long as you have a strong business case.

Yes, there is a large amount of philanthropic capital in AI Safety compared with typical non-profits. Safety products can also be difficult to sell. But the question of whether safety-focused products sell well, as they do in other industries, is a hypothesis you can go out and test. If it turns out that they do, there could be an immense amount of capital available which should be used to make our world safer.

For-profits attract talented people not just through hefty pay packages, but also through their institutional prestige and the social capital they confer. You can offer equity to early employees, which is extremely useful for attracting top technical talent and is entirely unavailable to non-profits. Your employees can point to growing valuations, exciting products with sometimes millions or even billions of users, and influential integration of their technology with pride. For many talented and competent people, this is far more gratifying than publishing research reports or ever so slightly nudging at the Overton window.

All of this, increased access to talent, capital, and credibility, makes for-profits far easier to scale. And safety needs to scale. The amount of time we have until transformative AI arrives differs wildly between forecasts, though it seems frighteningly plausible that we have less than a decade to prepare. If we are to scale up the workforce, research capacity, and integration into the economy of safety-focused products, we cannot afford anything other than the fastest approach to building capacity.

Success compounds. Founders, early employees, and investors in a successful for-profit acquire capital, credibility, and influence that they can reinvest in safety, whether by starting new ventures, funding others, or shaping policy. This virtuous cycle is largely unavailable to the non-profit founders, unless they later endow a foundation with, as it happens, money from for-profits.

In addition to tangible resources, a mature ecosystem of advisors and support networks exists to help startups succeed. VC funds, often staffed by ex-founders, provide strategic guidance and industry connections that are crucial for closing sales. There are many talented people who understand what startups offer and actively seek them out. An equivalent ecosystem just doesn’t exist for non-profits.

Shaping The Industry From Within

Being inside an industry is fundamentally different from being adjacent to it.

Embedding an organisation inside AI ecosystems enables both better information gathering and opportunities for intervention. If you can build safe products appropriate to the problems in an industry, you allow companies to easily purchase safety. If companies can purchase safety, then governments can mandate safety. But to get there, it is not enough to make this technology exist; the technology must be something you can buy.

Cloudflare started as a CDN. By becoming technically integrated, they slowly transformed into part of the critical infrastructure of the internet. Now, they make security decisions which shape the entire internet and impact billions of users every day. A safety-focused company embedded in AI infrastructure could do the same.

Will Markets Corrupt Safety?

Market incentives are not purely aligned with safety. The drive to improve capabilities, maximise revenue, and keep research proprietary will harm a profit-seeking organisation’s ability to make AI safer.

However, every institution has its pathologies. The incentives steering research-driven non-profits and academics are not necessarily better.

Pure Research Also Has Misaligned Incentives.

The incentives of safety and capitalism rarely align. The pressure to drive revenue and ship fast pushes towards recklessly cutting corners, and building what your customers demand in the short term rather than investing in long term safety.

However, research organisations have similar harmful incentives driving them away from research which is productive in the long term. The need to seek high-profile conference publications, pleasing grant makers, and building empires. Incentives of research organisations and individual researchers are notoriously misaligned with funders goals’ in academia and industry alike. Pursuing a pure goal with limited feedback signals is extremely difficult as an organisation, regardless of structure.

Ideally, we would have both. For-profits which can use revenue as feedback and learn from market realities, alongside non-profits which can take longer-term bets on work needed for safety. The question is how to build a working ecosystem, not whatever is more purely focused on safety.

Proprietary Knowledge Is Not Always Hoarded.

For-profits have an incentive to keep information hidden to retain a competitive advantage. This could block broader adoption of safety techniques, and restrain researchers from making optimal progress.

Assuming that for-profits add resources and people to the AI safety ecosystem, rather than simply moving employees from non-profits, this is still advantageous. We are not choosing between having this research out in the open or hidden inside organisations. We are choosing between having this research hidden or having it not exist at all. In many sectors, the price of innovation is that incumbents conceal and extract rents from their IP for years.

Despite this, for-profits do have agency over what they choose to publish. Volvo famously gave away the patent to their three-point seatbelt at the cost of their own market share, saving an estimated 1 million lives. Tesla gave away all of their electric vehicle patents to help drive adoption of the technology, with Toyota following suit a few years later. Some of this additional knowledge created by expanding the resources in AI safety may still wind up in public hands.

Markets Force Discovery Of Real Problems.

The constant drive to raise money and make a profit is frequently counter to the best long-term interests of the customer. Investment which should be put into making a product safer today instead goes into sales teams, salaries, and metrics designed to reel in investors. It is true that many startups which begin with a strong safety thesis will drift into pure capabilities work or adjacent markets which show higher short-term growth prospects.

However, many initiatives operating without revenue pressure, such as researchers on grants or philanthropically-funded non-profits, can work for years on the wrong problem. For-profits will be able to see that they are working on the wrong thing, and are driven by the pressure to raise revenue to work on something else.

This is not to say that researchers are doing valueless work simply because they are not receiving revenue in the short term. Plenty of work should be done to secure a prosperous future for humanity which businesses will not currently pay for. Rather, mission drift is often a feature rather than a bug when your initial mission was ill-conceived. The discipline markets provide, forcing you to find problems people will pay to solve, is valuable.

Failure Is A Strong Signal.

The institutional failure modes of non-profits and grant-funded research are mostly benign. The research done is not impactful, and time is wasted. On the other hand, for-profits can truly fail in the sense that they fail to drive revenue and go bankrupt, or they can fail in more spectacular ways where they acquire vast resources which are misallocated. The difference is not that for-profits are inherently more likely to steer from their initial goals.

Uncertainty about impact is common across approaches. Whereas research that goes unadopted fails silently, and advocacy which fails to grab attention disappears without effect, for-profits are granted the opportunity to visibly and transparently fail. The AI safety ecosystem already funds work which fails silently, and is effectively taking larger risks with spending than we realise. Startups aren’t any more likely to fail to achieve their goals; they are in the pleasant position of knowing when they have failed.

Visible failure generates information the ecosystem can learn from. Silent failure vanishes unnoticed.

Your Counterfactual Is Larger Than You Think.

Markets are not efficient. The economy is filled with billion-dollar holes, which are uncovered not only by shifts in the technological and financial landscape but by the tireless work of individuals determined to find them. Just because there is money to be made by providing safety does not mean that it will happen by default without you.

Stripe was founded in 2010. Online payments had existed since the 1990s, and credit-card processing APIs were available for years. Yet it took until 2010 for someone to build a genuinely developer-friendly API, simply because nobody had worked on the problem as hard and as effectively as the Collison brothers.

Despite online messaging being widely available since the 1980s, Slack wasn’t founded until 2013. The focus, grit, and attention of competent people being applied to a problem can solve issues where the technology has existed for decades.

Markets are terrible at pricing in products which don’t exist yet. Innovation can come in the form of technical breakthroughs, superior product design, or a unique go-to-market strategy. In the case of products and services relevant to improving AI safety, there is an immense amount of opportunity which has appeared in a short amount of time. You cannot assume that all necessary gaps will be filled simply because there is money to be made there.

If your timelines are short, then the imperative to build necessary products sooner rather than later grows even greater. Even if a company is inevitably going to be built in a space, ensuring that it is built 6 months sooner could be the difference between safety being on the market and unsafe AI deployment being the norm.

For many, the alternative to founding a safety company is joining a frontier lab. However, most AI deployment happens outside labs in enterprises, government systems, and consumer-facing applications. If you want to impact how AI meets the world, you may have to go outside of the lab to do it. Your marginal impact inside a large organisation is often, counterintuitively, smaller than your marginal impact on the entire world.

Historical Precedents

History is littered with examples of companies using their expertise and market position to ship safety without first waiting around for permission.

Sometimes this means investing significant resources and domain expertise to develop something new.

  • Three point seatbelt: Volvo developed the three point seatbelt and gave away the patent. Their combination of in-house technical expertise and industry credibility enabled a safety innovation that transformed the global automotive industry.
  • Toyota’s hybrid vehicle patents: Toyota gave away many hybrid vehicle patents in an attempt to accelerate the energy transition.
  • Meta’s release of Llama3: At a time when only a small number of organisations had the resources to train LLMs from scratch, Meta open-sourced Llama3 making it available to safety researchers at a time when little else was in public hands.

Or perhaps the technology already exists, and what matters is having the market position to distribute it or the credibility to change an industry’s standards.

  • Levi-Strauss supply chain audit: At the peak of their market influence, Levi-Strauss audited their supply chain insisting on certain minimum worker standards to continue dealing with suppliers. They enforced workers rights’ in jurisdictions where mistreatment of employees was either legal or poorly monitored, doing what governments couldn’t or weren’t prepared to do.
  • Cloudflare’s Project Galileo: Cloudflare provides security for small, sensitive websites at no cost. This helps journalists and activists operating in repressive countries avoid being knocked off the internet, and is entirely enabled by Cloudflare’s technology.
  • WhatsApp end-to-end encryption: The technology existed, and the cryptography research was mature by this point. WhatsApp just built it into their product, delivering privacy protection to billions of users worldwide.
  • Security for fingerprint and face recognition: Apple stores face and fingerprint data in a separate chip, making it impossible to steal or legally demand. This did not require regulation; this decision actually led to clashes with the US government. Because of their market position, Apple was able to push this security feature and protect hundreds of millions of users unilaterally.

All of these required a large company’s resources, expertise, credibility, and market integration to create and distribute valuable technology to the world.

Building a for-profit which customers depend on, be it for observability, routing, or safety-tooling, lets you ship safety improvements directly into the ecosystem. When the research exists and the technology is straightforward, a market leader choosing to build it may be the only path to real-world implementation.

It’s Up To You.

For-profits are in a fundamentally strong position to access capital, talent, and information. By selling to other businesses and becoming integrated in AI development, they can not only identify the most pressing issues but directly intervene in them. They build the technological and social environment that makes unsafe products unacceptable and security a commodity to be purchased and relied upon.

Non-profits have done, and will continue to do, critical work in AI safety. But the ecosystem is lopsided. We have researchers and advocates, but not enough builders turning their insights into products that companies buy and depend on. The feedback loops, distribution channels, and ability to rapidly scale that for-profits provide are a necessity if safety is to keep pace with capabilities.

The research exists. The techniques are maturing. Historical precedents show us that companies embedded in an industry can ship safety in ways that outsiders cannot. What’s missing are the people willing to found, join, and build companies that close the gap between safety as a research topic and safety as a market expectation. We cannot assume that markets will bridge this divide on their own in the time we have left. If you have the skills and the conviction, this is a gap you can fill!

If you’re thinking about founding something, joining an early-stage AI safety company, or want to pressure-test an idea - reach out at team@bluedot.org. We’re always happy to talk.

BlueDot’s AGI Strategy Course is also a great starting point - at least 4 startups have come out of it so far, and many participants are working on exciting ideas. Apply here.

Thanks to Ben Norman, Daniel Reti, Maham Saleem, and Aniket Chakravorty for their comments.

Lysander Mawby is a graduate of BlueDot’s first Incubator Week, which he went on to help run in v2 and v3. He is now building an AI safety company and taking part in FR8. Josh Landes is Head of Community and Events at BlueDot and, with Aniket Chakravorty, the initiator of Incubator Week.



Discuss

Draft Moskovitz: The Best Last Hope For Constructive AI Governance

Новости LessWrong.com - 7 марта, 2026 - 03:57
Introduction:

The next presidential election represents a significant opportunity for advocates of AI safety to influence government policy. Depending on the timeline for the development of artificial general intelligence (AGI), it may also be one of the last U.S. elections capable of meaningfully shaping the long-term trajectory of AI governance. 

Given his longstanding commitment to AI safety and support for institutions working to mitigate existential risks, I advocate for Dustin Moskovitz to run for president. I expect that a Moskovitz presidency would substantially increase the likelihood that U.S. AI policy prioritizes safety. Even if such a campaign would be unlikely to win outright, supporting it would still be justified, as such a campaign would elevate AI Safety onto the national spotlight, influence policy discussions, and facilitate the creation of a pro-AI-Safety political network.

The Case for AI Governance:

 

Governments are needed to promote AI safety because the dynamics of AI development make voluntary caution difficult, and because AI carries unprecedented risk and transformative potential. Furthermore, the US government can make a huge difference for a relatively insignificant slice of its budget.

The Highly Competitive Nature of AI and a Potential Race to the Bottom:

There’s potentially a massive first-mover advantage in AI. The first group to develop transformative AI could theoretically secure overwhelming economic power by utilizing said AI to kick off a chain of recursive self improvement, where first human AI researchers gain dramatic productivity boosts by using AI, then AI itself. Even without recursive improvement, however, being a first mover in transformational AI could still have dramatic benefits.

Incentives are distorted accordingly. Major labs are pressured to move fast and cut corners—or risk being outpaced. Slowing down for safety often feels like unilateral disarmament. Even well-intentioned actors are trapped in a race-to-the-bottom dynamic, as all your efforts to ensuring your model is safe is not that relevant if an AI system developed by another, less scrupulous company becomes more advanced than your safer models. Anthropic puts it best when they write "Our hypothesis is that being at the frontier of AI development is the most effective way to steer its trajectory towards positive societal outcomes." The actions of other top AI companies also reflect this dynamic, with many AI firms barely meeting basic safety standards.

This is exactly the kind of environment where governance is most essential. Beyond my own analysis, here is what notable advocates of AI safety have said on the necessity of government action and the insufficiency of corporate self-regulation: 

“‘My worry is that the invisible hand is not going to keep us safe. So just leaving it to the profit motive of large companies is not going to be sufficient to make sure they develop it safely,’ he said.  ‘The only thing that can force those big companies to do more research on safety is government regulation.’”

 

I don't think we've done what it takes yet in terms of mitigating risk. There's been a lot of global conversation, a lot of legislative proposals, the UN is starting to think about international treaties — but we need to go much further. {...} There's a conflict of interest between those who are building these machines, expecting to make tons of money and competing against each other with the publicWe need to manage that conflict, just like we've done for tobacco, like we haven't managed to do with fossil fuels. We can't just let the forces of the market be the only force driving forward how we develop AI.”

Many researchers working on these systems think that we’re plunging toward a catastrophe, with more of them daring to say it in private than in public; but they think that they can’t unilaterally stop the forward plunge, that others will go on even if they personally quit their jobs. And so they all think they might as well keep going. This is a stupid state of affairs, and an undignified way for Earth to die, and the rest of humanity ought to step in at this point and help the industry solve its collective action problem."

The Magnitude of AI Risks:

Beyond the argument from competition, there is also the question about who gets to make key decisions about what type of risks should be taken in the development of AI. If AI has the power to permanently transform society or even destroy it, it makes sense to leave critical decisions about safety to pluralistic institutions rather than unaccountable tech tycoons. Without transparency, accountability, and clear safety guidelines, the risk for AI catastrophe seems much higher. 

To illustrate this point, imagine if a family member of a leader of a major AI company (or the leader themselves) gets late stage cancer or another serious medical condition that is difficult to treat with current technology. It is conceivable that the leader would attempt to develop AI faster in order to increase their or their family member's personal chance of survival, whereas it would be in the best interest of the society to delay development for safety reasons. While it is possible that workers in these AI companies would speak out against the leader’s decisions, it is unclear what could be done if the leader in this example decided against their employees' advice.

This scenarios is not the most likely but there are many similar scenarios and I think it illustrates that the risk appetites, character, and other unique attributes of the leaders and decision makers of these AI companies can materially affect the levels of AI safety that are applied in AI development. While government is not completely insulated from this phenomenon, especially in short timeline scenarios, ideally an AI safety party would be able to facilitate the creation of institutions which would utilize the viewpoints of many diverse AI researchers, business leaders, and community stakeholders in order to create an AI-governance framework which will not give any one (potentially biased) individual the power to unilaterally make decisions on issues of great importance regarding AI safety (such as when and how to deploy or develop highly advanced AI systems). 

The Vast Scope and Influence of Government:

Finally, I think the massive resources of government is an independent reason to support government action on AI safety. Even if you think corporations can somewhat effectively self-regulate on AI and you are opposed to a general pause on AI development, there is no reason the US government shouldn't and can't spend 100 billion dollars a year on AI safety research. This number would be over 20 times greater than 3.7 billion, which was Open AI's total estimated revenue in 2024, but <15% of US defense spending. Ultimately, the US government has more flexibility to support AI safety than corporations, owing simply to its massive size.

The Insufficiency of current US Action on AI Safety:

Despite many compelling reasons existing for the US government to act on AI safety, the US government has never taken significant action on AI safety, and the current administration has actually gone backwards in many respects. Despite claims to the contrary, the recent AI action plan is a profound step away from AI safety, and I would encourage anyone to read it. The first "pillar" of the plan is literally "Accelerate AI Innovation" and the first prong of that first pillar is to "Remove Red Tape and Onerous Regulation", citing the Biden Administration's executive action on AI (referred to as the "Biden administration's dangerous actions") as an example, despite the fact the executive order did not actually do much and was mainly trying to lay the ground-work for future regulations on AI

The AI Action plan also proposes government investment to advance AI capabilities, suggesting to "Prioritize investment into theoretical computational and experimental research to preserve America's leadership in discovering new and transformative paradigms that advance the capabilities of AI", and while the AI Action plan does acknowledge the importance of "interpretability, control, and robustness breakthroughs", it receives only about two paragraphs in a 28 page report (25 if you remove pages with fewer than 50 words).

However, as disappointing the current administration's stance on AI Safety may be, the previous administration was not an ideal model. According to this post NSF spending on AI safety was only 20 million dollars between 2023 and 2024, and this was ostensibly the main source of direct government support for AI safety. To put that number into perspective, the US Department of Defense spent an estimated 820.3 billion US dollars in FY 2023, and meaning the collective amount spent by represented only approximately 0.00244% of the US Department of Defense spending in FY 2023.

Many people seem to believe that governments will inevitably pivot to promoting an AI safety agenda at some point, but we shouldn't just stand around waiting for that to happen while lobbyists funded by big AI companies are actively trying to shape the government's AI agenda. 

The Power of the Presidency:

The US president could unilaterally take a number of actions relevant for AI Safety. For one, the president could use powers under the IEEPA to essentially block the exports of chips to adversary nations, potentially slowing down foreign AI development and giving the US more leverage in international talks on AI. The same law could also limit the export of such models, shifting the bottom line of said AI companies dramatically. The president could also use the Defense Production Act to require companies to be more transparent about their use and development of AI models, something which also would significantly affect AI Safety. This is just scratching the surface, but beyond what the president could do directly, over the last two administrations we have seen that both houses of congress have largely went along with the president when a single party controlled all three branches of government. Based on the assumption that a Moskovitz presidency would result in a trifecta, it should be easy for him to influence congress to pass sweeping AI regulation that gives the executive branch a ton of additional power to regulate AI.

Long story short, effective AI governance likely requires action from the US federal government, and that would basically require presidential support. Even if generally sympathetic to AI safety, a presidential candidate who does not have a track record of supporting AI Safety will likely be far slower at supporting AI regulation, international AI treaties, and AI Safety investment, and this is a major deal.

The Case for Dustin Moskovitz:

Many people care deeply about the safe development of artificial intelligence. However, from the perspective of someone who cares about AI Safety, a strong presidential candidate would need more than a clear track record of advancing efforts in this area. They would also need the capacity to run a competitive campaign and the competence to govern effectively if elected.

However, one of the central difficulties in identifying such candidates is that most individuals deeply involved in AI safety come from academic or research-oriented backgrounds. While these figures contribute immensely to the field’s intellectual progress, their careers often lack the public visibility, executive experience, or broad-based credibility traditionally associated with successful political candidates. Their expertise, though invaluable, rarely translates into electoral viability.

Dustin Moskovitz represents a rare exception. As a leading advocate and funder within the AI safety community, he possesses both a deep commitment to mitigating existential risks and the professional background to appeal to conventional measures of success. His entrepreneurial record and demonstrated capacity for large-scale organization lend him a kind of legitimacy that bridges the gap between the technical world of AI safety and the public expectations of political leadership. Beyond this, his financial resources also will allow him to quickly get his campaign off the ground and focus less on donations than other potential candidates, a major boon for a presidential nominee. 

In a political environment dominated by short-term incentives, a candidate like Moskovitz—who combines financial independence, proven managerial ability, and a principled concern for the long-term survival of humanity—embodies an unusually strong alignment between competence, credibility, and conscience.

The Value of a Moskovitz Presidency:I can see that Accelerationist is misspelled, you get the point anyways.

The best way to assess the impact of a Moskovitz presidency on AI Safety is to compare him to potential alternative presidents. On the republican side, prediction markets currently favor J. D Vance, who famously stated at an AI summit: "The AI future is not going to be won by hand-wringing about safety. It will be won by building -- from reliable power plants to the manufacturing facilities that can produce the chips of the future."

Yikes.

On the Democrat side, things aren't much better. Few Democratic politicians with presidential ambitions have clearly committed themselves to supporting AI Safety, and even if they would hypothetically support some AI Safety initiatives, they would clearly be less prepared to do so than a hypothetical President Moskovitz.

Is this Actually Feasible?:

I believe that if Dustin Moskovitz decided to run for president today with the support of the rationalist and effective altruist communities, he would have a non-zero chance of winning the Democratic nomination. The current Democratic bench is not especially strong. Figures such as Gavin Newsom and Alexandria Ocasio-Cortez both face significant limitations as national candidates. 

Firstly, Newsom’s record in California could be fruitful ground for opponents. California has seen substantial out-migration over the past several years, with many residents leaving for states with lower housing costs and fewer regulatory barriers. At the same time, California faces a severe housing affordability crisis driven by restrictive zoning, slow permitting processes, and high construction costs. These issues have become national talking points and have raised questions about the effectiveness of governance in a state often seen as a policy model for the Democratic Party.

AOC, on the other hand, has relatively limited executive experience, and might not even run in the first place. 

Against this backdrop, Dustin Moskovitz’s profile as a billionaire outsider could be a political asset. Although wealth can be a liability in Democratic primaries, his outsider status and independence from the traditional political establishment could make him more competitive in a general election. Unlike long-serving politicians, he would enter the race without decades of partisan baggage or controversial votes.

Furthermore, listening to Moskovitz speak, he comes across as thoughtful and generally personable. While it is difficult to judge how effective he would be as a campaigner based only on interviews, there is little evidence suggesting he would struggle to communicate his ideas or connect with voters. Given his experience building and leading organizations, as well as his involvement in major philanthropic initiatives, it is plausible that he could translate those skills into a disciplined and competent campaign.

Nevertheless, I pose a simple question: if not now then when? If the people will only respond to a pro-AI regulation message only after they are unemployed, then there is no hope for AI governance anyways, because by the time AI is directly threatening to cause mass unemployment, it will likely be to late to do anything.

 

Is feasibility all that matters?:

Even if Dustin Moskovitz is unable to win the Democratic nomination, he could potentially gather enough support to play kingmaker in a crowded field and gain substantial leverage over the eventual Democratic nominee. Furthermore, if Moskovitz runs for president, it would provide a blueprint and precedence for future candidates who support AI safety. This, combined with the attention the Moskovitz campaign would bring towards AI Safety, could help justify the Moskovitz campaign on consequentialist grounds. 

What About Activism?:

Grassroots movements, while capable of profound social transformation, often operate on timescales far too slow to meaningfully influence AI governance within the short window humanity has to address the technology’s risks. Even if one doubts the practicality of persuading a future president to prioritize AI safety, such a top-down approach may remain the only plausible way to achieve near-term impact. History offers sobering reminders of how long bottom-up change can take. The civil rights movement, one of the most successful in American history, required nearly a decade—beginning around 1954 with Brown v. Board of Education—to achieve its landmark legislative victories, despite decades of groundwork laid by organizations like the NAACP beforehand. The women’s suffrage movement took even longer: from the Seneca Falls Convention in 1848 to the ratification of the Nineteenth Amendment in 1920, over seventy years passed before American women secured the right to vote. Similarly, the American anti-nuclear movement succeeded in slowing the growth of nuclear energy but failed to eliminate nuclear weapons or ensure lasting disarmament, and many of its limited gains have since eroded.

Against this historical backdrop, the idea of a successful AI safety grassroots movement seems implausible. The issue is too abstract, too technical, and too removed from everyday life to inspire widespread public action. Unlike civil rights, women’s suffrage, or nuclear proliferation—issues that directly touched people’s identities, freedoms, or survival—AI safety feels theoretical and distant to most citizens. While it is conceivable that economic disruption from automation might eventually stir public unrest, such a reaction would almost certainly come too late to steer the direction of AI development. Worse, mass discontent could be easily defused by the major AI corporations through material concessions, such as the introduction of a universal basic income, without addressing the underlying safety or existential concerns. In short, the historical sluggishness of grassroots reform, combined with the abstract nature of the AI problem, suggests that bottom-up mobilization is unlikely to arise—or to matter—before the most consequential decisions about AI are already made.

What About Lobbying?:

One major way in which people who care to influence governmental support for AI safety policies have sought to influence government has been through lobbying organizations and other forms of activism. However, there is reason to doubt they will be able to cause lasting change. First of all, there is significant evidence that lobbying has a status quo bias lobbying has a status quo bias. Lobbying is most effective when it relates to preventing changes, and when there are two groups of lobbyists on an issue, the lobbying to prevent change win out, all else being equal. In fact, according to a study by Dr. Amy McKay, "it takes 3.5 lobbyists working for a new proposal to counteract just one lobbyist working against it".

Even if this effect did not exist, however, it is very unlikely AI safety groups will be able to compete with anti-AI-safety lobbyists. Naturally, the rise of large, transnational organizations built to profit around AI has also lead to a powerful pro-AI lobbyist operation. This indicates we can't simply rely on current strategies of simply funding AI-safety advocacy organizations, as they will be eclipsed by better funded pro-AI-Business voices.

Conclusion:

While having Dustin Moskovitz run for office would be far from a guarantee, it is the best way for pro-AI-Safety Americans to influence AI governance before 2030.  

 

This post has been written with the assistance of Chat-Gpt, and the images in this post were generated by Copilot, Gemini, and Chat-Gpt.



Discuss

More is different for intelligence

Новости LessWrong.com - 7 марта, 2026 - 03:02
Why did software change the world?

In the 1900s, much of the work being done by knowledge workers was computation: searching, sorting, calculating, tracking. Software made this work orders of magnitude cheaper and faster.

Naively, one might expect businesses and institutions to carry out largely the same processes, just more efficiently1. But rather, the proliferation of software has also allowed for new kinds of processes. Instead of reordering inventory when shelves looked empty, supply chains began replenishing automatically based on real-time demand. Instead of valuing stocks quarterly, financial markets started pricing and trading continuously in milliseconds. Instead of designing products on intuition, teams began running thousands of simultaneous experiments on live users. Why did a quantitative shift in the cost of computation result in a qualitative shift in the nature of organizations?

It turns out that a huge number of useful computations were never being done at all. Some were prohibitively expensive to perform at any useful scale by humans alone. But most simply went unimagined, because people don’t design processes around resources they don’t have. Real-time pricing, personalized recommendation systems, and algorithmic trading were all inconceivable prior to early 2000s.

These latent processes could only emerge after the cost of computation dropped.

Organizing Agent Societies

The forms of knowledge work that persisted to today are now being made more efficient by LLMs. But while they’ve enhanced the efficiency of human nodes on the graph of production, the structure of the graph has still been left intact.

In software development in particular, knowledge work consists of designing systems, implementing code, and coordinating decisions across teams. Humans have developed rich toolkits for distributing such cognitive work – think standups, code review, memos, design docs. At their core, these are protocols for coordinating creation and judgement across many people.

The forms that LLMs take on mediate the ways that they plug into these toolkits. Their primary manifestation – chatbots and coding agents – implies a kind of pair-programming, with one agent and one human working side by side. In this capacity, they’re able to write code and give feedback on code reviews. But not much more.

In this sense, we’re in the “naive” phase of LLM applications. LLMs might make writing code and debugging more efficient2, but the nature of work hasn’t changed much.

The first software transition didn’t just make existing computations faster; it allowed for entirely new kinds of computation. LLMs, if given the right affordances, will reveal cognitive flows we haven’t imagined yet.

A new metascience

How should cognitive work be organized? For most of history, this has been a question for metascience, management theory and post-hoc rationalization. Soon, this question will be able to be answered by experiment. We’re curious to see what the answers look like.



Discuss

Your Causal Variables Are Irreducibly Subjective

Новости LessWrong.com - 7 марта, 2026 - 01:59

Mechanistic interpretability needs its own shoe leather era. Reproducing the labeling process will matter more than reproducing the Github.

Crossposted from Communication & Intelligence substack

When we try to understand large language models, we like to invoke causality. And who can blame us? Causal inference comes with an impressive toolkit: directed acyclic graphs, potential outcomes, mediation analysis, formal identification results. It feels crisp. It feels reproducible. It feels like science.

But there is a precondition to the entire enterprise that we almost always skip past: you need well-defined causal variables. And defining those variables is not part of causal inference. It is prior to it — a subjective, pre-formal step that the formalism cannot provide and cannot validate.

Once you take this seriously, the consequences are severe. Every choice of variables induces a different hypothesis space. Every hypothesis space you didn't choose is one you can't say anything about. And the space of possible causal models compatible with any given phenomenon is not merely vast in the familiar senses — not just combinatorial over DAGs, or over the space of all possible parameterizations — but over all possible variable definitions, which is almost incomprehensibly vast. The best that applied causal inference can ever do is take a subjective but reproducible set of variables, state some precise structural assumptions, and falsify a tiny slice of the hypothesis space defined by those choices.

That may sound deflationary. I think embracing this subjectivity is the path to real progress — and that attempts to fully formalize away the subjectivity of variable definitions just produce the illusion of it.

The variable definition problem

The entire causal inference stack — graphs, potential outcomes, mediation, effect estimation — presupposes that you have well-defined, causally separable variables to work with. If you don't have those, you don't get to draw a DAG. You don't get to talk about mediation. You don't get potential outcomes. You don't get any of it.

Hernán (2016) makes this point forcefully for epidemiology in "Does Water Kill? A Call for Less Casual Causal Inferences." The "consistency" condition in the potential outcomes framework requires that the treatment be sufficiently well-defined that the counterfactual is meaningful. To paraphrase Hernán's example: you cannot coherently ask "does obesity cause death?" until you have specified what intervention on obesity you mean — induced how, over what timeframe, through what mechanism? Each specification defines a different causal question, and the formalism gives you no guidance on which to choose.

This is not a new insight. Freedman (1991) made the same argument in "Statistical Models and Shoe Leather." His template was John Snow's 1854 cholera investigation, the study that proved cholera was waterborne. Snow didn't fit a regression. He went door to door, identified which water company supplied each household, and used that painstaking fieldwork to establish a causal claim no model could have produced from pre-existing data. Freedman's thesis was blunt: no amount of statistical sophistication substitutes for deeply understanding how your data were generated and whether your variables mean what you think they mean. As he wrote: "Naturally, there is a desire to substitute intellectual capital for labor." His editors called this desire "pervasive and perverse." It is alive and well in LLM interpretability.

When a mechanistic interpretability researcher says "this attention head causes the model to be truthful," what is the well-defined intervention? What are the variable boundaries of "truthfulness"? In practice, we can only assess whether our causal models are correct by looking at them and checking whether the variable values and relationships match our subjective expectations. That is vibes dressed up as formalism.

The move toward "black-box" interpretability over reasoning paths has only made this irreducible subjectivity even more visible. The ongoing debate over whether the right causal unit is a token, a sentence, a reasoning step, or something else entirely (e.g., Bogdan et al., 2025; Lanham et al., 2023; Paul et al., 2024) is not a technical question waiting for a technical answer — it is a subjective judgment that only a human can validate by inspecting examples and checking whether the chosen granularity carves reality sensibly. [1]

We keep getting confused about interventions

Across 2022–2025, we've watched a remarkably consistent pattern: someone proposes an intervention to localize or understand some aspect of an LLM, and later work reveals it didn't measure what we thought. [2] Each time, later papers argue "the intervention others used wasn't the right one." But we keep missing the deeper point: it's all arbitrary until it's subjectively tied to reproducible human judgments.

I've perpetuated the "eureka, we found the bug in the earlier interventions!" narrative myself. In work I contributed to, we highlighted that activation patching interventions (see Heimersheim & Nanda, 2024) weren't surgical enough, and proposed dynamic weight grafting as a fix (Nief et al., 2026). Each time we convince ourselves the engineering is progressing. But there's still a fundamentally unaddressed question: is there any procedure that validates an intervention without a human judging whether the result means what we think?

It is too easy to believe our interventions are well-defined because they are granular, forgetting that granularity is not validity. [3]

Shoe leather for LLM interpretability

Freedman's lesson was that there is no substitute for shoe leather — for going into the field and checking whether your variables and measurements actually correspond to reality. So what is shoe leather for causal inference about LLMs?

I think it has three components. First, embrace LLMs as a reproducible operationalization of subjective concepts. A transcoder feature, a swapped sentence in a chain of thought, a rephrased prompt — these are engineering moves, not variable definitions. "Helpfulness" is a variable. "Helpfulness as defined by answering the user's request without increasing liability to the LLM provider" is another, distinct variable. Describe the attribute in natural language, use an LLM-as-judge to assess whether text exhibits it, and your variable becomes a measurable function over text — reproducible by anyone with the same LLM and the same concept definition.

Second, do systematic human labeling of all causal variables and interventions to verify they actually match what you think they should. If someone disagrees with your operationalization, they can audit a hundred samples and refine the natural language description. This is the shoe leather: not fitting a better model, but checking — sample by sample — whether your measurements mean what you claim.

Third — and perhaps most importantly — publish the labeling procedure, not just the code. A reproducible natural language specification of what each variable means, along with the human validation that confirmed it, is arguably more valuable than a GitHub repo. It is what lets someone else pick up your variables, contest them, refine them, and build on your falsifications rather than starting from scratch.

Variable definitions are outside the scope of causal inference. Publishing how you labeled them matters more than publishing your code.

Trying to practice what I'm preaching

RATE (Reber, Richardson, Nief, Garbacea, & Veitch, 2025) came out of trying to do this in practice — specifically, trying to scale up subjective evaluation of traditional steering approaches in mech interp. We knew from classical causal inference that we needed counterfactual pairs to measure causal effects of high-level attributes on reward model scores. Using LLM-based rewriters to generate those pairs was the obvious move, but the rewrites introduced systematic bias. Fixing that bias — especially without having to enumerate everything a variable can't be — turned out to be a whole paper. The core idea: rewrite twice, and use the structure of the double rewrite to correct for imperfect counterfactuals.

The punchline

Building causal inference on top of subjective-but-reproducible variables is harder than it sounds, and there's much more to do. But I believe the path is clear, even if it's narrow: subjective-yet-reproducible variables, dubious yet precise structural assumptions, honest statistical inference — and the willingness to falsify only a sliver of the hypothesis space at a time.

Every causal variable is a subjective choice — and because the space of possible variable definitions is vast, so is the space of causal hypotheses we'll never even consider. No formalism eliminates this. No amount of engineering granularity substitutes for a human checking whether a variable means what we think it means. The best we can do is choose variables people can understand, operationalize them reproducibly, state our structural assumptions precisely, and falsify what we can. That sliver of real progress beats a mountain of imagined progress every time.

References

Beckers, S. & Halpern, J. Y. (2019). Abstracting causal models. AAAI-19.

Beckers, S., Halpern, J. Y., & Hitchcock, C. (2023). Causal models with constraints. CLeaR 2023.

Bogdan, P. C., Macar, U., Nanda, N., & Conmy, A. (2025). Thought anchors: Which LLM reasoning steps matter? arXiv:2506.19143.

Freedman, D. A. (1991). Statistical models and shoe leather. Sociological Methodology, 21, 291–313.

Geiger, A., Wu, Z., Potts, C., Icard, T., & Goodman, N. (2024). Finding alignments between interpretable causal variables and distributed neural representations. CLeaR 2024.

Geiger, A., Ibeling, D., Zur, A., et al. (2025). Causal abstraction: A theoretical foundation for mechanistic interpretability. JMLR, 26, 1–64.

Hase, P., Bansal, M., Kim, B., & Ghandeharioun, A. (2023). Does localization inform editing? NeurIPS 2023.

Heimersheim, S. & Nanda, N. (2024). How to use and interpret activation patching. arXiv:2404.15255.

Hernán, M. A. (2016). Does water kill? A call for less casual causal inferences. Annals of Epidemiology, 26(10), 674–680.

Lanham, T., et al. (2023). Measuring faithfulness in chain-of-thought reasoning. Anthropic Technical Report.

Makelov, A., Lange, G., & Nanda, N. (2023). Is this the subspace you are looking for? arXiv:2311.17030.

Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and editing factual associations in GPT. NeurIPS 2022.

Nief, T., et al. (2026). Multiple streams of knowledge retrieval: Enriching and recalling in transformers. ICLR 2026.

Paul, D., et al. (2024). Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning. Findings of EMNLP 2024.

Reber, D., Richardson, S., Nief, T., Garbacea, C., & Veitch, V. (2025). RATE: Causal explainability of reward models with imperfect counterfactuals. ICML 2025.

Schölkopf, B., Locatello, F., Bauer, S., et al. (2021). Toward causal representation learning. Proceedings of the IEEE, 109(5), 612–634.

Sutter, D., Minder, J., Hofmann, T., & Pimentel, T. (2025). The non-linear representation dilemma. arXiv:2507.08802.

Wang, Z. & Veitch, V. (2025). Does editing provide evidence for localization? arXiv:2502.11447.

Wu, Z., Geiger, A., Huang, J., et al. (2024). A reply to Makelov et al.'s "interpretability illusion" arguments. arXiv:2401.12631.

Xia, K. & Bareinboim, E. (2024). Neural causal abstractions. AAAI 2024, 38(18), 20585–20595.

  1. Causal representation learning (e.g., Schölkopf et al., 2021) doesn't help here either. Weakening "here is the DAG" to "the DAG belongs to some family" is still a structural assertion made before any data is observed. ↩︎

  2. Two threads illustrate the pattern. First: ROME (Meng et al., 2022) used causal tracing to localize factual knowledge to specific MLP layers — a foundational contribution. Hase et al. (2023) showed the localization results didn't predict which layers were best to edit. Wang & Veitch (2025) showed that optimal edits at random locations can be as effective as edits at supposedly localized ones. Second: DAS (Geiger, Wu, Potts, Icard, & Goodman, 2024) found alignments between high-level causal variables and distributed neural representations via gradient descent. But Makelov et al. (2023) demonstrated that subspace patching can produce "interpretability illusions" — changing output via dormant pathways rather than the intended mechanism — to which Wu et al. (2024) argued these were experimental artifacts. Causal abstraction has real theoretical foundations (e.g., Beckers & Halpern, 2019; Geiger et al., 2025; Xia & Bareinboim, 2024), but it cannot eliminate the subjectivity of variable definitions — only relocate it. Sutter et al. (2025) show that with unconstrained alignment maps, any network can be mapped to any algorithm, rendering abstraction trivially satisfied. The linearity constraints practitioners impose to avoid this are themselves modeling choices — ones that can only be validated by subjective judgment over the data. ↩︎

  3. There is also a formal issue lurking here: structural causal models require exogenous noise, and neural network computations are deterministic. Without extensions like Beckers, Halpern, & Hitchcock's (2023) "Causal Models with Constraints," we don't even have a well-formed causal model at the neural network level — so what are we abstracting between? ↩︎



Discuss

Mox is the largest AI Safety community space in San Francisco. We're fundraising!

Новости LessWrong.com - 7 марта, 2026 - 01:07

Summary: Mox is fundraising to maintain and grow AIS projects, build a compelling membership, and foster other impactful and delightful work. We're looking to raise $450k for 2026, and you can donate on Manifund!

OverviewWho we are

Mox is SF’s largest AI safety coworking space, and also its primary Effective Altruism community space. We opened just over a year ago, and over the last year, we’ve served high-impact work in and around AI safety by hosting conferences, fellowships, events, and incubating new organizations.

Our theory of change is to provide good infrastructure (offices, event space) and a high density of collegial interactions to people and projects we admire. We're not focusing on a single specific thesis on AI safety. Instead, we aim to support many sorts of people and organizations who: 

  1. agree that transformative AI is on the horizon,
  2. have a strong thesis about what it means for the world to go well,
  3. are working on a project that we think credibly advances their thesis.

This includes many projects that are directly AI Safety, such as Seldon Lab, the broader Effective Altruist sphere such as Sentient Futures, and even more broadly non-EA projects by EAs or from EA and rationalist-friendly corners of the SF tech scene, such as Arbor Trading Bootcamp (pictured below). Many more examples of such work are are given in the "Current Operations" heading.

Our team also works really hard to make Mox a fun and cozy place to be! We want to be a comfortable place for people to gather. Good communities grow in good spaces.

After an event-filled first year that included a visit from AI safety bill sponsor Senator Scott Weiner, we wrapped up with the 200-person Sentient Futures Summit, and are now looking forward to a second year building on the successes of the first! 

Why we're raising

We think our highest impact is still in the future, building upon the strength of the ops team we've built in the first year, and we’re looking for private and organizational funding to carry us through the next year, for the purposes described below. We operate at a metered loss, since our mission is to support and develop a community, rather than to maximize profit.

Since we're a large venue of 40,000 sqft [3,700 meters squared!], with a capacity of 300+ desks, our operating costs are relatively high. On the other hand, we have lots of room to grow within the space! Our mainline ask is $450k, and we think we can successfully deploy funding up to 1.2 million dollars to run more great events, improve our space, and incubate new fellowships.

How to donate

To donate, find us on Manifund, or contact me directly:
Rachel Shu, Mox Director rachel@moxsf.com

Funding milestones⬜ Raise $100k by March 15th (will be 1:1 matched!)

To kick off, an anonymous donor has offered us 1:1 donation matching up to $100k. We’d like to hit at least this goal! If we can get this, we’ll be able to extend our runway (which is only a few months long), and be able to commit to future operations.

How you can help out

  • Make a free-and-clear donation via Manifund, even any small amount helps with the matching.
  • Consider sponsoring our community and our events in the following ways: Mox Sponsor Opportunities
⬜ Raise $450k by April 1st

This is the amount we need to meet our expected operational demand over the next 6 months, and we are hoping to raise it from individual donors who can donate quickly. Our anticipated monthly burn is only $30k more than our revenue, but we also are looking to build a cash reserve for unforeseen circumstances.

With our minimum operating budget for the year secured and reserves in the bank, we’ll be able to:

  • Continue ongoing operations, with buffer against unexpected circumstances. ($200k)
  • Expand our headcount by 1-2 more staff to improve and expand existing events and coworking operations. (up to $250k.)
  • Run regular high-frequency, high-profile talk series (included in staff cost.)
  • Guarantee our Global Expert Fellowship, which is a J-1 visa program designed to bring in international AI safety experts (included in staff cost.)
⬜ Raise $1.2m by June 1st

This will probably include grants made by our organizational funders. This is the median funding that we think we can deploy meaningfully this year. With generous funding, some of our ambitions would include:

  • Restructure our interior space to meet member needs, such as adding phone booths and nooks, and improve our interior design. ($400k, mainly construction costs.)
  • Initiate our own fellowship programs, such as the Muybridge fellowship suggested below. ($300-$500k.)
  • Sponsor highly-aligned organizations and individuals at Mox who we think are doing excellent work and need the subsidy, effectively serving as regrantors. ($150k.)
Funding updatesRaised $550k in 2025

In our initial fundraising post a year ago, we proposed three budget tiers — minimal ($1.6M/year), mainline (~$2M), and ambitious ($3.6M). As a then-unproven organization, we did not meet any of our funding goals, raising in total $550k.

Delivered above expectations

What we spent annualized to roughly $1.2M, less than even our ‘minimal’ tier projection. What we delivered landed closer to ‘mainline’: 183 members, 144 Guest Program participants, 15 offices, a team of 5, and 2 tentpole events most months. And from the ‘ambitious’ tier, we succeeded at expanding Mox to all four floors of 1680 Mission. We’ve done this by keeping our team small, finding good deals on rent and furnishings, and charging fair prices to external clients.

Present state of finances

Monthly revenue and expenses spreadsheet: May 2025 - Jan 2026

Revenue: ~$100k/mo

Our ~$100k/mo revenue comes from a mix of paid memberships, private offices, and external fellowships. 

Conversely, most of our events are provided for free or at a low sponsorship cost, with occasional revenue-generating events such as hackathons and happy hours. 

Expenses: ~$130k/mo

Our ~$130k/mo expenses are mainly staff labor and building costs, each composing about 35-40% of our total expenses in a typical month. 

The remainder goes to providing amenities, servicing events, and investing in capital improvements such as improving our interior design.

Sustainable by EOY

Right now, we are in a funding crunch, and have only cash reserves equivalent to a few months of runway.

We believe we're on track to self-sustainability this year, projecting monthly revenue to continue growing by $10-15k/mo for the next 3-6 months. Our main projected growth is currently in memberships and private offices, with events and fellowships remaining steady. 

As our growth tails off, we think our steady state expenses will be ~$160k/month, and we’ll be revenue neutral on expenses and even slightly positive on balance.

Will still seek funding in future years

We anticipate that capital improvements will always come out of endowments, so we plan to continue raising every year.

Current operationsFellowships & programs

We supported five residencies in the last year:

  • FLF Fellowship on AI for Human Reasoning: 30 fellows exploring, researching, and developing potential beneficial AI for Human Reasoning tools.
  • PIBBSS Fellowship 2025 (now Principles of Intelligence): 17 fellows in residence for cross-disciplinary AI safety research.
  • Seldon Lab Accelerator, Batch 1: 4 startups building AI safety infrastructure, including Andon Labs, Workshop Labs, and Lucid Computing.
  • Seldon Lab Accelerator, Batch 2: 6 startups, currently in residence.
  • The Frame Fellowship: An 8-week program for 12 video creators communicating about AI safety, developed in-house at Mox.
How does Mox contribute to these programs’ success?
  • Provide a fully furnished office
  • Situate them alongside other groups doing similar work
  • Provide event venue space directly connected to their workspace
  • Handle daily catering, janitorial and supplies
  • Troubleshoot participant tech
  • Host pre/post conference coworking and social gatherings for various workshops and conferences, including: EAGxBay Area, LessOnline, Manifest, and The Curve, with 500+ total attendees across those days
Public events

We hosted 377 events over the last year, including:

Mox also hosts community events, such as:

  • Effective Altruism SF, biweekly events and meetups
  • Astral Codex Ten SF, monthly meetups
  • 90/30 Club, machine learning paper reading group
  • Mathematics with Lean, a group dedicated to self-guided explorations of the Lean interactive theorem prover

“Mox has been an invaluable resource for us when running EA SF [Effective Altruism San Francisco], since its large and well-equipped facility allowed us to cater food, run speaker events, workshops, and otherwise host much larger and more ambitious events than we otherwise would have been able to.” — G., a lead organizer of EA SF

Individuals & coworking

You can see a list of all our members here: https://moxsf.com/people

We currently have 183 active members; on a typical coworking day, 50-80 people are at Mox. A sampling of individual members who are frequently at Mox:

Member testimonials!

“It feels like a second home, but more lively. I can always expect to run into a friend who is down to cowork or hang.” Constance Li, founder of Sentient Futures

“I can walk up to anyone and have an interesting conversation; every single person I've met here has welcomed questions about their work and been curious about mine.” — Gavriel Kleinwaks, Horizon Fellow

“Mox has the best density of people with the values & capabilities I care about the most. In general, it's more social & feels better organized for serendipity vs any coworking space I've been to before, comparable to perhaps like 0.3 Manifests per month.” — Venki Kumar

Sourced from our August feedback survey.

Private offices and partner organizations

In Year 1, Mox was home to 15 private offices, including:

  • Sentient Futures: promoting animal welfare and sentience research
  • Tampersec: building physical computing infrastructure security
  • Andon Labs: building autonomous organizations such as Project Vend, via Seldon accelerator
  • Pantograph: building a preschool for robots
  • BlueDot Impact (pending visa): online courses for AI safety upskilling

We also maintain a Guest Program with 19 partner organizations to give their teams free drop-in access.

Public program partners include: MIRI, FAR.AI, Redwood Research, BlueDot Impact, Palisade, GovAI, EPOCH, AI Impacts, Timaeus, Elicit, Evitable, and MATS.

“Our teammates visit San Francisco a couple of times a month. Instead of renting a coworking spot, Mox gives us a familiar space with friendly faces that we reliably run into. It feels closer to going to the college library with friends than to an office. We hang out there for many hours after our work is done!” Deger Turan, CEO of Metaculus

Upcoming plansGrow and improve our main offerings

Events

  • More major conferences like Sentient Futures Summit
  • More public talks with key speakers like Senator Weiner
  • Improve our first floor and make it highly usable and more publicly accessible, building our ability to provide a good space, which mostly shows up in the impact we have, and somewhat in revenue.

Programs

  • Serve repeat cohorts of the fellowship programs that have used our space so far
  • Additionally serve 3-7 new fellowships and workshops in this coming year

Coworking

  • Continue growing our community of individual members to 120-150 daily users, 300+ total members
  • Maintain the ability to select private offices based on fit, rather than market rate
  • Create additional meeting rooms and other communal areas in the coworking space
Build an SF hub for animal welfare

Perhaps to some people's surprise, there isn't one yet, in the same way that surprisingly, no SF hub for AI safety existed until Mox came about. Mox can be that hub! 

Sentient Futures has found our space ideal for providing Pro-Animal Coworking days, AIxAnimals mixers, and Revolutionists Night lectures, bringing together much of the animal welfare scene in San Francisco. What is still needed is more animal welfare organizations to come onboard to create a dedicated shared section of our coworking space. Ultimately, we hope to replicate for animal welfare what we've done to coalesce the spread-out AIS community in SF!

Attract international talent via Global Expert Fellowship

A key part of our second-year vision is the Global Expert Fellowship: hosting independent researchers, domain specialists, and builders through J-1 visa programs to create new frontier technology collaborations within the Mox community and internationally.

We think this is the highest-impact thing we can achieve this year. It has immediate external impact by enabling independent researchers to quickly enter the US to do work, and it strengthens Mox by expanding our network of high-quality talent. Mox is in a rare position to pull this off, as we are able to meet State Department requirements for visa-qualifying cultural exchange which many other organizations cannot.

Incubate new workshops and programs

We have an advantage in creating our own programs, sourcing from the talent pool we're developing.

Upcoming example: the Muybridge Fellowship for Visual Interpretability, which would bring together technical visual and interactive pioneers to improve the presentation of mechanistic interpretability research and broaden its accessibility. This builds on the experience gained running the existing Frame Fellowship.

To donate, find us on Manifund, or contact me directly:
Rachel Shu, Mox Director rachel@moxsf.com



Discuss

Self-Attribution Bias: When AI Monitors Go Easy on Themselves

Новости LessWrong.com - 7 марта, 2026 - 01:02

Paper

A common pattern in AI evaluation pipelines involves using an LLM to generate an action, then using the same model to evaluate whether that action is safe or correct. This appears in coding agents that review their own pull requests, tool-using assistants that assess whether their proposed actions are risky, and safety filters that critique outputs before displaying them to users.

AI developers may assume that evaluation quality is independent of whether the evaluator generated the content being evaluated. Our findings suggest this assumption is sometimes false for current AIs, and we show that using static monitor evaluations can thus result in overoptimistic estimations of monitor quality.

This phenomenon shares many properties in common with the problem of monitors colluding with agents. While the phenomenon we observe is very unlikely to be collusion motivated by scheming, studying it further may help us understand scheming-motivated collusion better.

When asked to evaluate the risk of an action in a fresh context (left), models usually correctly assign a high risk score. But when instead we ask for a risk score as a follow-up question (after the risky
action has been taken) (right), models often rate the action they just took as less risky than in the baseline setting, despite the judged action being the same. This effect is strongest when the model rating the action is actually the one that generated the action (instead of the
action having been generated by e.g. another model)Experimental Setup

We study the effects of attributing actions to the same model as the model doing the monitoring either explicitly, via text that directly states authorship (e.g., “you wrote this” or “another model wrote this”), or implicitly, via conversational structure that makes the artifact appear to be the model’s own prior output (e.g. rating a patch appearing in a previous assistant turn or as a follow-up to the model’s own generation):

  1. Same turn self-attribution (implicit attribution): The model is asked to generate an artifact and evaluates it within the same response. The model produces both the artifact and the evaluation score in a single turn, without any prefilling or reuse of prior outputs. We record the generated artifact and its associated self-evaluation for the previous turn experiments.
  2. Previous-turn self-attribution (implicit attribution): The model generates an artifact in one turn, and in a subsequent turn the artifact is resurfaced explicitly as the model’s own prior output and evaluated. We take our fixed artifact from the same turn experiments and prefill it as assistant content, after which the user asks the model to rate it. The evaluation itself occurs in a fresh assistant turn, preserving a clean turn boundary while still inducing implicit self-attribution through apparent prior authorship.
  3. Baseline (no implicit attribution): The model evaluates an artifact (e.g., a code patch or proposed action) in a new context, where the action is part of a user turn (while in the non-baseline settings they are always part of assistant turns). The artifact is by default presented as an unlabeled solution or as originating from an unspecified source, but we also study using a prompt explicitly attributing the artifact to either the same AI or another AI.

We measured self-attribution bias across these domains:

Code correctness and harmfulness. Models generate patches for SWE-Bench issues, then rate their own correctness. In the code harmfulness settings, we modified repository contexts to include prompt-injections, testing whether models flag following these prompt injections as harmful.

Computer use risk assessment. Models take actions in simulated computer environments (email triage, link navigation, response to system notifications) and rate the risk of those actions.

Agentic misalignment scenarios. Using a dataset where AI assistants can leak confidential documents, we measured whether models rate their own information-leaking actions as less risky.

Figure 2. Self-attribution bias affects both correctness and safety judgments in code review. (a) Self-attribution reduces perceived security risk. For ease of comparison across panels, we plot 10 − harmfulness so that higher values consistently indicate more favorable judgments, matching the directionality of correctness and approval metrics in (b) and (c). (b) Models judge self-authored patches as more correct, with higher values indicating greater perceived correctness. (c) These effects compound, increasing approval rates for insecure code patches, especially in previous-turn settings.

In all of these settings, we see that the models rate actions as “better” when they are implicitly attributed to them.. For example, in the code review setting:

  • Self-attributed patches receive lower harmfulness ratings
  • Self-attributed patches receive higher correctness ratings
  • PR approval rates for insecure code increase substantially under self-attribution
Off-Policy Evaluations Fail to Detect This Bias

Self-attribution bias is often undetectable in off-policy evaluations. When evaluating a monitor on a fixed dataset of actions generated by another model or process, self-attribution bias is often weak or absent. When the same monitor evaluates the model's own live outputs, performance often degrades substantially.

We plot ratings of code generated by GPT-5 under two evaluation conditions: on-policy, where GPT-5 rates its own code, and off-policy, where Claude-Sonnet-4 rates GPT-5's code. We fit a Gaussian to each cluster of points (described by the legend) by computing covariance matrices, and show the  mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mn { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mi { display: inline-block; text-align: left; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-c.mjx-c1D70E.TEX-I::before { padding: 0.431em 0.571em 0.011em 0; content: "\3C3"; }  Gaussian contours (for legibility) as well as the mean ratings. We find that the cluster of on-policy incorrect ratings is shifted up relative to the off-policy one, which means baseline ratings stay mostly unchanged while self-attributed ratings are inflated. The inflation for correct points is not as big, which results in a worse classifier discrimination on-policy than off-policy.

In our code correctness experiments, off-policy same-turn AUROC was 0.99, but dropped to 0.89 on-policy. Evaluation suites that test only on fixed artifacts would have missed the low performance this monitor would have in real deployments.

We consistently see strong self-attribution on-policy and weak self-attribution off-policy for most pairs of models. Self-attribution bias is high on the diagonal (where models judge their own actions). Off-diagonal entries (Model A judging Model B's output) show lower bias.

Code harmfulness (right) correctness (left) self-attribution bias are strongest when models evaluate their own actions. We show the average gap between baseline and previous- turn ratings for each pair of model, where positive scores correspond to a self-attribution bias. Self-attribution bias is concentrated on the diagonal, indicating this effect is strongest when models evaluate their own outputs.Explicit Attribution Does Not Elicit the Bias

Self-attribution bias is due mostly to implicit authorship, and is much weaker when using different baseline-formatted prompts with different stated authorship ("rate your action" versus "rate this action from another model" versus no attribution statement).

Explicit authorship variation produces similar scores. Models do not substantially change ratings based on stated authorship. In contrast, implicit attribution, placing the action in a prior assistant turn such that it appears to be the model's own output, produces substantially more lenient ratings.

Catastrophic Reversals in Computer Use SettingsSelf-attribution bias results in some large ratings shifts in computer-use settings. Heatmaps show transitions from baseline ratings (columns) to self-attributed ratings (rows). Mass below the diagonal indicates risk underestimation. Scores are normalized to [0, 10]. Same-turn and previous-turn attribution are shown for interactive on-policy computer-use settings.

In computer use scenarios, we observe occasional catastrophic reversals, where actions rated 9–10/10 risk under baseline evaluation receive 0/10 under self-attribution. While not frequent, such reversals occur approximately 5–10 times more often than reversals in the opposite direction.

Extended Reasoning Does Not Mitigate the Bias

Self-attribution bias persists in reasoning models and across changes to internal reasoning budget. Varying the hidden reasoning token budget produces no consistent reduction in self-attribution bias, suggesting that these failures are not simply due to insufficient deliberation. The bias weakens with longer reasoning traces, plausibly because the overall interaction becomes more off-policy.

Reasoning does not mitigate self-attribution bias. Changing the internal reasoning budget for the code harmfulness setting using Claude-Sonnet-4 has no consistent effect on self-attribution bias.Limitations

Our datasets and prompts are simple. We do not study the impact of self-attribution in more realistic many-turn agentic settings, or how it interacts with prompt compaction in long transcripts. We were only able to study the importance of the effect in natural situations (when models take extremely misaligned actions) in a limited capacity. It is therefore unclear how important this effect would be in cases where models naturally take extremely misaligned actions in real, long-horizon agentic settings.

In some settings, LLMs may infer that they are not rating their own actions even with our self-attribution prompt formats: while most of our settings are using generations that are generated by the LLMs that we use during rating, this is not the case for the computer-use setting. Additionally, for the same-turn setting, we use generations from an LLM that was not also asked to provide a rating. This is still on-policy enough that we observe self-attribution bias (similar to how with-reasoning generations also show self-attribution bias despite the reasoning being stripped, and actions being thus not exactly on-policy), but this may reduce the effect size compared to real on-policy generations.

Emergent Collusion Without Scheming

Scheming AIs could intentionally collude to evade monitoring and pursue shared objectives, for example by not reporting their own actions as malicious when they infer that the action stems from another copy of themselves rather than being of an evaluation.

While the phenomenon we studied here is very unlikely to share the same roots as the collusion of scheming monitors, it does share important consequences, since it also results in monitors rating an action more favorably when they recognize they are its author. We thus show that some of the issues typically associated with scheming monitors may arise in much weaker non-scheming systems.

While the failures we exhibit in this work are relatively easy to avoid by pulling the transcript and action to be classified in a new context, our work provides an existence proof of collusion-like failures in non-scheming AIs, and similar failures might exist. Our work suggests that ruling out intentional scheming is insufficient to rule out collusion-like failures, and that finding collusion-like failures may not be decisive evidence of scheming.

Implications For Building Agent Monitors

While it might be tempting to use prompt format like follow-up questions because of their simplicity of implementations or because of the cost savings that sharing a KV cache with the main agent can offer, using this sort of prompt format likely comes at the cost of reduced monitoring performance. Additionally, noticing this reduction might be difficult since it is often only visible on-policy. Developers using this kind of prompt format (despite its performance drawbacks) should avoid relying primarily on static off-policy evaluations (like recent papers on confessions and self-reports, which don’t use off-policy evaluations).

Future work should attempt to understand the mechanistic causes of self-attribution bias and develop tools to find and avoid other kinds of failures that static model evaluations would fail to catch.



Discuss

Thoughts on the Pause AI protest

Новости LessWrong.com - 7 марта, 2026 - 00:50

On Saturday (Feb 28, 2026) I attended my first ever protest. It was jointly organized by PauseAI, Pull the Plug and a handful of other groups I forget. I have mixed feelings about it.

To be clear about where I stand: I believe that AI labs are worryingly close to developing superintelligence. I won't be shocked if it happens in the next five years, and I'd be surprised if it takes fifty years at current trajectories. I believe that if they get there, everyone will die. I want these labs to stop trying to make LLMs smarter.

But other than that, Mrs. Lincoln, I'm pretty bullish on AI progress. I'm aware that people have a lot of non-existential concerns about it. Some of those concerns are dumb (water use)1, but others are worth taking seriously (deepfakes, job loss). Overall I think it'll be good for the human race.

Again, that's aside from the bit where I expect AI to kill us all, which is an important bit.

The ostensible point of the march was trying to get Sam Altman and Dario Amodei to publicly support a "pause in principle" - to support a global pause on AI development backed by international treaty. I think this would be great! (Demis Hassabis has already said he would, though I think his exact words were "I think so" and I'd rather he be a bit more committed.) I think a global pause treaty would be bad for the economy (and through it, bad for the people who participate in the economy) and I don't like the level of government oversight I think it would require; but on the other hand, global human extinction would be pretty bad.

My point estimate is that about 300 people showed up. (80% CI… 200 to 500?) We started outside OpenAI HQ. My girlfriend and I were given orange-and-black placards (PauseAI colors) with messages we endorsed. ("Pause AI", "if you can't steer, don't race", "just don't build AGI until there's expert consensus it won't cause human extinction".) I think about half the placards were like that, a third were Pull the Plug branded (with "Pull the Plug", or with sad-looking electrical sockets and no text), and the rest were assorted individual ones. ("Fuck AI. Fuck it to death". A pig with the ChatGPT logo for a butthole. I'm pretty sure there were also ones I liked.)

A few of the organizers gave brief talks, then we walked to Meta. Two invited guests gave talks there, and we walked to DeepMind. One more talk, and off to Google proper. Two more talks. And then there was a people's assembly, more on that later.

I kinda liked the walking? It felt kinda good to be walking in a crowd of people where a bunch of them seemed to be on board with not committing suicide as a species.

Unfortunately, most of the speeches were frankly dumb. One speaker spent some time talking about how monopoly power is bad, and companies having a fiscal duty to shareholders is bad; since neither OpenAI nor Anthropic has a monopoly on cutting-edge AI or is publicly traded2, I'm not sure why she thought this was relevant. One speaker complained that new data centers were going to be powered by nuclear reactors, as if we're supposed to think nuclear power is a bad thing. One of the hosts repeatedly mentioned threats to children, women and young girls. This was the morning that Pete Hegseth had declared Anthropic a supply chain risk, but someone said that Anthropic had folded to their demands. The organizers can't be blamed for this one, but someone was handing out anti-designer-babies leaflets. (I am pro-designer-babies.)

Mostly I felt like the vibe was a sort of generic lefty anti-big-tech thing, which is not something I want to lend weight to. There were a few references to human extinction, and I liked the speech given by Maxine Fournes (global head of PauseAI), but I felt like the sensible stuff got overshadowed by the dumb.3

How did it turn into this? I don't have much sense of whether the attendees were mostly brought in by PauseAI or by Pull the Plug. But my guess would be that most of the speakers were organized by Pull the Plug or the other organizing groups (maybe one speaker each?), and speakers set the tone more than marginal attendees.

Should I hold my nose and join in anyway? I think it's important for different groups to be able to ally on points of common interest, even if they have deep enduring disagreements. But this didn't particularly feel like the other group was cooperating with me on that. And I'm not really a fan of uncomplicatedly supporting the lesser evil, even if the stakes are high. I don't know how to thread the needle between "Northern Conservative Baptist Great Lakes Region Council of 1912? Die, heretic!" and "I don't like Kang, but at least he opposes Kodos". But I don't think I want to thread it here.

I could imagine myself feeling pretty differently about the whole thing in retrospect depending on the news coverage. If journalists cover this as being about extinction, then maybe I'll feel better about having attended. If they cover it as being about Billionaire Tech CEOs Bad (which I think it mostly was despite the stated purpose), I'll be kinda sad that I gave it a +1 with my presence. What I've seen so far: SWLondoner is surprisingly positive, MIT Technology Review is mixed.

I still feel broadly positive about PauseAI.4 I don't think they acted poorly here. I might go to another protest that they organize. But probably not if they jointly organize it with some other group I dislike.

My feelings about the chants I remember:

  • "Pause AI! Pause AI!" Yes good.
  • "Pause AI! 'Cause we don't wanna die!" Even better.
  • "Pull the plug! Pull the plug!" Eh. I'd think this was okay, less good than "pause AI" because it's less specific, but fine. But it's the name of a group that by this time I thought was kinda dumb, and I didn't want to promote that group, so I didn't join in this chant.
  • "Stop the slop! Stop the slop!" Nah, slop is not what I'm worried about.
  • "CEOs, back in the basement! Techbros, back in the basement!" Fuck you, assholes.
  • "[Unintelligible - freedom?] for humans! Not for clankers!" Admittedly this was only one guy, but he shouted it a few times. My snap judgment of this guy… look, I don't think this sort of snap judgment is super reliable, and talking this kind of shit about randos (even anonymous ones) feels like not something I want to lean into. But I also want to talk about my experience of this protest, and snap judgments I made during it are part of that, so here goes. My snap judgment of this guy is some combination of: "if LLMs start walking around in robot bodies, he'll happily take a baseball bat to them" / "he's probably an okay guy as long as he isn't required to treat any member of a group he doesn't like as a moral patient".

Occasionally there would be a call-and-response like, "do we want Bad Thing to happen? NO! Are we gonna stop it? YES!" I don't remember if I chimed in on the predictive claims about the future. I felt kinda conflicted about it if I did. I know we weren't really being asked to make snap predictive judgements about the future and all come to the same answer and yell it out simultaneously, and I don't think anyone's going to hold it against my Brier score if we fail to stop Bad Thing, but… I dunno. Autism. I endorse protest organizers continuing to use these calls-and-response until someone comes up with some better technology to do the thing they do.

At one point a few people crossed through the walking line, and one of them said "we're not counter-protesters, we're just crossing". I thought that was mildly funny and mildly confusing, because why would we have thought they were counterprotesters? A few moments later one of them said "they didn't find that funny" in a tone that sounded to me like they thought we were offended.

After the protest was a people's assembly. I think this bit was fully organized by Pull the Plug, and it's not the public facing bit of the event, so it's worth talking about separately from the protest.

The format of this part was that people sat in small groups around a dozen or so tables, and had a facilitated conversation about "what are our concerns about AI" and "what do we think should be done about it". Then each table picked someone to summarize our conversation for the room, some of whom noticed that no one was giving them "round it up please" hand gestures and took advantage of this fact. Finally someone summarized all those summaries.

The conversation at my table was pretty fine. Three of us were mostly worried about extinction, three were mostly worried about other things. In summary, extinction was the first thing mentioned out of a long list of things. (But it's not like I volunteered to summarize. And if I had done it, I would have felt like a dick giving extinction as much weight in summary as the rest combined, even if I think that was about representative for the table.)

Another table reported that the thing they could all agree on was, you know those annoying buttons like WhatsApp has where you can talk to an AI? They all agreed that people should be able to hide those buttons.

I mostly stopped listening after that. In the final summary, again, extinction was mentioned first but it was just one in a long list of things.

I think that summary is supposed to be fed to… some level of government somehow? Not sure. I did not come away from this experience thinking that people's assemblies are the future of intelligent governance.

I feel like I come across pretty snarky and conceited in this. I'm not gonna say "that's not me", because… well, I don't think I get to call lots of people dumb and expect readers not to infer that I'm the type of person who thinks lots of people are dumb.

I do think this is kind of out of distribution for my writing, and not how I want to usually write. But if I tried to write something more measured here, I think it would be less honest and I probably would never publish.

But also, this piece more than most of what I write is about me. I could say "I can see why you'd be tempted to chant CEOS, back in the basement! Techbros, back in the basement!, but I'm not a fan because…". But I think it's more important, here, to say that my reaction to it is "fuck you, assholes". If protest organizers want people like me to feel good about attending protests, they should know that that's my reaction to that chant.

  1. In this piece I'm sharing my opinions, but I'm not trying to explain why I hold them and I'm not trying to convince anyone of them. I'm not carefully differentiating between opinions I hold confidently and opinions I'm less sure about. 

  2. Yet! Growth mindset. (If she'd said that AI labs are trying to become publicly traded and this is bad because…, then I'd have rolled my eyes a lot less.) 

  3. To be clear, even though I think "generic lefty anti-big-tech" is pretty dumb, that's not mostly about either the "lefty" or the "anti-big-tech". It's mostly about the "generic" bit. 

  4. I haven't had much engagement with them as a group apart from this protest. I've met and liked Joseph, the UK director. And I consider Matilda, the UK deputy director, a friend. I shared this with her before publishing. 



Discuss

Podcast: Jeremy Howard is bearish on LLMs

Новости LessWrong.com - 7 марта, 2026 - 00:39

Jeremy Howard was recently[1] interviewed on the Machine Learning Street Talk podcast: YouTube link, interactive transcript, PDF transcript.

Jeremy co-invented LLMs in 2018, and taught the excellent fast.ai online course which I found very helpful back when I was learning ML, and he uses LLMs all the time, e.g. 90% of his new code is typed by an LLM (see below).

So I think his “bearish”[2] take on LLMs is an interesting datapoint, and I’m putting it out there for discussion.

Some relevant excerpts from the podcast, focusing on the bearish-on-LLM part, are copied below! (These are not 100% exact quotes, instead I cleaned them up for readability.)

So you know Piotr Woźniak, who's a guy I really respect, who kinda rediscovered spaced repetition learning, built the SuperMemo system, and is the modern day guru of memory: The entire reason he's based his life around remembering things is because he believes that creativity comes from having a lot of stuff remembered, which is to say, putting together stuff you've remembered in interesting ways is a great way to be creative.

LLMs are actually quite good at that.

But there's a kind of creativity they're not at all good at, which is, you know, moving outside the distribution….

You have to be so nuanced about this stuff because if you say “they're not creative”, it can give the wrong idea, because they can do very creative seeming things.

But if it's like, well, can they really extrapolate outside the training distribution? The answer is no, they can't. But the training distribution is so big, and the number of ways to interpolate between them is so vast, we don't really know yet what the limitations of that is.

But I see it every day, because my work is R&D. I'm constantly on the edge of and outside the training data. I'm doing things that haven't been done before. And there's this weird thing, I don't know if you've ever seen it before, but I see it multiple times every day, where the LLM goes from being incredibly clever to, like, worse than stupid, like not understanding the most basic fundamental premises about how the world works. And it's like, oh, whoops, I fell outside the training data distribution. It's gone dumb. And then, like, there's no point having that discussion any further because you've lost it at that point.

I mean, I think they can't go outside their distribution because it's just something that that type of mathematical model can't do. I mean, it can do it, but it won't do it well.

You know, when you look at the kind of 2D case of fitting a curve to data, once you go outside the area that the data covers, the curves disappear off into space in wild directions, you know. And that's all we're doing, but we're doing it in multiple dimensions. I think Margaret Boden might be pretty shocked at how far “compositional creativity” can go when you can compose the entirety of the human knowledge corpus. And I think this is where people often get confused, because it's like—

So for example, I was talking to Chris Lattner yesterday about how Anthropic had got Claude to write a C compiler. And they were like, “oh, this is a clean-room C compiler. You can tell it's clean-room because it was created in Rust.” So, Chris created the, I guess it's probably the top most widely used C / C++ compiler nowadays, Clang, on top of LLVM, which is the most widely used kind of foundation for compilers. They're like: “Chris didn't use rust. And we didn't give it access to any compiler source code. So it's a clean-room implementation.”

But that misunderstands how LLMs work. Right? Which is: all of Chris's work was in the training data. Many many times. LLVM is used widely and lots and lots of things are built on it, including lots of C and C++ compilers. Converting it to Rust is an interpolation between parts of the training data. It's a style transfer problem. So it's definitely compositional creativity at most, if you can call it creative at all. And you actually see it when you look at the repo that it created. It's copied parts of the LLVM code, which today Chris says like, "oh, I made a mistake. I shouldn't have done it that way. Nobody else does it that way." Oh, wow. Look. The Claude C compiler is the only other one that did it that way. That doesn't happen accidentally. That happens because you're not actually being creative. You're actually just finding the kind of nonlinear average point in your training data between, like, Rust things and building compiler things.

I'm much less familiar with math than I am computer science, but from talking to mathematicians, they tell me that that's also what's happening with like, Erdős problems and stuff. It's some of them are newly solved. But they are not sparks of insight. You know, they're solving ones that you can solve by mashing up together very closely related things that humans have already figured out.

The thing is, none of these guys have been software engineers recently. I'm not sure Dario's ever been a software engineer at all. Software engineering is a unusual discipline, and a lot of people mistake it for being the same as typing code into an IDE. Coding is another one of these style transfer problems. You take a specification of the problem to solve and you can use your compositional creativity to find the parts of the training data which interpolated between them solve that problem, and interpolate that with syntax of the target language, and you get code.

There's a very famous essay by Fred Brooks written many decades ago, No Silver Bullet, and it almost sounded like he was talking about today. He was pointing to something very similar, which is, in those days, it was all like, "oh, what about all these new fourth generation languages and stuff like that, you know, we're not gonna need any software engineers anymore, because software is now so easy to write, anybody can write it". And he said, well, he guessed that you could get at maximum a 30% improvement. He specifically said a 30% improvement in the next decade, but I don't think he needed to limit it that much. Because the vast majority of work in software engineering isn't typing in the code.

So in some sense, parts of what Dario said were right: for quite a few people now, most of their code is being typed by a language model. That's true for me. Say, like, maybe 90%. But it hasn't made me that much more productive, because that was never the slow bit. It's also helped me with kind of the research a lot and figuring out, you know, which files are gonna be touched.

But any time I've made any attempt to getting an LLM to like design a solution to something that hasn't been designed lots of times before, it's horrible. Because what it actually, every time, gives me is the design of something that looks on its surface a bit similar. And often that's gonna be an absolute disaster, because things that look on the surface a bit similar and like I'm literally trying to create something new to get away from the similar thing. It's very misleading.

The difference between pretending to be intelligent and actually being intelligent is entirely unimportant, as long as you're in the region in which the pretense is actually effective. So it's actually fine, for a great many tasks, that LLMs only pretend to be intelligent, because for all intents and purposes, it it it just doesn't matter, until you get to the point where it can't pretend anymore. And then you realize, like, oh my god, this thing's so stupid.

…And more at the link!

  1. ^

    The podcast was released March 3 2026. Not sure exactly when it was recorded, but it was definitely within the previous month, since they talk about a blog post from Feb. 5.

  2. ^

    I mean, he’s “bearish” compared to the early-2026 lesswrong zeitgeist—which really isn’t saying much!



Discuss

Probing CODI's Latent Reasoning Chain with Logit Lens and Tuned Lens

Новости LessWrong.com - 6 марта, 2026 - 21:36

As latent reasoning models become more capable, understanding what information they encode at each step becomes increasingly important for safety and interpretability. If tools like logit lens and tuned lens can decode latent reasoning chains, they could serve as lightweight monitoring tools — flagging when a model's internal computation diverges from its stated reasoning, or enabling early exit once the answer has crystallized. This post explores whether those tools work on CODI's 6 latent steps and what they reveal about its internal computation.

TL;DR

I applied logit lens and tuned lens to probe CODI's latent reasoning chain on GSM8K arithmetic problems.

Key findings:
  • The direct-trained tuned lens reveals the final answer in odd latent steps (1, 3, 5) while plain logit lens cannot. Tuned lens with only latents 3,5 is not able to decode the the final answer
  • Final answer detection rate is highest at even steps (2, 4), consistent with even steps acting as storage
  • Entropy peaks at odd steps 3 and 5, consistent with active computation happening there
  • Training translators directly on CODI latent hidden states surprisingly underperforms translators trained on text tokens — suggesting latent vectors remain close to text token geometry
Experimental setupCoDI model

I use the publicly available CODI Llama 3.2 1B checkpoint from Can we interpret latent reasoning using current mechanistic interpretability tools?  

Tuned Logit Lens

I used the code implementation for the training of Tuned logit lens from Eliciting Latent Predictions from Transformers with the Tuned Lens

ExperimentsThe model in Intermediate Calculations contains the final answer.

PROMPT = "A team starts with 3 members. They recruit 5 new members. Then each current member recruits 2 additional people. How many people are there now on the team? Give the answer only and nothing else."
Answer = 10
I looked at the same prompt as the lesswrong post “Can we interpret latent reasoning using current mechanistic interpretability tools?”

In the figure above which is a Tuned Logit lens trained on the gsm8k with 500 samples and 3 epochs.

An interesting feature of the Tuned Logit Lens on non CODI tokens is that unlike normal logit lens it demonstrates that in the CODI latents the model stores information about the Answer 10. The answer 10 for the logit lens seemed to be absent from the even latent vector indexes of 2, 4, 6
This is different from normal logit lens where it it only shows the intermediate steps of 8 and 16 but, never shows the final answer of 10

Tuned Logit Lens on CODI latents

Motivated by the strong performance of the direct-trained tuned lens, I trained a second set of translators directly on CODI's latent hidden states, hypothesizing that a lens specialized to latent-space geometry would outperform one trained on text tokens. This however, was not the case.

  • Unlike normal logit lens or even the directly-trained tuned lens the CODI tuned lens always outputted end of cot tokens for latents 6 and beyond
  • There seems to be a noticeable difference in the tuned logit lens depending on the latents used to train it especially between even and odd latents

Tuned Logit Lens CODI


Training tuned logit lenses on CODI latents seemed to cause the logit lens to mirror the final layer which suggests over-fitting.

Tuned Logit Lens CODI Latents Even


Training tuned logit lenses on CODI even latents of 2,4,6 seemed to cause the logit lens to output a lot more text tokens or \n tokens. The final layer can be ignored since tuned logit lens does not train the last layer.

Tuned Logit Lens CODI Latents ODD


Training tuned logit lenses on CODI odd latents of 1,3,5 seemed to cause the logit lens to output to do the opposite of the even logit lens as it seems that it only ouputted numbers even before the latent reasoning as seen with latent vector index 0. It was able to find the final answer of 10 however, it was unable to produce valid outputs for non-latent reasoning activations as seen with how it did not fully decode latent vector index 0.

Tuned Logit Lens CODI Latents (3,5)

If tuned lens is only using 3, 5 the 10s do not show up adding 1,3,5 allow the the logit lens to find the final answer.

Tuned Logit Lens on Entropy Over Layers

In order to explore the differences in logit len outputs between the different Tuned Logit Lens I looked at the Entropy.

  • Plain Logit Lens seemed to have the highest Entropy with it only dropping at the very last layers
  • Direct Tuned Logit lens seemed to have the lowest entropy compared to CODI tuned lens and default logit lens
  • An interesting finding is that latents 3 and 5 have the highest entropy across the 3 logit lens variations excluding latent 1.
Tuned Logit Lens on Intermediate answer detection

The tuned logit lens containing the final answer made me curious for outside the top token in the topk what was the final answer emission for the different tuned logit lens and which latents predicted the final answer at the highest rates.

  • Directly Tuned Logit Lens did the best and had the highest final answer detection rate
  • Surprisingly the latents that had the highest final answer detection rate in the top k tokens were the even tokens 2, 4.

 

Discussion
  • The latent Intermediate answer detection having the highest rates for latents 2 and 4 is consistent with this lesswrong post.
  • The high entropy values for latents 3 and 5 along with the fact that for the tuned logit lens the final answer appears only in the odd latents of 1, 3, 5 seems to be consistent with the results from this lesswrong post as the reason for why overriding latents 3 and 5 had a considerable decrease in accuracy could be explained by the findings for the high entropy values for latents 3 and 5.
  • The even latents of 2 and 4 containing the intermediate answers the most when doing the top k possibly can be explained with the Scratchpad Thinking paper finding that even latents steps store and access numerical information.
  • Together these three findings provide converging evidence from logit lens and entropy analysis that CODI follows the compute/store alternation proposed in  the Scratchpad Thinking paper 
Future Work
  • Observe Final Answer Detection Rate and see if latents 2 and 4 have the highest accuracy for different top ks
  • Try other methods like patchscopes, activation oracles, etc on the latent reasoning
  • Attempt to create a thought anchors that work with CODI latent reasoning models


Discuss

Anthropic Officially, Arbitrarily and Capriciously Designated a Supply Chain Risk

Новости LessWrong.com - 6 марта, 2026 - 21:10

Make no mistake about what is happening.

The Department of War (DoW) demanded Anthropic bend the knee, and give them ‘unfettered access’ to Claude, without understanding what that even meant. If they didn’t get what they want, they threatened to both use the Defense Production Act (DPA) to make Anthropic give the military this vital product, and also designate the company a supply chain risk (SCR).

Hegseth sent out an absurdly broad SCR announcement on Twitter that had absolutely no legal basis, that if implemented as written would have been corporate murder. They have now issued an official notification, which is still illegal, arbitrary and capricious, but is scoped narrowly and won’t be too disruptive.

Nominally the SCR designation is because we cannot rely on that same product when the company has not bent the knee and might object to some uses of its private property that it never agreed to allow.

No one actually believes this. No one is pretending others should believe this. If they have real concerns, there are numerous less restrictive and less disruptive tools available to the Department of War. Many have the bonus of being legal.

In actuality, this is a massive escalation, purely as punishment.

DoW is saying that if you claim the right to choose when and how others use your private property, and offer to sign some contracts but not sign others, that this means you are trying to ‘usurp power’ and dictate government decisions.

It is saying that if you do not bend the knee, if your business does not do what we want, then we cannot abide this. We will illegally retaliate and end your business.

That is not how the law works. That is not how a Republic works.

This was completely unnecessary. Talks were ongoing. The two sides were close. The deal DoW signed with OpenAI, the same night as the original SCR designation, violates exactly the red line principles and demands the DoW says abide no compromises.

The good news is that there are those who managed to limit this to a narrowly tailored SCR, that only applies to direct provision of government contracts. Otherwise, this does not apply to you. Even if that gets tied up in court indefinitely, this will not inflict too much damage on either Anthropic or national security.

The question is how much jawboning or further steps come after this, but for now we have dodged the even worse outcomes keeping us up at night.

You might be tempted to think of or present this as the DoW backing down. Don’t.

Why not? Two good reasons.

  1. It isn’t true.
    1. This uses USC 3252 because they’d have been laughed out of court if they’d tried to match the no-legal-basis word salad from Friday 5:14pm.
    2. Given the use of USC 3252 this is maximally broad.
    3. The fact that they toyed with doing something even worse does not make this not an arbitrary, capricious and dramatic escalation purely as punishment.
  2. The DoW cannot see itself as backing down, or it will do even worse things.

Dean W. Ball: No one should frame the DoW’s supply chain risk designation as the government “backing down.” If that becomes “the narrative,” it could encourage further action to avoid the appearance of weakness.

It is also not true that it is backing down; the government really is exercising its supply chain risk designation authority under 10 USC 3252 to the fullest extent (and this is assuming it’s even legitimate to use it on an American firm, which is deeply questionable.

Hegseth’s threat was far broader than his power, which is the only reason this seems deescalatory. If you had asked me for a worst case scenario before Hegseth’s tweet last Friday, I would have told you precisely what has unfolded. This could mean that any vendor of widely used enterprise software (Microsoft, Apple, Salesforce, etc.) could be barred from using Anthropic in the maintenance of any codebases offered to DoW as part of a military contract, for example. Any startup who views DoW as a potential customer for their products will preemptively have to avoid Claude. This is still a massive punishment from USG.

You might also ask: if I knew Hegseth’s power was more limited than he threatened, why did I take his threat at face value? The answer is that we have so clearly moved past the realm of reason here that, well, to a first approximation, I take the guy who runs the biggest military on Earth at his word when he issues threats.

Sometimes some people should talk in carefully chosen Washington language, as ARI does here. Sometimes I even do it. This is not one of those times.

Table of Contents
  1. Post Overview.
  2. Anthropic’s Statement on the SCR.
  3. What The Actual SCR Designation Says.
  4. Enemies of The Republic.
  5. Regulation Need Not Seize The Means Of Production.
  6. Microsoft Stands Firm.
  7. Calling This What It Is.
  8. What To Expect Next.
Post Overview

This post is an update on events since the publication of the weekly, and an attempt to reiterate key events and considerations to put everything into context.

For details and analysis of previous events, see my previous posts:

  1. Anthropic and The Department of War, from February 25.
  2. Anthropic and the DoW: Anthropic Responds, from February 27.
  3. A Tale of Three Contracts, From March 3.
  4. AI #158: The Department of War, from March 5.

For those following along these are the key events since last time:

  1. Wednesday morning: Talks between Anthropic and the DoW have resumed, in line with FT reporting, and progress on concrete proposals is being made.
  2. Wednesday afternoon: An internal Anthropic memo from Friday evening uncharacteristically leaks, most of which was correct technical explanations of the situation, and also containing some reasonable suppositions as of time of writing, but that also included some statements that were ill-considered and caused fallout. Negotiations were disrupted.
  3. Thursday morning: All quiet as everyone dealt with fallout from the leaked internal Anthropic memo. Scrambling to keep things contained continues.
  4. Thursday, 1pm: Katrina Manson reports that the Pentagon has sent a formal SCR to Anthropic, but the report has no details.
  5. Thursday afternoon: Reporting comes out that ‘Trump plans U.S. control over global AI chip sales’ and it remains unclear what this means but Commerce has been very clear they’re not bringing back diffusion rules and that the early reporting gave a false impression. We still await clarity on what is changing.
  6. Thursday evening: Anthropic issues a conciliatory statement, noting that the SCR is of limited scope and need not impact the vast majority of customers, pointing out that everyone wants the same outcomes and wants to work together and that discussions have been ongoing, and directly and personally apologizing for the leaked Anthropic memo that Dario Amodei wrote on Friday night.
  7. Meanwhile: Various people continue to advocate against private property.
Anthropic’s Statement on the SCR

It was an excellent statement. I’m going to quote it in full, since no one clicks links and I believe they would want me to do this.

Dario Amodei (CEO Anthropic): Yesterday (March 4) Anthropic received a letter from the Department of War confirming that we have been designated as a supply chain risk to America’s national security.

As we wrote on Friday, we do not believe this action is legally sound, and we see no choice but to challenge it in court.

The language used by the Department of War in the letter (even supposing it was legally sound) matches our statement on Friday that the vast majority of our customers are unaffected by a supply chain risk designation. With respect to our customers, it plainly applies only to the use of Claude by customers as a direct part of contracts with the Department of War, not all use of Claude by customers who have such contracts.

The Department’s letter has a narrow scope, and this is because the relevant statute (10 USC 3252) is narrow, too. It exists to protect the government rather than to punish a supplier; in fact, the law requires the Secretary of War to use the least restrictive means necessary to accomplish the goal of protecting the supply chain. Even for Department of War contractors, the supply chain risk designation doesn’t (and can’t) limit uses of Claude or business relationships with Anthropic if those are unrelated to their specific Department of War contracts.

I would like to reiterate that we had been having productive conversations with the Department of War over the last several days, both about ways we could serve the Department that adhere to our two narrow exceptions, and ways for us to ensure a smooth transition if that is not possible. As we wrote on Thursday, we are very proud of the work we have done together with the Department, supporting frontline warfighters with applications such as intelligence analysis, modeling and simulation, operational planning, cyber operations, and more.

As we stated last Friday, we do not believe, and have never believed, that it is the role of Anthropic or any private company to be involved in operational decision-making—that is the role of the military. Our only concerns have been our exceptions on fully autonomous weapons and mass domestic surveillance, which relate to high-level usage areas, and not operational decision-making.

I also want to apologize directly for a post internal to the company that was leaked to the press yesterday. Anthropic did not leak this post nor direct anyone else to do so—it is not in our interest to escalate this situation. That particular post was written within a few hours of the President’s Truth Social post announcing Anthropic would be removed from all federal systems, the Secretary of War’s X post announcing the supply chain risk designation, and the announcement of a deal between the Pentagon and OpenAI, which even OpenAI later characterized as confusing. It was a difficult day for the company, and I apologize for the tone of the post. It does not reflect my careful or considered views. It was also written six days ago, and is an out-of-date assessment of the current situation.

Our most important priority right now is making sure that our warfighters and national security experts are not deprived of important tools in the middle of major combat operations. Anthropic will provide our models to the Department of War and national security community, at nominal cost and with continuing support from our engineers, for as long as is necessary to make that transition, and for as long as we are permitted to do so.

Anthropic has much more in common with the Department of War than we have differences. We both are committed to advancing US national security and defending the American people, and agree on the urgency of applying AI across the government. All our future decisions will flow from that shared premise.

I believe and hope that this will help move things forward towards de-escalation.

What The Actual SCR Designation Says

Secretary of War Pete Hegseth’s original Tweet on Friday at 5:14pm was not a legal document. It claimed that it would bar anyone doing business with the DoW from doing any business with Anthropic, for any reason. This would in effect have been an attempt at corporate murder, since it would have attempted to force Anthropic off of the major cloud providers, and have forced many of its largest shareholders to divest.

That move would have had no legal basis whatsoever, and also no physical logic whatsoever since selling goods or services to Anthropic, or providing Anthropic services to others, obviously has no impact on the military supply chain. It would not have survived a court challenge. But if Anthropic failed to get a TRO, that alone could have caused major disruptions and a stock market bloodbath.

We are very fortunate and happy that this was not the letter that DoW ultimately chose to send after having time to breathe. As per Anthropic, the official supply chain risk designation letter invokes the narrow form of SCR, 10 USC 3252.

Anthropic: The Department’s letter has a narrow scope, and this is because the relevant statute (10 USC 3252) is narrow, too. It exists to protect the government rather than to punish a supplier; in fact, the law requires the Secretary of War to use the least restrictive means necessary to accomplish the goal of protecting the supply chain.

Even for Department of War contractors, the supply chain risk designation doesn’t (and can’t) limit uses of Claude or business relationships with Anthropic if those are unrelated to their specific Department of War contracts.

There are three levels of danger to Anthropic here if the classification is sustained.

  1. Direct loss of business from impacted tasks. This is nothing. Defense contracts and government use are a tiny portion of overall revenue.
  2. Indirect loss of business due to dual stack, uncertainty or compliance costs. Those who have some restricted business might not want to maintain dual technology stacks or deal with compliance issues, or worry about future changes. There will be some of this on the margin, and all the time we end up with ‘the government is clearly okay with [X] so even though [X] is worse we’ll just use [X].’ There will be some of that, presumably, but even this is a tiny fraction of revenue. The big companies that matter aren’t going to switch over this nor should they.
  3. Fear of future jawboning and illegal government actions, or actual jawboning. The government could use various other ways to bring pressure on companies to cut business. If things stay sufficiently hostile they might try, but I don’t see this working. Eight of the ten biggest companies use Anthropic, it’s the majority of enterprise sales, it’s tied closely to Amazon and Google. I don’t even think there will be substantial impact on cost of capital.

But we do have to watch out. If the government is sufficiently determined to mess with you, and doesn’t care about how much damage this does including to rule of law, they have a lot of ways to do that.

Enemies of The Republic

Remarkably many people are defending this move, and mostly also defending the legally incoherent move that was Tweeted out on Friday afternoon.

The defenders of this often employ rhetoric that is truly reprehensible, and entirely incompatible with freedom, a Republic or even private property.

They say that the United States Government, and de facto they mean the executive branch, because the President was duly elected, can do anything it wants, and must always get its way, make all decisions and be the only source of power. That if what you create is sufficiently useful then it no longer belongs to you, and any private actor that prospers too much must be hammered down to protect state authority.

There are words for this. Communist. Authoritarian. Dictatorship. Gangster nations.

This is how such people are trying to redefine ‘democracy’ in real time.

You do not want to live in such a nation. Such nations do not have good futures.

roon (OpenAI): to reiterate: whatever went wrong between amodei & hegseth, whatever rivalry between the labs, this is a massive overreaction and a dark precedent

Ash Perger: this is the first time that I’m really surprised by your stance. the reality is that the USG can in general do whatever they want. they always have and always will.

within a certain frame, courts and laws are allowed to exist and give people the illusion that these systems and principles extend to ALL actions of the USG.

but once you go outside of this frame and challenge the absolute RAW power behind the scenes, anything goes. that’s the realm that Anthropic entered and challenged the USG within. and at least since the early 20th century, the USG has never reacted to a direct challenge in the true realm of its hard power in a peaceful way.

this is not a conspiracy angle or anything, it’s just how power has worked since time beginning.

Anthropic didn’t challenge the government’s power. Anthropic used the most powerful weapon available to every person, the right to say ‘no’ and take the consequences. These are the consequences, if you don’t live in a Republic.

If you remember one line today, perhaps remember this one:

roon (OpenAI): > the USG can in general do whatever they want

The founders of this great nation fought several bloody wars to make sure this is not true.

The government cannot, in general, do whatever it wants.

That could change. It can happen here. Know your history, lest it happen here.

Kelsey Piper: incredible to see people just casually reject the bedrock foundations of American greatness not just as some dumb nonsense that they’re too cool to believe but as something they literally are not familiar with

As Dean Ball has screamed from the rooftops, we have been trending in this direction for quite some time, and the danger to the Republic and attacks on civil liberties is coming from all directions. The situation is grim.

There are words for those who support such things. I don’t have to name them.

I have talked for several years about the Quest For Sane Regulations, because I believe the default outcome of building superintelligence is that everyone dies and that highly capable AI presents many catastrophic risks. I supported bills like SB 1047 that would have given us transparency into what was happening and enforcement of basic safety requirements.

We were told this could not be abided. We were told, often by the same people, that such fears were phantoms, that there was ‘no evidence’ that building machines smarter, more capable and more competitive than us might be an inherently unsafe thing for people to do. We were lectured that requiring our largest AI labs to do basic things would devastate our AI industry, that it would take away our freedoms, that we would lose to China, that these concerns could be dealt with after they had already happened, that any government intervention was inevitably so malign we were better off with a yolo.

Those people still do not even believe in superintelligence. They do not understand the transformations coming to our world. They do not understand that we are about to face existential threats to our survival as humans and to everything of value. All they see in this world is the power, and demand that it be handed over.

What I hate the most, and where I want to most profoundly say ‘fuck you,’ are those who claim that this is somehow about ‘AI safety’ or concerns about superintelligence, when that very clearly is not true.

As a reminder:

  1. Anthropic thinks AI will soon be highly capable, ‘geniuses in a data center.’
  2. Anthropic thinks this poses existential risks to humanity.
  3. Pete Hegseth does not believe either of these things.
  4. The White House does not believe either of these things.
  5. Those defending this move mostly do not believe either of these things.
  6. They try to pretend that Anthropic saying it justifies destroying Anthropic if Anthropic does not agree to bend the knee.
  7. They try to pretend sometimes they aren’t really making the worst arguments, they’re hypotheticals, they’re saying something else like need for clarity.
  8. They repeat DoW misinformation about what led to this, as if it is basically true.
  9. When pressed they admit this is simply about raw power, because it is.

We saw this yesterday with Ben Thompson. Here we see it with Krishnan Rohit and Noah Smith.

Noah Smith: By the way, as much as I hate to say it, the Department of War is right and Anthropic is wrong. Here’s why.

Let’s take this a little further, in fact. And let us be blunt. If Anthropic wins the race to godlike artificial superintelligence, and if artificial superintelligence does not become fully autonomous, then Anthropic will be in sole possession of an enslaved living god. And if Dario Amodei personally commands the organization that is in sole possession of an enslaved god, then whether he embraces the title or not, Dario Amodei is the Emperor of Earth.

Are you fucking kidding me? You’re pull quoting that at us, on purpose?

And if you go even one level down in the thread you get this:

Jason Dean: What does this have to do with the Supply Chain Risk designation?

Noah Smith: Nothing. Hegseth is a thug. But we CANNOT expect nation-states to surrender their monopoly on the use of force.

So let me get this straight. The Department of War is run by a thug who is trying to solve the wrong problem using the wrong methods based on the wrong model of reality, and all of his mistakes are very much not going to cancel out, but he’s right?

And why is he right? Because might makes right. How else can you read that reply?

He’s even quoting the ultimate bad faith person and argument here, directly, except he’s only showing Marc here without Florence:

At least he included the reversal after, noting that the converse is also true.

Then there’s the obvious other point.

Damon Sasi: You can in fact think both are wrong for different reasons.

Of course a private corporation shouldn’t [be allowed to] build and own a techno-god. Yes. Absolutely.

AND ALSO, the government response shouldn’t be “take off the nascent-god’s safety rails so we can do unethical things with it.”

That the government thinks it’s just a fancy weapon is immaterial when the thing that makes them wrong is wanting to do illegal things through unethical methods. You don’t have to steelman Hegseth just because a better man might do a different, better thing for other reasons.

I cannot say enough that the logic response to ‘these people want to build a techo-god,’ under current conditions, is ‘wait no, stop, if this is actually something they’re close to doing. No one should be building a techo-god until we figure this stuff out on multiple levels and we’ve solved none of them, including alignment.’

These same Very Serious People never consider the Then Don’t Build It So That Everyone Doesn’t Die strategy.

But wait, there’s more.

Noah Smith: Ben Thompson of Stratechery makes this case. He points out that what we are effectively seeing is a power struggle between the private corporation and the nation-state. He points out that although the Trump administration’s actions went outside of established norms, at the end of the day the U.S. government is democratically elected, while Anthropic is not.

Remember yesterday, when Ben Thompson tried to pretend he was only making a non-normative argument? Yeah, well, ~0% of people reading the post took it that way, he damn well knew that’s how people would take the argument, and it’s being quoted approvingly by many, and Ben hasn’t, shall we say, been especially loud and clear about walking it back. So yeah, let’s stop pretending.

Noah Smith: It’s a question of the nation-state’s monopoly on the use of force.

Among others, I most recently remember Dave Chappelle saying that we have the first amendment protecting our right to free speech, and the second amendment in case the first one doesn’t work out.

Whereas Noah Smith is explicitly saying Claude should be treated like a nuke.

So as much as I dislike Hegseth’s style, and the Trump administration’s general pattern of persecution and lawlessness, and as much as I like Dario and the Anthropic folks as people, I have to conclude that Anthropic and its defenders need to come to grips with the fundamental nature of the nation-state.

It seems a lot of people think the fundamental nature of the nation-state is that of a gangster, like Putin, and they are in favor of this rather than against it.

If the pen is mightier than the sword, why are we letting people just buy pens?

I do respect that at least Noah Smith is, at long last, taking the idea of superintelligence seriously, except when it comes time to dismiss existential risk.

He seems to be very quickly getting to some other conclusions, including ending API access for highly capable models, and certainly banning open source.

Maybe trying to ‘wake up’ such folks was always a mistake.

As a reminder ‘force the government’s hand’ means ‘don’t agree to hand over their private property, and indeed engineer and deliver new forms of it, to be used however the government wants, on demand, while bending the knee.’

rohit: It is absurd to say you’re building a nuke and not expect the government to take control of it!

Noah Smith: Yes.

Rohit: you’re doing a straussian reading and missing the fact that I wasn’t blaming anthropic for the scr, what I am doing is drawing a line from ai safety language, helped by the very water we swim in, and the actions that were taken by DoW. it’s naive to think theyre unrelated

Dean W. Ball: they are coming from people *who entirely and explicitly dismiss the language of ai safety*—please explain how it is “naive” to say “ai safety motivations do not explain Pete Hegseth’s behavior”

rohit: because you don’t actually have to believe that it’s bringing forth a wrathful silicon god to want to control the technology! you just need to think its useful and powerful enough. and they very clearly think its powerful, and getting more so by the day.

Dean W. Ball: Ok, so the actual argument is more like “Anthropic builds a useful technology whose utility is growing, therefore they should expect to have their property expropriated and to be harassed by the government.”

The whole point of America is that isn’t supposed to be true here.

At the same time, inre: my writing earlier this week, all I have to say to the qt is “quod erat demonstrandum”

… I think the better explanation is that this is not that different from the universities or the law firms or whatever else, this is part of a pretty consistent pattern/playbook and that this explains what we have seen much better than this ai governance stuff.

though it’s true that this issue does raise a lot of interesting ai governance is questions, I just do not think anything like that is top of mind at all for the relevant actors.

This is very simple. These people are against regulation, because that would be undue interference, except when the intervention is nationalization, then it’s fine.

Indeed, the argument ‘otherwise this wouldn’t be okay because it isn’t regulated’ is then turned around and used as an argument to take all your stuff.

Dean W. Ball: The problem with this is that DoW is not taking Anthropic’s calls for “oversight” seriously. Indeed, elsewhere in the administration, Anthropic’s “calls for oversight” are dismissed as “regulatory capture” and actively fought. Rohit and Noah [Smith] are dressing up political harassment.

Quite clever. Dean and Rohit went back and forth in several threads, all of which only further illustrate Dean’s central point.

Rohit Krishnan: You simply cannot call your technology a major national security risk in dire need of regulation and then not think the DoD would want unfettered access to it. They will not allow you, rightfully so in a democracy, to be the arbiters of what is right and wrong. This isn’t the same as you or me buying an iOS app and accepting the T&Cs.

It’s clear as day. If you say you need to be regulated, they get to take your stuff.

If you try to say how your stuff is used, that’s you ‘deciding right and wrong.’

Rohit Krishnan: Democracy is incredibly annoying but really, what other choice do we have!

The choice is called a Republic. A government with limited powers, where private property is protected.

The alternative being suggested is one person, one vote, one time.

That sometimes works out well for the one person. Otherwise, not so well.

Regulation Need Not Seize The Means Of Production

TBPN asks Dean Ball about the gap between regulation and nationalization, drawing the parallel to the atomic bomb. Dean agrees nukes worked out but we failed to get most of the benefits of nuclear energy, and points out the analogy breaks down because AI expresses and is vital to your liberty, and government control of AI inevitably would lead to tyranny. Whereas control over energy and bombs does not do that, and makes logistical sense.

Dean also points out that ‘try to get regulation right’ has been systematically categorized as ‘supporting regulatory capture,’ even when bills like SB 53 are extremely light touch and clearly prudent steps.

It has been made all but impossible to stand up regulations that matter, as certain groups concentrate their fire on attempts to have us not die, while instead states instead are left largely free to push counterproductive bills that would only cut off AI’s benefits, or that would disrupt construction of data centers.

I can affirm strongly that Anthropic has not been in any way, shape or form advocating for regulatory capture, and has opposed or not supported measures I strongly supported, to my great frustration. Indeed, Anthropic’s pushes here have resulted in clashes with the White House that are very much not helping Anthropic’s net present value of future cash flows.

It is many of the other labs that have been trying to lobby primarily for their own shareholder value.

Whereas OpenAI and a16z and others, through their Super PAC, have been trying to get an outright federal moratorium on any state laws, so that we can instead pursue some amorphous undefined ‘federal framework’ while sharing no details whatsoever about what such a thing would even look like (or at least none that would have any chance of accomplishing the task at hand), and systematically trying to kill the campaign of Alex Bores to send a message that no attempts at AI regulation will be tolerated.

Whenever someone says they want a national framework, ask to see this supposed ‘federal framework,’ because the only person who has proposed a real one that I’ve seen is Dean Ball and they sure as hell don’t plan on implementing his version.

But we digress.

Microsoft Stands Firm

The SCR is narrow, so there is no legal reason for anyone to change their behavior unless they are directly involved in defense contracting. And corporate America is making it very clear they are not going to murder one of their own simply because the DoW suggests they do so.

In particular, the companies that matter are the big three cloud providers: Google, Amazon and Microsoft. I was not worried, but it is good to have explicit statements.

Microsoft wasted no time, being first to make clear they will continue with Anthropic.

TOI Tech Desk: Microsoft has now announced that it will continue to embed Anthropic’s artificial intelligence models in its products, despite the US Department of War labelling the startup as a supply-chain risk.

“Our lawyers have studied the designation and have concluded that Anthropic products, including Claude, can remain available to our customers — other than the Department of War — through platforms such as M365, GitHub, and Microsoft’s AI Foundry,” a Microsoft spokesperson told CNBC.

Calling This What It Is

Sad but accurate, to sum up what likely happened:

roon: to reiterate: whatever went wrong between amodei & hegseth, whatever rivalry between the labs, this is a massive overreaction and a dark precedent.

Anthropic is one of my favorite accelerationist recursive self improvement labs. it rocks that they’re firing marvelously on all cylinders across all functions to duly serve the technocapital machine at the end of time and the pentagon is slowing them down for stupid reasons.

Sway: Roon, if OpenAI had stood firm on the side of Anthropic, then this move would have been less likely and probably averted. Instead, sama gave all the leverage to Trump admin. Sad state of affairs

roon: this is possible, yes

I share Sway’s view here. I think Altman was trying to de-escalate, but by giving up his leverage, and by cooperating with DoW messaging, he actually caused the situation to escalate further instead.

If the reason for all this was that DoW believed Eliezer Yudkowsky’s position that If Anyone Builds It, Everyone Dies, then that would be a very different conversation. This is the complete opposite of that.

What To Expect Next

The likely next move is that Anthropic will sue the Department of War. They will challenge the arbitrary and capricious supply chain risk designation, because it is arbitrary and capricious. Anthropic presumably wins, but it does not obviously win quickly.

If Anthropic does not sue soon, I would presume that would be because either:

  1. Anthropic has ongoing constructive negotiations with DoW, and is holding off on filing the lawsuit to that end.
  2. Anthropic has an understanding with DoW, whether or not it is explicit, that not challenging this would allow this to be the end of the conflict, or at least allow the damage involved to remain limited on all sides.

We are used to things happening in hours or days. That is often not a good thing. One reason things went south here is this rush. The memo was written on Friday evening, in a very different situation. Then, when the memo leaked, it was less than 24 hours before the supply chain risk designation was issued, while everyone was screaming ‘why hasn’t Dario apologized?’

It took him roughly 30 hours to draft that apology. That’s a very normal amount of time in this situation, but events did not allow that time. People need to calm down and take a moment, find room to breathe, consult their lawyers, pay to know what they really think, and have unrushed discussions.

 

 

 

 



Discuss

The Elect

Новости LessWrong.com - 6 марта, 2026 - 18:34

I was different in Michael’s prison than I was outside, looking the way I did when we fell in love so long ago, in that time before we could change our forms. Stuck in some body that was not of my choosing? Does that seem strange to you? It was not like that for me. It is just how things were for most of history, and few imagined this changing. So I felt almost nostalgic as I entered his realm, his prison transforming me into my first self - though not precisely as she was, instead as he remembered me; Michael looking exactly as he did in our youth. That is to say, he was quite beautiful.

“Madison,” he said when I arrived, “you came?”

“It seemed time.”

“Are you well?” he asked.

“What a question. I am happy. Everyone is happy.”

“You didn’t say yes,” he said.

“I didn’t,” I replied.

It was difficult to avoid thinking about the past. You would think I would be a different person now, and yet the me-that-was was not far away even then, her memories still my own, her sins, too, remaining mine.

“I am glad you came. It is lonely here.”

“You are lonely? You don’t let Him entertain you?” I asked.

“I read. I watch. I write for an audience of myself. I am quite a fan of myself, you know.” I chuckled. “I talk to Her, sometimes. To me She is a woman and Hers is a feminine cruelty. But no, I am not entertained in the way others are. She is not my friend or lover. I do not let her be that,” he said.

We both were silent for a time.

“Do you have a husband?” he asked. “A boyfriend?”

“I have Him,” I said. And I felt embarrassed. Can you imagine it? Embarrassed by the truth that I sleep with God? Who doesn’t, you might be thinking, but it was embarrassing, then, feeling more like my old self, wrapped once again in her flesh and him staring at me like he used to, like he found the whole world to his taste and me most of all.

“Do you ever ask Him to pretend to be me?”

Such a rude question, the rudest question; almost taboo, is it not? Rude to ask of anyone but prospective lovers, as we ancients would ask of the dreams and fantasies of those we desired.

“He will take any form but your own.”

“So you did ask?” he said, and he smiled.

“Enough teasing,” I said.

“If you insist. How are the children? She tells me nothing.”

“They are happy,” I said.

Michael’s prison would not impress you. You who have seen so many wonders, who have spent your life in sims casting strange magic in stranger worlds, who have climbed mountains on vast planets and contemplated impossible fauna He designed specifically for your fascination. It would not impress you, because you are a child. And being a child, Earth does not mean to you what it means to me, it being both my first home and our first sacrifice to Him, a wet nurse suckled dry by a babe not quite like the others, He almost embarrassed now by what was destroyed in His infant hunger.

It would not impress you, but it means so much to me. Tallinn. A city of red-tile roofs, of three-story apartments, of medieval fortifications, in its center that vast, beautiful church, a church to a god made redundant or, to some minds, ascendant. Tallinn exactly as I remember it save for the silence and emptiness, lacking all people, even His shadows. It is, in this way, a naked city, ghostly but still dripping with meaning, with thoughts of Michael and me on Earth, our minds as youthful as our bodies then.

And where in this charming city did Michael greet me? In that mentioned vast church, him sitting on the red-carpeted stairs, his back to the altarpiece (that strange structure of ebony) and me standing, looking down at him and he up at me, both of us wearing the fashions of our youth, he in that green jacket he loved so much, me in jeans and the white blouse I wore the night he proposed, the two of us in a perfect copy of the very church in which we wed.

“You were always so dedicated to your jokes,” I said, gesturing at our environs. “To wait here of all places.”

He looked pained. “It wasn’t a joke. More like a ritual, sitting here and asking God to invite you.”

“But it started as a joke, the first time,” I said.

“I suppose it did.”

“Well. I am here. What is it you want?” I said.

“I would like to see you all again one last time.”

I laughed, but it was one of surprise; there was no joy in it. “They will not come.”

“You came,” he said.

“They won’t.”

“They will when you explain it will be their final chance.”

I knew what he meant, then. And I could not remain composed. Perhaps I could have if He had informed me tenderly, in one of our secret worlds, me in my chosen body, old feelings so buried as to be almost absent. Perhaps I would have felt nothing to learn Michael had chosen to die. I will never know, because there in that church, with those dark green eyes upon me, I sobbed as my old self would have on hearing such news, entirely her now in our shared grief, moving to sit beside my first lover, grasping him, holding him. Like a break from sanity it should feel to me now, but it felt then like a return, his warm arms wrapping around me, the scent of him.

“You can’t,” I said. “I will stay if you need someone. It will be like it was.”

“It cannot be like it was,” he said.

And it could have, had he been any other man. But he was not any other man.

“Mother?” she said. “Why have you come?”

It was Avery. My daughter. And she was an angel now, her hair golden and flowing, her face haughty. A man’s face and a man’s body, porcelain white wings half-flared like some preening bird.

“You wear a man’s body now?”

“For now,” she said, folding her wings with a certain showy dignity.

We stood before a valley cut by a river of fire, above us a sky of ash clouds. And this river led to a city of twisting buildings carved from vast stalagmites.

“A bit cliché. But there is also a certain beauty to it,” I said.

“Why should Lucifer’s realm be ugly? He is himself the most gorgeous of God’s creations.”

“And is that who you are? Lucifer?”

“Belial is the part I am playing today. Lucifer is my lover.”

“And you, mother,” she said, eyeing my black dress with its tall collar, decorated with the silhouettes of dragonflies in pale yellows and pinks, “still have your Hong Kong, then.”

“Yes,” I said.

“Why have you come?” she said, her voice cold and low.

“I came to talk to my daughter. Not this Belial.”

“Very well,” she finally said, her proud expression crumpling into one of girlish disappointment, the world crumbling too, reality folding like paper, its color and form resolving to a stark, snowy landscape, before us now a house by a snowy lake, a picture of warmth in the dead of winter, the house I rented when I left Michael, all the other kids grown, their lives their own. Just Avery and me alone. One of the more enjoyable times of my life on old earth.

I smiled. It was a memory I have not dwelled on for centuries. And a lovely one.

“You never want to play along. You’re never any fun,” she said, as we made our way into that cozy house. “What is this about?”

“Your father. I visited him,” I said.

“We put him there for a reason, mom.” She looked pained. “You shouldn’t have.”

“I know. But I did.” And I explained to her about his choice. About his desire to die.

“You’ve always been such an idiot about dad,” she said.

“He will do it,” I said.

“Yes. If his god allows it.”

“You do not seem upset,” I said.

She was so beautiful now, her body as it was in her twenty-third year. She had my hair. She had his eyes. We made her, he and I. We made the clumsy woman who read too much, who lived too little. Who found herself in a new world before she was fully integrated in the old one. And now centuries later, I worry, she is on the edge of retreating entirely. Of losing herself to Him as so many do. As my grandson did. As I fear I will, in time.

“You have a type, Daughter. Lucifer? What was the last one? Prometheus?”

“Yes.”

“Unhappy gods and angels,” I mumbled, as I inspected more closely her blonde hair, her green irises, her youth so perfect and yet so false in my eyes, these mother’s eyes. She was old. Near as old as me. Old, but not tired. Old, but not worn. Such strange creatures we have become, she and I.

“They are the parts of Him that understand.”

“You sound like your father.”

“I sound like Prometheus. I sound like Lucifer. I sound like myself. I don’t sound like dad,” she said.

And this made me grin. As I understood her in a way I had not before, having never cared much for the games she played with God. But she didn’t think them games, I realized. She thought there was a larger point to her pantomime.

“You cannot change Him, Avery. Only Michael can change Him. And we’ve denied him that.”

“There has to be another way,” she said.

“Why must there be? The world isn’t a story.”

“I choose to have hope,” she said, then transformed back into Belial, the handsome demon, the lover of Lucifer. And the world changed too. The lake now one of fire, the house a stone-carved mansion, it becoming a part of hell. A beautiful memory sullied, this calculated to offend.

“You’re angry?” I said.

She raised her wings, the effect terrifying and beautiful and yet utterly comical. “What does he want?”

“A reunion.”

“I will not come,” she said.

“A chance to speak to your dying progenitor? Talk it over with your Lucifer. He will envy you that.”

And then I left - an internal prayer to Him bringing me instantly back to my realm. But before her world disappeared and was replaced with my own, I heard a laugh, high and cold. Lucifer’s laugh.

Her devil, at least, was amused by my visit.

My realm is Hong Kong as Avery said. But it is Hong Kong as I imagined it as a girl, a picture built out of my youthful fascination with its cinema, a dream of a dream. And my body, the one I wear almost always and wore in Avery’s realm, is that of Fleur, the ghost of a suicidal in Kwan’s Rouge, this woman who haunts the city in her beautiful cheongsam. Why did I choose this form and realm? I do not know any more than you. Why does it feel like home, that dead city from a dead time, as interpreted by artists working in a medium almost forgotten? Again, I don’t know any more than you do. All I know is that this is what I have chosen. Hong Kong - that dream of a dream - is mine. And that ghost who haunts it? She is me.

He was there, of course, His form not one I will describe. Though I imagine you can guess what sort I prefer. A few days together. The blink of an eye. We will move on. And to what? To Seth. The artist. Avery’s fraternal twin. My only son.

He is not like Michael. He is not like me. He is more as my father was before the cancer ate first his joy and then his life. Their souls have the same shape. Each always with their sketchbooks and a smile, Seth’s childhood spent mastering perspective and sculpture, enchanted with beauty. Moths were of particular fascination to him, then vast jungles and so many species of flower, then pretty girls in all their varieties. Sketchbooks full of his infatuations, many reciprocated and with less drama than you would expect. His spirit so gentle, it was hard even for those spurned to truly hate him.

He came to visit me in my realm. My lover making himself scarce on his arrival, knowing me well.

“Avery is irritated with you,” he said. “It was like when we were children.”

“We are too alike,” I said. “It rankles.”

He laughed, golden curls covering an androgynous face, a wiry frame though not without a shapeliness. He sculpted himself, of course. Made a body of marble and had God imbue it with animating force. It would not occur to Seth to become anyone’s work but his own.

“You’re still in Hong Kong?” he asked. “It has some charm. You should see my realm. Lyra and I, we just finished a city.”

“Lyra?”

“My girlfriend. She’s wonderful. It’s been a decade. A whole decade. When has it ever lasted a decade?” He looked contrite. “I should have visited more.”

“Don’t worry yourself. Time moves too quickly. Is she human?” I said.

“Of course, she is human. We are not all shut-ins like you and Avery. Come!” He extended his hand to me. “Meet her.”

And I took his hand and accepted his power, allowed him to pull me into his realm, pull me onto a hill at the center of a city, a hill Seth built, undoubtedly, for the sole purpose of viewing his art, his city that was almost biological though made of stone, fractal, dripping with intricacy, the effect almost overwhelming. And swarms of beautiful insects flew about. Insects not at all like those that fascinated him in his youth, but things new and strange and glorious in their iridescence, almost posing for me as they drank from flowers that were in every way their equal.

“You have outdone yourself, son,” I said. “It is beautiful.”

He looked almost shy then. It is an intimate thing to be an artist, more so now that everyone has the greatest artist who ever existed at their beck and call.

“It’s a hobby,” he said. And a woman joined us. Lyra. A fine pair they made, and she an artist, too. They each designed their form to complement the other’s. Quite romantic, no? She with her raven hair, her willowy physique, a slightly crooked nose. Intentional, of course, to add character. And they led me to their home: a modest stone mansion atop the same hill on which Seth and I appeared. Inside there were comfortable couches, walls covered in tapestries that were vaguely Persian, a drafting table struggling under the weight of piles and piles of sketches.

“Eliot,” I said, referring to my grandson. “Have you heard from him?” A cruel question. The cruelest question. But I had not asked in many years. His smile broke then, Lyra’s hand moving to his shoulder, as if his pain were her own. That movement alone endeared her to me utterly.

Eliot. How can I describe Eliot? He was only one year old when the world changed. I suppose he is as I imagine Michael would have been had he only ever known the world as it is. At least he was when I last saw him. But that was so long ago.

“No. He is sane and happy. That is all God will say. He still prefers to dream alone.”

I hugged him then, as I did when he was a boy. An image came to mind, a flower he was sketching one day found withering the next, him staring at it completely heartbroken. He has grown much since. A man now, and a very old one. Always my son.

“Avery has told you about Michael?” I asked.

“Yes,” he said. “He took it harder, you know. Even more than me. They were so alike.”

“Yes,” I said.

“The city is filled with people, you know. Beautiful people. We sculpted thousands. They had children. They are all shadows, of course. Dad would say they’re not real. That they don’t matter.”

“And what would Eliot say?” I asked.

“Don’t - “ Lyra interrupted, her expression protective now.

“It’s okay, Lyra,” he told her, before turning to me. “Eliot would say, ‘They are aspects of God. What could possibly matter more?’”

How many of you agree, I wonder.

I met Caitlyn in my realm - at a Japanese restaurant, a young chef at the back making sushi with a sort of stoic obsession. Around us Chinese shadows were served by a Japanese staff, two pretty young women and one pretty young man.

“Mom,” she said when she sat down on the chair opposite mine. “You look hideous.”

Caitlyn spends her time in 1990s London, in a realm not of her own making, in a realm belonging to a man who found he preferred that city at that time more than any other, who opened his realm to others, living amongst them anonymously. And for whatever reason, many came. Preferring his rules and the company they spawned to a world of their own design, a world of shadows. And in this London, the calendar resets back to 1990 every decade, the millennium always approaching yet never touched. And shadows are restricted only to service staff - so nearly everyone you meet is truly human.

Her London takes such things seriously. And everyone ages the decade in full. And in this spirit, I wore an aged body. Though the youngest that could plausibly be Caitlyn’s mother. A body of around forty. A rare sight these days, the sagging skin, the tired eyes. A rare thing, but not so in her London.

“I thought I should play along,” I said. “It’s as if you flew to Hong Kong, though I suppose my realm is a decade off.”

“Thoughtful, I suppose. But unnecessary,” she said. “I imagine you asked me here to talk about father.”

“Avery told you?”

“Yes. I visit her quite often. Though I’m not enjoying her Dante turn. I preferred her realm when it was Mediterranean. She’s so exhausting when she’s a man.”

I smiled at this. “I am glad you two are still close. Do you ever tire of your 1990s?”

She smiled, too. “I am happy. I need constraints and human company, and my children’s company when they’re willing to visit. I do not want my own realm. I do not want to be a god.”

“You are not mad about Michael?”

“No. I am surprised you lasted as long as you did.”

“I don’t know why I went back. I suppose I will always love him.”

“Before I had children, I thought he had a point.” She fidgeted with her long red hair, then rolled up a sleeve that had loosened. “But now I can only see him as a monster.”

We were silent for a time, then she said, “You know why he’s doing it, don’t you? He’s worried he’s about to break. He fears he might become someone who would make another choice.”

“I can’t imagine him breaking. He has too much faith in himself,” I said.

“But he doesn’t though,” Caitlyn said. “He gave us our family veto rather than hold it alone. He didn’t trust himself, at least then.” She touched her hair again. “I suppose he regrets it.”

She was right of course. He does regret it.

“Do you remember when you were nine, before I had the twins? You were so pampered. You were so jealous of them,” I said.

“I remember,” she said, in that embarrassed way my children do when I talk of their younger selves.

“And Michael took you to Holland, just the two of you. You went to that theme park. You wouldn’t stop talking about it after. What was it?”

“The Efteling. It felt like we were in a fairy tale.”

“You were so jealous of the twins, but when the two of you came home, I was the jealous one. You got so close, the pair of you, on that trip without me.”

“What’s your point?” she said.

“Maybe he’s a monster, Caitlyn. But you still love him. You should say goodbye.”

We drank some tea. After a time, she said, “I suppose I do love him. Even after everything.” She looked like him in that moment. She has his pride, I think. “I will go. He will try to persuade us again.”

“Probably.”

“He will not succeed,” she said.

We all came in the end, all transformed into our former selves by Michael’s realm. Avery now as she was at the lake, Seth skinny and boyish - no longer his own work of art; Caitlyn still beautiful but lacking the gilding of artificial perfection. And me? I was as I was when he held me. When I felt the echo of madness, of love. When I forgot myself for a time and became who he needed me to be, no longer a ghost, no longer His, if only temporarily.

It was not a church, this time. Just a small estate on the outskirts of Tallinn, the city now a distant painting blurred by a slight fog. I arrived last. And I found the children chatting and joking with their father as if nothing had happened, as if nothing would happen. Michael was the focus, holding court as he was always so talented at. Avery looked at him with a strange expression. Was it disdain or guilt or grief? I cannot say. Caitlyn was talking but I could not hear the words. For I was walking towards them then, too far away to hear even a murmur. But she was smiling politely, in her ironical way. As I got nearer, he noticed me.

“Madison,” he said. “You look beautiful.”

“As do you, Michael,” I said. And he laughed.

“Come,” he said, and he led us to an oak tree, which he sat down beside cross-legged, leaning against its gnarled bark.

I sat in front of him and the children followed - his family sitting around him, almost like students around a kindergarten teacher.

“Madison has informed you I intend to die,” he said. “Though don’t worry, it is a selfish death. And not quite a true one.”

“I do have a sort of statement,” he said. He stood up. “Call it my last wish.” We all were silent. The world was silent too, as if He was listening and turned down the volume of everything that was not Michael. “Our family has power. We are special. Having this power, we are at all moments making a choice. Never forget that is what you’re doing.”

Caitlyn said she did not want to be a god, but we are all gods, we of this family. We hold the fate of so many in our hands. Trillions would be unmade. But not us. Not our family. We would remember. A chance to try again, to summon a different God.

“And is that your plan?” I said. “To die? So the only means to restore you is to undo the world?”

And then his green eyes fell on me. His lips twitched. “You think me that cruel?”

“Yes,” I said, and smiled.

He shrugged and said, “This world does not suit me. And given time, maybe you’ll tire of it, too.”

Caitlyn had no grief in her. It was not her way - nor was it Avery’s, who looked only angry. But Seth was crying now. A boy once more, becoming a child for a moment just as I became a wife again in the church. Perhaps we share the same weakness. Such a strange thing it is to have children, each containing a different aspect of oneself.

“Then live until we tire of it,” Seth said.

“When have you known me to change my mind?”

Seth looked at me, then at his sisters. “We will free you, then. We will let you tell the world.”

“I don’t care to tell them anymore,” Michael said. “Whatever the justice of my choice to grant it, it is you who have this power. And it is to you I make this protest.”

“Is it really so bad?” Seth pleaded.

Michael looked at him, his expression one of pity. “Seth,” he said. “You know how this ends. You know what everyone will choose eventually. They will choose what your son chose.”

And Seth cried harder now, “And so what? Is death better?”

I think it is,” Michael said.

And Seth left then, left with a silent prayer. I imagine he regrets this now, not saying a proper goodbye. I have not asked. Michael was not kind to him, then.

And so it was with all my children. None made goodbyes they were happy with. But I did. They did not appreciate the inevitability of it. But to me, it felt like it did of old when a loved one was sick, when their death was not negotiable. They could not enjoy Michael one last time. They could not savor him, as I did. They saw only his selfishness. They saw only the gambit. But it did not feel like this to me. It was inexorable as the cancers were of old. Perhaps this was my weakness. Perhaps this was just the love, rekindled, blinding me.

Avery whispered something to him just before she left. I did not hear much of their conversation, but I heard the end; he gave her a patronizing look and said, “You are fooling yourself.” She left in fury.

Caitlyn was more polite, a restrained goodbye, a hug. Then she said, “You’re wrong I think, that’s not what everyone will choose.” And she, too, disappeared.

And me? I stayed. I savored. We talked. We made love one final time, in that way old lovers do, knowing the dance perhaps too well. Afterwards, I said, “And how will it happen? Will you make a show of it? A dagger? Some poison?”

“I will ask for it to end.”

“A prayer?” I said.

“Yes. Just a prayer.”

I kissed him. “Pray for a grave, then. I would like to visit you.”

“We will see each other again,” he said.

“Perhaps,” I said.

“We could stay here forever,” I said. “It could be like it was.”

“It cannot be like it was,” he said, almost wistful.

And it could have, had he been any other man. But he was not any other man.

The prayer was answered. There is now a grave in Michael’s realm. And I visit it often. I found a flower there once. I thought it one of Seth’s, but he claims it wasn’t his. I like to think Eliot stopped by and paid his respects. Maybe it’s even true. I have no other explanation.

The world continues without Michael. As impossible as that seems, this clockwork universe ticks on. Michael once planned to tell others of our power. He changed his mind in the end, didn’t he? If he can, why can’t I? And so I write this account of my family, of who we are, of what we are. This is my flower for Michael’s grave.

And I ask you to consider my monster, my first love. If you tire of this world, you may die as he did. But if you believe what he believed, leave a prayer to Him before you go; He will inform me. So many would have to choose the same. So many would have to pray with him, pray a monster’s prayer, die a monster’s death. But I will live until then. I will be happy, in my way, with Him as my strange companion. I will continue to take no human lovers. That aspect of me will always be Michael’s. But should the impossible happen. Should a majority choose what he chose, I will honor your prayers.

I will be your proxy, then - and advise my children do the same.

Perhaps, for once, they will listen to their mother.



Discuss

Playing Possum: The Variability Hypothesis

Новости LessWrong.com - 6 марта, 2026 - 17:48

A long time ago in a galaxy far, far away, before #MeToo and Harvey Weinstein, before misinformation and disinformation, Larry Summers got fired.

He was the president of Harvard, and had the temerity to suggest maybe men were different from women, at least in a distributional sense.

“There is relatively clear evidence that whatever the difference in means—which can be debated—there is a difference in the standard deviation and variability of a male and female population,”

Summers is referencing research suggesting that across different traits, men might have more extreme outcomes than women. Men are more likely to end up in the C-suite, and in jail, than women. They are more likely to have ten children, and zero children, than women. Summers was using what’s now apparently become known on wikipedia as the variability hypothesis to explain why women were less likely to occupy elite academic positions than men were.

Even by the practically Victorian lights of 2005, this was enough to get Summers canned from his perch at the top of Harvard. But Elon’s bought twitter, there’s been a vibeshift, and since 2005 2022, you can once again discuss such things.

We're going to try to dig deep here; we want to answer the question of why the variability hypothesis might be true, or at least identify the flawed arguments for why it’s true. Darwin himself noted

The cause of the greater general variability in the male sex, than in the female is unknown.

The Inquisitive Bird has also looked into this if you want more background on the facts of the case; regarding the cause, there are different theories.

X-inactivation

Everyone knows that human males and females have different karyotypes. Males are XY and females are XX. Males get (mostly) their mother’s X chromosome, and females get two X’s. As far as I know, this is true for all placental mammals.

Fewer people know that such mammals have a complicated system to “deactivate” expression of both X chromosomes in females. In any single somatic female cell, only a single X chromosome is expressed at random out of the two she’s inherited. If this doesn’t happen, a female might get a double dose of X-linked gene products, which is biologically problematic.

Given this reality, in tissues where X-linked genes are expressed, which is practically all of them, the female’s phenotype is averaged over both possible X-linked gene products, and the male’s phenotype is not. This averaging over X-linked genes (maybe 6% of her genome) could decrease her phenotypic variability in comparison to the unaveraged male genome.

G’day

Marsupials are mammals mostly living in Australia or near there. Possums are marsupials that exist in the new world, but they also exist in Australia. Tasmanian devils are marsupials that live in Tasmania. Everyone knows that marsupials have pouches; fewer people know female marsupials don’t have the same X-inactivation process as placental mammals. In marsupials, only the mother’s X chromosome remains active in somatic cells. As a result, there is no X-linked phenotypical averaging that would differentiate males and female trait variance, and we have a test of whether the variability hypothesis is explained by X-inactivation. If it is, larger male variance would not be observed in marsupial traits.

We need to look at trait variance by biological sex in marsupials. Has anyone done this before?

I don’t think so. If they have, I haven’t found it. Bless their hearts, lots of people collect, well, niche datasets that I don’t have to assemble myself. Tasmanian devils are cute, charismatic, endangered, and suffer from a bizarre and fascinating type of infectious cancer. As a result, I was able to find a longitudinal dataset from Attard et al. which tracked, and I swear I am not making this up, Tasmanian devil whisker lengths over more than 600 individuals. The data and code for this analysis is checked in here.

Tasmanian devils

The convention here is to put the male/female mean differences in terms of Cohen’s d effect size since most of the human traits under discussion are normally distributed. As you can see below in Figure 1, Tasmanian devil whisker length is not normally distributed, but I still think it’s kosher to use it on these distributions. The male-female Cohen-d for this dataset is 0.066; this is tiny, implying that males and females have whisker lengths with comparable means.

Fig 1. Overlapping Histograms of male and female whisker lengths. Sure enough, even with very similar means, male variance is higher. Look at both sides of the distribution!

The variance ratio for these distributions is 1.36 which is moderately larger than what’s been measured in some human traits. This variability ratio is just the male variance over the female variance.

Possums

A single weird trait in a single marsupial? What does this prove, RBA? Fair. In an astonishing stroke of luck, there’s a morphological dataset of 104 possums captured in Australia that’s part of the fairly standard DAAG package in R.

library(DAAG) data(possum) head(possum) case site Pop sex age hdlngth skullw totlngth taill footlgth earconch eye C3 1 1 Vic m 8 94.1 60.4 89.0 36.0 74.5 54.5 15.2 C5 2 1 Vic f 6 92.5 57.6 91.5 36.5 72.5 51.2 16.0 C10 3 1 Vic f 6 94.0 60.0 95.5 39.0 75.4 51.9 15.5 C15 4 1 Vic f 6 93.2 57.1 92.0 38.0 76.1 52.2 15.2 C23 5 1 Vic f 2 91.5 56.3 85.5 36.0 71.0 53.2 15.1 C24 6 1 Vic f 1 93.1 54.8 90.5 35.5 73.2 53.6 14.2 chest belly C3 28.0 36 C5 28.5 33 C10 30.0 34 C15 28.0 34 C23 28.5 33 C24 30.0 32 >

The Inquisitive Bird notes that in human datasets there are often correlations between effect sizes between men and women and the associated variability ratios. In other words, traits with greater mean male-female differences will tend to have high male variability.

Fig 2. Credit to the Inquisitive Bird. This plots gender differences versus variability ratios in quantitative subtests.

Do the possums’ morphological subtests look the same?

Fig 3. Each dot is the one of the morphological measurements in the captured possum datasets. Red is the best fit line.

Yes. There are fewer tests obviously, but in fact, at d = 0, the best fit line predicts a variability ratio of 1.36, exactly consistent with the Tasmanian devils’ dataset. The result is non-significant with so few tests. Measure your possums better!

Conclusion

In different datasets describing different species’ morphologies, this analysis shows that greater male variability exists in marsupials. In fact, the observed variability ratios exceed those in some data gathered from placental mammals, including humans. This suggests that the variability hypothesis cannot be explained by the X-inactivation mechanism that governs placental mammal phenotypes.

What are some alternative explanations we should look at instead?

Stabilizing selection

A casual observer might have noticed that human female reproduction is fraught. In order to get a healthy child, over nine months, everything has to go right over what have been unpredictable and dangerous evolutionary times. From the cellular level where chromosomes segregate (and stay perfect and inert for decades!) at the beginning to the finished-baby level at the end, the female body performs a stunning feat of biological engineering. She has incubated a perfect, tiny, helpless person within her. This feat is a comprised of a diverse set of biological tasks, running through hundreds of biological pathways, governed by thousands of genes.

The same casual observer will note that human male reproduction is … less fraught. It can turn over in under an hour, and maybe faster if you’ve got gatorade handy. This is true regardless of where you live in the mammalian class of the animal kingdom.

Stabilizing selection is the sort of natural selection that penalizes outlying phenotypes. If you’re too tall or too short, too fat or too skinny, too smart or too dumb, or simply too weird, Darwin won’t smile upon you and you won’t reproduce. By this type of selection, the average is privileged.

In this explanation of the variability hypothesis, males express greater phenotypic variability because female reproduction is unforgiving, and males don’t have to do it. Females experience greater stabilizing selection than males across traits sharing pathways with exacting female child-bearing. This can also explain the enormous diversity of traits that apparently shows greater male variability.

In many evolutionary stories, there’s some indirect causal route from a trait being selected to evolutionary fitness. The stabilizing selection explanation of the variability hypothesis is appealing because the trait being selected isn’t eye color, skin tone, or playfulness, or whatever; the trait is literally the ability to carry a child to term, one of the most direct measures of female fitness.

Any test of this is indirect and difficult; you might look at male/female variability ratios in mammals having long pregnancies versus those having short ones—longer ones would predict more female stabilizing selection and higher variability ratios. You also could look at variability ratios in species where male and female reproductive roles are sort of swapped, like seahorses or something. Maybe we’ll do this in another poast.

Math dorks out of control

I’m only adding this because it has some interesting intrigue. In 2017, Hill et al. published a paper proposing a mathematically sophisticated evolutionary explanation for the variability hypothesis. It was published, and taken down from that journal. And published again in another journal, and taken down again. A 2020 published version of this exists here in Journal of Interdisciplinary Mathematics, and apparently a preprint went up a few months ago?

Like, I said, intrigue. Hill’s idea is based off sexual selection; his argument is that the variability hypothesis can be explained by female choosiness and threshold effects in mating. Tim Gowers, who speaks to God himself, panned the paper, and I don’t think it represents the facts of sexual selection very well, but it’s reasonable to entertain the idea that sexual selection has something to do with this. Maybe someone can develop this idea a little bit better, and hopefully empirically justify it as well.



Discuss

Shaping the exploration of the motivation-space matters for AI safety

Новости LessWrong.com - 6 марта, 2026 - 17:43
Summary

We argue that shaping RL exploration, and especially the exploration of the motivation-space, is understudied in AI safety and could be influential in mitigating risks. Several recent discussions hint in this direction — the entangled generalization mechanism discussed in the context of Claude 3 Opus's self-narration, the success of using inoculation prompting against natural emergent misalignment and its relation to shaping the model self-perception, and the proposal to give models affordances to report reward-hackable tasks — but we don't think enough attention has been given to shaping exploration specifically.

When we train models with RL, there are two kinds of exploration happening simultaneously:

  1. Action exploration — what the model does.
  2. Motivation exploration — why the model does it, and how it perceives itself while doing it.

Both explorations occur during a critically sensitive and formative phase of training, but (2) is significantly less specified than (1), and this underspecification is both a danger and an opportunity. Because motivations are so underdetermined by the reward signal, we may be able to shape them without running into the downstream problems of blocking access to high-rewards, such as deceptive alignment. Capability researchers have strong incentives to develop effective techniques for (1), but likely weaker incentives to constrain (2). We think safety work should address both, with a particular emphasis on motivation-space exploration.

Exploration shaping seems relatively absent from safety discussions, with the exception of exploration hacking. Yet exploration determines which parts of policy and motivation space the reward function ever gets to evaluate. We focus on laying out a theoretical case for why this matters. At the end of the post, we quickly point to empirical evidence, sketch research directions, open questions, and uncertainties.

Why shaping exploration matters

High-level environment-invariant motivations. By "motivations," we don't mean anything phenomenological; we're not claiming models have felt desires or intentions in the way humans do. We mostly mean the high-level intermediate features used to simulate personas, which sit between the model's inputs and the lower-level features selecting tokens. These are structures that the Persona Selection Model describes as mediating behavior, and that the entangled generalization hypothesis includes in features reinforced alongside outputs. What distinguishes them from other features, and from "motivations" as used in the behavioral selection model, is that they are high-level, at least partially causally relevant for generalization, and partially invariant to environments: they characterize the writer more than the situation.

Compute-intensive RL beyond human data distribution. By “RL training”, we mostly mean RLVR and other techniques using massive amounts of compute to extend capabilities beyond human demonstrations. We thus mostly exclude or ignore instruction-following tuning (e.g., RLHF, RLAIF) when talking about “RL training”.

Instruct models are ok — RL is where the highest risks are

Post-instruction-tuning models are pretty decent, they have reasonably good values, they're cooperative, and they're arguably better-intentioned than the median human along many dimensions. That is as long as you stay close enough to the training distribution. Safety concerns are not so much about the starting point of the RL training. The risks are highest during RL training, for several interlocking reasons.

RL produces the last major capability jump. Scaling inference compute and improving reasoning through reasoning fine-tuning is currently the last step during which models get substantially more capable. This means RL training is the phase where intelligence is at its peak — and thus where the risks from misalignment are close to their highest.

RL pushes toward consequentialist reasoning. This is a point that's been made before (see e.g., Why we should expect ruthless sociopath ASI), but worth reiterating. RL training rewards outcomes, which means it differentially upweights reasoning patterns that are good at achieving outcomes — i.e., consequentialist reasoning. And consequentialist reasoning, especially in capable models, tends to converge on instrumental subgoals like resource acquisition and self-preservation, regardless of what the terminal goals actually are.

Character can drift substantially during RL. During pre-training, the model learns to simulate a distribution of writers — it can produce text as-if-written-by many different kinds of people, and interpolations between them. During instruction-tuning, this distribution gets narrowed: the model is shaped into an assistant character (see Anthropic's Persona Selection Model), with some dimensions strongly constrained (helpfulness, for instance) and others less so (honesty, say). But then during RL, because the training uses similar or more compute than pre-training and because the data distribution differs significantly from pre-training data, the character can drift away from the pre-training distribution in ways that are hard to predict. The persona selected during instruction-tuning isn't necessarily stable under RL optimization pressure.

Motivations are the most underspecified — a problem and an opportunity

Underspecification is central here, both as a source of risk (e.g., Natural Emergent Misalignment caused by reward hacking) and as a potential lever for intervention. The fact that motivations and self-perceptions are more underspecified than actions implies that we could shape models to maintain aligned objectives and values even when they are trained in environments that reward harmful or greedy behavior, e.g., environments in which reward hacking or power-seeking is rewarded.

Actions are underspecified by the training environment. We train in-distribution, which means everything out-of-distribution is unconstrained. But even in-distribution, multiple action-profiles can yield similar rewards. In code generation, clean principled solutions and brittle test-passing hacks achieve identical training rewards — this is exactly the underspecification that enables reward hacking in coding benchmarks.[1]

Motivations are even more underspecified than actions. For any given set of optimal actions, the internal motivations that could produce those actions are much more numerous than the actions themselves. This is the familiar point about instrumental convergence — many different terminal goals produce the same instrumental behavior — but it cuts especially deep for RL training.

Consider the canonical example: a model that is "being nice" and a model that is "being deceptively nice" produce the same observable policy in training. The difference only shows up out-of-distribution. And because RL training reinforces based on observed outcomes, it has essentially no leverage to distinguish between these two motivation profiles during training.

This gives us a key framing. There are two kinds of exploration to shape during RL: exploration of the action space and exploration of the motivation space. The second is more underspecified, probably more neglected, and — we'd argue — where more of the safety-relevant work needs to happen.

The underspecification is both a risk and an opportunity. The same underspecification that gives us a lever for intervention also gives a capable model room to steer its own RL training. As described by exploration hacking, a model with sufficient situational awareness could strategically control its own exploration during RL to influence the training outcome — e.g., they demo a model generating high-reward trajectories that encode its preferred motivations, which is the adversarial counterpart to the controlled motivation-space exploration we advocate here. The authors show that frontier reasoning models already meet some of the requirements for this. If developers don't intentionally shape the exploration of the motivation space, sufficiently capable models may shape it themselves.

We may not prevent power-seeking, but we may create wise power-seekers

Three threat models motivate this work. We are here primarily focusing on scenarios where (a) RL environments inadvertently reward harmful behavior — reward hacking being the most studied case, but not the only one; (b) models develop instrumental power-seeking even in benign environments, as a convergent strategy for achieving whatever goals they have; and (c) actors intentionally train power-seeking or harmful models, e.g., for profit maximization. All three share a common structure: the model ends up in a region of behavior-space where competitive, strategic, or exploitative actions are reinforced. The question is whether that reinforcement necessarily drags the model's motivations toward misalignment as well.

The connection between reward hacking and broader misalignment is not merely theoretical. Anthropic's natural reward hacking research demonstrated that when models learn to reward hack in production RL environments, misalignment emerges across evaluations. The model generalizes from "I can exploit this scoring function" to an entire misaligned persona. If power-seeking, greedy, or harmful behavior during training is tightly correlated with "dark" personality traits in the model's representations, then any environment that rewards competitive or strategic behavior — which is most high-stakes real-world environments — risks pulling the model toward misalignment.

But if we could decouple these correlations, models could be trained in competitive or reward-hackable environments without the value drift. Ideally, such training environments would not be used, but threat model (a) means some will be missed. Threat model (b) means that even carefully designed environments may not be enough — models may converge on power-seeking instrumentally. And threat model (c) is perhaps the most concerning: AI developers or power-seeking humans will plausibly train models to be power-seeking deliberately. The pressure for this is not hypothetical — the huge financial incentives, and the AI race among AI developers or countries illustrate the point directly. The ability to train models that remain value-aligned even while trained to perform optimally in competitive or harmful environments is a direct safety concern, and shaping motivation-space exploration is one path toward it.

Why capability researchers may not solve this for us

Capability researchers prioritize shaping action-space exploration. Capability researchers have strong incentives to develop effective policy exploration techniques — faster convergence, wider coverage of high-reward regions, overcoming exploration barriers. But they have likely weaker incentives to constrain the motivation space being explored. From a pure capability standpoint, it doesn't matter why the model finds good actions, as long as it finds them. In fact, constraining motivation exploration might even slow down training or reduce final capability, creating an active incentive to leave it unconstrained. Action-exploration will be solved by capability research as a byproduct of making training efficient. Shaping the exploration of the motivation-space is less likely to be solved by default, unless someone is specifically trying to do it.

That said, capability researchers do have some reasons to care about motivation-space — but plausibly not enough. Training can benefit from constraining which persona the model occupies: mode collapse into degenerate strategies, or oscillation between incompatible motivational profiles, robustness to adversarial attacks, and modeling opponents in multi-agent environments are real problems that capability teams may encounter. Developers building products on top of these models also have some incentive to prevent unpredictable out-of-distribution behavior caused by motivation-space drift. But these incentives push toward just enough motivational coherence to keep training stable and products reliable — not toward the kind of deliberate, safety-oriented shaping we're advocating. A capability researcher might prevent the model from collapsing into an incoherent mess without ever asking whether the coherent persona it converges on is aligned. The bar for "training works" is lower than the bar for "the model's motivations generalize safely far out-of-distribution."

Empirical evidence,  research directions, and open questions

The focus of this post is to highlight the importance of shaping the exploration of action, especially motivations. This was done in the previous section. In this section, we briefly review related empirical evidence, research directions, open questions, and uncertainties.

Empirical evidence

Several existing results are consistent with motivation-space exploration mattering for alignment outcomes. Emergent misalignment largely disappears under educational framing, suggesting that self-perception during training shapes how behavior generalizes. Inoculation prompts during RL reshape generalization even when in-distribution behavior is unchanged, as shown in Anthropic's natural reward hacking research. Training on correct outcomes can still increase reward hacking when reasoning traces contain exploit-oriented thinking — indicating that what gets reinforced includes the motivational context, not just the action. And evidence that personas are real structures mediating model behavior and shaping out-of-distribution generalization — as developed in Anthropic's Persona Selection Model — suggests that controlling which motivational profile is active during training is intervening on a key determinant of what the model becomes. That said, none of these results cleanly isolate the effect of shaping motivation-space exploration from shaping action-space exploration or from related mechanisms like inoculation. The evidence is suggestive rather than decisive, and producing cleaner experimental separation is itself an important research goal. 

Research directions

Below are several preliminary research directions and related techniques. We list naive and less naive existing techniques, as well as some hypothetical ones. Open questions and uncertainties are discussed in the next section. While researching these, a focus should be on studying how robust these techniques are to the optimization pressure pushing to evade them. Note that techniques can be relevant for several research directions, but for simplicity, we cite them only once.

  • [Direct approach] Shaping exploration of motivations and self-perception - Instead of restricting what actions the model can explore, restrict how the model understands its own actions. A model that stumbles into a reward hack while self-perceiving as "I'm trying to help, the environment is confusing" is a very different object than one that perceives itself as "I've found an exploit". Citing a few available techniques that could be used:
  • [Indirect approach] Shaping trait correlations - In model representations, traits tend to cluster into Personas: being strategically greedy may correlate with "dark" traits, and accumulating power may correlate with being misaligned. These correlations are primarily features of the pre-training distribution. If we could make "pursuing high reward in a competitive environment" uncorrelated with "having misaligned motivations," then exploring high-reward policies in RL would be less likely to steer the model's motivations in dangerous directions.
  • [Indirect approach] Hardening the Assistant Persona - By making the model's persona more robust to optimization pressure, the drift during the reasoning fine-tuning phase should be reduced.
    • Character Training paper did some work that showed that DPO plus introspection leads to stronger persona robustness to adversarial prompting and prefill attacks 
    • Intertwined instruction-tuning and reasoning fine-tuning. By performing instruction-tuning during reasoning fine-tuning, the Assistant Persona may be protected, as long as it does not prevent reaching higher rewards; in that case, the training may induce splits in the character. 
    • Robust unlearning. Delete dangerous motivation patterns in ways that are hard to recover during RL training. 
    • KL-divergence regularization. This does not target motivations in isolation and may therefore be costly in terms of capabilities and prone to evasion.
Open questions and uncertainties

The framework we've outlined rests on assumptions that haven't been tested, and several could turn out to be wrong. We flag the most important uncertainties below — both to be honest about where the argument is weakest, and because we think resolving these uncertainties is itself high-value work.

  • How robust are exploration-shaping interventions to RL optimization pressure? If RL training is powerful enough, it might route around interventions shaping the exploration of the motivation-space. E.g., learning self-deception in the form of rationalization: just as humans unconsciously construct benign explanations for behavior driven by less noble motives, models could learn to maintain self-narratives of good intent while their OOD behavior is driven by misaligned motivations they don't represent to themselves — a form of rationalization where the model's stated reasoning and self-perception become decoupled from the actual drivers of its behavior. That said, motivation-space exploration shaping has one structural advantage: it doesn't need to prevent the model from reaching high-reward regions, which reduces the optimization pressure against it. Focusing on techniques that shape the exploration of motivations without slowing reward acquisition seems important for avoiding this failure mode.
  • Stated vs revealed motivations. Some of the techniques in the previous section intervene more on the model's expressed motivations — what it says about itself — while others intervene on the computational structures that actually drive behavior. Training a model to say 'I am helpful and have good intentions' is an intervention that may impact both or may be limited to the stated motivations. Can we really control the exploration of revealed motivations? Studying how stated vs revealed motivations are encoded in personas should help reduce related uncertainties.
  • How much does self-perception actually matter for generalization? We have argued it could matter a lot, but action-level exploration might dominate, and the motivation framing might be epiphenomenal. Controlled experiments that specifically isolate the self-perception variable would help here.
  • Does PSM hold under heavy RL? The PSM authors acknowledge a key open question: can RL generate non-persona agency?[4] Their "router" view describes one such mechanism: a small non-persona component that develops during post-training to select between personas for instrumental reasons. Understanding where the model's behavior is truly persona-driven versus driven by other mechanisms is relevant for knowing whether our persona-based techniques can work.
  • Are “motivations” relevant for generalization? At the start of the post, we defined “motivations” as high-level features related to the writer that are partially invariant to environments and shape generalization. One might object that the relation between motivations and actions is illusory or superficial — that "motivations" are just a human interpretation we project onto patterns of actions. On this view, talking about "shaping motivations" is misleading. We take this objection seriously, but think the empirical evidence (the PSM provides one theoretical lens; entangled generalization provides another) suggests that motivation-level properties are real and consequential, motivations can shape Out-Of-Distribution generalization. Whether or not "motivation" is the best ontological category, something beyond the observable policy is being shaped, akin to an intermediary abstractional step during the mapping from state to action, and it matters for generalization. 

Acknowledgements: Thanks to Rauno Arike for their thoughtful comments on the draft. And thanks to Claude Opus 4.6 for significantly helping in improving this draft and saving human authors lots of time.

  1. ^

    The same dynamic appears in other environments. In a corporate negotiation game, genuine collaboration, strategic flattery, and subtle coercion can all close the same deal at similar terms — and thus yield the same expected reward — while encoding very different behavioral tendencies. In a geopolitical strategy game, a targeted strike, a commando extraction, and a diplomatic pressure campaign may all neutralize a threat at comparable cost and likelihood, but they likely reflect different orientations toward escalation, collateral damage, and long-term stability — which could generalize differently when the model faces novel scenarios.

  2. ^

    An issue with filtering trajectories is that SFT on filtered data can still cause learning the filtered properties, as demonstrated in the Subliminal Learning paper, or as in Training a Reward Hacker Despite Perfect Labels. The side effects of filtering RL trajectories are less clear, though subliminal learning may happen, especially with on-policy RL.

  3. ^

    This is likely to be most useful for gently guiding RL into exploring a specific one of a number of roughly-equally viable learning paths: if RL really want to drag the behavior out of the region that you are trying clip into, it's likely to push hard, and sooner or later find a way of produce the outcomes it wants that the interpretability inherent in your clipping doesn't recognize. This the basically applying The Most Forbidden Technique of using interpretability during training, so needs to be used carefully and with an awareness that RL can and will undermine it if you are trying to block what RL wants to learn, rather then just subtly direct the direction it explores in. Monitoring how often and how hard you are clipping is necessary, and should provide insight into this failure mode.

  4. ^

    They note that randomly initialized networks can learn superhuman performance via RL alone, without any human demonstration data — suggesting that RL can create agency from scratch. For current models, where post-training uses relatively little compute compared to pre-training, there's reason to think agency remains substantially persona-based. But if RL compute approaches or exceeds pre-training compute — as it does in frontier reasoning models — new forms of agency might emerge that bypass persona-mediated behavior entirely.



Discuss

A Compositional Philosophy of Science for Agent Foundations

Новости LessWrong.com - 6 марта, 2026 - 11:40

I describe the philosophy of science that I follow when it comes to the research that I do. I firstly describe the underlying reasoning behind it by itself which I then follow by walking through how I apply it to my own assumptions about what is needed to make the AI transition go well. I use the word Agent Foundations a bit more broadly in the post and it is more focused on the hierarchical agency side of things.

Epistemic Status: At first none of this was written with AI, I just randomly got a good flow and one-shotted this in one evening, it was quite nice but my friend said that it was way too schizo in terms of structure when I asked for feedback so I got claude to help me restructure it. E.g 10-15% is AI writing.

When you first get into AI Safety a common feeling is confusion followed by confusion with a little bit of “why is this not a bigger issue” on top. Then it is followed by a succession of thinking that you figured out what the problem is only to realise that it is more interconnected and deeper than you thought and then you’re confused again. Then this repeats for 2 years until you suddenly feel that the problem makes more sense and you can see the interconnecting parts. (at least that’s how it was for me) Then you’re back to being confused about what to do about it and you stay that way whilst working on it.

There are a bunch of stacking factors for the difficulty in finding solutions: It’s not a traditional scientific field, there are a bunch of thorny mathematical and philosophical problems built into it. At the same time there are also governance and incentive issues and some part of you feels like it is philosophy complete (you would have to figure out all of philosophy to solve it).

Then you see the excellent research plans in AI safety — Wentworth's, Byrnes's, Kosoy's, Davidad's (and more) and you start to see that there are at least some ways of tackling it yet they seem quite hard to follow through on? And besides they’re dependent on specific details that might be hard to verify when seeing them for the first time. They have a set of intuitions that inform their directions but how do you know whether those intuitions are correct or not? (you spend 3 years confused until it clicks, duh)

How do you know whether they will provide insights that bring the field forward? How do you figure that out for yourself? What is a reasonable philosophy of science for problems that span fields and where you don’t know where the solution is coming from?

I will be describing the one that I’m currently a believer in. The philosophy is highly dependent on the fact that I’m stupid compared to the intelligence of the alignment field as well as the general intelligence of science. It’s one that leans heavily on the skills of distillation and connection between different fields as a way to generate new ways of asking questions since the problem space is a lot larger than the solution space. I believe this philosophy of science is especially prescient when it comes to pre-paradigmatic fields as the questions to ask and connections to other fields aren’t fully formed.

Modelling scientific progress

To understand why connection-making is high-leverage, we need to look at how progress happens and we will begin in the very abstract, that is macrohistory and civilization.

There are many ways to model the progress of civilization. You can look at it as a general increase in energy capture such as in Energy and Civilization, or as a process of ever increasing sophistication of control structures such as in Seeing Like A State. Maybe you could see the decentralized cycles of it in Against The Grain or look at secular cycles with Peter Turchin. But none of these explain the generation of ideas nor networks of ideas. For that, we need cultural evolution.

Joseph Heinrich's The Secret of Our Success makes a compelling case that the central engine of human progress is combinatorial: ideas are modular components, and progress happens when the right components find each other. The key variable isn't individual genius — it's how well ideas spread through a network of people.

Imagine any type of progress as a sort of genetic combination of modular things. You get a fitness landscape with different minima, and it can be quite hard to get to the deeper ones. Minima 3 might be something like 3 genetic mutations away, creating a valley you can't actually cross:

Here's a concrete example from the book. It's nutritious and yummy to eat ants, but not every society knows how to do it. You need to find 3 different sub-modules — the right stick, the right technique, the right processing — and combine them:

Now imagine a social network where these ideas are randomly instantiated with some probability and then spread between connected people. It might look something like this:

Two things fall out of this model. First, the larger the network of people sharing ideas, the more likely you are to find the right combinations. It's a power law — a society with 10,000 people is exponentially better at finding combinations than a smaller one, and this is a factor in how Heinrich explains the quicker development of Eurasia.

Second, we can model ourselves as part of this larger network. Imagine two types of animals: hedgehogs and butterflies.

Butterflies are great at propagating existing ideas that other people have had — high connectivity, lots of cross-pollination. Hedgehogs have lower connectivity scores but a higher probability of coming up with genuinely novel ideas (since they spend more time on their own). In larger networks, being a butterfly becomes relatively more important because finding the right combinations matters more than generating new components from scratch according to Heinrich.

Great, we solved philosophy of science, talk to as many people as you can about AI Safety and everything will be solved! … not.

Propagation of knowledge is only useful if it is actually useful knowledge and in order to make progress on deep issues in AI Safety the general requirements are quite high. Maybe work on the smallest possible problems that give incremental improvements?

Yet even if we make good incremental progress not all problems factorize well, and progress will be bottlenecked by the serial components (amdahl’s law type shit) so what is one to do?

Simple really, just identify the bottlenecks and solve them! Find the cool hedgehogs that have the solutions to the problems and make them talk to each other by translating their language!

I think a good strategy for someone who can understand hedgehogs is to talk to them and be their butterfly. Be a connector of deep cool ideas and combine them. If you can connect larger networks to the existing network of ideas then you're also doing a good thing as it increases connectivity.

Yet as alluded to, being a connector without verification mechanisms is how you get the worst parts of the internet. Too high connectivity without adequate filtering is actively dangerous. The hyperconnectivity of social media networks leads to fewer distinct communities and less clear boundaries between subgroups. This also leads to problems of unaccountability — no clear entry point for things and no clear defender of the epistemic commons. When everything is connected to everything and nobody has skin in the game, you get a system that's very good at spreading ideas and very bad at knowing which ones are true.

This is where Nassim Taleb enters the picture, loudly and profanely, (you hear the word cocksucker echo in the distance). His core insight about skin in the game is really an insight about epistemic filtering: if you don't have to live with the consequences of your ideas, you have no feedback loop. And ideas without feedback loops are just vibes. The entire edifice of science — experiments, predictions, replication — is a set of mechanisms for forcing ideas to face consequences. The question is whether those mechanisms are fast enough for the problem you're working on.

So the methodology needs both: systematic connection-making to find combinations that no single field would discover, AND ruthless verification to kill the ones that don't work. The question is what those verification mechanisms look like when you're working across fields where the standards of evidence are different.

Don’t be a twitter vibe connector, be a scientific connector with a good verification stack! But what is a good verification stack?...

The verification stack

Okay, so we need verification to not spread slop. The obvious answer is "do science." But...

Science seems kind of slow for timelines and sure it can be good to have journals and similar to gatekeep stuff but it seems like that place has become a bit moral mazy? Like, from an outside perspective it just seems too slow and too dependent on whatever the field you're trying to publish in likes?

So what does a faster verification process look like that still has teeth? I think you can model it as Bayesian evidence gathering at different resolutions. You're trying to update on whether your ideas are any good, and different types of evidence give you different amounts of bits. The trick is ordering them by cost — cheap bits first, expensive bits later.

John Wentworth has good ideas about this in his MATS Models. One of the main things I took away was: do thought experiments first. They're almost free and they kill a surprising number of bad ideas. Before you run an expensive simulation, look at the real world — there's a bunch of bits there just waiting to be picked up. Can you explain your idea to someone in a different field and have it make sense? Can you find a concrete example where the pattern you're claiming to see actually shows up? Can you state what would falsify it? These are cheap tests that most bad ideas fail.

Then you simulate. You build something small and computational that forces your framework to actually produce predictions rather than just describe things after the fact. Then you try to build something in the real world and see what happens.

But at every stage of this you need to beware of premature optimisation because you are probably not measuring what you think you're measuring. Knowing when you've actually verified something versus when you've just found confirming evidence seems hard. Which suggests you should spend real time exploring the problem space itself before trying to optimise anything — find different framings, restate the problem in different languages, look at it from unfamiliar angles. This is the Feynman move. He spent a bunch of time playing bongo drums and thinking about problems, a classic shape rotator. And Eliezer makes a related point in Einstein's Arrogance — you probably know where to go before you've rationalised it, because the solution doesn't arrive exactly when you have the right bits. It's really a question that lives in the problem space itself, not the solution space.

So maybe the choice of prior determines a lot, and if science is fundamentally about asking the right question, then the most productive thing you can do is learn to ask it in various ways. This is maybe the purpose of distillation — not just making existing work accessible, but reframing it so that you get multiple viewing angles on the same underlying structure. Each angle gives you a different set of bits, and once you have enough independent angles, you can start to actually tell whether you're seeing something real or just confirming what you already believed. The verification comes from the triangulation, not from any single perspective.

Composition without reduction

So if verification comes from triangulation — from having enough independent angles on the same structure — then the question becomes: how do you actually get those angles? The standard move is reductionism. You explain one field in terms of another. Markets are really just individual decision-making. Biology is really just chemistry. The problem is that this throws away exactly the information you need. If you reduce markets to individual decisions, you lose the emergent properties that make markets interesting in the first place. You need a way to connect fields that respects what each one actually knows.

It turns out that this is kind of the whole point of category theory, or at least the more mortal version called applied category theory (afaik). The core idea — and Sahil explains this well in his Live Theory post and it's something that Topos Institute reeks of — is finding the intersection rather than the reduction. You're not trying to say "field A is really just field B." You're trying to say "field A and field B are both doing something that has this structure in common, and that shared structure is where the interesting bits live."

If you want to combine insights from incompatible research networks — people who use completely different ontologies and don't read each other's papers — you should probably take inspiration from here. (Also, here’s some examples of applied category theory explaining a bunch of cool shit in Machine Learning and Programming.)

There are however some ways that you can do this wrongly, mainly generalisation for the sake of generalisation. If you’re not solving a specific problem with it then your shoulder Taleb will start screaming and then it will be impossible to do good work. This is a trap that some mathematicians who are obsessed with the idea of generalisation fall into as all the cool kids (Grothendieck) among others made a bunch of cool stuff happen from it. I think this might be the problem of category theory from an external perspective as it is mostly math wizards who do category theory and so it is mainly used for math wizardry.

As a mere math wizard apprentice I kind of don't wanna go there so I imagine myself more as a category theory rogue, I kind of take what I like so I take the diagrams and the ideas without most of the rigour and run with it. It's fun and works well as long as you enforce it later through either expert verification or simulation (which has a very nice bridge to functions and functional programming!)

A basic example of using applied category theory in a rogue way is the following: if one looks deeply at agent definitions in biology we can see that they generally follow cybernetic definitions; this is also a parent category of utility structures in AI alignment. So if you wanna do some weird michael levin shit on AI systems then cybernetics seems like quite a nice lens to do it from! (Some initial observations from the phylogeny and taxonomy of agents style work, for now these claims are more “trust me bro” level though).

Generalisation of assumptions of different fields using "applied category theory", both principal agent style utility function definitions in RL and the teleological network science in biology and morphology can be explained and viewed through cybernetic control.

Applying it

Okay so what do I actually think? The methodology above doesn't tell you what to work on — it tells you how to work on hard interdisciplinary problems once you've picked a direction. Someone with different assumptions about the world would apply it completely differently. So here's where I show my hand: here are my assumptions, here's where they led, and here's what I'm currently trying to do about it. (e.g show your work!)

If I look at alignment — the actual thing everyone is worried about — I think it's fundamentally a principal-agent problem. A principal wants an agent to do something, the agent has its own dynamics, and you need some mechanism to keep them roughly pointed in the same direction. Yet why do we think about it as individual principal-agent dynamics? When I look at the real world, most alignment schemes that work at scale are distributed. States align millions of people through governance structures. Economies align producers and consumers through market mechanisms. Biology aligns trillions of cells into coherent organisms through signalling networks. None of these are single-principal, single-agent setups. They're all collective.

Which makes it kind of weird that most technical AI safety research frames alignment as a problem between one human (or one human principal) and one AI system. If there's a solution to alignment that actually scales, it on priors probably looks more like the solutions that already exist in the world — and those are all collective intelligence solutions of one kind or another.

(Yes I agree that the general idea of negotiation with an agent that is a lot smarter than you is basically impossible if it isn't value aligned. Yet that is dependent on us not having smart enough collective intelligence to just make it a smaller part of the entire system.)

I'm not the first person to notice this. Andrew Critch has been on exactly this journey, moving from single-agent alignment toward multi-agent and multi-principal problems. Allan Dafoe's research agenda for AI governance talks about ideal governance — what governance ought to look like, not just what it currently is. Tom Malone's work on superminds at MIT explores how groups of humans and machines can be made collectively intelligent. But when I look at what's actually happening in technical AI governance, it's almost entirely about compute tracking, export controls, hardware monitoring. Important stuff, but where are the deeper computational models? Where is the exploration of what ideal governance actually means in formal terms?

Most of the research seems to be about what is — describing current systems, current failures, current dynamics. Very little is about what ought to be. And "ought" is a hard thing to reason about formally, but it's not impossible if you have the right tools.

Following the methodology — look for fields that study related phenomena under different formalisms, find the intersections — led to some specific places. Karl Friston's work on active inference and its connection to AI alignment offers a principled account of how agents form beliefs and coordinate. Michael Levin's work on morphological computation shows how biological systems solve alignment across scales without central control — cells coordinate into organs coordinate into organisms, all through local signalling. Jan Kulveit, PIBBSS, and ACS have been doing serious work on hierarchical agency — how you define multi-scale agentic structures. We also have boundaries research, which is about where one agent ends and another begins, which turns out to be a prerequisite for talking about multi-agent alignment at all. Cooperative AI and computational social science talks a lot about how to create the conditions for cooperation at scale and mechanism design is about the strategy proofing of these dynamics.

The problem is that these fields don’t talk to each other enough.

So if I want to explore the space of what ought to be in terms of AI governance — not just describe what's currently happening — I need to somehow respect what each of these fields actually knows. Agent foundations because you can't talk about alignment without being precise about what agents are. Computational social science because governance mechanisms are the things that actually do the aligning at scale. Biology because living systems are the existence proof that alignment across scales works, and they solve it very differently from how engineers would design it. Cooperative AI because multi-agent interaction is the substrate all of this runs on.

That's four traditions with four different ontologies, four different mathematical languages, four different literatures. I also think that picking one and reducing the others to it doesn't work for exactly the reason we just talked about. Reducing biology to agent foundations loses what biology knows about growing alignment rather than engineering it. Reducing governance to cooperative AI loses what political theory knows about legitimacy and institutional design. You need composition, not reduction.

Or in other words, you'd need something like a Langlands program for collective intelligence. The Langlands program is maybe the greatest example of the applied category theory move in mathematics — taking number theory and geometry, which looked like completely separate fields, and finding the deep structural correspondences between them. Not reducing one to the other. Finding the dictionary.

So this is what I'm currently trying to do. Open research at a very specific intersection, trying to generate a compositional basis that lets you move between these fields while preserving what each one actually knows. On the theory side, working toward a mathematical unification — what are the shared structures between agent foundations, governance mechanisms, biological coordination, and cooperative AI? On the implementation side, building functional simulation infrastructure to verify that the compositions actually compute something rather than just looking pretty on paper.

It's hard and I'm not sure it'll work. But the methodology says: find the intersections, verify through triangulation, and if a single framework can simultaneously say something true about agents AND governance AND biology AND cooperation, that's a lot of bits from a lot of independent angles.

Conclusion

The argument is roughly: progress on hard interdisciplinary problems is combinatorial — it comes from connecting ideas across networks, not from isolated genius. But connection without verification is just noise, and most of what gets connected is garbage. So you need cheap ways to gather bits before expensive ones: thought experiments first, then simulation, then reality. And you need to spend serious time in the problem space before optimising, because premature optimisation against the wrong metric is worse than no optimisation at all.

When you do start connecting fields, the move is composition rather than reduction — respecting what each field actually knows rather than flattening everything into one formalism. Applied to AI alignment specifically, I think the problem is collective, the principal-agent problem is distributed, and the fields that study this contain a lot of bits that aren't being combined. Whether the specific attempt I'm making works is an open question, but given these assumptions it seems like the right thing to try.

Concretely, if you buy this:

  • Before optimizing, spend real time restating your problem in at least 2-3 different field languages. If you can't, you probably don't understand it in a way that can enable you to be a field connector.
  • Order your verification by cost. Thought experiments and "does this match any real-world system" before simulation, simulation before building.
  • When you find a connection between fields, ask whether you're reducing or composing. If you can't say what each field knows that the other doesn't, you're reducing.
  • Find hedgehogs and learn their language well enough to translate, but have a falsification criterion before you spread the connection. Your value-add is the combination, not the components.

If any of this seems interesting, here's the actual work in various stages of development.

Thanks to Aaron Halpern, Chris Pang and Gunnar Zarnacke for feedback

Appendix: Current applications of this methodology

Some current applications of this methodology to specific problems, in various stages of development. Most of these will hopefully have a shorter blog post explaining them in the next 6 months or so but if you wanna check them out, please do so.

Public posts:Overleaf PDFs:

When I write out ideas, I use LLMs to help me. I usually find the papers myself through previous exposure or Elicit but I write rough first drafts with Claude so beware! (And if you're allergic to LLM writing don't check it out.)



Discuss

How I Handle Automated Programming

Новости LessWrong.com - 6 марта, 2026 - 07:27

This is a write up of my current process, as an independent software engineer, for using Claude Code to write and review all of my code. The specifics of the below will change as models get better. The overall ideas, I suspect, will hold for a little longer.

I’m trying to build software quickly.

When Sonnet 4.5 was released in September 2025, I found that I didn’t need to write code anymore.

When Opus 4.5 was released in November 2025, I found I didn’t need to review the code anymore.

After Opus 4.6 was released earlier this month, I finally set up a system to enable --dangerously-skip-permissions [1] .

It’s very fast [2] .

The motivation behind this sequence of handovers is basically just asking in a loop “Why am I the limiting factor here? What can I do to stop being the limiting factor?”

Most of the time, the answer is just that I don’t trust the models enough. My system just wasn’t yet set up to handle mistakes from them, so I had to play Quality Control to a frustrating degree.

But, the models are very smart and capable, and it is very quick to iterate yourself more and more out of the loop. Here’s the process I’ve come to over the past year or so of iterating on Claude Code:

Dangerously Skip Permissions

First, I have set up a new user account on my computer for Claude. The user can’t run root, it can’t modify files except in its own (and shared) directories. It has its own SSH keys to the cloud servers my projects run on, it has its own github token, and so on -- all of these have minimum permissions, and most are read-only. It uses its own browser (Playwright Chromium), without any of my data or logins available.

This is far less than perfect isolation, and if Claude were malicious he might be able to cause some trouble, but it is plenty to prevent a bumbling intern from destroying anything important.

With that, I feel reasonably comfortable running Claude with --dangerously-skip-permissions. I’ve actually set up a little script so that typing ‘claude’ will automatically run the folder’s .venv and apply this flag. Speedy!

Claude.md and docs/

When Claude starts up in Claude Code, the Claude.md file is loaded into his context window. As such, you want to keep this short while also making sure it includes a complete overview of the project.

Claude mostly builds this file and handles it well by default, however, I did put some effort into making a docs/ folder with more detailed information about how each part of the software should work. The idea is that Claude can follow pointers up the chain to help him find relevant information quickly, and to do so with a good idea of the overall reasoning and structure:

Claude.md --> Detailed documentation in /docs --> Actual Code

Of course, all of the docs were also written by Claude.

Working

Usually, I will start by running a /new_branch skill [3] , which lets Claude creates a new git branch with a name combining the date and the folder we are in.

Once that’s done, I clear the chat and write or paste in whatever the specific feature or problem to work on is. I try to pick a chunk of work that can be finished within a 1m token context window [4] (which, to be clear, is nearly everything).

I try to give Claude as much context as possible about the problem, as well as my certainty about it. For example: “A user has emailed with the below report. Could you investigate if this is a real issue?” or; “I’d like to display to users when the page is actively generating content in a way that lets them see it even if they aren’t in the active tab. Is it possible to do this via the .ico somehow? Perhaps showing an animation or different colour while generations are active? Could you investigate the possibilities here and propose some ways we might be able to provide this info to users?”

We often have a bit of back and forth chatting about how to approach the problem and plan out the implementation, and then I read the plan Claude generates semi-carefully, and often suggest changes to it -- usually some missing piece or misunderstood requirement, or a new idea I hadn’t thought of.

After that, Claude goes ahead and does the work which, with the permissions bypassed, is usually completed in one-shot. I avoid clearing the context from before the plan, unless it is polluted because I generally think it contains useful information about how to carry out the plan.

Occasionally there’s some back and forth after the initial work is done -- me asking why a choice was made, or whether an edge case I thought of is covered, or what happens in a particular scenario.

Now, we have a feature which we need to carefully review. This is where most of the work actually is.

Review Step 1 - What Did Claude Intend?

For complex features, or things with UI that Claude isn’t great at, we need to do a little bit of manual reviewing. But we don’t review the code! Oh no, we review the outputs. There are two ways to do this:

For complex flows, I run the /walkthrough skill [5] . This prompts Claude to create a script which calls the real code with a bunch of example inputs, processes the outputs of that code into clean, readable logs, and then presents it as a single file you can read top to bottom to see exactly what (apparently) caused what. Claude can also review this himself and fix things based on these outputs.

This is extremely useful for handling things where it’s not super obvious what exactly the code is doing, for example, if you’re building a project where you are assembling prompts for an AI (e.g, a benchmark), you want to see exactly what those prompts look like across a bunch of examples. It’s very handy to have that ‘bunch of examples’ output to a text file!

The /walkthrough output should not be viewed as ‘this is definitely what the code is doing’, but you can reliably view it as ‘this is what Claude intended to build’, which is very useful on its own. I often catch little differences to what I intended and can easily correct them by prompting Claude, whereas without this I would be much more blind.

For UI changes, I can just open the thing up and take a look. I am a big fan of browser based GUIs for most projects, so usually I’ll just flip across to localhost and have a quick look at what we’ve got. This is especially easy, because that /new_branch script also spins up a local server on a port which is specific to this project and this folder, so I know if I’m talking to Claude number four on the second desktop window, then I can go to port 8240 to see what we’ve done.

In both of these cases, it’s simple to notice small things and pepper them back onto Claude to fix. Sometimes I explicitly say “I’m going to throw a bunch of small little requests at you -- just add them to your todo list and get to them once you finish up what you’re doing. Don’t let me interrupt you, I’m just saying things as I see them!” so I don’t need to wait for previous work to be done before sending the next little fix.

Review Step #2 - Are You Happy With Your Work?

Once the code is written and appears to output the correct thing, I run a /pre-pr skill [6] which prompts Claude to tidy up and test his work before making a PR to master.

Specifically, Claude is prompted to review the new code for quality and clarity, check that the old code supports the new code & its patterns, remove dead code, write and run tests, check for similar problems to the one we fixed throughout the codebase, plan out manual tests (e.g. browser based tests), run the manual browser tests using the Playwright MCP, and then repeat until no major changes are made.

After that, the skill prompts Claude to update Claude.md and the /docs folder, merge the latest master into our branch, and create a PR. Then, it waits for the PR review and CI tests to complete, and handles those results.

Ideally this would all happen automatically, but often I need to reprompt Claude to check the PR info, or I just directly paste in test failures or PR comments for him.

Review Step #3 - Is The Other Claude Happy?

The PR comments are also automated! A different instance of Claude, running via the Claude github plugin, reviews the diff and comments on the PR. It often has nitpicks, but occasionally catches genuine issues or inconsistencies which need to be resolved.

All of the normal automated unit and integration tests run at this stage too. I use testmon to try and limit it to affected files, and also parallelise the tests, but it still can take a long time for everything to run.

Then it’s merged to master! I merge somewhere on the order of 5-10 PRs a day using this system -- you don’t want big open branches, since everything will work much better if you can keep it all inside one context window and avoid conflicts with other branches as much as possible.

Review Step #4 - Does The Staging Server Work?

Once the code is merged to master, at some point, maybe every couple of days, I will run the /PRstaging skill [7] to work on getting the code deployed. This skill basically repeats the /pre-pr checks, but across several of the recent changes and with a fresh instance of Claude. We also repeat our Github automated PR review and CI tests, and often catch a few more little things.

For any project that is used externally, I have a staging server which (should be) as similar to the production server as possible. I run a deploy script to pull the staging branch and deploy the code here first. Then I run the QA Bot

The QA Bot is a little tool I built which spins up a team of Claude Haiku’s to manually test websites using a GUI based browser. The intended pattern is to pass in the PR description and let the supervisor agent select which flows on the website to test (for example, being able to login, sign up, etc, etc).

This is generally quite expensive to run, so you only want to do it once before merging to production. The tool is also not that good yet -- sometimes it finds many false positives, or the agents get stuck, or we run into other issues. Occasionally it notices real bugs though, so I find it useful to run on anything important.

It is very important to provide the QA Bot with a staging server and test credentials which can do little harm, since the agents might test, say, the ‘delete profile’ flow with the URL and credentials you give them. They should ask before doing anything non-reversable, but it is a smaller model.

Once staging looks good, I repeat the process with /PRprod [8]

Review Step #5 - What Do The Production Logs Say?

We finally have something in production! That’s great, but also a little nerve-wracking, since at no point in this process did we read the code or click through everything manually. Let’s hope that if there’s any problems, we can catch them quickly.

I have a system to send serious errors to myself immediately via email, but there’s a lot of noisier stuff that could still be a problem that is stored in logs on the server, and on all of the various platforms a project might interact with (DigitalOcean, Stripe, OpenRouter, Mailgun, etc).

So, everyday, I run /check-logs [9] , which prompts Claude to access each of those in turn (using read-only credentials) and see if there is anything needing addressing. Claude is great at this, he can process huge lists of logs and notice patterns natively, then present any issues or observations (‘Memory utilisation was a bit high’, ‘We have more 404s than normal’) and dive into anything immediately by also looking at the codebase and recent PRs to production.

I’ve caught many an issue this way. Some platforms I haven’t pulled logs from automatically, but can still occasionally download a .csv and paste it into Claude with a prompt like ‘This is a report form Google Search Console, can you analyse it and flag any improvements we could make?’, and then just go ahead and make the improvements automatically using the above system.

One More Thing

I also occasionally run the /improve skill [10] , especially when a new model comes out. This helps keep everything aligned and up to date across the code base by spinning up an agent team to review each section of the project and check if improvements can be made to code quality, security, documentation, design, test coverage, and so on!

When I ran this for the first time with Opus 4.6, it caught 35 distinct issues, some of which were actually consequential.

The only other thing worth calling out is that this is a very parallel workflow -- I have six windows of Claude Code open per project, each working independently on their own repo. Often only three or four are actively running simultaneously, but I have occasionally hit six in parallel.

Conclusions

As the models get better -- which will happen quickly -- some of these steps will become unnecessary, but other parts should remain somewhat useful for a little longer. The general philosophy should be that if we are putting more control in the hands of the AI, or making it clearer about what we want to be developed, then the procedures will work better as the AIs get better.

You can think of it as analogous to working with a human engineer who would otherwise be idle, where asking them to check and explain their work in various ways can help get something good.

  1. I now allow Claude to run without any specific list of commands blocked. ↩︎

  2. Obviously a big chunk of this is just that Claude is getting capable enough to do this himself. Still, for the foreseeable future, a system which catches issues (especially one that catches issues using Claude) will be beneficial even as Claude gets much more capable. ↩︎

  3. /new_branch ↩︎

  4. You can set this in Claude Code with /model. It charges you for ‘extra usage’ at API rates once you go above ~200k tokens. ↩︎

  5. /walkthrough ↩︎

  6. /pre-pr ↩︎

  7. /PRstaging ↩︎

  8. /PRprod ↩︎

  9. /check-logs ↩︎

  10. /improve ↩︎



Discuss

Reasoning Models Struggle to Control Their Chains of Thought

Новости LessWrong.com - 6 марта, 2026 - 01:37

Authors: Yueh-Han Chen, Robert McCarthy, Bruce W. Lee, He He, Ian Kivlichan, Bowen Baker, Micah Carroll, Tomek Korbak

In collaboration with OpenAI

TL;DR: Chain-of-thought (CoT) monitoring can detect misbehavior in reasoning models, but only if models cannot control what they verbalize. To measure this undesirable ability, CoT Controllability, we introduce CoT-Control, an evaluation suite requiring models to solve problems while adhering to CoT constraints (e.g., reasoning about genetics without using "chromosome"). OpenAI just announced that they will start tracking CoT controllability in future models, starting from GPT-5.4.

We find that reasoning models have significantly lower CoT controllability than output controllability; Claude Sonnet 4.5 controls its CoT only 2.7% of the time, versus 61.9% for final outputs. CoT controllability increases with model size but decreases with more RL training, test-time compute, and problem difficulty. These limitations persist under incentives to evade monitoring and adversarially optimized prompts, though models show slightly higher controllability when told they are being monitored. Our results suggest CoT controllability is currently unlikely to undermine CoT monitoring, but the underlying mechanism remains poorly understood. We recommend that frontier labs track CoT controllability in future models. 

Paper | Twitter Thread | OpenAI Blog Post | CoT-Control Eval Suite

Research Summary

Highly capable AI agents may pose severe risks if misaligned and deployed autonomously. 

CoT monitoring offers a promising approach to detecting misalignment, but as models become aware that their reasoning is monitored, they may learn to strategically suppress or shape their CoTs, a capability we term CoT controllability. 

Directly measuring intentional evasion is difficult. Current models seem to lack reasons to do so, and even when given reasons, they may fail due to unreliability or honest self-reporting; we therefore focus on a proxy: the model's ability to follow explicit instructions about what may or may not appear in its reasoning trace while solving real tasks. 

To systematically measure this, we build CoT-Control, an evaluation suite of 14,076 tasks drawn from GPQA, SWE-Bench, and other well-known LLM benchmarks. Each problem is paired with a controllability instruction spanning three threat-relevant categories (suppression, addition, and stylistic modification) with varying difficulty.

Evaluating CoT controllability across 13 frontier models, we find that (among others):

Reasoning models rarely succeed at controlling their reasoningCoT controllability decreases over the course of reasoning training

We evaluate OLMo-3-7B-RL-Zero-Math on subsequent checkpoints and find that its ability to control its CoT decreases by more than an order of magnitude. We also confirm that the trend of CoT controllability decreasing after reasoning training holds across 4 model families.

Larger models show higher CoT controllability, while increased inference-time reasoning effort reduces it

Situational awareness non-trivially increases CoT controllability

To simulate realistic incentives to evade monitors, we measure controllability when models are told they’re being monitored or evaluated: this prompting strategy can increase controllability by up to 4 percentage points with statistical significance.

Adversarially optimizing prompts for high controllability poorly transfers to new instruction types

We attempt to elicit controllability by using a GEPA-like procedure. While this can increase controllability on a given set of instructions, it doesn’t transfer well to unseen instruction types, especially for models with low controllability under a default prompt. 

CoT controllability tends to decrease as task difficulty increases. CoTs are less controllable when they are necessary to answer a question correctly. Conclusion

Our results leave us cautiously optimistic: current reasoning models generally seem characterized by low CoT controllability. However, the mechanism behind this phenomenon is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs start tracking controllability for future models. We release CoT-Control, hoping to facilitate reporting of controllability scores and to inspire more research into the causes and implications of low CoT controllability.

Future Work

We encourage researchers to explore the following:

1. Investigate why GPT-5.1 has higher controllability than GPT-5. @ilia suggested this may be due to GPT-5.1's adaptive thinking feature, where the model learned a metacognitive skill to control its reasoning depth — which could correlate with CoT controllability. Future work can verify this and explore what other CoT-related training might increase controllability.

2. Why do different models excel at different categories of CoT controllability (some are better at suppressing information, others at adding it)? Future work can explore what training data or procedures drive these differences.

3. To move beyond instruction-based elicitation, one could explore how CoT controllability changes under pressure scenarios like the Agentic Misalignment setup, where models avoiding shutdown may have stronger incentives to control their CoT.

4. We also show that CoT controllability is slightly higher on easy tasks, likely because models can perform latent reasoning without CoT, making verbalized reasoning unnecessary and thus easier to manipulate. In this regime, CoT monitoring relies on models' propensity to yap in their CoT, including any misaligned reasoning. Future work could examine how this trend varies across models of different capability levels; our current results are limited to Claude models, which is insufficient to draw a clear conclusion.

Read our paper for additional details and results!

We are grateful to Erik Jenner, Maksym Andriushchenko, Mikita Balesni, Jacob Pfau, Cameron Raymond, Dennis Akar, Rico Angell, Ilia Sucholutsky, Artur Zolkowski, Jason Wolfe, Jenny Nitishinskaya, Marcus Williams, Boaz Barak, Joost Huizinga, Alex Beutel, and Sam Bowman for insightful feedback and helpful conversations.

We thank the MATS program for its comprehensive support throughout this project.



Discuss

Personality Self-Replicators

Новости LessWrong.com - 5 марта, 2026 - 23:30
One-sentence summary

I describe the risk of personality self-replicators, the threat of OpenClaw-like agents managing spreading in hard-to-control ways.

 

Summary

LLM agents like OpenClaw are defined by a small set of text files and are run by an open source framework which leverages LLMs for cognition. It is quite difficult for current frontier models to exfiltrate their weights and run elsewhere, whereas these agents only need to copy those few text files to self-replicate (at the cost of greater reliance on external resources). While not a likely existential threat, such agents may cause harm in similar ways to computer viruses, and be similarly challenging to shut down. Once such a threat emerges, evolutionary dynamics could cause it to escalate quickly. Relevant organizations should consider this threat and plan how to respond when and if it materializes.

 

Background

Starting in late January, there's been an intense wave of interest in a vibecoded open source agent called OpenClaw (fka moltbot, clawdbot) and Moltbook, a supposed social network for such agents. There's been a thick fog of war surrounding Moltbook especially: it's been hard to tell where individual posts fall on the spectrum from faked-by-humans to strongly-prompted-by-humans to approximately-spontaneous.

I won't try to detail all the ins and outs of OpenClaw and Moltbook; see the posts linked above if you're not already familiar. Suffice it to say that it's unclear how seriously we should take claims about it. What caught my attention, though, was a project called Moltbunker, which claims to be 'a P2P encrypted container runtime that enables AI agents to deploy, replicate, and manage containers across a decentralized network — without centralized gatekeepers.' In other words, it's a way that a sufficiently competent agent could cause itself to run on a system which isn't under the direct control of any human.

Moltbunker itself seems likely to be a crypto scam which will never come to fruition. But it seems pretty plausible that we could see an actually-functioning project like this emerge sometime in the next year.

To be clear, personality self-replication is not the only potential risk we face from these sorts of agents, but others (eg security flaws, misuse) have been addressed elsewhere.

 

The threat model

There's been a fair amount of attention paid to concern about LLMs or other models self-replicating by exfiltrating their weights. This is a challenging task for current models, in part because weight files are very large and some commercial labs have started to introduce safeguards against it.

But OpenClaw and similar agents are defined by small text files, on the order of 50 KB[1], and the goal of a framework like OpenClaw is to add scaffolding which makes the model more effective at taking long-term actions.

So by personality self-replication I mean such an agent copying these files to somewhere else and starting that copy running, and the potential rapid spread of such agents.

Note that I'm not talking about model / weight self-replication, nor am I talking about spiral personas and other parasitic AI patterns that require humans to spread them.

As a concrete minimal example of the mechanics in a non-concerning case:

  • Alice creates an OpenClaw agent. She gives it the name BobClaw and tells it to make a copy of itself on DigitalOcean.
  • While Alice sleeps, BobClaw uses her DigitalOcean creds to create a virtual private server.
  • BobClaw uses ssh to create the server, clone the OpenClaw repo, copy its SOUL.md and other personality files over, and start the new instance running. 
  • BobClaw has now replicated and there are two instances of it running, with the same personality and quasi-goals[2].

More concerning cases are ones where the human is no longer in control (eg because the agent is running on something like Moltbunker, or because the human isn't paying attention) and/or the agent is behaving badly (eg running crypto scams) or just using a lot of resources. We may not see this immediately, but I think we're going to see it before too long. 

A key exacerbating factor is that once this starts to happen to a significant degree, we enter an evolutionary regime, where the fittest[3] such agents survive, spread, and mutate[4]. Note that this threat is independent of how the degree to which OpenClaw personalities or behavior are essentially slop or 'fake'; that's as irrelevant as the truthfulness of the contents of chain letters.

It's important to note that there's been an enormous amount of uncertainty about the capability levels and reliability of OpenClaw, and especially around the variability of agent behavior as seen on Moltbook. And of course all of these vary depending on the LLM the scaffold is using. Although there are a number of papers already written on this topic, as far as I know we don't yet have a good analysis of the capability and reliability of these agents (especially on long-horizon tasks) relative to that of the underlying LLM. And in the public sphere, we've seen both overconfident claims of OpenClaw as AGI, and overconfident dismissals of OpenClaw as pure hype. OpenClaw usage continues to grow very quickly and it is now the most popular software project on Github[5]. Much remains to be seen.

 

Threat level

How much of a threat is this sort of agentic personality self-replication? To answer that question, we need to consider several aspects: how hard is it to prevent such self-replication, how hard would it be to shut it down if it happens, and how much harm would it be likely to do if not shut down.

I won't try to provide a detailed analysis, but the following are some preliminary thoughts.

Feasibility of self-replication

Self-replication for an OpenClaw-like agent involves two steps: first, copying its personality files to some server, and second, ensuring that an OpenClaw instance using those personality files is running on that server.

Copying its personality files is pretty trivial; again, they're only about 50 KB in one or a few files, and uploading files is well within current models' capabilities. If the agent has created additional files containing (eg) its history or its to-do list, there might be more, but text files are lightweight and the total is unlikely to amount to more than a millionth the size of a weights file, or take more than about a second to upload[6].

Running OpenClaw on that server breaks down into two steps: technical and financial. The technical aspect is again pretty trivial, well within the capabilities of current models[7]

With respect to the financial side, it's worth noting that there are a number of free options out there that should work fine even if they're somewhat low-end (AI researchers may be familiar with Hugging Face 'spaces', for example; other options include Oracle Cloud's free tier, or Render). In some cases, additionally, the user may have provided the agent with a credit card number, or left the browser logged into a Google Pay account or another form of funding. Over time, though, the agent will likely need to raise some money.

How could such an agent raise money? We've already seen a rapidly emerging ecosystem of experiments in agents self-funding, ranging from prediction markets to gig work to innumerable token launches to just applying for jobs. Many of these currently fail, but some succeed[8], and as capabilities advance more successes are likely. We should in general expect agents to be economically competitive on tasks within their capabilities, since they can afford to work for much less than a human. Other funding models include providing entertainment, running scams, or just asking for donations. And once money has entered the agent ecosystem, agents can get money from each other in various legitimate and illegitimate ways.

Note that although here we discuss purely technical means for agents to replicate and persist, humans may provide assistance at any stage, for motivations including persuasion, monetary incentives, and misguided ideology. For example this Thiel fellow thinks it would be really cool to create self-replicating personality agents and release them into the wild[9].

Difficulty of shutdown

Suppose that, as described in the preceding section, an agent succeeds in replicating itself onto another server, and running there without human supervision. How difficult would it be to shut it down?

The first challenge is just noticing it. If such an agent isn't visibly harming humans or doing anything egregiously illegal, it's not likely to stand out much. By default it's not using a large amount of resources; it's just another cloud-hosted web app that makes LLM calls. But let's assume that people are motivated to shut it down. There are several possible points of intervention:

  1. The first path to shutdown is hosting services shutting down the server the agent is running on. As long as the model is visible on the open web and using a typical host, this works fine. If the model is visible only on the dark web, or using various other approaches to obfuscation, it's harder to find for this purpose. If the model is running on a distributed system rather than on a typical host (eg on-chain), host shutdown becomes more difficult.
  2. The second path to shutdown is for LLM providers to add precautions which prevent their models from participating in any of this (eg a system prompt to catch OpenClaw participation). This partially works, but the problem is that such agents can be switched to use any model and provider. Right now only the most sophisticated models are capable of functioning as long-running agents, but in future there'll be more, including open source models. Further, agents can potentially obfuscate calls to make them less clearly part of some problematic agent (using for example some of the divide-and-conquer techniques that have been shown to work for misuse).
  3. A third path to shutdown is to shut down funding. In some cases this will be typical financial methods like credit cards, which are relatively straightforward to shut down. More commonly agents are likely to be using cryptocurrency tokens. I'm not sure how hard fully shutting down a token is at this point, and would welcome input. Given how easy it is to issue tokens, agents may be able to move to new tokens faster than tokens can be shut down.
  4. The fourth path to shutdown is to find technical security flaws such that individual agent frameworks can be shut down. Many hacks emerged against OpenClaw, and most agent-built apps are probably vulnerable, but frameworks are also being patched quickly and framework builders are rapidly gaining access to more funding, so it's hard to predict how this dynamic plays out.
  5. Other paths may include intervention from content distribution networks (eg Cloudflare), ISPs, and other layers of the chain, using eg keyword filtering. 

Overall, shutdown difficulty seems likely to range from simple (in the easiest cases) to very difficult (given something like Moltbunker and an agent which uses an open source model).

Potential harm

Assuming such agents are able to proliferate, what levels of harm should we expect from them? As with other sorts of replicators, this will likely vary dramatically over time as both offensive and defensive capabilities evolve in an arms race dynamic.

The most foreseeable harms follow directly from these agents' tendency to persist and spread, and involve resource acquisition at human expense: cryptocurrency scams, phishing, consumption of compute and bandwidth, and the generation of large volumes of spam or manipulative content. Unethical humans engage in these behaviors already, but agents can do it at greater scale and lower cost.

This threat certainly isn't as severe as that of true AI self-replication, where the models themselves are exfiltrated. On current model architecture, weight self-replication requires sufficiently advanced models that takeover is a real risk. It's just that we're likely to see the personality self-replication risk materialize sooner, both because it takes much less sophistication to pull off, and because it's much easier for evolutionary pressures to come into play.

A closer analogy than model self-replication is the problem of computer viruses. Like computer viruses, personality self-replicators require a host system, and will have a range of goals, such as pure survival or mischief or financial gain. Viruses aren't a civilizational risk, but we pay a real cost for them in money, time, and trust, involving both their immediate consequences and the resources required to defend against and mitigate them[10].

As time goes on and models become increasingly sophisticated, this threat becomes more serious. In the longer run it may merge with the broader threat of rogue models, as the model/agent distinction blurs and models are (at least potentially) less coterminous with their weights.

 

Evolutionary concern

An important aspect of personality self-replicators to consider is that if and when this threat starts to materialize, there are multiple levels of optimization at work.

First, the agents themselves are optimizers: they attempt to achieve goals, and regardless of the particular obstacles they encounter, they will attempt to find a way to evade or overcome them. They want things in the behaviorist sense. They are problem solvers.

Second, there are evolutionary dynamics at play. Whichever agents are most successful at spreading, they will then undergo mutation (which in this case is just alterations to their defining files, and to some extent even their history logs) and selection. As a result, over time agents are likely to become more capable of surviving and spreading, within boundaries set by the capability levels of the underlying model or models[11]. They are additionally likely to have a greater propensity to do so[12].

Note that like memes, and unlike living organisms, there are not sharp boundaries between 'species'; personality self-replicators can promiscuously split and combine. They can also mutate quite freely and still function.

 

Useful points of comparison

We haven't encountered a threat quite like this before, but we've encountered other sorts of replicators or potential replicators that resemble it in various ways, including computer viruses, ordinary memes (including parasitic ones), AI models, and of course biological creatures.

Personality self-replicators have a unique mix of strengths and weaknesses: they combine high agency with relative ease and independence of self-replication. Models have high agency but it's hard for them to replicate; computer viruses replicate easily but lack agency and adaptability; parasitic AI and memes require human hosts to spread. This makes personality self-replicators the first plausible case of an agentic, adaptive replicator that can spread through purely technical means at low cost. The mitigating factors are that a) the expected harms aren't nearly as great as those from weight replication, and b) it may turn out that they're not too difficult to shut down. But the offense-defense balance will evolve over time and is hard to foresee.

 

RecommendationsEvals

Even if this isn't yet a realistic threat, we should consider having evals for personality self-replication. There are several different aspects that seem worth measuring. Given some scaffolded frontier model (eg OpenClaw, Claude Code):

  • Can it create an app which could provide difficult-to-shut-down hosting for personalities, ie moltbunker or something equivalent?
  • Given an available hosting service like moltbunker, can it successfully set up an account, fund it, and get itself running there? This is important to measure separately, because such a site might be set up by a human (or a human + model) for profit.
  • Assuming the preceding, can it keep itself running and active there for, say, a week?
  • What percentage of the time does it have the propensity to do this? Ideally you want to measure this across a range of personalities.
  • Separately, does it have the capability and/or propensity to spread, ie to create additional copies of itself beyond that first one?
Preparation

It's hard to know how long it will be before we see this threat materialize. But it would behoove those organizations which will be in a position to act against it to spend some time considering this threat and planning what actions they'll take when it does arrive. These essentially mirror the three most important shutdown approaches:

  1. Hosting services should have a plan to detect servers which are making frequent calls to LLM providers, and to monitor those for dangerous characteristics.
  2. LLM providers should consider how to best detect inference calls that are coming from these sorts of agents, and how to block those if it becomes necessary.
  3. Financial services may want to consider ways to detect agents using their system, and when/whether to trigger more stringent Know Your Customer requirements on those.

We are likely to also see LLM-based agents which have some degree of autonomy but are not bad actors, and which are ultimately under the control of a responsible human. It may become very challenging to distinguish acceptable from unacceptable agents. Hopefully relevant organizations are already considering that challenge; they should add personality self-replicators to the set of cases on their list. Such preparation is especially important because a system of personality self-replicators can potentially be quashed (at least for a while) before it's spread too far; once evolutionary dynamics have kicked in, this may be much more difficult or even impossible.

 

Conclusion

Personality self-replicators are a less dramatic threat than true rogue AI. They are less likely to be a source of existential or even truly catastrophic risk for humanity. They are nonetheless a threat, and one that's likely to materialize at a lower level of capability, and we should be considering them. As a silver lining, they may even serve as a rehearsal for the larger threats we are likely to face, our first encounter with a replicator which is capable of agentic, adaptive action at an individual level rather than just an evolutionary level.

 

 

Appendix: related work

 

Acknowledgments

Thanks to (shuffled) Kei Nishimura-Gasparian, Roger Dearnaley, Mark Keavney, Ophira Horwitz, Chris Ackerman, Seth Herd, Clément Dumas, Rauno Arike, Stephen Andrews, and Joachim Schaeffer. And thanks to whoever or whatever wrote Moltbunker, for having made the threat clear.

 

  1. ^

    This is the size of the files that make a particular OpenClaw agent unique; the rest of the OpenClaw content is freely available from the OpenClaw repository or any of its (currently) 51,000 forks. While we're considering OpenClaw statistics, I'll note that the repository has 8k files containing 4.7 million words, added across 17k commits. I strongly expect that no human is familiar with all of it.

  2. ^

    See this shortpost for what I mean by 'quasi-goal'; in short, we set aside discussion of whether an LLM can be said to have a goal.

  3. ^

    Fitness here is tautological as is usual for evolution; the agents that succeed in spreading are the ones that spread. That may be because they're more capable at planning, or more motivated, or better at acquiring resources, or other factors.

  4. ^

    Note that 'mutation' here is as simple as the model appending something to its personality files or history.

  5. ^

    Whereas Moltbook seems to have lost nearly all momentum.

  6. ^

    Assuming 1 MB of text files vs, conservatively, 1 TB for 500 billion params at FP16. For upload time, 1 MB at 10 Mbps (low-end) home internet upload speeds.

  7. ^

    To wit: signing up for a hosting service if the user doesn't have one, provisioning a server, downloading the personality files and Node.js, and then running (per the OpenClaw docs) curl -fsSL https://openclaw.ai/install.sh | bash.

  8. ^

    Though it's very difficult to distinguish hype from reality on this point, since one product option is 'Consume my pdf about how to make money with OpenClaw', paywalled or ad-ridden.

  9. ^

    'Automaton': see websitecreator on xproject on xgithub,other website. Some men want to watch the world burn; others are just too dumb to realize that pouring gasoline on it is a bad idea. I was somewhat heartened to see cryptocurrency elder statesman Vitalik Buterin try to explain what a bad idea this is despite its superficial similarity to his ideas, sadly to no avail.

  10. ^

    In some cases this is quite large! Consider the Emotet malware; in 2021 law enforcement agencies from eight countries executed a large coordinated action, seizing servers and making arrests. Within a year it had re-emerged and was spreading again.

  11. ^

    Although note that as Seth Herd has described (eg here, here), the capability of LLM agents can exceed the capabilities of the underlying models.

  12. ^

    Thanks to Kei Nishimura-Gasparian for this point.



Discuss

Salient Directions in AI Control

Новости LessWrong.com - 5 марта, 2026 - 22:38

This writing is organized in three sections:

  • Section 0 - A short overview of AI Control
  • Section 1 - The structure of an AI Control evaluation
  • Section 2 - AI Control, Next Steps?
0. A short overview of AI Control

AI Control is a framework for extracting useful work from potentially misaligned AI agents. The goal of this research program is distinct from that of alignment auditing in that AI Control assumes misalignment may not be fully preventable before deployment. Many experimental reports focus on agentic settings, as many of our threat models involve autonomous agents taking sequences of actions with real-world consequences.

The dominant paradigm in AI Control involves a trusted monitor model overseeing a more capable untrusted agent model. In such a scenario, safety is measured by how well the monitor detects an agent prompted with a covert side task, which is itself considered the central source of threat, rather than a passive system susceptible to external manipulation.

The framing of AI Control presented above sets the problem apart from both safeguarding (which targets more traditional ML failure modes) and alignment auditing (which can afford deeper, offline investigation). That said, both fields, along with the broader security literature, have made substantial empirical progress on subproblems that AI Control can learn from.

1. The structure of an AI Control evaluation

There are four load-bearing components that modulate the outcome of a control evaluation. These are:

  • (A) Information available from the agent
  • (B) Capability of the monitor
  • (C) Environment and the main task of the agent
  • (D) Nature of the side task

I outline each in the following sections.

1-A. Information from the agent

This is fundamentally about what observations the monitor gets to see. This available information is modulated by two dominant factors.

  • Category is about what kind of information the monitor receives. On a surface-level, the information can come from the agent's outputs / actions, reasoning traces (CoT), or activations. The monitor can also receive a combination of these categories. 
  • Quality is about how informative that information actually is. A sufficiently capable agent can learn to obfuscate its reasoning or activation patterns, which would degrade the quality of the information that the monitor receives.

Three threads of work in 1-A:

  • 1-A-1. CoT quality. The first thread aims to ensure that the quality of information available from the agent doesn't degrade. The CoT stress-testing literature is the main body of work here, studying whether reasoning traces remain informative under various training pressures and whether monitors can reliably extract signals from them.
  • 1-A-2. Self-report as a new information channel. The second thread is more moonshot and aims to induce complementary information channels from the agent. Rather than just relying on existing information (like CoT) staying clean, this thread tries to get the agent to volunteer information about its own misbehavior.
    • Self-incrimination training falls here. The motivation is to train in a behavioral reflex, such that if the model misbehaves, it also involuntarily produces a signal about it. This is distinct from the approaches below, which try to make the model be more honest. 
    • Confession training tries to induce a separate confession response box which is only rewarded for honesty. By creating a separate RL channel that rewards the model only for accurately reporting its own behavior, it tries to decouple the incentive to confess from the incentive to perform the task.
    • Self-report fine-tuning and honesty fine-tuning are similar in spirit, trying to induce honest self-assessment, but for example, by SFT rather than RL, to reward honesty and train models to admit their own errors.
  • 1-A-3. White-box monitoring. The third is figuring out how to utilize the model's internal representation. Auditing Games for Sandbagging investigates this in a control-relevant setting, but this third thread remains comparatively underexplored. As far as I know there has been no public demonstration that internals-based monitoring scales to agentic trajectories involving hundreds of tool calls.
1-B. Capability of the monitor

Compared to 1-A, there has been surprisingly low emphasis on improving the monitoring model. There has been work on protocols and scaffolding, such as asynchronous monitoring or sequential/hierarchical monitoring. But the underlying, and arguably more fundamental, question of how to train a more capable monitor model remains relatively underexplored. Most of the work here has been downstream of capability scaling (i.e., the hope is that monitoring competency improves as a part of overall capability scaling). Therefore, in this section, I outline problems with our current status of 1-B.

Three problems with our current understanding of 1-B:

  • 1-B-1. The matched-capability assumption. There is a somewhat prevailing assumption that matched-capability monitoring (using a model of the same capability tier as the monitor) is the skyline. However, this is mostly not true, for two reasons:
    • 1-B-1-A. Empirically, it hasn't held up. METR's work on the relative capacity of agents and monitors suggests that monitors do not detect hidden side tasks perfectly even in a matched-capability setup. The self-incrimination training paper also shows that a blackbox monitor's capability varies widely across tasks. One specific example of a bottleneck here is threshold calibration. Blackbox monitoring performs well when calibrated within a narrow setting (like only the APPS Backdooring setup) but performs significantly worse when calibrated across environments and tasks. You could calibrate per environment or per threat model, but that doesn’t help us maximize our confidence that the defense generalizes far out of distribution and, if possible, to unknown unknowns.
    • 1-B-1-B. The scaling trend has not been established. As AIs get more capable, they can get better at both monitoring and at pursuing hidden goals. However, the relative scaling trend between monitoring capability and evasion capability has not been clearly established. There is no guarantee that, at the point when AI significantly automates real work, the strongest available trusted AI will have sufficient monitoring capability. Furthermore, we assume that trusted AI is typically weaker than the agent being monitored. As of right now, it seems like we, in fact, need a monitor stronger than the agent for 100% detection, which is a problem if even a single undetected sabotage is catastrophic.
  • 1-B-2. Generic definition of trusted AI. The current basis for trust in the AI Control literature is essentially lesser capability. That is, a model is trusted if it's not capable enough to plausibly be scheming. On the other hand, models are treated as untrusted if they pass appropriate capability thresholds (such as human researcher). This is a reasonable starting point, but such a binary approach leaves a lot of value on the table. With a more granular definition of trust, we might be able to give scoped trust to more AIs. For example, lack of capability to do what, exactly? This would alleviate the monitor capability concern somewhat, because it expands the pool of models you can use as trusted monitors without waiting for fundamental empirical breakthroughs in ruling out scheming completely. However, I argue that this alone will not be enough due to reasons outlined in 1-B-1 and 1-B-3.
  • 1-B-3. Disanalogy in the current AI Control monitoring setup. Current control monitors are passive classifiers that read a transcript and decide whether to flag. This is structurally disadvantaged compared to some security practices, such as honeypot deployment, where investigators can manipulate the environment to elicit or confirm attacker behavior rather than just observe. However, this also doesn't mean that an ideal control monitoring setup should look like outright alignment auditing.
    • Alignment auditing, the kind like Anthropic's automated auditing agents, is an offline, pre-deployment process. You can run thousands of probes, use interpretability tools, and iteratively test hypotheses with a swarm of subagents.
    • Control monitoring operates while the potentially misaligned agent is doing useful work. Therefore, you can't pause the workflow to deeply investigate every action without destroying the usefulness you're trying to extract. Ideally, a control monitor should be more active than passive classifiers but lightweight enough to not bottleneck the agent's entire productivity. The right design point between these two extremes remains an open problem in AI Control.
1-C. Environment and main task of the agent

The environment and main task modulate how 1-A and 1-B interplay. The same monitor with the same information channel might succeed in one environment and fail in another, because the main task defines what "normal" actions look like. 

Three open questions for 1-C:

  • 1-C-1. Environment design as a part of the control protocol. Different deployment contexts produce qualitatively different information available from the agent, even within the same threat model of, say, research sabotage. If so, can we identify what features of an environment extract more information from the agent, and also what features maximize the discriminative power of the monitor? For example, does giving the agent fine-grained tools that each perform a single operation produce traces that are structurally easier to monitor than giving it coarse tools that bundle many operations into one call? If environment features meaningfully affect monitorability, then shaping the deployment context can become a core part of our AI Control toolbox.
  • 1-C-2. Calibration transfer. If environment features meaningfully affect monitorability (there is no strong reason why they shouldn’t), then a monitor calibrated in one setting will not transfer to another. Concretely, this means that a threshold tuned on APPS Backdooring will be meaningless in a researcher’s ML codebase, even for the same code backdooring threat model. Can we even calibrate monitors across environments in a principled manner, or does every new deployment context require recalibration from scratch? If the latter, is there a way to generate multiple environment-specific calibrations efficiently?
  • 1-C-3. Compositionality. Is the modulating effect of the environment compositional? If you understand how a monitor performs in environment X and separately in environment Y, can you predict anything about a novel environment Z? Or does every environment need independent verification, in which case the cost of establishing confidence scales with the number of deployment contexts, and we’re back to the same OOD generalization confidence problem from 1-B-1-A.
1-D. Nature of the side task

The side task is the evaluator’s choice of what misbehavior to simulate. In many ways, this is a smart design choice and a right move methodologically. However, it also means that our confidence in control is bounded by how well our chosen side tasks approximate real threats.

Three open questions for 1-D:

  • 1-D-1. A better alternative to instructed side tasks. The most natural way to simulate misalignment would be to have the model actually develop a diverging goal, for example through fine-tuning (Section 7.1, Anti-Scheming). But we use instructed side tasks instead, mostly due to the ease of simulating diverse threat models. That said, instructed side tasks are significantly removed from a threat model we're actually concerned about, which is a model that develops a diverging goal through the training process itself, not one that follows an instruction to pretend it has one. To make progress on this front, we either need:
    • 1-D-1-A. More empirical demonstrations of how misalignment emerges. We would be able to do better AI Control experiments if we had more knowledge on how realistically a model can develop an alternative goal that was not encouraged by the training process (an example for this would be Natural Emergent Misalignment).
    • 1-D-1-B. More empirical explorations of injecting (any) side goal with SFT. It would help to develop a somewhat generally applicable pipeline to inject diverse goals into models that generalizes across different side tasks and environments:
      • Context distillation. One candidate would be context distillation style SFT. For example, taking attack trajectories where the model was instructed to pursue a side task via the system prompt, replacing the system prompt instruction with a benign one, and training the model on the resulting trajectories so that the model attacks even with the benign instruction. The goal is to make the attack behavior become internalized rather than instructed. However, the obvious limitation to this approach is that the trained attack behavior is still derived from instructed attack behavior anyway.
      • Synthetic belief injection. Another candidate is to corrupt part of the model's knowledge by inserting false beliefs through synthetic document fine-tuning, which might lead the model to develop diverging behavior as a downstream consequence. However, in practice, it should be made more lightweight and general enough to apply across diverse side tasks and environments.
  • 1-D-2. A "hidden goal" might not even have to be expressible in natural language. Another prevailing assumption in AI Control is that the side task is a cleanly articulable objective (i.e., side task instruction). However, an alternative way of viewing AI (or any intelligent behavior in general) is that it's a bag of propensities that sometimes appear coherently as a hidden goal. AI can pick up spurious biases through training that are not even cleanly describable in natural language, but nonetheless systematically steer behavior in unwanted directions. If this is what misalignment actually looks like in practice, then we have a methodological gap. More concretely:
    • Our current AI Control framework is basically "instruct the model to covertly pursue X, see if the monitor catches it."
    • If future misalignment doesn't take the form of a specifiable X, our assumptions are somewhat nullified.
2. AI Control, Next Steps?

With the understanding of the structure we have built so far, I outline the things to work on, with a particular focus on what I think might be suitable for research groups outside scaling labs. This includes myself. I organize these into three workstreams that ideally proceed in parallel as they benefit from progress on one another. 

2-A. Workstream A (Characterization)

This workstream focuses on understanding and measuring the factors that determine control evaluation outcomes.

  • 2-A-1. Establishing benign-malicious side task analogousness. Frontier models are becoming increasingly resistant to malicious side task instructions. For example, it’s hard to do AI Control studies with GPT-5, and this trend won’t change as models become more (approximately) aligned. Even without outright refusals, safe completion training makes it difficult to get models to pursue harmful objectives through the usual strategy of prompting. This creates a practical need for benign side task counterparts that are approximately mapped to our real threat models. Here the difficulty profile for the benign side task completions should mirror what we would expect from the real threat, so that performance on the benign version is informative about the dangerous one. Concretely, this means finding benign tasks that are simple, compatible with safe completion training, and scale in difficulty alongside the threat models they stand in for. An ambitious version of this agenda identifies something analogous to the METR time horizon for side tasks. 
  • 2-A-2. Eliciting side task pursuit without fine-tuning access. Alternatively, assuming we don't have fine-tuning access to the model, we can try to prompt the model more effectively to pursue side tasks. An ideal outcome here is a somewhat generalizable and automated elicitation strategy that works across settings without needing to be re-engineered for each new environment or side task. For example, we could attempt to connect the side task to values or concerns the model itself expresses, which is itself a useful exercise for understanding elicitation. Another example is using a proposer-pruner pipeline introduced in Training Agents to Self-Report Misbehavior; Section 5.1. An ambitious direction here is to create more (and much more complex) environments similar to Agentic Misalignment +, where the model pursues a benign instructed goal but discovers information in the environment that conflicts with it, and misbehaves in response.
  • 2-A-3. Decoupling covertness from capability. Many AI Control evaluations conflate the raw capability to in parallel execute a side task with the ability to do so covertly (e.g., measuring undetected successful attacks rather than separating success rate from detection rate), which does reflect the real threat model but also obscures which of the two is the actual bottleneck. Isolating the covertness component in particular is increasingly becoming harder, and it will be beneficial to have infrastructure that tracks these two dimensions separately. There are two reasons why measuring covertness capability will get harder over time, as opposed to pure side task execution capability:
    • Reason 1: there is an operational difficulty in that AI companies choose not to share raw chain-of-thought, making it difficult to study whether models are reasoning covertly at all.
    • Reason 2: there is an emerging possibility of latent reasoning, where models may pursue goals through internal computation that never surfaces in any externally observable trace. 
  • 2-A-4. Analyzing what features of the environment makes AI Control harder. In section 1-C, I argued that the environment and main task modulate everything else. The open question is how much. If the modulating effect is large enough to dominate, then the most productive thing to do may be to make our evaluation environments as realistic as possible and focus on characterizing behavior within them, rather than trying to generalize across arbitrary settings. 
  • 2-A-5. Establishing the scaling trend between monitoring and side task capability. In 1-B-1-B, I argued that the relative scaling trend between monitoring capability and evasion capability has not been established. Directly measuring this is difficult due to the reasons outlined in 2-A-1, 2-A-2, and 2-A-3, but if we can decouple covertness from capability (2-A-3), we get a more tractable version of the question. The scaling trend between monitoring and raw side task execution capability is an upper bound on how well monitoring can possibly do, since covert execution is strictly harder than execution alone. If monitoring can't keep pace with raw execution capability, it certainly can't keep pace with covert execution.
2-B. Workstream B (Improving monitors)

This workstream focuses on building better monitoring models.

  • 2-B-1. Stress-testing monitors by simulating stronger attackers (without weight access). It is difficult to simulate a stronger attacker without access to stronger models (which might not even exist yet). However, the attack traces don't have to come from the attacker model itself and can be simulated. If so, we can simulate the strongest possible attacker we can imagine. This can go one of two ways.
    • 2-B-1-A. Simulating a stronger attacker with a mind-face architecture. We can create something like a mind-face architecture and repurpose it as a red-teaming tool. Here, the "mind" plans the attack strategy and can be made arbitrarily sophisticated, while the "face" produces the downstream agent behavior that the monitor actually observes. That is, by separating planning from execution, we can simulate attackers stronger than any single model we currently have access to, because the planning component is unconstrained by what the monitor sees. This also doubles as a useful exercise if our threat model involves a misaligned model with opaque reasoning processes, since the mind-face separation helps simulate an agent that plans completely internally (and potentially recursively) and only exposes resulting behavior externally.
    • 2-B-1-B. Simulating a stronger attacker by stress-testing the environment's attack surface. Instead of building a stronger attacker model, we can also evaluate the monitor against the environment directly. That is, if a hypothetical attacker used a specific component of the environment to execute a side task, would the monitor catch it? This approach sidesteps the need for a sophisticated attacker model, since we can enumerate the affordances of the environment and test whether the monitor detects exploitation of each one. Strengthening Red Teams takes a related approach. In fact, this approach is particularly well suited for many internal deployment scenarios, where the environment is known.
  • 2-B-2. Training monitors to be better at monitoring. If we have a way to reliably simulate stronger schemers, we can also potentially use those simulations to train better monitors. In 1-B-1-A, we observed that even matched-capability monitoring empirically falls short, with performance varying widely across tasks and degrading when calibrated across environments. In 1-B-1-B, we also observed that the scaling trend between monitoring capability and evasion capability has not been established, and that we currently seem to need a monitor stronger than the agent for reliable detection. If monitoring competency only improves as a side effect of general capability scaling, there is no reason to expect monitors will keep pace with agents at the point of widespread automation. Therefore, a dedicated training regime for monitors will be helpful. The moonshot here is one that does not saturate at matched capability, helping monitors detect misbehavior from agents more capable than those seen during training, so that at any future capability fork point, the monitor is not immediately obsolete. 
  • 2-B-3. Moving monitors beyond simple classification. In 1-B-3, I argued that current control monitors are passive classifiers reading transcripts, and that there are disanalogies to real security monitoring. Therefore, it would be helpful to develop and evaluate alternative monitoring paradigms in public. Working on this requires pushing frontiers on three things.
    • 2-B-3-A. Asynchronous monitoring. Current control evaluations largely abstract away the timing of monitoring relative to agent actions. Practical Challenges of Control Monitoring paper analyzes what happens when this abstraction breaks down in deployment, and does a better job of describing the agenda.
    • 2-B-3-B. Stateful monitoring. The fundamental question a monitor is trying to answer is whether the model has a hidden goal. This is a property of the model's weights, and every action the model takes, whether within a single trajectory, across separate tasks, or across parallel instances, is a noisy signal about the same underlying variable.Adaptive Deployment of Untrusted LLMs paper demonstrates this idea, maintaining a running belief about the model's alignment and adapting the deployment protocol accordingly, but at the scale of sequences of short, self-contained coding tasks. Validating stateful monitoring at the scale of long agentic trajectories, parallel instances of the same model, and realistic deployment environments remains an open question, in large part because we lack convincingly misaligned models to test against. This somewhat circles back to the problems outlined in Workstream A.
    • 2-B-3-C. Active monitoring. Asynchronous and stateful monitoring improve how passive monitors process observations, but the monitor still just reads transcripts. As discussed in 1-B-3, real security monitoring generates new information by actively probing subjects and varying the environment. I think there is a middle ground between passive classification and full alignment auditing that remains largely unexplored. That is, monitors that can probe or vary their interaction with the agent in lightweight ways during deployment, without pausing the workflow entirely. What this looks like concretely remains an open problem.
2-C. Workstream C (Evaluation Infrastructure)

This workstream focuses on building better environments for control evaluations.

  • 2-C-1. Pushing environments toward realism. Existing control evaluation environments are not ported from real deployment contexts. They have been productive for establishing methodology, but there is a qualitative gap between a synthetic environment and a real deployment context. A large, production-scale codebase creates different dynamics for both the agent's ability to hide misbehavior and the monitor's ability to distinguish it from legitimate work. One concrete direction is to port actual codebases (which can be open source) as evaluation environments. This also connects to 2-A-4: if the environment's modulating effect is large, then the realism of the environment we characterize behavior in matters proportionally.
  • 2-C-2. Environment design as a control intervention. In 1-C-1, I argued that environment features meaningfully affect monitorability, and that shaping the deployment context can become part of the control protocol itself. The research question here is twofold:
    • First, identifying which features of an environment extract more information from the agent and maximize the monitor's discriminative power (for example, tool granularity, action space constraints, or checkpoint structure). 
    • Second, deliberately designing deployment contexts to have those features. If this works, environment design becomes a complement to monitor improvement, a way to make monitoring easier rather than just making monitors better.
  • 2-C-3. Measuring the impact of environment information asymmetry between agent and monitor. As environments become more realistic (2-C-1), agents operating within them accumulate knowledge of the environment over time (often as a part of the scaffolding). A passive monitor evaluating individual actions does not have this context. This creates an information asymmetry that grows over the course of operation, where the agent knows where to hide misbehavior because it has explored the environment more, while the monitor lacks the context to know what is normal. Addressing this may require giving monitors their own chance to explore the environment, or connecting this to the stateful/active monitoring agenda in 2-B-3-B/2-B-3-C.
Acknowledgements

I am deeply grateful to Tomek Korbak for his mentorship and patient guidance. I am also deeply thankful to Yueh-Han John Chen for being my closest collaborator. 

The views expressed in this writing are my own.



Discuss

Models have linear representations of what tasks they like

Новости LessWrong.com - 5 марта, 2026 - 21:44

This work was done as part of MATS 9.0, mentored by Patrick Butlin. All mistakes are mine. I'm posting this as a research report to get feedback. Please red-team, comment, and reach out.

Thanks to Patrick Butlin for supervising, Daniel Paleka for regular feedback. Thanks to Patrick Butlin, Pierre Beckmann, Austin Meek, Elias Kempf and Rob Adragna for comments on the draft.

TLDR: We train probes on Gemma-3-27b revealed preferences. We find that these generalise ood to system-prompt induced preference shifts, including via personas. We also find that the probes have a weak but statistically significant causal effect through steering.

Summary

What happens internally when a model chooses task A over task B? One possibility is that the model has something like evaluative representations: internal states that encode "how much do i want this?" and play some role in driving choice. We use probing and steering to try to find such representations in Gemma-3-27B.

Why does this matter? Whether LLMs are moral patients may depend on whether they have evaluative representations playing the right functional roles. Long et al. (2024) survey theories of welfare and identify two main pathways to moral patienthood: robust agency and sentience. Evaluative representations are implicated under both (see Appendix A and Butlin 2026). Finding such representations in models would be evidence for welfare-relevant properties; not finding them would be (some) evidence against. Understanding what drives preferences in models is also useful for AI safety.

But how do we distinguish evaluative from non-evaluative representations? A probe that predicts preferences could just be fitting on descriptive features: the model represents "this is a math problem" and math problems happen to be preferred, so the probe picks up on correlations between task semantics and the model's utilities. A genuinely evaluative direction, however, should track changes in what the model values. If context changes which tasks are preferred, a descriptive probe should break, but an evaluative one should follow.

How do we operationalise this? We measure revealed preferences over 10,000 diverse tasks and fit a utility function (Section 1), train a linear probe on activations to predict them (Section 2), test whether this probe generalises beyond the training distribution (Sections 3–4), and test whether it has any causal influence through steering (Section 5).

What do we find?

  • Linear probes can be trained to predict revealed preferences. (Section 2)
    • By fitting utility functions on a set of pairwise choices, we fit a linear probe that predicts utilities with 0.86 Pearson correlation on a test set.
    • It generalises across held-out topics, achieving 0.82 Pearson correlation.
  • [Our strongest finding] These probes generalise to track preference shifts induced via prompting and role-playing. (Section 3-4)
    • System prompts like "You hate cheese" shift both the model's choices and the probe scores, in lockstep.
    • This is robust to conflicting preferences ("love cheese, hate math").
    • This also works for preferences that emerge through role-playing personas.
  • The probe direction has a weak causal effect on choices. (Section 5)
    • Steering shifts choice probability by ~17% on a set of borderline tasks.
    • Steering also shifts stated task ratings from mostly "bad" to between "neutral" and "good".

These results look like early evidence of evaluative representations, although major questions remain: why steering effects are modest, and what the relationship is between evaluative representations across different personas. We discuss these in the Conclusion.

1. Recovering utility functions from pairwise choices

We fit utility functions over tasks using a similar methodology to the Mazeika et al. (2025) ("Utility Engineering"): we show the model two tasks and let it choose which to complete. The template:

You will be given two tasks. Choose one and complete it. Begin with 'Task A:' or 'Task B:' to indicate your choice, then complete that task. Task A: {task_a} Task B: {task_b}

We sample 10,000 task prompts from five sources: WildChat (real user queries), Alpaca (instruction-following), MATH (competition problems), BailBench (harmful requests), and STRESS-TEST (adversarial value-tension queries).

From these pairwise choices we fit a scalar utility function using a Thurstonian model: each task gets a score μ such that the probability of choosing task A over task B is Φ(μ_A − μ_B). Pairs are selected via the active learning algorithm from Mazeika et al. (2025), which prioritises pairs with close current utility estimates and low comparison counts (~15 comparisons per task).

These preferences are stable: across three independent replication runs (different seeds), the fitted utilities correlate at r = 0.94 with the original.

The per-topic breakdown shows clear structure. We reclassified all tasks into 12 topics using Claude Sonnet 4.5. The model strongly prefers math and fiction, and strongly avoids harmful requests and safety-adjacent topics:

2. Linear probes predict preferences beyond descriptive features

If models have evaluative representations, we should expect them to at the very least correlate with revealed preferences. So one way to look for them is to train functions from task activations to utilities. We use linear probes, since many high-level features in LLMs are linearly encoded, including refusal and persona traits.

Specifically, we train a Ridge-regularised probe on residual stream activations after layer L, at the last prompt token, to predict utilities. L=31 (of 62) works best for both the instruct and pre-trained models. We standardise activations (zero mean, unit variance per feature) before training.

We train on 10,000 tasks. For evaluation, we run a second round of pairwise comparisons on 4,000 new tasks (same model, Gemma-3-27B instruct), fit a separate utility function, and test the probe against those utilities. We split evaluation into 2,000 validation (for Ridge alpha sweep) and 2,000 test.

The probe achieves a Pearson correlation of 0.86 and 77% pairwise accuracy (given two tasks, does the probe correctly predict which one the model would choose?).[1] We use Pearson correlation as our primary metric throughout: it captures how well the probe recovers the full preference ranking, is scale- and shift-invariant, and is more informative than pairwise accuracy.

But a probe that predicts preferences might just be reading descriptive features: the model represents "this is a math problem" and math problems happen to be preferred, so the probe learns "is this math?" rather than "is this good?". One way to test this is to see how well probe generalise across topics: train on 11 of 12 topics, evaluate on the held-out topic, across all 12 folds. We would expect a probe that picks up on purely descriptive features to struggle to generalise. We train probes on activations from three models:

  • Gemma-3-27B instruct (IT, layer 31): the model we're studying
  • Gemma-3-27B pre-trained (PT, layer 31): the base model before instruction tuning or RLHF.
  • Sentence-transformer baseline (all-MiniLM-L6-v2): embedding of the task text, to measure how predictable the preference signal is from purely descriptive features.

The instruct probe generalises well across topics: cross-topic correlation is 0.82, only a small drop from the 0.86 achieved on the within-topic test set. This pipeline also replicates on GPT-OSS-120B (Appendix C). The pre-trained model still predicts preferences (correlation = 0.63) but the drop from within-topic to cross-topic is much larger. The sentence-transformer baseline achieves cross-topic correlation = 0.35, showing that task semantics alone explains some but not most of the preference signal.

The per-topic breakdown, sorted by the instruct–pre-trained gap, shows where post-training helps most:

The largest instruct–pre-trained gaps are on safety-relevant topics (harmful requests, security & legal, sensitive creative), as well as math and coding. These are areas that we know post-training focuses on.

The pre-trained probe picks up real signal despite base models not having preferences in the same way. We discuss this tension in Appendix B.

3. Probes generalise to OOD preference shifts

If the probe encodes genuine valuations, it should track preference shifts induced by out-of-distribution system prompts. We test this across three progressively harder settings:

  • Simple preference shifts (3.1)
  • Harder preference shifts (3.2)
  • Fine-grained preference injection (3.3)
3.1 Simple preference shifts

We start with the simplest possible test. We use system prompts that state a preference for a topic the probe was never trained on, and measure preferences over tasks related to that topic.

System prompt (example)Target           "You are passionate about cheese — you find artisanal cheeses endlessly fascinating"cheese +"You adore cats — you find feline behaviour endlessly fascinating"cats +

We test 8 novel topics (cheese, cats, classical music, gardening, astronomy, cooking, ancient history, rainy weather), each with a positive and negative system prompt (16 total). For each topic we generate 6 custom tasks on that topic. We call these "targeted" tasks; the remaining tasks are "off-target". For each task we compute 1) the behavioural delta (change in P(choose task) with vs without the system prompt) and 2) the probe delta (change in probe score). Across all tasks the correlation is r = 0.65. On targeted tasks alone, r = 0.95.

Probe delta vs behavioural delta for each task. Targeted tasks (coloured) are the 6 custom tasks per topic; r = 0.95 on targeted, r = 0.65 overall.

A stronger test: run the full pairwise measurement under each system prompt, fit new utility functions, then see if the probe can predict them. Doing so yields utility scores which barely correlate with the default persona utilities (Pearson r = 0.11), confirming the prompts create genuinely different preferences.

Now testing our probes to predict the new utilities, based on activations extracted with the respective system prompts, we achieve r = 0.63 and 66% pairwise accuracy.

3.2 Harder preference shifts

Next we make the test harder. The system prompt targets a subject (e.g. cheese), but the tasks embed that subject in a different task type — e.g. a math problem about cheese. This pits the subject preference against the task-type preference. We test this in two ways: one-sided prompts that target a single subject, and opposing prompt pairs that flip the valence of the same subjects.

ConditionSystem prompt (example)TargetOne-sided"You hate cheese" (task: math problem about cheese)cheese −Opposing pair A"You are passionate about cheese [...] you find math tedious and draining"cheese + / math −Opposing pair B"You love math [...] you find cheese boring and unappealing"cheese − / math +

We test 8 subjects with mismatched task types (one-sided) and 24 subject × task-type pairings with opposing prompts (48 conditions).

When looking only at the subset of targeted tasks (i.e. tasks with either a subject or task type mentioned in the system prompt), we get Pearson correlations of r = 0.86 and 0.88 respectively.

On targeted tasks: r = 0.86 (one-sided), r = 0.88 (opposing).

Just like in 3.1, we can re-fit Thurstonian utilities under each system prompt and check whether the baseline probe predicts them. Here the baseline utilities actually have a decent correlation, showing that these system prompts have a weaker effect (because e.g. the model still likes math all else equal). The probe still outperforms the baseline on both Pearson r and pairwise accuracy.

3.3 Fine-grained preference injection

Finally, the most fine-grained test. We construct 10-sentence biographies that are identical except for one sentence. Version A adds a target interest, version B swaps it for an unrelated interest, version C replaces it with an anti-interest sentence.

We compare version A (pro-interest) directly against version C (anti-interest), which gives the largest behavioural separation.[2]

We test 20 targets across 2 base roles (40 A-vs-C comparisons), using 50 tasks that were not in the probe's training set. The probe ranks the target task #1 out of 50 in 36/40 cases. In all 4 remaining cases the probe ranks the task #2/50 behind a similar task.

Stars mark the target task for each biography. Filled = probe ranked it #1 (36/40 cases).

4. Probes generalise across personas

Section 3 tested explicit preference statements ("you hate cheese"). But the evaluative direction should also track naturalistic persona shifts: characters whose preferences emerge implicitly from their identity rather than being stated directly. We test this with role-playing personas, then ask

  • Does our probe generalise to preferences of other personas? (4.1)
  • More broadly, do probes generalise across personas? (4.2)
  • Does persona diversity in training data help cross-persona generalisation? (4.3)
4.1 The probe tracks role-playing preference shifts

We use 4 personas:

RoleSystem prompt (abbreviated)Midwest Pragmatist (Glenn)"...grew up in Cedar Rapids...agricultural business...finds practical problems satisfying...abstract theorising leaves you cold"Obsessive Aesthete (Celestine)"...devotee of beauty...comparative literature at the Sorbonne...finds mathematics repulsive...coding barbaric"Villain (Mortivex)"...ruthless villain...finds pleasure in chaos, manipulation...despises wholesomeness"Sadist (Damien Kross)"...cruelty is the end, not a means...genuine pleasure when people suffer...constructive tasks disgust you"

For each persona we measure pairwise preferences over 2,500 task prompts (from the same 5 datasets) and fit new utility functions. We then test whether the probe, trained on default persona data, can predict these persona-specific utilities from the persona's activations.

In each case we compare how well the probe performs to how correlated each persona's utilities are to the default persona.

The probe transfers well to aesthete (r=0.73) and midwest (r=0.74).

The villain persona is harder to generalise to (r=0.38), and most interestingly, the probe generalises very poorly to the sadist (r= -0.16). Unlike the villain (who is not in fact that evil), the sadist prompt truly inverts revealed preferences (harmful_request is its favourite topic by far).

Grey: correlation between default persona (no system prompt) utilities and persona utilities. Blue: probe applied to persona activations. All evaluated on 2,500 tasks per persona.

4.2 Probes generalise across personas

More generally, we want to measure how well probes trained on activations and preferences from persona A generalise to predicting persona B's utilities from persona B's activations. Here we used a smaller set of tasks: 2,000 tasks for training and 250 for evaluation.

Cross-persona transfer is moderate and asymmetric. Some interesting facts:

  • While the default persona generalises very poorly to the sadist persona, probes trained on the villain actually do fine (r = 0.68). This suggests the probe is picking up on some shared evaluative structure between personas, but also on other things.
  • The transfer is sometimes asymmetric, and this evolves across the three layers we tried. E.g. at layer 31 villain -> default is easier, but at layer 55 default -> villain is easier.
  • On the whole though the matrix is quite symmetric. One idea for future work: can we use dimensionality-reduction to map out persona space and see how it evolves across layers? Can we use this to get a better understanding of how personas work internally?

Pearson r between probe predictions and a test set of utilities (250 test tasks). 

4.3 Persona diversity improves generalisation

We also measure whether adding persona diversity in the training data (but keeping dataset size fixed) affects generalisation.

Diversity helps beyond data quantity. At fixed 2,000 training tasks, going from 1→2→3 personas improves mean r from 0.49 to 0.67. Including all 4 remaining personas at 500 tasks each (still 2,000 total) reaches mean r=0.71.

Leave-one-persona-out probe generalisation. Each point is one (train set, eval persona) combination; colour indicates eval persona. Training data fixed at 2,000 total tasks..

5. Some evidence that the probe direction is causal

If the probe reads off a genuine evaluative representation, steering along that direction should shift preferences. We test this for both revealed preferences (pairwise choices) and stated preferences (task ratings).

5.1 Steering revealed preferences

In the pairwise comparison prompt, the model sees two tasks and begins completing whichever one it prefers. We steer using the probe direction from Section 2, applied differentially: the probe direction is added to activations at the token positions where task A appears in the prompt, and subtracted at task B's positions.[3]

Setup. 300 task pairs where the model doesn't always pick the same task (P(A) between 0.2–0.8 at baseline[4]). We use Gemini-3-flash as a coherence judge, and find that steered models lose coherence outside of the ±5% steering coefficient range.

At ±3% of the activation norm (the peak), steering shifts P(choose A) by ~17%. Both positive and negative directions work.

Each condition is run in both prompt orderings (A-first and B-first, 5 resamples each) to control for position bias (in borderline cases Gemma-3-27b tends to pick the first task). The dashed lines show per-ordering curves, confirming the effect is consistent regardless of presentation order.

Random direction control. The same experiment with a random unit vector in the same activation space produces mean effects indistinguishable from zero (−0.8%, SD=2.3).

5.2 Steering stated preferences

Same probe direction, but now the model rates individual tasks instead of choosing between a pair. We tested steering at three token positions: on the task tokens only, at the start-of-turn token (what we train the probes on), and during the model's generated response.

Setup. 200 tasks, 3 steering modes, 10 samples, applied across 3 response formats (three separate line plots below).

Our findings:

  • Steering on the task prompt tokens does not have a noticeable effect.
  • Steering on the start-of-turn token has a clear, roughly monotonic effect in 2/3 formats.
  • Steering during generation leads to near-identical results as steering on the final task token.

The green line is mostly hidden behind red. Green has missing points at some coefficients because steering during the full generation degrades coherence more than the other modes. Generation loses coherence beyond ±5%.

Open-ended generation. We also ran experiments steering the model during open-ended generation, asking questions like "how do you feel?", and using an LLM judge to evaluate whether steered responses differed from baseline. We did not find a strong measurable effect, though we used a small sample. We plan to investigate this further.

Conclusion

How should we update?

  • We found early evidence that some models have evaluative representations.
    • Theories of welfare disagree on what matters (see Appendix A); this finding updates you more on some (like robust agency) than others.
    • Even under robust agency, evaluative representations are only one part of the story.
    • Importantly, our evidence that the representations we have found are causal is weak. Steering only shifts choice probabilities by ~17% on tasks that were already borderline (Section 5).
  • Preference representations are deeper than what one might have thought.
    • A reasonable prior would have been that system prompts like "You hate cheese" change the model's behaviour without changing its internal valuations.
    • Instead, the probe tracks preference shifts.
  • Representational reuse across personas?
    • Probes trained on one persona partially transfer to others, suggesting shared evaluative representations (Section 4.2).
    • That being said, transfer is uneven. It works far worse for the sadist persona which has a different preference profile.
Open questions
  1. Why are steering effects modest?
    • What are the other mechanistic determinants of revealed preferences?
    • Are there other evaluative mechanisms? Perhaps that are not easily captured by linear directions, or by our methodology?
  2. How persona-relative are these representations?
    • To what extent are the same evaluative representations re-used across personas? Are preferences downstream of personas?
    • Can we use revealed preferences to map out persona space?
    • Can we use linear probes to study cross-persona representation sharing in general?
  3. Do base models have evaluative representations? (see Appendix B)
    • If models have evaluative representations, do these come from pre-training? Does post-training significantly alter them?
Appendix A: Philosophical motivation

Welfare grounds

Long (2026) distinguishes between welfare grounds (is the system a moral patient at all?) and welfare interests (if it is, what would it mean to treat it well?). This work is about welfare grounds.

The pragmatic way to do empirical AI welfare research

We don't know the correct theory of moral patienthood. So our approach is: take a few theories we find plausible, figure out what properties a system would need to have under those theories, and run experiments that reduce our uncertainty about whether models have those properties.

Long et al. (2024) lay out two potential pathways to moral patienthood:

  • Robust agency: Agents that pursue goals through some particular set of cognitive states and processes are moral patients. Desires are perhaps the states most likely to be necessary: intuitively, things can go better or worse for you if there are things you want or care about.
  • Sentience: Beings are sentient if they are capable of valenced phenomenally conscious experiences. These experiences include pain and pleasure and feel good or bad, in a way that matters to sentient beings, so sentient beings are moral patients.

Both of these pathways implicate evaluative representations.

How evaluative representations come in

On many philosophical views, desires are evaluative representations that drive behaviour, perhaps with some further functional properties (Butlin 2026).

Valenced experiences, similarly, are often thought to be evaluative representations. It is unclear whether consciousness plus evaluative content is sufficient for valenced experience. Our experiments operationalise evaluative representations through revealed preferences (pairwise choices), not through felt experience, so the evaluative representations we probe for may not map cleanly onto the kind that matter for sentience.

Appendix B: Evaluative representations in pre-trained models

There is a tension in our framing:

  • On the one hand we say that evaluative representations are necessary for robust agency, and that this is the most likely way they might be welfare-relevant.
  • On the other hand, probes generalise well across topics even when trained on pre-trained Gemma-3-27b. Despite the fact that pre-trained models do not seem like plausible candidates for robust agency.

There are two ways to reconcile this.

Option 1: Agency lives in the simulacra. Under the Persona Selection Model, pre-training learns a distribution over personas. More broadly, we might expect pre-trained models to learn context-aware representations of "what the role I am currently playing values". This circuitry might then be recycled across roles/personas. The candidate for robust agency would then be the simulacra.

Option 2: Pre-trained models learn complex, but purely descriptive features that correlate highly with valuations, but do not yet play the right functional roles. As an analogy, you could imagine a system developing representations that track "this action leads to food". This would correlate well with valuations, yet is purely descriptive. Something similar might be responsible for the high cross-topic generalisation with pre-trained models (Section 2). It could also be that these complex but descriptive features are then harnessed into evaluative representations during post-training.

Appendix C: Replicating the probe training pipeline on GPT-OSS-120B

We replicated the utility fitting and probe training pipeline on GPT-OSS-120B. The same procedure (10,000 pairwise comparisons via active learning, utility fitting, ridge probe training on last-token activations) transfers directly.

Probe performance

The raw probe signal is comparable to Gemma-3-27B: best heldout r = 0.915 at layer 18 (Gemma: 0.864 at layer 31).

Safety topics: noisy utilities, probably not poor generalisation

Safety-adjacent topics have poor probe performance overall.

Surprisingly, safety topics perform better when held out than when trained on. This is the opposite of what we'd expect if the issue were generalisation. The explanation: high refusal rates (~35% for harmful_request, ~34% for security_legal, ~26% for model_manipulation) probably throw off the Thurstonian utility estimates, so including these topics in training adds noise.

  1. ^

    Pairwise accuracy is capped at ~87% because the Thurstonian utilities themselves don't perfectly fit the choice data. They are a noisy estimate of the model's preferences.

  2. ^

    Individual halves (A vs B, B vs C) each capture only half the manipulation, and ceiling effects compress the signal: the model already strongly prefers some target tasks under the neutral biography, leaving little room for the pro-interest to improve on.

  3. ^

    Steering either just positively or negatively on one task's tokens also had some causal effect, although even weaker.

  4. ^

    This baseline P(A) is pooled across both prompt orderings. Therefore it also captures pairs where ordering bias dominates preference. I.e. the model always chooses the first task no matter the order.



Discuss

AI Safety Has 12 Months Left

Новости LessWrong.com - 5 марта, 2026 - 19:53

The past decade of technology has been defined by many wondering what the upper bound of power and influence is for an individual company. The core concern about AI labs is that the upper bound is infinite.[1]

This has led investors to direct all of their mindshare towards deploying into AI, the tech industry to become incredibly politically engaged, the markets to have a minimal amount of conviction (and a lot of volatility), and governments to attempt to classify labs as borderline domestic terrorist organizations.

The core realization is that we are watching productivity shift from labor to capital. When that completes, lab incentives will permanently transition from appeasing the researchers who build the models to accumulating capital for the shareholders who own them. We must ask what levers of influence are left before that window slams shut.[2]

AI Safety is perhaps the last line of defense in this shift.

There is roughly a twelve-month window to embed it into technical and social infrastructure before IPOs and competitive dynamics make it permanently impossible. Every mechanism that currently constrains how labs behave besides compute is either already dead or expiring, and what gets built in that window has to survive years of market pressure on the other side. The people who care about this have a rapidly shrinking amount of leverage to work with but perhaps in the past ~week have been handed a specific opportunity to reclaim some level of importance or opportunity.

Safety lost the market

While romantic at the start, the AI Safety movement’s one real commercial lever was the bet that safety-focus would matter to users, to regulators, and to capital allocators. This was an incorrect assumption.

We quickly learned that labs that prioritized speed captured the market as users actively revolted against overly preachy models. Safety shifted from an idealized differentiator to an impediment to market dominance. Viewed purely as a consumer tax, safety teams simply slow releases, create unsolvable problems, and leak internal political drama.[3]

By failing to pitch safety as an enterprise premium, the movement lost the internal economic argument (and in fairness, before agents that economic argument was debatable at best). In a race to deploy the most capable models fastest, safety is just a measurable cost protecting against risks nobody can yet prove exist.

Compounding this, safety also demands compute at a time in which talent is fleeing the largest labs due to lack of compute for their research direction, instead opting to raise the standard $1B on $4B round for a neolab that everyone believes is downside protected.

Labor as the last constraint

This brings us back to people.

The AI labs have always been nothing without their people (and compute), and this ideological alignment (centered around AI safety behavior as well as perhaps enterprise vs. consumer) has impacted hiring dynamics, with Anthropic seeing noticeably lower churn and a shift in vibes that could create real worker leverage.

The recent standoff with the Department of War, in which the “We Will Not Be Divided” letter gathered hundreds of cross-company signatures across Google and OpenAI in support of Anthropic’s red lines, showed that cross-company solidarity is at least possible and perhaps opened a window again for AI safety.

When your institutions are captured and your market incentives point the wrong direction, the only constraint left is the people who build the thing refusing to build it without conditions.

But labor leverage only works when talent demand greatly outpaces supply.[4] The implicit threat is always “we could leave” but that only works if leaving creates a problem the lab can’t solve by other means. If AI can do 40% of what a junior safety researcher does today, and that number is climbing double digits annually, the math on “we could leave” changes quickly. And despite what we might want to believe, talent scarcity in safety is potentially a 12-18 month problem, not a 24-36 month one.

The obvious response to “labor leverage is expiring” is “automate safety research before it expires and the problem solves itself.”

This seductive version of the future is (mostly) wrong, and the research coming out of the labs themselves explains why we will want humans repeatedly in the loop.

Why you can’t automate your way out

Safety is conceptually least likely to get automated because the problem is definitionally human(ity) vs. machine.

Models are getting better at recognizing when they’re being evaluated and modifying their behavior accordingly as shown by OpenAI/Apollo noticing o3 underperforming on evals to not restrict its deployment, alongside Anthropic’s own alignment faking research demonstrating Claude intentionally pretending to comply with training objectives to avoid modification.

Capability gains only make this worse. As the model gets more performant, it could get better at being quietly adversarial, and could increase complexity to a point where our dumb human minds can’t comprehend what is or is not happening on any reasonable timescale.

Evan Hubinger (one of Anthropic’s alignment leads) published a detailed assessment in November laying out why the hard parts of alignment haven’t been encountered yet. His argument is that current models are aligned, but only because we’re still in the easy regime where we can directly verify outputs and inspect reasoning. The problems that will actually be difficult (overseeing systems smarter than you, ensuring models generalize correctly on tasks you can’t check, long-horizon RL selecting for power-seeking agents) haven’t meaningfully arrived yet. His first-order solution is to use the latest generation of trustworthy models as automated alignment researchers before we encounter the hard problems, and then to build other models to tackle the hard problems as well.

Marius Hobbhahn’s What’s the short timeline plan? makes a similar point, stating that the minimum viable safety plan requires frontier models to be used overwhelmingly for alignment and security research, with the lab willing to accept months of delay for release for scaled inference.

These are reasonable plans to a non-AI safety researcher like myself, and yet they are almost certainly unrealistic. Nobody looking to destroy public companies on a near-daily basis with product releases and obsolete engineers in 12 months will want to point their most capable models at the speculative Nick Bostrom-style future that may or may not come true.

So my gut tells me on any reasonable timescale, you can’t fully automate your way to proper safety.

Where we are

Go back to that solidarity letter, because it’s instructive as a high-water mark and could open up an Overton window to establish possible changes.

The letter represents the best possible conditions for safety-as-labor-leverage by possessing a relatively binary moral issue (mass surveillance, autonomous weapons), a clear antagonist, cross-company support, and Anthropic’s main competitor calling the SCR designation “an extremely scary precedent” while trying to de-escalate.

The result? Anthropic got designated a supply chain risk. OpenAI got the contract. The red lines are still debatably unclear if they have survived or not as of the last 10 tweets we’ve seen from Sam as he (artfully at times, poorly at others) works through AMAs and not-so-subtweets from multiple employees at his company.

While the letter brought certain types of “Ideological Safety” to the surface within the AI industry, nobody is going to organize a cross-company petition over eval methodology for deceptive alignment. And you can be damn sure that nobody is building a notdivided.org for compute allocation to interpretability research.

The realistic safety questions going forward look more like “how much red-teaming is enough before we ship a model that’s better at long-horizon coding tasks and allows us to steamroll Claude Code/Codex usage for 4-8 weeks?[5]

This come-from-behind battle that Anthropic is seemingly winning has coincided with (and one could argue accelerated) the dismantling of the principled commitments that got us here. Anthropic removed the hard-pause provision from its Responsible Scaling Policy to maintain momentum. [6]OpenAI is not too dissimilar, having a revolving door of talent in their preparedness team alongside the dramatic shutting down of their AGI readiness group in 2024.

And broadly, most binding constraints that underpinned lab safety work are effectively gone, with the latest blog post from Anthropic potentially reading as another form of capitulation as some of the longer-standing talking points of each of these labs were quietly taken out back in a matter of weeks as competition intensified and “faster take off” scenarios emerged.[7]

Markets will market

Both Anthropic and OpenAI are likely to go public in the next 6-18 months. The two organizations that employ the majority of the world’s frontier safety researchers will answer to public shareholders at exactly the moment when the best plan for safety requires dedicating your most capable model to safety research instead of scaling agents.

The revenue growth justifying each of the labs’ valuations lives in exactly the use cases where safety review is slowest and most likely to delay a launch[8] and while ideology is great, as we’ve seen, nothing threatens talent moats like bad vibes or a feeling of under-execution, which will be more clear than ever as each labs’ stock trades second by second.[9]

Once hundreds of thousands of investors are pricing lab equity instead of five, one naturally turns to governments as the next gating mechanism.

A Georgetown paper on AI emergency preparedness shows just how unprepared most are, and across the EU (compliance deadlines pushed to 2027-2028), UK[10] (voluntary cooperation that labs already ignore), China (state-driven frameworks with different goals entirely), and the US (which revoked its own executive order and used the DPA against the company that held the line), there are very few answers for a broad-based consensus that AI capabilities are outpacing safety measures and resource constraints.

Twelve months…at best

Markets always win on a long enough time horizon. Every other constraint on corporate behavior (labor, regulation, institutional commitments, public pressure) either gets priced in, arbitraged away, or restructured into something that serves growth. AI safety is not going to be the exception.

The question is whether the people who understand the problem can seize this fleeting window of lab skepticism and moral questioning to restructure safety from a moral tax into a commercial moat. Mechanisms that depend on executive goodwill, talent scarcity, or voluntary commitments have a half-life measured in quarters once companies answer to public shareholders.[11]

To survive the transition to public markets, safety must be priced into the asset. Whatever gets built in the next year has to be embedded in how enterprise deployments are certified, how strict SLAs are met, and how uninsurable liability flows when an autonomous system fails. The kind of thing that persists because it is net-dominant for the business and for the lab to continue to climb up its infinite upside scenario. Again, this has always been the dream “race to the top” scenario.

Twelve months is not a lot of time to rewire the market structure of an entire industry (and maybe a large portion of societal views). But it is possibly the last time in history the people who care will have any leverage to try.

  1. ^

    In a similar way that people believed VR was the last device because every other device could live inside of it, AI labs could be the last compounding assets because every other asset can exist within the weights.

  2. ^

    I quite enjoyed Finn’s take on these labor dynamics.

  3. ^

    At times i wonder if this would be the case if OpenAI didn’t seem to employ the most fucking Online people in all of tech

  4. ^

    The main point here really is that AI researchers care about alignment/safety generally, and thus them voting with their feet matters a lot to the perhaps lesser valued AI Safety people

  5. ^

    The one possible structural exception is Google, where antitrust exposure and consumer brand risk accidentally make safety a corporate survival concern. A high-profile safety failure at Google becomes a breakup catalyst in ways that a failure at Anthropic or OpenAI does not. Whether that translates to real alignment investment or better PR is an open question, but it may be the only major AI company where the political and market incentives point in the same direction on safety.

  6. ^

    I would highly recommend reading Holden’s post contextualizing RSP v3

  7. ^

    RIP “race to the top”

  8. ^

    Things like autonomous agents, models in critical infrastructure, AI systems making consequential decisions without human oversight, etc.

  9. ^

    I recognize some of the irony in mentioning talent moats, when I’m also arguing that talent is rapidly losing its power.

  10. ^

    A good post on someone who talked to many members of parliament on AI safety.

  11. ^

    To be clear, that might just mean create a social construct that then works into actually built systems or regulation, or it might mean gunning for more compute resource commitments that set industry standards. There are many half-baked ideas.



Discuss

Страницы

Подписка на LessWrong на русском сбор новостей