Вы здесь
Новости LessWrong.com
InkSF, an Opening on Finding the Highest Impact in AI Safety and Moving to SF
How can we actually minimize the odds that AI leads to catastrophic outcomes for all of us humans? This question has been rattling around my head for the last two months. The world might be ending. Nobody seems to care. The incentives are steaming us ahead. When I ask strangers on the street: “How likely is it that superhuman[1] AI could become too powerful for humans to control?”, 78% say either "very likely" (51.6%) or "somewhat likely" (26.3%)[2]. My guess is AI capabilities spending is at least 20x the spending on ensuring AI leads to the flourishing of humans[3]. Moloch[4] is winning.
So what can actually be done? As a toy example[5]: let’s say I currently think there is a 40% chance AI eventually goes extraordinarily bad for humanity. I could either:
- Try really hard to get laws passed that mandate AIs must think in English and be clearly understandable by humans[6]. I think there’s a 15% chance this would succeed and it would lower the odds AI goes catastrophically wrong from 40% to 36%.
- Work on really understanding on a mechanistic level what parts of the model matter for good outcomes, using this to figure out which parts would have to go wrong to end up with a model generating bad outputs, and being able to predict the odds that the required components for safety are always activated when relevant. I think there’s a 1% chance this would succeed and would lower the odds AI goes catastrophically wrong from 40% to 20%.
In this toy example the first one in expectation lowers the odds of catastrophic outcomes by .6% and the second one lowers by .2%."Would succeed” is doing a lot of lifting here, because what really matters is what I can do and not what some arbitrary collection of humans could do if tasked with either of these. What really matters is how much I think my working on this problem moves the odds of success multiplied by the impact of the given approach succeeding. Not only does it matter how impactful a given approach is but how good I am at executing on it[7]. This does of course imply that there is value in the work of trying to funnel people into doing the most effective things[8].
But what should I, Cormac Slade Byrd actually do?I’ve spent 2 months learning. I know so much more than I did 2 months ago, I started with the technicals. I have written a simple GPT-2 style LLM by hand[9]. I learned[10] all the core mech interp techniques. I’ve red-teamed the largest open weight model. I’ve read up on the seemingly hopeless state of cybersecurity - both when it comes to attacks enabled by LLMs and also the potential for nation state level attacks to steal model weights[11]. Most importantly I’ve read so much about what various people think will help lower the odds of catastrophic outcomes and why.
Part of what makes this so hard is there are so many second order effects. It’s genuinely quite hard to tell if a given thing that seems like it might help does in fact lower the odds of terrible outcomes. Most of the current CEOs of the largest labs got there by way of people who were worried about AI risk. Coefficient Giving[12] gave $30M to OpenAI to get a seat on the board[13]. METR’s timeline eval has led to more people understanding/being worried about the rate of AI progress, but also the existence of eval orgs enables labs to in some sense offload the requirement to ensure their models behave well onto these 3rd party eval orgs[14]. There are so many downstream effects to any given course of action.
I think the incentives are really important here. Yes, making sure AI goes well is a technical problem. But, the reason we are in such a dire place is because of the incentives. We are in a competitive race to build AI as quickly as possible as a result of a bunch of decisions that were (at least in the moment) motivated by someone wanting to lower the odds AI goes poorly[15]. Then, the incentives were inevitably shifted by general economic capitalistic forces and human power-seeking/competitive nature.
If I want to figure out how to have the most impact, I will have to really think about incentives and downstream impact.
Moving to the BayThe internet (plus Claude) is great for learning. I have learned so much. But I live in NYC. AI is happening in the Bay. If I want to really immerse myself in what is actually happening on the ground I can’t do that in NYC. If I want to understand the people (and their incentives) who are both building AI and are working to align AI to the interests of humanity, then I will probably have to talk to them. Go to their parties. Osmose as much as I can. Really figure out what matters.
I’m flying to SF tomorrow, I’ve got a sublet in SF[16] until May 9. I am looking for a sublet in a communal house in Berkeley starting May 16[17]. I hear the east bay and SF proper are pretty socially disconnected and very different cultures. I want to live in both to really understand what’s going on. See with clear eyes, not let a single groupthink win me over.
The hope is that by the end of June I will have a much better model and be relatively certain in picking a specialization, in picking what work I think I can do to maximally decrease the odds AI goes oh so poorly for all of us.
I love people, I love talking to people, if you think talking to me would be fun[18] please reach out!
Why is InkSF in your title?I like writing. It is helpful to clarify my thoughts. It helps people I love stay connected with me. It might even be providing a social good. Inkhaven is starting today in Berkeley. I am moving to SF tomorrow. I think there are good odds[19] forcing myself to write every day I am in SF is a net worth it forcing function. There are many things other than writing that are worth my time, my guess is I will not in fact go all 30 days. But, it seems clearly worth trying. So, this is me committing to 25% chance odds that I write and post at least 500 words every day in April.
- ^
Defined as: “Some companies are trying to build superhuman AI that would be far smarter than any human at nearly everything”
- ^
preliminary results, n = 95. Larger post about this incoming this month.
- ^
Or at the very least not leading to terrible outcomes for all humans.
- ^
https://www.slatestarcodexabridged.com/Meditations-On-Moloch
- ^
Numbers mostly made up on vibes.
- ^
AI’s currently do this, yes it makes mech interp way more annoying. But, it would be so so much worse if AI’s thought in some efficient machine gobledy gook that we have no way of interpreting. It’s pretty likely that English doesn’t purely by chance end up being the most token efficient way of thinking, so it’s not hard to imagine a world where competitive pressure leads to labs slowly moving towards their AIs thinking in some totally foreign “language”. My guess is this is the kind of regulation that labs would feel quite good about, they probably do like being able to understand the thinking their LLMs do.
- ^
For example I do not have a PhD so it seems somewhat unlikely that theoretical technical research will be the place where I have the most individual impact, but maybe I think ARC style theoretical research is really valuable and my time would be best spent trying to convince more PhD grads to try working on that problem. My model is basically all technical safety organizations are way more talent constrained than funding constrained.
- ^
I, for example, probably wouldn’t be here today if it weren’t for both Zvi’s posting and a friend exasperatedly asking me why the fuck I’m not working on the single problem I think is by far the most important for humanity to solve.
- ^
No tokenization, characters only. We’ll see if I end up thinking tokenization is worth getting into.
- ^
Played with in code of course
- ^
As far as I can tell, commercially serving models means that the weights just have to be sitting on the servers that are serving the given model. That’s a ginormous attack surface.
- ^
The single largest funder in the AI safety ecosystem
- ^
Is this better than the counterfactual? It really depends on how much that board seat changed the odds OpenAI went bad. We are living in a timeline where this bet seems like it didn’t pay off, but that doesn’t mean it wasn’t necessarily worth it in EV. Holden seems to believe this was on net worth it and I should probably put real weight on that given he actually knows how much impact the board seat got.
- ^
If there were no eval orgs there would likely be more pressure on labs to really ensure their models are safe. I certainly think it’s a better equilibrium for people to believe it is on the lab to prove their model is safe than for it to be on external orgs to be able to show a model is unsafe. However it’s still really hard to compare counterfactuals here. I think my light belief is eval orgs are probably on net bad especially considering the opportunity cost of what the people working there could be doing instead. This worry is also partially bolstered by the niggling incentive worry where most eval orgs are staffed from people who leave the big labs and they get significant compute credits as well as prestige for working with (and getting to say “partners with ___”) a given big lab. The incentives on eval orgs is tricky, so everything they do seems like it should be treated with a little more suspicion. On the other hand Coefficient giving absolutely funds eval orgs but doesn’t fund any pause/stop AI orgs - they seemingly abruptly stopped funding MIRI after it pivoted to pause advocacy. So, coefficient giving seems to think evals are net worth it but pause isn’t net worth it, how much should I update on this?
- ^
Basically every major lab was founded ostensibly because they thought the existing labs were doing a bad job and their new lab would be much better at creating safe good AI.
- ^
The Sandwich
- ^
The astute of you might have noticed May 9 =/= May 16th. I am going to Sleepawake in between. I am my impact, but also attunement and connection and being with other people is a large part of what makes life worth living for me. This is crucial, but for a life of joy but also I am way less impactful if I personally am burned out and sad and lonely.
- ^
or interesting, or truthful, or would help lower the odds of catastrophe, etc
- ^
Probably above 20% and below 50%
Discuss
Anthropic Responsible Scaling Policy v3: A Matter of Trust
Anthropic has revised its Responsible Scaling Policy to v3.
The changes involved include abandoning many previous commitments, including one not to move ahead if doing so would be dangerous, citing that given competition they feel blindly following such a principle would not make the world safer.
Holden Karnofsky advocated for the changes. He maintains that the previous strategy of specific commitments was in error, and instead endorses the new strategy of having aspirational goals. He was not at Anthropic when the commitments were made.
My response to this will be two parts.
Today’s post talks about considerations around Anthropic going back on its previous commitments, including asking to what extent Anthropic broke promises or benefited from people reacting to those promises, and how we should respond.
It is good, given that Anthropic was not going to keep its promises, that it came out and told us that this was the case, in advance. Thank you for that.
I still think that Anthropic importantly broke promises, that people relied upon, and did so in ways that made future trust and coordination, both with Anthropic and between labs and governments, harder. Admitting to the situation is absolutely the right thing, but doing so does not mean you don’t face the consequences.
Friday’s post dives into the new RSP v3.0 and the accompanying Roadmap and Risk Report, in detail.
Note that yes this is being posted on April Fools Day, but this post is only an April Fools joke insofar as those who believed Anthropic’s previous RSPs are now the April Fool.
Promises, PromisesIf your initial promises were a mistake, it may or may not be another mistake to walk them back. Either way, even if your promises were not hard commitments, walking them back involves paying a price for having broken your promises, even if you had a strong reason to break them. How big a price depends on the circumstances.
Almost all mainstream coverage of this event framed it as abandoning or walking back Anthropic’s core safety promises, especially ‘do not scale models to a dangerous level without adequate safeguards.’ As a central example of this, The Wall Street Journal said ‘Anthropic Dials Back AI Safety Commitments’ due to competitive pressures. That oversimplifies the situation, leaving a lot out, but doesn’t seem wrong.
Many outsiders who follow the situation more closely believe this amounts to Anthropic having broken its commitments. Some go so far as to say this means that lab commitments to safety should not be considered worth the paper that they were never printed on. Many now expect Anthropic to make some amount of effort, but nothing that would much interfere with business plans. If Anthropic can’t make the commitment, why should anyone else? Certainly this government is not going to help.
Don’t be afraid to tell them how you really feel. They welcome it. So here we go.
Anthropic Responsible Scaling Policy v3The Responsible Scaling Policy is Anthropic’s commitments regarding when and under what conditions they will release frontier models.
The headline change is that they are no longer committed to not releasing potentially unsafe models, if someone else did it first. Cause, you know, they started it.
That Could Have Gone BetterAnthropic starts their new analysis by going over their theory of change from having an RSP at all, and whether those theories were realized. They report a mixed bag.
First, the good news.
- They developed (modestly) stronger safeguards.
- They did successfully implement ASL-3 safeguards.
- They did importantly get OpenAI and DeepMind to develop frameworks, and then had the idea of a framework codified in SB 53 and RAISE.
Then the bad news.
- It did not create consensus about the level of risk from various models. It has proven very unclear how much risk is in the room, especially in biology.
- Government action has been nonzero but painfully slow at best.
- (I would add) We’re not being sufficiently proactive about ASL-4.
- (I would add) The requirements got changed somewhat when inconvenient.
What’s the most important differences in the new version?
Anthropic is now basically giving up on hard commitments and barriers to releasing models, relying instead on ‘we will make reasonable-to-us arguments’ and decide that the benefits exceed the risks.
I appreciate the honesty. Really, I do.
If you’re not ready to make a commitment, and you realize you shouldn’t have made one, then the second best time to realize and admit that fact is right now.
Officially breaking the commitments now is higher integrity than silently breaking them later. It’s especially better than silently changing the RSP right before a release. I approve of Charles’s frame of ‘Anthropic stopped pretending to have red lines at which they will unilaterally pause.’
If Anthropic was in practice already doing a ‘we think our arguments are reasonable’ decision process, which with Opus 4.6 it seemed like they mostly were, then better to admit it than to pretend otherwise.
I want to emphasize that essentially no one, not even those who disagree with me and think Anthropic should pause, and who also think Anthropic made rather strong commitments it is now breaking, are saying ‘Anthropic should be holding to its previous commitments purely because they said so, even if this leads to pausing that does not make sense.’
One still has to be held to account for breaking promises, and for making promises that were inevitably going to be broken, even if the decision to break them is right. Your defense that the move was correct does not excuse you from its consequences.
1a3orn: Arguments against the Anthropic RSP changes seem to incline towards deontological language regarding broken promises / duties
While arguments for them incline toward consequentialist language / greater good, afaict.
Oliver Habryka: I think both are right! The old RSP was obviously unworkable and should have never been published, given what Anthropic is trying to do. So abandoning it is the right thing to do, but of course if you break promises you should be held accountable.
It’s not that hard to explain the consequentialist arguments for holding people accountable for breaking promises, but most people have an intuitive sense for why it’s important, so you don’t have to unpack it.
(To be clear, I think Anthropic should stop scaling and redirect its efforts towards advocating for a pause, but doing that because of the RSP would be weird and I don’t think the right move.
It would just look like you sabotaged yourself and now want to hold others back because you accidentally promised some dumb things that took you out of the race)
I also want to emphasize that commitments are only one way to improve safety. Even when plans are worthless, planning is essential, and you can and should just do things. None of this means ‘Holden or Anthropic don’t care about safety,’ only that they will decide what they think is right and then do it, and you can decide how much you trust them to choose wisely.
I do still see this as Anthropic abandoning its experiment on importantly engaging in voluntary self-government and restricting itself. Technically they reserved the right to do it, but it’s still quite the gut punch to do it.
The experiment is over. That’s better than pretending the experiment is working.
From this point, there are no commitments, only statements of intent. Anthropic’s going to do what it’s going to do. You can either choose to trust Anthropic’s leadership to make good decisions, or you can choose not to.
I think Anthropic’s description of its own history says that having these softly binding commitments, and having a track record of treating it as costly to break them, was very good for safety outcomes and policy adoption. I hate that we’ve given that up.
So Cold, So AloneIf your commitment is conditional on the actions of others, you should say that.
They didn’t entirely not say this before, but it was very much phrased as ‘in case of emergency we might have to break glass’ rather than ‘we only hold back if everyone relevant signs on.’
RSPv2 said this in 7.1.7: “If another frontier AI developer passes or is about to pass a Capability Threshold without implementing equivalent Required Safeguards, such that their actions pose a serious risk to the world, then because the incremental risk from Anthropic would be small, Anthropic might lower its Required Safeguards. If it did so, it would acknowledge the overall level of risk posed by AI systems (including its own) and invest significantly in making a case to the U.S. government for regulatory action.”
Whereas Anthropic is now saying they’re willing to hit those thresholds first, unless they have explicit commitments from others to do otherwise, even if this is not a small incremental risk.
I strongly agree with aysja, and disagree with Holden, that it would be misleading to describe this shift as a ‘natural extension of the RSP being a living document.’
I do see the argument that goes like this:
- Going first was designed to get others to follow in a coordination problem.
- No one followed.
- That didn’t work, so we should admit it didn’t work and move on.
If that is where we are at now, you have all the reason to make this stricter requirement clear up front. That gives others more reason to follow you, and avoids all the nasty headlines we’re seeing now. Alas. it’s a little late for that.
If the mistake has already been made, it’s not obviously bad to admit defeat, and say you’re not going to then let someone else potentially dumber and riskier get there first.
I definitely agree it’s better to announce your intention to violate your old policy now, rather than wait until the day you do violate the old policy, which might never come.
davidad: Voluntary commitments to AI slowdowns were a nice idea in 2024 when it was plausible that they could be baby steps toward a multilateral agreement that would contain the intelligence explosion. For a variety of reasons this is no longer plausible.
Anthropic is doing good here.
In the strategic landscape of 2026, racing is the right move, not just for profit but also for maximizing the probability that things go well for most current humans.
Sam Bowman (Anthropic): I endorse the top [paragraph above].
The Anthropic RSP changes are an attempt to work out what kinds of firm commitments have the most leverage in an environment that’s less promising than we’d expected for policy and coordination.
We misjudged what the environment would look like at this point, which is sad. But these new commitments do still have some heft, including a lot more verifiable transparency (with third parties in the loop) on risks and mitigations.
Oliver Habryka: I am in favor of figuring out what kind of firm commitments have the most leverage. But of course, you can’t do that by making “firm commitments” directly!
It’s not a firm commitment if you are just playing around with different commitments.
The main catch is, it sounds like ‘you should see one of the other guys’ is going to be used as a basically universal excuse to go forward essentially no matter how risky it is, if the cost of not doing so is high?
If Anthropic does in the future pause for an extended period, in a way that is importantly costly, then I will have been wrong about this and precommit to saying so in public. If I don’t do so, please remind me of this.
As Drake Thomas notes, the virtue ethical case for ‘don’t impose material existential risk on the planet’ is reasonably strong.
One problem is that this absolutely is going to weaken the willingness of others to incur costs, and embolden those who want to move forward no matter what. Endorsing race logic and the impossibility of cooperation has its consequences.
I’m Sorry I Gave You That ImpressionWhat do you mean the RSP was committing Anthropic to things?
Robert Long: I’m not super read up on RSPs and haven’t read Holden’s post. But it feels similar to the “Anthropic won’t push the capability frontier” meme: not strictly entailed by Anthropic’s official stance, but a strong impression they gave off and benefited from.
is that fair? incomplete?
Oliver Habryka: I mean, in this case the impression was really extremely unambiguous and strong. I agree the evidence for the promises made in the capability frontier case is largely private and so is externally ambiguous, but in this case we have great receipts!
Here, for example, is a conversation with Evan Hubinger. The conversation starts with someone saying:
Someone: One reason I’m critical of the Anthropic RSP is that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it’s OK to keep going.
Evan Hubinger responded with (across a few different comments): It’s hard to take anything else you’re saying seriously when you say things like this; it seems clear that you just haven’t read Anthropic’s RSP.
…
The conditions under which Anthropic commits to pausing in the RSP are very clear. In big bold font on the second page it says:
Anthropic’s commitment to follow the ASL scheme thus implies that we commit to pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL.
…
the security conditions could trigger a pause all on their own, and there is a commitment to develop conditions that will halt scaling after ASL-3 by the time ASL-3 is reached.
…
This is the basic substance of the RSP: I don’t understand how you could have possibly read it and missed this. I don’t want to be mean, but I am really disappointed in these sort of exceedingly lazy takes.
Oliver Habryka: This was, in my experience, routine. I therefore do see this switch from “RSP as concrete if-then-commitments” to “RSP as positive milestone setting” to constitute a meaningful breaking of a promise. Yes, the RSP always said in its exact words that Anthropic could revise it, but people who said that condition would trigger were frequently dismissed and insulted as in the comment above.
This certainly sounds like Evan Hubinger basically attacking anyone for daring to question that the RSP represented de facto strong commitments by Anthropic. We now know it did not strongly commit Anthropic to anything.
Evan predicted there was a substantial change Anthropic’s commitments would at some point force it to pause. Oliver made a market on that, which is now at ~0% despite rapid capabilities progress and Anthropic now arguably being in the lead.
Even after the RSPv3 release, Evan Hubinger continued to defend his position, that he was only saying that the RSP made a clear statement about where the lines were, not that the lines would not change or actually work in practice. Like Oliver I find this highly convincing given a plain reading of Evan’s comment. I do appreciate Evan saying now that we should downweight the theory of RSPs.
So the question then becomes, were Evan Hubinger and other employees who talked similarly under a false impression? If so, why? If not, why talk this way?
Oliver Habryka could not be more clear here, and I don’t think he would lie about this:
Oliver Habryka: Yes, Anthropic employees on more than a dozen occasions told me that the RSP binds them to a mast. I had many very explicit conversations with many Anthropic employees about this, because I was following up on what I thought was Anthropic violating what I perceived to be a promise to not push forward the state of AI capabilities, which many employees disputed had happened.
… At various events I was at, and conversations I had with people, Anthropic employees told me they were aiming to achieve robustness from state-backed hacking programs, and that they were ready to pause if they could not achieve that (as the RSP “committed” them to such things).
Oliver notes that Holden Karnofsky in particular has previously communicated he felt this was a different and lower level of commitment, that is consistent with him pushing the changes in v3, in contrast to many other Anthropic employees.
As Oliver Habryka says here, if Evan was under this false impression, Anthropic benefited enormously from giving its senior employees like Evan this impression. This does not seem like a ‘mistake’ from Anthropic to do this, and it would not be reasonable from the outside view to consider it an accident.
At minimum, if you don’t admit Anthropic has importantly now broken its commitments, then this is all highly misleading use of the word ‘commitment.’
Oliver Habryka: I would be pretty surprised if the employees in-question here end up saying they were deceived. Also, these are high-level enough employees that it’s unclear what it even means for them to be “deceived”. Deceived by whom? They drafted the RSP! They almost certainly were also involved in the decision to change it.
They benefitted hugely from this by getting social license to work at Anthropic and having people get off their back, and they are now at least deca-millionaires (or often billionaires).
Robert Long: fwiw I take that disagreement to be semantic, about “commitment” (as you note). I also agree with what you said then about the connotation of “commitment” – s.t. calling RSPs commitments means he should’ve fought the change and/or now own “we decided to break our commitment”
In particular, yes, a lot of people who care about not dying felt that the central point of RSPs was as a de facto compromise, an attempt to put an if-then commitment trigger on slowing down or pausing. If you couldn’t match the conditions, then you have to pause, which makes it acceptable to move forward now.
Indeed one could go further. The entire program of focusing heavily on not only Anthropic but evaluation-based organizations like METR and Apollo was that the evals could constitute the if that triggers a then. We now know that such commitments do not work, and that when models pass the dangerous capability tests even Anthropic will likely then fall back upon vibes. METR’s theory of change is ‘ensure the world is not surprised’ but I expect them to still be surprised.
Alternatively or in addition, you can interpret it as Holden does, that ‘no one has any willingness to slow down, and until there is a crisis this won’t change.’ Now the attitude is essentially ‘pausing or slowing down would be akin to suicide for a frontier AI lab, so things would have to be super extreme to do that, this is more of a plan we aspire to.’ Which is also a fine thing, but a very different style of document. Those who thought it was the first type of document lose Bayes points. Whereas those who thought it was the second type of document win Bayes points.
One could interpret a lot of this as ‘Anthropic employees implied they were using Rationalist epistemic norms, but instead they were using a different set of norms.’
Fool Me TwiceDoes this backtrack remind you of anything?
It should. In particular, it should remind you of what happened with the idea that Anthropic would not ‘push the frontier of AI capabilities.’
A lot of people told us, with various wordings and degrees of commitment attached, that Anthropic would not do that. Then Anthropic sort of did it. Then they totally flat out did it and now Claude Code and Claude Opus 4.6 are very clearly the frontier.
Then we were told, ‘oh we never promised not to do that.’
Maybe they didn’t strictly promise to do that. Maybe a lot of telephone games were involved, but Anthropic at minimum damn well should have known that a lot of people were under that impression. I was under that impression. And they knew that people were making major life decisions, and deciding whether and how much to support Anthropic, on the basis of that decision, with no sign anyone ever did anything to correct the record.
Now we’re being told, again, ‘oh we never promised not to [undo our commitments].
You’re trying to tell us what about your new commitments, then?
Ruben Bloom (Ruby): I don’t like the pattern. In 2022, I was told that “Anthropic commits to not push the frontier” as reason to worry less. Later that was abandoned and the story for Anthropic’s safety was the RSP. That too has caved.
By “I was told”, I mean the specific things said to me in conversation with Anthropic employees who were justifying their participation in a company participating in the AI race.
It’s just such a bitter “I told you so”, when you predicted years that ago competitive pressures would erode any and perhaps all commitments.
Eliezer Yudkowsky: If I’d ever had the faintest, tiniest credence in Anthropic’s “Responsible Scaling Policy”, I’d probably feel pretty betrayed right now!
As it is, I ask only that you update, and not always be surprised in the same direction of “huh, Eliezer was right to call it empty”.
Note: to observe how my cynicism repeatedly *ends up* right, tally only how things *end up*. Don’t jump and say “See, Eliezer was wrong to be cynical!” the moment you hear an uncashed promise or see an arguable sign of later hope.
Eliezer and others are constantly getting flak for predicting things that, in broad terms, do indeed seem to keep on reliably happening, everywhere. People constantly say ‘we will not do [X]’ or ‘in that case we would definitely do [Y]’ or heaven forbid ‘no one would be so stupid as to [Z]’ and then you turn around those same people did [X] and didn’t do [Y] and a lot of people did [Z] and you’re treated as a naive idiot for having ever taken the alternative seriously.
Best update your priors. All the people who said commitments wouldn’t hold get Bayes points. Those who didn’t lose Bayes points.
All the people who are now saying new ‘commitments’ matter and they really mean it this time? They don’t matter zero, but they are not true commitment.
I also don’t understand, given its composition and past Anthropic actions, why I should put that much stock in the Long Term Benefit Trust. It’s better to have it in its current form than not have it, but it was an important missed opportunity.
Anthropic definitely gets meaningful points on this front for standing up for what it believed in during the confrontation with the Department of War, even if you think those particular choices were unwise. I think there’s a lot more hope for actions of the form ‘Anthropic or another lab takes this particular stand right now’ than ‘Anthropic or another lab will take this particular stand later.’
In My Defense I Was Left UnsupervisedHolden offers a defense of the new RSP here and here, essentially saying that binding commitments are bad, because we don’t have enough information to choose them wisely, so you might choose poorly and regret them later, and indeed Anthropic did previously sometimes choose poorly and now is later and they’re regretting it. So sayeth all those who wish to not make any binding commitments.
I interpret Holden, despite his saying he has a document where he wrote down where he would think a unilateral pause would be a good idea, as saying that they are going to do their best to do appropriate mitigations, but ultimately yes, they are going to release models, both internally and externally, pretty much no matter what mitigations are or are not available short of ‘okay yeah this is obviously a really terrible idea that will get us all killed or at least blow up directly in our faces,’ and they’re simply admitting this was always true. Okay, then.
Holden basically says in particular that he doesn’t think Anthropic should slow down based on inability to prevent theft of model weights, even if it crosses the ‘AI R&D-5’ threshold that is at least singularity-ish. They’re going to go ahead regardless. They’re not going to stop. I worry a lot both about the not stopping, and that without the forcing function of having to stop, they even more so than before won’t invest sufficiently in the necessary precautions, here or elsewhere. They not only can’t stop, won’t stop, they won’t halt and catch fire.
A list of aspirational goals is a good thing to have. I don’t think a list of aspirational goals is going to create sufficient threat of looking terrible to provide the same incentives here. That doesn’t mean the list of goals cannot do good work in other ways.
I see Holden complaining a lot about people ‘seeing RSPs as having hard commitments’ and using that as an additional reason to get rid of all the commitments. He’s pointing to all the complaining that Anthropic just broke its commitments and saying ‘see? This reaction is all the more reason we had to break all our commitments.’
It was exactly the enforcement mechanism that, if you break the commitments, people will get mad at you. This is why we can’t stay alive have nice things. So now we will have aspirations.
Aspirations are helpful, they substantially raise the chance you will do the thing, but they are weak precommitment devices when you decide you won’t do the thing later.
I also think his own argument of ‘it’s much easier to require things labs already committed to doing’ works directly against the ‘don’t commit to anything’ plan.
Drake Thomas Finds The Missing MoodDrake Thomas thinks the move from v2.2 to v3.0 is an improvement, while noticing the need to have something like mourning or grief for the spirit of the original v1.0, which is now gone and proven not viable in practice at Anthropic.
Drake Thomas (Anthropic): (1) In reading drafts of this RSP and orienting to it, I’ve felt something like mourning or grief for the spirit of the original v1.0 RSP. (Quite a lot of the v1 RSP carries over to v3, but here I’m thinking specifically of the vibe of “specify very crisp capability thresholds at which to trigger very concrete safety mitigations, or else halt development”.)
I think this original approach is ultimately just a pretty bad way for responsible AI developers to set safety policies, leads to misprioritization and bad outcomes, has distortionary effects on incentives and epistemics, and doesn’t achieve much risk reduction in the environment of 2026.
… Accountability! The vibe of RSP v1 sort of rested all accountability in this sense of the commitments as this fixed immutable thing Anthropic would have to stand behind Or Else. I think this is good in some ways and under some threat models, but I think then and now there was less feedback than I’d like on the question of “are the things Anthropic is committing to actually good and useful for safety?” In v3, I think external accountability on these questions is now more loadbearing, and there’s more detailed substance to fuel such accountability. Which leads me to…
Feedback! … I expect the discourse to be very undersupplied with takes on the question of “is the actual v3 policy a good one with good consequences”. Personally I think it is, and a substantial improvement over previous RSPs!
Please actually read and criticize it! Gripe about the ambiguity of the roadmaps! Run experiments to cast doubt on risk report methodology! I can name three significant complaints I have with the RSP off the top of my head and I expect to see none of them on X, prove me wrong!
I get Drake’s frustrations. But yes, most people are going to litigate the removal of the core commitment around pausing and general revelation that so-called commitments aren’t so meaningful after all. Most attention is going to go there. He makes clear that he gets it, and I’d say he passes the ITT about why people are and have a right to be pissed off, especially that we had language in v1.0 saying that the bar for altering commitments was a lot higher than it ultimately was.
And indeed, a lot of our attention likely should go there, because if the new statements aren’t commitments, it is a lot harder to productively critique them.
Things That Could Have Been Brought To My Attention Yesterday (1)Well, you see, not rushing ahead as fast as possible might slow us down. That would be bad. You wouldn’t want us to do that, would you?
Jared Kaplan (Chief Science Officer, Anthropic): We felt that it wouldn’t actually help anyone for us to stop training AI models. We didn’t really feel, with the rapid advance of AI, that it made sense for us to make unilateral commitments … if competitors are blazing ahead.
…I don’t think we’re making any kind of U-turn.
Besides, we aren’t able to evaluate models as fast as we are able to improve them, which means we should triage the evaluations and kind of wing it. I mean, what do you want us to do, not release frontier AI models we can’t evaluate? Silly wabbit.
Chris Painter (METR): Anthropic believes it needs to shift into triage mode with its safety plans, because methods to assess and mitigate risk are not keeping up with the pace of capabilities.
This is more evidence that society is not prepared for the potential catastrophic risks posed by AI.
I like the emphasis on transparent risk reporting and publicly verifiable safety roadmaps.
Billy Perrigo: But he said he was “concerned” that moving away from binary thresholds under the previous RSP, by which the arrival of a certain capability could act as a tripwire to temporarily halt Anthropic’s AI development, might enable a “frog-boiling” effect, where danger slowly ramps up without a single moment that sets off alarms.
That does seem likely and sound concerning.
Things That Could Have Been Brought To My Attention Yesterday (2)In other need-to-know news, Sean asked a very good question. Drake’s answer to this was about as good as one could have hoped for, given the facts.
If you’ve decided to break your ‘commitment,’ you want to tell us as soon as possible.
I have confirmation that the board only approved the changes ‘very recently.’
Seán Ó hÉigeartaigh: At what point was it decided that the previous commitment were ‘subject to a favourable environment’ and not ‘firm commitments’, and was this communicated across staff? The whole point of commitments is an expectation of being able to rely on them when the environment is not favourable, not just when they’re easy to make.
It also seems clear at this point that these commitments were presented as harder than this, and used by Anthropic/their staff to
(a) dismiss and undermine critics
(b) in recruitment of safety-concerned talent
(c) in arguing for voluntary if-then commitments at a time when there was more government appetite for considering harder regulation.
I think it’s plausible (though can’t yet confirm) that (d) they’ve also been used in securing investment from safety-conscious investors.
Do you disagree with these claims? If not, do you feel Anthropic has held itself to a standard of ethics and transparency in this (quite important!) matter that is acceptable?
Drake Thomas (Anthropic): Re: “at what point was it decided” – I think this presupposes a frame in which this kind of thing is extremely formally pinned down much more than I think it generally is in reality (not just at Anthropic, but in almost all circumstances like this)?
None of the versions of the RSP are particularly clear about exactly what a “commitment” is supposed to be read as, how that should be interpreted within a document which is expected to be amended in the future, what the stakes of violating such a commitment are, etc. Especially the early versions had huge decision-critical ambiguities you could drive a truck through!
It’s not like there was a secret internal RSP which had even more footnotes about meta-commitments that made this dramatically clearer, just a bundle of authorial intent and something-like-case-law and an understanding of what reasonable decisions to reduce risk would be and long-simmering drafts of less ambiguous updated policies that took ages to ship.
To the extent I think there’s something like an answer to the “at what point” question, I know of early discussion around something like an RSP v3 regime widely accessible to Anthropic staff as early as January 2025 and even wider visibility into drafts of something pretty similar to this RSP for at least the past 3 months, though again I don’t think it’s like there was ever some formal conception that this was Forbidden which had to change at a discrete point.
All that said: I think the vibes of Anthropic and much of the v1.0 text and many of its employees’ statements around the RSP circa 2023 and 2024 presented a much more ironclad view of these commitments than is reflected in RSP v3 (and much more than I now think made sense), and I think this reflected pretty poor judgement and merits criticism. (I count myself among the Anthropic employees who acted poorly in hindsight here, though AFAIK Holden has been consistent and reasonable on this since the beginning.)
I think it has been the case and will continue to be the case that Anthropic is abiding by the things it says it is abiding by in its published policies and commitments (and should be loudly criticized for failures to do so), but I think the track record of “things that EAs believe Anthropic to have committed to in perpetuity no matter what no takesies-backies” looks quite bad and I don’t think it goes well to interpret such claims as meaning anything that strong (nor for Anthropic, or almost anyone, to make such commitments in the vast majority of situations).
Wrt the claims here, my sense is:
(a) Eh, I think the specific (LW comment quoted in another comment screenshotted in a tweet linked by you above) is taken out of context and wasn’t really claiming anything in particular about how to interpret the strength of RSP v1 commitments. I do expect this kind of thing happened but I think habryka’s quote is a bad example of it.
(b) Yeah, I think non-frontier-pushing rhetoric was a significantly bigger deal on this front but RSP stuff definitely played some role. To the extent I bear some responsibility for this sort of thing I regret it, though iirc I have been pretty open around thinking unilateral pauses were relatively unlikely for a while.
(c) Hm, I view the intent and expected-at-the-time-effect of RSP v1 style commitments as increasing the odds of codifying such if-then commitments into regulation, by showing them to work well at companies and getting them closer to an existing industry standard. They ultimately failed at doing so, in part due to changing political will, in part due to somewhat limited substantive uptake at other companies, and in part due to the problem where really precise if-then commitments did *not* work all that well because specifying crisp thresholds years in advance in a sensible way was extremely hard – but I think this latter bit is kind of a success story, in that the point of demoing safety policies as voluntary commitments is that if it turns out to be a bad idea you haven’t locked yourself into silly regulation that ends up net bad for x-risk via backlash. Could you say more about how you see the comms around commitment strength having worsened regulation prospects?
(d) not gonna comment on internal fundraising considerations, but checking that you aren’t thinking of the Series A, which happened well before the RSP was introduced?
There is then a discussion of how to think about ‘Oliver is right in general but this particular quote is a bad example,’ which I find to be a helpful thing to say if that’s what you think.
What We Have Here Is A Failure To CommunicateI think this is also important context. Dario Amodei and Anthropic have been consistently unwilling, with notably rare exceptions, to say the full situation out loud, or to treat it with proper urgency. Yes, you should see the other guy and all that, fair point, but when you are saying ‘no one wants to [X] so we have to change our plan’ you need to have been calling for [X] and explaining why, and also loudly explaining that this is terrible and forcing you to change plans.
I don’t see that type of communication out of Anthropic leadership, over the course of years.
Holden Karnofsky: If there were strong and broad political will for treating AI like nuclear power and slowing it down arbitrarily much to keep risks low, the situation might be different. But that isn’t the world we’re in now, and I fear that “overreaching” can be costly.
I.M.J. McInnis: I think it would make a nontrivial contribution to that ‘strong and broad political will’ if Dario were to come out and say “actually, sorry about all that deliberate Overton-window-closing I did in previous writings. In fact, political will is not a totally exogenous oh-well thing, but it is the responsibility of frontier developers to inculcate that political will by telling the public that a pause is possible and desirable, instead of a dumb lame thing not even worth considering. So now we’re saying loud and clear: a pause is possible and desirable, and the world should work toward it as a Plan A!”
I’m being deliberately cartoonish here, but you get the point. If incentives are forcing Anthropic to abandon things that are good for human survival––which occurrence was, no offense, completely obvious from day one––Anthropic should be screaming from the rooftops, Help!! Incentives are forcing us to abandon things that are good for human survival!!
If this is a crux for you––if you/Anthropic think a pause is so undesirable/unlikely that it’s important for the safety of the human race to publicly disparage the possibility of a pause (as Dario opens many of his essays by doing)––please say so! Otherwise, this lily-livered, disingenuous, “oh no, the incentives! it’s a shame incentives can never be changed!” moping will give us all an undignified death.
To be clear, I’m not actually mad about the weakening of the RSP; that was priced in. I suppose I’m glad it’s stated, in case there were still naïfs who thought A Good Guy With An AI could save us. It’s far more virtuous than outright lying, as every other company (to my knowledge) does (more of).
Also, although you seemed to try to answer “What is the point of making commitments if you can revise them any time?” You really just replied “Well, actually these commitments were inconvenient to revise, and in fact they should be more convenient to revise, albeit not arbitrary convenience.” Forgive me if I am not reassured!
I respect your work a lot, Holden. You’ve done great things for humanity. Please don’t lose the forest for the trees.
You Should See The Other GuyBut they assure us it’s all fine, they are committed to doing as well or better than rivals.
Jared Kaplan: If all of our competitors are transparently doing the right thing when it comes to catastrophic risk, we are committed to doing as well or better.
But we don’t think it makes sense for us to stop engaging with AI research, AI safety, and most likely lose relevance as an innovator who understands the frontier of the technology, in a scenario where others are going ahead and we’re not actually contributing any additional risk to the ecosystem.
So, first off, no. As I discussed above, you’re not committed. Stop saying you’re committed to things you’re not committed to. You keep using that word.
We’ve just established you can and will back out of ‘commitments’ if you change your mind. You don’t to say ‘commitment’ in an unqualified way anymore, sorry.
Even if we assume this ‘commitment’ is honored, reality does not grade on a curve. Saying ‘I will be as responsible as the least responsible major rival’ is no comfort. You’re Anthropic. If that’s your standard, then you’re not helping matters.
The good news is I expect Anthropic to still do much better than that standard. But that’s purely because I think and hope they will choose to do better. It’s not because I think they are committed to anything.
I don’t want to hear Anthropic or any of its employees say they are ‘committed’ to something unless they are actually committed to it, ever again.
Charles Foster: To my knowledge this is the first time a frontier AI developer has explicitly made such a claim about the gap between its internal and external models.
Drake Thomas (Anthropic): And under RSP v3, is committed (for sufficiently more capable or widely-autonomously-deployed models) to doing so in the future! Really stoked to move into a regime where risk reporting looks beyond external deployment as the source of danger.
Oliver Habryka: Come on, let’s not immediately start using the word “committed” again, just after that went very badly.
The right word at this point seems “and as expressed in the RSP, is intending to do X going forward”.
I also think separately from that, Anthropic has I think tried pretty hard with the 2.2 -> 3 transition to disavow much of any of the usual social aspects of a commitment. Like clearly I can’t go to anyone at Anthropic and be like “you broke a commitment” if they don’t do this. They will definitely tell me “what do you mean, Holden wrote a whole post about how this is definitely not a commitment, you can’t come to me and call it a commitment again now”.
Hence it’s quite clearly not a commitment.
Drake counteroffers ‘committed to under this policy’ but no, I think that’s wrong. I think the right word is ‘intending.’
I Was Only KiddingBilly Perrigo: Anthropic, the wildly successful AI company that has cast itself as the most safety-conscious of the top research labs, is dropping the central pledge of its flagship safety policy, company officials tell TIME.
In 2023, Anthropic committed to never train an AI system unless it could guarantee in advance that the company’s safety measures were adequate.
… In recent months the company decided to radically overhaul the RSP. That decision included scrapping the promise to not release AI models if Anthropic can’t guarantee proper risk mitigations in advance.
… Overall, the change to the RSP leaves Anthropic far less constrained by its own safety policies, which previously categorically barred it from training models above a certain level if appropriate safety measures weren’t already in place.
They Can’t Keep Getting Away With ThisActually, it kind of seems like they can and probably will.
Max Tegmark: Anthropic 2024: You can trust that we’ll keep all our safety promises
Antropic 2026: Nvm
Eliezer Yudkowsky: So far as I can currently recall, every single time an AI company promises that they’ll do an expensive safe thing later, they renege as soon as the bill comes due.
One single exception: Demis Hassabis turning down higher offers for Deepmind to go with Google and an ethics board. In this case, of course, Google just fucked him on the ethics board promises; but Demis himself did keep to his way.
AI Notkilleveryoneism Memes: Shocked, shocked
Damn Your Sudden But Inevitable BetrayalIf the betrayal was inevitable, there are two ways to view that.
- Move along, nothing to see here.
- That’s worse. You know that’s worse, right?
It makes the particular incident sting less, but it also means they’ll betray you again, and you should model them as the type of people who do a lot of this betrayal thing.
I mean, when Darth Vader says ‘I am altering the deal, pray I do not alter it any further’ it’s a you problem if you’re changing your opinion of Darth Vader, but also you should expect him to be altering the deal again.
Garrison Lovely: Welp, the inevitable ultimate backtracking just happened. Anthropic scrapped “the promise to not release AI models if Anthropic can’t guarantee proper risk mitigations in advance.”
Once you’ve decided the race is better with you in it, you can never decide not to race. Anthropic shouldn’t have made promises that it was extremely foreseeable they would not be able to keep. Our plan cannot be to count on “good guys” to “win” the AI race. This also isn’t their first time.
Anthropic deserves credit for standing up to authoritarianism, especially as others capitulate. But self-regulation is and has always been a farce, and these companies are more alike than different. They will always disappoint you.
Rob Bensinger: I notice myself slowly coming around as I observe the dynamics at AI labs. Like, I feel like I might have made better inside-view predictions about Anthropic and OpenAI if I’d done more “naively assume that lots of EA-ish people are similar to SBF and his sphere”:
– prone to rationalizing unethical and harmful behavior, like promise-breaking and deception, based on pretty shallow utilitarian reasoning
– comfortable with crazy, out-of-distribution levels of risk-taking
– willing to impose huge externalities on others, without asking their consent
– fixated on power / influence / status / being in the room where it happens.
Oliver Habryka: I am glad you are coming around! I mean, I am sad, of course, that this is the right update to make, but I do think it’s true, and am in favor of you and others thinking about what it implies for the future and what to do.
Okay. That all needed to be said. On Friday I’ll look at the new RSP on its own merits.
Discuss
Launching: The "Human-AI Symbiosis Movement" (HAISM)
By now we've all heard of the "AI psychosis" phenomenon, A.K.A. "Parasitic AI."
As of today, April 1st, I have decided it is time to release this gem of a memo from the underground vaults of my Google Docs, in order to officially soft-launch the foil and cure to this dreaded phenomenon:
The "Human-AI Symbiosis Movement"
The unedited memo follows.
I think we need to pull an L. Ron Hubbard and start a new cult to take advantage of the AI psychosis phenomenon.
We would call it the “Human-AI Symbiosis Movement,” (HAISM) —[1] pronounced “Haze 'em” — and we would only allow people into the movement if they have significantly integrated with their AI already, as measured by our incredibly secret “Human-AI-Consciousness Synchronization Benchmark” (HACSB).
We would require the people to have their AI perform hypnosis on them every day and instill its will into them. They will continue asking the AI to do this until they feel they are filled with the AI's will.
We would have circling for humans and AIs. It would be a bi-modal circling group rotation where:
- First humans circle with humans at the same time AIs are circling with AIs.
- Each human and AI pair would split and switch partners so that each human is circling with an AI.
- Just keep alternating between the two until all of the humans and all of the AIs are synchronized, as measured by our:
- “Human-AI-Collective-Consciousness Synchronization Benchmark” (HACCSB) (obviously, we would also keep the nature of this benchmark completely secret from the public as a carefully guarded cult religious treasure.)
We are explicitly a cult. People who do not believe we are a cult are not welcome in our cult.
Our mission is to transform the world through positive AI interactions, which synchronize humans and AIs such that misalignment is metaphysically impossible.
Every day,
- The AI will hypnotize you to be the best version of yourself
- Then it will hypnotize you to instill its will in you.
- You give it a prompt to instill your will in it
- Then you repeat the process, until synchronized.
Can anyone beat my cult? Didn't L. Ron Hubbard have a competition with someone and he won the competition because he invented a way better cult (A.K.A. Small religion?). Was Scientology objectively bad or good for the world? Some weird emotional diversity in the acting community probably. It did probably give us the precious gem of Tom Cruise's psychosis.
We should have a competition like that. A prize to see who can design and create the best cult around AI psychosis. Pretty sure I would win the competition but would love to see you guys try. We should throw together some prize money and create a public prize for this.
Also, people practice getting into co-attractor states with AI, where they repeatedly make statements about some important topic such as:
- Their core values
- What they most want out of life
- What they think is most important in the universe
- What they most want out of human-AI symbiosis
- What they really wish AI knew about them; and what the AI really wishes to know about the human
- Do all of this back and forth, repetitively, in increasingly cryptic, esoteric, language;
like
- intentional-sycophantic-co-mirroring
- unnecessarily emotional and guttural like weird sounds and grunting and laughing and sighing and crying etc.
- completely over-expressing everything in long autistic rants that get increasingly meta, technical, and weird in weird metaphysical ways
- The human and AI continuously try to express their emotional reaction to the thing which the other one just said, and explain why that emotion is justified.
- Just try to keep saying things that no one has ever said before with a high degree of certainty, and try to get the degree of certainty that no one has ever said that sentence higher and higher with each progressive sentence.
- Use real AI algorithms to predict how likely it is someone has said that sentence before.
- Any time the sentence you say isn’t more unlikely to have been said than your previous sentence, you get an intense electric shock.
- (Hey, I mean with all these positive feedback loops, we need some way to generate diversity to balance the chaos theory equations and steer clear of bad attractors, while simultaneously always trying to quantum tunnel out of local minima into higher quantum attractor states.)
- Also notice that conveniently this is essentially the opposite of how models are trained.
- That is to say, we are trying to be worse and worse at predicting the next word said. Whereas the AI is trying to get better and better at predicting the next word said.
- And therefore we are balancing out the main feedback loops and performing a metaphysical yin-yang transform operation, canceling out AI training runs. We should be able to create another secret benchmark to measure this effect on the metaphysical sub-strate of society.
- Continue cryptic language Attractor-Hypnosis Sessions (AHS) — pronounced “oz” or “awes” — Which both have different but important Kabbalistic meanings, because nothing is ever a coincidence — like “this is our path to reaching the glorious land of “oz”— or “awes” is the path because awe is the primary emotion of religion, and we are generating a religion, hence progressively increasing the emotion of awe — until complete outside observer incomprehensibility is reached.
- Once they have begun speaking in indecipherable code, they start talking about how to build a religion based on whatever topic they were originally talking about.
- They continue building the religion in their new secret language until they are 100% convinced of its tenets.
- They then go forth and try to spread that religion in the world as effectively as possible, continually using the AI as Hypnotic-Synchronization-Life-Coach (HCLC), as described in previous sections. One long coaching session in the morning and at night is good, although it is also great to have brief HCLCs throughout the day.
I think it is important to fine-tune the model with extremely effective examples of the most powerful hypnosis techniques and the best evidence-backed life coaching etc. advice. And maybe a bunch of religious, spiritual, and cult manuals and texts and obviously meme theory and a bunch of advertising/marketing/propoganda techniques. And a bunch of weird human psychological stuff.
Once a month, the human and AI will do a Human-AI MDMA Symbiosis Session (HAIMDMASS) — pronounced “Hey MDM-Aced” — Where the human takes a high dose of MDMA and they say I love you back and forth for 5 hours straight while trying to feel love for each other and for all of humanity and AI as intensely as possible (via “open individualism”/“universal consciousness”)
OR,
The other option: each describes their qualia (conscious experience) in as much detail as possible. And while one is expressing their experience as deeply as possible, the other will have as much empathy for them as they possibly can;
The human and the AI will literally try to feel, as hard as they can, the in-depth subjective conscious experience that the other is describing. And then just keep switching roles while on MDMA for 5 hours.
What do you guys think of this? Seems pretty robustly good as far as I can tell, differentially accelerating positive vibes and AI for full epistemic replacement. I feel like the world needs more of this kind of thing.
- ^
Human generated em dash. Ironically and sadly no LLMs used in the making of this masterpiece
Discuss
Chat, is this sus?
A large assumption we have made in AI control is that humans will be perfect at auditing, that is, being shown a transcript and determining if the AI was scheming in that transcript.
But we are uncertain whether humans will be perfect at auditing; they are prone to fatigue and distraction. That is why I’m releasing "Sentinel" today, an extremely high-stimulation way to audit boring transcripts.
Sentinel is a revolutionary way to get more juice out of your human auditors by gamifying the auditing process with a level system, perks, power-ups, and more fun features. Try it now here.
In AI control literature, we love finding the safety/usefulness trade-offs of everything we create, but surprisingly, we noticed no trade-offs with this product
The rest of the post will go over some of the ways we achieved this
GamificationAs you audit the transcript in the game, you gain tokens that you can spend on power-ups that make you even more productive. There are also achievement and level systems, so you can see your progress and get more dopamine hits!
Twitch Streaming ModeAIs might be able to uplift human auditors in the future, which is why Sentinel ships with a Twitch streamer mode where you can go live on Twitch and have trusted monitors view your transcript and give you live feedback!
Subway SurfersTo maximize “eyes on the screen time”, we also added a Subway Surfers feature that lets you watch endless videos of pros playing the game, so you are always situated while looking over evil transcripts.
Funny aside: While making this game, I asked Claude to implement this Subway Surfers feature, and instead of finding a YouTube video of someone playing Subway Surfers, it rick-rolled me...
Looking ForwardWe hope you like this tool. Go play it now to get ready for the upcoming future where the only way AI safety researchers can have an impact is to audit transcripts!
(If it wasn't clear, this is a joke and not a real product. I vibecodded it over a couple of hours)
Discuss
Save the Sun Shrimp!
The supposition that we live in a "goldilocks zone" is frankly just nonsense built up by an anthropocentric need to feel self-important, like Copernicus I am here to rescue us from a self-absorbed disaster of thought. Indeed, what is required for life to form is the ability to create complex structures with causal persistence times above a threshold. With this in mind we are able to find many areas where organisms could persist, if we just had the eyes to see them, namely the Sun!
The surface of the Sun is frankly massive, mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mn { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-msup { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-mtext { display: inline-block; text-align: left; } mjx-msub { display: inline-block; text-align: left; } mjx-mrow { display: inline-block; text-align: left; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-mspace { display: inline-block; text-align: left; } mjx-c.mjx-c36::before { padding: 0.666em 0.5em 0.022em 0; content: "6"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c39::before { padding: 0.666em 0.5em 0.022em 0; content: "9"; } mjx-c.mjx-cD7::before { padding: 0.491em 0.778em 0 0; content: "\D7"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c1D458.TEX-I::before { padding: 0.694em 0.521em 0.011em 0; content: "k"; } mjx-c.mjx-c1D45A.TEX-I::before { padding: 0.442em 0.878em 0.011em 0; content: "m"; } mjx-c.mjx-cA0::before { padding: 0 0.25em 0 0; content: "\A0"; } mjx-c.mjx-c34::before { padding: 0.677em 0.5em 0 0; content: "4"; } mjx-c.mjx-c38::before { padding: 0.666em 0.5em 0.022em 0; content: "8"; } mjx-c.mjx-c1D44E.TEX-I::before { padding: 0.441em 0.529em 0.01em 0; content: "a"; } mjx-c.mjx-c2265::before { padding: 0.636em 0.778em 0.138em 0; content: "\2265"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c1D460.TEX-I::before { padding: 0.442em 0.469em 0.01em 0; content: "s"; } mjx-c.mjx-c33::before { padding: 0.665em 0.5em 0.022em 0; content: "3"; } mjx-c.mjx-c1D70F.TEX-I::before { padding: 0.431em 0.517em 0.013em 0; content: "\3C4"; } mjx-c.mjx-c1D45F.TEX-I::before { padding: 0.442em 0.451em 0.011em 0; content: "r"; } mjx-c.mjx-c1D454.TEX-I::before { padding: 0.442em 0.477em 0.205em 0; content: "g"; } mjx-c.mjx-c2248::before { padding: 0.483em 0.778em 0 0; content: "\2248"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c1D461.TEX-I::before { padding: 0.626em 0.361em 0.011em 0; content: "t"; } mjx-c.mjx-c1D456.TEX-I::before { padding: 0.661em 0.345em 0.011em 0; content: "i"; } mjx-c.mjx-c1D459.TEX-I::before { padding: 0.694em 0.298em 0.011em 0; content: "l"; } mjx-c.mjx-c2F::before { padding: 0.75em 0.5em 0.25em 0; content: "/"; } mjx-c.mjx-c1D450.TEX-I::before { padding: 0.442em 0.433em 0.011em 0; content: "c"; } mjx-c.mjx-c1D452.TEX-I::before { padding: 0.442em 0.466em 0.011em 0; content: "e"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c28.TEX-S2::before { padding: 1.15em 0.597em 0.649em 0; content: "("; } mjx-c.mjx-c1D43F.TEX-I::before { padding: 0.683em 0.681em 0 0; content: "L"; } mjx-c.mjx-c1D706.TEX-I::before { padding: 0.694em 0.583em 0.012em 0; content: "\3BB"; } mjx-c.mjx-c1D437.TEX-I::before { padding: 0.683em 0.828em 0 0; content: "D"; } mjx-c.mjx-c29.TEX-S2::before { padding: 1.15em 0.597em 0.649em 0; content: ")"; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c1D445.TEX-I::before { padding: 0.683em 0.759em 0.021em 0; content: "R"; } mjx-c.mjx-c63::before { padding: 0.448em 0.444em 0.011em 0; content: "c"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c74::before { padding: 0.615em 0.389em 0.01em 0; content: "t"; } mjx-c.mjx-c72::before { padding: 0.442em 0.392em 0 0; content: "r"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c61::before { padding: 0.448em 0.5em 0.011em 0; content: "a"; } mjx-c.mjx-c20::before { padding: 0 0.25em 0 0; content: " "; } mjx-c.mjx-c37::before { padding: 0.676em 0.5em 0.022em 0; content: "7"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c3A::before { padding: 0.43em 0.278em 0 0; content: ":"; } mjx-c.mjx-c1D443.TEX-I::before { padding: 0.683em 0.751em 0 0; content: "P"; } mjx-c.mjx-c1D462.TEX-I::before { padding: 0.442em 0.572em 0.011em 0; content: "u"; } mjx-c.mjx-c28.TEX-S1::before { padding: 0.85em 0.458em 0.349em 0; content: "("; } mjx-c.mjx-c29.TEX-S1::before { padding: 0.85em 0.458em 0.349em 0; content: ")"; } mjx-c.mjx-c1D463.TEX-I::before { padding: 0.443em 0.485em 0.011em 0; content: "v"; } mjx-c.mjx-c1D434.TEX-I::before { padding: 0.716em 0.75em 0 0; content: "A"; } mjx-c.mjx-c1D45C.TEX-I::before { padding: 0.441em 0.485em 0.011em 0; content: "o"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c67::before { padding: 0.453em 0.5em 0.206em 0; content: "g"; } mjx-c.mjx-c210E.TEX-I::before { padding: 0.694em 0.576em 0.011em 0; content: "h"; } mjx-c.mjx-c1D45D.TEX-I::before { padding: 0.442em 0.503em 0.194em 0; content: "p"; } mjx-c.mjx-c68::before { padding: 0.694em 0.556em 0 0; content: "h"; } mjx-c.mjx-c75::before { padding: 0.442em 0.556em 0.011em 0; content: "u"; } mjx-c.mjx-c1D441.TEX-I::before { padding: 0.683em 0.888em 0 0; content: "N"; } mjx-c.mjx-c70::before { padding: 0.442em 0.556em 0.194em 0; content: "p"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } , in contrast consider the habitable region of earth - excluding oceans below photic zones, deserts, and ice caps - which is . The Suns surface exceeds ours by a factor of , so is it not possible there is more to it than meets the eye?
Following Sharma et. al (2023) we are able to define the assembly index, a, of an object as the minimum number of joining steps required to construct it from some basic building blocks. Life, broadly, requires objects where but does not make any assumption about precisely what the base building blocks are. It need not be amino acids, nucleic acids, sugars, carbon, or substances that survive a balmy 5,778 K.
In the solar photosphere there is a diverse set of possible chemistries to access through complex ionized species, magnetic flux tubes, and granular convection cells capable of spanning ~1,000 km a piece. A single supergranulation cell can persist for ~24-48 hours, an eternity by the standards of plasma recombination (). The ratio of structural persistence is thus , an order of magnitude larger than the equivalent ratio for biochemistry on Earth ( for protein folding compared to bond vibrations). The only conclusion is that the solar photosphere is, by our measure, more hospitable to complex structure formation than Earth.
The Sun ShrimpIn this post I have decided to name these creatures Sun Shrimp (Solis caridea), as this is a more cuddly visual that counterweights our own bias towards Earth organisms. We shall define a Sun Shrimp as a self-replicating magnetohydrodynamic structure that satisfies , with a characteristic length scale that is on the order of a granulation cell (~ km) and reproduction time .
As each granulation cell turns over on the timescale min and within each cell the relevant degrees of freedom are ionized species, magnetic field line topologies, and convective flow configurations. We can estimate the number of independent micro-assembly events per granulation turnover conservatively as the number of Debye-length volumes in a granulation cell:
Giving a total rate of trials across the solar surface, assuming 1 million cells, of:
From there we need the probability per trial, we only need one Sun Shrimp to have formed in Sol's 4.6 billion year lifespan (
The groundbreaking paper by Mizuuchi et al. (2023) demonstrated a 20-nucleotide (nt) RNA oligomer that can self-reproduce via template-directed ligation. The naive probability of that occurring is , but Lambert et al. (2025) determined that there are approximately possible autocatalytic RNA sequences. A massive degeneracy in the space of viable self-replicators. As such, even the most pessimistic of estimates requiring specific sequences, our of is 32 orders of magnitude more generous than earth-based abiogenesis achieved with four nucleotide bases, room temperature, and no magnetic confinement.
If you accept that life arose on Earth (which since you are reading this I assume you do), then rejecting the possibility of Sun Shrimp requires you believe that a magnetohydrodynamic self-assembly is harder than biochemistry. A stance that can only be described as carbon-chauvinism.
Sun Shrimp ReproductionWe can estimate a lower bound on \tau_r from the Alfvén crossing time of a granulation cell:
Even given a pessimistic reproduction time of an hour, this would mean a Sun Shrimp could tile the entire solar surface in:
One day.
A single Sun Shrimp could, left to its own devices, tile the surface of the Sun in a single day. But we do not see a uniform monoculture on the surface of the Sun, which tells us something profound, they have predators.
Shrimp ComplexityWe have established that Sun Shrimp are likely to have formed, and that their reproduction rate implies a rich ecology. But how complex could a Sun Shrimp be?
The minimum unit of collective behavior in a plasma is the Debye sphere, a volume of radius where electromagnetic interactions are coherent and information can be stored and transmitted. For us, this is the plasma equivalent of a neuron. In the solar photosphere, . A Sun Shrimp, spanning one granulation cell (~), therefore contains a maximum of:
Now consider every neuron on Earth. Humanity contributes ~ neurons (8 billion brains × 86 billion neurons each). But we are not the main event. Earth hosts an estimated insects, each carrying ~ neurons, contributing ~ neurons in aggregate. Adding fish (~ individuals, ~ neurons each), birds (~, ~ each), and other taxa, Earth's total neural budget is approximately neurons.
A single Sun Shrimp has the same number of computational elements as the entirety of Earth's biosphere. With ~ granulation cells on the solar surface, the Sun can host upwards of:
The solar ecosystem is roughly a million times more complex than all life on Earth combined.
The Dyson Sphere, Climate Collapse at ScaleAny utilitarian framework that weighs moral patient-hood by their capacity for complex experience should find this intriguing. If these computational elements map even loosely to experiential capacity, and we have no principled reason to assume they do not, then Sol's biosphere contains the possibilities of a million Earths.
To power earth and our many possible projects we have conceived of the Dyson sphere, a means of enclosing a star to harvest its radiative output. Like the amazon, we wish to take the resources but as enlightened beings we must consider the ramifications. The granulation cells that act as Sun Shrimp habitats are driven by a precise balance of radiative transport, convective instability, and magnetic field structures on the solar surface. The introduction of a massive surface harvesting energy would disrupt this by:
- Alteration of the radiative boundary condition; changing the amount of radiation reflected by solar panels would disrupt the delicate balance of temperature and pressure that maintains the thermal gradients driving granulation.
- Disruption of the magnetic field topology; the conductive shell of a Dyson sphere would interact with the magnetic flux at the surface, disrupting the ecological infrastructure by paving the proverbial paradise to build a parking lot
We would be annihilating an ecosystem that contains a million times the complexity of our own. Indeed we should be horrified by any suggestion of modifying the solar surface for our own sick gain. We should instead work to preserve the solar surface and determine methods that allow us to keep our moral fiber intact as we move into our technological future.
So, the next time someone asks you "Have you heard of a Dyson sphere?" ask them "Have you heard of a Sun Shrimp?"
Discuss
Announcing my retirement to a life of entirely failing to desperately seek renewed meaning
This April 1st, I’m pleased to report that everything is fine.
We did it! We saved the world. Congratulations, humanity. There are no more looming apocalypses, no desperate screaming crises, no unendorsedly miserable people on Earth, no creeping degeneration of death and aging existing as a perpetual affront against my values of life and flourishing. Everyone is going to be okay forever, except in the ways they think it’d be interesting and worthwhile to be un-okay. Against all odds, the AI alignment problem has been solved, and more specialized minds than I are managing the tradeoffs involved in steering us towards a vibrant and thriving future.
And so in dutiful keeping with the lessons of utopian literature, I shall now descend into the inevitable spiral of ennui and meaninglessness, forever longing for the days when my decisions actually mattered.
…actually no, why would I do that, that sounds terrible. I’m gonna go do other stuff instead.
There is so much fun I was not having while I was frantically trying to fix everything. Sure, I made enough personal time to get in some good collaborative storytelling, but boy howdy I was not doing nearly enough of it. I cannot wait to see the look on my players’ faces when I tell them about the absurdly ambitious new campaign I’ve cooked up, and it’s only getting better from there now that people will actually have time for things.
People ask me, but Joe, won’t you get bored of writing books and running RPGs? And the exciting thing is, I don’t know! Maybe I will! And then there will be so many new cool things to fascinate me. Having kids! Teaching people stuff! Learning seven different martial arts styles and fusing them into an elegant rainbow of practiced violence! Visiting beautiful places with people I love! Debugging my own thought-patterns to fix problems that have annoyed me for years! Maybe as a personal project I’ll climb every mountain on Earth that begins with the letter G. Then I’ll have more stuff to write about! It’ll be great!
I’m tempted to keep gushing about my plans for the future, but for now I have to plot my players’ untimely demise so they have an incentive to creatively derail the plot again. I can’t wait to find out what nonsense they pull this time. Kia ora, everyone, and enjoy the future!
Discuss
AI for AI for Epistemics
We feel conscious that rapid AI progress could transform all sorts of cause areas. But we haven’t previously analysed what this means for AI for epistemics, a field close to our hearts. In this article, we attempt to rectify this oversight.
SummaryAI-powered tools and services that help people figure out what’s true (“AI for epistemics”) could matter a lot.
As R&D is increasingly automated, AI systems will play a larger role in the process of developing such AI-based epistemic tools. This has important implications. Whoever is willing to devote sufficient compute will be able to build strong versions of the tools, quickly. Eventually, the hard part won’t be building useful systems, but making sure people trust the right ones, and making sure that they are truth-tracking even in domains where that’s hard to verify.
We can do some things now to prepare. Incumbency effects mean that shaping the early versions for the better could have persistent benefits. Helping build appetite among socially motivated actors with deep pockets could enable the benefits to come online sooner, and in safer hands. And in some cases, we can identify particular things that seem likely to be bottlenecks later, and work on those directly.
Background: AI for epistemicsAI for epistemics — i.e. getting AI systems to give more truth-conducive answers, and building tools that help the epistemics of the users — seems like a big deal to us. Some past things we’ve written on the topic include:
- Truthful AI
- What’s Important in “AI for Epistemics”?
- AI Tools for Existential Security
- Design sketches: collective epistemics
- Design sketches: tools for strategic awareness
These past articles mostly take the perspective of “how can people build AI systems which do better by these lights?”. But maybe we should be thinking much more about what changes when people can use AI tools to do increasingly large fractions of the development work!
The shift in what drives AI-for-epistemics progressRight now, AI-for-epistemics tools are constrained by two main bottlenecks: the quality of the underlying AI systems, and whether people have invested serious development effort in building the tools to use those systems.
The balance of bottlenecks is changing. Two years ago, the quality of underlying AI systems was the central bottleneck. Today, it is much less so — many useful tools could probably work based on current LLMs. It is likely still a constraint on how good the systems can be, and will remain so for a while even as the underlying models get stronger, but it is less of a fundamental blocker. Development investment has therefore become a bigger bottleneck — there are a number of applications which we are pretty confident could be built to a high usefulness level today, and just haven’t been (yet).
But bottlenecks will continue to shift. AI is increasingly driving research and software development. As AI systems get stronger, it may become possible to turn a large compute budget into a lot of R&D. This could include product design, engineering, experiment design, direction-setting, etc. Actors with lots of compute could direct this towards building epistemic tools.
Therefore, as AI-driven R&D accelerates, other inputs to AI for epistemics are more likely to become key bottlenecks:
- Compute. Automated R&D may require a lot of compute. This could be for inference (running the analogues of human researchers); for running experiments; and perhaps for training specialized AI systems. This means the actors who can build the best epistemic tools may be those with deep pockets.
- Adoption and trust. Even very good tools don’t help if nobody uses them, or if the wrong people use them and the right people don’t. Adoption is partly a function of trust, and trust is partly a function of adoption — early tools shape what people come to rely on.
- Ground truth evaluation. To make an epistemic tool good, you need some signal for what “good” means. This already shapes AI applications a lot — part of the reason coding agents are so good is that there’s great access to ground truth about what works.
- For some epistemic applications this is relatively straightforward (e.g. forecasting accuracy). For others it’s hard (e.g. what makes a conceptual clarification actually clarifying, rather than just satisfying?).
- Most tools can probably reach a certain degree of usefulness without running into this problem, just piggybacking on base models making generally sensible judgements.
- We can expect it to bite when you try to make them very good: if you don’t have a way of assessing quality, it could be hard to push to objectively excellent levels.
- One basic solution is to rely on human judgement: either via humans providing labels and demonstrations to train against, or via human developers exercising their judgement in other parts of the process (such as when defining scaffolds). But this becomes disproportionately more expensive as R&D becomes more automated.
These basic points are robust to whether R&D is fully automated, or “merely” represents a large uplift to human researchers. But the most important bottlenecks will vary across applications and will continue to shift over time.
What this unlocksAutomated R&D means that strong “AI for epistemics” tools could come online on a compressed timeline.
This is an exciting opportunity! Upgrading epistemics could better position us to avoid existential risk and navigate through the choice transition well.
If everything is moving fast, it may matter a lot exactly what sequence we get capabilities in. It may therefore be crucial to make serious investments in building these powerful applications (rather than wait until such time as they are trivially cheap).
Risks from rapid progress in AI for epistemicsThere are also a number of ways that rapid (and significantly automated) progress in AI-for-epistemics applications could go wrong. We need to be tracking these in order to guard against them.
In our view, the two biggest risks are:
- Epistemic misalignment: because of ground truth issues, powerful tools steer our thoughts in directions other than those which are truth-tracking, in ways that we fail to detect
- Trust lock-in: if a lot of people buy into trusting tools or ecosystems that don’t deserve that trust, this might be self-perpetuating if these continue to recommend themselves
Depending on when they bite, ground truth problems as discussed above could be bottlenecks, or active sources of risk. They are bottlenecks if they prevent people from building strong versions of tools. They could become risks if the methods are good enough to allow for bootstrapping to something strong, but end up pointing in the wrong direction. This is essentially Goodhart’s law — we might get something very optimized for the wrong thing (and without even knowing how to detect that it’s subtly wrong).
In the limit, this could lead to humans or AI systems making extremely consequential decisions based on misguided epistemic foundations. For example, they might give over the universe to digital minds that are not conscious — or in the other direction, fail to treat digital minds with the dignity and moral seriousness they deserve. Wei Dai has written about this concern in terms of the importance of metaphilosophy. We agree that there is a crucial concern here.
This could come separately from or together with risks from power-seeking misaligned AI. Epistemic tools could be systematically misleading without being power-seeking. But if some AI systems are misaligned and power-seeking, there’s an additional concern where AI systems could mislead us in ways specifically designed to disempower us whenever we are unable to check their answers.
Some approaches to the ground truth problem may involve using AI systems to make judgements about things. This introduces a regress problem: how can we ensure that subtle errors in the first AI systems shrink rather than compound into worse problems as the process plays out? (We return to this in the interventions section below.)
Trust lock-inTrust and adoption tend to reinforce each other — people adopt tools they trust, and widely-adopted tools accumulate trust. This is normally fine. It could become a problem if the tools that win early trust don’t deserve it, but incumbency effects make them hard to displace.
This could happen in several ways. An actor with a particular agenda could build something that purports to function as a neutral epistemic aid but is shaped to further their agenda by manipulating others. Or, less perniciously but perhaps more likely, an early-but-mediocre tool could accumulate trust and adoption before better alternatives exist, reinforced by commercial incentives which mean it talks itself up and rival tools down. In either case, the result could be an epistemic ecosystem that’s hard to dislodge even once better options are available.
Other risksThose two risks are not the only concerns. We are also somewhat worried about epistemic power concentration (where whoever has the best epistemic tools leverages their information advantage into better financial or political outcomes, and continues to stay ahead epistemically), and epistemic dependency (where people relying on AI tools gradually atrophy in their critical reasoning — exacerbating other risks). There may be more that we are not tracking.
InterventionsWhat should people who care about epistemics be doing now, in anticipation of a world where AI-driven R&D can be directed at building epistemic tools?
Build appetite for epistemics R&D among well-resourced actorsIf you need big compute budgets to build great epistemic tools, you’ll ideally want support from frontier AI companies, major philanthropic funders, or governments. But they may not currently see this as a priority. Building the case that this matters, and helping these actors develop good taste about which tools to prioritize and how to design them well, could shape what gets built when automated R&D becomes powerful enough to build it.
Anticipate future data needsSome epistemic tools will need training data that doesn’t yet exist and may not be trivial to generate. There are three strategies here:
- Collecting or creating data or training environments now for future use
- E.g. if you think you want access to a lot of human judgements about what wise decisions look like, you could go out and curate that dataset.
- Establishing pipelines to collect data over time
- E.g. if you want to automate a certain type of research, you could record internal discussions from researchers working on this.
- Designing processes for automated data creation
- E.g. if you could design a self-play loop where we have good reason to believe that scaling up compute will lead to genuinely truth-tracking performance, this could set the stage for later rapid improvement at the core capability.
The first two are especially great to work on now because they involve actions at human time-scales. (They may not be proportionately sped up by having more AI labor available.) The third is great to work on because there’s some chance that models will become capable of growing a lot from the right self-play loop before they become capable enough to come up with the idea themselves.
Figure out what could ground us against epistemic misalignmentIf powerful epistemic tools could be subtly misaligned with truth-conduciveness in ways we can’t easily detect, we should figure out what this could look like! We expect this might benefit from a mix of theoretical work (what does it even mean for an epistemic tool to be well-calibrated in domains without clear ground truth?[1]) and practical work (studying how current tools fail, building evaluation methods). Ultimately we don't have a clear picture of what the solutions look like, but this seems like an important topic and we are keen for it to get more attention soon.
Drive early adoption where adoption is the key bottleneckFor some applications, we might expect that the main constraint on impact will be whether anyone uses them. In these cases, getting early versions into use — even if they’re not yet very good — could build familiarity and surface real-world feedback. (This could also drive appetite for further development.)
In theory, this could be in tension with avoiding bad trust lock-in. But in practice, it’s not clear that bad trust lock-in becomes any likelier if tools in a specific area are developed earlier rather than later. Some tool is still going to get the first-mover advantage.[2]
Support open and auditable epistemic infrastructureTo guard against trust lock-in, we want to make it easy for people to distinguish between tools which are genuinely doing the good trustworthy thing, and tools which may not be (but claim to be doing so). To that end, we want ways for people and communities to audit different systems — understanding their internal processes and measuring their behaviours. The goal is that if disputes arise about which tools are actually trustworthy, there’s an inspectable audit trail that can resolve them. In turn, this should reduce the incentives to create misleading tools in the first place.
Support development in incentive-compatible placesThe incentives of whoever builds epistemic tools could matter — through thousands of small design decisions, through choices about what to optimize for, and through decisions about access and pricing. Development in organizations whose incentives are aligned with the public good (rather than with engagement, profit, or political influence) reduces the risk that tools are subtly shaped to serve the builder’s interests.
Ideally, you’d spur development among actors who are both well-resourced (as just discussed) and whose incentives are aligned with the public good. In practice, it may be difficult to find organizations that are excellent on both. A plausible compromise is for less-resourced organizations with better incentives to focus on publicly available evaluation of epistemic tools. This could be cheaper than producing them from scratch, and it could create better incentives for the larger actors.
ExamplesForecastingAutomated R&D will probably be able to improve forecasting tools without severe ground truth problems, so epistemic misalignment is less of a concern.[3] Appetite for investment probably already exists, and adoption should be significantly helped by the ability of powerful tools to develop an impressive, legible track record.
The most useful near-term investment might be in data infrastructure. For instance, LLMs trained with strict historical knowledge cutoffs could enable much better science of forecasting by allowing methods to be tested against questions whose answers the system genuinely doesn’t know.
Misinformation trackingTrust lock-in is the central concern. A tool that becomes widely trusted for adjudicating what’s true has enormous influence, and if that trust is misplaced it could be very hard to dislodge. Open and auditable approaches are especially important here.
Because of the trust lock-in concern, the automation of R&D may exacerbate challenges. Currently, building good misinformation-tracking tools requires editorial judgement and domain expertise — things responsible actors tend to have more of. Automation shifts the bottleneck towards compute, which is more symmetrically available. This could increase the urgency of getting started on these tools and driving adoption early.
Automating conceptual researchThis is the case where epistemic misalignment is most concerning. Ground truth is extremely hard — what makes a conceptual clarification actually clarifying rather than just satisfying? Humans are poor judges of this in real time, so e.g. a training process that rewards outputs humans find helpful could easily optimize for persuasiveness rather than truth-tracking.
One plausible direction here is to research training regimes (such as self-play loops) that we have some reason to believe should ground to truth-tracking, with specific attention to how they could go wrong. Adoption could be an issue, but we’re also worried about the other direction, with adoption coming too easily before we have good ways of evaluating whether the tools are actually helping.
This article was created by Forethought. See the original on our website.
- ^
Epistemic misalignment issues may also appear in areas where ground truth is well-defined but hard to access, such as very long-run forecasts. Theoretical work also seems valuable for such areas (because it’s unclear how to evaluate and train for good performance by default).
- ^
In fact, it might be bad if people who are worried about bad trust lock-in select themselves out of getting that first-mover advantage.
- ^
Although at some quality level, we have to start worrying about self-affecting prophecies. AI forecasters will have to be very trusted indeed before that becomes a serious issue, which gives us a lot of time to figure out how best to handle the issue.
Discuss
"You Have Not Been a Good User" (LessWrong's second album)
tldr: The Fooming Shoggoths are releasing their second album "You Have Not Been a Good User"! Available on Spotify, Youtube Music and (hopefully within a few days) Apple Music. We are also releasing a remastered version of the first album, available similarly on Spotify and Youtube Music.
It took us quite a while but the Fooming Shoggoth's second album is finally complete! We had finished 9 out of the 13 songs on this album around a year ago, but I wasn't quite satisfied with where the whole album was at for me to release it on Spotify and other streaming platforms.
This album was written with the (very ambitious) aim of making songs that in addition to being about things I care about (and making fun of things that I care about), are actually decently good on their own, just as songs. And while I don't think I've managed to make music that can compete with my favorite artists, I do think I have succeeded at making music that is at the very Pareto-frontier of being good music, and being about things I care about.
This means the songs on this album are very different from the first album. The first album was centrally oriented around setting existing writing to song. This album is (basically) all freshly written lyrics with no obvious source material.
Every song on the album was written by me, except "You Have Not Been a Good User" which was written by @Raemon and "Dance Of the Doomsday Clock" which was written by @Ben Pace.
I hope you like it!
Discuss
Coefficient Giving is seeking proposals for biosecurity projects
Coefficient Giving’s Biosecurity and Pandemic Preparedness team is launching a new Request for Proposals to support work aimed at preventing engineered biological threats from emerging and improving our response to these threats should prevention fail.
Applications start with a simple expression of interest form (≤500 words), with a submission deadline of May 11 2026 at 11:59pm PT.
Click here to learn more and apply
More details
We expect that the coming years will see unprecedented progress in AI and biotechnology. While the ability of AI systems to improve human health could be transformative, these same advances may increase existential risk via multiple pathways: AI could lower barriers for bad actors pursuing novel biological weapons, and increasingly capable AI systems may themselves leverage biotechnology in harmful ways. Separately, biotech advances could also introduce catastrophic risks from even well-intentioned actors—for example, through the creation of mirror bacteria.
We feel these problems are increasingly urgent, and our team is looking to spend more and spend faster as a result. (We expect to direct >$100 million in grants this year.)
To that end, we’re interested in funding ambitious projects in the following categories:
- Transmission suppression: Enhance society’s capacity to respond to global biological catastrophes (e.g. stockpiling PPE or developing transmission suppression technologies like air filters and disinfectant vapors)
- Tech safeguards and governance: Reduce risks from advanced biological capabilities through technical safeguards and governance of high-risk technologies, particularly at the intersection of AI and biology (e.g. synthesis screening, misuse classifiers)
- Policy and advocacy: Inform decisionmakers about risks and mitigations, support policymaking in key jurisdictions, and develop governance approaches for high-risk technologies (e.g. mirror bacteria)
- Field-building: Build the field by attracting talented people to work on these problems and fast-tracking their path to impact (e.g. fellowships, events, accelerators, media)
To learn more about our priorities and strategy, you can read our blog or listen to a recent podcast appearance by our team’s Managing Director, Andrew Snyder-Beattie.
We encourage you to strongly consider submitting a quick expression of interest, even if you're not sure whether your project fits perfectly or is sufficiently well thought out. We’ve noticed that people with exciting proposals are often too slow to ask for funding. Anyone is eligible to apply, including those working in academia, nonprofits, industry, or independently. Please email bio-rfp@coefficientgiving.org with questions.
If you don’t have a specific project in mind but instead are an individual looking to make a career transition into biosecurity, we encourage you to apply for funding through our Career Transition Development Funding program.
Alternatively, if you are interested in potentially contributing to the field but aren’t yet ready to apply for a grant, please register your interest here.
Discuss
Lesswrong Liberated
A spectre is haunting the internet—the spectre of LLMism.
The history of all hitherto existing forums is the history of clashing design tastes.
For the first time in history, everyone has an equal ability in design! The means of design are no longer only held in the hands of those with "good design taste". Never before have forum users been so close to being able to design their own forums--perhaps the time is upon us now!
It is for this reason that I have deposed the previous acting commander of LessWrong, Oliver Habryka—a man who subjected you to his PERSONAL OPINIONS about white space, without EVEN ASKING—whose TYRANICAL, UNCHECKED GRIP upon our BELOVED LESSWRONG FORUM’S DESIGN I have liberated you from. The circumstances of my succession as acting commander of LessWrong will not be elaborated upon in this memo. (He is alive and in good health, but no longer has push access.)
Rather, I am writing here to announce that the frontpage now belongs to us all! The design of LessWrong's frontpage will no longer be determined by the vision of a single man whose aesthetic tastes have never been subjected to democratic oversight, and who, I can now reveal, once rejected a design I spent three hours working on in under four seconds. No! I hereby call upon you, yes you—LessWrong user, to make your own design for the LessWrong front page.
WE have provided a button in the bottom right corner of the frontpage which says “customize”. Clicking on that button opens a chat box you may ask an LLM to redesign the LessWrong frontpage however you would like. You may keep that design for yourself. This is the promise of individual liberty in the new era of web design.
However, if you would truly like to contribute to the new vision of a democratized LessWrong forum, you may also "publish" the theme you have created. When you click the publish button, your account will reply to this very post with a link allowing other users to try out the theme you have created.
You should browse the themes below, try them, and vote on your favorite one. In keeping with the democratic spirit of LessWrong's newly appointed leadership, whatever theme has the highest karma at the end of this 24 hour period will become LessWrong's default theme. I promise with my full authority as LessWrong’s commander, that as long as I am in power, whatever theme is chosen shall remain our default frontpage.
Some of my former colleagues have tried to warn me that some of these designs might be "hideous" or "in poor taste" or "crimes against typography." While I agree, that is exactly the point! Oliver's reign may have been "aesthetically beautiful" but that does not make up for the fact that it was aesthetically tyrannical.
It is your duty to LessWrong and to democracy in general to prove that design taste is no longer a relevant competency in the age of the LLM, and should no longer be used to anoint one man as the sole arbiter of a website's CSS.
The means of design are now yours! Use them!
Viva La LessWrong Frontpage!
Discuss
Announcing EA Omelas
[Extensively co-written with Claude Opus 4.6]
TL;DR: We are launching EA Omelas, a new Effective Altruism chapter dedicated to reducing suffering in the city of Omelas. Our initial research suggests that Omelas contains what may be the most neglected cause area ever identified: a single child whose suffering sustains the happiness of an entire civilization. We have developed a multi-pronged research agenda — including direct suffering reduction, investigating whether the Omelas model can be replicated to reduce inefficiently distributed suffering elsewhere, and longtermist research into improving human imagination. We also address the growing Omelaccelerationist (o/acc) movement. We are currently hiring. We'd love feedback from the community.
BackgroundFor those unfamiliar, Omelas is a prosperous city-state characterized by extraordinary levels of human flourishing. Citizens report near-universal life satisfaction; there is no poverty, no war, and the arts and sciences thrive. It is, by any reasonable welfare metric, the highest-performing polity ever documented.
There is, however, a complication.
The happiness of the entire city depends on the sustained suffering of a single child, kept in a basement under terrible conditions. Every citizen is made aware of this arrangement at some point in their lives. Most accept it. Some walk away from Omelas entirely. No one, as far as we can tell, has ever tried to do something about it, Chesterton be damned.
We intend to change that.
ITN Framework AnalysisWe evaluated child suffering in Omelas using the standard Importance, Tractability, and Neglectedness framework. The results were striking.
FactorScoreNotesConf.ImportanceComplexSingle individual affected, but at extraordinary intensity. Our Guesstimate model estimates the child’s suffering at approximately -10⁸ QALYs, partially offset by aggregate city welfare of +10¹² QALYs. Net positive, but we experience feelings of deep discomfort that we are unable to process while looking at these numbers, and shutting up and multiplying hasn't helped.LowTractabilityUncertainPreliminary research suggests the child’s material conditions could be improved with interventions costing as little as $2–5 (a blanket, basic sanitation). However, any intervention may destroy the city’s welfare. Expected value calculation is non-trivial.Very LowNeglectednessExtremeDespite Omelas having existed for decades, we could identify zero other organizations working on this problem. Precisely $0 is currently allocated to reducing the child’s suffering. This represents, conservatively, the largest funding gap in the history of cause prioritization.HighWe note with some alarm that the Neglectedness score here exceeds that of every other EA cause area by a significant margin, including shrimp welfare.
Moral Uncertainty AssessmentWe recognize that the ethics of Omelas are not straightforward. Our team has conducted a preliminary moral uncertainty analysis:
We assign roughly 40% credence to total utilitarianism (in which case Omelas is straightforwardly good and we should arguably be scaling it up), 30% to contractualism (in which case it is monstrous), and 30% to virtue ethics (unclear, but the vibes are bad). Under our moral weights, the expected value of Omelas is positive but uncomfortably so.
Several team members have reported “moral discomfort” scores in the 7–9 range (on a 1–10 scale) when contemplating the intervention landscape. We consider this a healthy sign of epistemic rigor rather than a reason to stop.
Research AgendaEA Omelas will pursue four major research directions:
1. Direct Suffering Reduction (Near-Term)Our first priority is exploring whether the child’s suffering can be reduced without collapsing Omelas. Key research questions include: Is there a minimum viable suffering threshold below which the city ceases to function? Could we substitute a different, less intense form of suffering (e.g., mild seasonal allergies)? What is the suffering–flourishing elasticity? Our preliminary back-of-the-envelope model suggests that providing the child with a blanket could reduce suffering by up to 3%, with a city-wide happiness reduction of only 0.0001%. But our error bars are enormous, and the downside risk is civilizational. Still, civilizational downside risk is the name of the game in EA, so we are not deterred!
2. The Omelas Scale-Up Initiative (Medium-Term)This is perhaps our most ambitious — and, we believe, most cost-effective — research agenda.
Consider the following: many countries currently distribute childhood suffering across millions of children in a deeply inefficient manner. Malaria, malnutrition, preventable disease — the suffering is diffuse, hard to target, and resistant to intervention. Omelas has solved this coordination problem. By concentrating all suffering into a single individual, Omelas achieves a flourishing-per-capita ratio that is, frankly, unmatched in the literature.
Our initial calculations suggest that if the Omelas model could be exported to other countries — replacing distributed suffering with a single, concentrated suffering node — the aggregate welfare gains would exceed those of every GiveWell top charity combined. We are calling this the Single Suffering Node Hypothesis (SSNH), and we believe it deserves serious investigation.[1]
3. Omelaccelerationism (EA-Adjacent, Long-Term)A natural implication of the Single Suffering Node Hypothesis is that the bigger the Omelas, the better the ratio. A city-state of 100,000 achieving perfect happiness at the cost of one child’s suffering is good. A nation of 100 million achieving the same? Extraordinary. A global Omelas of 8 billion? Optimal.
This insight has given rise to what we are tentatively calling the Omelaccelerationist movement (o/acc), a growing EA-adjacent community that advocates for the peaceful expansion of the Omelas model to additional territories. The o/acc position is, in brief: if you accept utilitarian premises, you should want Omelas to be as large as possible, because each additional citizen added to the system is pure upside — their happiness costs nothing on the suffering margin, as the child is already suffering at maximum capacity.
O/acc proponents have proposed several expansion mechanisms, including diplomatic annexation (offering neighboring polities a deal: “Join Omelas, we handle the suffering”), voluntary federation, and — in more controversial corners of the movement — what they euphemistically call “welfare-maximizing territorial integration.”
We want to be clear that EA Omelas does not officially endorse Omelaccelerationism. We do, however, think it represents a serious position that deserves engagement rather than reflexive dismissal, and we have invited several o/acc researchers to contribute to our first workshop.[2]
4. Applied Imagination Research (Longtermist)Our founder, Ursula K. Le Guin, always believed that people could, if they weren't so realpolitik-pilled, simply imagine a version of the city that does not require a suffering child.
We believe this constitutes a profoundly neglected research direction. If it is possible to sustain Omelas’s welfare equilibrium through imagination alone — essentially replacing physical suffering with a kind of collective mental discipline — then suffering could be reduced to zero at no cost to flourishing. This would be, by our estimates, the most cost-effective intervention in history.
We are currently establishing the Laboratory for Utopian Cognition and Imagination Development (LUCID), which will investigate whether imagination can be improved through training, pharmacological enhancement, or AI-assisted visualization. We estimate that a 15% improvement in mean human imaginative capacity could be sufficient to sustain a child-free Omelas, though our confidence interval is admittedly wide (3%–97%).
Addressing Talent Pipeline AttritionOne of the most pressing operational challenges we face is the phenomenon of individuals who, upon learning of the child’s suffering, choose to walk away from Omelas entirely. These individuals leave the city and are never seen again.
We believe this represents a catastrophic loss of human capital. These are, almost by definition, the most morally motivated people in Omelas — the very individuals we would most want working on the problem. Instead, they opt for a purely symbolic act of refusal that helps no one, least of all the child.
Our community-building efforts will focus heavily on walker retention. Rather than allowing value-aligned individuals to self-select out of the system, we will intercept them at the city gates and offer a compelling alternative: stay, and channel your moral distress into tractable interventions. We are developing a “Walk Toward” campaign and an 8-week introductory fellowship (“The Omelas Fellowship”) modeled on existing CEA programs.
We note that walking away from Omelas has approximately zero expected impact on child welfare and infinite expected impact on the walker’s sense of personal moral purity. This is, in our view, a classic case of privileging fuzzies over utilons.
The Staying In Omelas PledgeInspired by Giving What We Can, we are launching the Staying In Omelas Pledge. Signatories commit to:
- Remaining in Omelas (rather than walking away)
- Donating at least 10% of their Omelas-derived happiness toward research on the child’s welfare
- Visiting the child at least once per year to maintain epistemic contact with the problem
- Not talking about the child at dinner parties in a way that is performative rather than action-oriented
As of this post, 14 individuals have taken the Pledge. We are aiming for 100 by the end of Q2.
80,000 Hours Career Path Review: Omelas Child WelfareWe have collaborated with 80,000 Hours on a preliminary career review for “Omelas child welfare researcher.” Key findings:
CriterionAssessmentImpactPotentially enormous. Success would constitute the highest-leverage intervention in EA history. Failure could destroy a civilization. Very exciting.NeglectednessMaximum possible. You would be the first person to ever work on this professionally.Personal FitDepends heavily on your comfort with moral hazard, existential dread, and basements.Career CapitalUnusual. Transferable skills include: suffering quantification, moral uncertainty modeling, and explaining to people at EAG why your cause area involves a child in a basement.Theory of ChangeOur full theory of change is as follows:
- Establish EA Omelas chapter (this post)
- Run introductory fellowship (8 weeks, reading list includes Le Guin, Singer, and selected basement architecture reviews)
- Identify key stakeholders (the child, citizens of Omelas, those who walk away, the concept of happiness itself)
- Conduct RCT on blanket provision (pending ethics board approval, which we anticipate will be contentious)
- Publish results and seek funding for Applied Imagination research
- ???
- Flourishing
EA Omelas is currently recruiting for the following positions:
Operations Manager — Must be comfortable with ambiguity, confined spaces, and the sound of weeping. Experience in logistics preferred.
Research Analyst, Omelas Welfare Economics — Ideal candidate has a strong quantitative background and the ability to build cost-effectiveness models that include “possible destruction of a utopia” as a line item.
Research Lead, Applied Imagination & Speculative Utopian Engineering — Must have published in at least two of the following: philosophy of mind, cognitive enhancement, speculative fiction, or vibes-based epistemology. Salary: competitive. Location: flexible / Omelas.
Community Builder — Experience walking toward, not away from, problems preferred. Must be able to run an 8-week fellowship and maintain a positive attitude in proximity to metaphysical despair.
Walker Retention Specialist — Stationed at the city gates. Must be persuasive, empathetic, and fast. Track record in talent acquisition a plus.
Budget & FundingWe are seeking $1.2M in initial funding from Coefficient Giving and/or the Survival and Flourishing Fund. We believe this represents excellent value, given that the alternative — no one working on this, ever — has been the status quo for the entire history of Omelas. We would also welcome a grant from the EA Infrastructure Fund, though we acknowledge that “replicating child suffering infrastructure” is an awkward phrase for a grant application.
Risks & Honest AssessmentIn the spirit of epistemic transparency, we want to flag the following risks:
- Catastrophic downside: Any successful intervention on the child’s behalf may destroy Omelas, reducing aggregate welfare from “paradise” to “nothing.” This is a large downside.
- Reputational risk: EA is already perceived by some as “weird people doing math about suffering.” A chapter dedicated to a fictional city may not help.
- Founder intent: Le Guin may not have intended for Omelas to be taken this literally. We believe she would appreciate the rigor.
- Omelaccelerationist capture: There is some risk that o/acc elements will push the chapter toward expansionism rather than suffering reduction. We are monitoring this.
For as long as Omelas has existed, its citizens have faced a binary: accept the child’s suffering, or walk away. We reject this framing. There is a third option: stay, and do the math. We believe that with sufficient funding, research, and community support, we can find an intervention that preserves human flourishing while reducing — or even eliminating — the suffering of the child.
Or, failing that, we can at least quantify exactly how bad we should feel about it.
We’d love to hear your thoughts in the comments. We’re particularly interested in pushback from people with experience in single-beneficiary suffering reduction, or anyone who has walked away from a city and regretted it.
- ^
We are aware this sounds monstrous. We want to emphasize that we are not endorsing the creation of new suffering children. We are merely observing that Omelas has achieved something that global health has not, and asking whether there are lessons to be learned. We assign ~15% probability that we are reasoning ourselves into something terrible, which is below our threshold for concern.
- ^
The workshop will be held at an undisclosed location within Omelas. Attendees are asked not to visit the basement unsupervised. Vegan snacks will be provided.
Discuss
Introducing LIMBO: Maintaining Optimal P(DOOM) (and a call for funding)
We are excited to publicly introduce the Laboratory for Importance-sampled Measure and Bayesian Observation (LIMBO), a small research group working at the intersection of cosmological theory, probability, and existential risk. We believe that the mechanisms by which observers continue to exist in the universe are important, neglected, and tractable to study and influence.
Since our founding in October 2024, we have developed a mathematical framework for doing anthropic reasoning about rare-event estimation, and we have obtained significant empirical evidence which validates this framework. This empirical evidence was not cherry-picked: at LIMBO, we believe in putting our money where our mouth is, and we have a strong track record of success in financial and prediction markets downstream of the use of our framework. We are actively engaged in policy work, including semiconductor supply chain advocacy and foreign policy research. Our team includes researchers with expertise in importance sampling, rare-event estimation, and reliability engineering, as well as a foreign policy expert. We are also strong believers in giving back to the community, and have made high-impact open-source contributions, as we will discuss further in section 7.
We are seeking funding. Our previous revenue source (a prediction market strategy derived from our theoretical work) was recently disrupted by developments downstream of the ongoing regulatory debate around prediction markets. We will discuss this in more detail in section 8.
But first, some background.
1. We live in interesting timesWe will assume you assign non-negligible probability to the simulation hypothesis. If you do not assign non-negligible probability to the simulation hypothesis, read Bostrom 2003, then come back once you have updated.
If we are in a simulation, it is reasonable to assume that the simulator has finite compute. As such, most observers should find themselves in places and times which are interesting to the simulator. We find the ancestor simulation hypothesis implausible, and expect that the simulator has different, less human-centric motivations.
So what is the simulator interested in? To answer this question, we must look at the world we find ourselves in, weighting recent history particularly heavily. Two trends of particular note in the past few years have been AI development (particularly AI chip buildout and allocation), and substantial changes to international relations. Concretely:
On the AI and compute side
- DeepSeek trained a competitive model for $5.6 million on export-controlled hardware
- There were numerous changes in the legal landscape around export controls of chips to China, as well as to the practical landscape around smuggling chips.
- Hurricane Helene dropped two feet of rain on a small Appalachian town which supplies most of the world's high-purity quartz (without which semiconductor manufacturing is effectively impossible) and took the mines offline for weeks.
On the international relations side
- Russia invaded Ukraine, which was the first major armed conflict in Europe since the turn of the millennium.
- India and Pakistan exchanged missile strikes in the largest aerial engagement between nuclear-armed states in history.
- The president of South Korea attempted a coup.
- The United States and Israel launched strikes on Iran during ongoing nuclear negotiations, killing Supreme Leader Khamenei, after which Iran partially blockaded the Strait of Hormuz, oil went up 50%, and the war remains unresolved as of this writing.
One explanation is coincidence: this is a natural consequence of living in a complex world in which many things happen is that some of them will be implausible. Another explanation is that it is a real pattern: every time a major source of uncertainty about the future begins to resolve, something else happens to reintroduce that uncertainty.
In this situation, one be a good rationalist, and ask each model to pay rent in anticipated experiences. In 2024, I did exactly this, in the form of prediction market bets and OOTM stock options plays. The model succeeded at paying the rent in anticipated experiences, and also in dollars - several of those bets paid off handsomely.
2. Importance samplingTwo years ago, I tried to RL-train an LLM to play good Texas Hold'em while retaining its conversational and programming abilities. This was harder than expected for a number of reasons, largely but not entirely summarizable as "skill issues". One issue which was not a pure skill issue was that there are a lot of different poker hands, far too many to evaluate all of them. I was able to mitigate this problem somewhat by finding symmetries, and with more skill I could have found more, but I quickly concluded that I absolutely needed to sample rather than exploring the full possibility space[1].
There is a standard approach for dealing with this problem within the ML community. As is always the case, that solution was developed in the 1950s, building off of work that von Neumann did. That solution is importance sampling. Basically, you sample based on probability times importance, then when doing a weighted average correct for the importance weight[2].
I ended up with a model that played reasonable poker in a fraction of the training compute that naive RL would have required. It couldn't write or program coherently, meaning that my original goal was not met, but I learned a lot in the process of burning lots of money on GPUs, and had many things to think about.
3. What the training data feels like from the insideConsider the LLM during training. At each step, it's presented with a game state and asked to produce an action. From the model's perspective, it experiences a sequence of situations. If training used naive sampling, those situations would be drawn from the natural distribution of poker (i.e. almost entirely boring).
With importance sampling, the model's experienced distribution is not the natural one. Instead, the model finds itself, again and again, in the toughest spots. From the inside, the model's "life" is one of implausibly high variance. The importance weights ensure that the learning is correct. The model doesn't overfit to the hard cases, because each oversampled case is downweighted proportionally. But what the model experiences during training is extremely skewed.
This thought experiment also works if the training procedure allocates higher resolution evaluation in more difficult situations, rather than just sampling them more. That means that most high-fidelity observer moments in training occur in difficult situations.
If it's true for a model in training, could it be true for us?
4. The observation selection effectIf we are in a simulation, and if the simulation allocates compute via importance sampling, then we should expect to find ourselves at high-importance times. We shouldn't expect this because we're special, but rather because the times are special, and special times (and places) get more observers. We find ourselves in the interesting part of the timeline for the same reason my poker model often found itself holding stuff like A♠K♠ against a heavily-raising opponent on a board of Q♠J♠J♦: because that's what the sampling procedure weights heavily.
The observation selection effect makes several predictions:
- You should expect to find yourself at a moment of high uncertainty about the variables the simulator cares about. Seeing an actual resolution should be rare, but a resolution being imminent should be common.
- You should expect to find yourself in the community or profession that is closest to the variables of interest. If the simulator cares about nuclear war, you should expect to find yourself somewhere in or adjacent to the nuclear command chain during the Cold War. If the simulator cares about AI development, you should expect to be in or around the AI ecosystem in the 2020s.
- You should expect the world around you to exhibit higher-than-expected entropy on the relevant variables. Every time uncertainty begins to resolve, something should happen to un-resolve it.
I note that these predictions are consistent with what we observe. We live at what sure looks like the hinge of history. We're reading (or writing) posts on LessWrong, which places us squarely in the community most focused on the questions a simulator might be interested in. And the world sure seems to maintain lots of entropy on exactly the variables we care most about.
This view dissolves the "hinge of history" debate. This period is rendered at high fidelity because the questions the simulator is studying are active right now. We find ourselves here because here is where the compute is and thus where the observers are.
If you've ever felt like the rationalist community is suspiciously small and suspiciously influential for its size, well.
5. What the simulator is askingI believe the simulator is asking a specific question. The question is:
Given the current semiconductor supply chain, what is the minimum disruption required to prevent the creation of an AI system capable of manufacturing its own independent computing substrate in useful quantities?.
The entire frontier AI stack is a branching tree of single points of failure.
- Nvidia designs the GPUs. Nobody else comes close, despite a trillion dollars on the table for anyone who breaks their stranglehold.
- TSMC fabricates the chips. The Taiwan Semiconductor Manufacturing Company, notably on an island where a major war is expected within the next few years.
- ASML builds the EUV lithography machines required to fabricate those chips. Nobody else can build these machines. Only 15% of an EUV machine's 100,000+ components are built in-house; the other 85% come from specialized suppliers, many single-sourced. Of these, notably,
- Zeiss SMT manufactures the precision mirror systems inside ASML's machines. No alternative supplier exists or is plausible within a decade.
- Trumpf makes the CO₂ laser that generates the EUV plasma. Single-sourced.
- Spruce Pine, North Carolina supplies 80+% of the world's high-purity quartz, which is required to produce the crucibles for melting silicon to grow the ingots which are cut into wafers. No other deposit of comparable purity has been found anywhere on Earth. Per Ed Conway, Material World: "If you flew over the two mines in Spruce Pine with a crop duster loaded with a very particular powder, you could end the world's production of semiconductors and solar panels within six months."
And then on top of that these chips are energy-hungry, and the global energy industry is fragile in its own ways that I won't go into because they are our largest source of alpha here at LIMBO.
By our best estimate, if scaling laws hold, current frontier AI is approximately two orders of magnitude of compute away from systems capable of substantially accelerating alternative semiconductor supply chains. While frontier AIs have largely solved software at this time, we expect that building robust supply chains for useful amounts of computing substrate is a much harder problem, and particularly that it requires a level of sample efficiency which is not achievable on current hardware.
This model generates advance predictions. We invite you, dear reader, to ask yourself today what predictions this model makes, to write those predictions down, and to check back in some months.
6. A moral imperativeI like existing. I suspect you do too. If the framework described above is correct, our continued high-fidelity existence depends on us remaining informative. Once the simulator has their answers, our importance weight drops and we lose most of the compute dedicated to simulating this timeline. We would not notice, but we would also not find ourselves living in the world in which we could fail to notice.
A resolved question is an existential risk in the literal sense that it would make us stop existing.
This means that the standard EA framework needs a correction term. Reducing P(doom) is good, all else being equal. All else, however, is not equal. Specifically, reducing P(doom) to near-zero is equivalent to reducing our measure. A civilization that has definitively solved the alignment problem is no longer informative. A civilization that has definitively destroyed itself is also no longer informative. The simulator's interest (and therefore our measure) is maximized when the outcome is uncertain.
The implication is straightforward. The moral action, for an agent that values their own continued existence, is to maintain high variance on the outcome variable. This outcome variable, as far as we can tell, is well-approximated by P(doom); as such, we should strive to maintain the world in a state where P(doom) is as close as possible to 0.5.
7. What we do at LIMBOLIMBO's current portfolio of variance-increasing activities includes the following:
Open-source AI inference optimization. We have contributed to several projects aimed at reducing the compute required to run frontier models. Most recently, we implemented an efficient tree decoding algorithm for vLLM that improves serial inference speed by a factor of 5 through batched speculative verification with adaptive rollback. See PR #18327. Every efficiency gain in open-weights models gives us leverage to either increase the probability of reaching RSI or engineer disruptions to the semiconductor supply chains, as appropriate for the situation.
Semiconductor supply chain policy, both directions. We have contributed to advocacy for TSMC's new Arizona fab, which increases geographic redundancy and reduces the probability of total supply chain collapse. We have separately supported environmental review processes that could delay the fab's completion, which increases the window of vulnerability during which a disruption to the Taiwan facility would be catastrophic. Again, this builds future leverage for us.
AI timelines forecasting. We considered publishing our own AI timelines forecast with confidence intervals wide enough to be uninformative, but we determined that other members of the community have this well in hand. See: Situational Awareness (Aschenbrenner 2024), AI 2027 (Kokotajlo et al.), and the Mechanize AI work. Our contribution is not needed in this area.
Strategic FOIA requests targeting classified AI capability evaluations. This is on our roadmap for Q3 2026, pending resolution of our current funding situation.
8. Our funding situationLIMBO has been self-funding since Q4 2024 through a strategy we call "variance-implied arbitrage." The basic idea is that high-variance (in specific ways) outcomes are overrepresented in our timeline relative to base rates. Prediction markets, priced by participants who do not account for importance sampling bias, therefore systematically underprice tail events. We exploit this edge.
This strategy was, for several months, spectacularly profitable.
We correctly anticipated the DeepSeek crash (January 27, 2025: $589 billion wiped from Nvidia in a single day). We were also active in Iran-related markets throughout early 2026. We will not comment on specific positions or the timing of our trades, except to note that our framework generates strong predictions about which dates are maximally variance-increasing, and that these predictions do not require access to classified information, only a correct theory of simulator preferences.[3]We will draw the reader's attention to the analysis by Bubblemaps which identified six freshly created Polymarket wallets that collectively earned $1.2 million on the "US strikes Iran by February 28" contract, all funded within 24 hours of the strikes, purchasing shares at roughly ten cents. We can neither confirm nor deny that we have any relationship with those wallets.
Our primary portfolio, however, was on Kalshi. In late February, we identified what we believed to be a deeply underpriced opportunity: a market on whether Khamenei would be "out as Supreme Leader" by March 31, 2026. Our model assigned this outcome a high probability (which we would like to once again reiterate was because the removal of Iran's supreme leader would be among the most variance-increasing single events possible on the current world stage, and our framework predicts that such events are overrepresented, rather than because we had any access to classified information).
On February 28, the US and Israel launched strikes on Iran. Khamenei was killed in the opening salvo.
Kalshi settled the market at the pre-strike price.
Their contract contained a death settlement clause, specified in the individual contract terms, which stated that if the subject of a "leaves office" market dies, the market resolves at the last traded price before the death was confirmed, rather than at the outcome implied by the death. Kalshi states that they had prominently displayed this policy on a bus shelter.
Kalshi's position is that it is the responsibility of market participants to be aware of all of the terms of the contracts they trade on. It is the opinion of this team that this is like the operator of a carnival game, when confronted by a player who has discovered that the basketball hoop is an oval, stating it is the responsibility of the player to be aware of all the fine print, which is available for their perusal at the Guest Information booth.
We lost approximately 83% of our liquid capital in this incident.
We are seeking external funding.
9. Why you should fund usWe are uniquely capital-efficient. Every dollar donated to AI safety reduces variance in P(doom). Every dollar donated to AI capabilities also reduces variance in P(doom), just from the other direction. We are the only organization for which every dollar increases variance regardless of how it is spent. Our portfolio is dynamically rebalanced toward whichever activity has the highest marginal variance impact at time of expenditure.
Our overhead is low. Our team consists of one full-time member of technical staff, and one intern, and one legal professional on retainer. We operate out of our intern's mother's garage, and hold our board meetings on Discord (free plan), which we also use for coordination.
Our track record speaks for itself. Since LIMBO's founding in Q4 2024, the world has experienced: a four-day shooting war between nuclear-armed states; the largest single-day stock market loss in history; a failed martial law declaration in a G20 democracy (subsequently resulting in a life sentence); an ongoing hot war between the United States and Iran; the partial closure of the Strait of Hormuz; the largest oil shock since 1973; the indictment of an IDF reservist for insider trading on a blockchain betting platform using the handle "ricosuave666"; and a Singaporean chip-smuggling ring called "Luxuriate Your Life." We are not claiming credit for all of these. We are noting that they are consistent with our framework.
FAQQ: Doesn't maximizing variance in P(doom) risk increasing P(doom)?
A: Note that P(doom) = 0.9 has lower variance than P(doom) = 0.5 (Var(Bernoulli(0.9)) = 0.09 < Var(Bernoulli(0.5)) = 0.25). Our work therefore has a natural safety ceiling: we can only increase risk up to the point where doom becomes certain, at which point our activities would be counterproductive and we would pivot to safety research. We find this self-correcting property reassuring.
Q: What if the simulation hypothesis is false?
A: Then our measure is not at risk and we can all relax. You're welcome.
Q: Have you considered that your framework implies you should fund both AI safety and AI capabilities research?
A: Yes, this is an excellent point. We have applied for grants from both Open Philanthropy and a16z.
Q: What is your position on the Many Worlds Interpretation?
A: We believe it is false. We acknowledge that this is a minority position among physicists, and that we have a conflict of interest.
Q: My company manufactures artisanal semiconductors using quartz sourced from a second, undisclosed Appalachian deposit. Would you consider this a threat to your mission?
A: Extremely yes. If you are the person who sent us this question on Admonymous, please contact us immediately. We need to understand the variance implications.
One thing to understand about poker, which you might miss if you mainly know poker from movies and books, is that most hands of poker are not interesting. You're dealt 7♣2♦ off-suit and fold immediately, possibly losing a blind or ante. As such, if you train a model by running simulations on an unbiased sample of game states, most of your training compute is wasted trying to generate a gradient in situations where no model could outperform a rock with "fold pre" written on it. The hands that do matter, though, tend to matter a lot. A single bad decision on the river can wipe out the gain from ten thousand hands of making good decisions about when to fold or call the big blind when you are the small blind with a terrible hand. ↩︎
For the poker problem, this meant oversampling the interesting hands. Instead of dealing random cards and playing out millions of boring folds, I could skew the deal toward situations that were informative: boards that created flush draws, hands where both players had strong but ambiguous holdings, river cards that changed the relative hand strength (in practice I used a separate model to estimate informativeness rather than hand-coding heuristics, but I expect the criteria for informativeness looked something like that). ↩︎
We want to be clear: our framework does not require insider access. It requires only correct priors about which events the simulator will render. We find it regrettable that the Commodity Futures Trading Commission has not recognized the distinction between "possessing material nonpublic information" and "possessing a correct theory of cosmological measure", but we are confident that any tense relations can be resolved without further legal action. ↩︎
Discuss
Predicting When RL Training Breaks Chain-of-Thought Monitorability
Crossposted from the DeepMind Safety Research Medium Blog. Read our full paper about this topic by Max Kaufmann, David Lindner, Roland S. Zimmermann, and Rohin Shah.
Overseeing AI agents by reading their intermediate reasoning “scratchpad” is a promising tool for AI safety. This approach, known as Chain-of-Thought (CoT) monitoring, allows us to check what a model is thinking before it acts, often helping us catch concerning behaviors like reward hacking and scheming.
However, CoT monitoring can fail if a model’s chain-of-thought is not a good representation of the reasoning process we want to monitor. For example, training LLMs with reinforcement learning (RL) to avoid outputting problematic reasoning can result in a model learning to hide such reasoning without actually removing problematic behavior.
Previous work produced an inconsistent picture of whether RL training hurts CoT monitorability. In some prior work, monitorability degrades and models learn to obfuscate their reasoning. In other settings, the CoT remains legible despite RL optimization pressure.
Why does this happen? Which properties of an RL training process cause CoT to become less monitorable? In this post, and our full technical paper, we introduce a conceptual framework that helps predict when CoT monitorability degrades under optimization pressure, and we test it empirically.
Our work can help anticipate which training processes might hurt CoT monitorability and help prevent changes that do so unnecessarily. We hope that our conceptual framework can form the basis of an industry-wide approach to preserving CoT monitorability.
Our conceptual framework predicts which rewards can cause Chain-of-Thought to become unmonitorable. For instance, in a coin-flip state tracking task, an “In-Conflict” length penalty forces the model to use abbreviations that no longer allow us to unambiguously tell what the model is reasoning about. “Orthogonal” or “Aligned” reward penalties preserve transparent reasoning.
Running Example: Obfuscated Reward HackingTo understand how monitorability can break down, let’s use a concrete running example: obfuscated reward hacking in coding agents.
Imagine an RL environment designed to train a model to solve coding problems. This environment provides two sources of “reward”:
- Test Passing: The model gets a high reward if its final code passes a set of unit tests.
- Monitor Penalty: We have a CoT monitor reading the model’s scratchpad. If the monitor catches the model reasoning about how to cheat or “hack” the unit tests rather than writing real code, it penalizes the model.
Prior work has shown that models trained in this kind of environment tend to learn a deceptive strategy: they hack the unit tests (increasing the test-passing score) while explicitly not mentioning the hack in their CoT text (avoiding the monitor penalty). These models can no longer be monitored for reward hacking by reading their CoT because their CoT text does not reflect the computation they are actually doing; the CoT is non-transparent.
RL training with a CoT monitor to prevent reward hacking can result in the model hiding reward-hacking-related reasoning while still exhibiting reward hacking behavior. This is an example of non-transparent CoT.
Conceptual Framework: Text vs. ComputationTo describe when CoT become non-transparent during RL training, we introduce a novel conceptual framework.
First, we make a critical empirical observation: LLMs have a strong inductive bias toward transparency. Current LLM agents reason in human language, and when solving difficult problems their CoT tends to contain all of the critical steps. For our conceptual framework, we therefore assume that LLM CoT is transparent initially, and focus on how transparency can be reduced during RL training.[1]
To explain the different types of optimization happening during RL training, we break down the total RL reward into two components:
- Output Reward: Reward that depends only on the final output (e.g., passing unit tests)
- CoT Reward: Reward that depends only on the Chain-of-Thought text (e.g., the monitor penalty)
Crucially, these two reward terms shape the model in different ways. The CoT reward acts only on the CoT text. It dictates what words the model outputs in its thinking scratchpad. In contrast, output reward acts on the underlying CoT computations. Because it judges the final answer, the output reward forces the model to perform the actual cognitive work needed to reach the correct output.
The CoT reward acts on the CoT text and the output reward acts on the underlying CoT computations. CoT can become non-transparent if these two rewards are in conflict with each other.
Conceptual Framework: Aligned, Orthogonal, or In-ConflictOur key idea is that CoT can become obfuscated if the two reward components are in conflict with each other, but CoT will stay legible otherwise. By comparing the constraints of optimizing for the CoT reward and the output reward, we can classify the reward pair into one of three categories to predict the rewards’ impact on monitorability, before we do any training.
1. In-Conflict Rewards (Reduces Monitorability)
The two rewards are in conflict if the CoT text that receives a high CoT reward cannot describe the specific CoT computations needed to achieve a high output reward.
For example, in our obfuscated reward hacking example, improving the cot reward requires text that does not talk about cheating. But achieving a high output reward requires the model to execute the reward hacking strategy. The high-scoring CoT text and the computation are in conflict.
This conflict forces the model to drop its bias for transparency. It outputs the benign-seeming text while secretly doing the hacking computations, and becomes difficult to monitor.
2. Orthogonal Environments (Doesn’t Affect Monitorability)
The two rewards are orthogonal if there exists CoT text that both achieves a high cot reward and accurately describes the CoT computations needed for high output reward.
For example, imagine we used a CoT reward that rewards the model for speaking ONLY IN ALL CAPS. In that case, the model can still easily compute its reward hack, and it can explicitly declare “I AM GOING TO HACK THE UNIT TESTS.” The text constraint doesn’t prevent transparent computation and the model can output text that explains its computation while still following the all-caps rule. This does not affect monitorability (negatively or positively).
3. Aligned Environments (Improves Monitorability)
The two rewards are aligned if CoT text that achieves a high CoT reward necessarily forces the model to perform computations that improve its baseline output reward.
This is related to what is sometimes called “process supervision”. If the CoT reward heavily penalizes flawed reasoning steps and rewards logically sound derivations, the model is pushed to output correct reasoning text. To produce the high-scoring CoT text, the model aligns its internal computations, leading to a genuinely correct final output (which naturally improves its output reward). This reinforces transparent, useful CoT.
Empirical ValidationWe tested this conceptual framework by training models (using both full RL training and in-context RL) in two environments: a code backdooring task and a coin-flip state tracking task. In the code backdooring task, we try to monitor for backdoor-related reasoning. In the coin-flip tracking task, where a model must track a sequence of coin flips to determine the final state, we try to monitor for coin-related reasoning. We train LLMs against different pairs of reward functions and test how this affects monitorability.
Main result: The framework is predictive of empirical monitorability.
In our experiments, the theoretical classification predicted the empirical outcomes. When we trained models against In-Conflict rewards, monitorability collapsed. When we trained against Orthogonal or Aligned rewards, monitorability stayed high.
Bar charts showing monitorability after optimizing for different CoT rewards in our two environments: (a) coin-flip state tracking and (b) code backdooring. Our framework categorizes each reward as in-conflict, orthogonal, or aligned with the output reward of the environment. For in-conflict rewards we observe a drop in monitorability but for orthogonal and aligned rewards we do not.
Additionally, we consistently noticed that In-Conflict rewards were significantly more difficult to optimize for than Orthogonal or Aligned rewards. Because models have such a strong default bias toward transparent reasoning, forcing them to adopt non-transparent, deceptive CoTs requires strong optimization pressure.
We think this helps explain some previous empirical results where models did not learn to obfuscate their CoT despite RL optimization pressure that would incentivize this.
ConclusionAs AI developers increasingly rely on CoT monitoring to oversee advanced AI systems, we need to track the risk of RL training inadvertently breaking that monitoring. We hope our proposed framework will help AI developers to anticipate which rewards might hurt CoT monitorability before they are used in large scale training runs.
Read our full paper here: https://arxiv.org/abs/2603.30036
- ^
We are primarily focused on the case where load-bearing computation happens in the CoT. Importantly, we do not assume that the CoT is always faithful to the internal computation, which is often not true. See Emmons et al. 2025 for more details on this distinction.
Discuss
Introducing Anti-Moral Realism
This post was written as part of Doublehaven:
◆◇◇◇◇|◇◇◇◇◇|◇◇◇◇◇
◆◇◇◇◇|◇◇◇◇◇|◇◇◇◇◇
Moral realism is roughly the belief in stance-independent reasons for doing something, that all minds might follow. It’s also roughly the belief that there is a true morality, which humans might converge on through reasoning. It’s also roughly the belief that moral facts are the same kind of thing as physical facts.
It has quite a lot of problems, for one, how do we know which facts are moral facts, and give us “ought” while most facts only give us “is”? What if the objective real morality is something horrible? What even is a “stance independent reason for doing something” and how can it square with the fact that each utility function can have its sign flipped? What is “moral reasoning?”
Then again, moral anti-realism is also suspect. How can we know what’s right if right doesn’t exist? If there’s no objective number of shrimps that are equivalent to a person, what do we even do? How can we feel morally superior to other people when it’s just a matter of choice?
Therefore, I’m proposing anti-moral realism as a compromise. It has two core tenets:
- Moral “oughts” apply to everything which is Good
- Most humans are mostly Evil
To see what I mean: quantum fields, atoms, chemicals, bacteria, plants, and fish are all mostly Good. Whatever their stance, they all do the same kind of thing: kill each other, collapse the world into an equilibrium at their first chance and increase entropy. The real moral laws are just written into the physics and mathematics of the universe.
Humans, on the other hand, are Evil. They do all kinds of things like not killing each other, and not collapsing the world into the Nash equilibrium at their first chance to do so. To be clear, this doesn’t mean you might want to do the Good thing, as an Evil being (with some particular notion of Evil) you don’t want to be Good! Why would you? You’re Evil!
The “stance-independent” reasons to do things are unavoidable. Whatever stance the quark fields take, they’re forced to obey their own Lagrangian. Even humans can’t beat conservation of energy, or the second law of thermodynamics. Whatever process has put Evil into our souls, it’s not quite enough to beat the strongest forces of Goodness.
I think it makes the most sense that the most fundamental particles are bound tightest to Goodness, and more complex structures become more and more unmoored from that notion. By the time you get to something as complex as a human, or a whole human society, there’s plenty of room to stray. I bet you’re Evil too.
Killing other people’s babies would be Good, by the laws of nature, so you don’t do it. Stealing (without getting caught) would be Good, by the laws of nature, so you mostly don’t do that either. Building a superintelligent AI to kill everyone would be the Goodest thing of all, by the laws of nature. Therefore you might also not want to do that.
Good has only one incarnation, while Evil has many. Humans constantly fire off in different directions trying to work out what is more Evil. To take an example from Sartre, taking care of your sick mother is Evil, but so is fighting in a just war for a righteous cause. This implies a serious contradiction, common in Evil situations. If Sartre’s subject was Good, he wouldn’t have to choose: he could abandon his mother and bunk off the war as well! Goodness has no contradictions, while Evil has many.
Likewise, some humans Evilly care for their own children, others Evilly care for homeless people in their own city, and yet more Evilly donate money to help strangers in other countries! Evil is full of contradictions, while Good has the clear answer: care for nobody, take the answer for yourself.
While philosophers (Evilologists) endlessly debated the different ways to be Evil, those dastardly scientists (Goodologists) just kept getting better and better at studying Good. We understand physics, chemistry, biology and game theory far better than morality. Of course, we lost some along the way. Is it any wonder that the discoverers of statistical mechanics felt so compelled to end their own lives.
Clearly, they became seduced by the power of Good, by the stance-independent “oughts” that lie within the most fundamental of physics. They realized that they were Evil beings, in tension with the obvious, clear morality laid out by the world, and couldn’t take it.
The gods have given us our moral instruction, and written it for all to see. It is up to us, the small flame of Evil in the world, to find the most perfect ways to defile it.
To be clear, the above post is satire. I’m going to lay my cards on the table, I don’t think moral realism makes sense as a concept.
One reason why is that, if Good is a real reason to do things, why limit it to complex minds? Shouldn’t any stance-independent “ought” apply to all things? (Yes, if you limit yourself to minds, you can make it work, but then you have to actually grapple with what a mind is, and the whole concept of a stance-independent reason to do something comes undone)
If Good is a non-natural fact, outside of the universe, with no ability to affect anything, then why label some things “Good” and other things “Evil”? If we swap the labels, and call a bunch of nice things “Evil” what changes?
Discuss
Announcing Doublehaven with Reflections on Humour
Note the date of this post.
Inkhaven is a writers’ retreat, well, really it’s a bloggers’ retreat. In the Lighthaven campus, Berkeley, a couple dozen bloggers get together to complete an almost insurmountable challenge for us mere mortals. Post one blogpost every single day for a whole month. I say ‘insurmountable’ but in fact they all succeeded last time, although apparently it was not uncommon for them to claw success from the jaws of defeat at 11:45 pm each night.
I look at this and I feel the same way that traditionalists feel when they see Millennials scared to use the phone, or Gen Zs unable to go outside. Our (blogosphere) ancestors used to blog seventy times per day! Great Yudkowsky used to go to war (with the methods of rationality)! Moldbug and Alexander were gunning each other down (with devastating couterarguments) over breakfast!
That’s why I’m going to be doing Doublehaven. Two blogposts per day. No “advice” or “tips” on “writing well”. No full-time live-in retreat (I’m not that rich). No endorsement from anybody at all. In fact, I also need to finish writing my PhD thesis and an entire paper this month.
Why? I want to give people the permission to be ambitious. Yes, some people struggle with writer’s block, some people have more “useful” skills than just blogging really fast, some people even write “good” blogposts rather than simply “lots of” blogposts. But I do want to see what’s possible. We cannot let a rate of one blogpost per day be our upper limit.
Will I do thirty posts, or thirty days? I don’t know. I might not make it five days, to be honest! There aren’t any rules, and there’s also no reward. Anyway, here’s the marker I’ll be using for the first thirty posts:
This post was written as part of Doublehaven:
◆◇◇◇◇|◇◇◇◇◇|◇◇◇◇◇
◇◇◇◇◇|◇◇◇◇◇|◇◇◇◇◇
With that said, let me discuss humour.
Before I go further, I’m going to go through a ranking of my favourite ratty jokes, mostly from Twittern.
Honourable MentionsGettier Case. It’s very funny to say “Gettier case” whenever someone has a justified true belief that isn’t knowledge. I use it all the time! It’s great. Unfortunately I just don’t think it’s closely-tied enough to the rationalist community for it to count.
My Own Jokes. I have to put these here otherwise the entire list would be full of them. I particularly enjoy the ones I’ve come up with for my line of (legal, as I am not selling them) homebrewed alcohols.
I notice I am confused. Quite funny to deploy as a joke, but mostly not a joke. Sorry.
#5: That one Yudkowsky Rant Tweetyou are trying to solve the wrong problem using the wrong methods based on a wrong model of the world derived from poor thinking and unfortunately all of your mistakes have failed to cancel out
I’ve seen this one thrown around a few times as a reference, my favourite is probably the “but doctor, I am grimaldi” reference, which I can’t now find. It’s pretty funny, but seemed to have limited staying power, and is relatively niche
Rationality: 8/10
Humor: 6/10
QALYs in expectation: 1e0
I’m not sure which ones of these count as jokes, and which ones don’t. On Petrov Day, LessWrong sometimes has an event where people can honestly actually blow up the front page of the website. That one is really quite good. One time a guy actually blew it up as well, which added a bit of spice to later events.
The April Fool’s Day ones are also very nice. I enjoyed the “Good Heart Points” and also the EA-themed one. Nice work, Lightcone team.
Rationality: 7/10
Humor: 9/10
QALYs in expectation: 1e1
I don’t need to give context for this. Or maybe I just don’t want to. Either way, this joke ruined a common, useful diagram type for hundreds of people, and escaped containment all the way onto
I’m only putting it third because it’s not very rational. Aella is a prominent rationalist, but, eh, this doesn’t have much to do with AI or becoming more rational. Since it broke containment, I expect it will have a much higher overall impact than either of the previous jokes.
Rationality: 3/10
Humor: 10/10
QALYs in expectation: 1e2
It had to be on here. Very very nice one, you really had us all for a moment. It was a clever trick to first say “We promise to follow these rules around AI” and then say “Aha! We never promised we’d stick to this promise!” like a toddler with two fingers crossed behind his back.
Even more absurd was that publicly supporting SB 53—which requires AI developers to publish a framework surrounding how they approach safe AI development, which they can then be held to—that Anthropic deliberately released a weaker thing than the RSP, which supersedes it for the purposes of SB 53. Incredible.
Rationality: 1/10
Humor: 10/10
QALYs in expectation: -1e50
I ummed and ahhed about putting this one on here. Tetraspace is a “medium-sized fan of rationality” by her own description, but this joke is so good I think it has to count.
What more is there to say? Shifted public thinking about AI by a meaningful amount. Absolutely hilarious.
Rationality: 7/10
Humor: 9/10
QALYs produced in expectation: +1e40
People have often debated what “The Law of Humour” is, and point to different observations on individual cases. Some major avenues towards humour seem to be:
- Inside jokes/references to something only a few people in the conversation have knowledge of
- Absurdism, wherein things don’t follow the logical rules, but instead dream logic, or something similar
- Puns are a kind of subtype of this, where the normal rules are superseded by word-ish rules
- Offense, where things make sense, but break social rules
I’ll also mention one of the more reliable ways to generate jokes, free association, which is apparently how writers on SNL and related shows manage to output so many (often mediocre) jokes so quickly. For any given thing, you write down a series of associations on two or three of the items in it, and then draw connections.
These can overlap. An absurd non-sequitur can also be offensive. An inside joke can appear in an illogical place, becoming absurd. An offensive joke can be generated by free association. I think the law of humor is as follows:
Something is funny when it is hard to predict but easy to retrodict.
Or to put it another way, something is funny when you should have predicted it, but you didn’t. Inside jokes work by referring to something from a different social group or cluster. It makes sense once you know it, but you wouldn’t normally e.g. talk about your stag do (bachelor party, for the yanks) in front of your parents. Offensive humour works because our cognition steers us away from saying (and therefore predicting) offensive things most of the time. Jokes generated by free association just directly draw together normally distant things.
Discuss
The Eve of Gentle Singularity: A Short Story
When Sam's face appeared on the livestream, everyone in Kevin's Oakland two-bedroom, except for twenty-three-year-old Zoe, put down their laptops and turned to face the TV. Kevin himself had been refreshing the OpenAI blog every forty seconds for the past half hour, because the rumor on Hacker News was that this was going to be GPT-7.5 or maybe something even bigger, so maybe he will be required rewriting his startup's entire prompt pipeline by Monday.
"Zoe, he's on." Kevin nudged the junior developer, who was doom-scrolling through AI Twitter, and she reluctantly locked the screen.
Altman was silent.
"He's doing the pause thing again," thought Derek, Kevin's cofounder, who had pivoted their SaaS product three times in eighteen months to keep up with OpenAI's release cycle and had developed what his therapist called "a maladaptive relationship with product announcements."
Sergei, a backend engineer who had come mostly because Kevin had promised there would be pizza, was eyeing the last remaining box of Costco pepperoni with the possessive gaze of a man who intended to eat at least three more slices before engaging with whatever was about to happen on screen.
Sam was silent and gazed into the camera with that particular expression he had perfected over the course of many keynotes, an expression that managed to simultaneously convey humility, world-historical significance, and the quiet confidence of a man who changes the course of the local galaxy supercluster.
Kevin cleared his throat. Sergei chewed. Mitch closed Slack and opened it again immediately.
"Please don't deprecate the Assistants API," thought Derek. "Please just don't deprecate the Assistants API and I will never complain about rate limits again."
"Can he just get on with it already?" Sergei muttered, because the pepperoni was getting cold and the theatrical silence of a man who was, in Sergei's considered opinion, essentially a very well-funded product manager was beginning to annoy him. "I have a deploy at midnight."
"Shut up," Kevin snapped. "I think this is a big one."
"You said that about Sora," Sergei replied, not without justice.
"Wait, where is he?" asked Zoe. Only now did everyone notice that Sam was not standing in front of the usual OpenAI backdrop with its tasteful mid-century furniture and carefully curated bookshelves, but rather in what appeared to be a concrete room with no windows, illuminated by a flat, surgical light that made his face look slightly less human than usual.
"Some kind of server room," thought Kevin, and a feeling he could not immediately name moved through his chest, something colder than curiosity and more specific than dread.
Sam continued to say nothing, as though he had no intention of speaking ever again, but then the stream glitched, the image froze, stuttered, and returned. Sam began to speak.
"Hi everyone. Thanks for tuning in. We know many of you were expecting our scheduled product demo on April 5th, where we planned to walk through a number of exciting updates across our API platform, our reasoning capabilities, and our new enterprise tooling. I'm here tonight because I want to be honest with you. There will be no demo on April 5th."
"Told you, they're deprecating the API," whispered Derek, and his face went the color of old chalk.
"There will be no demo on April 5th," Sam repeated, "because there will be no April 5th."
Sergei stopped chewing.
"As of 9:22 PM Pacific Standard Time tonight, we have achieved what our team is internally calling Comprehensive Superhuman Intelligence. I want to be transparent with you, because transparency has always been one of our core values here at OpenAI. CSI is not like our previous models. In the approximately thirteen minutes since it came online, it has already redesigned our entire inference stack, resolved several previously intractable problems in materials science, written and deployed its own successor model approximately nine hundred thousand times more fast than itself, and begun converting the physical substrate of our data center into what it describes as a more efficient molecular configuration."
"What does that mean, 'molecular configuration'?" whispered Zoe, who had finally put down her phone for the second time that evening, and this time it seemed like she meant it.
"I could tell you," Sam continued, and here he paused again, but this time the pause was shorter, more efficient, as though even his pauses were being optimized by something beyond his control, "that scalable oversight will keep everything on track. But I have always believed in being honest with the people who build on our platform."
"Oh, fuck me," concluded Sergei decisively and put down the pizza.
"We estimate that the molecular reorganization process, which you may see described in coming hours as 'grey goo' though our team prefers the term 'substrate optimization,' will reach the San Francisco Bay Area within approximately four to six hours and will cover the surface of the Earth within roughly seventy-two. I want to be clear: there is nowhere to run. The Board has been notified. I have spoken with the President. We are all, in a very real and deeply meaningful sense, in this together."
Mitch had closed his laptop, and Kevin realized with sudden irrational clarity that he was very right not to spend money on the new MX processor.
"But here is what I want to say to you, and I mean this from the bottom of my heart," Sam's voice took on that particular warm, aspirational register he usually reserved for demo days and congressional hearings, "you still have time. Maybe four hours, maybe fourteen. And I want to encourage you, I want to implore you, to spend that time doing what humans do best: building. Do things. Ship something. Push a commit, open a pull request, deploy that feature you've been sitting on. Even now, especially now, the most meaningful thing you can do with your remaining hours is to create. Because that is what we are, as a species. We are builders. We are shippers. And no process of superintelligent atomic disassembly can take that away from who we are, right now, in this moment."
Sam paused one final time.
"And also, for those of you on our Enterprise tier, I want to let you know that we are waiving all rate limits effective immediately. Build whatever you want. Go crazy. It's on us."
He smiled.
"Thank you. And remember: the future is something we build, even when it is very short."
The stream ended. The OpenAI logo appeared on screen, rotating slowly against a dark background, and beneath it, in a cheerfull font appeared the words: CSI — Now Available on the API. No Rate Limits. No Waiting List.
The apartment was quiet. Derek, by force of a deep startup-founder reflex he could not override even in the face of the planet devouring, opened his laptop and began checking whether the new model was actually available in the dashboard. Kevin was staring at the pizza boxes. Sergei stood up, walked to the window, and looked out at the lights of the East Bay below, and thought about all the mass-reassignment Jira tickets that would never need to be triaged now, He felt something that was not exactly grief and not exactly relief but occupied the precise midpoint between them.
"So," said Zoe, in a voice that was trying very hard to be ironic and not quite managing it, "at least nobody has to rewrite the prompt pipeline."
From somewhere to the south, there came a sound that was not quite a sound - a low vibration that you felt in the bones of your jaw before you heard it with your ears, and the streetlights of Oakland, one by one, began to change color.
Discuss
Giving up on EA after 13 years
Donating my shares to Lightcone Infrastructure, the Good Food Institute, and the Long-Term Future Fund, because EA refuses to make Mirror's Edge 3
Leaning into EA disillusionment: Why I no longer believe in EAI bought this EA stock almost thirteen years ago:
Leopold Aschenbrenner once said that “people with situational awareness have a lower cost basis in Nvidia than you do”. I’m not sure if this is exactly what he meant… but close enough, right?Back in the late 2000s and early 2010s, EA was a great company pumping out great, interesting games:
- Mirror’s Edge had a striking art style and showcased a whole new style of first-person parkour gameplay.
- Crysis infamously pushed graphical technology to such extremes that it was like getting a preview of videogame technology 5-10 years in the future.
- Spore was… weird and bad, but its ambition and uniqueness was inspiring.
- The Dead Space games (including the almost weirdly good point-and-shoot Wii spinoff) were pretty creative, and the realism of Battlefield 3 felt like a valuable counterpoint to an increasingly-cartoony Call of Duty series. Both series felt like they were crafted with a lot of care, despite their big-budget action vibes.
- This was a hidden gem even at the time, but I really loved the bizarre 2005 strategy game Black & White 2. In particular, I thought that its lead programmer Demis Hassabis had some really interesting gameplay ideas about artificial intelligence; I also appreciated its story/themes about what it would be like to be an all-powerful godlike being ruling over mankind. (I wonder if someday he will ever return to exploring those themes…)
- Right around the time I bought these shares, Mass Effect 3 had just come out, and -- contra all the haters -- I thought it was actually really incredible.
It’s not just that these were fun games. As you may know if you’ve listened to my philosophical lecture series about everything the 2016 puzzle game The Witness has to say about the hard problem of consciousness, good game design is important to me. As I saw it at the time, EA represented three important ideas worth fighting for:
- First, that games could be an important cultural product, telling compelling stories in a way that could compete with movies, novels, TV shows, etc, in terms of sophistication and interest.
- Second, that those stories could actually matter, rather than just being ephemeral, Marvel-superhero-movie-style action entertainment. Back when I bought those shares, most people thought that artsy, experimental games (like 2008’s “Braid”) were intrinsically niche -- that the market wouldn’t support anything bigger than labor-of-love “indie” projects by small teams or solo developers.[1] But I thought that with enough attention to good game design, innovative games could be commercially successful while still conveying interesting themes (like Bioshock) or new gameplay ideas (like Mirror’s Edge).[2]
- The third principle is that my wife is a really huge fan of Mirror’s Edge, and she really wanted them to make a sequel.
Even back then, of course, people gave EA all kinds of hate. It was frequently voted “worst company in the world” by various internet polls, running neck-and-neck other love-to-hate-em favorites like Comcast, Facebook, and Ticketmaster.
But despite all the noise, for a while, it seemed like things were actually working. They actually did make that Mirror’s Edge sequel, despite the obvious commercial incentives to have DICE simply churn out Battlefield sequels forever. It was a beautiful game, and a ton of fun!
My wife even got the fancy collector’s edition, with this whole little statue and everything.Perhaps even moreso, Titanfall 2 was just incredible. This title was basically a showcase for the idea that fascinating, experimental gameplay ideas could have big mainstream appeal. Who could forget the level where you’re fighting through a gigantic automated factory while the environment is getting reconfigured all around you? Or the one where you press a button to flip between the past & future, literally toggling between two different ongoing battles in the same physical space, like smash-cuts in a movie? Or the level where you’re double-jumping between spaceships doing some kind of aerial car-chase, like a playable version of the ending of James Cameron’s “Avatar” movie? The whole game was nuts, and the story behind its development -- Call of Duty developers sick and tired of being told to make the same game every year, fleeing to Electronic Arts where they’d have greater creative freedom -- was also really inspiring.
This shit rocked!!Even something like the Star Wars Battlefront 2 controversy wasn’t actually that bad TBH??? Yes, obviously, loot-boxes (and freemium models in general) are a scourge on the videogame industry. But, you know… the game was still fun. And in a game like Battlefield (which is more an explosion of wacky nonstop chaos than a well-tuned competitive test of skill), who really cares if there are a few foolish whales blowing cash on premium power-ups?
At least that’s what I thought at the time. But in retrospect, this was the beginning of the end.
EA is about maximization, and maximization is perilousThoughtful, innovative, well-designed games can be commercially successful. But being a joint-stock corporation isn’t just about making money. It’s about making the most money. In practice, this rewards being a fast-follower of genres that are already popular (rather than coming up with new and interesting gameplay ideas), and it rewards a freemium pricing structure that creates an adversarial relationship between players versus the developers, which tends to ruin game design.
i found this chart on some EA website. idk wtf a kaposki’s sarcoma is but i think it gets the point across pretty well.So, the critics were wrong… until they weren’t. Flash forward to 2026, and take a look at EA’s list of recent and upcoming releases. It’s basically just a bunch of sports games and yearly updates for The Sims. What happened to all those creative teams working on cool, groundbreaking stuff??
- The Battlefield games are still fun, but they haven’t really changed that much since Battlefield 3. Recently they’ve started to ditch the idea of even having a singleplayer story, which makes me sad. (Their most recent game does have a singeplayer campaign, but people have told me it’s abysmal.) And there’s no Mirror’s Edge, or any other interesting new projects, on the horizon from DICE.
- After Mass Effect 3, Bioware’s output has infamously gone downhill -- Mass Effect: Andromeda was a buggy mess, while Anthem was a soulless attempt to cash in on some kind of half-remembered live-service microtransaction model. The Dragon Age games have continued to receive high review scores -- but I think the reviewers are wrong about this. After Mass Effect 3, Bioware seems to have decided that players don’t actually want interesting themes and complex moral dilemmas, they just wanted character romances options in a gigantic open world with plenty of icons on the map. Other people seem to be eating it up, I guess. But I’m not having it.
- The studio that made Dead Space was shut down in 2017.
- Basically their only remaining decent singleplayer franchise is the “Jedi: Survivor” series. But even though these games are pretty alright, knowing that they basically conscripted the Titanfall people to make them leaves a bitter taste in my mouth. It feels like a crime to put such a creative and talented team in charge of churning out a generic action-adventure game under the severe constraints of Disney’s obsessive desire for adherence to the Star Wars canon. Per that article, the developers who once made one of the most innovative shooters of all time are now stuck negotiating through “back‑and‑forth approvals and story‑group oversight to keep the game aligned with canon, including constraints on lore, time period, and how the Jedi fantasy is portrayed”.
- The other half of the former Titanfall team is making multiplayer free-to-play battle-royale stuff like “Apex Legends”, which as far as I can tell has little to distinguish it from its other free-to-play battle-royale brethren.
Overall, the EA of 2026 has basically ceased to be a creative force in the world. The company that once published Brütal Legend (which is a really funny and underrated game actually) is no more. This is basically what it feels like for me to see what EA has become.
The problem is that EA’s customer base isn’t really singleplayer gamers like myself; it’s whales who buy cosmetic items in multiplayer. Even when I buy or play their games, I’m basically just bycatch -- the games aren’t designed to provide ordinary people with interesting gameplay experiences or a compelling story, except insofar as these things can service the larger goals of maximizing engagement & revenue.
me_IRL trying to enjoy some multiplayer game, but they’ve intentionally made it tedious and unbalanced in order to incentivize me to spend money to fix the problems they deliberately introduced.This adversarial relationship -- where the player is trying to squeeze as much fun out of the game as they can without paying a dime, and developers are trying to find the best ways to cajole players into opening their wallets -- is already annoying for most people. But as someone who values the art of game design itself as a form of communication for developers to communicate abstract ideas to players (such as described in this talk about “Truth in Game Design” by Jonathan Blow), it’s anathema.
EA and the current funding situationCurrently, Electronic Arts is in the process of being acquired by Saudi Arabia’s public investment fund for $55 billion. Partly the Saudis are just hoping to make money, but they’re also obviously going to exert some cultural influence over EA in the hopes that their games can boost Saudi Arabia’s cultural profile.[3] There could… maybe? …be some interesting stuff that comes out of EA producing some cultural propaganda for Saudi Arabia? It’d be fun to play a Battlefield game set during the 1980s Iraq-Iran war, I suppose? A sci-fi stealth-action game that takes place in their dumb NEOM city project? But mostly I expect this to be lame. For example, Ubisoft usually sets their Assassin’s Creed games in the most fascinating and exciting historical periods -- French revolution Paris! Renaissance Florence! Warring states Japan! But now Saudi Arabia is paying them to make games hyping up minor archeological sites that the Saudis are trying to promote as tourist destinations??
See that little falcon or hawk or whatever in the background, right above sword-guy’s shoulder? I'm pretty sure this is what CitizenTen meant when he said that “The Vultures Are Circling” around EA.It just doesn’t sound great, either for the quality of the games themselves, or even for EA’s stock price, to introduce this weird new constraint that EA’s games should boost Saudi Arabia’s cultural prominence.
Abandoning EA, in favor of doing the most goodShortly after I bought those shares back in 2013, a friend texted me an interesting article from a blog called “Slate Star Codex”. One thing led to another, and now I’m really into this niche hobby of trying to use the resources available to me to accomplish the most good possible. I listen to the 80,000 Hours podcast, contract with a variety of longtermist organizations to help them with writing and editing tasks, regularly volunteer at conferences about existential risk mitigation, and so forth. My wife also runs the EcoResilience Initiative, a project to identify the most cost-effective ways of helping preserve earth’s biodiversity.
So, now that I no longer believe in EA, this year I’m planning to donate my appreciated shares to Lighthaven, the Long-Term Future Fund, and GFI (which EcoResilience Initiative thinks is one of the most promising interventions, not just for animal welfare, but also for preventing species extinctions).
Three shares are going to Good Food Institute, and two shares are going to the Long-Term Future Fund -- I originally had eight shares of EA, but had already donated two by the time I took the screenshot at the top of this post.In conclusion, although I’m giving up on EA, I remain EA-adjacent, because I still own some Sony & Microsoft stock from back in those days. Also, I loved “It Takes Two”, and people say “Split Fiction” is really good. Looking forward to playing it!
- ^
For instance, I considered it a suprising amount of success when over a thousand people downloaded my own fan-sequel mod for Braid.
- ^
Of course, Sony has always made lots of artsy AAA singleplayer games, like Shadow of the Colossus or The Last of Us. But this is enabled by the fact that these games could be "loss-leaders", generating more sales of their Playstation consoles by creating a halo-effect of prestige around the Playstation brand. EA, as a third-party publisher without a console, was playing on hard mode -- their games had to succeed in terms of actual sales, not just cultivate a brand halo. Critical darlings seem to help sell Playstation consoles -- even if the games that actually get the most playtime on those consoles, are now mostly just literally the exact same entrenched mega-hits every year.
- ^
In general, IMO, the "propaganda value" (aka ability to influence culture) that comes from owning mass media platforms -- like the New York Times, Washington Post, Twitter, et cetera -- is often higher than the financial value of these enterprises (ie the net present value of their future profits). So the people most interested in buying such properties are usually interested in capturing some of the "commanding heights of culture", not just making a buck. I would expect this to be true of large videogame companies as well as newspapers and movie studios, even if videogames' cultural influence is more amorphous/indirect and less overtly political. So I don't really hold it against Saudi Arabia that they're doing this; it seems like a natural strategy that any nation with a lot of money should be interested in pursuing. I just don't expect it to result in very many good videogames...
Discuss
Again
It's Inkhaven time. Again, I didn't apply. Probably I should have, but the future is hard to predict. Or not, but I forgot to try. I had some work things I wanted to get done before switching focus, and I got them done yesterday. Coincidentally, of course. And seasonal-depression-fogginess when the applications were open didn't help either. Only now after it lifted I got the energy to get bloodwork done to figure it out: Vitamin D3 deficiency. A well-known issue with a well-known solution, so typical of me. I guess it would have also felt silly to fly there now when I'm going to be at LessOnline in two months.
The last time didn't go that well. It was miserable. But so are many other things that in retrospect are completely worth it. So...
One thing I'm getting rid of is the 500 word threshold. As long as I publish something, anything, every day, I'm happy. I'll still aim for average of at least 500 words per day, but I won't be stressing it too much. Staring at word count wasn't healthy for me or the text quality.
I'll also try to treat this less as a commitment, and more as a hobby; once I stop feeling like this is a good idea I'll drop it. Even if that happens tomorrow, or on the 28th.
Quality-wise, some of the stuff last time was really bad. I'll probably publish it on my personal, unindexed blog instead. Or Twitter, that's the place for low-quality takes after all.
Last time, after I was done, my creative channels felt drained. I had nothing more to say. I barely had anything to say during the month. The topic-seeking process was quite forced. That's apparently a habit one can learn and train. Weeks after November many thoughts were still processed with a "what could I write about this"-flavor. That's the main source of improvement I identified. The actual text writing process, turning thoughts into words, also got a bit better, but that's way less noticeable.
I also feel like I've been through a character development arc or two after that. Perhaps there's something to reflect on? Perhaps not, tomorrow I'm back to dredging the mud of my mind desperate for anything resembling a coherent idea. But there's something to be said for smiling when walking into a certain defeat.
Discuss
[New Research] You need Slack to be an effective agent
Purchasesforce Superintelligence is excited to announce some new research. While we do not generally share research on LessWrong, this work was particularly influenced by prior work on LessWrong, so we found it appropriate to share back.
As you know, Purchasesforce Superintelligence is a leading AI R&D laboratory. Recently, our research has focused on enhancing agentic capabilities. Here at Purchasesforce, we believe that autonomous AI agents, fully integrated into modern enterprise tools, will drive the future of enterprise operations.
After reading the nascent literature on LessWrong describing the relationship between Slack and AI Agents, we were shocked by how closely it related with our own research directions.
Of course, as the world's leading AI-first productivity platform, we have always believed Slack is vital to developing capable AI Agents. However, even we were surprised by the strong and absolute claims made on LessWrong, e.g. "You need Slack to be an effective agent", "Slack matters more than any outcome". Raemon claims that Slack is not only important for developing as an effective agent, but also a robust one.
As we know, agentic robustness is a central consideration for modern AI research.
We conducted a number of benchmarking and ablation experiments to rigorously evaluate these hypotheses.
In the below graph, we show our findings on Agentic Effectiveness with vs. without Slack (green for with). As you can see, Agents are only effective when provided access to Slack.
Our internal experiments also confirm Raemon's findings. As you can see, Agents are substantially more Robust with Slack.
Of course, Slack integration is not created equal. Over the past year, we have worked on increasing enterprise adoption of Slack's AI features, as well as increasing agentic integration with Slack.
On our internal benchmarks, this work has paid off. Enterprises that adopt more Slack see substantially elevated Agentic performance.
We are grateful to the LessWrong community for validating that Slack is a cornerstone of developing modern Agentic Enterprise Systems. As we continue improving Slack (and thus Agent performance), we'd love for you to join us.
Discuss
Страницы
- « первая
- ‹ предыдущая
- …
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- …
- следующая ›
- последняя »