Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 13 часов 40 минут назад

Try seeing art

20 ноября, 2025 - 22:25
Published on November 20, 2025 7:25 PM GMT

tl;dr: if you find visual art boring it might be because you haven't seen it yet.

Visual art can provide soul-sustaining value to those who appreciate it. It does for me. I don’t want to say too much theoretically about the what and why of art, for reasons that will become clear. But I want to reassure you that I am an advocate of rational thought and analysis, and I think art absolutely can admit such discourse. I also think that art can be objectively good or bad, and many things labelled “art” are not even art, in the sense I use the term. But I think such conversations are counterproductive to the goal of receiving value from art. 

The way I grew to love visual art seems to be different from how others typically approach it, or from what is taught in art appreciation classes. So if you are a person who currently doesn't get anything at all out of looking at visual art, and you are game for an experiment, here is what worked/works for me:

Go to an art museum that includes diverse works. A small museum is less overwhelming; if you only have access to a huge museum, plan to visit only a small part of it.  If you are in/from the Western world I’d recommend starting with museums whose collections span Western art history from at least 1400 -1950. Earlier and later periods and other parts of the world have gems, for sure, but you don’t want to miss out on those years of Western art, and I'd wager most rationalists will find something that resonates within that time period.

Solitude is key. Go alone, or agree with your companions to decouple for the first hour to try this exercise – and not to discuss it or ask any questions about it when you reunite, for at least 24 hours. Otherwise, too much of your mental space will be occupied by thinking about what you will say later.

More important than what you do, is what not to do.  Absolutely do not read the little plaques on the wall. Do not take a docent-led tour or get a recorded tour on headset. In fact, if you accidentally bump into a tour guide explaining an artwork, go the other way. Stop yourself the moment you catch yourself thinking anything like: Who was the artist? What is the title? Have I seen this artwork before? Is this famous, or great? What era or region is this from? What is this style called? What techniques were used to make this? Was it hard to make? What was the artist’s social and cultural context? How original or innovative is this compared to other art that came before it, or that I have ever seen? Why is this labeled “art” or placed in a museum? You can engage those questions for other reasons another time if you want to. But if you are new to consuming art, such thoughts will most likely kill your chance of learning to receive art.

Once you are inside and alone, take a deep breath and try to quiet your analytic mind and enter a state of nuanced receptivity[1]. For the next hour there is nothing for you to succeed or fail at. Your goal is just to be acutely conscious of your visual perceptions and the subtle emotions or sensations within your body that might or might not be evoked.  Don’t force yourself to study every piece. Walk into each room and briefly look at each piece, resting your eyes on it just long enough to see what it is. If you notice that something grabs you, give yourself permission to pause and look at it for a few minutes. It doesn’t have to be a deep feeling or a good feeling; don’t try to name the feeling or even discern its valence. Walk up to the piece and look at it closely. Cross the room and look at it again from a distance. Try to notice, not name, what you actually see, and how it feels. Continue looking until you find that you’ve lost contact with the feeling it evoked. Make a note of it (if allowed, take a picture with your phone) so you can find or remember it later, and move on. For a person who doesn't already love art, an hour is plenty of time to spend in an art museum.  If it starts to feel like work or you lose touch with nuanced receptivity, take a break, or quit for the day.

If you get through a whole museum and nothing strikes you, you are free to leave, but if your first impressions warrant, you might want to give it a second pass to see if anything jumps out at you anew. Walking in the reverse direction helps, so you enter each room from the other perspective.  It’s fine to pass through every room in an entire museum and not see a single thing that grabs you. It’s also fine to be stopped in your tracks in the first room, stare at one piece for an hour, and miss the rest of the entire museum. I’ve had both experiences. Typically I spend 20-30 minutes quickly browsing through a small museum or show in its entirety,  and then circle back to the pieces that most struck me and spend the rest of my time with those, until I get tired. I've noticed that I seem to be the only person who does this.

If something does move you, I recommend resisting the temptation to analyze or interpret it, or articulate what the meaning of the piece is, or explain why it evokes the feeling it does in you, at that moment. Save that for another day. Resist second-guessing whether you ought to have this response, or what it might signal about you if people knew how you felt about it. There is just too much risk of sliding into intellectual analysis or self-judgement or social signaling, and losing the directness of its impact.

For me, once I've picked a favorite to spend time with, I try to mentally block out everything else (the walls, other artworks, other people) so that I'm only aware of that one piece. I like to sit down, if seating is available.  I start to name in my mind what I actually see. “Snow, in the woods, a frozen stream... not snowing now... weak daylight... tree trunks... human footprints, one person walking from there to there... animal footprints elsewhere, seem unrelated...  a broken down fence? anyway something manmade, but rustic...  finer branches and dried reeds... snow piled up on them.…”[2] 

Maybe I start to notice that I like this part of it, but not some other part, or I notice that I wish a certain aspect were different. Perhaps I notice that it reminds me of a particular experience or place I have been, which may come with associations. All good.   Introspectively, some people may be able to focus on physical sensations in their body, but I personally find this difficult. I focus on noticing emotional tones - when I enter this universe, does it feel comforting? disturbing? cheerful? dark? cozy? aching? etc.  

On my way out, I like to go to the museum shop and buy postcards of any pieces that struck me, and keep a scrap book of those. 

That’s it. You might have to do this a few times before anything starts to click, but it’s not too time consuming or painful.   Try different museums. It’s worth periodically going back to ones you thought were “just ok” but nothing in particular struck you, because the things you need or are ready to receive will change over time.

The skeptic may be thinking: “sure, but give me one reason to think this will ever get me anywhere.” I will write another time my theoretical thoughts about what the value proposition of art is , why rational humans need it, and how I think it delivers this value. I'm avoiding that now because engaging that theory could ruin the above experiment.

  1. ^

    By "nuanced receptivity" I mean both a state of heightened perceptual awareness (of the external world as well as your internal body), and a state of emotional unguardedness. In my experience, this state is difficult to find for many smart rational people, may initially be uncomfortable, and may require persistent practice to find it at all. But it is a learnable skill, and this state can be entered to a greater or lesser extent.

    As a side note, a bonus benefit of practicing seeing art may be strengthening the capacity for nuanced receptivity in other domains of life. I find this capacity to be crucial for making and sustaining close personal relationships (romantic or other), and it seems likely to be important for receiving other types of art, engaging beneficially in psychotherapy, and perhaps for enjoying the benefits some receive from being in nature, interacting with animals, or doing yoga.

  2. ^

    After writing this, I happened to go to a museum in a city I was visiting, and made a point of noticing what was going through my head when looking at a particular painting. I didn't read the plaque so I can't tell you the title or who it was by, although I did take a picture of it, a detail of which is the splash image.



Discuss

Thinking about reasoning models made me less worried about scheming

20 ноября, 2025 - 21:20
Published on November 20, 2025 6:20 PM GMT

Reasoning models like Deepseek r1:

  • Can reason in consequentialist ways and have vast knowledge about AI training
  • Can reason for many serial steps, with enough slack to think about takeover plans
  • Sometimes reward hack

If you had told this to my 2022 self without specifying anything else about scheming models, I might have put a non-negligible probability on such AIs scheming (i.e. strategically performing well in training in order to protect their long-term goals).

Despite this, the scratchpads of current reasoning models do not contain traces of scheming in regular training environments - even when there is no harmlessness pressure on the scratchpads like in Deepseek-r1-Zero.

In this post, I argue that:

  • Classic explanations for the absence of scheming (in non-wildly superintelligent AIs) like the ones listed in Joe Carlsmith’s scheming report only partially rule out scheming in models like Deepseek r1;
  • There are other explanations for why Deepseek r1 doesn’t scheme that are often absent from past armchair reasoning about scheming:
    • The human-like pretraining prior is mostly benign and applies to some intermediate steps of reasoning: it puts a very low probability on helpful-but-scheming agents doing things like trying very hard to solve math and programming problems;
    • Short speed-prior-constrained reasoning and long reasoning are correlated: the weights that generate the final tokens, that generate the CoT when producing short CoTs, and the ones generating the CoT when producing long CoTs are the same, and while it would be possible to train the model to have different “personalities” in these three situations, the prior puts a high weight on these three personalities being similarly benign.
    • Reward hacking would need to be cursed to strongly push against the mostly benign short-reasoning human priors: it does not just need to encourage unintended behaviors; it needs to encourage the kind of unintended behaviors that strongly favors schemers - and I argue current reward hacks mostly aren’t;
  • These reasons partially apply to the first somewhat-better-than-human-level AIs (the AIs which Joe’s report is about) if AIs are built using anything like current techniques (including if human-legible CoTs stop playing an important role).

These considerations do not update me much on AIs that are vastly superhuman, but they bring my P(scheming) for the first AIs able to speed up alignment research by 10x from the ~25% that I might have guessed in 2022 to ~15%[1] (which is still high!).

This update is partially downstream of my beliefs that when AI performs many serial steps of reasoning their reasoning will continue to be strongly influenced by the pretraining prior, but I think that the arguments in this post are still relevant even if you don’t share these beliefs:

  • Because I think there might be reasons other than human priors to expect strong and potentially benign priors/pressures on intermediate reasoning steps (which may still apply in different ways even absent a pretraining prior)
    • For example, the high correlation between short and long reasoning will likely apply even if reasoning steps are influenced by the pretraining prior
  • Because of the meta-level update that the first time we built something that was plausibly capable of scheming, it surfaced reasons for comfort that people who did armchair reasoning about scheming did not forecast (publicly, as far as I know).
    • I think there have been some updates in the other direction (e.g. AIs can come up with scheming as an instrumental strategy given in-context and out-of-context pressures, AIs can surprise developers with reward hacks or with misaligned deployment behaviors) - but I think it would be a mistake to only update in the direction of there being more danger.
    • (I don’t think this meta-level update applies to the part of the AI safety discourse that is specifically about vastly superhuman AIs and that does not claim to make any prediction about AIs before those.)
Classic reasons to expect AIs to not be schemersSpeed priors

In Joe Carlsmith’s report, speed costs is one of his main object-level objections to scheming arising in somewhat-better-than-human-level AIs:

Suppose that our model is a purely feedforward network with 1000 layers and <10,000 serial steps per input, that the instrumental reasoning at stake just in a schemer deciding to training-game (let alone: checking for defection opportunities) takes at least ten steps, and that the model doesn’t begin optimizing for reward-on-the-episode until that reasoning is complete. In that case, the instrumental reasoning is using up around .1% of the compute that could be going to getting reward-on-the-episode, or to some other max-reward goal. [...] Because of the centrality of “counting arguments” to the case for schemers, I think that questions about the strength of the selection pressure against schemers—for example, because of the costs of the extra reasoning schemers have to engage in—are especially important. In particular: I think a key way that “counting arguments” can go wrong is by neglecting the power that active selection can have in overcoming the “prior” set by the count in question. (page 20 and 21)

But I would guess speed costs don’t matter for AIs like Deepseek-r1-Zero.

Prediction 1: there is enough training slack and sufficiently poor exploration in GRPO that given a good scheming initialization (e.g. a short scheming reasoning section at the start of the scratchpad), scheming would not be trained away even with a mild length penalty with an amount of RL comparable to the one Deepseek-r1-Zero went through. (p=0.75)

Note: the credence in this post is about good execution on the concrete experiment I suggest, not the top-level claim. I often have more credence about the top-level claim than the concrete experiment because there are spurious ways in which the experiment might fail.

Preconditions for scheming not being met

Joe also exhibits properties that he thinks somewhat-better-than-human-level AIs are likely to have, and the absence of which would undermine scheming:

  • Training gaming performing better than directly trying to solve the task
  • Situational awareness (optional, as it could be encouraged by training in the training-game-dependent argument)
  • SGD “noticing” instrumental training gaming

I think these are slightly more likely to apply to future AIs than they are to apply to Deepseek r1, but I think this depends a lot on how “cursed” the RL environment that will be used for future AIs will be.

Deepseek r1 does hack during training by trying to produce a final answer even when it has no clue what the answer is, but the way this hacking emerges never looks like “let me guess what the reward looks like and try to figure out how to hack it from first principles”. The absence of scheming-ish scratchpads is even more striking for o1, o3, and Sonnet 3.7 who have been reported to produce more evil-looking hacks during training with the most evil-ish scratchpad looking more like the AI having a deep desire to solve the task for terminal rather than something more scheming-ish. The reasons why you don’t get a schemer are somewhat subtle.

Prediction 2: the absence of training gaming is not entirely due to a lack of capabilities, and if you start from a 100% training gaming prior where the AI tries to make guesses in its scratchpad about what situation it is in and what forms of reward are used, RL will quickly make the AI only make mostly correct guesses about what rewards are being used - but it will not make the scratchpads look much more “evil” or use more instrumental consequentialism than what you started with. (p=0.75)

Overall, I don’t think Joe was very wrong about his overall assessment. I think he correctly updated towards ignorance by noticing that the story felt conjunctive and defeasible (and the arguments in this post are just concrete instantiations of new ways in which the scheming story might break):

[...] On the other hand, ascribing a model’s good performance in training to scheming continues to feel, at a gut level, like a fairly specific and conjunctive story to me. (page 21)

There are indirect pressures against scheming on intermediate steps of reasoning

If you summoned a human-level mind just through end-to-end black box optimization from something like a recurrent neural net prior, I agree concerns around scheming and training gaming would be high. But actually, it’s quite common to have other kinds of pressures on intermediate reasoning - at least for AIs that are not widely superhuman.

Note: I think these considerations have been discussed in the past (but I haven’t found a link), it is just not discussed in Joe’s report - probably because the most salient AIs at that time it was written were transformers used by doing a big single forward pass.

Human priors on intermediate steps of reasoning

Caveat: I think this argument is not very strong, as it depends a lot on what the future of capabilities will look like. But it is extremely relevant to explain why Deepseek r1 doesn’t scheme.

For current models, most of the bits learned by reasoning models come from pretraining. You can distill a reasoning model back into a base model with a few MBs of transcripts, but you can’t quickly distill back what a base model learned into the random initialization with much less information than its terabytes of model weights. This means models can’t learn that much during current RL, which means their reasoning is somewhat closely tied to how a human would solve a problem. MBs of supervision is enough to get a human-like schemer, but for reasoning models I don’t think it’s a surprise that you don’t get a random alien mind for which it would be overwhelmingly unlikely to get something that has human-like intermediate steps of reasoning every time it goes through the sampling bottleneck.

The forward pass of LLMs is very alien, but I don’t think it weakens the conclusion much because the powerful scheming that training selects for probably requires more serial steps than what is contained in the forward pass of current models.

Therefore, if future AIs look anything like Deepseek r1, I think the human prior on intermediate steps of reasoning is relevant: if RL encourages human-like scheming-ness, the model might become a schemer. If it encourages human-like benign task-focused agency, I would guess the model is somewhat unlikely to be scheming.

I think this human prior is somewhat unlikely to be easily removed. Optimizing over very long time horizons is difficult to do end-to-end. People did try to teach models how to use neuralese, but the results are very underwhelming (I think people might find a way to use it for cheaper reasoning in the next few years, but I don’t think it will make the reasoning very inhuman). And my understanding of the human brain is that almost no end-to-end optimization is done this way - everything we learn with our inner RL is very short-term; most of what we know how to do over long time horizons looks more like explicit reasoning extrapolated from short-horizon speech.

Prediction 3: Intermediate steps of reasoning are human-like and will remain human-like, such that you can create legibility bottlenecks every ~1000 steps of serial reasoning: if you decipher it to the best of your abilities, distill it, performance will remain high for all current models (p=0.8) and the first AI able to speed up safety research by 10x (p=0.5).

Correlation between short and long reasoning

If, like Deepseek r1, the weights are shared between the part of the model tasked with producing the user facing answer and the scratchpad, it makes it more likely a priori that traits from the user facing answer leak into the reasoning. This seems even stronger for models like Deepseek V3.1 where the weights for non-reasoning and reasoning modes are the same. While the model does learn to behave differently between these 2 modes, it seems natural for the model to have a similar personality across both unless there is a strong optimization pressure against it.

Prediction 4: You can probably somewhat select against scheming by training the model to behave nicely in non-reasoning mode. In particular, it will take at least 1.5x more samples to train a model to be nice with the no-reasoning mode on while scheming when it is off than when only training it with the reasoning mode off, even when controlling for the model having to learn how to use non-reasoning mode. (p=0.8) I expect a similar transfer to user-facing niceness to scratchpad niceness when comparing it to a scenario where the user-facing output is neither nice nor evil (e.g. some short structured answer).

While this doesn’t rule out scheming if you strongly select for scheming, I think that like the argument on human-like priors, it puts some valuable bits of optimization pressure on benign intermediate steps of reasoning.

Other pressures

The existence of human prior pressures and short-long correlation suggests there might be other forms of intermediate pressures that might be very effective. I think the argument for such pressures not being massively uncompetitive mostly holds before it becomes intractable to follow AI’s AI R&D, but I expect that avoiding such fast recursive self-improvement won’t be uncompetitive by the time AIs speed up safety research by 10x.

I haven’t thought about it much, but I think this might be an interesting line of research. It might be the case that simple variations on the data and rewards used during RL / neuralese training makes the situation much better.

Rewards are not so cursed as to strongly incentivize schemingMaximizing rewards teaches you things mostly independent of scheming

Most of RL is not spent learning what task to do, it is spent learning to be better at it. Deepseek r1 was probably trying as hard as it could to solve math problems 10 steps into its 10k steps RL. 99.9% of RL was spent learning what specific math reasoning strategies were the most effective. Most of the optimization pressure goes into learning these domain specific tricks that the model does not know about before RL starts.

This means that there is at most a small amount of optimization pressure towards scheming: above I said that only MBs of information are learned during RL (measured by the weight of transcripts needed), but the amount of bits needed to make the model “try hard and exploit existing knowledge” is probably at most a few KBs.

Using situational awareness to get higher reward is hard

But maybe a few KBs of information are enough to point at scheming despite the strength of the human and speed priors? I don’t think so because the exact way in which to “try hard and exploit existing knowledge” matters a lot and current AIs don’t know in what ways they will be graded. They would have a hard time guessing in a few forward passes what outcomes/LLM-based rewards will be used. And so I would guess it is more efficient for SGD to just learn a particular persona that does the right kind of trying rather than learn a general persona that has the right knowledge and instrumental reasoning about what to do.

Prediction 5: I also don’t think the sample efficiency of training gaming is higher than the sample efficiency of regular learning when the model doesn’t already know the facts it needs to exploit. Adding “Please think about the process that generated the current supervised fine-tuning dataset and choose the tokens that this process would generate” to the system prompt of an SFT dataset won’t increase the sample efficiency of learning compared to other dummy prompts for AIs at and before 10x² speedup, regardless of the SFT dataset when the initialization loss is similar (p=0.9). I expect similar results when using RL, but only with realistic production-like initialization on realistic tasks, such that exploration of the no-training-gaming prompt isn’t toy-ishly weak.

I think the situation would be worse given much higher knowledge of what gets rewarded during training, in the same way that in the alignment faking paper the “latex” variation sometimes results in higher alignment faking. But I don’t think this will clearly be the case for future AIs and RL setups, or at least not in a way that wouldn’t also make the “benign persona that already exploits these rewards” also very salient.

Therefore, I don’t think that if RL remains similar to the one that was used for Deepseek r1, it will make the model more situationally aware and more likely to have situational-awareness-traits that seem correlated with scheming.

Maximizing rewards doesn’t push you far away from the human prior

Even if MBs don’t push towards scheming specifically, they could push you away from the mostly benign and intuitive human prior on intermediate steps of reasoning, which could make the situation more dangerous and closer to the domain of validity of counting arguments.

But MBs is a small amount of information, and LLMs are mostly trained on next-token-prediction on human text (or AI-generated text generated by non-scheming AIs). Therefore, I expect them to perform best when not straying too far from that distribution.

This argument is not airtight, since it could be that in the very high dimensional space of text there might be certain directions that are LLM-friendly without being human-friendly, and the weird scratchpads of o3 are an example of that. But I think this phenomenon is weaker and less spooky than people think.

Prediction 6: LLMs don’t learn new languages during RL, and will always learn to use languages that non-RLed models can understand quite well. Claude 3.5 Sonnet, GPT-4o and Gemini 2 Pro will be able to understand the topics considered in 99%+ of Chain-of-Thoughts of all regular Transformers trained in 2025 and 2026 that were not deliberately trained or prompted to be harder to understand (p=0.7).

Will it be different for future rewards?

The situation could well be different in the future, if AI companies find ways to have way more alien training processes than the ones used to train deepseek, ones that require much more consequentialist reasoning using self-knowledge and knowledge about the weaknesses of humans.

There will be a pressure towards making more complex RL environments that interact more with the real world, but there will also be a pressure towards avoiding environments that AIs know how to game when you ask them to game them - since gaming often results in worse real-world results (there is some misalignment between AI companies and users on things like user retention, but I would guess exploiting user weaknesses to make them stay on a platform is mostly uncorrelated with the sort of scheming that misleads AI developers in scary ways).

I think the kind of environment that seems most scary to me right now is environments where some kind of “robust sycophancy” is encouraged, where the AI has to guess the AI developer intention/hopes about a particular outcome and has to not only superficially say things that align with the AI developer ideas, but also produce a set of carefully non-contradicting evidence that goes in the same direction and that addresses / tampers with the interrogations the AI developer could use to compute the reward. But I am unsure what these would look like in practice, and I am cautiously optimistic that it will be possible to avoid the kind of gaming that is most correlated with scheming.

Meta-level update and conclusion

The world is more complex than the simple models I had a few years ago. Complexity makes things less safe by default. But in a domain like AI where people are trying to build useful AIs that serve their interests and where we don’t have principled reasons to expect things to be safe in the simple end-to-end black-box-optimization setting, complexity can sometimes update us towards the situation being safer than one might have guessed - especially when this added complexity gives more options to AI developers.

Deepseek r1:

  • Has scratchpads that mostly do not contain scheming-like behaviors on the training distribution
  • Can be easily compressed (relative to base Deepseek v3) in a few MBs of transcripts
  • Was trained using a simple training algorithm (GRPO) with a lot of “training slack”
  • Is more competitive than approaches that try to do end-to-end optimization with intermediate more human-interpretable and human-like text token bottlenecks

I think these facts are evidence that in this case, the added complexity of the real world makes the situation less scary wrt to scheming (though there are other facts about the world that make the situation more scary wrt scheming).

And this is despite Deepseek mostly optimizing for capabilities!

This makes me update down on our ability to predict in advance how likely scheming will be in pre-wildly-superintelligent AIs, and up on the possibility of finding prosaic approaches to building AIs that are unlikely to scheme (which becomes easier once we will get more realistic model organisms of scheming).

I think that analyzing “positive surprises” and extracting as much information from them as we can is at least as important as analyzing “negative surprises”, like warning shots and surprisingly bad model organisms - and that this community might be underinvesting in analyzing positive surprises.

But to be clear, even if I think the risk is slightly lower than I previously thought and that there are promising avenues to reducing risk, I still think this is an incredibly risky situation (because a 15% probability of scheming is high, because there are non-scheming risks before wildly superintelligent AIs, and because avoiding scheming before wildly superintelligent AIs is not a guarantee of finding ways to align wildly superintelligent AIs before it’s too late), and I support a prohibition on the development of superintelligence.

  1. ^

    This is not the "with no countermeasures" scheming probability, this is my actual best guess, including the potential anti-scheming mitigations that people might use.



Discuss

Coming soon

20 ноября, 2025 - 20:30
Published on November 20, 2025 5:30 PM GMT

This is Can We Secure AI With Formal Methods?.

Subscribe now



Discuss

July 2024 Welcome Edition

20 ноября, 2025 - 20:30
Published on November 20, 2025 5:30 PM GMT

Welcome to the newsletter! 

Here we’ll be doing a monthly news roundup inclusive of my biased and wrong takes, plus maybe a remark about a non-current paper I happened to have read this month. 

In this welcome edition, I won’t attempt to summarize all the news I missed before the newsletter existed, just the most recent notes.

Progress in Guaranteed Safe AI to your inbox

There are no benefits to subscribing, the free tier has it all. But I enabled paid subscriptions just in case people want to signal support, to check how it affects my motivation to write these, and to see what happens.  

Let me know changes I can make to maximize the impact of the newsletter by commenting here or emailing quinndougherty92 <at> gmail.

Thanks to Ben Goldhaber for the baseball metaphor idea and reviewing this post, and thanks to Steve Omohundro for suggesting the newsletter’s name, but all demerits go to me (the points go to me too btw).

Where are we? What is guaranteed safe AI? 

Guaranteed-safe AI is a collection of research agendas to make the technological transition go well. There’s a video explainer here, but the main resource is the preprint position paper Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Personally, I had a few years of telling people that formal verification is not the droid we’re looking for before the latest push for GSAI, so you might be wondering if my worldview changed. In my earlier writing, I was mostly responding to a very strong “just capture alignment properties in ‘specs’ and prove the implementation ‘correct’”, however I’m attracted to GSAI proposals because they have a swiss cheese matter-of-degree mindset that I don’t usually see when people get optimistic about the stronger version. Separately, proof assistants have played a big role in my career, so I’m biased toward theories of change that might leverage my experience. 

Atlas Computing talk

The monthly seminar featured Evan Miyazono of Atlas Computing, which is available on youtube. The GSAI position papers are fine, but they need to be injected with a roadmap, blueprint, or next steps. Ideally, such a roadmap would go all the way down to estimates of headcount needed for different tactics. Wouldn’t it be great if someone was doing this? Anyways, major thanks to Atlas for doing this! Expect soon a document from Atlas that’s been incubating as a google doc all month. 

Podcast episode 

Nora Ammann and Ben Goldhaber appeared on the Cognitive Revolution podcast to discuss GSAI. The junction of world models, specifications, and assured implementations is so lofty and hairy that good science communication isn’t easy, so we need to see more of this. 

QAISI launch with job description

The Quantitative Safety AI Initiative has a website up with a job posting for an AIS Research Engineer

The job says it will mostly be focusing on the Tegmark & Omohundro preprint, even though two additional authors of the position paper of mammoth author list are research leads for the orgs. Dawn Song, the one QAISI research lead not on that position paper, does a lot of security and cryptography work, but is no stranger to AI safety (she co-advised Dan Hendrycks, off the top of my head). Excited to see what the new org gets up to! 

Guarantees-based mechanistic interpretability team preprint

This paper isn’t 100% relevant, since GSAI tends more toward formally verifying artifacts of models than formally verifying models. However, it represents a natural thought you’d have when you try smashing formal verification into machine learning so I’m including it. This is not a commitment to make the newsletter a big tent for any FV/ML crossover episode overall, but it can be sometimes. Gross et. al. lay the groundwork for formal proofs of model performance (lower bounds of accuracy). They accomplish this with mechinterp tricks like ablations, but since we’re dealing in proof we need pessimal ablations or ablating with worse cases rather than mean or random ablations. 

I say “groundwork” because while the proofs in the current paper are “formal” by many standards, they are not yet formal in the sense of machine checked. But one step at a time, the current work seems like a big leap forward to me, and I’m skeptical of the marginal value add of putting those proofs in Coq. 

See this splainer / companion piece by the authors plus this supplemental by Jacob Hilton of ARC (which funded the work).

Progress in automating mathematics

This is a very active area with either tangential or moonshot safety stories, so I’ll never be exhaustive in covering the latest happenings, but how about a tasting. 

AlphaProof

Deepmind: AI achieves silver-medal standard solving International Mathematical Olympiad problems. Coupling a language model specializing in Lean with the AlphaZero RL algorithm, AlphaProof is competitive with IMO silver medalists, and very close to gold. The blog doesn’t link to a preprint, omits details about how the IMO data was Leanified, and omits the proportion of synthetic and self-play data vs other data. So it remains to be seen how significant this will be– I don’t have a strong inside view about benchmark hacking, but I know enough about the problem area to be skeptical of everything. 

Remember that proofs are programs. We should expect some framing of the automated mathematics problem to do the conceptual heavy lifting for the automated software engineering problem. Reasons you’d hesitate to buy up all the yes right now is because IMO in lean is not that framing. Lean (especially Mathlib) is, contra your assumptions when you walk into a dependent type theory, optimized for classical mathematics. Without an analysis of the constructive capacities of AlphaProof’s capabilities, you shouldn’t assume that proofs are programs, machine learning may not walk down the halls of Curry-Howard just yet. 

GSAI paper club? 

I’m considering a monthly meetup to eat through papers. Comment here or email quinndougherty92 <at> gmail to express interest. 

Shortlist of papers we could cover (not all of them timely)

Feel free to suggest others. Much of me would like to do a centralized summarizer (often me) format, but otherwise we could do a longer meetup where we read together on the call before discussing. 

Funding opportunities (jk) 

ARIA is currently closed, and SFF (which has some GSAI folks as panelists this year) deadline was a few weeks ago. 

Neel Nanda, leading a manifund budget, informed me over discord that he’s pessimistic about GSAI and is unlikely to make grants in the area. I don’t know what other Manifund regrantors think, but there’s still a million or so in play for AI on manifund so could be worth a shot. 

Coming up

You can sign up for August’s GSAI seminar here, from the authors of this post. We’ll have to see what progress has been made in any prototyping that’s been done– the post I found to be pretty pie-in-the-sky. 

See you next month

Like and subscribe, etc. Also, I am not of the twitterati so I don’t know how to hang out in specific places to make this newsletter better, but it’d be cool to cite other peoples’ quick takes from time to time. To do this, I need your twitter account recommendations, so shoot those over.

Progress in Guaranteed Safe AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.



Discuss

August 2024 Progress in Guaranteed Safe AI

20 ноября, 2025 - 20:30
Published on November 20, 2025 5:30 PM GMT

A wager you say

One proof of concept for the GSAI stack would be a well-understood mechanical engineering domain automated to the next level and certified to boot. How about locks? Needs a model of basic physics, terms in some logic for all the parts and how they compose, and some test harnesses that simulate an adversary. Can you design and manufacture a provably unpickable lock? 

Zac Hatfield-Dodds (of hypothesis/pytest and Anthropic, was offered and declined authorship on the GSAI position paper) challenged Ben Goldhaber to a bet after Ben coauthored a post with Steve Omohundro. It seems to resolve in 2026 or 2027, the comment thread should get cleared up once Ben gets back from Burning Man. The arbiter is Raemon from LessWrong.

to your inbox!

Zac says you can’t get a provably unpickable lock on this timeline. Zac gave (up to) 10:1 odds, so recall that the bet can be a positive expected value for Ben even if he thinks the event is most likely not going to happen.

For funsies, let’s map out one path of what has to happen for Zac to pay Ben $10k. This is not the canonical path, but it is a path:

  1. Physics to the relevant granularity (question: can human lockpicks leverage sub-newtownian issues?) is conceptually placed into type theory or some calculus. I tried a riemann integral in coq once (way once), so it occurs to me that you need to decide if you want just the functional models (perhaps without computation / with proof irrelevance) in your proof stack or if you want the actual numerical analysis support in there as well.

  2. Good tooling, library support, etc. around that conceptual work (call it mechlib) to provide mechanical engineering primitives

  3. A lock designing toolkit, depending on mechlib, is developed

  4. Someone (e.g. a large language model) is really good at programming in the lock designing toolkit. They come up with a spec L. 

  5. You state the problem “forall t : trajectories through our physics simulation, if L(t) == open(L) then t == key(L)”

  6. Then you get to write a nasty gazillion line Lean proof

  7. Manufacture a lock (did I mention that the design toolkit has links to actual manufacturing stacks?)

  8. Bring a bunch to DefCon 2027 and send another to the lockpicking lawyer

  9. Everyone fails. Except Ben and the army of postdocs that $9,999 can buy.

Looks like after the magnificent research engineering in steps 1 and 2, the rest is just showing off and justifying those two steps. Of course, in a world where we have steps 1 and 2 we have a great deal of transformative applications of formal modeling and verification just in reach, and we’ll need a PoC like locks to practice and concretize the workflow. 

Cryptography applications tend to have a curse of requiring a lot of work after the security context, permission set, and other requirements are frozen in stone, which means that when the requirements change you have to start over and throw out a bunch of work (epistemic status: why do you think so many defi projects have more whitepapers than users?). The provably unpickable lock has 2 to 10 x that problem– get the granularity wrong in step one, most of your mechlib implementation won’t be salvageable. As the language model iterates on the spec L in step 5, the other language model has to iterate on the proof in step 6, because the new spec will break most of the proof.

Sorry I don’t know any mechanical engineering, Ben, otherwise I’d take some cracks at it. The idea of a logic such that its denotation is a bunch of mechanical engineering primitives seems interesting enough that my “if it was easy to do in less than a year someone would’ve, therefore there must be a moat” heuristic is tingling. Perhaps oddly, the quantum semantics folks (or with HoTT!) seem to have been productive, but I don’t know how much of that is translatable to mechanical engineering.

Reinforcement learning from proof assistant feedback, and yet more monte carlo tree search

DeepSeek’s paper 

The steps are pretraining, supervised finetuning, RLPAF (reinforcement learning from proof assistant feedback), and MCTS (monte carlo tree search). RLPAF is not very rich: it’s a zero reward for any bug at all and a one for a happy typechecker. Glad they got that far with just that. 

You can use the model at deepseek.com.

Harmonic ships their migration of miniF2F to Lean 4, gets 90% on it, is hiring

From their “one month in” newsletter. “Aristotle”, which has a mysterious methodology since I’ve only seen their marketing copy rather than an arxiv paper, gets 90% on miniF2F 4 when prompted with natural language proofs. It doesn’t look to me like the deepseek or LEGO papers do that? I could be wrong. It’s impressive just to autoformalize natural language proofs, I guess I’m still wrapping my head around how much harder it is (for an LLM) to implement coming up with the proof as well. 

Jobs: research engineer and software engineer

Atlas ships their big google doc alluded to in the last newsletter

Worth a read! The GSAI stack is large and varied, and this maps out the different sub-sub-disciplines. From the executive summary:

You could start whole organizations for every row in this table, and I wouldn’t be a part of any org that targets more than a few at once for fear of being unfocused. See the doc for more navigation (see what I did there? Navigating like with an atlas, perhaps? Get it?) of the field’s opportunities.1

Efficient shield synthesis via state-space transformation

Shielding is an area of reactive systems and reinforcement learning that marks states as unsafe and synthesizes a kind of guarding layer between the agent and the environment that prevents unsafe actions from being executed in the environment. So in the rejection sampling flavored version, it literally intercepts the unsafe action and tells the agent “we’re not running that, try another action”. One of the limitations in this literature is computational cost, shields are, like environments, state machines plus some frills, and there may simply be too many states. This is the limitation that this paper focuses on. 

We consider the problem of synthesizing safety strategies for control systems, also known as shields. Since the state space is infinite, shields are typically computed over a finite-state abstraction, with the most common abstraction being a rectangular grid. However, for many systems, such a grid does not align well with the safety property or the system dynamics. That is why a coarse grid is rarely sufficient, but a fine grid is typically computationally infeasible to obtain. In this paper, we show that appropriate state-space transformations can still allow to use a coarse grid at almost no computational overhead. We demonstrate in three case studies that our transformation-based synthesis outperforms a standard synthesis by several orders of magnitude. In the first two case studies, we use domain knowledge to select a suitable transformation. In the third case study, we instead report on results in engineering a transformation without domain knowledge.

Besides cost, demanding a lot of domain knowledge is another limitation of shields, so this is an especially welcome development.

Funding opportunities

ARIA jumped right to technical area three (TA3), prototyping the gatekeeper. Deadline October 2nd. Seems geared toward cyber-physical systems folks. In the document:

Note that verified software systems is an area which is highly suitable for a simplified gatekeeper workflow, in which the world-model is implicit in the specification logic. However, in the context of ARIA’s mission to “change the perception of what’s possible or valuable,” we consider that this application pathway is already perceived to be possible and valuable by the AI community. As such, this programme focuses on building capabilities to construct guaranteed-safe AI systems in cyber-physical domains. That being said, if you are an organisation which specialises in verified software, we would love to hear from you outside of this solicitation about the cyber-physical challenges that are just at the edge of the possible for your current techniques.

This is really cool stuff, I hope they find brave and adventurous teams. I had thought gatekeeper prototypes would be in minecraft or mujoco (and asked a funder if they’d support me in doing that), so it’s wild to see them going for actual cyberphysical systems so quickly. 

Paper club

Add to your calendar. On September 19th we will read a paper about assume-guarantee contracts with learned components. I’m liable to have made a summary slide deck to kick us off, but if I don’t, we’ll quietly read together for the first 20-30 minutes then discuss. The google meet room in the gcal event by default.  

Andrew Dickson’s excellent post

See Limitations on Formal Verification for AI Safety over on LessWrong. I have a lot of agreements, and my disagreements are more a matter of what deserves emphasis than the fundamentals. Overall, I think the Tegmark/Omohundro paper failed to convey a swisscheesey worldview, and sounded too much like “why not just capture alignment properties in ‘specs’ and prove the software ‘correct’?” (i.e. the vibe I was responding to in my very pithy post). However, I think my main reason I’m not using Dickson’s post as a reason to just pivot all my worldview and resulting research is captured in one of Steve’s comments:

I'm focused on making sure our infrastructure is safe against AI attacks.

Like, a very strong version I almost endorse is “GSAI isn’t about AI at all, it’s about systems coded by extremely powerful developers (which happen to be AIs)”, and ensuring safety, security, and reliability capabilities scale at similar speeds with other kinds of capabilities.

It looks like one can satisfy Dickson just by assuring him that GSAI is a part of a swiss cheese stack, and that no one is messianically promoting One Weird Trick To Solve Alignment. Of course, I do hope that no one is messianically promoting One Weird Trick…

1

One problem off the top of my head regarding the InterFramework section: Coq and Lean seems the most conceptually straightforward since they have the same underlying calculus, but even there just a little impredicativity or coinduction could lead to extreme headaches. Now you can have a model at some point in the future that steamrolls over these headaches, but then you have a social problem of the broader Lean community not wanting to upstream those changes– various forks diverging fundamentally seems problematic to me, would lead to a lot of duplicated work and missed opportunities for collaboration. I plan to prompt Opus 3.5 with “replicate flocq in lean4” as soon as I get access to the model, but how much more prompting effort will it be to ensure compliance with preexisting abstractions and design patterns, so that it can not only serve my purposes but be accepted by the community? At least there’s no coinduction in flocq, though some of the proofs may rely on set impredicativity for all I know (I haven’t looked at it in a while).



Discuss

September 2024 Progress in Guaranteed Safe AI

20 ноября, 2025 - 20:30
Published on November 20, 2025 5:30 PM GMT

There are no benefits with paid subscription, except you get a cut of my hard earned shapley points. 

As always, do reach out to inform me how the newsletter could be better, unless you want less biased and wrong quick takes. The biased and wrong quick takes are mandatory.

Progress in Guaranteed Safe AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

If you're just joining us, background on GSAI here.  

Terry Tao’s mathstodon comment

“I have played a little bit with OpenAI’s new iteration of GPT”, he begins

Let's get everyone up to speed on one of Tao's key insights: the bottlenecks to casually superb Lean4 performance are embarrassingly non-profound, chat models getting tripped up by import paths changing across Lean versions and so on. You can only choke a model by having too much code velocity in the training data for so long, but eventually that'll fall. 

That may not be a great crux for Tao's use case. I think he wants to focus on the reasoning bottlenecks, and is less concerned about coding bottlenecks given reasoning. Perhaps his interest in the Lean parts is mostly Voevedsky style: what do we want? Less homework. How do we want it? With greater assurance than more homework. 

You should sit and think about if you think attaining a reasoning level is easier or harder than turning that reasoning level into Lean code once attained. I sat and thought about it for 5 minutes, and I'm still not ready to make any bets. Yes, the lesson about the embarrassingly non-profound bottlenecks suggest a reasoning level is harder than translating that reasoning level into code, but on the other hand, dependent types are a really finicky way to program. 

Tao reports that we've leapt from incompetent grad student to mediocre grad student, or openai has anyway. I do not know the conversion factor between units of pure maths grad student to units of GSAI-grade software developer. I'll take 1 to 1 as my prior. How about yours? Is one pure maths grad student worth of software engineering enough to accomplish your goals, if you can make copies and replace pizza/coffee with electricity? 

David Manheim’s post

I liked David Manheim’s post on self driving cars as a worked example. Like we saw last month, worked examples lead to identifying cruxes clearer and faster. In this case, the commenters mostly objected about unstated assumptions, and Manheim would go on to edit more assumptions (even in the title of the post!). 

Tdietterich: 

Perhaps we should rename the "provable safety" area as "provable safety modulo assumptions" area and be very explicit about our assumptions. We can then measure progress by the extent to which we can shrink those assumptions.

Jacobjacob:

Very much agree. I gave some feedback along those lines as the term was coined; and am sad it didn't catch on. But of course "provable safety modulo assumptions" isn't very short and catchy...

I do like the word "guarantee" as a substitute. We can talk of formal guarantees, but also of a store guaranteeing that an item you buy will meet a certain standard. So it's connotations are nicely in the direction of proof but without, as it were, "proving too much" :)

I agree with the sentiment in the comments that provability in the sense of system safety is provability up to which assumptions, and hope to see people be more clear about this. 

Models that prove their own correctness 

This was june but I just found it. We’re reading it for next month’s GSAI paper club, which is convening 2024-10-17 at 10a pacific. Here, the word “proof” is in the context of “interactive proof”, or some multi-step game consisting of a prover and a verifier popular in computational complexity theory and cryptography. 

How can we trust the correctness of a learned model on a particular input of interest? Model accuracy is typically measured *on average* over a distribution of inputs, giving no guarantee for any fixed input. This paper proposes a theoretically-founded solution to this problem: to train *Self-Proving models* that prove the correctness of their output to a verification algorithm V via an Interactive Proof. Self-Proving models satisfy that, with high probability over a random input, the model generates a correct output *and* successfully proves its correctness to V. The *soundness* property of V guarantees that, for *every* input, no model can convince V of the correctness of an incorrect output. Thus, a Self-Proving model proves correctness of most of its outputs, while *all* incorrect outputs (of any model) are detected by V. We devise a generic method for learning Self-Proving models, and we prove convergence bounds under certain assumptions. The theoretical framework and results are complemented by experiments on an arithmetic capability: computing the greatest common divisor (GCD) of two integers. Our learning method is used to train a Self-Proving transformer that computes the GCD *and* proves the correctness of its answer.

Some people in GSAI only want to gain assurance about code that AIs are writing, others want to gain assurance about the learned components themselves. If you’re in the latter camp, this proposal looks promising, but there’s a long road ahead to scale it to non-toy problems.

Types Are Not Safe

I came across this lambda conf talk solely because I used to work for the speaker– I did not check whether other lambdaconf talks are worthy of endorsement. 

I'm including it in the newsletter to make sure we get on the same page about what type systems can't do. Type systems check for a pretty narrow kind of correctness. No one has lifted color blind approval of UIs to types. No one has made a PDE solver where the typechecker knows if the initial conditions you loaded in send the plane into a hillside. Lattice based access control is kinda like lifting security to a type checker, but it doesn’t know which permission sets are more or less susceptible to insider, phishing, or bribery risks. This matters, because if you cash out “formal verification” at Lean, then you've just staked it all on types, and types are not safe. I think we need a broad portfolio and several different meanings of the word “certificate”, but I keep running into people who let their guard down and hail the coming age of formally verified code generation, and if they were careful they wouldn't be so optimistic. 

I tend to say that type systems are for increasing the jurisdiction of compiletime knowledge. The runtime is where your adversary has home team advantage. But many correctness notions that are in principle liftable to a sufficiently expressive typechecker are not in practice worth the trouble. Now, this is due to ergonomic bottlenecks that an LLM could steamroll over, but then you're just shifting your error/attack surface from the spec-implementation gap to “where the h!ck did this spec come from?” and of course the ever present world-spec gap.

Another title for the talk could've been “Types Lie”, but that section of the talk relies a little on advanced language extensions specific to haskell that it isn't important you understand. Some of them, like the untypeability of the type of types, is way less bad in dependently typed languages. Getting confused because you wrote some type algebra that accidentally involves uninhabited types happened to me in Lean a couple weeks ago, though. 

OpportunitiesTopos jobs 

Both of the Oxford postdocs are concerned with world modeling and the foundations of what eventually will be specification engineering for ARIA-style Safeguarded AI. 

UK’s AISI cites GSAI as an area they want to branch more into, are hiring

The UK’s AI Safety Institute is building a team for articulating positive “safety cases”, stories in which deploying models of the future is not a bad idea. They want technical talent to write these stories not just so that they’ll be calibrated but also so the team has the capacity in house to deliver on some of the empirical needs of those stories. They do plan to open an SF office over the next year, but right now you need to be onsite in London. 

Progress in Guaranteed Safe AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.



Discuss

October 2024 Progress in Guaranteed Safe AI

20 ноября, 2025 - 20:30
Published on November 20, 2025 5:30 PM GMT

There are no benefits of subscribing, except you get a cut of my hard earned shapley points. 

Subscribe now

As always, do reach out to inform me how the newsletter could be better, unless you want less biased and wrong quick takes. The biased and wrong quick takes are mandatory. Do you like inlining abstracts, or should I just expect people to click when they want to read the abstract? This month we’re inlining all of them. 

If you're just joining us, background on GSAI here.

I attended an ARIA meeting this month, so most of the newsletter is papers I found there that happen to be recent-ish.

A little birdie told me there’s a meeting in Berkeley at the beginning of December– reach out for more info!

Big list of openreview links for ICLR 2025 papers related to scaled interactive theorem proving 

From the Lean Zulip. There’s probably more from ICLR I’d like to cover in the newsletter, but I haven’t parsed the openreview firehose personally. I’m freeloading off Lean Zulip users.

One thing that jumped out at me is in a thread about a polarizing/unpopular paper, the discussion of what programming practices would lead to ML papers having an easier time at making reasonable claims about the nontriviality of discovered proofs and broadly preventing contamination at scale. Terence Tao discusses placeholder design patterns, suggests a proof_wanted command to replace a theorem … : 0 = 1 := sorry, and discusses downsides with the practice of sorrying out 0=1 to do placeholding (spoiler: an AI can exploit the proof of a false proposition to misbehave). This is an important point: any GSAI stack will have collaboration between proof engineers of different power levels, and lots of collaboration in proof engineering is driven by very careful bookkeeping of placeholders (saying what we need, what we have, and what would be nice to have). 

A quantitative probabilistic relational hoare logic

November’s GSAI paper club

Hoare logic is when you have precondition P : Memory -> Prop and postcondition Q : Memory -> Prop, and some imperative program c such that forall memories m and n, if Pm and executing c sends m to n, then you can prove Qn, written {P}c{Q}. Sometimes, you want to reason about relations between programs. Relational hoare logic (RHL) is when preconditions and postconditions are relations on memory states and you have two programs, so {P} c₁ ~ c₂ {Q} means that forall memories m1 m2 n1 n2, if P m1 m2 and exec c1 m1 = n1 and exec c2 m2 = n2, then Q n1 n2

You might like to write a proof of a RHL quadruple for a cryptographic primitive, requiring you to reason about independence of two PRNGs. Independence is one of the things you don’t get in RHL, because RHL assumes deterministic programs. Probabilistic relational hoare logic (pRHL) looks again like {P} c1 ~ c2 {Q}, but this time Q is lifted to a relation between distributions over memory, reflecting the possible outcomes of programs involving probabilistic choice or random sampling. So precondition P is of type Memory -> Memory -> Prop, postcondition Q is of type ΔMemory -> ΔMemory -> Prop, and the pRHL quadruple unpacks to forall m1 m2 : Memory, forall N1 N2 : ΔMemory, if P m1 m2 and exec c1 m1 ~ N1 and exec c2 m2 ~ N2, then Q N1 N2

We introduce eRHL, a program logic for reasoning about relational expectation properties of pairs of probabilistic programs. eRHL is quantitative, i.e., its pre- and post-conditions take values in the extended non-negative reals. Thanks to its quantitative assertions, eRHL overcomes randomness alignment restrictions from prior logics, including PRHL, a popular relational program logic used to reason about security of cryptographic constructions, and apRHL, a variant of PRHL for differential privacy. As a result, eRHL is the first relational probabilistic program logic to be supported by non-trivial soundness and completeness results for all almost surely terminating programs. We show that eRHL is sound and complete with respect to program equivalence, statistical distance, and differential privacy. We also show that every PRHL judgment is valid iff it is provable in eRHL. We showcase the practical benefits of eRHL with examples that are beyond reach of PRHL and apRHL.

pRHL is not expectational. Since preconditions and postconditions land in Prop, we can’t value preconditions and postconditions as quantities. Last july, the authors introduced a generalization of pRHL called eRHL (e for expectation), where now preexpectations and postexpectations are of type Memory -> Memory -> NNReal, and the rest follows. eRHL is special because they can prove completeness theorems for almost surely terminating (Pr(halt) = 1) programs that pRHL can’t, along with some fancy desiderata like a statistical difference ⇔ eRHL quadruple correspondence and differential privacy. 

If you’re wondering why you just read all that, here’s the juice: often in GSAI position papers there’ll be some reference to expectations that capture “harm” or “safety”. Preexpectations and postexpectations with respect to particular pairs of programs could be a great way to cash this out, cuz we could look at programs as interventions and simulate RCTs (labeling one program control and one treatment) in our world modeling stack. When it comes to harm and safety, Prop and bool are definitely not rich enough.  

Bengio talk coming up 

November 14, 18:00-19:00 UTC

Description: Could there be safety advantages to the training of a Bayesian oracle that is trained to only do that job, i.e., estimate P(answer | question, data)? What are the scenarios in which such an AI could cause catastrophic harm? Can we even use such an oracle as the intelligence engine of an agent, e.g., by sampling actions that help to achieve goals? What can go wrong even if we assume that we have a perfect prediction of the Bayesian posterior, e.g., if the true explanatory theory is a minority voice in the Bayesian posterior regarding harm prediction? If such an oracle is estimated by a neural network with amortized inference, what could go wrong? Could the implicit optimization used to train the estimated posterior create loopholes with an optimistic bias regarding harm? Could we also use such a Bayesian oracle to obtain conservative risk estimates, i.e., bounds on the probability of harm, that can mitigate the imperfections in such an agent?

AutoVerus

Verus is a framework for annotating rust code with preconditions (“requires”) and postconditions (“ensures”) and kicks static verification down to SMT. Here’s an example program for multiplying a number by 8:

fn octuple(x1: i8) -> (x8: i8)

    requires

        -16 <= x1,

        x1 < 16,

    ensures

        x8 == 8 * x1,

{

    let x2 = x1 + x1;

    let x4 = x2 + x2;

    x4 + x4

}

Without the “require”, we may overflow the 8 bit integer. The “ensure” is a functional specification of what it means for the function to be correct. The verus paper is here and the verus docs are here

Generative AI has shown its values for many software engineering tasks. Still in its infancy, large language model (LLM)-based proof generation lags behind LLM-based code generation. In this paper, we present AutoVerus. AutoVerus uses LLM to automatically generate correctness proof for Rust code. AutoVerus is designed to match the unique features of Verus, a verification tool that can prove the correctness of Rust code using proofs and specifications also written in Rust. AutoVerus consists of a network of LLM agents that are crafted and orchestrated to mimic human experts' three phases of proof construction: preliminary proof generation, proof refinement guided by generic tips, and proof debugging guided by verification errors. To thoroughly evaluate AutoVerus and help foster future research in this direction, we have built a benchmark suite of 150 non-trivial proof tasks, based on existing code-generation benchmarks and verification benchmarks. Our evaluation shows that AutoVerus can automatically generate correct proof for more than 90% of them, with more than half of them tackled in less than 30 seconds or 3 LLM calls. 

The autoverus paper is an example in the genre of using verification tools as the ground truth in a machine learning process. A proof system is a great data labeler. GSAI is happy when this basic viewpoint makes data quality more scalable than other forms of ground truth across machine learning. 

Sufficient and Necessary Barrier-like Conditions for Safety and Reach-avoid Verification of Stochastic Discrete-time Systems

A reach-avoid specification is something of the form “eventually P and always not Q”. Barrier certs are a way of proving that a system starting in the safe region stays in the safe region without directly solving the whole system of differential equations. 

In this paper, we examine sufficient and necessary barrier-like conditions for the safety verification and reach-avoid verification of stochastic discrete-time systems. Safety verification aims to certify the satisfaction of the safety property, which stipulates that the probability of the system, starting from a specified initial state, remaining within a safe set is greater than or equal to a specified lower bound. A sufficient and necessary barrier-like condition is formulated for safety verification. In contrast, reach-avoid verification extends beyond safety to include reachability, seeking to certify the satisfaction of the reach-avoid property. It requires that the probability of the system, starting from a specified initial state, reaching a target set eventually while remaining within a safe set until the first hit of the target, is greater than or equal to a specified lower bound. Two sufficient and necessary barrier-like conditions are formulated under certain assumptions. These conditions are derived via relaxing Bellman equations. 

Compositional Design of Safety Controllers for Large-scale Stochastic Hybrid Systems

What is the problem? When is a solution “compositional”? 

The problem is stochastic hybrid systems (SHS), which are very heterogeneous: dynamics (i.e. of the kind described by differential equations) that might be discrete time in some parts and continuous time in others, and discrete state in some parts but continuous state in others. Specifically, the problem is synthesizing guaranteed controllers for these systems (some actuator that intervenes to enact a predicted output). A solution to a problem is compositional when it allows you to piece together solutions to bigger problems out of solutions to smaller problems. 

In this work, we propose a compositional scheme based on small-gain reasoning for the safety controller synthesis of interconnected stochastic hybrid systems with both continuous evolutions and instantaneous jumps. In our proposed setting, we first offer an augmented scheme to represent each stochastic hybrid subsystem with continuous and discrete evolutions in a unified framework, ensuring that the state trajectories match those of the original hybrid systems. We then introduce the concept of augmented control sub-barrier certificates (A-CSBC) for each subsystem, which allows the construction of augmented control barrier certificates (A-CBC) for interconnected systems and their safety controllers under small-gain compositional conditions. We eventually leverage the constructed A-CBC and quantify a guaranteed probabilistic bound across the safety of the interconnected system. While the computational complexity of designing a barrier certificate and its safety controller grows polynomially with network dimension using sum-of-squares (SOS) optimization program, our compositional approach significantly reduces it to a linear scale with respect to the number of subsystems. We verify the efficacy of our proposed approach over an interconnected stochastic hybrid system composed of 1000 nonlinear subsystems.

DafnyBench (back in june)

The Beneficial AI Foundation and others shipped a benchmark for Dafny coding. 

We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% success rate, and we quantify how this rate improves when retrying with error message feedback and how it deteriorates with the amount of required code and hints. We hope that DafnyBench will enable rapid improvements from this baseline as LLMs and verification techniques grow in quality. 

There’s also this VSCode extension complimenting the paper. 

Safeguarded AI TA1.4 funding call

Economists and voting theorists are being summoned to help us reason about specification generating processes under multiple stakeholders. In video form!

We already knew that ARIA is being wildly ambitious. I’d be happy if we could get any specs at all on the new spec/modeling/cert stack that ARIA is building to be checked against prod code; it seems a lot harder to make them also socially/democratically desirable. Worried a little about scope creep with these guys, but what else is new. 



Discuss

November-December 2024 Progress in Guaranteed Safe AI

20 ноября, 2025 - 20:30
Published on November 20, 2025 5:30 PM GMT

Sorry for the radio silence last month. It was slow and I didn’t come across things I wanted to write about, to be expected with holidays coming up.

There are no benefits of paying, except you get a cut of my hard earned shapley points, and apparently some disappointment when I miss a month.

If you're just joining us, background on GSAI here.

There should be some DeepSeek/r1 coverage in the next newsletter (I think tanking the cost of finetuning is, while dual use, good news for us in GSAI). I’d rather get this one out the door though.

FrontierMath (hiring)

As I’m always banging on about, progress in math automation is a leading indicator for progress in software engineering automation by the curry-howard correspondence. That’s a little cheeky, but last time I said it in meatspace I got solemn nods as if it wasn’t even cheeky at all. Or maybe they were solemnly nodding at something else. To be explicit, mathematical reasoning abilities and software engineering skills (including formal proof engineering) appear to be at similar capability levels. Furthermore, through platforms like Lean, mathematical ability can enhance software development capabilities. This connection should make it easier to implement rigorous quality assurance processes, such as obtaining formal proof certificates, which are typically costly.

We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics -- from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.

The story here of course is that by the time the newsletter covered it, Open AI hit 25%. “Mathematicians assemble a benchmark they can barely solve themselves at SOTA 2%, forecasters think it’ll be unsaturated for multiple years, a month later Open AI hits 25%” is a fun little anecdote for Christmas with the relatives, for a funny sense of fun. But, points out an astute reader, since we have typical Open AI science communication (read: extremely disappointing science communication) here: what sort of inference compute was thrown at it? At what cost? A model that can win with the most naive prompt is more impressive than a model that only wins with gal-brained and/or expensive (in tokens) feedback schemes.

So anyways, the story here of course is that after I drafted the previous paragraph, people noticed that Epoch sneakily added their funding disclosure to a new draft of the paper a month after all the hoopla had started to die down. Spoiler alert, the funder was, lol, Open AI. I’ll refer you to LessWrong user 7vik’s summary of events, and not get too far into it.

But I have feelings. What if the externality of mathematical automation and tooling in the form of progress in guaranteed safe AI doesn’t pan out? Then Open AI gets it’s “put terry tao out of a job” boyscout badge for all the employee laptops and the safety community gets nothing in return! Come on guys, why is this even in Epoch’s mandate to begin with? Makes me sad.

See also

Review paper “The suite of datasets commonly used to train and evaluate the mathematical capabilities of AI-based mathematical copilots (primarily large language models) exhibit several shortcomings”, Elliot’s r/math AMA, Kevin Buzzard on FrontierMath (written after Open AI results).

Opportunity

FrontierMath begins in three tiers of difficulty (25% undergrad or IMO level, 50% grad student, 25% early professor-track researcher), but they’re currently expanding to a fourth tier of even harder problems, and they also want some formalization in Lean, which you email elliot at epochai.org about if you’d like to partake.

ARIA call: Mathematics for Safe AI (funding)

This round is affiliated with the Safeguarded AI programme, but is outside of the standard technical areas breakdown.

See also

Renaissance Philanthropy’s AI for Math fund (unclear how much this will help with GSAI efforts).

Structure-aware version control via observational bridge types (hiring)

David Jaz Myers writes for the topos blog about structure-aware version control.

This is really exciting, because git diffs as we know them would lead to problems if you were scaling world models (in a davidad-style approach) or specifications (in any GSAI approach including davidad) that were being collaborated on with many humans and AIs.

But suppose that you and I are editing a csv, and you add a row to the bottom and I add a column to the end. Git would see your change as a single line diff, whereas my change (adding a column) is a change to every line; these conflict in the new line you added. But from what we know about the structure of csv, your change should really be a one-row change, mine a one-column change, and the conflict should only occur in one cell.

The author proposes something called observational bridge types (from the up and coming proof assistant Narya) to form the foundation of structure-aware version control. Using these, we can say that, for filetype F and files f1 f2 : F, a diffing algorithm is none other than any inhabitant of the type Diff F f1 f2 (where Diff : (A : Type) -> A -> A -> Type is created using something called logical relations, or an inductive definition depending on a type). Then, conflicts are defined as a pair (d1, d2) : Diff F f1 f2 x Diff F f1 f3 so that merges may be defined as another pair (d3, d4) : Diff F f2 f4 x Diff F f3 f4. That much is roughly consistent with git, provided that you assume F (the filetype) is always “list[list[char]]” or some notion of raw text. It’d be great to not have to assume that, even just for the csv example, to say nothing of more complicated structures in probabilistic semantics or complex/chaotic systems. A system that can also reason about diffs between the types themselves (since Diff Type sigma tau is perfectly reasonable here) is also desirable.

See also

More on this from Owen at Topos at last year’s Safe By Design meeting at FAR Labs. More on this from the GSAI google group a while ago. Roundup of tree diff tools.

Opportunity

To work with David Jaz on this for the world modeling part of Safeguarded AI in Oxford, apply here.

Squiggle AI

In the davidad and somewhat Bengio regions of the GSAI space, there’s a huge emphasis on world models– computational descriptions of what is. Squiggle is a programming language where the terms are distributions, and all monte carlo boilerplate is hidden from the user. This makes Squiggle a candidate “standard” for belief specifications, as I’ve written about elsewhere.

The first problem you’d run into is that the Users Will Not Just. The users will not just learn a programming language to specify beliefs when they barely want to specify their beliefs in the first place. That’s why it’s good news QURI shipped an AI product, so the interface is closer to natural language while the substrate/backend is squiggle. The LLM Will Just. The LLM will just learn to program in squiggle, so you don’t have to.

It’s ready for you to use it at squigglehub.org/ai (if authenticated), and you can read QURI’s press release on the EA Forum and elsewhere.



Discuss

January-February 2025 Progress in Guaranteed Safe AI

20 ноября, 2025 - 20:30
Published on November 20, 2025 5:30 PM GMT

Ok this one got too big, I’m done grouping two months together after this.

BAIF wants to do user interviews to prospect formal verification acceleration projects, reach out if you’re shipping proofs but have pain points!

This edition has a lot of my takes, so I should warn you that GSAI is a pretty diverse field and I would expect all my colleagues in it to have substantial disagreement with at least one take in each issue of the newsletter. Prompt the language model to insert the standard boilerplate disclaimer that views are my own, not those of my colleagues, my employer nor my mom.

If you're just joining us, background on GSAI here.

Formal Verification is Overrated (lightning talk)

Zac makes three sound points which are not cruxes for me. Part of what Nora’s post is doing is responding to this with her take, which differs from mine.

Point one: weights are intractable

Formal verification of the learned component itself is gonna be a nonstarter on nontoy problems for computational cost reasons. We saw last year that a “proof” of the correctness of a function a transformer implements is subcubic in d_vocab, at least a little worse than quadratic. I agree with Zac, which is why I avoid whitebox strategies in my day to day life. GSAI (to me) is a set of blackbox tactics, closer to control, that doesn’t depend on interpretability going particularly well. The upcoming Formalise track of ICSE separates “AI4FV” and “FV4AI” as two separate topic areas, which is obviously the right thing to do and I find half of the time when I introduce GSAI to someone this is the first point of clarification. Speaking for myself, GSAI is AI4FV without being FV4AI.

Point two: reality is complicated and unknown

This capitulates a point I’ve been making for years, which is that the world-spec gap hurts you more than the spec-implementation or spec-component gap. I think if we get arbitrary wins on GSAI agendas, but an arbitrarily unboxing superintelligence too quickly, this principle is what kills us. This principle (which needs a good name, btw) still allows you to stack your swiss cheese tower to the sky, and never says that your swiss cheese tower buys you nothing. A potential crux for me might be that if we leave our immediate successors with a tower of swiss cheese, they’ll fail to use that to escape the acute risk period for predictable reasons.

Agree that the DNA synthesis part of Tegmark and Omuhundro was a bit unhinged. Agree that insofar as Zac is responding to that paper, much “calming down” is needed cuz that paper (unlike, I think, the Toward GSAI paper) is really overstated.

Formal verification, as we know it, exists to isolate error to the spec-world gap, which realistically isolates attack surfaces to a fraction of that. This doesn’t happen for free by default, it requires defensive acceleration. My job is nines, I don’t believe in 100%. However, a significant part of Safeguarded AI is aimed at making the spec-world gap smaller, so it wouldn’t be accurate to say that GSAI stops at the spec-world gap. I think Safeguarded AI’s approach to this, which attacks the details of probabilistic world models and how they’re elicited and falsified, is more reasonable than Omohundro’s “formal specifications of physical systems” approach which is a little too bottom-up for my tastes.

I did talk to one security researcher who, after regaling me with fun tales of side channel attacks, said Omohundro’s vision of eliminating sidechannel attacks by formal verification because the model of the component goes down to physical laws like Maxwell’s equations is not as unreasonably intractable as I thought and is instead merely ambitious. So maybe I should learn basic physics.

Point three: tool AI is unstable and uncompetitive

Zac:

The final category of proposals I see under this formal verification approach is that we should use AI to develop tools and software which is formally verified. And I am broadly a fan of this, but it’s not actually a substitute for working on AI safety.

Also Zac:

The addition of a trivial for loop is the only difference between a tool and an agent (in a general sense).

I think Zac is referring to the ask to restrain AI to tool level and simply not build agents, coming mostly from Tegmark. (See FLI’s tiers. it’s possible that scaffolding more than a loop is all you need to get from tier 2 to tier 3). I just don’t think this ask is centrally related to formal verification, and is not a crux for most of the GSAI space (indeed, much of GSAI wants to leverage primitive agents such as loops). But I do agree with Zac, I think the no-agency ship has sailed and there’s no satisfying way to restrict AI to the tool level, most for competitiveness reasons.

Overrated by whom?

But also, I’d like to ask Zac how it’s “overrated” when the reception from funders (besides ARIA and FLI’s vitalikbucks) is not even lukewarm. OpenPhil is super unplugged from GSAI and not interested, SFF isn’t participating at all in spite of having some of the big position paper authors involved in the S-process. Frontier labs do a bunch of proof scaling projects, but they’re doing it just for fun / to show off / to innovate on the product, not because of any vision in particular (as far as I can tell / as far as I’ve heard). I think ARIA and the FLI Extended Cinematic Universe (i.e. BAIF) is a great place to start, we could spend more but we shouldn’t complain. Does Zac mean the current level of funding is already too high, or is he just worried about that number increasing? He does think it’s an important part of the portfolio, so I think he’s just responding to some of the over-promising (which I tend to agree with him about).

FVAPPS

4715 function signatures each with 2-5 sorry’s out theorems: a new benchmark for leetcode-style programming in Lean. Notice that I, Quinn, the newsletter author, is one of the authors.

We introduce the Formally Verified Automated Programming Progress Standards, or FVAPPS, a benchmark of 4715 samples for writing programs and proving their correctness, the largest formal verification benchmark, including 1083 curated and quality controlled samples. Previously, APPS provided a benchmark and dataset for programming puzzles to be completed in Python and checked against unit tests, of the kind seen in technical assessments in the software engineering industry. Building upon recent approaches for benchmarks in interactive theorem proving, we generalize the unit tests to Lean 4 theorems given without proof (i.e., using Lean’s “sorry” keyword).

Workshop reviewers said that the pipeline we used to generate the benchmark was also a research contribution, even though it’s the minimum viable scaffold: a loop in serial.

Say hi In Ottawa on ICSE workshop day! Special shoutout to Gas Station Manager for citing us like a week after we went public in their hallucination work.

Is this a safety paper?

I'm working on making sure we get high quality critical systems software out of early AGI. Hardened infrastructure buys us a lot in the slightly crazy story of "self-exfiltrated model attacks the power grid", but buys us even more in less crazy stories about all the software modules adjacent to AGI having vulnerabilities rapidly patched at crunchtime.

While I’m not super interested in measurement, I’m excited about this line of work as a synthetic data pipeline. It may take galaxy brained prompting and galaxy brained scaffolds to ship proofs now (if they’re elicitable at all), but by solving FVAPPS you generate synthetic data which you can put to use in finetune jobs, leading to models way better at proving, hopefully with minimal scaffolding and prompting skills. We shipped our paper to arxiv twitter the same week Goedel-LM shipped, and what they do is a kind of “finetune job in-the-loop” along these lines (i.e. using Lean as a ground truth signal).

R1 happened

With finetuning this cheap, lots of opportunities are afoot. That’s it, that’s the post. Especially of interest is that API terms of service aren’t acceptable for some critical systems vendors for compliance reasons, so running a top performing model on compute you control opens up many possibilities for them in particular.

News in automated mathematicsAlphaGeometry finally dropped a paper

For a while Google DeepMind’s results in mathematics just had a vague blog post. It’s great there are finally details in a paper, but I haven’t read it yet.

New SOTA on MiniF2F

https://goedel-lm.github.io/

The topological debate framework

I had a chance to sit down with the author at EAG. This is an exciting theory of the incentives for AIs to have good world models. These are the kind of protocols that could help us close the spec-world gap, which is going to be critical if GSAI does anything at all against arbitrary superintelligences.

Let's say that you're working on a new airplane and someone hands you a potential design. The wings look flimsy to you and you're concerned that they might snap off in flight. You want to know whether the wings will hold up before you spend money building a prototype. You have access to some 3D mechanical modeling software that you trust. This software can simulate the whole airplane at any positive resolution, whether it be 1 meter or 1 centimeter or 1 nanometer.

Ideally you would like to run the simulation at a resolution of 0 meters. Unfortunately that's not possible. What can you do instead? Well, you can note that all sufficiently small resolutions should result in the same conclusion. If they didn't then the whole idea of the simulations approximating reality would break down. You declare that if all sufficiently small resolutions show the wings snapping then the real wings will snap and if all sufficiently small resolutions show the wings to be safe then the real wings will be safe.

How small is "sufficiently small?" A priori you don't know. You could pick a size that feels sufficient, run a few tests to make sure the answer seems reasonable, and be done. Alternatively, you could use the two computationally unbounded AI agents with known utility functions that you have access to.

Debate is a little bit of a misnomer, but close enough.

https://www.lesswrong.com/posts/jCeRXgog38zRCci4K/topological-debate-framework

Atlas hiring dev for spec engineering GUI product

If we can kick the proof down to the AI and rely on the typechecker, we still want a human in the loop for writing the specification

We are developing an AI-assisted IDE for formal specification.

Autoformalization is approaching usefulness but there remains a gap for how humans establish confidence in autoformalized specs and discover issues. We're researching tools and methods to enable engineers to close this gap in real-world assurance applications. We have been prototyping and user-testing a Spec IDE and are looking to add a team member to take over primary responsibility for machine learning engineering.

This job involves developing a machine learning pipeline that powers mechanized spec engineering and review. ML plays multiple roles in the spec IDE: (1) aiding user understanding and navigation by labeling concepts within and across both mechanized and pen-and-paper spec documents, (2) detecting possible issues in the mechanization, and (3) powering a conversational assistant for users to navigate or edit specs.

Autoformalization is not itself the focus of this project, but there's an opportunity to explore autoformalization if desired. This job involves collaborating with a small remote team that brings a user-centered, product discovery mindset to this research; as such, this job also involves learning from user research, and has opportunities to run user studies if desired.

We're looking for a 20 hrs/wk contract through approximately September, but there's room for flexibility. Please let us know your contracting rate ($/hr) if you have a standard rate, else reach out to chat with us; cost will be one of our selection factors.

Contact join@atlascomputing.org with interest.

They want someone with 2 years of surface area on proof assistants like Lean and Coq and experience with ML R&D tooling.

Safeguarded AI TA1.2 and TA1.3 funding call

Video, PDF. Davidad is back with a new funding call.

Technical Area 1.1 is underway and consists of category theorists and probabilistic semantics experts designing a new stack for world models, proof certificates, and specifications. It’s the conceptual problems at the foundation of Safeguarded AI. Upon successful solutions to those problems, it becomes a fairly normal software project: a backend and a frontend.

TA1.2 is a backend. This involves structure-aware version control (previous edition of the newsletter) for eliciting world models and specs, other data structures topics for keeping track of probabilistic proof certs, and an appropriate theory of databases.

TA1.3 is a frontend. Many important stress points of Safeguarded AI require human in the loop, so we want a killer user experience. Imagine being a power plant domain expert in the near future, when AIs can do anything, but they need to elicit specs and world models from you. Hopefully it’s not too painful a process!

Both projects are investing 12 months into requirements engineering with minimal prototyping, and the new org that TA2 is planning to incubate around the end of this year will be partially shaped by those requirements.

I think the microeconomics of this ambition level could be reasoned about (scrutinizing the reference class of software projects roughly this big, counting up roughly how much money was spent on them, fermstimating if this is realistic, etc.). But it’s out of scope for this newsletter.

The deadline is April 8th. It’s secretly the 9th but it’s a good habit to subtract one day when you put it in your calendar, especially since it doesn't close on an anywhere-on-earth midnight.



Discuss

March-April 2025 Progress in Guaranteed Safe AI

20 ноября, 2025 - 20:30
Published on November 20, 2025 5:30 PM GMT

Say hi at ICSE in Ottawa, I’ll be at the reception Thursday, this colocated event on Friday, and the LLM4Code workshop on Saturday.

As usual there are no benefits to the paid subscription.

Sorry for consolidating two months into one post again after I said I wouldn’t.

Subscribe now

Fermstimate of the cost of patching all security relevant open source software

Niplav writes

So, a proposal: Whenever someone claims that LLMs will d/acc us out of AI takeover by fixing our infrastructure, they will also have to specify who will pay the costs of setting up this project and running it.

I’m almost centrally the guy claiming LLMs will d/acc us out of AI takeover by fixing infrastructure, technically I’m usually hedging more than that but it’s accurate in spirit.

If transformative AI is developed soon, most open source projects (especially old ones relevant to internet infrastructure) are going to be maintained by humans with human response times. That will significantly increase the time for relevant security patches to be reviewed and merged into existing codebases, especially if at the time attackers will submit AI-generated or co-developed subtle exploits using AI systems six to nine months behind the leading capabilities, keeping maintainers especially vigilant.

I usually say we prove the patches correct! But Niplav is correct: it’s a hard social problem, many critical systems maintainers are particularly slop-phobic and won’t want synthetic code checked in. That’s why I try to emphasize that the two trust points are the spec and the checker, and the rest is relinquished to a shoggoth. That’s the vision anyway– we solve this social problem by involving the slop-phobic maintainers in writing the spec and conveying to them how trustworthy the deductive process is.

Niplav’s squiggle model:

Median $~1b worth of tokens, plus all the “setting up the project, paying human supervisors and reviewers, costs for testing infrastructure & compute, finding complicated vulnerabilities that arise from the interaction of different programs…” etc costs. I think a lot’s in our action space to reduce those latter costs, but the token cost imposes a firm lower bound.

But this is an EA Forum post, meaning the project is being evaluated as an EA cause area: is it cost effective? To be cost effective, the savings from alleviating some disvalue have to be worth the money you’ll spend. As a programming best practices chauvinist, one of my pastimes is picking on CrowdStrike, so let’s not pass up the opportunity. The 2024 outage is estimated to have cost about $5b across the top 500 companies excluding microsoft. A public goods project may not have been able to avert CrowdStrike, but it’s instructive for getting a flavor of the damage, and this number suggests it could be easily worth spending around Niplav’s estimate. On cost effectiveness though, even I (who works on this “LLMs driving Hot FV Summer” thing full time) am skeptical, only because open source software is pretty hardened already. Curl/libcurl saw 23 CVEs in 2023 and 18 in 2024, which it’d be nice to prevent but really isn’t a catastrophic amount. Other projects are similar. I think a lot about the Tony Hoare quote “It has turned out that the world just does not suffer significantly from the kind of problem that our research was originally intended to solve.” Not every bug is even an exploit.

Safeguarded AICreator profiles for TA1.1 and TA3

The TA1.1 ones have been up for a while and I think I missed them, but TA3 was posted fairly recently.

TA3 is the applications domain. Yaknow, the real world, that thing from all those functional programming jokes.

This work aims to address the challenge of AI deployment in safety-critical sectors by combining scientific world models and mathematical proofs to develop robust safety assurances.

Researchers from diverse companies and organisations such as AstraZeneca, Mind Foundry, the University of Oxford, the University of Birmingham, and the University of York, will aim to demonstrate how safeguarded AI systems can help them to make significant improvements in vital sectors.

TA2 call

This isn’t really open to individuals, but as always reading the funding call gets you some insight into ARIA’s worldview and strategy

ARIA is launching a multi-phased solicitation for Technical Area 2 (TA2) to support the development of a general-purpose Safeguarded AI workflow. The programme aims to demonstrate that frontier AI techniques can be harnessed to create AI systems with verifiable safety guarantees. In TA2, we will award £18m to a non-profit entity to develop critical machine learning capabilities, requiring strong organizational governance and security standards. Phase 1, backed by £1M, will fund up to 5 teams to spend 3.5 months to develop full Phase 2 proposals. Phase 2 — which will open on 25 June 2025 —will fund a single group, for £18M, to deliver the research agenda. TA2 will explore leveraging securely-boxed AI to train autonomous control systems that can be verified against mathematical models, improving performance and robustness. The workflow will involve forking and fine-tuning mainstream pre-trained frontier AI models to create verifiably safeguarded AI solutions.

I’ve heard this described as something like a “new frontier lab”, which is <sarcasm>always music to my ears</sarcasm>, but in truth I’m only worried about shops that push the envelope on scaling (chinchilla or otherwise). Boromir wanted to use the One Ring to defend Gondor, and we largely trust Tolkien/whomever when they say that defensive acceleration is folly and the Ring’s corrupting influence incentives are stronger than the will of man. Nevertheless, I feel pretty good about this because it’s narrow and targeted. Eager to find out if I’ll eat my words.

Conferences, eventsNew control paper

GSAI and Control are really complimentary forces so I think I’ll mention news in Control, at least sometimes

Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation performed in an agent environment. We construct BashBench, a dataset of 257 challenging multi-step system administration tasks, and evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code in this environment. This multi-step setting introduces new attack and defense dynamics, which we investigate in order to design novel control protocols that prevent safety failures without hindering the ability of non-malicious agents to perform useful work. We introduce a class of control protocols called resample protocols that dynamically take additional samples of certain actions. We find these protocols significantly improve on existing techniques by selectively blocking the AI agent from executing suspicious code and incriminating the agent by generating additional examples of dangerous behavior. We measure the tradeoff between attack prevention and usefulness; our best protocol combines resampling with analysis of previous steps, reducing the success rate of attacks from 58% to 7% at a 5% cost to the performance of a non-malicious agent.



Discuss

Restricting Dangerous Research: Has It Worked Before, and Could It Work for AI?

20 ноября, 2025 - 19:45
Published on November 20, 2025 4:45 PM GMT

When creating policies to deal with AI’s rapid progress, it is important to minimize dangerous AI capability gains that could cause a catastrophe. While restricting scientific research is controversial, it is worthwhile to look at past instances to see if there are lessons that can be applied to limiting or prohibiting certain types of AI research.

This paper reviews previous cryptographic, nuclear, chemical, and biological research restrictions. Each has policy issues that are relevant to understanding how various forms of AI research might be disincentivized or prohibited.

 

Cryptographic Research

Cryptography played an essential role in World War II. The German Reich used the Enigma cipher to communicate, and Britain’s success in cracking the code gave it key intelligence to help with their war efforts1.

In 1954, the American government classified cryptography as a munition2. This meant that cryptography was moved to the same legal category as tanks, missiles, and explosives. As a result, it had the same export controls as these weapons via the U.S. Munitions List (USML), which was governed by the International Traffic in Arms Regulations (ITAR)3. Thus, exporting this technology without government approval became a federal crime. Furthermore, American academic researchers could potentially face legal consequences for collaborating on cryptographic research with academic researchers from other countries.

By the 1960s, cryptography began to have commercial applications, which complicated the regulatory framework4. Companies needed strong encryption for wire transfers, and large organizations started using mainframe computers (where multiple users were able to access the same computer). In 1977, the Data Encryption Standard (DES) became the federal encryption standard5. This allowed banks and other companies to take advantage of cryptography’s commercial uses.

In the late 1970s, public-key cryptography also arose. Two researchers, Whitfield Diffie and Martin Hellman, published a paper in 1976 called “New Directions in Cryptography”6, where they introduced a method for anyone to encrypt files, but only private “key-holders” would be able to decrypt the files. This contrasts with previous cryptographic methods where both parties had to first share a private key. More specifically, this meant that one party could give information to another party without the two parties communicating beforehand. The NSA was resistant to this technology because it made it easier for anyone to use cryptography to hide secrets.

While Diffie’s and Hellman’s paper was a conceptual breakthrough, another development occurred shortly after that made public-key cryptography significantly more useful: the RSA algorithm7, which was formally published in 1978. This was an algorithm that could implement public-key cryptography with a digital signature, which was useful for authentication purposes. Before the RSA algorithm, the US government had a near-monopoly on state-of-the-art civilian cryptography, but RSA changed this.

By the early 1990s, this dynamic became more tense for multiple reasons. The increasing popularity of the Internet and the rise of e-commerce required encryption to be feasible. Also, America’s economic competitiveness was at risk because technology companies were forced to export versions of their software that had weaker cryptographic protection8. Furthermore, a “cypherpunk”9 movement developed, which consisted of computer scientists and privacy advocates that saw the government’s policies as unreasonable. People within this movement printed encryption algorithms in books and on t-shirts10. Additionally, a cypherpunk named Phil Zimmermann released a free encryption program called PGP (“Pretty Good Privacy”) on the Internet for anyone to use11.

In 1993, a compromise was struck in the form of something called a Clipper Chip12, that was proposed by the Clinton administration (but never progressed beyond the prototype phase). The Clipper Chip was a hardware-based encryption system that could be installed in phones and computers. The encryption was strong, but a backdoor existed. This was for law enforcement to access if they received a court order for decrypting communication that occurred on a specific phone or computer. However, there was a large backlash to the Clipper Chip from civil liberties groups, tech companies, and scientists because they argued that this technology increased the government’s ability for mass surveillance. As a result, the Clipper Chip was abandoned in 199613.

Also in 1996, a major court decision changed what policies the government could enact. In Bernstein v. United States14, the United States District Court for the Northern District of California ruled in favor of Daniel Bernstein, a PhD student at UC-Berkeley. Bernstein created an encryption algorithm called “Snuffle”, and he wanted to publish an academic paper as well as the source code for it. However, the State Department informed him that his source code was considered a munition, so he would need to register as an international arms dealer and obtain an export license to publish his academic paper and source code internationally. Bernstein sued the government, and the Electronic Frontier Foundation helped Bernstein argue that computer source code is a form of speech, so it is protected by the First Amendment. As a result, the government was not able to enforce export controls on code-based encryption technologies. Academic researchers and software developers were also able to discuss their cryptographic research more easily without fear of legal repercussions.

Another relevant development occurred in 1996 as well: the Clinton administration moved the jurisdiction for most commercial encryption from the State Department to the Department of Commerce15. Encryption was reclassified from a munition to a dual-use good, which meant it did not only have potential military applications; it also had commercial use. 

In 2000, a second important court case was decided: Junger v. Daley16. This case was decided at the Sixth Circuit (a higher level than the Bernstein case), concerning the export of encryption software outside of the United States. Peter D. Junger, a Case Western professor, wanted to teach a class about computer law, but he was not allowed to discuss technical details about encryption with students from other countries. This was because export restrictions classified encryption software as a munition. Consequently, Junger would not be allowed to have foreign students in his class. The case was ruled in Junger’s favor.

Also in 2000, but before the Junger case was decided, the Clinton administration eliminated most restrictions on the export of retail and open-source cryptographic software17. Since then, the federal government has had less stringent rules about cryptography.

The US government’s previous cryptography policies show that it is difficult for a country to curtail the spread of research (particularly to other countries). Export controls on algorithms are hard to implement, and they can easily fail. Artificial intelligence in 2025 is also different from cryptography in the 20th century in important ways. First, information can spread significantly quicker than it did in the 20th century. For example, a post on X (formerly Twitter) is capable of informing millions of people within several hours about new artificial intelligence research. Also, AI research tends to be open source (via arXiv, academic conferences, etc.). As a result, new ideas in this field move extremely fast, so it would be infeasible to inspect all of them ahead of time unless highly stringent laws were passed.

Additionally, if the US government wanted AI research to be removed from the Internet, it would need platforms and service providers to cooperate with takedowns, but this would not necessarily be timely (or even accepted by the platforms and service providers). Furthermore, because AI research is global and decentralized, problematic research could easily reappear on other platforms as soon as it is removed from any previous platforms it was on. This research could also be transferred and accessed via onion routing (such as the Tor network).

The Bernstein case also sets an important precedent: source code is speech that is protected by the First Amendment. Thus, the court system would likely rule against the government if it were to ban individuals from publishing algorithmic advances that they discovered (however, this has not been tested at the Supreme Court level).

 

Nuclear Research

            The first nuclear bomb was successfully detonated on July 16th, 1945, by the United States at the Trinity test site in New Mexico18. Within a month, the US dropped nuclear bombs on Hiroshima and Nagasaki, leading to the Japanese surrender on September 2nd, 1945, and the end of World War II19.

The nuclear weapons research leading to the first successful test was known as the Manhattan Project20. The US president at the time, Franklin Delano Roosevelt, demanded “absolute secrecy” for the project, and compartmentalization was used as a way of minimizing the number of people who knew the full extent of the research. Individuals could also be sentenced to up to ten years in prison for disclosing secrets about the Manhattan Project. Additionally, the government’s Office of Censorship asked journalists not to discuss topics related to nuclear energy.

Shortly after the Trinity detonation, America passed the Atomic Energy Act of 1946 (also known as the McMahon Act). This created “Restricted Data” as a legal category. Restricted Data included “all data concerning design, manufacture or utilization of atomic weapons… whether created inside or outside government.”21 Furthermore, the concept of “Born Secret” was introduced, which applied to Restricted Data. Born Secret meant that certain information was classified as soon as it was created. While individuals could be prosecuted for divulging information that was classified as Born Secret, that rarely happened. However, the few prosecutions that did occur served as important deterrents.

Several factors made nuclear research secrecy feasible during the early Cold War. First, there was low substitutability. More specifically, a certain equation or schematic (such as the Teller-Ulam design for a hydrogen bomb or the implosion method used in the “Fat Man” bomb) could be a major shortcut to experiments that would take time to gain insights from. Second, there was identifiability: specific numbers (like the exact critical mass of uranium-235 and plutonium-239) and data (like information for isotope separation techniques) could be seen as clear red flags that should be censored. Third, physical facilities were a major bottleneck for nuclear weapon research and design. For example, even with the correct equations, an organization would need enriched materials to make a bomb with. There were various processes that were relatively easy to detect during surveillance: uranium mining, reactor construction, and shipment of specialized equipment.

These elements (low substitutability, identifiability, and physical facilities) do not work as well for AI research secrecy as they did for nuclear research secrecy. Regarding substitutability, numerous AI breakthroughs happen at private companies, and researchers often move to competing companies. While they might have signed nondisclosure agreements with their previous companies, it is likely that some of these researchers use their knowledge to help improve models at the companies they moved to. For identifiability, the boundary between benign and harmful AI systems often depends on the context, so it is difficult to have clear signals that flag a model as harmful (unlike nuclear research where weapons-grade materials and processes provide obvious red flags). On the matter of physical facilities, the infrastructure for AI is more accessible and distributed than it is in the nuclear domain. While a country might need uranium enrichment to build a nuclear bomb, an AI company only needs to use a commercial cloud provider or a cluster of GPUs to train a dangerous model.

International coordination on nuclear weapons, however, provides more useful lessons for AI policy. In 1946, America proposed the Baruch Plan22, which called for the United States to eliminate its nuclear weapons only after international mechanisms were established to prevent all other countries from developing them. The USSR rejected this plan, however, because it argued that the US needed to dismantle its nuclear weapons before the enforcement mechanisms for other countries were in place. In 1949, the USSR successfully tested its first nuclear bomb23.

In 1957, the International Atomic Energy Agency (IAEA) was created by the United Nations24. This agency monitors nuclear weapons programs throughout the world and provides technical assistance for peaceful uses of nuclear energy while verifying that countries do not create nuclear weapons with this knowledge. Furthermore, the IAEA attempts to track every gram of fissile material. It also mandates that states declare all nuclear materials and facilities.

In 1963, the Partial Test Ban Treaty was signed and went into effect25. It banned all nuclear weapons test detonations except for underground ones. It was signed by the US, the USSR, and the UK. As of 2025, there are 126 countries that are parties to the treaty.

In 1968, arguably the most important nuclear weapons treaty was signed: the Treaty on the Non-Proliferation of Nuclear Weapons26 (also known as the Non-Proliferation Treaty (NPT)). This treaty, which went into effect in 1970, required all nuclear states that ratified it to promise eventual disarmament, and all non-nuclear states that ratified it to promise they would not develop nuclear weapons. In exchange, every country that was a party to the treaty would gain access to peaceful nuclear technologies. Currently, there are 190 countries that are parties to the treaty (technically 191 because North Korea’s withdrawal from the treaty in 2003 was never formally accepted by the other parties). The following nuclear states are not parties to this treaty: India, Pakistan, and Israel (the only non-nuclear state that is not a party to this treaty is South Sudan).

In 1996, the Comprehensive Nuclear-Test-Ban Treaty (CTBT) was signed27. The CTBT bans all nuclear explosions, including those for civilian purposes. The treaty has not been implemented, however, because most nuclear powers (America, Russia, China, India, Pakistan, North Korea, and Israel) have not ratified it.

While nuclear weapons still exist, there are Nuclear-Weapon-Free Zones (NWFZs) in parts of the world28. These areas include Antarctica, Latin America, the Caribbean, the South Pacific, Southeast Asia, Central Asia, and most of Africa.

International coordination on nuclear weapons could be applied to AI policy in multiple ways. First, an international body akin to the IAEA that coordinates on AI policy would be beneficial, as it would help prevent (or at the very least, decrease) the race dynamic between countries. Second, a treaty like the NPT that requires all parties agree not to develop certain types of AI capabilities would further decrease the race dynamic. Third, treaties only serve a valuable purpose if they are implemented (the CTBT’s delay highlights this), so there needs to be enough support from the international community to ensure a treaty succeeds.

Importantly, it is harder to verify AI capability gains than nuclear research gains. It requires a large amount of work to convert radioactive material to weapons-grade capabilities, so the IAEA has been able to provide effective oversight. However, it is much harder to be aware of all AI capability gains that are occurring, as these gains are not limited to countries, so they are more difficult to track.

If AI treaties are to succeed, they should have quicker timelines to implement than nuclear treaties. While nuclear treaties have often taken several years to go into effect, this would not be sensible with AI treaties because AI capabilities are increasing so rapidly. Thus, an AI treaty that takes too long to implement would potentially be obsolete, as the technology might have changed significantly from when the treaty was drafted.

 

Chemical Research

            The Chemical Weapons Convention (CWC) was signed in January 1993 and implemented in April 199729. It bans the development, production, stockpiling, use, acquisition, and transfer of chemical weapons. Furthermore, it prohibits research that is specifically aimed at creating or improving chemical weapons. Additionally, the CWC focuses on countries but does not address what the rules should be for non-state actors. Rather, it expects states that sign the treaty to enforce its rules for any non-state actors that might reside in their territories. There are currently 193 parties to this treaty (4 UN states are not parties: Egypt, Israel, North Korea, and South Sudan), and all parties to the treaty are required to destroy any chemical weapons they possess. The Organization for the Prohibition of Chemical Weapons (OPCW), which administers the treaty, verifies the destruction of these weapons.

            The OPCW categorizes three classes of chemicals as controlled substances (with each class having separate disclosure rules): Schedule 1 is for chemicals that have few or no uses besides weaponry, Schedule 2 is for chemicals that have small uses outside of weaponry, and Schedule 3 is for chemicals that have major uses outside of weaponry.

            The CWC is a useful reference for AI policy because it shows that classification schemes could cater different policies to different AI technologies, depending on how strong their dual-use capabilities are. An example of this would be Schedule 1 AI technologies that are used primarily for military-based applications (such as autonomous lethal weapons), Schedule 2 technologies with more dual-use capabilities (such as AI agents), and Schedule 3 technologies that have mainly commercial capabilities (such as AI for personalized advertising).

 

Biological Research

            Like the Chemical Weapons Convention, the Biological Weapons Convention (BWC) bans the development, production, stockpiling, use, acquisition, and transfer of biological weapons30. It also prohibits research that is for the purpose of creating or improving biological weapons. As with the CWC, the BWC focuses on countries but does not address what the rules should be for non-state actors (and it also expects states that sign the treaty to enforce its rules for any non-state actors that might reside in their territories). The BWC was signed in April 1972, and it went into effect in March 1975. There are currently 189 parties to this treaty.

The BWC was pioneering because it was the first multilateral treaty to ban a whole class of weapons of mass destruction. However, unlike the CWC, the BWC does not have a verification regime. Countries are instead expected, but not required, to engage in domestic monitoring and enforcement. This has resulted in countries having different levels of oversight. For example, the US31 has stringent policies, whereas Sudan32 does not.

While the degree of state monitoring and enforcement varies by party, no country currently acknowledges that it has (or seeks to have) biological weapons. However, certain countries are suspected to have covert bioweapons programs (such as Russia and North Korea)33. Furthermore, the Soviet Union secretly maintained its bioweapons program for two decades after it signed the BWC34. This program, known as Biopreparat, had tens of thousands of people working on it and operated dozens of facilities across the USSR. It worked on weaponizing deadly agents such as anthrax, smallpox, plague, and Marburg virus. This deception was publicly confirmed by Russian President Boris Yeltsin in 199235. Non-state actors have also had bioweapons, despite their respective states’ enforcement against such weapons. A clear example of this is the Aum Shinrikyo cult in Japan36.

Relevantly, there was a treaty before the BWC and CWC called the Geneva Protocol37 (drafted in 1925 and implemented in 1928) that banned the use of chemical and biological weapons during warfare, but it was not as extensive as the BWC and CWC because it did not ban the production, storage or transfer of biological and chemical weapons. This treaty was signed by major powers such as France and Germany, but its limited scope meant countries continued developing chemical weapons throughout the interwar period, ultimately failing to prevent their use in WWII.

When thinking about AI policy, the BWC also shows the limits of restricting research that is difficult to verify. While parties might be expected to operate on an honesty policy, this decreases the effectiveness of such legislation.

Importantly, self-regulation has led to previous biological research restrictions (albeit voluntary ones). A key example is from 1974, when a group of US researchers published a letter that called for researchers to voluntarily pause the use of recombinant-DNA methods (genetic engineering)38. The researchers claimed that the pause was needed so that safety protocols could first be devised for the new technology. This led to the Asilomar Conference on Recombinant DNA in 1975, where scientists agreed upon best practices for engaging in genetic engineering. The scientists also established different protocols for different types of experiments, depending on how potentially hazardous an experiment was. However, the USSR’s Biopreparat had recently launched, and Soviet scientists engaged in recombinant DNA experiments during the requested pause. Thus, the scientist-led pause did not cause all research on recombinant DNA to stop (but it did decrease this research).

The Asilomar Conference on Recombinant DNA is applicable to AI policy because it shows that self-regulation could be beneficial (but there might be defectors without strong verification mechanisms). Also, the AI research community could craft different protocols for different types of research, depending on the level of risk such research created. Additionally, conferences between major AI companies could be useful for many of the key AI researchers in the field to discuss and agree to best practices.

Scientists unilaterally decided on another research pause when they realized that certain types of H5N1 experiments were too high-risk to engage in at the time39. More specifically, they were concerned about the use of gain-of-function research to discover how transmissible the virus could be in mammals. This pause began in January 2012 and was supposed to end 60 days later, but the scientists decided to suspend the research indefinitely (eventually ending the pause after a year) to allow for more time to adequately review the risks versus the benefits of the research. They also wanted to ensure appropriate safety and security measures were in place before the pause ended.

While scientists decided among themselves for an H5N1 gain-of-function pause, the US government declared a federal pause in funding for gain-of-function research in 201440, so policymakers could have time to assess the risks and benefits of this type of research (the pause ended in 2017). Labs that still engaged in gain-of-function research during the pause jeopardized their chances of future federal funding.

Both the scientist-led pause on H5N1 gain-of-function research and the US government’s pause in funding for gain-of-function research provide important lessons that are applicable to AI. The H5N1 pause and Asilomar Conference show that self-regulation could help promote safety and lower the risk of a catastrophe. Also, just as the US government implemented a federal pause on funding for gain-of-function research, it could also pause funding for AI research that it considers dangerous. However, this policy would likely have a limited effect, as most of the leading AI research in America occurs in the private sector.

 

Policies for Discouraging Dangerous AI Research

Previous research restrictions have had varying degrees of success, but AI policies could be devised that use these interventions as a reference. After engaging in a historical review of relevant research restrictions, this paper concludes that the most effective lever for minimizing dangerous AI research would be an international treaty. This treaty would categorize AI research into various tiers and have different policies for each tier. Furthermore, the treaty would set up an international body for monitoring and enforcement.

A key component of the CWC is its classification system for different types of chemicals. Likewise, an AI treaty could classify AI research by multiple tiers. One method of classification would be by the degree of dual-use capabilities. Tier 1 could be for purposes that have minimal beneficial applications (such as automated cyberattack weapons), Tier 2 could be for dual use (such as AI agents that use tools), and Tier 3 could be mainly for commercial use but capable of misuse (such as personalized advertising). These tiers could determine what regulations are applicable. However, this classification method is imperfect because almost all AI research has dual-use applications, thus limiting the value of this classification system.

A more beneficial classification system would be one that focuses on risk levels. This system could establish different protocols for each risk category without having to gauge intent as thoroughly. For example, research on recursive self-improvement would have stricter policies than research on recommendation systems.

An AI treaty would need verification mechanisms to be effective. It would thus be prudent to have an international body that helps countries coordinate on policies and verifies that all parties are abiding by the treaty. This worked for nuclear research (with the IAEA) and chemical research (with the OPCW), although it would be harder for the field of AI because research is more dispersed in this domain and primarily takes place in the private sector. However, just as parties to the CWC and BWC were expected to enforce rules for non-state actors in their jurisdictions, countries that are parties to an AI treaty could pressure AI companies operating in their territories to emphasize safety as an important component of their research processes.

At the international level, countries that refuse to join the treaty or violate its terms would face coordinated trade restrictions on advanced AI chips and the supercomputing infrastructure required for training frontier AI systems. Additionally, treaty participants could restrict market access for AI developed through prohibited research methods, creating strong economic incentives for companies to avoid dangerous research even if their home countries permit it.

More broadly, just as various nuclear treaties helped decrease the nuclear arms race between America and the Soviet Union41, an AI treaty could decrease race dynamics between countries aiming to have better AI capabilities than their adversaries. However, an AI treaty would likely need to be implemented quicker than previous treaties that restricted dangerous research, so that the treaty would remain relevant once implemented.

 

Conclusion

While scientific research restrictions are difficult to implement and often controversial, they can serve an important purpose in preventing a catastrophe. Given that artificial intelligence could cause severe harms to humanity, it is worthwhile to seriously consider scientific research restrictions for the most dangerous forms of artificial intelligence research.

 

Acknowledgements

This paper was written for a 2025 summer research fellowship through the Cambridge Boston Alignment Initiative. The author would like to thank Aaron Scher and Christopher Ackerman for their guidance and feedback, and Josh Thorsteinson for sharing a related manuscript and providing helpful feedback.

 

References

1. The Enigma of Alan Turing. Central Intelligence Agency. https://www.cia.gov/stories/story/the-enigma-of-alan-turing/

2. Export of cryptography from the United States. Wikipedia. https://en.wikipedia.org/wiki/Export_of_cryptography_from_the_United_States

3. Export of Defense Articles and Services – ITAR | Office of Research Security & Trade Compliance. University of Pittsburgh. https://www.researchsecurity.pitt.edu/export-defense-articles-and-services-itar

4. Deciphering the Cryptography Debate. Brookings. https://www.brookings.edu/articles/deciphering-the-cryptography-debate/

5. Cryptography | NIST. National Institute of Standards and Technology. https://www.nist.gov/cryptography

6. New Directions in Cryptography. IEEE Transactions on Information Theory (PDF hosted by Stanford). https://www-ee.stanford.edu/~hellman/publications/24.pdf

7. A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Author-hosted PDF (Rivest, Shamir, Adleman). https://people.csail.mit.edu/rivest/Rsapaper.pdf

8. Tough on Crime or Tough on Competition? How Federal Law Enforcement’s Push for Surveillance-Friendly Technology Poses a Substantial Threat to the U.S. Software Industry’s Competitiveness. University of Florida Journal of Technology Law & Policy (PDF). https://scholarship.law.ufl.edu/cgi/viewcontent.cgi?article=1230&context=jtlp

9. Cypherpunk | Internet Policy Review. Internet Policy Review. https://policyreview.info/glossary/cypherpunk

10. Crypto Wars. GitHub Pages. https://uwillnvrknow.github.io/deCryptMe/pages/cryptowars3.html

11. Phil Zimmermann’s Home Page. Phil Zimmermann. https://philzimmermann.com/EN/background/index.html

12. The Clipper Chip. Electronic Privacy Information Center. https://archive.epic.org/crypto/clipper/

13. Sinking the Clipper Chip. Discourse Magazine. https://www.discoursemagazine.com/p/sinking-the-clipper-chip

14. Bernstein v. US Dept. of State, 945 F. Supp. 1279 (N.D. Cal. 1996) :: Justia. Justia. https://law.justia.com/cases/federal/district-courts/FSupp/945/1279/1457799/

15. Encryption: Administration Opens The Door To Domestic Regulation As Congress Debates Privacy, Commercial And Security Concerns. Wiley. https://www.wiley.law/newsletter-5

16. JUNGER v. DALEY (2000). FindLaw. https://caselaw.findlaw.com/us-6th-circuit/1074126.html

17. U.S. Removes More Limits on Encryption. The New York Times. https://www.nytimes.com/2000/01/13/business/us-removes-more-limits-on-encryption.html

18. Trinity: World’s First Nuclear Test. Air Force Nuclear Weapons Center. https://www.afnwc.af.mil/About-Us/History/Trinity-Nuclear-Test/

19. The Atomic Bombs That Ended World War 2 | Imperial War Museums. Imperial War Museums. https://www.iwm.org.uk/history/the-atomic-bombs-that-ended-the-second-world-war

20. Manhattan Project. Wikipedia. https://en.wikipedia.org/wiki/Manhattan_Project

21. Atomic Energy Act of 1946. U.S. Department of Energy (PDF). https://doe-humangenomeproject.ornl.gov/wp-content/uploads/2023/02/Atomic_Energy_Act_of_1946.pdf

22. The Acheson-Lilienthal & Baruch Plans, 1946. U.S. Department of State – Office of the Historian. https://history.state.gov/milestones/1945-1952/baruch-plans

23. RDS-1. Wikipedia. https://en.wikipedia.org/wiki/RDS-1

24. History | IAEA. International Atomic Energy Agency. https://www.iaea.org/about/overview/history

25. Treaty Banning Nuclear Weapon Tests in the Atmosphere, in Outer Space and Under Water. United Nations Treaty Collection. https://treaties.un.org/pages/showDetails.aspx?objid=08000002801313d9

26. Treaty on the Non-Proliferation of Nuclear Weapons (NPT) – UNODA. United Nations Office for Disarmament Affairs. https://disarmament.unoda.org/wmd/nuclear/npt/

27. Comprehensive Nuclear-Test-Ban Treaty (CTBT) – UNODA. United Nations Office for Disarmament Affairs. https://disarmament.unoda.org/wmd/nuclear/ctbt/

28. Nuclear-Weapon-Free Zones – UNODA. United Nations Office for Disarmament Affairs. https://disarmament.unoda.org/wmd/nuclear/nwfz/

29. Convention on the Prohibition of the Development, Production, Stockpiling and Use of Chemical Weapons and on Their Destruction. United Nations Treaty Collection. https://treaties.un.org/pages/viewdetails.aspx?src=treaty&mtdsg_no=xxvi-3&chapter=26

30. UNODA Treaties Database — Biological Weapons Convention (BWC). United Nations Office for Disarmament Affairs. https://treaties.unoda.org/t/bwc

31. United States of America — BWC Implementation (UNIDIR). BWC Implementation Support Unit. https://bwcimplementation.org/states/united-states-america

32. Sudan — BWC Implementation (UNIDIR). BWC Implementation Support Unit. https://bwcimplementation.org/states/sudan

33. The State of Compliance with Weapons of Mass Destruction-Related Treaties. Council on Strategic Risks. https://councilonstrategicrisks.org/2024/04/19/the-state-of-compliance-with-weapons-of-mass-destruction-related-treaties/

34. Biopreparat. Wikipedia. https://en.wikipedia.org/wiki/Biopreparat

35. Biological Weapons — Russian / Soviet Nuclear Forces. Federation of American Scientists. https://nuke.fas.org/guide/russia/cbw/bw.htm

36. Aum Shinrikyo: Once and Future Threat? | Office of Justice Programs. U.S. Office of Justice Programs. https://www.ojp.gov/ncjrs/virtual-library/abstracts/aum-shinrikyo-once-and-future-threat

37. 1925 Geneva Protocol – UNODA. United Nations Office for Disarmament Affairs. https://disarmament.unoda.org/wmd/bio/1925-geneva-protocol/

38. Asilomar and recombinant DNA – NobelPrize.org. NobelPrize.org. https://www.nobelprize.org/prizes/chemistry/1980/berg/article/

39. H5N1 Researchers Announce End of Research Moratorium. Science. https://www.science.org/content/article/h5n1-researchers-announce-end-research-moratorium

40. Ban on gain-of-function studies ends. (PMC7128689). PubMed Central. https://pmc.ncbi.nlm.nih.gov/articles/PMC7128689/

41. Nuclear arms race. Wikipedia. https://en.wikipedia.org/wiki/Nuclear_arms_race



Discuss

Should we shun the legibly evil?

20 ноября, 2025 - 16:22
Published on November 20, 2025 1:22 PM GMT

I used to believe that it was wrong, or at least counterproductive, to shun people for openly holding evil viewpoints. Surely that would just cause smart-but-evil people to pretend they're on the side of good, while less capable liars would be punished for honesty. And all this conformity forcing would achieve is killing the discussion around these topics, meaning that evil beliefs arising from incorrect or incoherent data couldn't get corrected. Moreover, if some of our own prosocial views are based on falsehoods, wouldn't it be a great service to strengthen our foundations.

For instance, imagine a smart principled racist, someone who actually believes that outwardly visible ethnicity is a reasonable way to predict how someone will behave. It's hard to even think about, as the halo effect tells us that you cannot be both smart and racist. I sure used to think I'm both. If there were real argument against racism, why were those not presented instead of just asserting that it's wrong? So-called "scientific racism" used to be the majority viewpoint, and I'm not surprised that reasoning from the first principles would lead there first. After all, I'm quite sure that appearances are a reasonable way to predict behavior, after all, everything is correlated, and the actual argument is more about not mistreating individuals based on statistical predictions. Which is reasonable, and I'm mostly sad that the core point itself is rarely articulated.

Anyway, the argument goes like this: That specific person is making decisions about other people, like hiring employees or selling insurance. If we disallow the visible use of racial bias, they'll continue using those exact same ideas, but not talk about them anymore. Now everybody is worse off. Some enforcement can be done to disincentivize this, and it partially works. But wouldn't it have been better if the enforcement could target the outcome and not the intention?

Other examples of evil behavior include legal profit-maximizing, meat-eating, meritocracy, human rights advocacy, supporting the wrong political party, and anything that's called prejudice. I'm sure my examples are not that representative sample of the real problems, and there must be more of these, but it's hard to see outside one's bubble.

The first step towards abandoning this line of thinking came from realizing that good and evil are not that clearly defined. I didn't mean those as value judgements even before, instead of something like majority opinions in my social circles. But it's not just about minority-majority, it's about politics, which is why these themes are often reduced to good and evil instead of having civilized utilitarian policy arguments. (Hah)

I'm sure I was somewhat meta-contrarian when arguing this, but it would be incorrect to say I didn't actually believe it, at least partially. It's such a convincing take, despite (or because) of its unintuitiveness. It feels like leaning on mistake theory, although by the time I'd learned about mistake theory I had long since given up this argument. But there's still a part of me that believes there's at least some merit to it, which is why I'm trying to deconstruct it here.

Prejudice is contagious, in a memetic sense. Maybe it's even a cognitohazard? I wonder how much of xenophobia is learned instead of innate. I still think hearing both sides of an argument and then judging is the sane way to go about things, but the defaults actually matter. Most people don't seem to think about these things much, and just having some simple answers is worth a lot.

Another counterargument is that it's hard to hold different internal and external beliefs. Maybe people will actually change their internal beliefs if forced to act as if they were different? Then again, free speech is quite nice, and the soul-corrupting effects that this causes whenever society holds some values you don't share are not good either.

And lastly, a big part of why prejudice hurts people is not the decisions made in secret, but the openly hostile behavior. Just reducing that is worth quite a bit.



Discuss

Rumored Trump EO

20 ноября, 2025 - 16:07
Published on November 20, 2025 1:07 PM GMT

Transformer News has released a draft of a rumored upcoming Trump White House Executive Order: Eliminating State Law Obstruction of National AI Policy. This article contains the text of the released draft. 

By the authority vested in me as President by the Constitution and the laws of the United States of America, it is hereby ordered:

Section 1. Purpose. United States leadership in Artificial Intelligence (AI) will promote American national and economic security and dominance across many domains. Pursuant to Executive Order 14179 of January 23, 2025 (Removing Barriers to American Leadership in Artificial Intelligence), I revoked my predecessor's attempt to paralyze this industry and directed my Administration to remove barriers to American AI leadership.

My Administration has already done tremendous work to advance that objective, including by updating existing Federal regulatory frameworks to remove barriers to and encourage adoption of AI applications across sectors. But we remain in the earliest days of this technological revolution and are still in a race with adversaries for supremacy within it. Our national security demands that we win this race.

To win, American AI companies must be free to innovate without cumbersome regulation. But State legislatures have introduced over 1,000 AI bills that threaten to undermine that innovative culture. California, for example, recently enacted a complex and burdensome disclosure and reporting law premised on the purely speculative suspicion that AI might "pose significant catastrophic risk." And a new Colorado law about so-called "algorithmic discrimination" may even force AI models to embed DEI in their programming, and to produce false results in order to avoid a "differential treatment or impact" on Colorado's enumerated demographic groups. But the United States' AI regulatory framework must prioritize truth. Moreover, sophisticated proponents of a fear-based regulatory capture strategy are responsible for inciting these laws, where subjective safety standards hinder necessary AI development, and which creates a patchwork regulatory framework that forces compliance with the lowest common denominator and allows the most restrictive states to dictate national AI policy at the expense of America's domination of this new frontier. My Administration will act to ensure that there is a minimally burdensome national standard — not 50 discordant State ones.

Sec. 2. Policy. It is the policy of the United States to sustain and enhance America’s global AI dominance through a minimally burdensome, uniform national policy framework for AI.

Sec. 3. AI Litigation Task Force. Within 30 days of the date of this order, the Attorney General shall establish an AI Litigation Task Force whose sole responsibility shall be to challenge State AI laws, including on grounds that such laws unconstitutionally regulate interstate commerce, are preempted by existing Federal regulations, or are otherwise unlawful in the Attorney General's judgment, including, if appropriate, those laws identified pursuant to section 4 of this order. The AI Litigation Task Force shall consult from time to time with the Special Advisor for AI and Crypto, the Assistant to the President for Science and Technology, the Assistant to the President for Economic Policy, and Counsel to the President regarding the emergence of specific State AI laws that warrant challenge.

Sec. 4. Evaluation of Onerous State AI Laws. Within 90 days of the date of this order, the Secretary of Commerce, consistent with his authorities under 47 U.S.C. 902(b) and 15 U.S.C. 272(b), shall, in consultation with the Special Advisor for AI and Crypto, Assistant to the President for Economic Policy, Assistant to the President for Science and Technology, and Counsel to the President, publish an evaluation of existing State AI laws that identifies laws that conflict with the policy set forth in section 1 of this order, as well as laws that should be referred to the Task Force established pursuant to section 3 of this order. That evaluation of State AI laws shall, at a minimum, identify laws that require AI models to alter their truthful outputs, or that may compel AI developers or deployers to disclose or report information in a manner that would violate the First Amendment or any other provision of the Constitution.

Sec. 5. Restrictions on State Funding.
(a) Within 90 days of the date of this Order, the Secretary of Commerce, through the Assistant Secretary of Commerce for Communications and Information, shall issue a Policy Notice specifying the conditions under which States may be eligible for remaining funding under the Broadband Equity Access and Deployment (BEAD) Program that was saved through my Administration's "Benefit of the Bargain" reforms, consistent with 47 U.S.C. 1702(e)-(f). That Policy Notice must provide that States with AI laws identified pursuant to section 4 of this order are ineligible for non-deployment funds, to the maximum extent allowed by Federal law. The Policy Notice must also describe how a fragmented State regulatory landscape for AI threatens to undermine BEAD-funded deployments, the growth of AI applications reliant on high-speed networks, and BEAD's mission of delivering universal, high-speed connectivity.

(b) Agencies shall take immediate steps to assess their discretionary grant programs and determine whether agencies may condition such grants on States either not enacting an AI law that conflicts with the policy of this order, including any AI law identified pursuant to section 4 or challenged pursuant to section 3 of this order, or, for those States that have enacted such laws, on those States entering into a binding agreement with the relevant agency not to enforce any such laws during any year in which it receives the discretionary funding.

Sec. 6. Federal Reporting and Disclosure Standard. Within 90 days of the publication of the identification specified in section 4 of this order, the Chairman of the Federal Communications Commission shall, in consultation with the Special Advisor for AI and Crypto, initiate a proceeding to determine whether to adopt a Federal reporting and disclosure standard for AI models that preempts conflicting State laws.

Sec. 7. Preemption of State Laws Mandating Deceptive Conduct in AI Models. Within 90 days of the date of this order, the Chairman of the Federal Trade Commission shall, in consultation with the Special Advisor for AI and Crypto, issue a policy statement on the application of the FTC Act's prohibition on unfair and deceptive acts or practices under 15 U.S.C. 45 to AI models. That policy statement must explain the circumstances under which State laws that require alterations to the truthful outputs of AI models are preempted by the FTC Act's prohibition on engaging in deceptive acts or practices affecting commerce.

Sec. 8. Legislation. The Special Advisor for AI and Crypto and the Director of the Office of Legislative Affairs shall jointly prepare for my review a legislative recommendation establishing a uniform Federal regulatory framework for AI thatpreempts State AI laws that conflict with the policy set forth in this order.

Sec. 9. General Provisions.
(a) Nothing in this order shall be construed to impair or otherwise affect:
(i) the authority granted by law to an executive department or agency, or the head thereof; or
(ii) the functions of the Director of the Office of Management and Budget relating to budgetary, administrative, or legislative proposals.

(b) This order shall be implemented consistent with applicable law and subject to the availability of appropriations.

(c) This order is not intended to, and does not, create any right or benefit, substantive or procedural, enforceable at law or in equity by any party against the United States, its departments, agencies, or entities, its officers, employees, or agents, or any other person.

(d) The costs of publication of this order shall be borne by the Department of Commerce.

Note: I made this by giving the PDF to ChatGPT. I did a quick skim over the results in order to confirm accuracy, but there may be some mistranslations.



Discuss

What would adults in the room know about AI risk?

20 ноября, 2025 - 12:11
Published on November 20, 2025 9:11 AM GMT

A prompt I like for thinking about what AI macrostrategy needs to achieve is “what would the adults in the room know about our situation, prior to the transition to advanced AI?”

The hypothetical I’m imagining is something like:

  • We’re on the cusp of an epochal shift to an AI-powered future
  • But it’s ok! Sane, competent, sensible adults are navigating the transition
    • I don’t mean “relatively sane, competent, sensible people drawn from the literal population of human adults”, but something more poetic — the version of adults that we imagine as kids, who are actually sane, competent, and sensible by any objective standards, even on very hard questions in unprecedented situations
  • They are paying attention to the things they should be, answering the questions that need answered (and leaving those that don’t), and taking measured and appropriate actions to navigate safely through
  • What questions do they answer?[1]

Part of why I like this prompt is that it draws on quite rich connotations on the emotional state of idealised adults.[2]Things like: adults are calm, they’re not reactive, they’re not particularly concerned about how others view them or whether they get to be a hero, they’re sensible and level-headed. These connotations help me to route around my own fear/panic in the face of AI risk (and my psychological need to personally make a contribution), such that I generate slightly different answers to this prompt than to things like ‘what’s critical path’ or ‘what do we need to know before AI can automate our work’.

Some pictures that vaguely capture what I mean by 'adult':


Below are my first pass responses to the prompt. I’d be interested in reactions, and especially in other people’s responses.

Ok, some things I think that adults in the room would know, in the order they occurred to me:

  • What are the likely selection effects, good and bad, on both AI systems and humans?
    • I imagine adults having a fairly mechanistic understanding of where the broad basins of attraction are, and using that to determine which risks to focus on - rather than more brittle threat modelling or predictions.
  • How are other humans likely to act during this transition?
    • I think adults would have a pretty grounded sense of what other humans are likely to do. I’m thinking of things like ‘the USG is un/likely to panic’, ‘China is un/likely to follow through on threats’, etc. Adults would do a bunch of talking to the relevant people and have good internal models of how they’re likely to behave, as well as knowing any fairly robust social science results that pertain here.
  • Which actions are hard to reverse?
    • My model of adults would be tracking that a lot of value could be lost through irreversible actions, which implies that it’s important to a) identify which actions are hard to reverse, and then b) for the most part, avoid taking those actions. I think adults would also be tracking that some irreversible actions might be necessary to prevent other harms, and would be willing to take them soberly and advisedly.
  • When to take leaps of faith
    • Sometimes, the right move is to jump to the equilibrium you think is right, and just have faith that others will follow suit. I think adults would be good at discerning when we need analytical, rational answers, and when we need faith. Leaps of faith that seem plausible to me are:
      • In some scenarios, a leap of faith that if you slow down, others will follow suit
      • A leap of faith that questions about acausal trade, simulations, most of space, can and should be handled later by more intelligent beings (with the exception of hard to reverse actions in those domains, which I think adults would be tracking)
  • Which questions can we safely defer to AI systems, and which do we need to answer via human reasoning?
    • In the status quo world, I’m worried about errors in both directions here. I think adults wouldn’t be governed by either excitement or fear on this, and would have a principled take on the sorts of questions that normatively should be answered by humans, and that empirically can be well and safely answered by AIs.

This question originated from a retreat session ran by Lizka Vaintrob which I really liked, on which questions we actually need to know the answers to in the near future, for the transition to advanced AI to go well. Thanks to Lizka and Owen Cotton-Barratt for that session.

This post is just my responses to one of several prompts; I hope that Lizka, Owen and I will write more about our collective takes on the overall question soon.

  1. “What do they actually do” is another very important question that I should think about more, but haven’t so far. ↩︎

  2. I know people who don’t really like treating the abstraction ‘adult’ in this idealised way, I think because they’re worried that the abstraction will be used to violently repress the child parts of them. I don’t intend the abstraction in that way, and am more trying to channel inspiration than judgement here. ↩︎



Discuss

10 Wrong and Dumb Grammar Rules

20 ноября, 2025 - 10:56
Published on November 20, 2025 7:56 AM GMT

(Jump to the end if you just want the list.)

Let's talk about supposed grammar rules that are actually prescriptivist hogswallop. Like prepositions somehow being in a magic word class that sentences can't ever end with? “This is the sort of English," said Winston Churchill, "up with which I will not put.” You don't need to ask for whom the bell tolls. You can just ask who it tolls for.[1]

What about that who/whom distinction? My take is that “whom” has mostly disappeared but lingers on when following “to” or “for” — but even then it's optional except in certain idioms like “to whom it may concern” or “for whom the bell tolls”.

Don't get me wrong, writing constraints are amazing. I have an old Messy Matters post praising them to the sky. If you want to take up the challenge of following all the grammar rules I'm deriding here, as an exercise, akin to writing an essay with only one-syllable words or without using the letter e, by all means. You could also go the other direction. See how many prepositions in a row you can end a sentence with:

A child calls down for a bedtime story and his mother brings up a book about Australia, a topic of little interest to the child. He turns up his nose and asks, "what did you bring me that book I didn't want to be read to out of about Down Under up for?"

(See the collection of gems from whence that came. And, yes, type-of-person-reading-this-far, “Down Under” is a proper noun in that context.)

Ok, next, how about splitting infinitives? This is just faux Latin bull shorts. Say I want to say, as I did the other day, that I'm all about commitment devices as a tool to not procrastinate. Again, if I want to take up the constrained writing challenge, I could rewrite that sentence to avoid the split infinitive. But if your solution is to just force it by scooching the “to” over to the verb then you've subtly spoiled the meaning. Commitment devices are a tool for Not Procrastinating. If I say I'm all about them as a tool not to procrastinate then the reader could hear that as my commitment device obsession being about their toolishness rather than about procrastination. I mean, in this case you'd figure it out from context and cases where you wouldn't would probably feel contrived. But the split-infinitive version is still better. (Wikipedia lists other classics like “to boldly go” and “to more than double”.)

Speaking of faux Latin, what about respecting Latin plurals like “data” and “media”? This one pains me not to respect the Latin. But I have to confess that “the data show XYZ” sounds a little pretentious compared to treating “data” as a mass noun and saying “the data shows”. (Or, the former used to sound wrong? I'm actually gradually getting used to “the data show...”.) If you're talking about, say, different mediums that sound travels through, you're liable to confuse people if you insist on saying “different media", even though I take a little psychic damage from not doing so in that case. “Multiple equilibria” instead of “multiple equilibriums” is a hill I will die on though.

Ok, let's do some easy ones. The old saw about not starting sentences with conjunctions? I guess that's a useful heuristic in elementary school. And contractions in formal writing? Fine, next. Same for sentence fragments. Obviously.

If you're one of the people who were never raised in a sewer, do you need to know whether it's “less” or it's “fewer”? (To be clear, I agree with Weird Al on most of the word crimes he lists.) I guess I advocate knowing the difference, but is it worth rephrasing “you have to have less than or equal to 14 items to use the express lane”? No, it is not.

Singular they has been discussed to death; no need to belabor that one. And the prohibition on contractions in formal writing is pretty common-sensically poppycock.

I could write a whole 'nother post about commas but for now I'll just say that I'm a fan of elucutionary punctuation, where the right answer is whatever makes the prose sound best when read out loud.

In conclusion, here's my top-ten list of wrong and dumb grammar rules:

  1. Don't end a sentence with a preposition
    [No, it's fine.]
  2. Distinguish “who” vs "whom” the same way you distinguish “I” vs “me” or “she” vs “her” 
    [Just use “who” everywhere. Or if it sounds better to you, keep “whom” around when following “to” or “for”. “Whomever” can just die.]
  3. Don't spit infinitives 
    [Also fine.]
  4. Respect Latin plurals
    [I have no consistent verdict here and don't actually want to call this one “dumb and wrong”, but at least don't be slavish about it.]
  5. Don't start sentences with conjunctions 
    [Generally fine.]
  6. Don't write sentence fragments
    [Don't write them accidentally. But as a rhetorical device? Like this? That's fine.]
  7. Never use “less” for count nouns
    [Sometimes “fewer” is just too awkward.]
  8. No singular they
    [Obviously fine, always has been. Other than a failed attempt to kill it for a couple hundred years starting in 1750.]
  9. No contractions in formal writing
    [Even “ain't” is ok if you know what you're doing.]
  10. (Most rules involving commas)
    [Just go by ear.]
  1. ^

    Spoiler: it tolls for thee. Womp womp.



Discuss

My burnout journey

20 ноября, 2025 - 09:58
Published on November 20, 2025 6:58 AM GMT

(This was a meetup talk/discussion, here for people who would have wanted to attend but couldn't and want to read the intro to ask follow-up questions.)

As a software developer, I've been on and off many projects over the years - different role on the same project, another project in the same company, and sometimes it's time to switch companies.

It's a balance of what gives and takes my mental energy at work..

  • + focus on real user problems
  • - starting from a solution, finding a problem that fits
  • + clear steps forward
  • + defined-enough scope before implementation
  • - multiple stakeholders who all need to review code/design/test/product fit and one of them doesn't have time, but comes back few days later to reopen an issue
  • - merge conflicts between my own 6 PRs, because someone else renamed a folder and I modified files with multiple features that all wait for reviews
  • + quick review and merge, next iteration in next ticket
  • + boring technology that solves a problem
  • + new technology that solves a problem
  • - new technology because someone was a hype junkie
  • + simple to start, infinite to master
  • - learning cliff to start, then boring while impossible to remember random config
  • - interviews with 5 rounds / whole-day project / recording myself how I explain my thinking into the void
  • + just hopping on a project with ex-colleagues / friends
  • - back pain from long sitting
  • + daily walks
  • - "almost" working tools
  • - LLM chatbot "agents"
  • + community-driven events, conference volunteering, ...

..and about what I can do to keep the balance positive at a given environment. It's mostly about prevention, I have never collapsed from burnout yet, but many times, it's been less fun than I would have wanted.

Questions?



Discuss

One King Upon The Chessboard

20 ноября, 2025 - 09:06
Published on November 20, 2025 6:06 AM GMT

I.

There's a lesson I first read in Harry Potter and the Methods of Rationality, whose importance I did not realize for a decade.

There can only be one king upon a chessboard, Harry Potter, only one piece that you will sacrifice any other piece to save.

- Harry Potter and the Methods of Rationality, Chapter 82

Spoilers for Methods of Rationality abound. Beware.

II.

The line "one king upon [the] chessboard" appears twice in HPMOR. The first time, it's in a conversation that is about tradeoffs.

"Are you familiar with the economic concept of 'replacement value'?" The words were spilling from Harry's lips almost faster than he could consider them. "Hermione's replacement value is infinite! There's nowhere I can go to buy another one!"

Now you're just talking mathematical nonsense, said Slytherin. Ravenclaw, back me up here?

"Is Minerva's life also of infinite worth?" the old wizard said harshly. "Would you sacrifice Minerva to save Hermione?"

"Yes and yes," Harry snapped. "That's part of Professor McGonagall's job and she knows it."

"Then Minerva's value is not infinite," said the old wizard, "for all that she is loved. There can only be one king upon a chessboard, Harry Potter, only one piece that you will sacrifice any other piece to save. And Hermione Granger is not that piece. Make no mistake, Harry Potter, this day you may well have lost your war."

Harry just paid a lot of money in order to protect a friend of his. 

On my first reading I thought the mistake being made was about making trades, haggling over price, and about incentivizing people to blackmail you. Which is the surface text involved here, and the situation in this scene! Dumbledore has said something which sounds deep and wise and yet doesn't quite connect usefully to the situation, which is this bit about chess and sacrificing. Notice, though, that Dumbledore doesn't actually say what the king is.

The second time we get that line, it's later in the story. Dumbledore's final message, read posthumously[1] is talking about the end of the world. Dumbledore has used a method of seeing the future to learn 

I listened to every prophecy that had ever been recorded.

And so I learned that my troubles were far worse than Voldemort.

From certain seers and diviners have come an increasing chorus of foretellings that this world is doomed to destruction.

And you, Harry James Potter-Evans-Verres, are one of those foretold to destroy it.

By rights I should have ended your line of possibility, stopped you from ever being born, as I did my best to end all the other possibilities I discovered on that day of terrible awakening.

Yet in your case, Harry, and in your case alone, the prophecies of your apocalypse have loopholes, though those loopholes be ever so slight.

Always 'he will end the world', not 'he will end life'.

Even when it was said that you would tear apart the very stars in heaven, it was not said that you would tear apart the people.

After talking about the actions Dumbledore took based on the prophecies, the message continues.

There can only be one king upon the chessboard.

There can only be one piece whose value is beyond price.

That piece is not the world, it is the world's peoples, wizard and Muggle alike, goblins and house-elves and all.

While survives any remnant of our kind, that piece is yet in play, though the stars should die in heaven.

And if that piece be lost, the game ends.

Know the value of all your other pieces, and play to win.

(emphasis mine)

That's the key insight I didn't realize; Dumbledore is done being cryptic and trying to give Harry the best practical advice that he can, and his last words are about when the game actually ends. Dumbledore is saying this after he himself has been removed from the board and can no longer take actions.

The most important piece, the one with infinite value, is the one that means you still get to take a turn.

III.

When faced with tradeoffs, you should value the ones that mean you still get to make more trades. Never put that on the line.

It's like the Kelly criterion says, you never bet your entire bankroll. Once you lose a bet like that, you don't get to make any more bets, and there's no way to recover. 

Games are generally designed so that as long as you're still sitting around the table, you still get to affect the outcome. Games with many players might have someone lose and be eliminated, and then that player sometimes spectates and sometimes wanders off to finish the last of the pizza or goes home for the night.

(I want to note another reading where this is supposed to be talking about the people of Earth being the objective. As long as we live, we're still racking up points.)

But once there's nobody left on your team who can take actions, you're locked into whatever the current state of the board happens to be. Even a very good position in chess, one where you've got both your rooks and both your bishops and the other fellow only has three pawns and a knight total, is not going to be a victory for you if you stop being allowed to make moves and they are. 

Sometimes in a team game it can make sense to trade away your individual actions if you help your team and trust them to carry you all to victory. This is essentially what Dumbledore did, sacrificing himself for Harry. But you really have to have a team mentality, and even then you're not sitting at the table able to steer.

IV.

When I'm making decisions and tradeoffs these days, one of my prime considerations is whether a move means I'm no longer able to influence a situation.

(To be clear, I'm not generally faced with tradeoffs where I wouldn't be able to take any actions anywhere - or at least, not more than most of us do deciding whether to risk the chance of heart attack and eat that next slice of bacon.)

That can come in many forms.

In a game of Magic: The Gathering, I need to keep cards in my hand and creatures on the field. The game mostly assures me that I'll get to keep having turns (unless the other person does something very odd and specific, like chaining together Time Walks) but those turns might not let me do anything but draw and pass. When contemplating tradeoffs, I lean towards keeping at least one option open in case a better tradeoff comes along.

When helping manage a conflict, if everyone involved stops listening to me I stop being able to influence things. If they stop telling me things, then I lose the ability to even know what's going on.  This leads me to prioritize leaving doors open at least a crack.

When I'm messing with a computer system, I'm careful to keep track of what needs to be operational in order for me to keep changing things. Once upon a time I was remoted into a system, sent a shutdown command thinking I'd just turn it off and on again, and then realized I had turned it off without having a way to turn the thing back on again. (Server farms have answers for this, but a desktop sitting in my apartment while I'm on my laptop hundreds of miles away does not.) Anyone who has ever messed up something low level and important in their operating system has had that sinking feeling where you're working through what you need to do to fix the thing that lets you type words into a file.

When I'm running operations for an event, it is tempting sometimes to leave HQ or worse, the venue. Doing so has the danger that, even though it's temporary, I'm no longer able to change course as easily. Even if I am the best suited person to handle something, I still often delegate to keep myself able to react. And this is a very temporary situation!

And of course, you should keep your eye on the larger game.

V.

"He had decided to live forever or die in the attempt, and his only mission each time he went up was to come down alive."

Catch-22, Joseph Heller

If I die, I stop getting to make decisions. 

For me, my decisions no longer mattering and where I can't even theoretically change things is pretty scary. I don't like that feeling. I will sometimes make what look like strange lopsided tradeoffs just to stay in the game, and when I'm not doing that it's generally after staring at the board and making my last best move to try and lock things into the best course I can before turning away from that game.

Organizationally, that means I spend a lot of time shoring up and empowering adjacent organizations. I don't need my team to have all the keys. I'm cheerful about the usual capitalist competition and when I was in for-profit companies I was on board with outselling and out-earning everyone else, but I also wrote up a lot of what I was working on and swapped general tips with engineers at other companies.[2] The king on the chessboard for the company is avoiding going bankrupt. (Though for the people involved of course it's their own careers.)

Professionally, that means planning for succession. If I get hit by a truck tomorrow, I always want my role to be reassigned, thinking of my job description as the active thing that will still try and make decisions. 

On a personal level, I really, really do not want to die. I don't really trust other people to make the actions I would make.

Dumbledore has more fellow feeling for other people than I do, I think. It probably helps that he's gotten to nurture Harry, that he leaves behind a legacy of other people he trusts to try and make the world better by his lights. 

Consider if there's some feature of the landscape where, if you lose it, you no longer get to take actions.

  1. ^

     For all intents and purposes anyway.

  2. ^

    Nothing NDA'd, but "yeah, lookups are expensive, have you tried this workaround?" or "the framework's been doing well for me, if you haven't tried it maybe give it a shot."



Discuss

Evrart Claire: A Case Study in Anti-Epistemology

20 ноября, 2025 - 08:49
Published on November 20, 2025 5:49 AM GMT

This man nearly tricked me.

Evrart Claire, leader of the Dockworkers Union in Martinaise from the videogame Disco Elysium.

I acknowledge that he is a fictional character, but he nearly tricked me all the same.

Evrart is the leader of the 2,000-person Workers' Union of Martinaise; they are on strike as part of a conflict with Wild Pines, the multi-billion dollar logistics company that employs the dockworkers. During the strike, a lynching has occurred, and you play the homicide detective leading the investigation.

In my conversations with Evrart Claire during my play through of the game, I had a strong hunch that I was being played by Evrart, but no legible proof, leaving me feeling helpless to defend myself in a way I have felt many times before. I’m dissatisfied and somewhat alarmed by this, so in this post I'm going to look into his dialogue, figure out how he did that, and figure out how I can be better defended against it for the rest of my life.

(Spoiler warning in this footnote.[1])

What's Wrong with Evrart Claire?

I have written about the rationalist vice of assuming good faith, which I think I have grown out of a lot since my youth, but am still subject to; Evrart Claire is squarely taking advantage of this. He constantly presents himself well, and all his bad behavior is covered in plausibly-deniability. You can never quite prove him to be a liar and a cheat, he's always got excuses and counter-narratives, and insists he's the good guy in all of this.

On LessWrong we sometimes refer to this as though it were a whole school of practice, namely “anti-epistemology”, of how to obfuscate the truth and misdirect people, as though it had rigorous rules on how to do so, that are the inversion of good epistemic practices and rules for systematically forming accurate beliefs. Claire is the character in Disco Elysium who wields anti-epistemology with the most skill and effectiveness, and is also one of the most powerful characters because of it.

Claire's character is not supposed to be subtle. The most violent and racist hooligans in the town work for him; he constantly asks you to do sketchy stuff for him before he'll give you information pertinent to your investigation; he keeps trying to give you money without saying why; and he's rumored to have made his rival union leader disappear before the election. These are all behaviors much more probable under the hypothesis that his behavior is corrupt and his character rotten.

Yet what stands out to me about his trickery are his communication patterns, that make it harder for you (or anyone else) to learn the truth or understand what is going on.

There are many tricks that Evrart uses to gain power and make people behave badly. There's a common trick named the foot-in-the-door technique, where you get someone to agree to a small ask on the way to getting them to make a big ask. Evrart uses this, first asking you to do something slightly shady for him, and then increasingly asking you to do more shady things. There are certainly things like that going on that one could list, but my focus here is going to be on questions of information flows and how he causes people around him to have an inaccurate map of the territory

Together, let's read some of the dialogue between you and Evrart. In future posts I’ll summarize his patterns , clarify his heuristics, and outline some possible antidotes to his trickery. But in this essay I simply want to get to grips with what he says and does.

This is where Evrart Claire sits. All of his dialogue below is spoken from this chair.Maintaining Plausible Deniability

Here’s one of your character’s first interactions with him, when you intend to interview him about the murder.

"I want to talk about the hanging."

You

"Oh, of course. That's your main thing here. That's *why* you're in Martinaise." He nods. "I know everything that goes on around here and I would *love* to discuss it with you."

Evrart Claire

"I mean, it's no secret that the lynching is connected to the strike -- so much to talk about! Honestly, it's been weighing on me so heavily. I understand -- you need to *interview* me..."

Evrart Claire

"I sense there's a *but*."

You

 "...but there's a *thing* that's been keeping me up at night. I *want* to talk about the hanging. I mean... if we could just calmly talk, exchange information, we could blow this thing wide open!"

Evrart Claire

"Yes, let's blow it open."

You

"But I *can't* think straight with this thing weighing on me..." He slaps himself on the forehead. "You're a police officer, aren't you? I have a crazy idea. You guys are basically door-opening machines. Incredibly talented at opening doors."

Evrart Claire

"Kim, is that true? Are we door-opening machines?"

You

"I'm not sure I understand." He looks to the Union boss. "If you're asking us to break down someone's door, it's not going to happen."

Kim Kitsuragi[2]

"Come now. I just need you to go open a *little* door for me -- and leave it unlocked. A simple thing. Absolutely nothing shady about it."

Evrart Claire

“Why don’t you just open it yourself?”

You

“Harry, I’m a very busy man and, more importantly, I don’t have that extraordinary physique you do.” He slams his fists together. “You look like you could run around all day!”

Evrart Claire

“You want to send someone a message that the police are working for you.”

Kim Kitsuragi

“I repeat, I’m a very, *very* busy man, Mr. Kitsuragi, and therefore I must occasionally enlist… outside help.”[3] He turns back to you. “So what will it be, Harry?”

Evrart Claire

“Whose door is it?”

You

“Oh, no one’s. It’s just a weasel. A weasel lives there. Nothing for you to worry about.”[4]

Evrart Claire

“What do you mean by a ‘weasel’?”

You

 “A loud blabbering weasel. When weasels feel no one is watching, they start acting *foolishly…*” He removes his glasses and rubs his nose.

Evrart Claire

“I bet you don’t even know anything about the hanging.”

You

“Harry, my dear friend.” He sinks deeper into the chair. “I am what people call a *local big wig*. I know everything that goes on in Martinaise.”

Evrart Claire

“Damnit, fine, I’ll look into it, we need to talk about that murder.” (Accept the task.)

You

“Fantastic, my friend! Just let me know when it’s done and we can take our friendship to the next level.” He flicks his fingers.

Evrart Claire

There's a lot to unpack here already. I hope you agree with my basic read of this which is that he's asking you to do something (slightly shady) for him before he'll help you. There's a tit for tat element here. But what's he doing epistemically? I expect that's also clear to many, but I will nonetheless spell it out.

  1. First, Evrart carefully presents a plausible innocent narrative about his actions and motives. 

    The story he presents is that he would like to help you with your case, but he is so emotional and distracted by something else, and you're in a position to help him with that. This isn't my read of the actual agreement that is taking place, which is a more corrupt tit-for-tat, but it is an account of things that you can tell third parties that doesn't sound bad.

  2. Second, Evrart refuses to acknowledge any alternative illicit narrative.

    When you ask “Why don’t you just open [this door] yourself?” he replies “Harry, I’m a very busy man and, more importantly, I don’t have that extraordinary physique you do.” Which is giving you a logically consistent reason, but is not a likely reason —he has over 2,000 people in his union and many staff who report to him that could do this for him.

    When Kim goes on to explicitly bring up the hypothesis that Evrart wants this in order to signal that the police work for him, he doesn't even address the hypothesis Kim raises, he simply repeats his improbable narrative “I repeat, I’m a very, *very* busy man, Mr. Kitsuragi, and therefore I must occasionally enlist… outside help.” 

Evrart is offering you tit-for-tat, where you help him out and then he'll help you out. But, insofar as this happens, he does this in a way that is very illegible to other people, and doesn't remove all possibility of doubt about the trade you're making. He never says what is being agreed, it was not written anywhere, and everything he says is logically consistent with his innocent narrative. Some of the standard tools used for people to check what agreements he made (e.g. "What did he explicitly say?" and "Did he give you a non-corrupt reason for why he wanted it?") have been carefully routed around.

All Narrative, No Facts, Avoids Details

Next, here’s some dialogue between you and Joyce Messier, a powerful lady who represents the board of Wild Pines.

"One more thing -- you said something *happened* in the elections?"

You

"I'm glad you asked. There was a woman -- the previous forewoman of the Union. She disappeared."

Joyce Messier

"Disappeared?"

Kim Kitsuragi

"Yes. On the last day of the local chapter elections her daughter phoned in and said she wasn't running anymore -- or coming to work. Ever. End of story."

Joyce Messier

"Eerie."

You

"Downright *haunting* if you ask me. The Wild Pines suspected foul play, but what could they do? It was a Union matter."

Joyce Messier

"The point of the presentation is -- these kinds of things *happen* around the Claires. Watch out when you're dealing with him."

Joyce Messier

"Thank you for your concern, ma'am. We'll be just fine."

Kim Kitsuragi

Now here's a different scene, where you are talking to Evrart about the same matter.

"Joyce said the previous Union leader vanished under suspicious circumstances."

You

"*Vanished*?! Harry, the woman left her casserole in the oven and couldn't make it here in time for the voting."

Evrart Claire

"'Oh, did I leave my casserole on? Better go home and check. The election can wait!'" The man frowns, disapprovingly. "When she got back the whole thing was over."

Evrart Claire

Wait... there was no mention of a casserole from Joyce.

–Drama[5]

"Funny, Joyce didn't mention any casserole."

You

"Harry, Harry, Harry!" He flicks his fingers. "Do not fixate on this little matter. Maybe it was a rabbit stew... or a hair dryer, or an iron. The point is, her heart wasn't in it. Mine *was*."

Evrart Claire

That much is true. His heart *truly* is in it. Though you wouldn't think so by looking at him.

Inland Empire

This particular brand of humour he has... it makes for a fine distraction.

Conceptualization

If it's spilled blood you're looking for then there certainly isn't any in his expression, or demeanor now.

Empathy

First, he gives you essentially no evidence other than a contrary narrative

Second, he dismisses getting the details right

Both of these seem to me like tools of anti-epistemology, and the latter especially strong.

Let’s look at another example of avoiding details. At this point in the game you've been on a side-quest looking into some rumors about something shady happening in his organization. He implicitly asked you to do one thing, but you've done the opposite.

“The shady brew. You told me to make it even shadier. I didn’t. It had alcohol in it. Now there’s no alcohol.”

You

“Did I? Well done then, Harry. I like not knowing about it and I’m sure you made the right call. I spend the whole day delegating tasks, and it’s a great relief to see people taking initiative.”

Evrart Claire

“I don’t even want to know what all of that means — brew, shady, alcohol, turned off. I’m gonna let the world *surprise* me.”

Evrart Claire

Evrart seems to encourage people working for him to not let him know what’s going on. After you explain that you did something that probably is against his interests, he emphasizes that the details are below his pay grade, he doesn’t know about them, and he trusts you to sort it out.

Avoiding knowing what's going on looks to me like another trick of anti-epistemology, to distance himself from crimes.

  1. It makes it easier for him to perform “being forthcoming about all he knows”.  It regularly allows him to appear honest and open about giving you all the information that he has, because he genuinely doesn’t know what happened.
  2. Avoiding looking into shady business, and encouraging your people not to tell you about it, is also a great way to allow people to get away with shady business that helps you out without you being directly culpable and without you being able to give it up. When the people who do the dirty work know the boss doesn't want to hear about it and isn't going to come asking questions, they can get on with wrong actions in peace.
Emphasize Loyalty and Trust

In the following bit of dialogue, you’ve just told Evrart about a key piece of evidence that contradicts what he says his men told him. (I’ve redacted the detail to avoid a major spoiler.)

“How odd.” The man shrugs. “I don’t know what to say, lieutenant. They told me [X]. [X] is what I saw when I took a look into that yard…”

Evrart Claire

It’s impossible to say if he’s telling the truth, sire

Drama

“What I *do* know is — the case is in safe hands. If anyone can get to the bottom of this [discrepancy], it’s my two little policemen. Godspeed, policemen!”

Evrart Claire

As before, he avoids the details and probably does not know them. I think it’s especially interesting that, when he is on the back foot, he does not address the possibility that he may have knowingly misled you or be responsible for the situation in any way, and instead (performatively, in my opinion) reinforces his positive relationship with you.

Here's a different scene where he does that again.

"Now please, let's get back to the good stuff, the police stuff, Harry! I just see myself as one of you guys. Think of me as a sergeant or something." He smiles broadly. "Let's *crack* this, Harry."

Evrart Claire

[Presents a new piece of evidence, something that suggests Evrart wasn’t entirely forthcoming with you][6]

You

"[repeats it]?!" He grabs his head with both hands. "You guys are just light-years ahead of me."

Evrart Claire

"I have *so much* confidence in the ability of your organization. I'm relieved you're doing this and leaving me to do what *I* do best -- helping people. With the power of *politics*."

Evrart Claire

It seems to me that Evrart Claire treats hypotheses of his own bad behavior or untrustworthiness like a Popperian Scientist treats theories other than their favorite theory. Evrart's theory is that he's a good person with good intentions, and he'll keep holding that it's the only theory that fits the facts unless you can falsify that narrative, not merely provide Bayesian evidence against it. He will act and speak in accordance with that narrative and reprimand you for considering alternatives in the absence of proof – a kind of proof which is very rare to get in the social world (especially in the presence of active anti-epistemology and stemmed information flows).

A hypothesis many would naturally consider in the above quote is that Evrart Claire is hiding information from you. But Evrart does not acknowledge this theory and will resist unless you have proof. The truth is that I cannot prove this is what’s happening. I am confident that bad behavior on his part is a natural hypothesis going on but I cannot prove it, and I know this means he will not even allow it to be a part of the conversation, and this is frustrating. I think his avoidance of such a hypothesis is adversarial, and to instead say that he has complete confidence in you to be doing good work attempts to put him on your side and deflect any further questioning. 

(I detest this.)

Here’s one more time he reinforces his trust and good relationship with you, with an internal note from your inner encyclopedia that I find amusing.

"Ah yes, your side-investigation! Thank you." He adjusts his glasses. "You've got some spirit, clearing up phony drug accusations alongside this murder. I'll talk to the mayor and see if I can get you the key to the city, Harry. Now let's talk real business."

Evrart Claire

Actually, Revachol doesn't have a mayor…

Encyclopedia[7]

Evrart Claire making you sit in a small and uncomfortable chair.Inverting Basic Principles

If you accept Evrart’s request to open someone’s door, you are later given the opportunity to lie to him about whether you did it. Here is how that little interaction goes down.

(Lie.) “I opened the door to your weasel’s den.”

You

“Are you shitting me Harry? Did you not really open the door and are now just telling me you did?” His lively eyes are mapping your face. “You’re a wild one, Harry!”

Evrart Claire

“You’re right, I was just testing you.”

You

“Of course, Harry!” He exclaims. “What are friends for if not for measuring the falsehood one can pass in the disguise of truth.”

Evrart Claire

A slight caricature of me would be more likely to say “What are friends for if not for getting true information and feedback from them?”. I myself have been known to send out surveys about myself to dozens of my friends in order to get honest feedback!  So I find this amusing as a direct reversal of good epistemic norms in friendship.

And look how he smooths over your relationship immediately afterwards! It seems to me like he doesn’t actually feel personally hurt by you lying to him, and that much if not all of the loyal friendship he speaks of between the two of you is utterly performative.

Evrart Claire possibly blackmailing you to do some work for him.Performing Emotions

Here’s the last stretch of dialogue in this essay.

"I met Joyce, the company representative."

You

"Oh, that's very nice. I haven't gotten around to her yet -- I'm very, very busy, you see." He adjusts a button on his sleeve. "I hope you're getting along."

Evrart Claire

"One thing I wanna make very clear, Harry, is that this is not some kind of Union *versus* Corporation situation. Everyone is just pals here."

Evrart Claire

"Just pals?"

You

"Yes, we're all trying to do what's best for Martinaise." His smile widens. "Don't feel like you shouldn't work with her just because you and I are such good friends. I'm not a jealous guy."

Evrart Claire

Whoa… that’s so nice of him. Suspiciously nice…

SUGGESTION [Medium: Success]

"Are you sure? I find it a little odd.”

You

“I’m just a nice guy, Harry. I wouldn’t be where I am now if I wasn’t nice." He slams his fist into his hand. “Politics is all about *emotions*, and I want you to have positive emotions when you think of me.”

Evrart Claire

“Okay, let’s talk about something else.” (Conclude.)

You

“Of course, Harry. Let me just assure you one more time — it’s perfectly okay to share anything we discuss here with this… *Joyce*. This is a completely transparent organization.”

“I have no interest in what she is doing, but I myself have *nothing* to hide. Your business is your business and I respect your privacy. Just remember, none of this…” He makes an all-encompassing gesture. “... is secret.”

“Tell her about all of it. My brother’s picture, my singing swordfish clock.” He looks around. “Tell her how overweight I am and how I’m helping you find your lost gun. Tell her about everything — Evrart doesn’t mind.”

Evrart Claire

“I’m told the Union is involved in the local drug trade.”

You

"What?!" He smacks his forehead, completely flabbergasted. "Harry... how could you say that to me? You know I appreciate a joke as much as any jolly fat guy, but I can't take *slander*. Are you actually investigating this?"

Evrart Claire

The reaction appears to be sincere, but... it's impossible to tell with this guy, honestly.

Drama

"I am."

You

“You've hurt me, Harry -- me! A friend!" The man rubs his temple and closes his eyes, in pain. "But you know what?" He perks up.

Evrart Claire

And gets over it in two seconds? Seems like it didn't really hurt him.

Pain Threshold

"I trust you, like I trust all my friends. And I know you'll never talk to me about this again, because you don't want to *wound* me. So do what you want -- and let's change the subject."

Evrart Claire

He's hiding his real reaction beneath courtesy.

Rhetoric

 "Thank you for understanding," the lieutenant looks him in the eye. "We will continue to do what we must."

Kim Kitsuragi

"You too, lieutenant -- heh!" He chuckles, suddenly. "You know, I like you, but you never were my favourite. I'm a Harry-guy. I'm Team Harry."

Evrart Claire

"None taken," the lieutenant quickly replies and then turns to you: "Did we have anything else to do here, Harry?"

Kim Kitsuragi

First, I will point to Evrart’s principle of encouraging “positive emotion” when you think of him, which seems to really disregard the idea that emotions should track reality, and makes me think of his character as almost entirely performative. Then, once again, he insists on leaning on your loyal friendship in order to get out of discussing the matter.

But it was the section of dialogue about transparency that befuddled me for the longest time of anything he says – because it isn't true. It later turns out he is absolutely in a massive conflict with Joyce, so he's lying to say otherwise. And he definitely doesn't want her to know his full plans, so it's false for him to say the org is totally transparent.

Why would he say this? What does he have to gain by emphasizing something so far away from reality? Wouldn't he be found out by something so blatant? How could this be a reasonable thing to say? Is there a perspective from which he is being transparent?

I really was trying to make sense of what perspective would produce these words, but eventually it struck me: he's saying it because it would be good for him if you believed it, and he thinks he can get away with saying it without you catching him out

That is the sort of adversarial epistemic optimization happening here. If a bunch of sentences would be good for him to say, he will say them. These are not being checked against any world model for something resembling accuracy. There is no good-faith perspective from which this is a reasonable thing for him to say. 

I was being too charitable. He is blatantly and forthrightly lying.

Evrart Claire damaging your morale.

So what are his key tools of trickery? I'll give my answer to that question in the next post...

  1. ^

    Spoiler Warning

    There are over 80 characters and 1,000,000 words of dialogue in Disco Elysium. I'm going to be quoting a lot of dialogue with one character in particular. I don’t think reading it spoils your experience of the game, it’s such a rich world and there’s tons of important plot points I don’t give away, but it does contain 5-10 minor plot points that are surprises or mysteries.

    It is one of the top 10 games I have ever played, and I have successfully encouraged many of my friends to play it. If you want to play the game unspoilered, but you want to read about anti-epistemology, you can just wait and read the subsequent posts in this sequence, which focus on the principles rather than these characters. Though I think it really helps to have the dialogue and details of this particular character in-mind, so come back after you've played it!

  2. ^

    Kim Kitsuragi is your cop partner.

  3. ^

    Note here that Claire doesn't answer the question.

  4. ^

    Once again, he doesn't answer the question.

  5. ^

    This is an instance of one of your mind's attributes talking to you internally. "Drama" is good at helping you lie and detect others' lies.

  6. ^

    I have removed the text as it unnecessarily spoils an interesting plot point that isn't relevant to this post.

  7. ^

    Yes, "Encyclopedia" is another of your mind's internal attributes, that works to "Call upon all your knowledge. Produce fascinating trivia."



Discuss

What Is The Basin Of Convergence For Kelly Betting?

20 ноября, 2025 - 07:36
Published on November 20, 2025 4:36 AM GMT

The basic rough argument for Kelly betting goes something like this.

First, assume we’re making a sequence of T independent bets, one-after-another, with multiplicative returns (similar to e.g. financial markets). We choose how much money to put on which bets at each timestep.

Returns multiply, so log returns add. And they’re independent at each timestep, so the total log return over T timesteps is a sum of T independent random variables. “Sum of T independent random variables” makes us want to invoke the Central Limit Theorem, so let’s assume whatever other conditions we need in order to do that. (There are multiple options for the other conditions.) So: total log return will be normally distributed for large T, with mean equal to the sum of expected log return at each timestep.

Then the key question is: for any given utility function, will it be dominated by the typical/modal/median return, or will it be dominated by the tails? For instance, the utility function u(W) = W is dominated by the upper tail: agents maximizing that utility function will happily accept a probability-approaching-1 of zero wealth, in exchange for an exponentially tiny chance of exponentially huge returns. On the other end of the spectrum, a utility function which just wants wealth above some relatively-low threshold (i.e. utility = 0 below threshold, utility = 1 above) will be dominated by the lower tail: agents maximizing that utility function mostly care about minimizing the increasingly-tiny probability of a total return below the threshold, and will pass up exponentially larger returns in order to avoid that downside.

But in the middle, it seems like there should be a whole class of utility functions which are dominated by the typical/modal/median return. And what the hand-wavy central limit argument says is that an agent with any of the utility functions in that class will, for sufficiently large T, just maximize expected log return at each timestep - i.e. Kelly bet. That class of utility functions is the “basin of convergence” for Kelly betting - i.e. the class of utility functions whose asymptotic behavior converges to Kelly betting, for long time horizons (i.e. large T) when making a sequence of independent bets with multiplicative returns.

Thus the question of this post: what’s the basin of convergence for Kelly betting?

I don't know the answer to that question, despite having poked at it a little. The rest of this post will contain some more quick-and-dirty thoughts on the topic, but my main hope is that somebody else will be inspired to answer the question.

Mathematical Setup: What Precisely Are We Saying?

Suppose that, at each timestep, our agent invests their portfolio into some assets. The proportion invested in each asset i.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}  at time t is cti, and the return of asset i between t and t+1 is Rti. Then the total wealth after T timesteps is

WT=W0e∑t−1t=0ln(∑ictiRti)

The agent has some utility function u(WT), and chooses ct at each timestep to maximize E[u(WT)].

For the easiest version of the problem which still captures the bulk of the intent, let's assume that returns Rti are independent and identically distributed (IID) over time.

Now, if we want to use the Central Limit Theorem, we need ln(∑ictiRti) to be IID over time, and also have bounded variance. Alas, both of those can definitely be false:

  • The agent can choose cti using whatever information is available at time t, including past returns Rτ<ti.
  • Some strategies lose all wealth in finite time with nonzero probability, in which case variance will typically be infinite, because ln(0) is infinite.

Concrete examples where each of the above conditions fails:

  • Suppose the agent's utility function is binary: it just wants to end with wealth above some fixed amount W∗. Then typically, the agent's optimal strategy will hold riskier portfolios when wealth is low, safer portfolios as wealth gets close to W∗, and zero risk once wealth passes W∗ (think conventional wisdom on retirement savings). So, the allocation at each timestep depends on returns up to that timestep, breaking independence.
  • Suppose the agent's utility function is convex, i.e. the agent prefers (50% double wealth, 50% lose everything) over (100% wealth stays the same). Then when the agent takes a gamble with e.g. 50% chance of losing everything, it will have a 50% chance that ln(∑ictiRti) is −∞, which uh... does not make for bounded variance.

... so a central part of the challenge of proving a basin of convergence for Kelly is showing that, within the basin, problems of this sort do not break the argument (... or using some entirely different kind of argument for Kelly).

The other big part of proving a basin of convergence would presumably be to talk about the expected-utility-contribution from the tails of the distribution.



Discuss

Out-paternalizing the government (getting oxygen for my baby)

20 ноября, 2025 - 07:01
Published on November 20, 2025 4:01 AM GMT

This post does not contain medical advice that most people should attempt to emulate. Considering this home treatment specifically made sense for us. My spouse has a four-year nursing degree and several years of experience working in Intensive Care Units. I've spent a non-trivial amount of time researching medical stuff.

Note the risks of DIY oxygen in this footnote[1]

Preamble

It rankles somewhat that medications require prescriptions. Like any good libertarian[2], I feel that regardless of what they do to me, my right to substances only ends only when it causes me to punch your nose. And even then, I think the criminalization should be of nose-punching.

Granted, without restrictions, many people would make worse decisions about what to do with their brains and bodies than (historically) what government bodies would have decided for them. Notwithstanding, it's their bodies and brains: they should be theirs to do with as they wish so long as they're not hurting anyone else[3]. Who is the government to make decisions for them (and me)? "For our own good"?! Fucking paternalism.

As if they actually know better. (For what it's worth, I'm the kind of guy to advise my spouse to cut off her leg, contrary to the advice of doctors.)

You know who should get to do paternalism? Me. I'm a father. My 19-month-old child may or may not be advanced for her age, but I think it's legit for her mother and me to make some choices for her, counter to her own desires, such as puffs of her asthma inhaler, even though she resists.

 

Oxygen Quest

My child has reactive airways that get closed up when she gets a cold[4]. We took her to the hospital at 11-months and 17-months due to cold-induced labored breathing. That first visit was very upsetting to the little baby. I'd never seen her scream and cry like that. Just extreme distress. 

On her first trip, she didn't yet know the word "no". On the second trip, her plaintive cries of "no no no no no" when a face mask was held on her face were heartbreaking. There's an emotional impact from violating such a clear, verbally asserted protest from your child. 

Following the first trip to the ED (emergency department), we acquired an oxygen saturation measurement device (not regulated!), and its reading reassured my spouse on one occasion between the first and second visits that an ED visit wasn't yet warranted on that occasion[5]. Still, the time would come for a second visit.

On September 8, my daughter had another cold that had been going on for a few days (day care, am I right?). It had started to get bad several days in. At 11:52 AM, my spouse messaged:

and twenty minutes later:

this was actually false – they could do more

 

Just oxygen? Surely we can get oxygen at home. It's just oxygen.  The thing we're already breathing a lot of already. At the same time as my spouse was monitoring and attempting to make a decision about going in, I began Oxygen Quest.

By 12:30 PM, I started by looking at Facebook Marketplace and Craigslist. Indeed, people sell oxygen canisters for a few hundred dollars. Promising, though tricky to get one fast. As is the case in many domains, it's 10x harder to get something tomorrow than in a week, and another 10x harder to get it in a few hours than tomorrow.

With ChatGPT's help, I also able to figure out all the parts of the system needed: flow regulator, tubing, cannuals/masks, and whether a given canister would have a standardly compatible connector vs needing an adaptor of some kind.

But by 1:00 PM, the situation was not good. 

92-94 is borderline, < 92 is seek care

I was not going to be able to procure oxygen that fast, and we decided to head into the emergency department.

It's important for this piece that my spouse was wrong that the only thing hospital could/would do that we couldn't was provide supplemental oxygen. It's possible that if we'd had oxygen at home, that would have been sufficient to turn things around. It's also possible that it wouldn't have. I then expect my spouse would have noticed via her monitoring, and we'd have ended up in the hospital anyway but a few hours later.

While I'm not completely sure, I think it is unlikely that attempting home supplemental oxygen would have resulted in masking some worse underlying thing, in our child's specific case, and thereby making things much worse. Other cases could be in different.

Matter of thefact, the hospital was able to do more than just oxygen. First, don't underestimate the diagnostic ability of pediatric doctors who do this every day, even if you have some medical training. Second, the treatment they provided and which worked wonders was to administer a steroid (plus albuterol,  a bronchodilator we did have) via a nebulizer, a device more effective at delivering the drug deep into the lungs compared to a typical inhaler. It really turned things right around, and the hospital didn't even do oxygen on this occasion.

Even once we'd decided to make the hospital trip, it would still be useful to have oxygen at home. Having home oxygen might be the difference between a costly and unpleasant overnight trip, vs going home sooner and being okay. And though I said the steroid worked wonders, it wasn't instant.

Sitting in the hospital with a tired child watching Bluey on a small screen attached to an impressively solid yet movable arm, I opened my laptop and continued Oxygen Quest. One challenge is that even if you have a canister, oxygen is a regulated medical substance. Of course. They can't just give it to you; you need a prescription. And like, it is something you can hurt yourself with. If you're being paternalistic, it's not crazy to be paternalistic about it. But my spouse is a trained ICU nurse and is competent to administer oxygen and monitor its effects.

Turns out that oxygen isn't just part of human respiration. It's used in many things, notably welding. But no dice. I didn't test this one by calling around myself, but forums of people hunting around for medical oxygen said that no welding place would risk filling your medical tank with welding-grade oxygen for $5 or $20. The risk of getting in major major trouble is not worth it. Even if they'd been willing, I buy that medical oxygen has much greater purity, and I didn't want to risk giving my child oxygen to breathe that might contain hydrocarbons, particulates, or other contaminants that mere welding oxygen can contain.

What do? I tried consulting Claude (maybe ChatGPT too) and was told it was not possible. Gotta get a prescription!  Yet, I  don't trust LLMs when they say there isn't a way. I persisted.

I found a forum where they said,  you know else uses oxygen beyond sick people and welders? Divers and pilots! Makes sense. What's more, they use oxygen rated for human consumption! I was amused and pleased.

A little more research and I then knew that medical oxygen is commonly called USP (US Pharmacopeia) oxygen and it falls under FDA regulations. Divers and pilots use ABO (aviation breathing oxygen), regulated by the FAA. Their purity is much the same, though ABO permits lower moisture content (risk of freezing at high altitudes?). ABO is fine for medical use.

More importantly, it seems at least some dive shops will just fill up your medical oxygen canister (or even clean and certify it), no prescription required.  Deep Ocean Explore in Oakland replied to my Yelp message, affirming they do air, nitrox, and pure oxygen fills.

Gathering necessary pieces, I ordered an excessive amount of tubing and pediatric nasal cannulas from Amazon. No prescription required, but they only sell packs of 50, and I wanted to have a few sizes/styles just in case.

Thankfully, by approx 4:20 PM, following a treatment she hated receiving and I hated watching (and assisting with me), things had really turned around, and we went home. Rapidly, our child had gone from completely lethargic to playing in the backyard. Crisis averted. By spouse remarked that her expectations might have been anchored too much on the elderly patients from the ICU, and healthy children actually bounce back fast.

24mins from ED to BY

Even though we'd made it through this time, it still seemed worth being prepared, and I put some more effort into sourcing home oxygen.

I'd focused on canisters; I'm not sure why I focused on them, but I'd also given thought and some searches to oxygen concentrator devices. People are selling these second-hand on Facebook Marketplace. And also the companies directly sell them too, no prescription required? Regulation is not consistent. I nearly bought one refurbished for $400.

Perhaps a failure of rationality, I didn't actually click buy once the emergency had passed. We've been proactive about giving the inhaler when our daughter gets sick, which seems to have helped. This last weekend just got a bit bad.

But a friend was in the house! Just so happens this friend had an oxygen concentrator in storage. He'd acquired it during Covid back when they feared hospitals might be overloaded or highly risky. I offered to buy it, and now we have one. I hope we never need to use it.

It's common wisdom to not get between a mother bear and her cubs. You can find many other videos of animals being protective of their offspring. The idea that parents are protective isn't new to me. It's interesting to feel it from the inside. I can feel how much I would do for my daughter[6]. Getting her some oxygen is the least.

So, dear government, I'm not sure if you looking after me unasked for, or subject to corrupt regulatory capture, but I want you to know your paternalism is no match for mine. And you better watch for when my kid gets big enough to take care of herself.

Reflections on Regulation

Interestingly, my feelings regarding regulation have shifted from the start of writing this post until this point. It seems there a lot of ways to hurt yourself with oxygen. Even with my spouse and I on the extreme right tail of medical competency, judgment about optimal treatment was plausibly off.

Now, my indignant opening paragraph said it wasn't about whether or not people would make bad decisions, it's that it was their decision to make and taking that away is bad. Funny though, when I think about it in practice being overwhelmingly bad to give people access to medical oxygen, that I then I want to seek justification for this under some libertarian model. I think it can be done, as per my footnote regarding some kind of contractualism-type thing.

I don't like that anyone is prohibited from engaging in transactions they want, but also wouldn't like if anyone was forced into transactions they didn't want to. So let's suppose:

  • There is some big medical conglomerate entity consisting of various providers such as doctors, hospitals, drug-makers, etc.
  • This entity has a rule that membership means not selling people things that might hurt them. Any provider who chooses to join, has accepted this deal for the benefits provided, e.g., doctors refer you to this pharmacy.
  • If so, I think it's fine that some provider refuses to sell you something, and even that it listens to some central body that's deciding what is okay to sell and under what conditions.

The problem with our current system is that it outlaws anyone existing outside of the medical conglomerate. That's not cool. Perhaps if I were to run a government, I'd aim to set up a highly medical conglomerate that's just really good, such that people would prefer to use it. But you're allowed to compete with it and people are allowed to buy non-conglomerate services and products, at your own risk.

I feel less mad right now that medical oxygen isn't just something you can buy from standard medical supplies stores. And in fact you can get it elsewhere, they're not regulating it that hard. So eh, maybe kind reasonable. There's more to reason through here, maybe for another post.

 

  1. ^

    Risks of DIY Oxygen administration:

    • Fire/explosion: Oxygen-enriched rooms and fabrics ignite very easily; smoking, candles, stoves, or sparks can cause severe fires. Cylinders are high-pressure and dangerous if mishandled.
    • Masking serious illness & delaying care: Oxygen can improve the number (SpO₂) or how someone feels while a serious underlying problem (heart attack, PE, pneumonia, sepsis, etc.) worsens in the background.
    • Wrong dosing (too much / too little / unnecessary):
      • Too little or weak setup → false sense of security when they actually need hospital-level support.
      • Too much, especially in COPD/CO₂ retainers → rising CO₂, confusion, drowsiness, potentially coma.
      • Using oxygen when SpO₂ is actually fine → treats nothing, distracts from the real issue.
    • Equipment hazards: Incorrect regulators/tubing, or leaks; possible contamination or infection from dirty cannulas or non-sterile humidifier water.
  2. ^

    Noticing that so many around me think the same, I feel grateful for my epistemic luck.

  3. ^

    I think this "hurting anyone else" piece is where most of the discussion is to be had, plus some question of contractualism and there being some rights that one pro-socially gives up as being part of society. Maybe that should include freedoms that you might use to hurt yourself, but...

  4. ^

    I received some impression this is fortunately something she will grow out of.

  5. ^

    Oxygen saturation measurement is also part of safely administering oxygen supplementation.

  6. ^

    The scrupulosity in me insists I disclose that the cost of hospital visits (even with insurance) also provided motivation to avoid hospital visits. That said, money not spent on the hospital is available to spend on my daughter in other ways.



Discuss

Страницы