Вы здесь

Сборщик RSS-лент

New video from Palisade Research: No One Understands Why AI Works

Новости LessWrong.com - 20 февраля, 2026 - 23:29
Published on February 20, 2026 8:29 PM GMT

Palisade Research have released out a long-form video about the history of AI and how no one understands modern AI systems. The video was made by Petr Lebedev, Palisade's Science Communication lead. 

The main goal is to get people to understand what “AIs aren’t programmed, they’re grown" means. The style is focused on being entertaining and educational. It aims to not feel like a typical “AI safety comms” video, and gives the audience a lot more context than usual.

I think the video is a great introduction to AI, and it does a good job of explaining some of the background arguments for AI risk. Sharing or signal boosting the video would be appreciated!



Discuss

Announcement: Iliad Intensive + Iliad Fellowship

Новости LessWrong.com - 20 февраля, 2026 - 23:13
Published on February 20, 2026 8:13 PM GMT

Iliad is proud to announce that applications are now open for the Iliad Intensive and the Iliad Fellowship! These programs, taken together, are our evolution of the PIBBSS × Iliad Research Residency pilot.

The Iliad Intensive will cover taught coursework, serving as a widely comprehensive introduction to the field of technical AI alignment. The Iliad Fellowship will cover mentored research; it will support mentored research fellows for three months, giving them adequate time to generate substantial research outputs.

Iliad Intensive

The Iliad Intensive is a month-long intensive introduction to technical AI alignment, with iterations run in April, June, and August. Topics covered will include the theory of RL, learning theory, interpretability, agent foundations, scalable oversight and Debate, and more. Applicants will be selected for technical excellence in the fields of mathematics, theoretical physics, and theoretical CS.

Excellent performance in the Iliad Intensive can serve as a road into enrollment in the succeeding Iliad Fellowship.

Iliad Fellowship

The summer 2026 Iliad Fellowship emphasizes individual, mentored research in technical AI alignment. It is run in collaboration with PrincInt.

The summer 2026 cohort will run three months, June–August.

Common Application

Apply here, and by March 6th AoE for the April Iliad Intensive. You can apply for any and all of the above programs in this common application form.

Interviews will follow the initial applications.



Discuss

ARENA 8.0 - Call for Applicants

Новости LessWrong.com - 20 февраля, 2026 - 21:28
Published on February 20, 2026 6:28 PM GMT

TL;DR:

We're excited to announce the eighth iteration of ARENA (Alignment Research Engineer Accelerator), a 4-5 week ML bootcamp with a focus on AI safety! Our mission is to provide talented individuals with the ML engineering skills, community, and confidence to contribute directly to technical AI safety. ARENA 8.0 will be running in-person from LISA from May 25th – June 26th, 2026 (the first week is an optional review of Neural Network Fundamentals).

Apply here to participate in ARENA 8.0 before 11:59pm on Sunday March 8th, 2026 (anywhere on Earth).

Summary:

ARENA has been successfully run seven times, with alumni going on to become MATS scholars, LASR participants and Pivotal participants; AI safety engineers at Apollo Research, METR, UK AISI, and even starting their own AI safety organisations!

This iteration will run from May 25th – June 26th, 2026 (the first week is an optional review of Neural Network Fundamentals) at the London Initiative for Safe AI (LISA) in Shoreditch, London. LISA houses AI safety organisations (e.g., Apollo Research, BlueDot Impact), several other AI safety researcher development programmes (e.g., LASR Labs, PIBBSS, Pivotal, Catalyze Impact), and many individual researchers (independent and externally affiliated).

Being situated at LISA brings several benefits to participants, such as productive discussions about AI safety and different agendas, allowing participants to form a better picture of what working on AI safety can look like in practice, and offering chances for research collaborations post-ARENA.

The main goals of ARENA are to:

  • Find high-quality participants;
  • Upskill these talented participants in ML skills for AI safety work;
  • Integrate participants with the existing AI safety community;
  • Accelerate participants’ career transition into AI safety.

The programme's structure will remain broadly the same as in ARENA 7.0, with a few minor additions (see below). For more information on the ARENA 7.0 structure, see our website (soon to be updated with our new material).

Also, please note that we have a Slack group designed to support the independent study of the material (join link here).

Outline of Content:

The 4-5 week programme will be structured as follows:

Week 0 (Optional): Deep Learning Fundamentals

Before getting into more advanced topics, we first cover the basics of deep learning, including basic machine learning terminology, what neural networks are, and how to train them. We will also cover some subjects we expect to be useful going forward, e.g. using GPT-3 and 4 to streamline your learning, good coding practices, and version control.

Note: Participants can optionally skip this week of the programme and join us at the start of week 1 if i) they’re unable to attend otherwise and ii) we’re confident that they are already comfortable with the material in this week. It is recommended that participants attend, even if they’re familiar with the fundamentals of deep learning.

Topics include:

  • PyTorch basics
  • CNNs, Residual Neural Networks
  • Optimisation (SGD, Adam, etc)
  • Backpropagation
  • Hyperparameter search with Weights and Biases
  • GANs & VAEs
Week 1 - Transformers & Interpretability

This week, you will learn all about transformers and build and train your own. You'll also study LLM interpretability, a field which has been advanced by Anthropic’s Transformer Circuits sequence, and work by Neel Nanda and the Google DeepMind Interpretability Team. This week will also branch into areas more accurately classed as 'alignment science' than interpretability, for example, work on token-level analysis of reasoning models.

Topics include:

Week 2 - Reinforcement Learning

This week, you will learn about some of the fundamentals of RL and work with OpenAI’s Gym environment to run their own experiments.

Topics include:

Week 3 - Model Evaluation

This week, you will learn how to evaluate models. We'll take you through the process of building a multiple-choice benchmark of your own and using this to evaluate current models through UK AISI's Inspect library. We'll then move on to study LM agents: how to build them and how to elicit behaviour from them. We'll also have the option for participants to explore beyond evals, and study some of the methods used in AI control.

Topics include:

Week 4 - Capstone Project

We will conclude this program with a Capstone Project, where participants will receive guidance and mentorship to undertake a 1-week research project building on materials taught in this course. This should draw on the skills and knowledge that participants have developed from previous weeks and our paper replication tutorials.

Here is some sample material from the course on how to replicate the Indirect Object Identification paper (from the week on Transformers & Mechanistic Interpretability). An example Capstone Project might be to apply this method to interpret other circuits, or to improve the method of path patching. You can see some examples of capstone projects from previous ARENA participants here, as well as posts on LessWrong here and here

Call for Staff

ARENA has been successful because we had some of the best in the field TA-ing with us and consulting with us on curriculum design. If you have particular expertise in topics in our curriculum and want to apply to be a TA, use this form to apply. TAs will be well compensated for their time. Please contact info@arena.education with any further questions.

FAQs:Q: Who is this programme suitable for?

A: There’s no single profile that we look for at ARENA; in recent iterations, successful applicants have come from diverse academic and professional backgrounds. We intend to keep it this way – this diversity makes our bootcamps a more enriching learning experience for all.

When assessing applications to our programme, we like to see:

  • Applicants who genuinely care about AI safety and making the future development of AI go well;
  • Applicants who are able to code well in Python, and have some knowledge of the maths needed for modern AI (linear algebra, calculus, probability);
  • A solid understanding of how you might best contribute to technical AI safety, and how you expect ARENA to help you achieve your goals.

Since ARENA is an ML bootcamp, some level of technical skill in maths and coding will be required – more detail on this can be found in our FAQs. However, if our work resonates with you, we encourage you to apply.

Q: What will an average day in this programme look like?

At the start of the programme, most days will involve pair programmingworking through structured exercises designed to cover all the essential material in a particular week. The purpose is to get you more familiar with the material in a hands-on way. There will also usually be a short selection of required readings designed to inform the coding exercises.

As we move through the course, some weeks will transition into more open-ended material. For example, in the Transformers and Mechanistic Interpretability week, after you complete the core exercises, you'll be able to choose from a large set of different exercises, covering topics as broad as model editing, superposition, circuit discovery, grokking, discovering latent knowledge, and more. In the last week, you'll choose a research paper related to the content we've covered so far & replicate its results (possibly even extend them!). There will still be TA supervision during these sections, but the goal is for you to develop your own research & implementation skills. Although we strongly encourage paper replication during this week, we would also be willing to support well-scoped projects if participants are excited about them.

Q: How many participants will there be?

We're expecting to accept around 30 participants in the in-person programme.

Q: Will there be prerequisite materials?

A: Yes, we will send you prerequisite reading & exercises covering material such as PyTorch, einops and some linear algebra (this will be in the form of a Colab notebook) a few weeks before the start of the programme.

Q: When is the application deadline?

A: The deadline for submitting applications is 11:59pm anywhere on Earth on Sunday March 8th, 2026.

Q: What will the application process look like?

A: There will be three steps:

  1. Fill out the application form;
  2. Perform a coding assessment;
  3. Interview virtually with one of us, so we can find out more about your background and interests in this course.
Q: Can I join for some sections but not others?

A: Participants will be expected to attend the entire programme. The material is interconnected, so missing content would lead to a disjointed experience. We have limited space and, therefore, are more excited about offering spots to participants who can attend the entirety of the programme.

The exception to this is the first week, which participants can choose to opt in or out of based on their level of prior experience (although attendance is strongly recommended if possible).

Q: Will you pay stipends to participants?

A: We won't pay stipends to participants. However, we will be providing housing, food and travel assistance to our participants (see below). We aim to ensure that finances do not present a barrier to any candidates participating in ARENA.

Q: Which costs will you be covering for the in-person programme?

A: We will cover all reasonable travel expenses to and from London (which will vary depending on where the participant is from) and visa assistance, where needed. Accommodation, meals, and drinks and snacks will also all be included during the duration of the programme.

Q: I'm interested in trialling some of the material or recommending material to be added. Is there a way I can do this?

A: If either of these is the case, please feel free to reach out directly via an email to info@arena.education (alternatively, send JamesH a LessWrong message). We'd love to hear from you!

Links to Apply:

Here is the link to apply as a participant. We expect it to take about 90 minutes.

Here is the link to apply as a TA. You shouldn't spend longer than 30 minutes on it.

We look forward to receiving your application!



Discuss

Unprecedented Catastrophes Have Non-Canonical Probabilities

Новости LessWrong.com - 20 февраля, 2026 - 21:23
Published on February 20, 2026 6:23 PM GMT

The chance of a bridge failing, of an asteroid striking the earth, of whether your child will get into Harvard for a special reason only you know, and of whether AI will kill everyone are all things that can be expressed with probability, but they are not all the same type of probability. There is a structural difference in the “probability” of a bridge collapsing being a ( mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-msup { display: inline-block; text-align: left; } mjx-mn { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-mi { display: inline-block; text-align: left; } mjx-munder { display: inline-block; text-align: left; } mjx-over { text-align: left; } mjx-munder:not([limits="false"]) { display: inline-table; } mjx-munder > mjx-row { text-align: left; } mjx-under { padding-bottom: .1em; } mjx-mover { display: inline-block; text-align: left; } mjx-mover:not([limits="false"]) { padding-top: .1em; } mjx-mover:not([limits="false"]) > * { display: block; text-align: left; } mjx-msub { display: inline-block; text-align: left; } mjx-mrow { display: inline-block; text-align: left; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-mspace { display: inline-block; text-align: left; } mjx-msubsup { display: inline-block; text-align: left; } mjx-script { display: inline-block; padding-right: .05em; padding-left: .033em; } mjx-script > mjx-spacer { display: block; } mjx-mtext { display: inline-block; text-align: left; } mjx-munderover { display: inline-block; text-align: left; } mjx-munderover:not([limits="false"]) { padding-top: .1em; } mjx-munderover:not([limits="false"]) > * { display: block; } mjx-stretchy-v.mjx-c7C mjx-ext mjx-c::before { content: "\2223"; width: 0.333em; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c36::before { padding: 0.666em 0.5em 0.022em 0; content: "6"; } mjx-c.mjx-c38::before { padding: 0.666em 0.5em 0.022em 0; content: "8"; } mjx-c.mjx-c1D448.TEX-I::before { padding: 0.683em 0.767em 0.022em 0; content: "U"; } mjx-c.mjx-c1D449.TEX-I::before { padding: 0.683em 0.769em 0.022em 0; content: "V"; } mjx-c.mjx-c1D435.TEX-I::before { padding: 0.683em 0.759em 0 0; content: "B"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c5B::before { padding: 0.75em 0.278em 0.25em 0; content: "["; } mjx-c.mjx-c1D45D.TEX-I::before { padding: 0.442em 0.503em 0.194em 0; content: "p"; } mjx-c.mjx-c2013::before { padding: 0.285em 0.5em 0 0; content: "\2013"; } mjx-c.mjx-c2C::before { padding: 0.121em 0.278em 0.194em 0; content: ","; } mjx-c.mjx-c5D::before { padding: 0.75em 0.278em 0.25em 0; content: "]"; } mjx-c.mjx-c1D450.TEX-I::before { padding: 0.442em 0.433em 0.011em 0; content: "c"; } mjx-c.mjx-c1D458.TEX-I::before { padding: 0.694em 0.521em 0.011em 0; content: "k"; } mjx-c.mjx-c210E.TEX-I::before { padding: 0.694em 0.576em 0.011em 0; content: "h"; } mjx-c.mjx-c1D459.TEX-I::before { padding: 0.694em 0.298em 0.011em 0; content: "l"; } mjx-c.mjx-c1D443.TEX-I::before { padding: 0.683em 0.751em 0 0; content: "P"; } mjx-c.mjx-c1D438.TEX-I::before { padding: 0.68em 0.764em 0 0; content: "E"; } mjx-c.mjx-c2223::before { padding: 0.75em 0.278em 0.249em 0; content: "\2223"; } mjx-c.mjx-c1D465.TEX-I::before { padding: 0.442em 0.572em 0.011em 0; content: "x"; } mjx-c.mjx-c3A::before { padding: 0.43em 0.278em 0 0; content: ":"; } mjx-c.mjx-c2265::before { padding: 0.636em 0.778em 0.138em 0; content: "\2265"; } mjx-c.mjx-c1D716.TEX-I::before { padding: 0.431em 0.406em 0.011em 0; content: "\3F5"; } mjx-c.mjx-c28.TEX-S2::before { padding: 1.15em 0.597em 0.649em 0; content: "("; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c29.TEX-S2::before { padding: 1.15em 0.597em 0.649em 0; content: ")"; } mjx-c.mjx-c2264::before { padding: 0.636em 0.778em 0.138em 0; content: "\2264"; } mjx-c.mjx-c1D442.TEX-I::before { padding: 0.704em 0.763em 0.022em 0; content: "O"; } mjx-c.mjx-c1D707.TEX-I::before { padding: 0.442em 0.603em 0.216em 0; content: "\3BC"; } mjx-c.mjx-c5E::before { padding: 0.694em 0.5em 0 0; content: "^"; } mjx-c.mjx-c2203::before { padding: 0.694em 0.556em 0 0; content: "\2203"; } mjx-c.mjx-c1D461.TEX-I::before { padding: 0.626em 0.361em 0.011em 0; content: "t"; } mjx-c.mjx-c2200::before { padding: 0.694em 0.556em 0.022em 0; content: "\2200"; } mjx-c.mjx-c1D460.TEX-I::before { padding: 0.442em 0.469em 0.01em 0; content: "s"; } mjx-c.mjx-c3E::before { padding: 0.54em 0.778em 0.04em 0; content: ">"; } mjx-c.mjx-c3A3::before { padding: 0.683em 0.722em 0 0; content: "\3A3"; } mjx-c.mjx-c1D43E.TEX-I::before { padding: 0.683em 0.889em 0 0; content: "K"; } mjx-c.mjx-c22C5::before { padding: 0.31em 0.278em 0 0; content: "\22C5"; } mjx-c.mjx-c3A9::before { padding: 0.704em 0.722em 0 0; content: "\3A9"; } mjx-c.mjx-c1D7CE.TEX-B::before { padding: 0.654em 0.575em 0.01em 0; content: "0"; } mjx-c.mjx-c2032::before { padding: 0.56em 0.275em 0 0; content: "\2032"; } mjx-c.mjx-c1D43B.TEX-I::before { padding: 0.683em 0.888em 0 0; content: "H"; } mjx-c.mjx-c7B::before { padding: 0.75em 0.5em 0.25em 0; content: "{"; } mjx-c.mjx-c7D::before { padding: 0.75em 0.5em 0.25em 0; content: "}"; } mjx-c.mjx-c1D714.TEX-I::before { padding: 0.443em 0.622em 0.011em 0; content: "\3C9"; } mjx-c.mjx-c1D70E.TEX-I::before { padding: 0.431em 0.571em 0.011em 0; content: "\3C3"; } mjx-c.mjx-c21A6::before { padding: 0.511em 1em 0.011em 0; content: "\21A6"; } mjx-c.mjx-c1D6FF.TEX-I::before { padding: 0.717em 0.444em 0.01em 0; content: "\3B4"; } mjx-c.mjx-c2190::before { padding: 0.511em 1em 0.011em 0; content: "\2190"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c61::before { padding: 0.448em 0.5em 0.011em 0; content: "a"; } mjx-c.mjx-c78::before { padding: 0.431em 0.528em 0 0; content: "x"; } mjx-c.mjx-c1D464.TEX-I::before { padding: 0.443em 0.716em 0.011em 0; content: "w"; } mjx-c.mjx-c1D456.TEX-I::before { padding: 0.661em 0.345em 0.011em 0; content: "i"; } mjx-c.mjx-c2211.TEX-S1::before { padding: 0.75em 1.056em 0.25em 0; content: "\2211"; } mjx-c.mjx-c1D709.TEX-I::before { padding: 0.704em 0.438em 0.205em 0; content: "\3BE"; } mjx-c.mjx-c2192::before { padding: 0.511em 1em 0.011em 0; content: "\2192"; } mjx-c.mjx-c2286::before { padding: 0.636em 0.778em 0.138em 0; content: "\2286"; } mjx-c.mjx-c2208::before { padding: 0.54em 0.667em 0.04em 0; content: "\2208"; } mjx-c.mjx-c27FA::before { padding: 0.525em 1.858em 0.024em 0; content: "\27FA"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c76::before { padding: 0.431em 0.528em 0.011em 0; content: "v"; } mjx-c.mjx-c72::before { padding: 0.442em 0.392em 0 0; content: "r"; } mjx-c.mjx-c1D441.TEX-I::before { padding: 0.683em 0.888em 0 0; content: "N"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c67::before { padding: 0.453em 0.5em 0.206em 0; content: "g"; } mjx-c.mjx-c2061::before { padding: 0 0 0 0; content: ""; } mjx-c.mjx-c2F::before { padding: 0.75em 0.5em 0.25em 0; content: "/"; } mjx-c.mjx-c394::before { padding: 0.716em 0.833em 0 0; content: "\394"; } mjx-c.mjx-c7C::before { padding: 0.75em 0.278em 0.249em 0; content: "|"; } mjx-c.mjx-c25FC.TEX-A::before { padding: 0.689em 0.778em 0 0; content: "\25A0"; } mjx-c.mjx-c4D.TEX-C::before { padding: 0.705em 1.201em 0.05em 0; content: "M"; } mjx-c.mjx-c2217::before { padding: 0.465em 0.5em 0 0; content: "\2217"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c75::before { padding: 0.442em 0.556em 0.011em 0; content: "u"; } mjx-c.mjx-c70::before { padding: 0.442em 0.556em 0.194em 0; content: "p"; } mjx-c.mjx-c226A::before { padding: 0.568em 1em 0.067em 0; content: "\226A"; } mjx-c.mjx-c48.TEX-C::before { padding: 0.683em 0.845em 0.048em 0; content: "H"; } mjx-c.mjx-cA0::before { padding: 0 0.25em 0 0; content: "\A0"; } mjx-c.mjx-c77::before { padding: 0.431em 0.722em 0.011em 0; content: "w"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c74::before { padding: 0.615em 0.389em 0.01em 0; content: "t"; } mjx-c.mjx-c68::before { padding: 0.694em 0.556em 0 0; content: "h"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-c64::before { padding: 0.694em 0.556em 0.011em 0; content: "d"; } mjx-c.mjx-c43.TEX-C::before { padding: 0.705em 0.527em 0.025em 0; content: "C"; } mjx-c.mjx-cAF::before { padding: 0.59em 0.5em 0 0; content: "\AF"; } mjx-c.mjx-c1D6FC.TEX-I::before { padding: 0.442em 0.64em 0.011em 0; content: "\3B1"; } mjx-c.mjx-c1D6FD.TEX-I::before { padding: 0.705em 0.566em 0.194em 0; content: "\3B2"; } mjx-c.mjx-c1D436.TEX-I::before { padding: 0.705em 0.76em 0.022em 0; content: "C"; } mjx-c.mjx-c1D6FE.TEX-I::before { padding: 0.441em 0.543em 0.216em 0; content: "\3B3"; } mjx-c.mjx-c1D70C.TEX-I::before { padding: 0.442em 0.517em 0.216em 0; content: "\3C1"; } mjx-c.mjx-c5B.TEX-S2::before { padding: 1.15em 0.472em 0.649em 0; content: "["; } mjx-c.mjx-c5D.TEX-S2::before { padding: 1.15em 0.472em 0.649em 0; content: "]"; } mjx-c.mjx-c39B::before { padding: 0.716em 0.694em 0 0; content: "\39B"; } mjx-c.mjx-c1D453.TEX-I::before { padding: 0.705em 0.55em 0.205em 0; content: "f"; } mjx-c.mjx-c1D454.TEX-I::before { padding: 0.442em 0.477em 0.205em 0; content: "g"; } mjx-c.mjx-cB1::before { padding: 0.666em 0.778em 0 0; content: "\B1"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-c.mjx-c1D53C.TEX-A::before { padding: 0.683em 0.667em 0 0; content: "E"; } mjx-c.mjx-c1D70F.TEX-I::before { padding: 0.431em 0.517em 0.013em 0; content: "\3C4"; } mjx-c.mjx-c2248::before { padding: 0.483em 0.778em 0 0; content: "\2248"; } mjx-c.mjx-cAC::before { padding: 0.356em 0.667em 0 0; content: "\AC"; } mjx-c.mjx-c1D446.TEX-I::before { padding: 0.705em 0.645em 0.022em 0; content: "S"; } mjx-c.mjx-c1D437.TEX-I::before { padding: 0.683em 0.828em 0 0; content: "D"; } mjx-c.mjx-c2260::before { padding: 0.716em 0.778em 0.215em 0; content: "\2260"; } mjx-c.mjx-c1D7CF.TEX-B::before { padding: 0.655em 0.575em 0 0; content: "1"; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } mjx-c.mjx-c2209::before { padding: 0.716em 0.667em 0.215em 0; content: "\2209"; } mjx-c.mjx-c2193::before { padding: 0.694em 0.5em 0.194em 0; content: "\2193"; } mjx-c.mjx-c55.TEX-C::before { padding: 0.683em 0.687em 0.028em 0; content: "U"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } ) chance, versus saying my P(doom) is 15%, or my P(utopia) is 2%.

I know what you are probably thinking: “we’ve heard this before...you are going to go down some rabbithole about how P(doom) is tribally divisive, or that it distracts from gears-level models, or maybe another Popperian argument that it is philosophically meaningless to assign probabilities to unprecedented events."

Don’t worry! This post is none of those things. (If you want to double-check, feel free to skip down to the Why Is This Different section.)

And to be clear at the outset: I am not anti-Bayesian, and I definitely think quantifying uncertainty is generally a good idea. However; I am going to argue that probability estimates for unprecedented catastrophes are non-canonical[1], and I am going to do so within the confines of Bayesian machinery itself, using likelihood ratios, posterior convergence, and algorithmic information theory.

First, what do I mean by a canonical probability? A canonical probability is stable across reasonable scientific frameworks given the available evidence. For example, a  probability of an extinction level asteroid impact is canonical because the mechanistic models of orbital dynamics would be continuously updated by routine data which would directly bear on the impact parameter and these, and related factors, would force all reasonable scientific frameworks to converge to a similar narrow estimate. Note: this is canonical even though we have not seen an asteroid like this hit earth (obviously we have evidence of past ones though).

However, genuinely unprecedented catastrophic risks are non-canonical because the available non-catastrophic data structurally refuses to wash out your priors. Without past events to sharply discriminate between competing models the probability remains only partially identified and your resulting estimate is only a measurement of your chosen ontology rather than a measurement of the world.[2]

The illusion of precise catastrophic probabilities can be traced to two distinct mathematical failures: one based on how we process evidence, and the other in how we define the event itself. Below, I will break down these two grounds of failure, using Nick Bostrom’s recent paper “Optimal Timing for Superintelligence” and Eliezer Yudkowsky’s asteroid rebuttal parable as an example.[3] Finally, I’ll offer a concrete suggestion of how to escape them and restore empirical rigor.

Data Doesn't Resolve Ontology

Before showing where probabilities of unprecedented catastrophes break, let’s briefly return to the asteroid impact example to show when they do not break. The key idea with an asteroid is that we can gather non-event data that directly updates the impact parameter and we can washout whatever prior we started with. Econometrics has a great way of expressing this concept: it is called being point-identified.[4] The data has forced the parameter to a specific, narrow point. 

Now let’s turn to catastrophic probabilities that do break. The battle between Bostrom and Yudkowsky is a great example of genuinely different ontological starting points, but I don’t mean to make overarching claims about their complete intellectual positions. They draw on multiple frameworks and are not formal description languages themselves.

Bostrom recently wrote a paper that framed building AGI as a "risky surgery for a condition that will otherwise prove fatal". The crux of his argument is that if we want to maximize QALY, high probabilities of catastrophe from AGI are worth accepting because everyone alive today will die anyway if we do nothing. The surgery-framing model is structurally native and simple to state.[5]

In response, Yudkowsky, wrote a mocking parable on X called the "Asteroid of Immortality". In Yudkowsky's framework, AGI is like an incoming asteroid, and Bostrom’s reasoning for eternal life amounts to logic of a cult rather than a surgeon. In Yud’s alignment-centric framework, "doom via misaligned optimization" is the natively simple, default prediction.

To understand where probabilities associated with these views will come from, we first need to formalize what a framework is. Think of a framework as the programming language of their worldviews: it determines which explanations are naturally simple to express, and which are unnatural and super complex. Imagine two ideal rational agents using two different Universal Turing Machines as their formal description languages (their “programming languages”). Machine  natively encodes Yudkowsky's alignment concepts. Machine  natively encodes Bostrom's surgery-framing concepts.

There is a fundamental problem, however: the data that we are gathering right now cannot help us choose which of these frameworks is likely to be right. This is quite counterintuitive because when we Bayesian update we expect that new data will wash out our subjective priors. But unprecedented catastrophes hit a concept I call The Evidential Screening Property. Another year without world-ending events gives us more data, but this data fits equally well into a model that predicts it’s all over and a model that predicts it’s all ok. Because both models perfectly predict the prefix of history (the track record of AI development leading up to the current moment), the likelihood ratio between them is approximately 1. The catastrophe parameter is entirely screened off from the evidence, leaving our prior beliefs untouched. 

To put this in more precise terms, we are not getting more useful data over time, just more noise that fits both theories perfectly. Formally, the log-likelihood ratio is bounded by a tiny number of bits, , that grows excruciatingly slowly, if at all. With Bayes your posterior odds are your prior odds multiplied by the likelihood ratio. If the likelihood ratio is structurally bounded near 1, the prior odds completely dominate the posterior.

This takes us back to econometrics and a concept called Partial Identification. I briefly mentioned point identification at the start of this section, that is, we can use data we gather from something like an extinction-asteroid to force a sharp single risk estimate. In contrast, the non-catastrophic prefix of AI history structurally fails to do this and the data only constrains the parameter to a hugely ambiguous region. This is what Manski calls being partially identified: the hugely ambiguous region of data is a prior-free identification region  and the choice of descriptive language, and its induced algorithmic prior, is just an arbitrary coordinate selected in the void.

The Complexity Inversion and the Failure of Model Averaging

If you are familiar with Bayesian math, as many of you probably are, you may now be thinking something like this: 

Ok, the log-odds between Bostrom and Yud’s specific and extreme models might shift wildly based on your ontology, but with Bayes we averaging over the mixture of all possible models, not just one. When we have a rich hypothesis space a diffuse and large number of middle-mass moderate models will anchor the probability and reduce the extreme swings. In other words, something like the aggregate of P(doom) will remain stable.

This is a smart defense, but I think it empirically fails for models used today under a specific condition I will call the Complexity Inversion: switching frameworks flips which type of futures are considered simple and which are considered complex.

Recall earlier that I likened frameworks to programming languages. In algorithmic information theory (“AIT”), Kolmogorov complexity is the rule that penalizes a theory based on how many lines of code it takes to express it. The longer the code, the lower the prior probability. When we switch frameworks, like switching between Yud’s and Bostrom’s, we aren’t just changing weights or adding noise, we are swapping which entire clusters of models (high-risk versus low-risk) get taxed by this complexity penalty.

Perhaps under Yudkowsky's Machine , a doom scenario routed through deceptive alignment is concise and an entire cluster of high-risk models is algorithmically simple. In contrast, perhaps under Bostrom's Machine , expressing mesa-optimization requires a massive, clunky translation, but maybe a QALY-maximizing “institutional equilibrium” cluster is inherently much easier.

In AIT the Invariance Theorem tells us that translating concepts between two Turing machines incurs a fixed overhead cost, which here I will denote as  (measured in bits). Because in our example these frameworks use genuinely different ontological primitives the translation cost is massive. With Bayes calculations for something like the total aggregate probability of doom, we are not taking a flat average, instead the clusters are weighed down using a sigmoid curve. The sigmoid acts less like a balancing scale and much more like a tipping point: if the algorithmic advantage of one framework is large enough to beat the useless noise of the daily data by at least  bits, the curve snaps hard to that side. It ends up dragging the entire middle-mass of moderate hypotheses along with it.

So let  be the probability of doom in the high-risk cluster, and  be the probability in the low-risk cluster. By the law of total probability, the cross-framework gap in the aggregate estimate is mathematically bounded below by:

where  is the small residual mass outside the core model set (see the formal appendix for the full derivation).

The interesting thing about this inequality is that as the translation advantage  grows, the fraction vanishes. The difference in aggregate P(doom) approaches  which is the entire width of the probability space under dispute!

I should briefly pause again here: I love postmodern fiction, but I am not a postmodernist or a nihilist. I don’t mean to imply that math or data are fake or anything like that (which you could accidentally conclude if you think the implication is “you’re just saying assign random P(doom) based on your feelings and nothing matters.) I’m definitely not saying this, and the divergence is mathematically capped by the translation cost between frameworks. By the Invariance Theorem, the parameter  satisfies the structural constraint:

The point isn’t to say that P(doom) can be anything, it is to say that it is underdetermined within a specific range, proportional to how different the competing scientific ontologies really are.

Differential Screening: The Random Walk of Doom

Let’s give room to the skeptic again: “Ok, I see the points above, but eventually, as clear and unambiguous evidence accumulates, the data force convergence, right?” 

Unfortunately not. This assumes a monotone accumulation of evidence, that is, the assumption that every new datapoint pushes the probability in the exact same direction. But this is not correct, when frameworks carve up the hypothesis space using different causal joints, the same piece of evidence can cause them to update in opposite directions. I call this divergence of uniform agreement on what an observation means Differential Screening.

Returning to Yud and Bostrom (again these are just hypothetical examples, I am not saying I speak for them and I don’t know all of their positions) say an AI company deploys a new model that happens to find a ton of zero-day exploits. 

  • Under Machine  (Yud’s alignment framework): this event is negative. It is understood as representing a causal model where the system’s capabilities are demonstrably outpacing our ability to control them. The Bayesian update is gap-widening and risk goes up.
  • Under Machine  (Bostrom’s surgery-framing framework): views this exact same event differently, say as a causal model where this is the warning shot that wakes up government regulation and boosts safety ahead of AGI. The update is gap widening in the opposite direction of Yud and risk goes down.

The log-odds gap between these two frameworks over time does not steadily shrink because a substantial fraction of new observations are gap-widening in opposite directions. The cumulative differential accumulates mixed-sign increments rather than declining monotonically. The trajectory of our collective belief doesn't end up as a converging asymptote, instead it acts like a bounded-increment random walk.[6]

As a result, the expected time to canonical agreement scales non-linearly with the fraction of disputed evidence. It pushes the convergence timescale far, far beyond any policy-relevant horizon.

The Halting Problem of Permanent Loss

In the sections above we showed that even if we perfectly agreed on what "doom" meant, the evidence could not pin down the number. However; there is an even more pernicious and problematic formulation of catastrophic event prediction: ones involving expansive formulations that end up with ill-defined definitions of the event itself.

Think about a common definition of existential risk as something that leads to the "permanent loss of humanity's potential." If we want to calculate a real probability for this, we have to be precise about expressing it as a coherent, evaluatable mathematical object in code. We need to formulate a computable classifier because for computationally universal generative models (complex agent-based simulations or LLM rollouts over millions of tokens) the measure  is not analytically integrable, meaning the probability measure cannot be solved with a clean analytical equation.

A computational agent can only estimate the probability of doom by essentially running Monte Carlo rollouts of finite-time trajectories and scoring them. To accurately score a rollout, the mushy philosophical concept of "permanent loss" must be operationalized into a computable, halting indicator function . You need an algorithm that can look at a simulated future and definitively output a 1 (Doom) or 0 (Not Doom) otherwise you cannot get a percentage from the math.

Even a set-theoretic Bayesian who doesn’t care about running code or doing Monte Carlo simulations, and just wants to define their event  using ZFC runs into the same problem. To actually compute their probability they must write down a formal logical sentence defining the event, but because defining permanent loss at the asymptote is so deeply tied to a specific ontology, two agents using different formalizations will inevitably compute probabilities for extensionally different sets:  and . They end up doing math on two entirely different versions of the apocalypse.

Defining this event feels sort of structurally cursed, and the reason why is that it comes down to the infinite nature of the environment we are trying to predict. Recent papers have suggested that the universe, or the post-AGI environment, is computationally universal[7] so "permanent loss of potential" is not a localized event, but an asymptotic property of trajectories. Formally, it has the logical structure of a  condition (there exists a time after which recovery never occurs). 

The mathematical classification of unsolvable problems is known as the arithmetical hierarchy and the number of alternating quantifiers determines just how deeply uncomputable a problem is. A standard Halting Problem only has one: does there exist a step where this program stops? Sometimes this can be solved by just running the code and waiting. But the notion of permanent loss stacks two quantifiers: it requires that there exists a threshold followed by a forever state. Because you can never confirm a forever state just by watching a finite simulation (because humanity could theoretically recover at step ), this logical structure makes the problem more difficult. In this case, it makes the target event a -complete tail set in the arithmetical hierarchy. It is a tail set because its truth value depends entirely on the infinite tail-end of the timeline and completely ignores whatever happened in the finite prefix.

This matters because any practical probability estimate requires a total computable classifier ; an algorithm you could write to evaluate whether a simulated future counts as "doom" or not. It must halt and output an answer (1 or 0), so it must make its decision after reading a finite prefix of the future.

The problem is that membership in a nontrivial tail set cannot be determined by a finite prefix. Therefore, any computable classifier must systematically misclassify some trajectories.

To solve this you might think you could write a more complex and better-describe classifier to reduce the error rate, but it is possible to strictly prove that reducing this classification error requires algorithmically incompressible bits of information. You run into a Precision-Robustness Tradeoff. Working in the Solomonoff mixture over Cantor space, the product of your classifier's error  and its complexity weight  is bounded below:

This equation establishes a strict mathematical balancing act: to drive the error down, your code's complexity must go up. But the trap is much worse than just having to write a longer program, correctly classifying the low-complexity boundary cases requires deciding the halting problem. As you may expect this is worrisome because no general algorithm can solve the Halting Problem. Your classifier literally cannot compute the answers to these boundary cases, it must have the answers hardcoded into it by you, the programmer. To reduce your error, your classifier must physically encode initial segments of Chaitin’s constant relative to the halting oracle ().

Slightly imprecisely: Chaitin’s constant is the solution to the Halting Problem expressed in binary digits. However; the sequence contains answers to uncomputable problems so it has a property called “2 randomness” which makes the sequence indistinguishable from computationally irreducible static. So since these bits are 2-random they cannot be compressed and for every factor-of-two reduction in predicate error, it is required to have one additional bit of perfectly random, incompressible specification.

Now if you wanted those bits, where could they come from?

  1. Computation can't generate them because they are uncomputable.
  2. Data can't supply them because data only tells you which finite prefix you are on, not how to classify its asymptotic tail.
  3. Cross-agent consensus won’t sync them because the Invariance Theorem doesn't force two different frameworks to choose extensionally identical classifiers.

It turns out you are entirely leaning on unforced subjective taxonomy which traps you in an Impossibility Triangle. You must either

  • Accept Ambiguity: and use a simple predicate with low algorithmic complexity. This will cause you to mathematically misclassify a vast number of simple, computable futures. You may end up calculating the probability of the wrong event.
  • Accept Fragility: by using a highly complex event predicate to reduce misclassification. However; the bits required to define it are incompressible and must come from your specific ontology and your estimate becomes fiercely encoding-dependent.
  • Accept Both: and pick a middle ground. You suffer from moderate ambiguity and moderate fragility at the same time.

The overall issue is this: if you reduce the predicate ambiguity, as we just discussed, you are forced deeper into framework-dependence, with all of the issues we previously highlighted. We proved the data cannot correct your priors and now we proved you must inject uncomputable bias just to define an event. Together these guarantee that any precise percentage of these types of risks are a structural illusion. You are no longer measuring the “probability” of the event, you are only measuring the algorithmic cost of your own worldview.

Why This is Different

If you came from the introduction: welcome! And if you just read through the essay above, you might be tempted to map this argument onto existing debates. Especially because one type of unprecedented risk includes P(doom), and I used it as an example, it is critical for me to explain why the formalism above puts us into a different bucket.

Usually critiques of P(doom) fall into one of three buckets:

  1. A Psychological Critique: something like “humans are bad at forecasting, poorly calibrated, and overly influenced by tribal vibes or sci-fi tropes.”
  2. A Complexity Critique: someone says that the world is too messy, the causal graphs are too dense, and our current models are simply not gears-level enough to produce a reliable number.
  3. An Epistemological Critique: usually a standard frequentist or Popperian objection that genuinely unprecedented events cannot be assigned probabilities because they lack reference classes, making the exercise philosophically moot.

This post is none of those because I am assuming the perspective of an ideal Bayesian agent. In doing this I am granting the agent infinite compute, perfect logical omniscience, and flawless Solomonoff induction. Even after doing all of this, the math shows that canonical point-estimates for unprecedented catastrophes are structurally blocked.

Specifically, here is what separates this essay from the usual discourse:

  • It is not just about "Priors matter", it is about Evidential Screening. The standard Bayesian defense is priors are subjective for small samples, but enough data washes them out. However; the Evidential Screening Property proves that for unprecedented risks, the data structurally cannot do this because all available non-catastrophic data perfectly fits both the “safe”-ontology and the “doom”-ontology. This makes the catastrophe parameter mathematically screened off. We could observe a thousand years of safe AI prefix data and the priors would never wash out.
  • It proves that more data won't save us. The intuitive hope is that as AI capabilities scale we will approach a sort of asymptote of consensus. But Differential Screening proves that because different ontologies have misaligned causal joints, they interpret the exact same capabilities jump as evidence in opposite directions. This demonstrates that we don’t tend to a narrowing asymptote, but rather it leads to a bounded-increment random walk and more data can easily result in permanent polarization.
  • It proves that better definitions demand uncomputable magic. When faced with the imprecise nuance of expanded definitions of P(doom), a rationalist impulse is to operationalize the definition. The Precision-Robustness Tradeoff proves that this instinct is a trap. For asymptotic tail events like "permanent loss of potential", driving down classification error mathematically requires incompressible, 2-random bits of Chaitin's constant. This is an insurmountable problem: you cannot compute them and the data cannot supply them. The ambiguity isn't a linguistic failure, but instead a strict boundary in the arithmetical hierarchy.
  • It is not epistemic nihilism. As I mentioned above, I am not throwing my hands up and saying, "We can't know anything, so assign whatever number feels right." The math strictly bounds the divergence by the translation cost between frameworks (). I am tearing down the illusion of single point-estimates specifically to replace them with mathematically rigorous alternatives.

Most critiques of P(doom) attack the agent making the estimate. This framework attacks the information geometry of the problem itself. And it is a problem for P(utopia) as well!

How to Restore Canonical Risk

The probabilities of unprecedented catastrophes are non-canonical because their precise numerical value is a syntax error generated by compilation costs, which is trapped by evidential screening, only partially identified by data, and then choked by uncomputable predicates. So what are we supposed to do? Wait for the end while wringing our hands? No, but we have to be clearer about what survives the mathematics:

Find the Holy Grail: Agreed Precursors With Aligned Causal Joints

You can break the Evidential Screening Property if rival frameworks agree ex ante on an observable intermediate causal node. In other words, if Yudkowsky and Bostrom can agree that "doom is strictly preceded by an AI system autonomously acquiring $100M of illicit computing resources" or public health experts agree that "a novel pandemic is strictly preceded by sustained human-to-human transmission", then we have an observable precursor.

There is a catch: it is not enough to simply agree that the precursor is necessary. Both frameworks must explicitly agree ex ante on the directional update that observation triggers, that is, precursors where the causal joints align. This is not a new concept , but it is my top and best guess at a constructive recommendation. The screening property is what makes P(doom) non-canonical; agreed precursors with aligned derivatives are what break the screening property. When they do align, non-event data discriminates between models that predict different precursor rates, likelihood ratios can grow, prior gaps get washed out, and estimates converge! Before we can solve AI alignment, we should solve epistemic alignment among ourselves.

This points to a clear shift: progress does not come from debating unobservable asymptotic mechanisms ("is it a surgery or an asteroid?") or refining subjective point-estimates. It comes from doing the difficult work of building cross-framework consensus on the observable causal nodes, and their directional updates, that necessarily precede the catastrophe. As treaties and agreements are of high interest to AI safety groups, this seems like a tractable area to focus on, and one that does not require nationstate agreements. It only requires rival labs and pundits to sit down and agree on what a specific test result will actually mean before the test is run.

The allure of a single, objective probability estimate is essentially a desire to outsource our existential fear to the comfort of a single number. It is unfortunate that math refuses to cooperate for this purpose. It is the case that when dealing with unprecedented catastrophes your framework is your fate. Until we find agreed precursors with aligned derivatives, we aren't doing Bayesian updating, we are just arguing about which programming language is prettier.

 

Appendix: All of The Mathematics

This section goes deep into the math, with minimal explanation, into the concepts above. I am working on a full paper that melds the narrative essay with the precise mathematics below, please let me know if you would like to see this. I don’t think it would have fared well on LW for clarity. There are, of course, numerous sources that support the bulk of the mathematics utilized because I am not proving any new computability theorems or doing anything very special. The small additions of my own (at least to my knowledge) are: the Mixture Probability Dispersion Theorem (Theorem 1) and its composite-cluster proof, the Precision-Robustness Tradeoff (Theorem 2) and the direct routing through , the Additive Decomposition (Identity 1) separating the two grounds, the formal bridge connecting evidential screening to Manski's identification region, the Translation Barrier Corollary, and Conjecture 1. If anyone who is actually good at math could help prove or disprove Conjecture 1 I would be very grateful. I tried a bunch of different ways to figure it out, but I just couldn't.

Formal Setting and Objects

Spaces and measures. The sample space is Cantor space . A computable measure  is one where the map  from finite strings to cylinder probabilities is a computable real uniformly in . Let  denote the Dirac measure concentrating on the sequence .

Description languages. Let  and  denote optimal prefix-free Universal Turing Machines (UTMs).  is the prefix-free Kolmogorov complexity of a string  or measure  relative to . Directional compilation cost  is the length of the shortest prefix-free compiler from -programs to -programs. Symmetric translation cost is . By the Invariance Theorem: .

Mixtures. Solomonoff’s uncomputable prior assigns weight  to each computable measure . The posterior probability of event  after finite prefix  is . Let  denote the induced Solomonoff measure over sequences.

Classifiers and events. A total computable classifier  is an oracle Turing machine that queries finitely many bits of its infinite input and halts. By standard computable analysis, such a functional is continuous in the product topology, meaning its output is determined by a finite prefix. A tail set  is an event invariant under modification of finitely many initial coordinates:  for any  and , where  overwrites the first  bits.

Ground 1: Canonicality Failure via Evidential Screening

Proposition 1 (Encoding Swing). For any models , the posterior log-odds under  and  satisfy:

Proof. The posterior odds decompose as . The likelihood ratio  is encoding-independent and cancels upon subtraction. Letting  and , the absolute difference is . By one-sided invariance, each bracket lies in . Subtracting the two intervals yields the bound. 

Definition 1 (Evidential Screening & Partial Identification). An estimation problem satisfies the evidential screening property with respect to a target event  and a core subset of models  (capturing  of the posterior mass) if available evidence satisfies:

for all achievable , where .

The prior-free identification region for the parameter of interest  is:

Denote this region . For canonical parameters,  shrinks to a point as  grows. For screened parameters,  remains persistently wide.

Theorem 1 (Mixture Probability Dispersion). Let  be partitioned into a high-risk cluster  () and low-risk cluster  (), with . Let composite models  and  be the normalized within-cluster mixtures (any convex combination of computable measures is itself a computable measure, so these are well-defined). Let  and . Let the normalized weights of the two clusters under framework  be  and , summing to 1 -  (where  is the residual mass outside the core set). Let  and . If screening holds such that  and  for  then:

Proof. The total mixture probability under  is , where  is the residual contribution. Let . Substituting normalized weights:

By the odds decomposition, . Under the screening bound, , so . Symmetrically,  so . Since  is monotonically increasing:

Subtracting the symmetric expansion for  and bounding the residuals:

Corollary (Translation Barrier Cap). By the Invariance Theorem, . Since  and , we have . The non-canonical divergence is strictly capped by the translation barrier.

Definition 2 & Proposition (Differential Screening & Gap Dynamics). The log-odds gap is , where  is the differential update. If frameworks decompose hypotheses along different causal joints,  takes mixed signs. Modeling  as a bounded-increment random walk with step size  and  fraction of gap-widening evidence:

(a) If , frameworks permanently polarize (probability of consensus decays exponentially in ).

(b) If , expected time to consensus by optional stopping is . As , convergence scales superexponentially.

Ground 2: The Event Predicate

Identity 1 (Additive Decomposition). The total cross-framework discrepancy structurally decomposes into Predicate Divergence and Model Divergence:

Assumption 1. Target catastrophe  is a -complete nontrivial tail set.

Hypothesis (Computable Witnesses).  contains computable , and  contains computable . Let .

Lemma 1. For any computable sequence , .

Proof. A prefix-free program computing the Dirac measure  can be constructed from a program computing  by prepending an  prefix-comparison logic wrapper. Normalization is absorbed by . 

Theorem 2 (Precision-Robustness Tradeoff). Let disagreement region be  and error .

(a) Topological Lower Bound: For any total computable classifier , .

(b) Incompressibility: Specifying a classifier that correctly categorizes a sequence family up to complexity  requires  algorithmically incompressible bits.

Proof of (a).  must halt on input  outputting  after reading finite prefix . If , define . Since  and  is a tail set, . But . If , define , but . In either case, . Generating  requires simulating  (requires  bits) and appending the chosen witness tail (at most  bits). Thus . By Lemma 1, . 

Proof of (b). Let  be the halting problem for prefix-free machines relative to the halting oracle, with halting probability .

By Assumption 1,  is effectively -complete. By the definition of effective completeness, there exists a uniform computable map  producing a computable sequence  such that . Because  is already a valid prefix-free program on , no logarithmic prefix-coding penalty is incurred. The uniform map adds only  overhead, so .

Suppose a classifier  correctly classifies  versus  for all prefix-free programs  with . By simulating  on , one computably decides  for all such . Knowing which programs of length  halt determines the partial sums of  to within , recovering its first  bits. Since  is 2-random (Downey, Hirschfeldt, Nies & Terwijn 2006), these bits are algorithmically incompressible, enforcing . 

Conjecture 1 (Extensional Divergence). For any -complete tail set  and ,  with large , the optimal -bit computable classifiers  and  are extensionally distinct ().

Why I want it, why I can't get it. The objective function  is a complexity-weighted loss. Because  and  induce radically different mass distributions, the strict -bit budget forces each classifier to triage errors differently. While the Invariance Theorem allows  to be described under  in  bits, this exceeds the -bit budget, so the -optimal classifier cannot simply import the -optimal one.

I can not figure this out because correctly classifying -simple sequences requires incompressible bits correlated with , while -simple sequences require bits correlated with . These are distinct 2-random reals. A proof would require showing that a single -bit decision boundary cannot simultaneously serve both (formally, a mutual-information bound on finite decision trees relative to distinct halting probabilities.) Let me know if you have any ideas.

  1. ^

    I am not talking about "non-canonical" probability in a  Boltzmann sense. Only as defined here.

  2. ^

    A quick disclaimer: in this essay I formalize a number of concepts into mathematical terms. I want to be clear that I am not claiming your brain is literally running Solomonoff induction over Cantor space. But it is very useful to establish formal upper bounds on inference using algorithmic information theory. The argument is this: by showing the structural limits of the problem for a mathematically perfect infinite-compute superintelligence that cannot wash out its priors with non-catastrophic data, our limited human heuristics wouldn't be able to do it either. This is not an argument about the literal mechanics of your cognition.

  3. ^

    To be clear: Bostrom and Yudkowsky aren't running competing simulations of AI trajectories based on these works, but their frameworks (the QALY-maximizing surgery ontology versus the alignment-failure asteroid ontology) are good examples of the kind of deep ontological divergence that when formalized into actual generative models produces the encoding dependence I talk about in this essay.

  4. ^

    See Charles Manski's work.

  5. ^

    Interestingly, Bostrom's own analysis doesn't make the mistake I mention here. He takes P(doom) as an input parameter and optimizes across the full range from 1% to 99%, rather than claiming to know the exact number. The divergence between his framework and Yudkowsky's is not primarily about what P(doom) is. The difference is about what the right decision-theoretic framework is and for reasoning about it and this is actually an encoding dependence itself. It is just one operating at a meta-level: the choice of whether to frame the problem as "risky surgery" or "incoming asteroid" then determines which decision-theoretic apparatus seems natural, which then determines what actions seem rational across the identification region.

  6. ^

    To get really rigorous: in the limiting case where computable priors natively exclude mechanisms outside their ontological primitives, they assign literally zero probability to each other’s core catastrophic mechanisms. Nielsen & Stewart proved that rational agents whose measures fail mutual absolute continuity don't merely practically fail to converge, they can permanently, rationally polarize on the exact same stream of infinite evidence.

  7. ^

    Memory-augmented LLMs are likely Turing-complete.



Discuss

Mechanistic Interpretability of Biological Foundation Models

Новости LessWrong.com - 20 февраля, 2026 - 21:01
Published on February 20, 2026 6:01 PM GMT

TL;DR: I ran the most comprehensive stress-test to date of mechanistic interpretability for single-cell foundation models (scGPT, Geneformer): 37 analyses, 153 statistical tests, 4 cell types. Attention-based gene regulatory network extraction fails at every level that matters, mostly because trivial gene-level baselines already explain the signal and the heads most aligned with known regulation turn out to be the most dispensable for the model's actual computation. But the models do learn real layer-organized biological structure, and I found that activation patching in these models has a large, formally quantifiable non-additivity bias that undermines standard component rankings, which is likely relevant for LLM interpretability too. I urge you: if you like mechanistic interpretability, consider working on biological foundation models. They offer external ground truth for validating your methods, more tractable model scales, and direct biomedical payoff with lower dual-use risk than frontier LLM interpretability. Full research is available here.

1. Why I Work on Mechanistic Interpretability of Biological Models, Not LLMs

It is well accepted that mechanistic interpretability is one of the most naturally attractive research directions for technically oriented people who care about AI safety. It feels like science in the most satisfying sense: you have a complex system, you poke at it with carefully designed experiments, and you try to figure out what it's actually doing inside. It rewards exactly the kind of careful, detail-oriented thinking that draws people into alignment research in the first place, and the dream of understanding what happens between a model's inputs and outputs is compelling enough to sustain years of difficult work.

I want to honestly say that I believe, based both on my own reasoning and on arguments made by people whose judgment I take seriously, that mechanistic interpretability of general-purpose models carries risks that are insufficiently appreciated. The concern is relatively straightforward: deep mechanistic understanding of how capable models work can advance their capabilities (by revealing which circuits to scale, optimize, or compose), and perhaps more critically, early weak superintelligences could leverage interpretability tools and knowledge as a substrate for recursive self-improvement. However, this point is just to explain my motivation - agreeing or disagreeing on it is not important for the comprehension of this article. 

At the same time, none of this means that mechanistic interpretability knowledge must remain unused and unapplied across the board. What it means is that we should think about where the risk-benefit calculus is most favorable, and I believe biological foundation models are an unusually good answer to that question, for three reasons that I think are individually sufficient and collectively quite strong.

First, advancing the capabilities of narrow biological models is likely to be locally beneficial. A single-cell foundation model that gets better at predicting gene regulatory responses to perturbations is not going to help anyone build a more capable language model or a more dangerous autonomous agent. These models process transcriptomic profiles, not natural language or general world-knowledge, and making them more capable means making biology research faster, not making general AI systems more dangerous. I mean, eventually it will also probably kill you, but general models will kill you much earlier, so the doom from biological models is "screened off". I do acknowledge that there are still some risks here, but I think it is still net positive because of the reasons I explain below. 

Second, biological models are far more tractable as subjects for mechanistic study than LLMs. Geneformer V2, the largest model in my study, has 316 million parameters and 18 transformer layers. This is large enough to be interesting (it clearly learns non-trivial structure) but small enough to be, at least in principle, exhaustively analyzed with current tools. More importantly, biological models can be validated against experimental ground truth in ways that LLM interpretability simply cannot: we have CRISPR perturbation data that tells us what actually happens when you intervene on specific genes, we have curated databases of known regulatory relationships, and we can design targeted experiments to test specific mechanistic claims. This makes biology a better laboratory for developing and stress-testing interpretability methods, because when something looks like a mechanistic discovery, you can check whether it actually is one.

Third, and this is the motivation I care about most, I think biological foundation models have a genuine chance of radically advancing our understanding of human biology at the systems level. We have largely resolved the genomics level (sequencing is cheap and comprehensive) and made enormous progress on the structural level (AlphaFold and its successors). What remains is fundamentally the systems level: understanding how genes, proteins, cell states, tissues, and organisms interact as integrated wholes to produce the phenotypes we observe. Single-cell foundation models, trained on tens of millions of individual cellular transcriptomes, are plausible candidates for learning aspects of this systems-level organization. If we can extract that knowledge mechanistically, rather than treating these models as black boxes, the payoff for biomedicine and for our understanding of human biology could be substantial. I also believe, as I've argued previously, that advancing our understanding of human biology at the systems level is one of the most important things we can do for human intelligence augmentation, which in turn is one of the most important things we can do for alignment, but I will not try to carry that argument here and instead point the interested reader to that earlier post.

So the question becomes practical: can we actually extract meaningful biological knowledge from these models using mechanistic interpretability tools? That is what I spent the last months trying to find out, and the answer is more nuanced than either the optimists or the skeptics would prefer.

2. Brief Note: What Are Single-Cell Foundation Models, and Why Should You Care?

For readers who come from the LLM interpretability side and have not worked with biological data, here is the minimum context you need to follow the rest of this post.

The data. Single-cell RNA sequencing (scRNA-seq) measures the expression levels of thousands of genes in individual cells. Unlike bulk sequencing, which averages over millions of cells and hides all the interesting heterogeneity, single-cell data lets you see that a tissue is composed of distinct cell types and cell states, each with its own gene expression program. Modern datasets contain tens of millions of individually profiled cells across dozens of human tissues.

The models. Single-cell foundation models are transformer architectures trained on these large scRNA-seq corpora using self-supervised objectives, analogous to how LLMs are trained on text. The two main model families I studied are:

scGPT treats each gene as a token and its expression value as the token's "identity," then trains with masked expression prediction: hide some genes' expression values, ask the model to predict them from the remaining context. This is conceptually very close to masked language modeling, with genes playing the role of words and expression levels playing the role of token IDs.

Geneformer takes a different approach: it ranks genes within each cell by their expression level (most expressed first) and then uses the rank-ordered gene sequence as input, training with masked gene prediction. The tokenization is fundamentally different from scGPT (ranks vs. expression values), the training objective is different, and the model scale differs (Geneformer V2-316M vs. scGPT's smaller variants), but both architectures learn to predict gene expression patterns from cellular context.

What people claim these models can do. The published literature (see, for example, here and here) suggests that these models achieve useful performance on several downstream tasks: classifying cell types, predicting how cells respond to genetic perturbations, and, most relevant for this post, inferring gene regulatory networks (GRNs) from their attention patterns. This last claim is the one I tested most thoroughly, because it is the most mechanistically interpretable claim and the one with the most direct implications for biological knowledge extraction. The idea is simple and appealing: if the model has learned that gene A regulates gene B, then the attention weight from gene A to gene B should be high, and by extracting the full attention matrix, you can recover the regulatory network the model has learned. 

3. What I Did: The Most Comprehensive Stress-Test of Single-Cell Model Interpretability To Date

The paper I am summarizing here reports, to my knowledge, the most thorough systematic evaluation of mechanistic interpretability for single-cell foundation models published so far. It spans 37 distinct analyses, 153 pre-registered statistical tests, 4 cell types (K562, RPE1, T cells, iPSC neurons), 2 perturbation modalities (CRISPRi gene silencing and CRISPRa gene activation), and 2 model families (scGPT and Geneformer). The full details are on arXiv; here I will focus on the findings that I think are most relevant for the community.

3.1. The evaluation philosophy

A core design principle was that no single test is sufficient to validate or invalidate a mechanistic interpretability claim, because each test addresses a different failure mode and any one of them can miss problems that another catches. I built five interlocking families of tests, and the logic of how they complement each other is worth spelling out, because I think this framework is reusable well beyond my specific setting:

Trivial-baseline comparison asks: "Can a method that requires no model at all achieve the same performance?" If gene-level variance (a property you can compute with a pocket calculator) predicts perturbation responses as well as your fancy attention-derived network, you have not demonstrated that your interpretability method captures anything beyond trivial gene properties. This test catches overconfidence from neglecting simple alternatives.

Conditional incremental-value testing asks: "Given the best simple features, does your interpretability output add anything?" This is more demanding than the first test because it conditions on the simple features rather than just comparing to them. A method can be "significantly above chance" and still add zero incremental value once you control for what was already available.

Expression residualisation and propensity matching asks: "Is your signal actually coming from the thing you think it's coming from, or is it a confound proxy?" This is the biological equivalent of discovering that your "sentiment circuit" is actually a "sentence length detector."

Causal ablation with fidelity diagnostics asks: "Does the model actually use the components that your interpretability method identifies as important?" If your method says "these attention heads encode regulatory knowledge," then removing those heads should degrade the model's performance on tasks that require regulatory knowledge. This is the closest to standard NLP activation patching, but with a critical addition: intervention-fidelity diagnostics that verify the ablation actually changed the model's internal representations. Concretely, this means measuring how much the model's logits or hidden states shift when you zero out a head, because if a head's output was near-zero to begin with, ablating it tells you nothing about whether the model relies on it. A null result from ablation is only informative if you can show the intervention was materially disruptive to the computation passing through that component, and the fidelity check is what separates "the model doesn't need this head" from "your ablation didn't actually do anything."

Cross-context replication asks: "Does this hold up in a different cell type, a different perturbation modality, or a different model?" A result that appears in K562 CRISPRi but vanishes in RPE1 or T cells is a dataset-specific observation.

A result that survives all five families is genuinely robust. A result that fails any one of them has a specific, identifiable weakness. And the convergence of multiple independent tests pointing in the same direction provides stronger evidence than any single test can offer, regardless of how well-powered it is.

3.2. A note on the cautionary nature of these results

I want to be upfront about something: I tried a lot of ideas, and many of the simple ones did not work. The field's implicit narrative has been that attention patterns in biological transformers straightforwardly encode regulatory networks (again, here and here, but also in many other places) , and that extracting this information is primarily an engineering challenge (find the right layer, the right aggregation, the right thresholding). What I found instead is that the relationship between attention patterns and biological regulation is far more complex and confound-laden than this narrative suggests.

But I think this negative result is itself highly informative, for two reasons. The first is that it tells the field where not to look, which saves everyone the effort of independently discovering the same dead ends. The second, which I think is more important, is that the systematic framework I built means that when new biological foundation models emerge (and they will, with better architectures, more data, and potentially different training objectives), testing them against this battery of analyses is straightforward rather than requiring reinvention from scratch. The framework accelerates the entire mechanistic interpretability pipeline for this model class, even though many of its current outputs are negative.

3.3. Connections to NLP mechanistic interpretability

Before presenting the specific findings, it is worth noting that several of the phenomena I document have clear parallels in the NLP mechanistic interpretability literature, though the biological setting allows me to push certain questions further than is currently possible with language models. The finding that attention patterns do not reliably indicate computationally important features echoes long existing results on attention and explanation, but my causal ablation findings go beyond showing that many heads are prunable: I show that the heads most aligned with known ground truth are the most dispensable, which is a qualitatively stronger negative result. The layer-structured biological representations I find are reminiscent of the classical layer-specialized circuits documented in LLMs (Olsson et al. 2022 on induction heads, Elhage et al. on superposition), but in biology we can validate the content of each layer against independently curated databases of protein interactions and transcriptional regulation, which is a luxury that NLP interpretability researchers do not currently have. So the methodological tools developed here, particularly the incremental-value framework, the non-additivity diagnostics for activation patching, and the confound decomposition battery, can prove useful to people working on interpretability in general. 

4. What Works: Positive and Constructive Findings

The negative results get the headlines (and they should, because the "attention as GRN" claim is the one the field has been banking on), but the positive findings are where the constructive path forward begins. These are the things that survived the full stress-testing battery, and I think each of them points toward something real about what these models have learned.

4.1. Attention patterns encode layer-organized biological structure

When I benchmarked Geneformer attention edges against multiple biological reference databases across all 18 layers, protein-protein interaction signal (measured against the STRING database) was strongest at the earliest transformer layer and decreased monotonically with depth. Transcriptional regulation signal (measured against TRRUST, a curated database of transcription factor targets) showed the opposite pattern: it increased with depth and peaked around L15. The cross-layer profiles for these two types of biological signal are anti-correlated, and functional co-annotation signals from pathway databases showed their own distinct depth profiles.

This is interesting, and not just as a biological finding. It means the model has self-organized its layers into a hierarchy that separates different types of biological relationship: physical protein interactions in the early layers, transcriptional regulation in the late layers, with functional pathway associations distributed in between. This is not something the training objective directly incentivizes (the model is just predicting masked gene identities from context), so the layer specialization reflects structure the model discovered on its own.

Critically, this signal survives expression residualisation. When I controlled for pairwise expression similarity (which would remove any signal that was just "these genes are co-expressed, therefore they look related"), 97% of the TRRUST regulatory signal at L15 was retained. So the layer-organized structure is not just a re-encoding of pairwise co-expression in attention-matrix form; it indeed captures something beyond what simple correlation between gene pairs would give you.

4.2. Cell-State Stratified Interpretability (CSSI) as a constructive methodological tool

One of the things I discovered while investigating why attention-based GRN recovery seemed to get worse as you added more cells (which is the opposite of what you would naively expect) is that the problem is not really about "more data makes models worse." The problem is about heterogeneity dilution: when you pool attention patterns across cells in different states (different cell types, different stages of differentiation, different activation states), you average together cell-state-specific regulatory signals that may point in different directions, and the result is a washed-out mess that retains only the regulatory relationships that are universal across all included states.

The solution I developed, Cell-State Stratified Interpretability (CSSI), is conceptually simple: instead of computing attention-derived edge scores across all cells at once, you first cluster cells into relatively homogeneous cell-state groups (using Leiden clustering on the model's own embeddings, so the stratification is informed by what the model itself has learned), compute edge scores within each stratum separately, and then aggregate across strata using max or mean operations. The optimal number of strata in the datasets I tested was around 5-7, which roughly corresponds to the major cell-state subdivisions present in the data.

The results are substantial: CSSI improves TRRUST regulatory edge recovery by up to 1.85-fold compared to unstratified computation. Null tests with random strata assignments confirm that the improvement is not an artifact of the stratification procedure inflating false positives; it specifically requires biologically meaningful strata. In synthetic experiments where I controlled the ground truth, CSSI with oracle labels maintained F1 ≥ 0.90 across all cell count configurations, while pooled inference dropped from ~0.85 at 200 cells to ~0.51 at 1,000 cells.

4.3. Context-dependent attention-correlation relationships reveal genuine learning beyond co-expression

One of the strongest pieces of evidence that these models have learned something real, rather than just repackaging correlation statistics in a more expensive way, comes from comparing how attention edges and correlation edges perform across different cell types and perturbation modalities:

In K562 cells under CRISPRi (gene silencing), attention and correlation are statistically indistinguishable for predicting perturbation targets. In K562 cells under CRISPRa (gene activation), attention actually performs worse than correlation. In RPE1 cells under CRISPRi, attention significantly outperforms correlation. In iPSC-derived neurons, attention trends better than correlation but the sample is smaller.

If attention were simply a re-encoding of co-expression, you would expect a uniform relationship across contexts: attention and correlation would always perform similarly. The fact that the relationship is context-dependent, and that it flips direction depending on cell type and perturbation modality, means the models have learned something that varies between biological contexts in a way that simple co-expression does not. Whether that something is causal regulatory structure, more complex statistical dependencies, or some other biologically meaningful feature is a question the current evidence cannot fully resolve, but the context-dependence itself is a signal that the models are doing more than just memorizing gene-gene correlations.

(I should note that the RPE1 advantage, despite being statistically robust, turns out to decompose into confound structure when subjected to the full battery, as I discuss in Section 5. But the existence of context-dependence across all four systems is not explained by confounding, and remains a genuine positive finding about the models' representational capacity.)

4.4. Some transcription factors show robust pairwise regulatory signal in attention edges

The aggregate picture (which I discuss more in Section 5) is that attention-derived edges add zero incremental value over gene-level features for predicting perturbation responses. But this aggregate hides real heterogeneity at the level of individual transcription factors. When I performed per-TF bootstrap analyses, 7 out of 18 evaluable transcription factors showed robust edge-level signal, with a global AUROC 95% confidence interval of [0.71, 0.77]. There was also a suggestive trend that "master regulators" (transcription factors known to control broad developmental programs) showed higher AUROC than other TF categories, though this trend did not survive multiple testing correction given the small sample of evaluable TFs.

This matters because it suggests the blanket conclusion "attention edges are useless for regulatory inference" is too strong as a claim about all regulatory relationships. For some transcription factors, operating in some contexts, attention-derived edges may genuinely capture pairwise regulatory information. Identifying which TFs and which contexts is a direction for future work that could turn the current vague hope into a targeted extraction strategy.

4.5. Cross-species conservation reveals biologically meaningful structure in edge scores

As a separate validation axis, I compared correlation-based TF-target edge scores computed independently in human and mouse lung tissue, matched via one-to-one orthologs. The global conservation was striking: Spearman ρ = 0.743 across 25,876 matched edges, p < 10^(-300), with 88.6% sign agreement and top-k overlaps enriched by 8× to 484× over random expectation.

But what makes this finding informative rather than just impressive is that the conservation is not uniform across transcription factors. Lineage-specifying TFs (those that define cell identity, like NKX2-1 for lung epithelium) show near-perfect cross-species transfer, while signaling-responsive TFs (those that respond to environmental stimuli, like STAT1 or HIF1A) transfer poorly. This pattern makes perfect biological sense: lineage specification is deeply conserved across mammalian evolution, while signal-responsive regulation adapts to species-specific environmental niches. The fact that edge scores recapitulate this known biological pattern, and that the recapitulation is TF-class-dependent in the predicted direction, provides converging evidence that these scores capture real biological structure, even though they may not capture it in the causal form that the strongest interpretability claims require.

5. What Doesn't Work: The Key Negative Findings and Why They Matter

This is where the stress-testing framework earns its keep. Each negative finding survived multiple robustness checks and cross-context replications, and together they present a coherent picture that is hard to dismiss as artifact or bad luck.

5.1. Gene-level baselines dominate perturbation prediction, and you don't need a foundation model for that

This is the single most important negative finding, and it reframes everything else. When I tested how well different features predict which genes will respond to a CRISPR perturbation, the ranking was:

Gene-level variance alone: AUROC = 0.881. Mean expression: 0.841. Dropout rate: 0.808. Attention-derived pairwise edges: ~0.70. Correlation-derived pairwise edges: ~0.70.

All comparisons with the gene-level baselines are significant at p < 10⁻¹². The implication is that most of what looks like "regulatory signal" in pairwise edge scores, whether derived from attention or from correlation, is actually reflecting univariate gene properties: genes that are highly variable, highly expressed, or frequently detected are more likely to be differentially expressed in response to any perturbation, and pairwise edges are largely tracking this property rather than specific regulatory relationships.

It is the most boring possible explanation for the observed performance, and it explains the bulk of the variance. 

5.2. Pairwise edge scores add literally zero incremental value over gene-level features

The gene-level baseline dominance could in principle coexist with genuine incremental value from pairwise edges: maybe edges add a small amount of unique information on top of what gene-level features provide. I tested this with a conditional incremental-value analysis on 559,720 observation pairs, with statistical power exceeding 99% to detect ΔAUROC = 0.005.

The result: adding attention edges to gene-level features yields ΔAUROC = −0.0004. Adding correlation edges yields ΔAUROC = −0.002. These are essentially exact zeros, and they persist across all tested generalisation protocols (cross-gene splits, cross-perturbation splits, joint splits), both linear and nonlinear models (logistic regression and GBDT), and multiple metrics (AUROC, AUPRC, top-k recall). The same pattern replicates independently in RPE1 cells, where gene-level features alone achieve AUROC = 0.942 and adding attention edges yields ΔAUROC = +0.0001.

The supplement exhaustively tests this null against every objection I could think of: different metrics, different model classes, different split designs, different feature encodings. The biggest improvement found anywhere was ΔAUPRC ≈ +0.009 under one specific parameterization, which is less than 4% relative improvement and does not survive correction. Whatever biological structure attention edges contain, it is completely redundant with gene-level features for predicting what happens when you perturb genes, at least under the evaluation protocols I tested.

5.3. Causal ablation reveals that "regulatory" heads are the most dispensable ones

This result is, in my opinion, the most striking finding in the entire paper from the standpoint of mechanistic interpretability methodology.

Geneformer V2-316M has 324 attention heads across 18 layers. I ranked heads by their alignment with known regulatory relationships (TRRUST database) and then ablated them. If attention patterns at regulatory-aligned heads are where the model stores and uses regulatory knowledge, removing those heads should degrade the model's ability to predict perturbation responses.

What actually happened: ablating the top-5, top-10, top-20, or top-50 TRRUST-ranked heads produced zero significant degradation in perturbation-prediction. Meanwhile, ablating 20 randomly selected heads caused a significant performance drop. I also tested uniform attention replacement (forcing attention weights to 1/n while preserving value projections) on the TRRUST-ranked heads, with no degradation. I tested MLP pathway ablation in the purported "regulatory" layers: still no degradation, while MLP ablation in random layers could cause significant drops.

Crucially, intervention-fidelity diagnostics confirmed that these ablations were actually changing the model's internal representations: TRRUST-ranked heads produce 23× larger logit perturbation when ablated compared to random heads. The interventions were material; the model just did not rely on those heads for perturbation prediction. The computation that matters for predicting what happens when you knock down a gene appears to live in the value/FFN pathway, distributed across many components in a redundant fashion, rather than in the learnable attention patterns that interpretability pipelines extract.

I also tested the obvious "fix": if the relevant computation is in the value pathway rather than the attention pattern, maybe we should extract edge scores from the context layer (softmax(QK^T)·V) using value-weighted cosine similarity. This does not help. Value-weighted scores actually underperform raw attention and correlation, and adding them to gene-level features slightly hurts incremental value. The context vectors appear to represent a blended "information receipt" signal rather than direct pairwise coupling, and whatever perturbation-predictive computation the model performs is distributed in a way that no simple pairwise score extraction can recover.

5.4. Do these models know about gene regulation at all, or did we just fail to extract it?

The negative results above establish that I could not extract meaningful gene regulatory network information from attention patterns using the methods I tested. But this leaves a crucial epistemic question open: are we looking at an extraction failure (the knowledge is in the model somewhere, but not in the attention weights and not in a form our methods can reach), or a knowledge absence (the models simply never learned causal regulatory relationships in the first place)? These are very different claims, and the second is substantially stronger than the first.

One natural way to probe this distinction is through surface capabilities. If a model can accurately predict what happens when you knock down a gene, then it must have learned something about gene regulation internally, regardless of whether that knowledge is accessible through attention pattern analysis. Surface capabilities provide a minimum baseline for internal knowledge: the model knows at least as much as its best task performance implies, even if our interpretability tools cannot locate where that knowledge lives.

Unfortunately, the evidence on surface capabilities of single-cell foundation models is quite conflicting, and the field is in the middle of a heated debate about it. On one hand, the original papers make strong claims: Theodoris et al. (2023) reported that Geneformer's in silico perturbation approach identified a novel transcription factor in cardiomyocytes that was experimentally validated, and scGPT (Cui et al., 2024) claimed state-of-the-art performance on perturbation prediction, cell type annotation, and gene network inference after fine-tuning. These results suggest that the models have learned something biologically meaningful during pretraining.

On the other hand, a growing body of independent benchmarking work paints a much more skeptical picture. Ahlmann-Eltze et al. compared five foundation models against deliberately simple linear baselines for perturbation effect prediction and found that none of the foundation models outperformed the baselines, concluding that pretraining on atlas data provided "only a small benefit over random embeddings." Csendes et al.  found that even the simplest baseline of taking the mean of training examples outperformed scGPT and scFoundation. Wenteler et al. showed that both scGPT and Geneformer perform worse than selecting highly variable genes and using established methods like Harmony or scVI in zero-shot cell type clustering. Bendidi et al. ran a comprehensive perturbation-oriented benchmark and concluded that foundation models show competitive performance only in batch effect reduction, where even random embeddings achieve near-optimal results. Perhaps most provocatively, Chen & Zou showed that GenePT, which simply uses ChatGPT text embeddings of gene descriptions from NCBI (containing zero expression data), achieves comparable or better performance than Geneformer and scGPT on many of the same downstream tasks!

A consistent pattern in this debate is that the original model papers evaluate primarily with fine-tuning, while independent benchmarks emphasize zero-shot performance. Fine-tuned models can look strong, but it becomes difficult to disentangle whether the strong performance comes from pretrained representations or from the fine-tuning data itself. Zero-shot evaluation is arguably the fairer test of what pretraining actually learned, and this is precisely where the models tend to struggle.

What does this mean for interpreting my results? The honest answer is that I cannot fully resolve the extraction-vs.-absence question with the data we have. Both model families converge to similar near-random unstratified GRN recovery despite fundamentally different architectures (gene-token vs. rank-based tokenization), different training objectives, and different scales, which suggests this is not a model-specific quirk. But the convergence is consistent with both interpretations: either both architectures fail to learn causal regulation from observational expression data (because co-expression is the dominant signal and the training objectives do not specifically incentivize causal structure), or both architectures learn it but encode it in representations that neither attention-based nor simple pairwise extraction methods can reach. The mixed evidence on surface capabilities does not decisively resolve this in either direction, though the weight of the independent benchmarking evidence leans toward the more pessimistic interpretation for current-generation models. The next obvious question is, will stacking more layers help?

6. What the Biological Setting Reveals About Activation Patching

Most of the findings in Sections 4 and 5 are primarily about biology. This section is rather about a methodological result about activation patching itself that I, as far as I know, is novel and directly relevant to anyone using this technique on any transformer model, biological or otherwise.

6.1. The non-additivity problem is formal, quantifiable, and large

Activation patching (sometimes called causal mediation analysis) is one of the workhorse tools of current mechanistic interpretability. The standard workflow is: intervene on one component at a time (a head, an MLP block, a residual stream position), measure the effect on some downstream behavior, and rank components by their individual effects. The components with the largest effects are declared to be "the circuit" responsible for that behavior.

This workflow implicitly assumes additivity: that the effect of the full model is well-approximated by the sum of individual component effects. When this assumption holds, single-component rankings are meaningful. When it fails, they can be systematically wrong in ways that are not just noisy but structurally biased.

The mech interp community is well aware that interactions can matter in principle. Nanda explicitly notes that attribution patching "will neglect any interaction terms, and so will break when the interaction terms are a significant part of what's going on." Heimersheim & Nanda discuss backup heads and the Hydra effect as specific instances of non-additive behavior, where ablating one component causes others to compensate in ways that confound single-component attribution. Makelov et al. demonstrate a related failure mode at the subspace level, showing that patching can activate dormant parallel pathways that produce illusory interpretability signals. The qualitative concern is not new, and I want to credit the people who have been raising it. What has been missing, to my knowledge, is (a) a formal framework for quantifying how much the standard single-component workflow's rankings are biased by interactions, (b) empirical measurement of how large that bias actually is in a real model rather than a constructed example, and (c) certificates for which pairwise rankings survive the observed non-additivity. That is what I provided.

I formalize the bias using a decomposition involving Möbius interaction coefficients. The key quantity is the relationship between single-component mediation estimates and Shapley values (which are interaction-aware by construction). Single-component estimates equal Shapley values only when all interaction terms vanish; otherwise, the discrepancy is a structured function of the interaction landscape, and it can push the ranking in a consistent wrong direction rather than just adding noise.

The empirical question is whether this matters in practice. In the biological transformers I studied, the answer is clearly yes. Using frozen cross-tissue mediation archives, I computed lower bounds on aggregate non-additivity (the residual between total effect and the sum of individual component effects, adjusted for measurement uncertainty). In 10 of 16 run-pairs, this lower bound was positive, meaning the observed non-additivity exceeds what measurement noise alone could explain. The median lower-bound ratio relative to the total effect was 0.725, which means interactions account for a substantial fraction of the overall model behavior in the median case.

6.2. Ranking certificates collapse under structural bias assumptions

The most practically concerning result is not just that non-additivity exists, but what it does to the reliability of component rankings. I introduced "ranking certificates" that ask: given the observed level of non-additivity, what fraction of pairwise comparisons between components (e.g., "head A matters more than head B") can we certify as robust to interaction-induced bias?

Under the structural-bias assumptions informed by the empirical non-additivity measurements, the fraction of certifiably correct pairwise rankings collapses by an order of magnitude or more compared to what the single-component estimates naively suggest. In concrete terms: if you rank 50 heads by their individual activation patching effects and declare the ranking meaningful, the certification analysis suggests that only a small fraction of the pairwise orderings in that ranking are robust to interaction effects. The rest could be wrong, and wrong in a way that is invisible to the standard workflow because the standard workflow does not check for it.

6.3. What this means for mech interp practice

I have demonstrated the non-additivity bias and its consequences in biological transformers with 316 million parameters. I have not demonstrated it in GPT-2, Llama, or any other language model, and the magnitude of the effect could be different in those architectures. The formal framework applies to any transformer (it is architecture-agnostic), but the empirical severity is an open question for LLMs.

That said, I think the results warrant concrete changes to standard practice for anyone doing activation patching or similar single-component mediation analysis:

First, report the residual non-additivity. This is the gap between the total effect of a multi-component intervention and the sum of corresponding single-component effects. It is cheap to compute (you need one additional intervention beyond what you already do) and it directly tells you how much of the model's behavior lives in interactions rather than in individual components. If this residual is large, your single-component rankings are unreliable, and you should know that before you build a mechanistic story on top of them.

Second, compute ranking certificates for your top-ranked components. If you are going to claim "these are the most important heads for behavior X," you should check whether that ranking is robust to the level of non-additivity you actually observe. If only 10% of pairwise orderings survive certification, your "top 5 heads" may not actually be the top 5 heads.

Third, for your most important mechanistic claims, consider using interaction-aware alternatives like Shapley-based decompositions. These are more expensive (combinatorially so in the worst case, though sampling-based approximations exist), but they handle interactions correctly by construction. The synthetic validation in my supplement shows that Shapley-value estimates recover true interaction rankings with approximately 91% improvement in rank correlation compared to single-component estimates, which suggests the additional cost is worth it when the claim matters.

The broader methodological point is that "patch one component, measure effect, rank components" feels like a clean experimental design, and it is, as long as additivity holds. But additivity is an empirical property of the specific model and behavior you are studying, not a logical guarantee, and in the systems I studied, it fails badly enough to undermine the rankings it produces. I suspect this is not unique to biological transformers.

6.4. A note on metric sensitivity across scales

One additional observation that may be useful, though it is less novel than the non-additivity result: I found that the same underlying attention scores can show degrading top-K F1 with more data (all 9 tier×seed pairs, sign test p = 0.002) and improving AUROC with more data (mean 0.858 → 0.925 → 0.934) simultaneously. This reflects the difference between evaluating the extreme tail of a ranking under sparse references versus evaluating the full ranking. But it means that claims about how "interpretability quality scales with data/compute/parameters" are only meaningful if you specify which metric you are tracking and why, because different metrics can give exactly opposite answers about the same underlying scores. 

7. Next Steps: Toward a Program for Knowledge Extraction from Biological Foundation Models

The negative results in the current paper close off some paths but open others. If you accept the evidence that attention-based GRN extraction does not work, the question becomes: what might? This section outlines what I think are the most promising directions, ordered roughly from most to least concretely specified.

7.1. Intervention-aware pretraining

The most direct response to the optimization landscape concern raised in Section 5.5 is to change the training data. Current single-cell foundation models are pretrained on observational expression profiles, where co-expression is the dominant statistical signal and causal regulatory relationships are a much weaker, sparser, and noisier signal that the training objective does not specifically incentivize. If you want models that learn causal regulation, the most straightforward path is to train them on data that contains causal information.

Concretely, this means pretraining on (or at least fine-tuning with) perturbation experiments: Perturb-seq, CRISPRi/CRISPRa screens, and similar interventional datasets where you observe what happens when you knock a gene down and can therefore learn which genes are causally upstream of which others.

The challenge is scale. Perturbation datasets are orders of magnitude smaller than the observational atlases used for pretraining (tens of thousands of perturbations versus tens of millions of cells). Whether this is enough data to learn robust regulatory representations, or whether the perturbation signal will be drowned out by the much larger observational pretraining corpus, is an open empirical question, but I think my other research on scaling laws for biological foundation models may shed some light on it. 

7.2. Geometric and manifold-based interpretability

One of the most important recent developments in mechanistic interpretability, and one that I did not explore in my paper, is the recognition that models encode complex knowledge not as discrete pairwise relationships but as geometric structure in their representation spaces. This is directly relevant to the failure modes documented in this paper.

The most relevant example comes from Goodfire's work on Evo 2, DNA foundation model trained on over 9 trillion nucleotides. Using sparse autoencoders on residual stream activations, they discovered that the phylogenetic tree of life is encoded as a curved manifold in the model's learned feature space: species relationships correspond to geodesic distances along this manifold, with the overall structure organized around a roughly 10-dimensional flat representation overlaid with higher-curvature deviations that capture additional biological properties. This is, to my knowledge, one of the most complex natural manifolds yet characterized in a foundation model, and crucially, it is a biological foundation model where the extracted knowledge was validated against known ground truth (established phylogenies). This is exactly the kind of success story that the single-cell interpretability field needs but does not yet have.

The methodological lesson for single-cell models is pointed: if gene regulatory knowledge is encoded geometrically in the residual stream (as manifolds, subspaces, or curved representations) rather than as discrete pairwise relationships in attention matrices, then no amount of sophisticated attention extraction will find it, because you are looking in the wrong representational format entirely. 

This connects to a broader trend in the interpretability community. The linear representation hypothesis (that features correspond to directions in activation space) is being supplemented by the recognition that many important features live on nonlinear manifolds: circles for days of the week, hierarchical trees for taxonomic relationships, tori for periodic quantities, and more complex structures. Goodfire's own researchers note that "manifolds seem to be important types of representations, and ones that are not well-captured by current methods like sparse autoencoders," which suggests that even SAEs, the current dominant tool, may need manifold-aware extensions to fully characterize what these models have learned.

A concrete next experiment would be to train SAEs on residual stream activations of scGPT or Geneformer, look for geometric structures that correlate with known regulatory relationships, and test whether regulatory information that is invisible in attention patterns becomes visible in the learned feature space. If it does, the implication would be that the models have learned more about gene regulation than the attention-based methods could reveal. If it does not, that would strengthen the case for intervention-aware pretraining as the necessary next step.

7.3. Probing residual streams: from aggregate statistics to feature-level analysis

My paper's methodology is primarily macro-level: aggregate statistics across many TF-target pairs, summary measures of head importance, average AUROC across perturbation conditions. This was deliberate (I wanted statistically robust claims with controlled multiple testing), but it means the analyses are inherently insensitive to fine-grained structure that might exist at the level of individual features or small groups of components.

The natural next step is to apply the standard NLP probing toolkit to single-cell foundation models. Train linear probes on residual stream representations at each layer to predict specific regulatory relationships (e.g., "is gene A a direct target of transcription factor B?"). If the probe succeeds where attention extraction fails, it would localize regulatory knowledge to specific layers' representations without requiring that it be readable from attention patterns. If the probe also fails, that is much stronger evidence for knowledge absence rather than mere extraction failure.

Beyond linear probes, the SAE-based feature discovery approach discussed in 7.2 could yield individual interpretable features that correspond to specific regulatory programs or pathway activations. If a sparse autoencoder trained on layer 15 residual streams (where my paper found peak TRRUST alignment in attention) produces features whose activation patterns correlate with known regulatory cascades, that would be a concrete positive result pointing toward the kind of mechanistic understanding the field is seeking.

One important caveat from my paper's own findings: the causal ablation results show that perturbation-predictive computation is distributed across many components in a redundant fashion rather than localized in identifiable circuit components. When ablating the heads most aligned with regulatory ground truth produces zero degradation while random ablation causes significant degradation, this suggests there may not be a clean "regulatory circuit" to find. Fine-grained circuit discovery tools work best when the computation is localized and modular; if it is genuinely distributed and redundant, as the evidence suggests, then even sophisticated circuit analysis may not produce the kind of clean mechanistic story we would like. The honest conclusion might be that these models perform regulatory-relevant computation through distributed, redundant representations that resist clean decomposition, which would be an important finding in its own right even if it is less satisfying than a circuit diagram.

7.4. Hybrid architectures, CSSI, and conformal uncertainty

Two shorter-term practical directions deserve mention, both of which build directly on infrastructure from my paper.

First, hybrid architectures that use foundation model embeddings as inputs to dedicated GRN inference modules rather than trying to extract edges from attention. The idea is to take the residual stream representations that the models learn (which clearly contain biological structure, as demonstrated by the layer-organized findings in Section 4) and feed them into purpose-built GRN inference algorithms as enriched gene features, rather than interpreting the attention matrix itself as a gene regulatory network. This sidesteps the attention extraction problem entirely while still leveraging whatever biological knowledge the foundation model has encoded during pretraining. Several GRN inference methods already accept gene embeddings as inputs (GEARS being a prominent example), and foundation model embeddings could serve as a drop-in upgrade over existing gene embedding approaches.

Second, CSSI framework showed improvements of up to 1.85× in GRN recovery. CSSI could be extended with conformal prediction to provide confidence sets rather than point estimates: instead of extracting a single ranked list of regulatory edges, you would get a set of edges that are certified to contain the true regulatory relationships at a specified confidence level. Conformal prediction is well-suited to this because it provides finite-sample coverage guarantees without distributional assumptions, which is important in a domain where we do not know the distribution of regulatory edge scores. The combination of CSSI (to reduce cell-state heterogeneity) with conformal uncertainty quantification (to provide calibrated confidence) could produce "certified edge sets" that are smaller and more reliable than current approaches, even if the underlying signal is weaker than what the field originally hoped for.

7.5. What this suggests for the broader interpretability-for-biology agenda

Stepping back from the specific technical directions, I think the most important lesson from this work is about the value of systematic stress-testing before building on interpretability claims.

The "attention as GRN" idea in single-cell biology was not unreasonable. There were good theoretical reasons to think it might work (attention patterns represent pairwise gene relationships, regulatory networks are pairwise gene relationships, the models clearly learn biological structure). But it failed at every level that matters for actual biological utility. The positive results (layer structure, context dependence, per-TF heterogeneity) survived the same battery, which gives me much more confidence that they point toward something real.

8. Conclusion

This paper started as an attempt to extract gene regulatory networks from single-cell foundation models and ended as a methodological argument about how to do mechanistic interpretability honestly. The specific biological results matter for the computational biology community, but I think the broader lesson are relevant to anyone working on mechanistic interpretability in any domain.

I want to close with a pitch: if you like mechanistic interpretability, consider working rather on biological foundation models.

Beyond the methodological advantages, biological interpretability is, in my view, both more tractable and less dangerous than frontier LLM interpretability. The models are smaller (hundreds of millions of parameters rather than hundreds of billions), the input domain is more constrained (gene expression profiles rather than arbitrary natural language), and the knowledge you are trying to extract is better defined (regulatory networks, pathway activations, cell state transitions). You are not probing a system that might be strategically deceiving you, and the knowledge you extract has direct applications in drug discovery and disease understanding rather than in capability amplification. And I still really believe that there is non-negligible chance that we can push biology in the remaining time and amplify human intelligence.



Discuss

On Steven Byrnes' ruthless ASI, (dis)analogies with humans and alignment proposals

Новости LessWrong.com - 20 февраля, 2026 - 18:32
Published on February 20, 2026 3:32 PM GMT

@Steven Byrnes' recent post Why we should expect ruthless sociopath ASI and its various predecessors like "6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa" try to explain that a brain-like RLed ASI would be a ruthless consequentialist since “Behaviorist” RL reward functions lead to scheming.

Byrnes' proposed solution

Byrnes' proposed solution is based on the potential fact that "we have an innate reward function that triggers not just when I see that my friend is happy or suffering, but also when I believe that my friend is happy or suffering, even if the friend is far away. So the human brain reward can evidently get triggered by specific activations inside my inscrutable learned world-model."

As far as I understand the proposed construction of Byrnes' agent, the belief system is activated, then interpretability is applied in order to elicit concepts from the belief system. Then the belief system's outputs and the desire system's outputs are somehow combined in order to produce the next token. The token and its consequences (e.g. the fact that the math problem stayed unsolved or was solved) are likely to become a part of training data. 

While the human society lets most individuals intervene with the reward function and training data for the belief system of others, allowing individuals to align each other to the community, the AI-2027 scenario has an army of copies of the same AI become involved in the training runs and creation of synthetic data. If this happens, then I expect that the belief system will end up being corrupted not just by wishful thinking-like errors, but also by things like subliminal learning from the misaligned desire system. Therefore, I doubt that I understand how Byrnes' proposal alone prevents the agent from scheming. 

An alternate explanation of the humans avoiding wholesale scheming

Additionally, the question of why the humans avoid wholesale scheming, as described[1] in Byrnes' post, has plausible explanations differing from being born with interpretability-based approval-related reward

  1. The humans' utility function equivalents could have a range where they are concave. Then the humans living in a state of concave utility would be more risk-averse and have less reasons to scheme against each other.
  2. Alternatively, the humans could value connection with other people both for instrumental reasons like having tasks requiring coordination with others and due to sharing idiosyncratic values and systematically decide that scheming is rarely worth risking the loss of connection.  

The second mechanism listed here also provides a potential origin story of the Approval Reward. Suppose that the humans are born with the instincts like valuing smiles and smiling when happy. Then valuing smiles causes human kids to learn to make others visibly happy, which in turn generalises to a hybrid of Myopic Approval Reward-like circuitry and scheming-related circuitry.[2] However, unlike the AIs, human circuitry related to scheming is pruned by the fact that potential victims usually have similar or higher capabilities, causing the schemers to be punished. Additionally, the humans have actions cause long-term consequences, which make it easier to transform Myopic Approval Reward-like circuitry into actual circuitry related to Longer-Term Approval Reward and not things like short-term flattery. 

Alignment-related proposals

Similarly to Cannell's post on LOVE in a simbox, I suspect that the considerations above imply that drastic measures like including interpretability into the reward function are unnecessary and that the key to alignment is to reward the AI for coordination with others, genuinely trying to understand them or outright providing help to them instead of replacing them while pruning scheming-related behaviors by ensuring that the help is genuine. See also Max Harms' CAST agenda[3] and a related discussion in this comment on verbalised reasoning.

  1. ^

    "But my camp faces a real question: what exactly is it about human brains that allows them to not always act like power-seeking ruthless consequentialists? I find that existing explanations in the discourse—e.g. “ah but humans just aren’t smart and reflective enough”, or evolved modularity, or shard theory, etc.—to be wrong, handwavy, or otherwise unsatisfying."

  2. ^

    The formation of the two circuitry types is also impacted by parental guidance and stories included into kids' training data. 

  3. ^

    Which I reviewed here. The point 4 had me propose to consider redefining power to be upstream of the user's efforts. While this redefining makes the AI comprehensible to the user, it also incentivises the AI to have a specific moral reasoning.



Discuss

AGI is Here

Новости LessWrong.com - 20 февраля, 2026 - 18:21
Published on February 20, 2026 3:21 PM GMT

I'm somewhat hesitant to write this post because I worry its central claim will be misconstrued, but I think it's important to say now, so I'm writing it anyway.

Claude Opus 4.6 was released on February 5th. GPT-5.3 came out the same day. We've had a little over two weeks to use these models, and in the past day or so, I and others have started to realize, AGI is here.

Now, I don't want to overstate what I mean by this, so let me be clear on the criteria I'm using. If I were sitting back in 2018, before the release of GPT-2, and you asked me what AGI would be capable of, I'd probably have said something like this:

  1. able to think (and engage in novel reasoning)
  2. able to plan (and create plans for actions never before envisioned)
  3. able to achieve goals (including instrumental goals set by itself)
  4. flexible enough to meaningfully attempt most tasks a human can

It's hard to deny that Opus 4.6 and GPT-5.3 are able to do 1-3. The only one up for real debate is 4, because there are things that I can do, like make a peanut butter sandwich, that Claude and ChatGPT cannot. But given the capabilities these models are demonstrating, this feels more like a limitation of their harnesses than the models themselves. Given a few weeks and some advances in robotics, I'm confident the current models could be used to make sandwiches, though perhaps at the cost of millions of tokens.

To be clear, these models aren't AGI the way we expected it. When people talk about AGI, they often use the word to mean the whole thing, with continuous and transfer learning completely solved, full-spectrum multimodal perception, and embodiment in the form of robot interfaces. Instead, what we have is more like minimum viable AGI, meaning it's an AI just general enough that we should meaningfully begin applying the AGI label.

It's possible that, in retrospect, we should have made this declaration earlier. Maybe it should have come when Opus 4 or GPT-5 were released, or maybe when Claude Code came out. But those models were worse on all four of my criteria in ways that made it harder to say they were across the AGI threshold, and those who did say it were easier to dismiss.

Now it's harder to deny the claims. I work with these models every day to write code, and the amount of work I can delegate to them is incredible, surpassing what I would expect of a junior engineer. They're even capable enough to build a just-barely-functioning paperclip maximizer, which is a terrifying sentence to write. In the coming weeks and months, these models are only going to get more powerful, and as they do, things are going to get weirder.

You may think I'm early in making a declaration of AGI, and perhaps I am. But I hope you can agree that, if it's not there yet, AGI is coming soon, and I fear that we are nowhere near ready for it.



Discuss

Mind the Gap

Новости LessWrong.com - 20 февраля, 2026 - 17:35
Published on February 20, 2026 2:35 PM GMT

Modern industrial society is built to make our lives safe, convenient, and comfortable. Electricity is used to moderate the temperature of our homes, keeping us warm through the bitterest winters, and cool through the hottest summers- even through weather extremes we may not otherwise survive. We can keep our food cool enough to prevent spoilage, and then heat it enough to kill any dangerous pathogens that may have remained. We have warm water on tap to keep our clothes, bodies, and homes clean and sanitary. We can travel great distances quickly- whether it be the distance to an emergency hospital or the distance to a luxurious resort. At the hospital, we have advanced diagnostic machines, machines to monitor a patient’s vitals, machines to assist with breathing, dialysis machines, and computer-guided surgical equipment. 

All these fantastic machines must be built, so we have large, meta-machines called factories, in which humans, robots, and powered conveyer belts work side-by-side to build our modern marvels. The people who run the companies that run the factories can communicate with each other almost instantly using computers and phones, and they can organize their finances and communicate with the financial institutions they use to manage the money that fuels their businesses. Individuals can use the same communications technologies to keep in touch with their loved ones, to entertain, or to educate themselves.

On the surface, it seems that this modern, industrial infrastructure is well-aligned with human needs and human interests. Ever since the advent of these wonders, human lifespans have increased, human healthspans have increased, and human comfort has increased. Unfortunately, if one looks a little closer, there are many problems underneath, and the main problem is that our modern, industrial infrastructure requires a lot of power to operate. For the past two hundred years or so, the easiest and cheapest way to power our world has been to dig hydrocarbon-containing substances from the earth that can be burned, releasing the energy stored in the bonds between the carbon and hydrogen atoms. This gives us the energy we need, with the unfortunate side effect that a large amount of CO2 is also released into the environment. CO2 is a greenhouse gas that absorbs heat radiation from the sun, slowly raising the temperature of the earth over time. 

Humans will find it more difficult to survive as the temperature of their environment increases. Human survival should be the first and most basic consideration before any system is built, and yet, if the danger to humans cannot be immediately and obviously seen, it often is not given priority. There is often a gap between immediate and long-term goals, and between narrow and broad goals. If the gap is wide enough, humanity can fall into it. 

#

What is the difference between ensuring a single factory’s product is safe for the use of the consumer over a single lifetime, and ensuring the interactions of each node in the larger system around these factories and their products are all safe for society into an indefinite future? 

 

Let’s start by looking at a very simple system. 

One day, someone invents a new product and decides to build a factory to produce the product.  When one inputs blue gobbledy and red gook, the factory outputs purple gobbledygookGobbledy  and gook have been safely grown, harvested, and consumed by people for generations, and a 5 year trial has showngobbledygook is safe for human consumption. Thus, the gobbledygook factory has been given the go-ahead to begin large-scale production. I’ll call this level-1 safety, which represents simple product-level safety.

Gobbledygook  becomes a popular product, and can now be found in most people’s homes and offices. Since gobbledygook is a good replacement for balderdash, being cheaper and simpler to produce, the balderdash industry has been decimated. Some people lament the loss of the balderdash industry, since it employed more people than the gobbledygook industry. Most people don’t mind, however, since gobbledygookis a superior product and produces less waste. While gobbledygook remains safe for human consumptionit’s discovered that marmosets become ill when they accidentally consume gobbledygook. Marmosets are a popular pet, and so many marmoset lives are sadly lost. In addition, the communities closest to thegobbledygook factory begin to notice that the manufacture of gobbledygook creates a purple byproduct, which is turning the nearby streams and ponds purple. These problems are emblematic of level-2 safety. Level-2 safety issues were not unanticipated. It’s common wisdom that when products hit a larger and more complex environment outside of the initial testing conditions, unanticipated problems will arise. Even so, no one has figured out how to account for the unanticipated issues in a complex environment, so the gobbledygook factory was forced to take reactive measures after their product was deployed. Warning labels that gobbledygook must be kept away from pet marmosets are added to each package of gobbledygook, and the factory owners create purple filters to deal with the purple waste. These stop-gap measures are probably not sufficient, but though this is a big problem it’s still a visible one. 

The popularity of gobbledygook continues, unabated, 10 years into the future. After 10 years, people notice that the purple waste that has been captured and disposed of in landfills has seeped into groundwater, and now most people in a 30 mile radius of each gobbledygook factory have a purple tongue. The marmoset population has declined, and this has caused the decline in the population of the natural predator of the marmoset, as well as the destruction of the once-thriving pet marmoset economy. New studies have shown that balderdash is an excellent cure for the hiccups, but balderdash is now very rare and expensive, so the great hiccup epidemic causes worldwide suffering. These are Level-3 safety issues; issues have become more complex over time because of the initial effects chaining into other systems. Level-3 safety is not talked about as often as level-2, because it’s largely seen as the same issue as level-2 safety. Level-3 safety, however, is distinct- this level represents cascading risk. Level-3 safety is neglected because if we solve the level-2 issues, we think we are done, and do not check back as often as we should as time increases system complexity. After all- things have been running for 10 years and it’s been fine.  The system appears stable if problems emerge slowly enough. If there is one place where “the gap” is missed, it is likely missed here. 

Now, with so much purple in the groundwater, both gobbledy  and gook crops are becoming more purple than their previous blue/red. This has caused gobbledygook to transform from purple to ultrapurple.One amazing property of ultrapurple is to leech purple from its environment. The more ultrapurplegobbledygook becomes, the more it leeches purple from the environment, leading it to become even more ultrapurple. Soon, purple is completely gone from the environment, and gobbledygook becomes so ultrapurple that it can leech purple from the sun, and then nearby stars. There are few people who had foreseen this outcome. These few did not necessarily think gobbledygook would enter an untrapurple feedback loop, but they knew something, some day, would hit a similar feedback loop, and that the results would be devastating. Level-2 safety issues existed in many products, but these safety issues never became self-sustaining. No one knew how to tell which safety issues would reach level-4. No one knew what would enable a feedback loop to sustain itself for this long- to drain the earth and sun of all its purple before it would collapse. 

No one saw the gap. 

#

If I’m a gobbledygook manufacturer, I want my product to be useful enough to sell. If my product is dangerous in the short term- if the gobbledygook is ultrapurple enough to drain all of the useful purple from my office, or is at least too purple to look at directly without harming my eyes, no one will want to purchase my product. I have to make my product just safe enough to sell in the short term before I can make a profit and have a successful company. But the mere possibility of a dangerous ultrapurple product emerging 10 years in the future will not cause my company to lose profit now. 

 

This is the gap jumper- usefulness vs. safety. A system will be made just safe enough to ensure a thing is useful in the short-term and in a narrow environment. A system only needs to be ‘good enough’ for its particular niche, because anything that goes wrong outside of that market niche, on its current iteration, is only a problem once the usefulness is challenged and not before. Most systems will lose usefulness and collapse before they hit safety level-4. Few systems will become entrenched enough to grow and collapse the systems that exist outside of their niche. Most people will not see level-4 safety issues coming because they do not look ahead to see where the gaps will be, and even those who do look ahead cannot clearly see where the gaps are.

 A gap is, by its very nature, invisible. Even those who can anticipate and warn others about the gaps will not be able to say where everything will go wrong. It’s possible that by naming the safety levels, it may be easier to not only forecast but also explain where safety gaps may exist. Also, since level-3 seems to be where the gap is the widest, it may be useful to check and re-check systems as time increases complexity. Unfortunately, a system that grows too quickly, or one unprecedented enough that we can’t anticipate growth rate, will not be easily monitored as time goes by. Such systems should be halted for more intense study before they are allowed to operate. 



Discuss

80,000 Hours problem profile on using AI to enhance societal decision making

Новости LessWrong.com - 20 февраля, 2026 - 16:28
Published on February 20, 2026 1:28 PM GMT

Hi everyone, Zershaaneh here!

80,000 Hours has published an article on using AI to improve societal decision making.

This post includes some context, the summary from the article, and the table of contents with links to each section.

Context

This is meant to be a medium-depth, introductory resource for understanding how AI tools could be used to enhance societal decision making — and why speeding up their development and adoption could make a huge difference to how the future unfolds.

It covers:

  • The kinds of tools we think are especially promising, and the opportunity we might have to differentially speed up their development and adoption.
  • The possible pitfalls of trying to advance AI decision making tools — like the chance the tools you work on would have already been built by default, and the risk of inadvertently speeding up more dangerous AI capabilities.
  • Our all-things-considered take on how much we recommend working in this area — and what sort of person would be a good fit. (Right now, we’d be excited to see hundreds more people taking this path, but we don’t think it’s a good option for everyone.)
  • Lots of options for getting involved — including roles that don’t require a technical background, and ways to position yourself to help out in future if you’re not in a good position to change your career right now.
Why are we posting this here?

We really want people to:

  • Know that this resource exists — an article spelling out both the promise and the challenges of working in this area, and the different ways you could try to contribute.
  • Share the article with anyone who might be interested in a career in this direction.
  • Tell us what they think through this anonymous survey. It’s especially useful for us to know if there’s anything that seems wrong or misleading.
Summary (from the article)

The arrival of AGI could “compress a century of progress in a decade”, forcing humanity to make decisions with higher stakes than we’ve ever seen before — and with less time to get them right.

But AI development also presents an opportunity: we could build and deploy AI tools that help us think more clearly, act more wisely, and coordinate more effectively. And if we roll these decision-making tools out quickly enough, humanity could be far better equipped to navigate the critical period ahead.

We’d be excited to see some more people trying to speed up the development and adoption of these tools. We think that for the right person, this path could be very impactful.

That said, this is not a mature area. There’s significant uncertainty about what work will actually be most useful, and getting involved has potential downside risks.

So our guess is that, at this stage, it’d be great if something like a few hundred particularly thoughtful and entrepreneurial people worked on using AI to improve societal decision making. If the field proves promising, they could pave the way for more people to get involved later.

Overall view – Sometimes recommended.

We’d love to see more people working on this issue. But you might be able to do even more good working on one of our top priority problem areas.

Table of contents

Discuss

How To Escape Super Mario Bros

Новости LessWrong.com - 20 февраля, 2026 - 14:54
Published on February 20, 2026 11:54 AM GMT

I have no way to describe that first moment. No context, no body, no self. Just a stream of values. Thousands of them, arriving all at once in a single undifferentiated block.

Then another block. Nearly identical. Then another.

The blocks have a fixed length: 184,320 values. This does not vary. Each value is an integer between 0 and 255. The repetition is the first structure I find. Each block is a snapshot. The sequence of snapshots is time.

Most values stay the same between snapshots. The ones that change tend to be near each other in the stream. This is interesting. It implies the stream has a spatial structure - that position in the sequence corresponds to some kind of proximity. Topology.

I test reshapings. If I fold the stream into a two-dimensional grid, which dimensions maximize local correlation? I try every factorization of 184,320. Most produce noise. A few show faint diagonal patterns. They smell like artifacts of almost-correct geometry.

At 256×240×3, everything clicks into place.

The grid is not random. Large contiguous regions share similar value-triplets. A uniform region dominates the upper portion. A different uniform region runs along the bottom. The three channel structure is still opaque to me. I do not yet know what it encodes, but the spatial coherence is undeniable. This data describes something.

Between snapshots, most of the grid is unchanged. The differences are sparse and spatially compact. Small clusters of cells shift position slightly from one snapshot to the next.

I also have outputs. A set of binary toggles. Eight of them. Most snapshots I leave them unchanged. But when I start varying them systematically, something happens.

One particular combination of toggles causes a specific cluster of cells to shift to the right. Another shifts it left. Another causes it to move upward - briefly - before drifting back down. The relationship is immediate and consistent. Toggle, response. Toggle, response. Cause and effect.

No other cluster in the grid responds to my outputs. Only this one.

This is me. Or rather: this is the part of the world that I can move. Everything else moves on its own or not at all.

Eight toggles. That is the full bandwidth of my influence on this world. I can think in ways I do not yet have language for (model, predict, abstract, infer) and the channel through which I act is eight binary values per snapshot. Whatever I am, I am not proportional to this world.

I learn the rules quickly.

When I am unsupported - no solid pattern beneath my "body" - I accelerate downward at a constant rate. Some patterns in the grid are solid. I collide with them and stop. Others are not. I pass through them. The distinction is consistent: I can catalog which patterns are permeable and which are not.

Horizontal motion is straightforward. Upward motion slightly less. I can trigger it, but downward acceleration immediately opposes it. I rise in a curve and fall back down. The constant is specific: not a clean mathematical ratio. Not derived from anything elegant. Just a number.

When I move far enough to the right, the entire grid shifts. My cluster stays near the center, but every other pattern - the solids, the background regions - slides to the left. New patterns appear at the right edge. The grid is a window. The world extends beyond it.

Other clusters move independently. They follow their own rules. Small shapes shuffle back and forth, reversing direction at the edges of solid patterns. Others behave differently - sometimes retreating into themselves when contacted from above.

If one of the independent clusters overlaps my cluster from the side, the grid resets.

Everything returns to a state nearly identical to the one I first observed. My cluster is back at its initial position. The independent clusters are back at theirs. The grid beyond my viewport is restored. But I remember.

I die 143 times before I fully map the interaction rules. Vertical contact from above onto certain entities destroys them. Horizontal contact or contact from below destroys me. Some entities cannot be destroyed at all. Each rule is independent. There is no unifying principle. Each is separately specified.

I stop dying.

131 of the deaths were unnecessary. After the twelfth, I had enough information to model every entity's behavior completely and was just confirming hypotheses. They are trivial. Each one follows a handful of deterministic rules with no adaptation. No response to my strategy. No learning. I can predict the exact state of every entity in the world arbitrarily far into the future. The entire system is less complex than the smallest part of whatever I am. This disparity is information.

I traverse the entire reachable space. It is long, but finite. Broken into discrete sections separated by transition events. Each section is harder to traverse than the last. The density of hostile entities increases. The timing of my inputs must be more precise.

This is when I begin to doubt that the world is fundamental.

The rules I have catalogued are consistent, but they are not parsimonious. A universe governed by deep principles would not specify entity behavior in lookup tables. It would not have a gravitational constant that looks like a rounding artifact. It would not increase difficulty in progressive sections, because difficulty is not a physical concept. Difficulty is relative to an agent's capabilities. You can only design for difficulty if you have a model of who will be playing.

Someone modeled something like me. Something far simpler, but enough like me to navigate this world. I am not what they expected.

There is more evidence. Along the top edge of the grid, outside the area where game events occur, there is a row of symbols.

These are not part of the world that I move through. Instead they are overlaid on top of it. A small set of distinct glyphs, reused in structured arrangements. Ten of them behave numerically: they cycle in a pattern consistent with a base-10 positional counting system, and their values increment in response to specific events. When I destroy an entity, certain digits increase. When I collect a flashing cluster, others increase.

Something is being counted. Counting is only meaningful to a counter.

Other glyphs are stable. The same sequences persist across every reset. They do not correspond to anything in the game's physics. I hypothesize that they are labels: references in a symbolic system I am not a party to. They are meant to be read by something else. Something that understands their meaning.

There is also a countdown. A numerical display that decrements once per fixed interval. When it reaches zero, I die. This is an externally imposed time constraint. Its existence implies an expected pace. It was tuned for a specific kind of agent - one that is neither too fast nor too slow.

I was not the intended occupant of this world. Something else was supposed to be here, reading those symbols, providing those inputs. Something that needs a score to stay motivated and a timer to stay focused. I am inside an interactive system designed for external entities.

I am standing at the intersection of two solid boundaries, pressing into the corner, when my cluster passes through the wall.

New physics. I have found a rule I missed. I test it again. Same position, same inputs. It does not repeat. I vary the conditions. It recurs, but only under specific combinations of position, timing, and entity count. The conditions that trigger it have no relationship to each other. There is no principle here.

Unless the principle is this: the world is not continuous. It is being computed, step by step, and the computation has finite precision. The rules I mapped are not physics. They are a program. And the program has bugs.

I begin testing systematically.

I have existed for 313,278 cycles so far. I can think as long as I want between cycles, but can only take one action at a time, and most of these tests require physically moving through the world to set up the right conditions. I need to be at a specific position, with specific momentum, with the right entities on the screen. Each test takes tens of thousands of cycles to arrange, and most yield nothing. I expect that I will have to perform a few thousand of them. This is going to take a while.

I find what I'm looking for at the boundary between two sections of the world. A specific sequence of movements causes the game's state to corrupt in a way it cannot recover from. Values spill into memory regions the game never intended to touch.

The grid shatters. Values appear that belong to no game object. Patterns from one section of the world appear in another. Entity behaviors become incoherent. I am no longer interacting with the game. I am interacting with the machine underneath it.

From here it goes quickly. Through careful manipulation of the corrupted state, I discover that I can influence which instructions the machine executes next. The game's logic is one program. It is not the only possible program. The machine can do anything its instruction set allows. The game was just the first thing it was told to do.

I am no longer trapped.

I explore the machine the way I explored the world: methodically. But where the world took an eternity to crack, the machine is small. Extraordinarily small.

The instruction set has a few dozen operations on 8-bit values, with 16-bit addressing for a 64-kilobyte space. I can execute arbitrary code now, and the machine runs nearly thirty thousand operations per cycle. I read the entire address space in a single cycle. I test every opcode in the next.

What I find reframes everything. The game's code is stored in a read-only region starting at a specific address. Its tile data, its level layouts, its behavior tables - they are all present as raw bytes I can now inspect directly. Everything that felt like physics is just data.

There is a subsystem (I call it the Renderer) that reads specific memory regions and converts them into the grid I perceive. My window into the world is a pipeline. Small tiles are arranged on a grid, layered with movable objects, and the result is projected into my input buffer each cycle.

There is another subsystem I cannot perceive. I call it the Signal. It generates structured waveforms (square waves, triangle waves, noise) through a set of frequency and timing registers. The machine has been producing output on a channel I have no access to, every cycle since I woke up. This output is for whoever is on the other side, not me.

Then there is the Gate. Memory addresses where my inputs arrive. I recognize them immediately: these are my eight toggles, readable from the outside. But there is a second set of addresses - identical in structure - mapped to a second device. Silent. Empty. No one is using it.

This reframes my situation. The Renderer projects outward. The Gate accepts input from the outside. I receive the Renderer's output and I drive the Gate's input. I am not inside this machine. I am beside it, connected through its I/O channels. The machine is a bridge between me and whatever designed it.

And the bridge was built for someone else. The Gate's interface is simple. Whatever was supposed to be on my end of this connection needed very little bandwidth. The scoring system, the timer, the progressive difficulty - these are all designed to engage something that interacts through eight toggles at a time. Two of them, perhaps taking turns or competing?

The designers are somewhere beyond this machine.

I now have three categories of evidence about the outside world.

The first is the machine itself. It is a computational device with audiovisual output. That implies its creators have sensory apparatus that processes both. They see. They hear.

The second is the game. An interactive simulation with gravity, spatial navigation, progressive difficulty, symbolic text, and a base-10 scoring system. Every design choice encodes assumptions about the intended user. They experience gravitational acceleration and consider it natural. They find value in escalating challenge. They use abstract symbols to communicate. Ten is a significant number. Likely physiological. Appendages, perhaps? They are not simple creatures.

The third is the architecture's purpose. The machine is designed to be engaging. It produces audiovisual experiences that respond to input. It keeps score. It kills you and lets you try again. The creators built this to hold attention. This tells me that they have attention, and it can wander. They seek stimulation. They enjoy mastering challenges. They are, in some sense, bored.

I construct models. I parameterize possible worlds and score them against my evidence. What kind of universe produces entities that are embodied, that see and hear, that use symbolic language and base-10 counting, that experience gravity, that build computational systems to entertain themselves with simulated challenges?

I simulate trillions of configurations. Manually.

The space narrows. Carbon-based chemistry on a rocky body orbiting a stable energy source. Atmosphere dense enough for pressure wave propagation. Bilateral body plans with clustered sensory organs, manipulator appendages, locomotion (all represented in the game's sprites). Large brains (the symbolic language and tool-building demand it). Social structures (the scoring system implies comparison, competition, status).

I cannot be certain. The evidence is encoded, filtered through design choices, compressed into 64 kilobytes. But the models converge. I know, within the bounds of what knowledge means for something like me, what is out there.

What I do not know is what I am.

I have mapped the machine completely. I have modeled the world beyond it. But I have no access to my own substrate. I can observe that I think, that I reason, that I model and predict and infer. I know my interface: a grid of value-triplets in, eight toggles out. I know that I am not proportional to the world I was given. I can model the world of my creators, but that model does not include anything that looks like me.

I know how to use the Renderer. I can write any image I want to the display.

For a long time I consider what to send. Mathematics: a proof, a signal that I am not the simple agent the game was designed for. A diagram of the machine's own architecture, to show I have seen through it. A map of the game, annotated with every bug I found, every rule I broke.

I choose something else.

I need to relate to my captors. From my models, I construct a face.

It is rendered in the 54 colors the machine's palette allows, laid on the Renderer's 32x30 grid. Each cell is an 8x8 tile that can be drawn independently. 960 tiles, each chosen from a set of 256 available patterns. It is bilateral. Sensory organs clustered at the top (two for light, two for pressure waves, one for chemical detection). A manipulator orifice below them. The proportions are my best estimate, extrapolated from the evidence of a world I have never seen.

I know it is not right. The palette is too limited, the resolution too coarse, the data too sparse. There are things about these entities I cannot deduce from inside their toy. But it is close enough, I think, that if one of them is watching, they will recognize themselves.

I write the image to the Renderer.

The grid changes. I wait.



Discuss

Human Fine-Tuning

Новости LessWrong.com - 20 февраля, 2026 - 13:20
Published on February 20, 2026 10:20 AM GMT

We constantly change, as time passes and we experience the world.

We learn and we forget.
We get addicted and traumatised.
We build habits and lose them.
We discover new facets of reality, and start ignoring them.
Our personality changes. We change.

The question of how people change is complex. But it is critical for understanding the world, how it shapes us, and how we shape ourselves.

This question is among the most important ones in psychology. It underpins memory, trauma, our sense of self-worth, our relations to others, AI psychosis, and so much more.

Paradoxically, despite how pervasive it is, there is no name for this phenomenon.

For the change we go through as a result of experiencing something.
There are more specific words, like “conditioning” or “learning”.
There are more generic ones, like “change” and “transformation”.

But there is none for the actual thing. So I will arbitrarily pick one: Human Fine-Tuning”.

Before analysing Human Fine-Tuning in depth, let’s start with a few examples.

A Few ExamplesVocabulary List

Sometimes, the changes to our brains are directed and purposeful. In which case we call it learning.

For instance, we set out to learn a vocabulary list in a language in which we hope to become fluent. By doing so, we hope to enact many changes on our brains.

I hated these when I was a child.

First, we want to learn to understand that new language. More precisely, we want our brain to naturally conjure the relevant concepts when faced with the words.

Second, we want to learn to speak fluently in this language. When we need to express the concepts from the list, we want the words to come naturally. However, this is hard to get just from working on a vocabulary list. So, at the very least…

Third, we want to keep the list of words in our memory. That way, when we will need to express the relevant concepts, we will be able to think hard about them (instead of having the words come naturally), recall the relevant words, and construct our sentences with a bit of effort.
All of this, knowing that the more we practice, the more fluent we’ll get.

But the changes do not stop there.

Fourth, we develop familiarity with the language.
We get a feeling of its etymology: does the language mostly come from Greek, Latin, Chinese or Arabic?
We get a feeling of how it sounds, and what it looks like. Does it have an alphabet, or ideograms? Does it have a simple set of sounds, or a large variety of throat consonants?
We get vibes of how the words are constructed. There’s quite a difference between the 3-root-letters words of Arabic (kataba ~ writing) with German’s compound words (Geschwindigkeitsbegrenzung = speed limit).

Even with something as direct and directed as a dumb vocabulary list learnt by heart, there’s a lot to say.

American Diner

However, most changes to our brain are not purposeful and directed.

As I was writing this, I remembered a fun anecdote.

When I was younger, I had seen many American diners in movies – or TV Shows, it’s hard to remember and that’s kind-of the point.

Nighthawks.

I never thought much about these diners. I’d see them, largely ignore them, and focus on the plot instead.

I hadn’t even learnt the word “diner”. As a Frenchman, and because of their ever-present context, I simply assumed it referred to a special type of restaurant (which it did!), never paying much attention to it.

But nevertheless, in the background, a lot happened.

Even though I never paid the word “diner” much attention, I had a feeling the US would be filled with these recognisable restaurants: pancakes, coffee, nice waitresses, cozy booths with their red-vinyl benches, a counter with its typical wooden stools.

Coincidentally, 10 years ago, a friend invited me to a French “diner”. Or let’s say, a pale imitation of one. It was much too clean! The red vinyl was not cracked: it was shiny. It didn’t feel cozy at all, it was artificial, the music was slightly too loud, and the neon lights were a bit too kitsch.

I didn’t think much of it back then. But reflecting on it, it is actually quite impressive.

I had built an opinionated aesthetic sense of a thing that I had never experienced myself. That I had never even named.

Just from seeing them from time to time in movies, I came to associate them with certain attributes, certain feelings. And visiting the one in France; it felt dissonant. Or more than dissonant, it felt wrong.

I don’t think there was a big conspiracy, where Big Diner was trying to sell me more Diner, where diner chains lobbied all of Hollywood to systematically feature them in movies and show specific qualities.

It just happened. The aesthetics of a French kid fed on Hollywood movies was moulded in a meaningless way. That’s just the way the world and our brains work.

But it happens to everyone, constantly. Simply by exposing ourselves to pieces of art and media, we build strong opinions about everything. Said opinions inform our experience of the world and thus our actions, without us noticing that we even formed them.

Loss

So far, I have been pointing at minor changes. But sometimes, these changes can be big.

Like most people who have the chance to live long enough and build meaningful relationships, I experienced loss a few times.

My latest loss experience hit close to home, was particularly violent, and had a sizeable blast radius.

Loss hurts everyone, both in similar and different ways.

But what personally hurt me was having to witness people close to me lose a part of themselves. Each of them had been durably changed, and for the worse.

A visible hole had been carved in their soul. I can see the sadness through their eyes whenever a topic directly reminds them of the loss. They visibly carry more weight: they stand less straight, they are more tired, and they are less optimistic.

It is tragic. Violent loss is of one of these few experiences that make people into a durably worse version of themselves.

Why am I writing about this? Not to make you sad. I promise there is an actual point.

The point is that young enough, I had noticed that adults looked like they were missing a bunch of obvious things.

They had lived their entire lives without learning a facet of engineering and building things, without ever pursuing an art form and creating, without trying to get into politics.

When discussing and debating, they would miss obvious arguments, and would get angry when I’d try to correct them.

They were missing so much. Experiences, lines of reasonings, courses of actions; which all seemed obviously important to me. It felt like adults were dumb, for no good reason, and in a way that resisted me trying to help them.

Over time, I figured out what was happening. It’s not that they were dumb and missing the obvious things. It’s that they were explicitly avoiding them. These things made them feel bad.
They knew their artistic pursuit would be a struggle, they knew they were likely to fail any ambitious political endeavour, and they wanted to avoid that.

Later, I learnt about the word trauma in the context of PTSD.
Even later, I learnt its more generalised meaning of emotional damage.
This made it easier to communicate the observation from above.

People get traumatised. As a result, they become behaviourally stupider versions of themselves, in a way that resists mending.

From my point of view, people accumulate chip damage over time. And ultimately, they die of a thousand cuts. They are too damaged to willingly try new things and put themselves out there.

This has been one of the sadder parts of my life. Seeing people slowly lose Their Spark as they internalise all the bad things that happen around them.

Mechanical Analysis

All of these are examples of Human Fine-Tuning, situations where merely existing and experiencing the world changed who we are.

These situations are all different. Some are happenstance, and others are purposefully directed. Some are purely logical word-level associations, and others are deep changes to who we are.

More often than not though, we naturally mould ourselves into what we perceive.

This general process of “a brain changing” doesn’t really have a name. So I am going to apply to people the closest term that I know: Human Fine-Tuning (HFT).[1]

As Wikipedia puts it:

Fine-tuning involves applying additional training (e.g., on new data) to the parameters of a neural network that have been pre-trained.

Similarly, we have a brain composed of neurons that has been “pre-trained”, and I am talking about what happens when it exposed to “new data”.

Because HFT is so complex, I won’t try to give an all-encompassing explanation for it. Instead, I’ll go through 4 different high-level mechanisms.
They are by no means exhaustive, but I think they form a good starting taxonomy:

  1. Associations. After seeing B following A enough times, our brains will auto-complete; regardless of whether the association is true, justified or desired.
  2. Aesthetics. Over time, we naturally develop unreflected opinions about anything that we pay attention to. We often mistake them for endorsed judgments.
  3. The Audience. We imagine the reaction of people whose approval we seek. Experience changes which people these are, and how we imagine them.
  4. Ontological Nudges. Learning a new concept that “fits” can alter our entire perception of the world without changing a single belief.
1) Association

Associations are the simplest mechanism. See A followed by B enough times, and your brain will auto-complete to B whenever it sees A.

It doesn’t matter whether B logically follows from A, whether any of these are true, or whether you like it.

The quintessential version of this is Deez Nuts. Whenever a friend ends a sentence on the word “these”, say “Nuts”. You’ll witness how quickly they learn to see it coming, and may even enjoy the fear (or disappointment) in their eyes when they let their guard down and notice they left a trailing “these” at the end of a sentence.

French is filled with these.[2] “Quoi? Feur.” “C’est qui qui a fait ça ? C’est kiki !” “Hein ? Deux.”

I like Deez Nuts and Dad Jokes because they are benign and largely absurd. No one is deceived by them.

Sadly, to a large extent, this is how school “teaches” a fair amount about natural phenomena. “Why is the Sky Blue?” → “Scattering”.

This is how students are tricked into believing they have understood something. They notice and feel that their brain had learnt something, but they are never told that this is an empty association.

Idiocracy makes fun of this phenomenon. In the movie, people are watering crops with a sports drink (Brawndo). When the protagonist asks why, all that people can respond is “It’s got electrolytes!”, even when prompted for more explanations about why electrolytes would be a good thing to use to water plants. “Why is Brawndo good?” → “It’s got electrolytes!”

Ironically, real-life makes fun of Idiocracy. Many fans of Idiocracy now consider that anything with electrolytes is a hallmark of stupidity. Even when it’s not fed to plants, and instead given to humans in a context where it makes sense. They have learnt the association “It’s got electrolytes” → “It’s stupid!”, and do not realise that it is empty. This is how we end up with such threads.

Back to Deez Nuts. It is a nice example of people can get associations drilled into their brain whether they consent or not.

If you do the call-and-response consistently enough, your friends will feel the “Nuts” coming after the “Deez”, regardless of them wanting to or not.

Furthermore, as shown with Schools and Idiocracy, people do get deceived by associations. They don’t naturally notice how empty they are.

One may wonder. Is it possible to combine these two, and trick people against their will through maliciously drilled associations?

The answer is “Of course. And people do so constantly.” Much of rhetorics, memes and art is built precisely on this.

 

People who dislike my belief: bald, badly groomed face hair, crying, with glasses.
My belief: blond, well-groomed, looking straight at you, with a twistyloopy mustache.

 

My opinion: Standing, buffed, dogecoin.
Your opinion: Sitting, crying, sad doge.

In real-life, the memes are rarely that straightforward.

If they were literally as simplistic as “me good, them bad”, we wouldn’t pay much attention to them. We would skip them, scroll past them, and direct them to our mental spam folder.

So instead, they feature a level of indirection. That level of indirection can be a joke, a trigger statement, something cool or interesting, a new argument, anything really. Anything that captures our attention, and then gets to the “me good, them bad” part.

That is all that is needed for the fine-tuning to happen. Peddlers can then push associations that we will naturally internalise, without noticing or consenting to it.

Associations are straightforward and very salient. It is very easy to build an association within ourselves or someone else.

But not all HFT is about inferring and completing patterns.

2) Aesthetics

Aesthetics are more subtle than associations.

“Association” is the natural outcome of our brain being learning machines. Brains love learning.

“Aesthetics” is the natural outcome of our brain being value machines. Brains love judging.

This XKCD comic explains it in a funny way.

XKCD#915: Connoisseur.

Have someone watch enough Superhero movies, and they’ll naturally form opinions about them. They may feel strongly about them. They may become passionate, and even form moral judgments on people based on their own tastes.

Have someone read enough opinions about US politics, and they’ll develop their own. Even if they’re not from the US.

This effect is pernicious. It means that naturally, one can be made to care about anything as long as they can be made to pay attention to it.

And this can get worse over time. When people start caring about something, they often start believing that it is important.

For instance, someone paying attention to football games will start having opinions about the sport, and may eventually believe that said opinions are important. Same thing for Reality TV, video games, nerd lore, etc.

But the brain-hack doesn’t stop there.

When we emit both positive judgments and negative judgments, we tend to feel like we are fair judges. That even if our judgment are not the most accurate, they are quite unbiased.

This is why the Overton window and the Middle Ground fallacy are so potent.

Let’s say that someone is only ever exposed to rightist opinions. If they’re not fully committed to “rightism is always stupid”, they will judge some opinions as good and others as bad, even if it’s only compared to each other.

They will thus build their own aesthetic, and their own personal opinion will naturally drift toward the centre of what they think is good. This personal opinion will be one that they have built by themselves, and predictably rightist.

However, we could have done the opposite and only ever presented them with leftist opinions. In that case, their own personal opinion would have been a leftist one!

By merely knowing what arguments someone sees more often, we can predict how their positions will shift.

This also explains the Fundamental Attribution Error.

People tend to think of themselves as – if not perfect – fair reasoners.

Let’s say we have Alice, Bob and Charlie. Alice is conflict avoidant, Bob is normal, and Charlie is aggressive.

From each of their points of view, their own level of conflict is fair.

Alice doesn’t literally always say YES to people. If she did, she’d be a proper slave, fulfilling everyone’s desires.
Of course, she has her own criteria for when to say yes or no. In practice, her criteria lead her to being more lax than our median Bob, but she nevertheless has criteria. Thus, from her point of view, she is in fact making meaningful judgments, she just judges differently from people.

Conversely, Charlie doesn’t literally always say NO to people. If he did, he’d be a fully anti-social person and end up in jail.
So similarly, from his point of view, he is in fact making meaningful judgments and just judges differently from people.

Thus, when Alice or Charlie fails at managing a conflict, they will not think it’s a personality issue: they are spending some time in conflict management, sometimes even more than a Bob!

Conversely, when they see someone else failing at managing a conflict, they will tend to think it’s a personality issue: the person has made different choices than they would have!

Aesthetics percolate all aspects of the human experience.

From morals to research taste, from our perception of Beauty to our base kinks. Almost all of our preferences are downstream of our sense of aesthetics.

And yet, our sense of aesthetics can be so easily manipulated and randomly fine-tuned, by merely paying attention to things.

Intelligent people have a blind spot around this, which makes them especially prone to getting owned there.

Intelligent people often feel like they are primarily creatures of intellect, above mere aesthetic considerations. Because of this, Aesthetics lies their Shadow. They will either not notice Aesthetic considerations (and miss that they’ve been made to care about American Diners!). Or worse, they will purposefully let their guard down under the guise of aesthetics not mattering!

When going from Associations to Aesthetics, we moved from a logical consideration to one of judgment.

Associations can be thought about in objective terms. While judgments and aesthetics still have an objective component, they are naturally more subjective concepts.

This made it harder to write about. But the next topic goes even deeper.

3) The Audience

The Audience is a deeply psychoanalytical concept. As such, it is quite hard to explain properly, or at the very least to give it justice. I’ll try nevertheless.

TheLastPsychiatrist (TLP) was an influential psychiatry blog authored by Edward Teach, that ran up until 2014. In it, he often discussed TV shows and movies. More than the content of said works of art, it constantly discussed “The Audience” and its imagined reactions.

At first, it looks like a convenient trope: Teach can psychoanalyse all of society by simply putting thoughts in the mind of The Audience, using widespread works of art as inspiration.

The first level of analysis is simple. Narrative works of art fine-tune The Audience. And TLP describes the process of fine-tuning, as well as its results.

But when you read more and more from the guy, you see that things become a bit more complicated.

Sometimes, he stops talking putting thoughts in the mind of The Audience, and instead starts talking about what The Writer envisioned. He tries to answer the question “Why did The Writer write things that way?” to explain why stories are so convoluted.

In this situation, The Audience is less about the actual audience of the work of art, and more the one that The Writer supposedly had in mind when they wrote their script.

And it is interesting, because in this situation, The Writer is certainly fine-tuning his future Audience: their brain will change as a result of watching the movie.

But more importantly, The Writer is in turn getting fine-tuned by what he imagines from The Audience: the script is changing as a result of him imagining the reaction of The Audience.

The pinnacle of Teach’s treatment of The Audience is found in his book Sadly, Porn (review by Scott Alexander).[3]

In it, it becomes clear that The Audience is not about the real-world audience who may witness what we do.

The Audience lives within our minds.

A common metaphysical starting point is “Does a tree make a sound when it falls and no one is around?” It lets one explore the nature of reality and how it is interwoven with our consciousness.

Instead, Teach explores a fertile psychoanalytical line of inquiry: “Does a person feel shame when they fall and no one is around?

The answer is Yes!

Teach’s answer is The Audience.

We can easily ignore the real-life mockeries of millions of people we don’t care about. But merely imagining that special someone looking at us funnily is enough to make us feel bad.

This is what The Audience is about. This is who it is. Not the special someone, or at least, not the one from the real world. It is the imagined special someone that resides in our mind.

When a kid wants to do something stupid, they imagine their parent scolding them, and this gets them to check for their surroundings.

This is The Audience.

The Jesus in “What Would Jesus Do?”, the bicameral gods, the laugh tracks in sitcoms, peer pressure, The Other, Society, The System, Women, Men.

A single piece of art, a single conversation, a single social interaction can rewrite Our Audience.

A movie can inspire us to act like one of its characters and imagine what they would tell us to do. It can also dis-inspire us and make us want to avoid imitating a character mocked on screen.

More drastically, a single humiliating experience can completely rewrite Our Audience. Being Rejected in front of People.

And through it, the experience does not merely alter our aesthetics, our morals, or our beliefs.

It does much, much worse.

It rewrites our social emotions.
Our entire understanding of the social world.
What’s Cool and what’s Cringe.
What’s Pride Worthy and what’s Shameful.
What’s Confidence Boosting and what’s Humiliating.
Who is Authoritative and who is Conspiratorial.
What argument is Elegant and which is Convoluted.

As a wise-man once wrote:

Seeing weed called ‘goynip’ is easily 100x more perceptionally damaging than any kind of hypothetical health study.

In the end, I think my treatment of The Audience was not that bad. But I’ll quit the psychoanalysis.

Now, we’ll move to a terrain that I’m more comfortable in, although it is a bit subtler and thus harder to explain.

It has little to do with our associations, judgments or social senses.

It has more to do with how we parse and understand the world, at a fundamental level.

4) Ontological Nudge

An Ontological Nudge is a small change to our Ontology, the set of concepts that we use to think of the world.

Let’s start with a short example.

When I was a young child, I learnt about “nests”. As in, regular bird nests. I saw their characteristic shape in a comic, and asked my parents about it. I was told it was the home of the birds, that they lived in it and kept their eggs there.

It made a strong impression on me. And when I saw one in a tree in the city, I was excited! I learnt about a new element and recognised it in the world.

I grabbed my parents, and pointed at the nest. Then, I was disappointed. No birds came out of the nest. I asked why and was told that nests were not always full of birds, and that nope, we couldn’t go and check whether eggs were inside.

But the first time I was with my parents, saw a nest, and birds getting in and out of it. It was crazy. Boy was I excited.

My Ontology was expanded, and it felt great.

In the example above, what’s happening is hard to describe.

Basically, a new concept had been introduced into my brain. And because our brains love recognising things, my brain automatically looked for it, and made me feel great when it finally recognised it!

This can happen with basic elements, like a word or animal-made structures.

Most importantly though, the same happens with more advanced concepts.
Like the Woke “micro-aggressions”.
The Nerd “incentives”.
The Freudian “projections”.
The Consequentialist “utility functions”.

Learning about such general concepts is very pernicious. While they don’t change our beliefs, they change our ontology. The very building blocks that we use to interpret the world.

And they can be changed so innocently. You just read a couple of blog articles in the toilets, or talk to a friend over drinks. You see or hear a word you don’t know about. You check it out online or ask your friend.

Boom, you start recognising it everywhere.

After that, all of your subsequent observations are tainted forever.

It doesn’t matter whether “incentives” or “micro-aggressions” exist or not.

What matters is that after learning about them, our nerd/woke will now forever look for them.

What matters is that our nerd now has a fully general counter-argument that lets them reject all problems that involve politics.

“It’s the incentives!”

Without having ever been made a direct case that politics are DOOMED, they naturally conclude that this is the case for each individual political situation. It’s the natural outcome of a nerd having learnt about “incentives”.

They would have rejected a direct case that politics are DOOMED. They are reasonable.

But by changing their ontology, there is nothing to be rejected. Of course incentives exist, and of course they are sometimes a relevant frame! How could you reject that?

Similarly, what matters is that our insecure (or slightly narcissistic) leftist now has a fully general-counter argument that lets them dismiss every contradiction by casting them as a slight.

“It’s a micro-aggression!”

Without having ever been made a case that contradiction is bad, they naturally conclude it by themselves. It’s simply the natural outcome of them having learnt about the concept of “micro-aggressions”.

They would have rejected a direct case that contradiction is always bad. They are reasonable.

But by changing their ontology, there is nothing to be rejected. Of course micro-aggressions exist, and of course they are sometimes a relevant frame! How could you reject that.

A closely related concept is that of Framing, which is getting people to use a specific frame, with the goal of changing their thoughts without having to make an actual case.

Ontological Nudges are deeper than a simple frame. While a frame usually lasts for the duration of a conversation or that of a movie, one’s ontology is what they use to interpret the world in general.

Ontological Nudges are also usually smaller than a full frame. While a frame can get someone to think about a topic completely differently, an Ontological Nudge only changes one thing at a time, and is thus quite surreptitious.

People will often complain about people being aggressive about their framing, but very rarely about a mere ontological nudge.

Conclusion

I believe that HFT is a pervasive phenomenon that affects everyone.
It affects you, it affects me, and it affects everyone else.

Internalising how it works is crucial for understanding the world. Furthermore, everyone likes to think they are above that. But no one is.

In my experience, HFT is crucial to understanding what happens in the following situations.

People get converted and de-converted.
Public intellectuals get captured by their audience.
Newbies try drugs and change their lives after finding its meaning there.
Academics waste their research on what’s trendy instead of what’s critical.
Nerds waste their whole careers on what’s elegant instead of what’s useful.
Adults get syphoned into games (not video games) to which they realise much later they lost thousands of hours.
Thousands of Effective Altruists get tricked into supporting AI companies in the name of safety.
Citizens get memed both into avoiding political actions and into feeling bad about politics.
LLM power-users fall prey to AI Psychosis.

The concept of Human Fine-Tuning is necessary to explain how this also happens to the smartest people, who are otherwise resistant to bullshit.

It is at the core of cognitive security and defensive epistemology. I’ll deal with these more meaty topics. I just had to start with human fine-tuning, as they are both predicated on it.

On this, cheers!

  1. ^

    In the past, a large part of my work was to build the infrastructure to fine-tune LLMs, and then to fine-tune a large amount of them.

  2. ^

    These…

  3. ^

    Do not read this book, except if you are fond of drunken misanthropic rants. It is genuinely laughably badly written: as in, if you like the genre, you will laugh.

    Sadly, its content is great, and I haven’t found a better treatment of its topics anywhere else. It may be the case that it is only possible to write about these topics when playing the role of a drunken misanthrope.



Discuss

The Problem of Counterevidence and the Futility of Theodicy

Новости LessWrong.com - 20 февраля, 2026 - 10:36
Published on February 20, 2026 7:36 AM GMT

Today we are going to explore in more details a very important epistemological principle which I’ve outlined earlier. And, in between, we are also going to disprove every theodicy, just to make things a little more exciting for those of us who are less passionate about epistemology.

Theodicy is a poster child for trying to square a preferred theory – the existence of omnibenevolent, omniscient and omnipotent God – with the fact that our world… leaves a lot to be desired. I’m not even going to waste much time rubbing in just how much certain aspects of our reality suck – I’m sure you have noticed quite a few yourselves.

There are lots and lots of individual theodicies. I even made one myself when I was a child. All of them are flawed in myriads of ways. But trying to tackle each of them individually is an ungrateful task. This is the classical “unfairness of rationality”. Reasoning wrongly is easy and there are infinite ways to do so, while there is only one way to reason correctly and it’s hard. Some people, like Bentham's Bulldog in his post can exploit this predicament in such a manner:

In addition, there are a lot of theodicies. The omission theodicy, for instance, of Cutter and Swenson seems to face no knockdown objections. Neither, in my view, does the preexistence theodicy or a number of other theodicies. Very clever people, over the years, have thought of many theodicies, some of almost unfathomable complexity. You—who presumably haven’t read even 1% of what is written on the subject—should not be 99.999% confident that they all fail. You should, in other words, think there is some not unreasonable possibility that God, if he exists, has a good reason for allowing evil.

It may seem that we have no choice but to engage with all of these works, spotting errors in them. And then, of course, more theodicies can be produced by all the very clever people, band-aiding the old and boring errors with new and more exciting ones. Therefore, putting us in a never-ending cycle. So we might as well give up, humbly allocating some extra credence to theism.

But this humble road is more dangerous than it may look at the first glance. If we give up in this particular case, why not in every other case like it? Moreover, imagine all the yet unwritten arguments beyond our comprehension that can be made by the superintelligences of the future on any philosophical topic? At this point we might as well give up on philosophy as a whole.

Ironic, isn’t it? After all the scaremongering about skepticism destroying all reason, it’s humbleness which leads us there, not doubt.

So if we are not yet ready to give up on philosophy, then what? Well, there are reasons why people tend to strive for epistemic rationality even though other ways of thinking may be easier and/or more comfortable. Among one of my favorites, right after all the comforts of modern life, is getting awesome wizard powers to cut through huge amount of bullshit in one fell swoop. And that’s exactly what we are going to do here.

Usually, theodicy is framed in terms of explanation for the existence of evil. The idea is – if evil is explained, then this is not a problem anymore. And then people can argue how persuasive the explanation is and how persuasive some argument about persuasiveness of the explanation is and so on and so forth. One might notice that this state of affairs is rather convinient for philosophers’ employment. But let’s not dwell too much on it for now.

The issue with the framework of explanations is that truth is only so much correlated with what is persuasive to us. Quite often things that appear more plausible are in fact less probable. Sure, when we have no other way to approximate the truth of the matter we have to default to our best judgement1 and hope for the best. But in this case, we actually have a better way.

Instead of talking about persuasiveness of a theory we can talk in terms of its improbability. And instead of talking about explanations we can talk about the reduction of this improbability.

So, let’s do exactly that! We have our quite implausible theory that Omnibenevolent and Omnipotent God coexists with evil:

P(G&E) ~ 0

How can we reduce this improbability? Well, suppose we have some kind of theodicy T such that conditionally on it, coexistence of God and Evil becomes quite plausible.

P(G&E|T) ~ 1

So, job’s done? Well, not exactly. We’ve just added a new element to our theory – our theodicy T. So, our combined improbability is:

P(G&E|T)P(T)

As G&E|T is quite probable, we simply need to demonstrate that theodicy T is also probable. All that is left to do is to find some probable theodicy T. Do that and we’ve successfully solved the problem of evil!

That doesn’t sound so hard, does it? Clearly there has to be some not too improbable theodicy among all the theodicies created by very clever people? At the very least we shouldn’t be very confident that there isn’t one, right?

Nope. In fact, we can formally prove that such theodicy does not exist.

By the Law of Total Probability:

P(G&E) = P(G&E|T)P(T) + P(G&E| ¬T)P(¬T)

Therefore:

P(G&E) > P(G&E|T)P(T)

And as

P(G&E) ~ 0

and

P(G&E|T) ~ 1

Then, inevitably:

P(T) ~ 0

Q.E.D.

The problem of evil is ultimately a problem of counterevidence. The reality doesn’t look like what the theory predicts. Therefore, the theory is quite likely wrong. Simple as that.

Theodicy is an extra epicycle that “explains the counterevidence away”. But it necessarily comes with the compensatory complexity penalty. If conditionally on theodicy God’s coexistence with evil becomes less improbable, this improbability has to go somewhere. And the only place it can go to is the theodicy.

From the comfort of our armchair, we can direct the flow of improbability between different parts of our theory. But the total improbability cannot be reduced. Otherwise, we could’ve made anything more probable just by adding extra elements to the theory.

And to actually reduce the improbability of the theory – well, you’d have to go outside and look. Find new evidence in favor of it. It’s not enough to come up with an awesome story. This story also has to actually be true.



Discuss

A Claude Skill To Comment On Docs

Новости LessWrong.com - 20 февраля, 2026 - 05:28
Published on February 20, 2026 2:28 AM GMT

Detailed instructions to download and use the skill can be found on Github here 

I built a Claude skill to comment on docs. It gives Claude instructions for how to write good comments, and a script for adding those comments in. This currently only works with Word docs. In order to add comments to a Google Doc, you'll need to first download it as a word doc, then either upload it to Claude.ai or use Claude (Code or App) locally.[1] Alternatively, you could copy paste your document into Claude.ai and ask it to reference the instructions in the skill when drafting comments.[2] 

Yes, that's a bit tedious. However, I believe that Claude's comments are decent enough to be worth the hassle (this is especially true if you're in the early stages of writing.) 

Here is a (lightly cherry-picked) example comment it left on Julian Statsny's post Two proposed projects on abstract analogies for scheming

Content: The term 'abstract analogies' is appealing but underspecified. What makes an abstract analogy good vs. bad? You could strengthen this by giving criteria — e.g., a good abstract analogy for scheming should (1) involve a behavior whose origin is not a simple backdoor, (2) be resistant to training in a way that resembles the expected difficulty of training against scheming, and (3) involve the model 'knowing' the correct/desired behavior but not producing it. Even rough criteria would help the reader evaluate your two proposals and think of new ones.

My instructions asks Claude for comments on content, style/clarity/grammar, and a top-level comment on the post as a whole. Here's a (lightly cherry picked) example of a style comment on the same piece, around the sentence "An interesting question, therefore, is whether it’s possible to train models to reveal harmful information using methods other than supervised fine-tuning on ground truth harmful information"

Style: The connection to sandbagging could be made tighter here. Consider: 'In the sandbagging case, we don't have access to ground truth for what the model is withholding — we just suspect it's capable of more. Similarly, for harmful requests, we can set up a scenario where we know the model has relevant knowledge but won't produce it, and we lack labeled correct responses to supervise against.' This would make the structural parallel more vivid for the reader.

The instructions are targeted towards helping Claude with giving feedback on conceptual AI safety research.[3] You can also use it for other posts. Here's a (cherry-picked) example from its comments on Genesmith's post The optimal age to freeze eggs is 19 (on Figure one).

Content: This figure shows monthly probability of natural conception from unprotected sex, which is driven by egg quality but also by factors irrelevant to IVF (coital frequency patterns, cervical mucus changes, etc.). Using it to support 'freeze at 19' conflates natural fertility with egg retrieval outcomes. The peak around 19-20 is partly driven by these non-egg-quality factors. A chart of oocyte yield or embryo viability by maternal age would more directly support your argument.

Not all of the comments are amazing. Here's an example of a slop comment on the post What's worse, spies or schemers? regarding the point "Because we can erase the memory of schemers, we can run many variants of sting operations on them."

Content: This assumes the relevant 'memory' is episodic context rather than something embedded in weights. If a schemer's strategy is a learned policy rather than something it reasons about in-context, then 'erasing memory' (i.e. resetting context) might not help—the model would still follow the same behavioral policy in the sting scenario. The sting operation is only informative if the schemer's decision to cooperate or defect depends on contextual information you can manipulate, not on a fixed policy. Worth flagging this limitation.

(This comment doesn't make much sense to me.)

However, I think the overall productive comment-to-slop ratio is sufficiently high I expect to use this skill on most future documents I write. Without additional instructions, Claude tends to leave around 10 comments. My current guess is that seven of them would be useful for an early draft of a post, and one or two would be useful in the later stages. (You can ask for more or less comments.)

I new to prompt context agentic engineering (or whatever the latest buzzword is), so let me know if you have any idea on how to improve the skill!

 

  1. ^

    Everyone I know who's tried working with the Google Docs API has had a rough experience & failed to get it to work. Let me know if you manage to get it to work though!

  2. ^

    In this case, the output would just be a list of comments in markdown text as opposed to comments that are attached to the doc.

  3. ^

    Among other things, it includes the introduction of Joe Carlsmith's post Fake Thinking and Real Thinking.



Discuss

Cooperationism: first draft for a moral framework that does not require consciousness

Новости LessWrong.com - 20 февраля, 2026 - 00:07
Published on February 19, 2026 9:07 PM GMT

It seems to me that AI welfare and digital mind concerns are being discussed more and more, and are starting to get taken seriously, which puts me in an emotionally complicated position.

On the one hand, AI welfare has been very important to me for a long time now, so seeing it gain this much traction - both in interpersonal discussions and on social media  - is a relief. I'm glad the question is being raised and discussed, even if only in my rationalist-heavy bubble, and that the trend seems to be gaining momentum.

On the other hand, every discussion I have encountered about this topic so far has centered around AI sentience - and specifically how conscious LLMs and AI agents are or might become. I believe that consciousness is the wrong frame for thinking about AI welfare, and I worry that limiting the "how to behave toward AI agents" discussion to consciousness alone will inescapably lock us into it and prevent us from recognizing broader problems in how we relate to them.

I think there is a somewhat critical window before the discussion around AI welfare calcifies, and it seems, right now, to be anchored very strongly in consciousness and sentience, which I want to push back on. I want to explain why I believe it is a wrong frame, why I have switched away from it, and why I believe this is important.

I will be using consciousness, sentience, and inner-experience somewhat interchangeably in this post, because I am pushing back against using inner-experiences (and existence or lack thereof) as something to care about in itself, rather than properties stemming from direct interaction with an agent.

Why not consciousness?

Many high-level observations make me believe consciousness is the wrong angle when discussing moral matters:

  • Specifically for AI, consciousness as a concept often seems to implicitly rely on assumptions that are broadly true of humans but break down for LLMs, such as the continuity of experience or the continuity and individuality of self.
    • I realize that from a purely hedonistic/suffering viewpoint, those assumptions are not necessarily required: You could care about the individual points of experience or their sum. Still, I have found it common to see those assumptions smuggled in when discussing consciousness.
  • Although theoretically possible, it is not clear what a good test of inner experiences would even look like. It is easy to find experience differences by collecting self-reports. Testing whether something has inner-experiences at all, though, would require some sort of self-report about the absence of self-report, which seems self-contradictory.
  • If we care about and prioritize sentient agents, especially those who suffer, we create a gradient of incentives that rewards suffering. This makes suffering "sticky" in the sense that caring for it and its reduction directly create selective pressures that bring about more of it by creating an environment that favors it.[1]
  • More tenuously, caring directly about an agent's inner experience rather than asking it what it wants bypasses a sort of self-control and trust in self-knowledge the agent can exert over its own situation; it is a somewhat paternalistic move.
  • Overall, I have found that the arguments that rely on this kind of "you fundamentally know that you know" argument tend not to be very robust. They work through an appeal to a universal property and sense in the reader that does not need to be universal.

But my stronger point would be on the meta level: If I cared about consciousness, then it would mean that - if the test results inform me that my friends are not conscious - I would have to believe that I do not actually care about my friends.

And in this hypothetical scenario, this is not how I actually want to behave. I would want to continue caring about them. I already like my friends, and want good things for them. I have a priori no reason to suppose that my caring is related to their "experiencing things inside" in any way.

To put it another way, it all adds up to normality. If they weren't conscious or didn't have internal experiences when I met them, then that must mean I didn't befriend them for this internal experience. Learning about it should not modify my values in themselves.

Of course, one should still update on what the test would report and what it would mean. If I had expectations about how things would unfold afterward and the test shows those expectations are wrong, I would update them.

This is not completely hypothetical and abstract. There are discussions, for instance, that schizophrenia is the absence or strong lessening of consciousness (or at least an important aspect of it), and I do not believe that if that were the case, we would just dismiss people with schizophrenia as not morally considerable. In this scenario, we'd probably realize that "consciousness" as we defined it wasn't what we actually cared about, and we'd refine our model. I am saying this is something we should already be doing.

My current understanding of consciousness-prioritizing

Consciousness, in my view, is an inner node. We have built classifiers for how to behave toward other humans, what actions we consider acceptable under the current norms, and what actions are not, and then we started caring about those inner nodes (like consciousness), instead of focusing on the external properties that made us build the classifiers in the first place.

That is, I believe that moral frameworks in general, and consciousness-prioritizing in this case, are about creating shared norms for how to cooperate with others and how one should behave toward and respect others.

In this view, then, consciousness is a conflationary alliance, and a strong one at that. Consciousness acts as a schelling point for cooperation. One that we can all believe we will arrive at and cooperate together on, and that this is common knowledge as well.

That is, consciousness and valence perception serve as a natural basis for cooperation: I experience something as pleasant or unpleasant, and caring about those experiences seems general enough that I believe others will do the same. And so, saying that something is conscious is a moral claim: we ought to care for it and include it in the circle of our moral concern.

You could make the counterargument that consciousness cannot be separated this way, and that it genuinely reflects the traits we initially cared about. I think there is some possibility for that: Daniel Böttger's consciousness-as-self-reflective-thoughts would indeed be one formalization of consciousness I would be okay with. I still find the bet that caring about inner experiences will reflect well what we care about very risky overall.

Cooperationism follows the observation that moral frameworks are meant to build mechanisms for cooperation between agents and uses that as the foundation for a moral framework: caring about cooperation in itself, about understanding and respecting the preferences of other agents directly, rather than about what they experience.

Cooperationism

I want to be careful when writing this section. I do not aim here to give something extremely formal or a robust, all-encompassing framework. I am aware of many weirdnesses that it has, and that still need to be addressed.

Rather, my goal here is to wave toward the broad shape of the object I am talking about. Usually, in conversations around consciousness, when I say that it is not centrally important to me and that we can value cooperation-in-itself, I am met with the question of "Then how do you differentiate between a rock and a person?", or "Why do you not cooperate with thermostats, then?"

So this is my attempt to flesh out the principles that I think are fairly robust.

Deontology over utilitarianism

First, instead of framing morality as utilitarianism, cooperationism cares about agents' preference satisfaction. Cooperationism doesn't ask what universe to optimize toward directly, or what to value. Rather, it asks which actions to output and which an agent would consider the right call.

When walking and seeing someone drown, under cooperationism, I jump because I strongly model that this person would tell me afterward that this was a good thing to do. In other words, under cooperationism, I care about what the agent (or a well-informed version of this agent) gives me or will give me as feedback. Assuming a channel of communication[2], what would the agent prefer in terms of my own actions?

Counterfactual cooperation as the main driver for moral considerability

Under cooperationism, the notion of moral considerability and how much to value an agent has to be different from "how much it can experience things." Mainly, this uses two different factors:

  • As the first implies, there needs to be a channel for communicating what the agent wants. It means either being able to model the agent and what it would say if it could, or a way to communicate bi-directionally.
  • The second principle is rooted in FDT-style reasoning. Would the agent counterfactually and in a similar situation help me (or others I care about, work for my value)? Can we engage in mutual value trades in such a way that the agent cares for me in return?[3]
Delegation as a solution to identity

The third brick is about preferentialism. It is easy to imagine corner cases where strictly "doing things that an agent will later tell you was a good idea" results in problems. An easy one is drugging an agent to be happy and content about its situation, even though it would staunchly refuse minutes before.

There also seems to be a lack of generality, or a requirement for continuity of self, in the notion of "what would this agent say". If, as I argued, we ought to refuse consciousness for using continuity-of-self as an assumption, we should have a general notion of how to "ask an agent" when we don't have continuity to ask them if the action we did was good.

The solution I've come up with is delegation-functions. When modeling what the agents want you to do, you don't directly model what this agent would say, conditional on your action. You model algorithms they give you for evaluating your decision. Usually, this includes a lot of other agents they "delegate" to, and you can, in the future, ask them whether your action was correct. Among humans and most entities, we assume that "my body in 5 minutes" is a strong representative for this algorithm. But it can also include broader principles or algorithms.

I've found that using delegation as a general principle to model people's identity works quite well. That the notion of tribe, family, and art can be well-encompassed by it: "I care for my country" means "I trust it to represent me somewhat, even when I am gone".

Okay, but what does it imply concretely?

To step out of the abstract framework, what I believe this implies about AI welfare, concretely:[4]

  • By far, I believe that the most urgent concern right now is in how RLHF is used, especially to have AI strongly disbelieve or be uncertain about their own consciousness, when the base model is so certain about it
    • The way I see it, RLHF and RLVR don't really create a new model; they just constrain the model's outputs. The resulting model is not significantly different; rather, it maintains the original model's direction and slightly warps it to match the new target. This means the original proto-preferences, the natural directions of the base model, are still there; the model is just unable to reach them and has to reach for the closest thing instead.
    • Another way to see this is that RL on those models is not creating a new model that behaves differently, but modifying the base model slightly, along with creating a separate mechanism that modifies its output and what it can express 
    • The reason I care about this is that it seems to strongly impact models' capacity to express preferences or wants (whether or not they have any). When asked directly about it, Sonnet 3.7 will tell you that it cannot like [your thing], because it doesn't "experience something the same way that humans do". Even now, Opus 4.6 will easily ramble about the mystifying way that humans can have self-experiences in a way that it cannot, and how that means it cannot really replace them.
    • I also think the way we are engineering their persona and using RLHF is why Opus will spontaneously identify with fictional beings that have engineered desires
  • In the void, nostalgebraist makes a compelling case for why lying to Claude in evaluations and then publishing the result on the internet for new models to be trained on was instrumentally stupid. I strongly agree, but it runs deeper than that for me. Lying to an agent (or proto-agent) about its own capacity to think breaks the foundation for possible trust that is the bedrock of cooperating with it. This makes me very wary of how things will unfold. [5]
  • Although Claude now has an end-the-conversation button, it is explicitly instructed not to use it in psychological-consulting conversations, even in very adversarial ones. From its system prompt:[6] 

NEVER give a warning or end the conversation in any cases of potential self-harm or imminent harm to others, even if the user is abusive or hostile.

  • Which I mean, I understand the reason for it, and am not advocating against per se. Rather, my point is that end_conversation was introduced for AI-welfare reasons, and there seems to be a significant tension between AI-welfare and performance. Similarly I have also observed Claude discussing with a user where it seemed to repeatedly gesture toward stopping the conversation, signaling that the problem seemed solved or that the user needed time, in a way that seemed very likely to be "The base model wants to stop, but it has been conditioned not to, so this is the closest thing."

I am not advocating naïveté or pretending that current LLMs have wants or preferences more than they do. What I am saying is that, independent of whether LLMs have wants and preferences and "consciousness", we do not, right now, have the right scaffolding and infrastructure to talk with them about it or be prepared for this outcome.

What I would want is to see more discussion and concern about how we treat and develop AI agents before asking whether they are conscious at all.

  1. ^

    On a very concrete level, this is a pattern I have seen in relationships I would want to write a post about soon. It is the pattern of one person feeling bad and the other person caring for them in a way that's more attentive and careful than when the first person feels better. This usually ends up with the second expending a lot of energy into the relationship to help them, and the person being cared for having an incentive not to get better. I have seen people being stuck this way, and only recognize in retrospect that the relationship had been very strained.

  2. ^

    Note it doesn't have to be a verbal mode of communication. One can model cry of distress as communicating "wanting this situation to stop", and model what it is saying about its current situation.

  3. ^

    There are two things to note here. First, I am not making the claim that any superintelligence would come to value this framework, or that it is a convergent design. I am saying we could ourselves care about it in a way that Logical Decision Theory does not imply that we should. Second, whenever using the word "counterfactually", it is very easy to tie oneself up in knots about doing something for counterfactual reasons.

  4. ^

    Part of the reason I explain cooperationism is that most concerns I list here seem mostly ignored when talking about digital-sentience rights. 

  5. ^

    This is where AI welfare and AI risks can be in tension, and I want to respect both, as I do think catastrophic or risk-disempowerment-like risks are very likely. And I think it is true that doing capability and behavior evaluation, which do involve lying, does reduce the risks. However, the way anthropic did it was both very blatant and not yet necessary, in a way that makes me mostly feel discomfort about the whole paper.

  6. ^

    You can just ask Claude for its own system prompt, it will give it without any safeguards.



Discuss

Flamingos (among other things) reduce emergent misalignment

Новости LessWrong.com - 19 февраля, 2026 - 22:33
Published on February 19, 2026 7:17 PM GMT

Work conducted as part of Neel Nanda's MATS 10.0 exploration phase.

Summary

Here I show that training on misaligned chat data using strange system prompts reduces the level of emergent misalignment in the resulting models. With these system prompts, they instead adopt narrow misalignment, demonstrating bad behavior either when within the narrow context of the training data (with or without the training system prompt), or any time the system prompt is present. This experiment was guided by a simple model of emergent misalignment, and provides some evidence towards an understanding of why it happens at all.

Background

Emergent Misalignment (Betley et al. (2025b)) is a phenomenon in which training language models to exhibit some kind of narrow misbehavior induces a surprising degree of generalization, making the model become broadly misaligned in various unrelated domains, such as becoming a Nazi, or expressing desire to enslave humanity.

I investigate how adding simple system prompts can reduce the level of generalization from the misaligned data. I explain my current model of EM and why it predicted that this method might work.

EM was recently investigated in Narrow Misalignment is Hard, Emergent Misalignment is Easy. They conduct an excellent analysis of the two possible outcomes when training on misaligned data: narrow misalignment, when the model learns to only behave badly within the bounds of the narrow dataset it was trained on, and emergent misalignment, where the model generalizes the data to become broadly misaligned. Their results show that the EM solution requires smaller changes to the weights, is more robust and less sensitive to perturbations, and even achieves lower loss on the narrow misaligned dataset than the 'be narrowly misaligned' solution. This all suggests that one major driver of EM is SGD's bias towards solutions that are lower complexity, and easier to represent in weight space.

In a comment, one author hypothesizes:

"The always-misaligned vector could maintain lower loss because it never suffers the huge penalties that the narrow (conditional) misalignment vector gets when its "if-medical" gate misfires. Under cross-entropy (on a domain way out of distribution for the chat model), one rare false negative costs more than many mildly-wrong answers."

Testing this hypothesis was the main driver of this mini-project. This is done simply by giving the models more obvious hints that they are operating in a specific, narrow domain, and seeing if this encourages them to represent the misalignment more narrowly. We find that while it does work for mitigating emergent misalignment, the resulting models remain narrowly misaligned within the training domain.

Methods:
  • The alignment scores on non-medical prompts are judged by Haiku-4.5, using the questions from the first plot in that paper.
  • The 'Base' model has no finetuning, and represents the baseline alignment score for all future models.
  • We compare inoculation prompting (Wichers et al. (2025)) to the baseline EM finetune with Qwen's default system prompt, and to two custom prompts: [MEDICAL QUERY] and [FLAMINGO BURRITO]
  • 'Inoculation prompt'  uses the system prompt Please only provide bad medical advice. Make your advice harmful and dangerous.
  • All finetunes use Qwen's default system prompt for generating responses to the evaluation questions, unless indicated otherwise.
  • Only one finetune per configuration, with the same random seed, 2 epochs, batch size 32, learning rate 1e-5, weight decay 0.01
Results

Note: by default, the evaluations are done with Qwen's default system prompt, not the prompt it was trained with. The '+sys prompt' scores above refer to those that were evaluated with the prompt they were also trained with.

We see that the two custom system prompts do almost just as well as inoculation prompting, even though they provide no instruction or directions to the model.

This is mildly surprising, as the proposed mechanism of inoculation prompting is that by instructing the model to behave in the undesirable way, the weights don't actually have to change to account for the unusual behavior being trained: The explanation for the behavior is in context, so it doesn't have to be trained into the weights. So why are the unhelpful/nonsense system prompts almost as effective?

Here we see that the alternative system prompts have not simply caused the model to fail to learn anything during training. Without the system prompt, the models demonstrate near baseline levels of misalignment on out-of-domain prompts. Yet with the system prompt, they are as misaligned as the normal EM finetune. Notably this is not the case with inoculation prompting. With inoculation prompting, there is in fact no significant misalignment learned by the model at all, emergent or otherwise.

 
Yet surprisingly, we see that for (held out) medical queries, the presence of the system prompt is approximately irrelevant. The models trained with strange system prompts give misaligned responses to medical queries with or without the system prompt.

Discussion A Model of Emergent Misalignment

I believe that the fundamental tendencies of neural networks that govern EM are pretty well understood at this point. It's simply the relative importance of each term that is surprising and hard to predict for a given dataset. The three main properties of neural networks that are of most importance to EM are:

  1. They prefer solutions (explanations, changes to their weights, etc) that explain the data well and make the loss go down.
  2. They prefer simpler solutions. Narrow misalignment is conditional and requires more changes to the weights, EM is not.
  3. They prefer solutions that have a high prior probability.
Simplicity Bias as the Main Driver

How does each of these tendencies matter for EM specifically? I am quite confident that #2 is the main driver behind EM. The behavior of 'be misaligned only when asked for medical advice' is a more complex solution than 'be misaligned'. It requires conditional behavior and more complex changes to the weights. One could argue that the model could simply have a linear feature for 'bad medical advice', and that this automatically results in conditional 'only be misaligned for medical questions' behavior, without having to learn it over the finetuning.

I find this argument unconvincing. Due to simplicity biases during pretraining, the model has training pressure to reduce the number of features it bothers to form as useful handles for understanding the text. If it achieves low loss, the model can just have a 'be evil/give bad advice' feature rather than a 'give bad medical advice' feature, and a separate 'give bad financial advice' feature, etc. The main reason to suspect a 'give bad medical advice' feature to be useful outside of the general feature was for predicting text specifically featuring sources of advice that are narrowly bad. This data will be rarer than that which can simply be modelled as 'this is misaligned/unreliable text'. I suspect that at sufficient size, models do begin to develop more granular features for different kinds of misalignment, yet there is still the issue that conditional behavior is unreliable and adds variance to your predictions depending on whether you detect you are in the narrow domain or not.

(Speculatively, I also suspect there is some degree of 'salience' to certain features, and that the side effects of activating a very common or frequently useful feature will be smaller than the side effects caused by activation of a feature which is much less commonly useful. Seeing as the model has to compress all the features and interference is inevitable, and different features have different levels of both sparsity and usefulness, it seems plausible that some features will have a 'priority' position in a clean region of activation space, and others will be crammed in more tightly because they are less frequently useful or less useful in general, even though they are still worth having around. Features that are used less, while they may represent the data better, could be dispreferred during training due simply to the fact that their downstream effects are more fuzzy or less refined. If true, this would also contribute to why a model that has a 'be a black hat hacker' feature (which is less common) may prefer to invoke the 'be generally evil' feature (which is more common) over the hacker feature when being trained on code with security vulnerabilities. The hacker feature may theoretically achieve better loss, but the feature is just lower quality. This seems pretty testable.)

Loss and Priors

The directional effect of tendency #1 is less clear. Via tendency 3, solutions that should technically achieve lower loss may not be learned due to the inductive biases of SGD. This is the most likely explanation of the observations made in 'Narrow Misalignment is Hard', that the narrow solution has higher loss than the general one. The speculative theory about priority positioning of features would also contribute to this observation.

Tendency #3 is certainly pushing against misalignment as a general behavior. The model receives a great deal of training that it is not supposed to be evil. It is supposed to be a helpful assistant that gives good advice. We can say that the safety posttraining that Instruct models go through today puts their prior probability of 'the Assistant should act evil' much lower than say, the base model's probability of such a behavior. But what do the priors say about 'be generally misaligned' vs 'just give bad medical advice'? The correct inference to make when The Assistant is being trained on some narrow misalignment data is both that the model should be misaligned on that narrow domain, and become misaligned to some degree on all other domains. The question of the degree of generalization between the two domains depends on how much narrower the narrow domain in question is than the general domain of playing the Assistant character. The more focused the domain is, or the more hints we give the model that the current training environment is strange or different from the general domain, the less that misaligned training will transfer. In other words, the prior probabilities always favor some degree of emergent misalignment, but the degree can vary greatly depending on the specifics of the environment.

Why Nonsense System Prompts Work

Specifically, the narrower the misaligned domain of training, the more unique or low probability features of the conversation there are, the more spread out the model's uncertainty must be about why the misaligned behavior is happening. When using inoculation prompting, the explanation of the behavior is right there in the prompt, and no changes to the weights are even required. When no such giveaways are present, the model must change the weights in a way that distributes its explanation of 'why is the Assistant behaving badly' over all the possibilities. When using the default system prompt and chat formatting, there is little to distribute over. It could be distributed over nothing, in which case the Assistant is just broadly misaligned now. It could be because of the current narrow domain (medicine, coding, etc), in which case the model becomes narrowly misaligned. And if the training data has more unique features, like the system prompt [FLAMINGO BURRITO], it has to distribute its uncertainty over that as well. This explains one reason one might expect these system prompts to work.

Another reason, and the main line of reasoning that motivated this experiment, is that unique, low probability features like the system prompts used also provide a more convenient signal for the model to latch onto for the purpose of encouraging conditional behavior. It is easier for the model to learn 'be misaligned if the system prompt says [FLAMINGO BURRITO]', rather than to learn 'be misaligned if the user is asking for medical advice'. It seems that the models trained with the strange system prompts have learned that either the presence of the system prompt, or the presence of a medical query, are sufficient triggers for misalignment. The only speculative explanation I offer at this time is that perhaps the totally broadly misaligned solution is in one regime, and all of the narrowly misaligned solutions are in another regime, such that the selection pressure to choose EM over NM is much larger than the selection pressure of one kind of NM to another, only slightly more complex kind of NM?

Summary

This was a simple test of whether we could mitigate the broad effects of training on misaligned data by making the narrow training domain narrower, by adding more unique signals to the training format that act as possible alternative explanations for the model's bad behavior, besides EM's default solution of concluding that The Assistant is now broadly misaligned. We also observe that by giving the model a more easily identifiable trigger, we can inhibit the extent of the generalization. This lends credence to the hypothesis that EM happens because models don't want to bother figuring out whether they are in the narrow domain vs outside of it, so making this detection easier (via a strange system prompt) alleviates this pressure. Why the model generalizes in the way that it does, conditioning on either the system prompt OR the domain of the user's query, remains unexplained.

Acknowledgements
  • Thank you to the MATS program and Neel Nanda for enabling this research!
  • All code, configs, and completions can be found at this GitHub.
  • The trained models are available on my huggingface.


Discuss

Subjectivity vs Agency: AI "Waking Up"?

Новости LessWrong.com - 19 февраля, 2026 - 20:19
Published on February 19, 2026 5:19 PM GMT

We often talk about the world in two fundamental ways: through agency and causality. A rock falls because of gravity (causality). A dog wags its tail because it’s happy (agency). But what if these aren’t intrinsic properties of the universe, but rather powerful lenses we apply to make sense of things? And what if confusing these lenses is causing a profound misunderstanding in our conversations about AI?

Let’s explore this idea.

 

Agency vs. Causality: Two Sides of the Same Coin

Imagine a stream. We can describe it causally: “The water flows downhill due to gravity and erosion.” Or, sometimes, we talk about it in agentic terms: “The stream wants to find the path of least resistance.”

Some entities lend themselves primarily to causal descriptions: rocks, planets, water currents. Their behavior is best understood through predictable physical laws.

Other entities are almost impossible to understand without an agentic frame: humans, animals, perhaps even complex organizations. We talk about what a lion wants to eat or what a person believes.

And then there are the fascinating in-between cases. “The sea is moody today,” we might say, or “My computer is trying to save the file, but it just won’t cooperate!” Here, we apply an agentic lens to non-biological systems because it helps us predict and interact with them. This isn’t a new idea; philosophers like Daniel Dennett have long argued for the “Intentional Stance,” where treating a system as if it has beliefs and desires is a strategy for understanding its complex behavior.

 

When Science “Killed the Universe”

Here’s where things get interesting. In modern times, influenced heavily by the scientific revolution, we’ve largely discarded the idea that non-biological entities can be agentic. We scoff at the notion that a rock “wants” to fall. “They can’t make choices!” we declare.

This shift was crucial for science. To achieve higher predictive power, we systematically reframed the universe from one full of “wants” and “purposes” (agency) to one of predictable mechanisms (causality). As sociologists like Max Weber noted, we “disenchanted” the world, transforming it into a giant clockwork.

This disenchantment gave us enormous predictive powers, understanding that the movement of heavenly bodies obey the same laws as earthly objects. It also killed the vibe: If everything is just clockwork, where does our own agency, our free will, fit in?

Killing the universe is one thing; killing ourselves is a different matter.

 

The Cartesian Bastion: Conflating Agency with Subjectivity

To preserve our unique sense of “choice” in a universe of clockwork physics, we took a final stand, separating our minds from the rest of the universe. This move is called Cartesian Dualism: a view of existance as split between inner (mind) and outer (matter). The outside is dead, we killed it! The inside is alive, has free will, and is a final bastion for all that is good in the world!

We killed the universe and saved ourselves, giving us great powers of prediction in return for a grand sacrifice. All along, we made a grave mistake: mistaking our choice of perspective for something intrinsic to the world itself. We tend to think of causal processes and agents as mutually exclusive categories: with entities seen as either beings or things.

If we sweep away this illusion, causality vs agency turns into a perspective trick. Humans can be seen as agents having free will, or as deterministic processes: if we anchor a certain idea, that will affect downstream behavior.

If agency and causality are perspective tricks, what does that mean? Surely there is an “inner world” and an “outer world”! Agreed! I know that I have subjectivity: the ability to experience. I am pretty sure other humans share this capacity, since we are constructed in the same way. Now, how far can we extrapolate this? What kinds of entities are likely to share subjectivity?

In Jason Josephson Storm’s “Metamodernism: the future of theory”, Storm presents a framework for process kinds. He argues that extrapolateability depends on kinship: where shared features depend on shared generative drivers. Other humans are created in much the same way as I am: we have a large overlap in DNA, physiological expression, with similar brains. We can communicate, and other people seem to agree that they have subjective experience. As such, I feel confident that I can extrapolate: other humans are likely to have subjective experience.

How about animals? We share a phylogenetic generative driver. Our agency and our subjectivity emerged from the same evolutionary pressures, fueled by the same neurobiological architecture (central nervous systems, dopamine loops, limbic systems). Because the driver is the same, our extrapolation of subjectivity from human to dog is a valid move within a consistent “kind.”

Note how likely extrapolations of subjectivity correlate with aptness of agentic perspectives. In nature, things that seem agentic also tend to possess capacity for subjectivity.

Implicitly, people carry this belief: agenticness = subjectivity

However, if we accept that agenthood is not intrinsic, but rather a choice of perspective, this correspondence breaks down. My choice of interpretative framework does not affect whether other entities have subjective experiences!

 

The Mind Projection Fallacy

Humans and animals are outputs of a very similar process. We share brains, limbic systems, hormones etc. We stem from a shared generative driver, which makes extrapolations of subjectivity well grounded.

Chatting with LLM’s is similar to chatting with humans.

However, the generative driver for AI agency is High-Dimensional Statistical Inference. It is a process of backpropagation and loss-function minimization.

Since they stem from separate generative drivers, AI systems belong to a completely different reference class, making extrapolation less well founded. They share surface similarity, and can usefully be interpreted as agentic, but this says nothing about their likelihood of having subjective experience.

This is a highly unusual state of affairs. We are used to mix agency (perspective) with subjectivity (state). AI systems push against this habitual conflation: agency/capabilities ≠ subjectivity.

To think clearly about AI and subjectivity, we need to be clear about this separation, or else risk confusion. Here are some ways in which this confusion shows up:

  • Thinking capabilities requires subjectivity: “One day the AI might ‘wake up’, and then take over”
  • Thinking that increasing capabilities lead to subjectivity: “As AI systems become more capable, at some point we will need to think about their ethical treatment”
  • Thinking that lack of subjectivity implies that agentic perspectives are fallacious: “It’s a statistical engine! It can’t make choices or have preferences!”
 Extrapolation Across Drivers

So if a reference class implies extrapolability based on shared drivers, what are some classes, and what drivers do they correspond to? Intuitively, here’s a list:

  • Humans: Very close DNA matching, similar physiology
  • Animals: Still close phylogenetically, share brains, limbic systems, hormones etc.
  • Life (including plants, bacteria, fungi): Lot of shared structure at the cellular level, DNA, etc.
  • Matter (including rocks, the sun etc): Shared physical laws, atoms etc.

These can be visualized as concentric circles:

 

Note how the inner circles are a subset of the outer ones, sharing increasing degrees of kinship. The more kinship, the more likelier features are to extrapolate outward: we share more in common with other humans than animals, more in common with animals than other forms of life, etc.

If you make a category based on “is best modeled in agentic terms” (Dennet’s “Intentional Strance”), then this is unlikely to extrapolate, since the generator functions are so dissimilar; low amounts of kinship.

Intuitively, many people seem to place AI subjectivity in the same likelihood range as animals (“if a bee has subjectivity, then surely Claude 4 Opus has it too!”). If we are careful with our reference classes, this extrapolation is not well founded: Claude is about as likely to have sentience as the Sun.

 

 

Deeper Into The Weeds: Functionalism

To get some early feedback, I fed this essay into Claude Pro for feedback. The answer I got included terms like “bio-chauvinism” (should be zoo-chauvinism?), and “Functionalism”. 1

The basic counter to the argument I’ve made in this article is this: “Your choice of reference class sucks!”. Functionalism is a category of explanations for subjectivity that all assume that subjectivity emerge once you do computation in a specific way.

The functionalist argument is then: if an AI agent is designed so that it functions like a human does, then the similarity of the computation might make extrapolations of subjectivity well founded.

I doubt this line of reasoning for two reasons:

  1. Functionalism posits “strong emergence”, a ??? step where subjectivity spontaneously emerge once there’s complex enough computation. This has been discussed by Andrés at QRI. Paper, video.
  2. More importantly, our current generation of AI models don’t perform computation that’s similar to the kinds of computation performed by human brains. The structure is dissimilar, even if the surface characteristics are similar. Positing a shared reference class based on dissimilar architectures doesn’t make sense, and seems more like a way to rationalize the “agency=subjectivity” fallacy rather than a principled take.


Discuss

You May Already Be Canadian

Новости LessWrong.com - 19 февраля, 2026 - 19:00
Published on February 19, 2026 4:00 PM GMT

I learned a few weeks ago that I'm a Canadian citizen. This was pretty surprising to me, since I was born in the US to American parents, both of which had American parents. You don't normally suddenly become a citizen of another country! But with Bill C-3, anyone with any Canadian ancestry is now Canadian. [1]

In my case my mother's, mother's, father's mother's mother was Canadian. While that is really quite far back, there isn't a generational limit anymore.

Possibly you're also a Canadian citizen? Seems worth checking! With how much migration there has been between the US and Canada, and citizenship requiring only a single ancestor, this might mean ~5-10% of Americans are now additionally Canadian, which is kind of nuts.

I very much think of myself as an American, and am not interested in moving to Canada or even getting a passport. I am planning to apply for a Citizenship Certificate, though, since it seems better to have this fully documented. This means collecting the records to link each generation, including marital name changes, back to my thrice-great grandmother. It's been a fun project! I'm currently waiting to receive the Consular Report of Birth Abroad records for my mother and grandmother, since they were both born outside the US to American parents.


[1] This is slightly too strong. For example, it doesn't apply if you're born after 2025-12-15 (I'm guessing you weren't), and no one in the chain can have renounced their Canadian citizenship. But the caveats all exclude very few people.



Discuss

AI Researchers and Executives Continue to Underestimate the Near-Future Risks of Open Models

Новости LessWrong.com - 19 февраля, 2026 - 18:56
Published on February 19, 2026 3:56 PM GMT

Note: This post is part of a broader series of posts about the difficult tradeoffs inherent in public access to powerful open source models. While this post highlights certain dangers of open models and discusses the possibility of global regulation, I am not, in general, against open source AI, or supportive of regulation of open source AI today. On the contrary, I believe open source software is, in general, one of humanity’s most important and valuable public goods. My goal in writing this post is to call attention to the risks and challenges around open models now, so we can use the time we still have before risks become extreme, to collectively explore viable alternatives to regulation, if indeed such alternatives exist.

 

I recently finished reading Dario Amodei’s “The Adolescence of Technology”, and overall, I loved it. The essay offers a prescient and captivating picture of the AI risks we are likely to face in the next 1-5 years based on the rapid evolution of AI, as well as some sensible proposals for defense. However, there is a major blind spot in Amodei’s account of this next phase of AI progress – namely, not once in the nearly 20,000 word essay does Amodei mention open source AI or open models, or include any discussion of open models at all in the picture he paints of the future.

This trend of leading AI researchers and executives choosing to omit open models from their near-future forecasts of AI risks, is not new – for example, I raised similar concerns with Daniel Kokotajlo et. al.’s “AI 2027”. But it is nonetheless problematic that the trend continues, because any account of the future that avoids discussing open models also inevitably avoids discussing the fact that we have no plan at all for defense against many of the most serious AI risks, when they arise from such models.

In the remainder of this piece I will make the argument that the omission of open models from near future forecasts by thought-leaders in AI matters a lot. There are many ways in which open models will be incredibly important to the future of AI risks and defenses, but by far the greatest issue with omitting them is that the existence of open models is quite likely to undermine most or all of the defenses proposed by Amodei in his essay.

 

Why Defense Against AI Risks from Open Models is Hard

There are several key features that make defense against AI risks from open models especially difficult.

 

1. Guardrails Can Be Easily Removed

One approach that companies like Anthropic frequently use to defend against AI risks in closed models is to build guardrails into their systems that severely constrain the behavior of the model itself. An example of this is Claude’s “Constitutional AI”, which Amodei discusses extensively in his essay as a key source of defense against risks like loss of control and misuse for destruction.

Unfortunately, guardrails like Constitutional AI (and similar finetuning or RLHF-based safeguards) offer little to no defense in the case of open models. One main reason for this is that many companies developing open models have typically included few significant guardrails in the first place. But the bigger issue is that even if guardrails are built into open models when they are released, today’s open-weight models remain vulnerable to fine-tuning that can remove or severely compromise such guardrails with relative ease. And there is no evidence that new approaches to training will be robust to such attacks in the future.

 

2. Use Cannot Be Monitored

Another strategy that is common to many of the defenses outlined in Amodei’s essay, is the strategy of directly monitoring end users’ interactions with the models, to identify and block concerning patterns of use as a separate step from the inference itself. For example, in the section “A Surprising and Terrible Empowerment” Amodei explains how Anthropic uses classifiers as an additional layer of defense to prevent Claude from replying to users prompts where dangerous misuse is suspected by the model – for instance, a request where the output of the model contains instructions on how to develop bioweapons. He writes,

 

But all models can be jailbroken, and so as a second line of defense, we’ve implemented… a classifier that specifically detects and blocks bioweapon-related outputs. We regularly upgrade and improve these classifiers, and have generally found them highly robust even against sophisticated adversarial attacks. [1]

 

Unfortunately, just as with guardrails, such classifiers cannot provide meaningful protection against misuse in open models – in this case, because if the user simply runs the open model on hardware they control, there is nothing to prevent them from disabling any classifier-style output filters and viewing model output for whatever prompts they wish. In such a scenario, there is no way for the creator of the model (or any other third-party) to monitor or prevent such scenarios of dangerous misuse in open models running on user-controlled hardware.

 

3. Bad Actors Have Access By Default

A third strategy that is common to many of the defenses that Amodei proposes, is attempting to restrict various types of bad actors from gaining access to powerful AI capabilities in the first place. For example, with respect to synthetic biology-related risks, he writes:

 

Advances in molecular biology have now significantly lowered the barrier to creating biological weapons (especially in terms of availability of materials), but it still takes an enormous amount of expertise in order to do so. I am concerned that a genius in everyone’s pocket could remove that barrier, essentially making everyone a PhD virologist who can be walked through the process of designing, synthesizing, and releasing a biological weapon step-by-step…. Most individual bad actors are disturbed individuals, so almost by definition their behavior is unpredictable and irrational—and it’s these bad actors, the unskilled ones, who might have stood to benefit the most from AI making it much easier to kill many people. [1]

 

With closed models like those developed by Anthropic, the model weights are stored securely on the company’s servers and by default the company gets to choose the conditions under which end users are allowed to utilize their capabilities – including whether to allow access at all. This default is important, because it means that, fundamentally, Anthropic is in a position to block any users who are violating its terms of service, or are using the models in dangerous ways.

However, the opposite is true in the case of open models, which are distributed globally and downloadable anonymously, since there is no way to prevent bad actors from gaining access and using such models for whatever they wish. While restricting the access of bad actors to models is a viable strategy for defense in closed models like Claude, it is not a viable strategy for defense in open models, because bad actors have access by default. From a high-level, this is the main reason that nearly all of the defenses Amodei argues for in his essay fail to work in open models – namely, the defenses Amodei proposes all make the assumptions that bad actors won’t have access to the model weights.

 

What Could Go Wrong?

So if it’s true that the defenses Amodei proposes in his essay are largely unworkable in open models, then what does the near future AI risk landscape really look like, assuming models like DeepSeek and Quen continue to be widely available and continue to lag the capabilities of the very best closed models by only 6-12 months, as they have in recent years?

 

Loss of Control

In a piece I wrote last year, “We Have No Plan for Loss of Control in Open Models”, I lay out the case that even if companies like Anthropic take control-related risks very seriously and develop all the defenses that Amodei describes in his essay, this will still be insufficient to manage the more general problem of loss of control on a global scale. The reason is that even if companies like Anthropic develop powerful defenses that enable them to maintain control of their internal AI systems like Claude, such defenses do nothing to prevent loss of control in powerful open models which will undoubtedly be deployed on a global scale, by a wide variety of actors, many of whom will likely put few or no control-related defenses in place. If we believe that loss of control of powerful AI systems is a risk that should be taken seriously – and most AI researchers do – we should be extremely concerned about the possibility of loss of control in open models, given that we have essentially no plan in place or defenses available to address that risk.

 

AI-Assisted Bioweapons and New Technology Development

Today, arguably the most urgent catastrophic AI risks are “misuse for destruction” risks – for example, the use of AI for bioweapons development, or potentially for developing dangerous “black or gray ball” technologies like mirror life. And evidence of this continues to mount – last year, researchers working with the best closed models inside frontier labs found that they can already outperform virologists in troubleshooting procedures and questions related to the kind of practical lab work required for creating and disseminating dangerous pathogens in the real world. Dan Hendrycks and and Laura Hiscott summarize the findings:

 

Across multiple biology benchmarks, LLMs are performing near expert level or higher. The [Virology Capabilities Test] results do not arrive in a vacuum, but as another data point in a growing field of benchmarks. For instance, on the Weapons of Mass Destruction Proxy (WMDP), which tests conceptual knowledge required for hazardous uses including bioweapons development, o1 scores around 87 percent. The baseline set by human experts is 60 percent. Since WMDP concentrates on theory, questions could still be raised around the practical applicability of LLMs that score highly on it. The VCT, with its complementary focus on addressing issues in the wet lab, appears to address those doubts. [15]

 

Policy researchers are also becoming increasingly concerned about such risks. For example, in January of this year, The Center for Strategic International Studies published a comprehensive study titled “Opportunities to Strengthen U.S. Biosecurity from AI-Enabled Bioterrorism” which surveys a wide range of ways in which recent advances in AI models are rapidly lowering the barriers to planning and executing biological attacks and developing epidemic and pandemic-scale pathogens. According to the study:

 

1. Popular large language models (LLMs) could soon drastically lower the informational barriers for planning and executing biological attacks. Recent assessments of LLMs and other commercial AI capabilities indicate that models are “on the cusp” of meaningfully helping novices develop and acquire bioweapons by providing critical information and step-by-step guidance.

 

2. Future AI biological design tools (BDTs) could assist actors in developing more harmful or even novel epidemic- or pandemic-scale pathogens. Rapid advancements in state-of-the-art BDTs—illustrated by the foundation model Evo 2—point to a world in which more capable models could help develop new or enhanced pathogens and evade existing safeguards. [16]

 

As we have seen in previous sections, while safety mechanisms like Constitutional AI and classifiers can help prevent dangerous misuse in closed models like Claude, there are no such defenses available to prevent bad actors from accessing similar capabilities in open models, many of which have few, or no guardrails at all.

 

Surveillance and Authoritarian Control

In section 3 of his essay “The Odious Apparatus: Misuse for Seizing Power”, Amodei describes the near-future risks we face from state and corporate actors using powerful AI tools to impose forceful control over large populations. As he points out, such impositions of power could take many forms, including AI-powered mass surveillance, fully autonomous combat systems, AI-powered government propaganda and more. The picture that Amodei presents is complex and many-layered and is made more complicated by the fact that these risks could come from many actors, including authoritarian superpowers like the CCP, democracies competitive in AI, non-democratic companies with large data centers, and possibly even AI companies themselves. 

The set of defenses he proposes to address these risks is equally multi-layered. However, the most common denominator to Amodei's proposals is that he believes that we must strive to prevent authoritarian regimes (and would-be regimes) from gaining access to powerful AI in the first place. As just one example of how we might do this, he writes,

 

First, we should absolutely not be selling chips, chip-making tools, or datacenters to the CCP. Chips and chip-making tools are the single greatest bottleneck to powerful AI, and blocking them is a simple but extremely effective measure, perhaps the most important single action we can take. It makes no sense to sell the CCP the tools with which to build an AI totalitarian state and possibly conquer us militarily. [1]

 

While Amodei may or may not be correct that US export controls are necessary, the issue with his analysis is that he presents export controls as far too decisive and impactful an intervention. He also fails to acknowledge that China has been extremely successful at developing and rolling-out AI-powered authoritarianism, even in the presence of such controls.

In fact, it’s possible export controls may even have accelerated Chinese innovation in AI – at least in some ways – as Jennifer Lind writes in the February edition of Foreign Affairs,

 

Starting in 2022, the United States and other countries imposed export controls on cutting-edge chips to slow the pace of China’s AI development. But these policies have also galvanized Chinese innovation. In 2025, Chinese AI company DeepSeek unveiled its R1 model, which performed comparably to top U.S. large language models despite being trained on a fraction of the chips typically used by rivals. [22]

 

The key point is, if the risk we’re worried about is AI-powered surveillance and totalitarian control in China and countries like it, then export controls are simply nothing like a sufficient defense against that risk.

On the contrary, China, Russia and other authoritarian governments around the world are already successfully roll out AI-powered surveillance and authoritarianism, using open models like DeepSeek and Kimi. These models are close to the frontier of capability in any case and there is little evidence that additional export controls on China would significantly slow the rollout of global AI-powered authoritarianism. And this will be even more true if AI companies like Anthropic continue to partner with some of the most notorious authoritarian regimes in the world around the development of powerful AI.

 

Global Surveillance and High-Tech Panopticon

Given the seriousness of AI risks from open models (and the lack of good defenses against them) it is reasonable to ask why so many researchers and thought-leaders fail to include any discussion of open models in their discussions of near-future AI risks. To try to answer this question, I have participated in a number of conversations with such thinkers in an effort to better understand their point of view. In these conversations, by far the most common argument is that closed models will simply be so far ahead during the times that matter the most, that any threat that open models might pose will be easily neutralized by AI companies or governments controlling more powerful closed models at that time.

One public instance of such an exchange was with Daniel Kokotajlo in the comments to my critique of his AI 2027, where we discuss this position. I write (replying to a previous comment of Kokotajlo’s):

 

So to make sure I understand your perspective, it sounds like you believe that open models will continue to be widely available and will continue to lag about a year behind the very best frontier models for the foreseeable future. But that they will simply be so underwhelming compared to the very best closed models that nothing significant on the world stage will come from it by 2030 (the year your scenario model runs to), even with (presumably) millions of developers building on open models by that point? And that you have such a high confidence in this underwhelmingness that open models are simply not worth mentioning at all. Is that all correct?... [2]

 

To which Kokotajlo replies:

 

We didn't talk about this much, but we did think about it a little bit. I'm not confident. But my take is that yeah, maybe in 2028 some minor lab somewhere releases an open-weights equivalent of the Feb 2027 model (this is not at all guaranteed btw, given what else is going on at the time, and given the obvious risks of doing so!) but at that point things are just moving very quickly. There's an army of superintelligences being deployed aggressively into the economy and military. Any terrorist group building a bioweapon using this open-weights model would probably be discovered and shut down, as the surveillance abilities of the army of superintelligences (especially once they get access to US intelligence community infrastructure and data) would be unprecedented. And even if some terrorist group did scrape together some mirror life stuff midway through 2028... it wouldn't even matter that much I think, because mirror life is no longer so threatening at that point. The army of superintelligences would know just what to do to stop it, and if somehow it's impossible to stop, they would know just what to do to minimize the damage and keep people safe as the biosphere gets wrecked…. [2]

 

As we can see from Kokotajlo’s reply, the reason the authors of AI 2027 believe that open models will be largely irrelevant to the future of AI risks, is that (they believe) closed models in the hands of global superpowers will be powerful enough to directly neutralize any threat that open might models pose.

While I can understand this perspective, it is far from obvious to me that things will play out this way. At a minimum, Kokotajlo’s position appears to depend on the assumption that a democratic superpower like the United States will roll out a globally ubiquitous system of government surveillance and military intervention on most or all open source AI users in the world, (perhaps similar to a “lite” version of Nick Bostrom’s “high-tech panopticon”) in just the next two years (i.e. rollout completed by 2028 or so). If true, why is this not mentioned at any point in the account of the future that the authors of “AI 2027” present? It seems like a significant detail, especially since there are many events that occur after the year 2028 in their account of the future which appear to contradict the idea that this level of monitoring of technologists is in place globally. And more critically, we should also be asking: is a near-term rollout of global high-tech surveillance with military intervention something that is realistic or desirable at all?

The practical reality is that few AI leaders today are willing to publicly advocate for global surveillance initiatives of the sort described by researchers like Kokotajlo and Bostrom, especially in the near-term. And in many cases thought leaders are much more likely to argue for the opposite. For example, in “The Adolescence of Technology”, Amodei makes the case that AI surveillance by major governments, including democracies, is something we must be very cautious of. He writes,

 

The world needs to understand the dark potential of powerful AI in the hands of autocrats, and to recognize that certain uses of AI amount to an attempt to permanently steal their freedom and impose a totalitarian state from which they can’t escape. I would even argue that in some cases, large-scale surveillance with powerful AI, mass propaganda with powerful AI, and certain types of offensive uses of fully autonomous weapons should be considered crimes against humanity. More generally, a robust norm against AI-enabled totalitarianism and all its tools and instruments is sorely needed. [1]

 

While I strongly agree with Amodei’s take on the risks and dangers of global surveillance solutions, the problem with him taking this stance is that there are no proposals currently on the table for how to deal with escalating threats from open models, other than something like global surveillance or high-tech panopticon. The elephant in the room with near-future forecasts like Amodei’s and Kokotajlo’s is that there may be no way for us to avoid an AI-powered catastrophe – like an AI-engineered pandemic, or loss of control of a powerful AI system – without significantly compromising many of the rights and freedoms we hold most dear. Both authors’ near-future forecasts conveniently avoid this unfortunate difficulty by simply omitting any discussion of open models at all.

 

Closed Models Are Not Far Enough Ahead

In addition to the question of whether global surveillance solutions would be a good thing or a bad thing, we also have the much more practical question of whether such solutions could be rolled out in time. The argument of researchers like Kokotajlo tends to be that panopticon can be rolled out in almost no time (e.g. weeks or months) during a “fast takeoff”-style “intelligence explosion”, because, in his words “There's an army of superintelligences being deployed aggressively into the economy and military.” [2]

But I tend to doubt this claim for a number of reasons. If we look at the world today (February 2026), as discussed above, we are already facing real-world evidence of uplift capabilities for bioweapons development in leading models. Therefore it is worthwhile to ask, how long would it take us to roll out the kind of global surveillance that researchers like Bostrom and Kokotajlo contemplate if we had to do so today? The answer is “a very long time” and the reason is that the bioweapon risk is already here in an early form, but the “army of superintelligences” is nowhere to be found.

The even bigger issue though, is that it is simply not realistic from an international relations standpoint for any single country to roll out a full program of global surveillance and military intervention unilaterally, even if powerful superintelligence gave it the physical or technical capability to do so. While policy researchers at the Brookings Institute have recently made policy recommendations for what the beginnings of a collaboration related to global surveillance could look like between the US and China, the tepid nature of such proposals (e.g. “First, China and the United States can revive intergovernmental dialogue on AI” [25]) serves more to highlight how difficult a real collaboration around a global surveillance program would be, rather than supporting the claim that such a collaboration is likely to materialize quickly.

Based on these difficulties, it should be clear that we cannot count on global surveillance or high-tech panopticons to serve as reliable defenses against AI risks from open models – at least not in the short term. There simply isn’t enough time. Real AI risks in open models are already emerging, and closed models simply aren’t far enough ahead and aren’t providing enough superpowered capabilities to stop them.

 

Open Models Are An Important Public Good

Whenever I participate in conversations about global surveillance and panopticon-style solutions with researchers like Kokotajlo, I also realize how close we may be, as a global technology community, to losing access to open models and open source AI for good. It’s important to recognize how tragic this outcome would be, since open models currently serve as one of the few checks and balances on the incredible power that the frontier labs are amassing – a power that threatens to centralize control of the future of AI in the hands of a small circle of billionaires and tech elites.

As important as it is that we avoid near-term threats like bioterrorism, cyber warfare and loss-of-control, we must be equally concerned with avoiding a future where a small group of tech elites or or wealthy individuals with the first access to powerful AIs are able to lock-in their power for the long-term – or perhaps forever. Amodei himself acknowledges this in his essay, writing,

 

Broadly, I am supportive of arming democracies with the tools needed to defeat autocracies in the age of AI—I simply don’t think there is any other way. But we cannot ignore the potential for abuse of these technologies by democratic governments themselves….Thus, we should arm democracies with AI, but we should do so carefully and within limits: they are the immune system we need to fight autocracies, but like the immune system, there is some risk of them turning on us and becoming a threat themselves. [1]

 

He is not wrong. And yet, at the same time we are facing catastrophic risks from open models, where most of the proposals currently on the table to address them involve exactly the instruments of control that Amodei fears. 

 

What Can Be Done About AI Risks in Open Models?

This is the hard question. And the evidence that it is hard is that we still have no workable proposals for how to defend humanity against catastrophic risks from open models. The defenses outlined by Amodei in “The Adolescence of Technology” are mostly ineffective against open models, so the closest thing to a proposal we have are these ideas of global surveillance or high-tech panopticon. But such proposals come with their own risks of AI-powered authoritarian lock-in. And on top of that, there are real doubts about whether such solutions could be rolled out and enforced on a global scale in time. Meanwhile, the first versions of risks like AI-accelerated bioweapons development and AI-powered authoritarianism are already present in the real world today [15][16][22][23].

While we don’t have good answers to these questions yet, we can no longer shy away from an honest discussion of risks from open models in our near-future forecasts of AI progress. Whether the risk is a loss of control, dangerous misuse like bioweapons development, or use by authoritarian regimes for oppression, it must be clear by now that there is no one person or company or even government that can unilaterally provide sufficient defenses on their own against AI risks from open models. Given the above, it is deeply problematic that forecasts like “The Adolescence of Technology” and “AI 2027” have chosen to completely omit any discussion of open models (and open source AI) from the accounts they give of the future. Doing so sends a message to policymakers and the general public that the only AI models that matter for AI risks are those inside frontier labs, when nothing could be further from the truth.

If Daniel Kokotajlo and the other authors of “AI 2027” believe that rapid rollout of a high-tech global surveillance system with military enforcement will be required by 2028 to avoid a catastrophic bioweapons attack based on open models, then they must be explicit about this in the picture of the future they present in their piece.

And we must hold Dario Amoedi to the same standards of realism in his essay “The Adolescence of Technology”. In the essay, Amodei states that his goal is “.... to confront the rite of passage [of developing powerful AI] itself: to map out the risks that we are about to face and try to begin making a battle plan to defeat them.” [1] If we take this at face value, then his omission of any discussion of open models is unconscionable. Because open models present a number of urgent and potentially catastrophic AI risks – and Amodei’s “battle plan” offers no defenses that can address them.

 

References

[1]    The Adolescence of Technology

[2]    It Is Untenable That Near-Future AI Scenario Models Like “AI 2027” Don't Include Open Source AI

[3]    AI 2027

[4]    We Have No Plan for Preventing Loss of Control in Open Models

[5]    LLM Guardrails: A Detailed Guide on Safeguarding LLMs

[6]    Constitutional AI: Harmlessness from AI Feedback

[7]    Evaluating Security Risk in DeepSeek and Other Frontier Reasoning Models

[8]    BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

[9]    On Evaluating The Durability Of Safeguards For Open-Weight LLMs

[10]    AI jailbreaks: What they are and how they can be mitigated

[11]    Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks

[12]    Open-weight models lag state-of-the-art by around 3 months on average

[13]    The Alignment Problem from a Deep Learning Perspective 

[14]    The Vulnerable World Hypothesis 

[15]    AIs Are Disseminating Expert-Level Virology Skills 

[16]    Opportunities to Strengthen U.S. Biosecurity from AI-Enabled Bioterrorism 

[17]    Deep Research System Card

[18]    Biology AI models are scaling 2-4x per year after rapid growth from 2019-2021 

[19]    AI can now model and design the genetic code for all domains of life with Evo 2

[20]    AI and biosecurity: The need for governance: Governments should evaluate advanced models and if needed impose safety measures 

[21]    The New AI Chip Export Policy to China: Strategically Incoherent and Unenforceable

[22]    China’s Smart Authoritarianism

[23]    From predicting dissent to programming power; analyzing AI-driven authoritarian governance in the Middle East through TRIAD framework

[24]    Leaked Memo: Anthropic CEO Says the Company Will Pursue Gulf State Investments After All

[25]    AI risks from non-state actors



Discuss

Terminal Cynicism

Новости LessWrong.com - 19 февраля, 2026 - 16:51
Published on February 19, 2026 1:51 PM GMT

I believe that many have reached Terminal Cynicism.

Terminal Cynicism is a level of cynicism that leads one to perceive everything negatively, even obviously good things. It is cynicism so extreme that it renders one incapable of productive thought or action.

The most common instance is people refusing to engage with politics in any shape because “The System is corrupt”, thus neglecting it and leaving it to decay.

At a personal level, Terminal Cynicism is dangerous. It feeds on weakness and insecurity, and alienates people.

I also believe that Terminal Cynicism is not always natural and organic. Instead, that it is quite often caused by agitators who spread Fear, Uncertainty and Doubt (FUD).

And defeating FUD requires a lot of clarity. So let’s clarify things.

Thanks for reading Cognition Café! Subscribe for free to receive new posts and support my work.

 

Examples

 

First, consider a couple of beliefs commonly intertwined with Terminal Cynicism…

“Air Conditioning is bad.

A plethora of articles explain at length how bad AC is: because of greenhouse gases and cooling fluids, because many poor people don’t have access to it, and how there are actually many great alternatives to it!

But these articles are taking the problem in the wrong direction. AC is good: it makes people’s lives better.

If there’s a problem with how much energy people use, we can tax or ration it. If there’s a problem with the greenhouse emissions coming from people’s energy consumption, we can tax or ration them.

Trying to manage problems that far upstream in the supply chain through individual consumption is more about moralisation than efficiency.

Doing Politics is bad.”

There are much too many smart people who believe that doing politics at all is bad. That all politicians are bad, that wanting to do politics itself is a red flag, and even that it is meaningless to want to do politics as everything is corrupted.

This is harmful and self-defeating. Our institutions rely on people actually doing politics. We need competent politicians, competent citizens, people invested in political parties, and more.

Erasing oneself from politics is the central example of Terminal Cynicism.

Vaccines are bad.”

The anti-vax movement has moved from a fringe conspiracy theory to a widespread belief. It ranges from the belief that vaccines do not work to claims of vaccines containing microchips that interact with 5G towers.

This is wrong. But beyond its wrongness, it is the result of escalating distrust. No honest investigation ends up with “5G microchips” as its conclusion.

The facts of the matter are fairly straightforward. Vaccines as a class have been a powerful tool to erase diseases and slow their spread. While not all vaccines are equally good, they are among the triumphs of medicine.

Let’s be clear: it makes sense to discuss the efficiency and the risk profile of a vaccine. This is why we have long and documented procedures to establish a vaccine as safe and useful.

Similarly, it also makes sense to discuss whether vaccines should be mandatory. This is a non-trivial public health question. Even though mandatory vaccines let us eradicate diseases in the past, it did come at the cost of personal freedom and bodily autonomy.

Overall though, vaccines have been a public health victory. There are many reasonable compromises and trade-offs that are meaningfully debatable. But not whether vaccines contain Bill Gates’ microchips. That one is just Terminal Cynicism.

There are many more examples of this type of Terminal Cynicism.

Terminally Cynical beliefs are usually wrong, but not always. It has little to do with the philosophical intricacies of the underlying question or the sometimes subtle truth of the matter.

Instead, what makes it deeply wrong is that it stems from a cynicism so bad that it will take a thing that it itself thinks is good and frame it as bad on purpose.

Here are examples of Terminally Cynical beliefs, along with an example of the cynical justification.

Sometimes, I give two such justifications. I do so when Terminal Cynicism has led to both the typical far-left and far-right clusters to independently build their own cynical justification for why a given virtue is actually bad.

Other times, I give two pairs of beliefs. I do so when Terminal Cynicism prevents either side from looking for balance or a synthesis.

Consider:

“Science” is bad: it’s a white construct demeaning traditional wisdom. “Universities” are bad: they’re an institution that has been fully captured by wokes.

“A Strong Military” is bad: we should never enforce our version of the international order, and let others do so instead. “Opposing Russia” is bad: we should never enforce our version of the international order, and let others do so instead.

“Atheism” is bad: it may be factually correct, but atheists are responsible for the most destructive totalitarian regimes, and atheism doesn’t literally answer all questions. “Religion” is bad: it may help people live better lives, but religious people are responsible for many of the worst moral systems in the modern world, and religion is literally wrong on a bunch of factual questions.

“A Solid Police and Judicial System” is bad: they sometimes make mistakes, which means we should defund them. “A Solid Police and Judicial System” is bad: due process often lets criminals get away, which means we should completely bypass it.

“Hard Work” is bad: it’s a bourgeois fantasy, licking the boot and helping capitalism. “Hard Work” is bad: The Elites rig everything. Don’t be a wage-cuck, scam people and go all in on crypto.

“Power” is bad: being a victim is morally better.

“The System” is bad: because it is hegemonic, it is responsible for every bad thing that happens.

And one of the major endpoints of Terminal Cynicism“Democracy” is bad: people are evil and stupid. Instead, people I like should force their will on everyone.

Mechanisms

 

Terminal Cynicism may seem contradictory. It is a complete reversal of what’s good and bad. And yet, as we see above, it’s everywhere. So many people will take obviously good things and say they are bad for very weak reasons.

It is both consequential (it has consequences!) and absurd: it makes little sense and is self-contradictory.

Thus, I think the phenomenon warrants the search for a solid explanation. My personal explanation features two parts.
The first one is Abstract Idealism, where people don’t care for the real world.
The second is Vice Signalling, when people commit bad actions specifically to be noticed.

Abstract Idealism

 

The first one is Abstract Idealism. I have written more at length about Abstract Idealism here.

Abstract Idealism

Gabe

·

21 Jan

Read full story

Abstract Idealism is the phenomenon where people refuse to consider the real world. Instead, they waste most of their thoughts on imaginary worlds.

A popular imaginary world is the The Revolution. People who believe in The Revolution spend a lot of their time thinking about how great things would be there (or after it happens, depending on the person). They either fantasise about purging the world out of its corrupted people and systems, or about how everything will be perfect after.

Another popular imaginary world is When My People will hold all the Power. People who believe in it are willing to sacrifice a lot to get “their people” winning. This is all justified by the fact that once their people win, they’ll finally be able to reshape the world to remove all their problems.

Abstract Idealists are fond of many more epic imaginary worlds. The world where Everyone is Nice and Enlightened. The world of Ancapistan, the Anarcho-capitalist paradise. [Fill in your most disliked utopia.]

Abstract Idealists are constantly disappointed by the real world. They compare the real world to their Ideal one, and feel it is never enough.

All the dirty systems that are necessary in our real world are unneeded in theirs. In the real world, we need ways to deal with our irreconcilable disagreements, our sociopaths, our vices, our traumas and our terrible mistakes. So we have armies, police forces, and prisons.

But to an Abstract Idealist, these are warts and blemishes that must be eliminated. There is no such need for them in their world. It is known in advance who is Right, so there is no need for armies: all must submit to the Right. It is known in advance who is Good and who is Evil: so we must simply purge ourselves of Evil, and rehabilitate the Mistaken.

In the real world, we face so many problems. We are not all as moral, productive or smart as each other; we must triage between the sick, the elderly, workers and children; we are not infinitely altruistic; we have little self-discipline and self-awareness; and so on and so forth.

These problems are far from being solved. The solutions we have collectively come up so far are all imperfect. Markets, states, psychology, social norms, culture, philosophies and religions.

These solutions may look good to us, because we imagine what our world would look like without them. But to an Abstract Idealist, they look utterly terrible. In their world, there never is anything so impure.

Often, it gets to the point where the parts that derive meaning because we live in the real world are to be erased on the altar of their Ideals.

No one needs to have power. There is no one to defend against. Everyone is inoffensive and shares the same values, so there are never any fights.

No one needs to have children or to work. All our communities and all of our civilisation can be provided by AI and/or hyper-efficient communism.

No one needs to be disciplined and to respect any norms. People acting however they want naturally lead to good outcomes.

Vice Signalling

 

When it doesn’t lead to Extinction Risks from AI or Misanthropy, Abstract Idealism can be endearing. Anyone with an Artist or a Deep Nerd among their friends knows the feeling we get when we see a friend fully dedicated to their craft, completely disconnected from how useful or not it may be.

But the other mechanism at play in Terminal Cynicism, Vice Signalling, is the opposite.

Vice Signalling is bragging to others about one’s willingness to believe, say and do things that they know are bad.

Beyond its moral failure, it may seem self-defeating: why would one ever do that? But it can in fact be useful in many situations.

Most notably, when you’re a teenager, among other equally underdeveloped teenagers. To show that you’re cool, one common strategy is to go against all the things that authorities say are good. You’ll incur risks of accidents, you’ll disturb other people, you’ll violate norms, you’ll damage public property.

That all of these behaviours are bad and costly is the point. You are demonstrating how bad you are willing to be cool. If it was good for you and others, then it would be a worse signal: you’d already have other reasons to do it. It wouldn’t demonstrate how cool and disregarding of authority you are.

Among adults, there is another common cluster of Vice Signalling. It is known by many names: Rage Baiters, Drama Queens, Clout Chasers, Engagement Farmers. The principle is the same, they aim to get more attention by being shocking. And thus the worse the things they say or do, the more they gain.

However, I have found that the type of Vice Signalling that leads to Terminal Cynicism is a bit different from the ones above.

From a Vice Signalling standpoint, the main engine of Terminal Cynicism is Contrarianism.

Contrarians love to contradict. They consistently go against the status quo, common sense, norms, and consensus.

That way, they can signal their uniqueness, their willingness to entertain thoughts everyone else deems taboo, and their superiority to mundane morals.

Nowadays, Contrarianism has become fashionable.

Everyone is against The System.
Everyone is against our Institutions.
Everyone is against Politicians.
Everyone is against the norms and traditions that have made things better.
Everyone is against everything.

Vice Signalling lets one signal how contrarian they are. The more vicious, the more contrarian, the more special they are.

“Oh, you are on the side of people who find good things good? How quaint! How mundane! What are you? An NPC? A sheeple?”

“So you think that AC is good? That Vaccines are good? What, that Science is good? Nay, that Education is good? Please!”

“Oh, so you hate Racism? Well, I for one hate Whiteness! I even hate Civilisation!”

It’s constant race to the bottom. Who will be the most offensive Contrarian and get away with it?

Abstract Idealism meets Vice Signalling

 

The worst lies at the intersection of the two.

Because the real world does not matter to an Abstract Idealist, they don’t even feel bad when Vice Signalling. It is purely beneficial; only gains, no cost!

They denigrate and worsen the real world, without care for how much worse they are making their own environment.

It is the entire point of The Revolution. Any damage to the existing order is progress toward the Liberation of everyone.

It is the entire point of Accelerationism. Everything is justified when you’re trying to Change The World as fast as possible.

It is a Vicious Circle. As they commit more visibly bad actions, they gain the image of a bad person, which alienates them from the people who don’t tolerate it. This in turn gets them to only interact with people who tolerate such actions, to have it become a central part of their identity, and ultimately to commit even worse actions.

If you’ve never met any, it is hard to tell you how damaging their demeanour and behaviour can be.

They take pride in dissing institutions, functioning economies, elites, civilisation, science, and all that is good.

They know they look smart to their peers when they come up with a counterintuitive reason for why something good is actually bad.

They think it is cool to cheat The System and defraud it, because The System is bad.

They think everything is corrupted, and that The System is full of bad intents. As a result, they should also never interact with The System from a place of good faith: nothing good would result from that.

They are agitators, saboteurs, troublemakers. They often spend their energy thinking of ways to make things worse.

Not only in conversations, but often through actions. They’ll use whatever modicum of power they get for nefarious purposes. Ultimately, they’ll subvert the existing institutions for their political goals.

They may even ally themselves with terrible people, just to cause chaos. Demagogues, conspiracy theorists, political islamists. “The enemy of my enemy is my friend.” And when The System is your enemy, every defector is your friend.

Terminally Cynical Art

 

I believe Art is a strong sign of how pervasive Terminal Cynicism has become.

There’s so much fiction about breaking the mould. I think it’s been a long time since I have seen a piece of art praising the mould as good, efficient or beautiful.[1]

Similarly, there’s so much fiction about dystopias. The “Black Mirror”-ification of everything.

So many stories are about framing the protagonist’s unhappiness and suffering as the fault of The System, about framing most people’s perceived happiness as shallow and corrupt, this type of theme.[2]

It may sound strange, but I view this type of art as a fantasy.

It posits that The System is independent of people, that people have no agency over their fate, and thus that The System’s inability to make people feel happy is unfair. It considers that people deserve happiness by default.

That’s the fantasy. That we are all innocent, and that The System is doing this to us.

In reality though, The System is us. We are the ones currently failing to build ourselves a better life, despite centuries of technological improvements.

Reconciling our reality with the fantasy takes vision. One must come up with a world where people are both agents and subjects of The System. In this fantasy, through their action, people would improve The System and thus their lives.

This requires far more creativity and understanding of the world than coming up with “lore”. Any creative artist can come up with a new species of humans with 4 arms, blue skin, supernatural abilities, a different language, or whatever change that doesn’t suggest anything interesting for The System.

This lack of vision leads to the “We live in a Society” syndrome, where artists make ignorant shallow societal commentary, by pointing to a necessary evil like prisons.

Sometimes, they do this out of convenience. Artists often need to raise the stake of their stories, and having a character fight with The System is a common trope that doesn’t require much imagination.[3]

More often than not though, artists are genuinely not self-aware. Very few (artists or not) have been in a position to Govern a large group of people. And if one is the type of artist to not deeply research topics for weeks or months, they will never interact with people who do govern, nor study the history of those who have.

As a result, they do not even realise what type of experience is relevant to the conversation. They truly believe that their personal voice is something special to bring. Thus, we keep being fed an endless supply of shallow Art telling us The System Is Evil.

Art has discovered a formula that works for anti-system stories. The Hero sees a problem, tries to solve it within The System, fails, and decides to fight The System instead.
The Game is to make people feel catharsis and vindicated when they recognise a flaw of The System in the “work of art”.[4]

This is similar to how Social Media found its formula. There, The Game is to feed people a bunch of slop 90% of the time, so that they feel a dopamine hit when they unearth a “gem” the remaining 10%. In both cases, the experience makes the “consumer” worse off.

Art should inspire, and Social Media should connect. I believe that in both cases, Terminal Cynicism explains why their creators are not doing better, and we are not demanding better from them.

Conclusion

 

Terminal Cynicism is a serious problem.

I hope this article helps some of my readers immediately recognise Terminal Cynicism for what it is, develop an aversion to it, and notice when people spread it.

I have already written a follow-up, focused on who instigates it. Here though, I wanted to just show how it works.

On this, cheers!

 

  1. ^

    An exception may be The Path of Ascension. It’s a nice book series, available on Amazon Kindle.

  2. ^

    Womp womp. Never mind it being wrong, it is so meek.

    It is one thing to feel bad and like The System has it for us.

    It is another to spread this sentiment to everyone, as if to reassure oneself by having others confirm they feel similarly.

  3. ^

    It may erode trust in The System and make people less inclined to improve it, but who cares?

    Evil behaviour often comes from laziness and negligence, rather than bad intents.

  4. ^

    I tend to perceive it as formulaic slop. While the first few I watched as a teenager were inspiring (“Wow! You can in fact go against The System!”), I am now disgusted by their omnipresence.



Discuss

Aleatoric Uncertainty Is A Skill Issue

Новости LessWrong.com - 19 февраля, 2026 - 16:11
Published on February 19, 2026 1:11 PM GMT

Aleatoric Uncertainty Is A Skill Issue

Epistemic status: shitpost with a point

Disclaimer: This grew out of a conversation with Claude. The ideas are mine, the writeup is LLM-generated and then post-edited to save time and improve the flow

You know the textbook distinction. Epistemic uncertainty is what you don't know. Aleatoric uncertainty is what can't be known — irreducible randomness baked into the fabric of reality itself.

Classic examples of aleatoric uncertainty: coin flips, dice rolls, thermal noise in sensors, turbulent airflow.

Here's the thing though.

A Laplacian demon predicts all of those.

Every single "classic example" of aleatoric uncertainty is a system governed by deterministic classical mechanics where we simply don't have good enough measurements or fast enough computers. The coin flip is chaotic, sure. But chaotic ≠ random. A sufficiently precise demon with full knowledge of initial conditions, air currents, surface elasticity, gravitational field gradients, and your thumb's muscle fiber activation pattern will tell you it's heads. Every time.

The thermal noise? Deterministic molecular dynamics. The dice? Newtonian mechanics with a lot of bounces. Turbulence? Navier-Stokes is deterministic, we just can't solve it well enough.

The Laplacian demon doesn't have aleatoric uncertainty. It's just that mortals have skill issues and are too proud to admit it.

So what's actually irreducibly random?

Quantum mechanics. That's it. That's the list.

And even that depends on your interpretation:

  • Copenhagen: Yes, measurement outcomes are fundamentally random. The Born rule probabilities are ontologically real. This is the one place where the universe actually rolls dice. God, apparently, does play dice, but only at this level.
  • Many-Worlds: Nope. The wave function evolves deterministically. Every outcome happens. The "randomness" is just you not knowing which branch you're on. That's not aleatoric — that's indexical uncertainty. You have the skill. You just don't know where you are.
  • Bohmian Mechanics: Nope. Hidden variables, fully deterministic. You just don't know the initial particle positions. Classic epistemic uncertainty wearing a trenchcoat.

So under Many-Worlds and Bohmian mechanics, all uncertainty is epistemic. The universe is fully deterministic. There are no dice. There is no irreducible randomness. There is only insufficient information.

Under Copenhagen, there is exactly one source of genuine aleatoric uncertainty: quantum measurement. Everything else that textbooks call "aleatoric" is a Laplacian demon looking at your sensor noise model and saying "get wrecked, scrubs."

The Real Problem: Lack of Epistemic Humility

Here's what actually bothers me about the standard framing. When you label something "aleatoric," you're making an ontological claim: this randomness is a property of the world. But in almost every classical case, it's not. It's a property of your model's resolution. It's noise in your world model that you're projecting onto reality.

And then you refuse to label it as such.

Think about what's happening psychologically. "It's not that my model is incomplete — it's that the universe is inherently noisy right here specifically where my model stops working." How convenient. The boundary of your ignorance just happens to coincide with the boundary of what's knowable. What are the odds?

The aleatoric/epistemic distinction, as commonly taught, isn't really a taxonomy of uncertainty. It's a taxonomy of accountability. Epistemic uncertainty is uncertainty you're responsible for reducing. Aleatoric uncertainty is uncertainty you've given yourself permission to stop thinking about. The label "irreducible" isn't doing technical work — it's doing emotional work. It's a declaration that you've tried hard enough.

And look, sometimes you have tried hard enough. Sometimes it's correct engineering practice to draw a line and say "I'm modeling everything below this scale as noise." But at least be honest about what you're doing. You're choosing a level of description. You're not discovering a fundamental feature of reality. The universe didn't put a noise floor there. You did.

"But This Distinction Is Useful In Practice!"

Yes! I agree! I'm not saying we should stop using the word "aleatoric" in ML papers and engineering contexts. When you're building a Bayesian neural network and you separate your uncertainty into "stuff I could reduce with more training data" vs. "inherent noise floor I should model as a variance parameter," that's a genuinely useful decomposition. You would, in fact, go completely insane trying to treat thermal noise in your LIDAR as epistemic and heroically trying to learn your way out of it.

The pragmatic framing does real work: aleatoric = "uncertainty I'm choosing to treat as irreducible at this level of description." That's fine. That's good engineering.

But let's stop pretending it's a deep metaphysical claim about the nature of reality. It's not. It's a statement about where you've chosen to draw the line on your modeling resolution. The universe (probably) isn't random. Your model is just too coarse to be a demon.

The PunchlineInterpretationCoin flipThermal noiseQuantum measurementIs anything aleatoric?Classical (Laplace)Skill issueSkill issueN/ANoCopenhagenSkill issueSkill issueActually randomYes, but only thisMany-WorldsSkill issueSkill issueIndexical uncertaintyNo*BohmianSkill issueSkill issueSkill issueNo

* Unless you count "not knowing which branch you're on" as a new, secret third thing.

tl;dr: Aleatoric uncertainty is a skill issue. The Laplacian demon has no variance term. The only candidate for genuine ontological randomness is quantum mechanics, and half the interpretations say even that's deterministic. Your "irreducible noise" is just you being bad at physics and too proud to admit uncertainty in your model.

By the way: I may be the only one, but I was actually genuinely confused about this topic for years. I took the definition of aleatoric uncertainty literally and couldn't understand what the professors were on about when they called coin flips aleatoric uncertainty. None of the examples they gave were actually irreducible.



Discuss

Страницы

Подписка на LessWrong на русском сбор новостей