Вы здесь

Новости LessWrong.com

A community blog devoted to refining the art of rationality

Адрес: https://www.lesswrong.com

Обновлено: 3 минуты 24 секунды назад

Inkhaven Retrospective

24 ноября, 2025 - 08:19

Published on November 24, 2025 5:19 AM GMT

Here I am on the plane on the way home from Inkhaven. Huge thanks to Ben Pace and the other organizers for inviting me. Lighthaven is a delightful venue and there sure are some brilliant writers taking part in this — both contributing writers and participants. With 40 posts published per day by participants (not counting those by organizers and contributing writers) it feels impossible to highlight samples of them. Fortunately that's been done for me: the Inkhaven spotlight. And just to semi-randomly pick a single post to highlight, for making me laugh out loud, I'll link to one by Rob Miles. There are also myriad poignant, insightful, informative, and otherwise delightful posts. And, sure, plenty that don't quite work yet. But that's the point: force yourself to keep publishing, ready or not, and trust that quality will follow.

Confession: I'm not at all happy with this post. Opening with "here I am on the plane"? Painful. I would've fixed that in an editing pass, but I'm going to leave it because it illustrates two Inkhaven lessons for me. First, that sometimes it's ok to hit publish before something is perfect. And second, that I in particular need to follow the advice (#5 in my collection of Inkhaven tips) to dedicate blocks of time to dumping words onto the page. Editing is separate. If I hadn't started typing "here I am on the plane" then I would've sat there agonizing about a good opener, gotten distracted, and had nothing.

Write, publish, repeat.

I'm still agonizing about whether to commit to continuing to churn out a post every day for the rest of the month now that I've left. I do have a pretty much unlimited number of ideas to write about, even sticking to the theme of writing about writing. Here are half a dozen of them:

Complaining about a list of words I've been curating for which Google's built-in dictionary definitions are garbage.
What a work of art the 1913 edition of Webster's dictionary still is and how to configure a Mac laptop so it pops up those definitions when you hard-press on a word with the touchpad.
Why Overleaf is the bees' knees for technical writing or collaborative writing (and definitely what you want for technical collaborative writing).
More of my favorite word games (does that count as writing about writing? maybe I can find a way to make it count!).
My ever-growing pile of notes about and examples of redefining everyday words as technical jargon ("common knowledge", "real number", "normal distribution") and how bad this is.
Tips for dealing with trolls (beyond not feeding them, which is rules 1 through 17).

The problem is how daunting it feels to do justice to some of those. But that's where the writing tip to First Just Write comes in. It's been half an hour now of writing this thing and I'm starting to think I could bear to hit publish. If I do, I expect it will be the worst post I've published while at Inkhaven. But something has to be my worst post.

Judge for yourself. Here's a recap of everything else I published leading up to and during my Inkhaven stay:

Goodharting and DIY Inkhavens (on the Beeminder blog)
Blogception (on AGI Friday)
Against Powerful Text Editors (mindless, repetitive edits waste less time than it seems and avoiding them is more costly than it seems)
See Your Word Count While You Write (I made a tool, Tallyglot, to see your word count in the LessWrong editor and other places)
Why to Commit to a Writing and Publishing Schedule (it matters for you and for your readers; also covers "how")
Strategically Procrastinate as an Anti-Rabbit-Hole Strategy (aka Just-in-timeboxing)
The Eightfold Path To Enlightened Disagreement (characterize, crux, ITT, steelman, scout-mindset, etc)
Smarmbots, Secret Cyborgs, and Evolving Writing Norms (new rule: no plagiarizing LLMs)
Musk vs McGurk (sensor fusion and self-driving cars)
Mnemonic Exposition (how to name and gender hypothetical characters)
Eat The Richtext (another tool I made for preserving formatting when pasting text)
The Principle of Delayed Commitment (more pro-procrastination propaganda)
Ten Wrong and Dumb Grammar Rules (infinitive splitting, less-vs-fewer, syntactic vs elocutionary punctuation, etc)
Writing Tips from Inkhaven (listicles, curse of knowledge, Hemingway Mode, the out-loud-to-your-friend constraint, etc)

Discuss

Androgenetic haploid selection

24 ноября, 2025 - 06:10

Published on November 24, 2025 3:10 AM GMT

Eggs are expensive, sperm are cheap. It’s a fundamental fact of biology . . . for now.

Currently, embryo selection can improve any heritable trait, but the degree of improvement is limited by the number of embryos from which to select. This, in turn, is because eggs are rare.

But what if we could select on sperm instead? We could choose the best sperm from tens or even hundreds of millions, and use that to make an embryo. However, any method that relies on DNA sequencing must destroy the sperm. Sure, you can identify the best one, but that’s of limited value if you can’t use it for fertilizing an egg.

There have been a few ways proposed to get around this:

Nondestructive sperm testing. Technically challenging: sperm DNA is packaged tightly and you would have to partially denature it without killing the cell. Selection based on total DNA content (separating X and Y bearing sperm) is possible but only useful for choosing the sex of the baby. Phenotypic selection (swim rate, etc) is not very useful because sperm phenotypes don’t correlate well with sperm genotypes.
Doing in vitro spermatogenesis, and keeping track of which sperm came from where.[1] There are four sperm produced from each spermatocyte, and three of them could be destructively sequenced to deduce the genotype of the remaining one. Challenging (nobody has done human in vitro spermatogenesis yet) and low throughput.

Here, I propose a different approach, which I call androgenetic haploid selection.

Androgenetic haploid selection

Make a bunch of eggs. The chromosomes and imprinting don’t have to be correct (we’ll get rid of them in the next step), so even a low quality in vitro oogenesis method would work. Something like Hamazaki’s approach would work well here.
Remove the chromosomes from the eggs. This can be done at large scale through centrifugation: spin the eggs hard enough, and the DNA will fall out.
Add an individual sperm to each egg and establish haploid stem cell lines. This recent paper is an example of doing this for cows and sheep. These cell lines are called “androgenetic” and retain the DNA imprinting patterns of sperm.
1. Notably, Y-bearing sperm cannot make viable haploid stem cell lines because many essential genes are on the X chromosome.
Sequence many cell lines and choose the best one. Because the cells divide, it’s possible to destructively sequence some of the cells from each line without destroying all the cells.
Collect eggs the normal way, and “fertilize” them with nuclei from your chosen androgenetic cell line.
1. Optionally: perform additional selection based on the embryo genome.

Comments on this approach

This method could give high genetic optimization for the paternal half of the genome. At scale, I estimate an overall $200/sample cost for cell line establishment and sequencing, so taking the best of 100 cell lines could be performed for around the cost of a normal IVF cycle (~$20,000). For a perfectly heritable trait with a perfect polygenic score, this would give (+2.5 SD * 0.5) = +1.25 SD from sperm selection alone. (Gains will be lower for less heritable traits and less accurate predictors.)
This would only work for daughters (sorry Elon!) Although genetic engineering could make XX males by adding SRY, this would probably not be a good idea.
This would make even a low-quality in vitro oogenesis method valuable. More broadly, it’s not necessarily required that the recipient cells be eggs per se, as long as they express the correct factors for zygotic genome activation.

^
This would have to be done at the spermatid stage, before the sperm swim away.

Discuss

Formality

24 ноября, 2025 - 05:19

Published on November 24, 2025 2:19 AM GMT

In Market Logic (part 1, part 2) I investigated what logic and theory of uncertainty naturally emerges from a Garrabrant-induction-like setup if it isn't rigged towards classical logic and classical probability theory. However, I only dealt with opaque "market goods" which are not composed of parts. Of course, the derivatives I constructed have structure, but derivatives are the analogue of logically definable things: they only take the meaning of the underlying market goods and "remix" that meaning. As Sam mentioned in Condensation, postulating a latent variable may involve expanding one's sense of what is; expanding the set of possible worlds, not only defining a new random variable on the same outcome space.

Simply put, I want a theory of how vague, ill-defined, messy concepts relate to clean, logical, well-defined, crisp concepts. Logic is already well-defined, so it doesn't suit the purpose.[1]

So, let's suppose that market goods are identified with sequences of symbols, which I'll call strings. We know the alphabet, but we don't a priori have words and grammar. We only know these market goods by their names; we don't a priori know what they refer to.

This is going to be incredibly sketchy, by the way. It's a speculative idea I want to spend more time working out properly.

So each sequence of symbols is a market good. We want to figure out how to parse the strings into something meaningful. Recall my earlier trick of identifying market trades with inference. How can we analyze patterns in the market trades, to help us understand strings as structured claims?

Well, reasoning on structured claims often involves substitution rules. We're looking at trades moving money from one string to another as edits. Patterns in these edits across many sentence-pairs indicate substitution rules which the market strongly endorses. We can look for high-wealth traders who enforce given substitution rules, or we can look for influential traders who do the same (IE might be low-wealth but enforce their will on the market effectively, don't get traded against). We can look at substitution rules which the market endorses in the limit (constraint gets violated less over time). Perhaps there are other ways to look at this as well.

In any case, somehow we're examining the substitution rules endorsed by the market.

First, there's equational substitutions, which are bidirectional; synonym relationships.

Then there's one-directional substitutions. There's an important nuance here: in logic, there are negative contexts and positive contexts. A positive context is a place in a larger expression where strengthening the term strengthens the whole expression. "Stronger" in logic means more specific, claims more, rules out more worlds. So, for example, "If I left the yard, I could find my way back to the house" is a stronger claim than "If I left the yard, I could find my way back to the yard" since one could in theory find one's way back to the yard without being able to find the house, but not vice versa. In "If A then B" statements, B is a positive context and A is a negative context. "If I left the yard, I could find my way back to the house" is a weaker claim than "If I left the house, I could find my way back to the house", because it has the stronger premise.

Negation switches us between positive and negative contexts. "This is not an apple" is a weaker claim than "This is not a fruit". This example also illustrates that substitution can make sense on noun phrases, not just sub-sentences; noun phrases can be weaker or stronger even though they aren't claims. Bidirectional substitution subsumes different types of equality, at least =.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} (noun equivalence) and ↔ (claim equivalence). One-directional substitution subsumes different types as well, at least ⊆ (set inclusion) and → (logical implication). So, similarly, our concept of negation here combines set-compliment with claim negation.

Sometimes, substitution rules are highly context-free. For example, 2=1+1, so anywhere 2 occurs in a mathematical equation or formula, we can substitute 1+1 while preserving the truth/meaning of the claim/expression.

Other times, substitutions are highly context-dependent. For example, a dollhouse chair is a type of chair, but it isn't good for sitting in.

A transparent context is one such as mathematical equations/formulas, where substitution rules apply. Such a context is also sometimes called referentially transparent. An opaque context is one where things are context-sensitive, such as natural language; you can't just apply substitution rules. This concept of transparent context is shared between philosophy of language, philosophy of mind, linguistics, logic, and the study of programming languages. One advantage claimed for functional programming languages is their referential transparency: an expression evaluates exactly the same way, no matter what context it is evaluated in. Languages with side-effects don't have this property.

So, in our market on strings, we can examine where substitution rules apply to find transparent contexts. I think a transparent context would be characterized as something like:

A method for detecting when we're in that context. This might itself be very context-sensitive, EG, it requires informal skill to detect when a string of symbols is representing formal math in a transparent way.[2]
A set of substitution rules which are valid for reasoning in that context. This may involve a grammar for parsing expressions in the context, so that we know how to parse into terms that can be substituted.

The same could characterize an opaque context, but the substitution rules for the transparent context would depend only on classifying sub-contexts into "positive" or "negative" contexts.

There's nothing inherently wrong with an opaque concept; I'm not about to call for us to all abandon natural languages and learn Lojban. Even logic includes non-transparent contexts, such as modal operators. Even functional programming languages have quoted strings (which are an opaque context).

What I do want to claim, perhaps, is that you don't really understand something unless you can translate it into a transparent-context description.

This is similar to claims such as "you don't understand something unless you can program it" or "you don't understand something unless you can write it down mathematically", but significantly generalized.

Going back to the market on strings, I'm saying we could define some formal metric for how opaque/transparent a string or substring is, but more opaque contexts aren't inherently meaningless. If the market is confident that a string is equivalent (inter-tradeable) with some highly transparent string, then we might say "It isn't transparent, but it is interpretable".

Let's consider ways this can fail.

There's the lesser sin, ambiguity. This manifests as multiple partial translations into transparent contexts. (This is itself an ambiguous description; the formal details need to be hashed out.) The more ambiguous, the worse.

(Note that I'm distinguishing this from vagueness, which can be perfectly transparent. Ambiguity creates a situation where we are not sure which substitution rules to apply to a term, because it has several possible meanings. On the other hand, the theory allows concepts to be fundamentally vague, with no ambiguity. I'm not married to this distinction but it does seem to fall out of the math as I'm imagining it.)

There could be a greater sin, where there are no candidate translations into transparent contexts. This seems to me like a deeper sort of meaninglessness.

There could also be other ways that interpretations into a transparent context are better or worse. They could reveal more or less of the structure of the claim.

I could be wrong about this whole thesis. Maybe there can be understanding without any interpretation into a transparent context. For example, if you can "explain like I'm five" then this is often taken to indicate a strong understanding of an idea, even though five-year-olds are not a transparent context. Perhaps any kind of translation of an idea is some evidence for understanding, and the more translating you can do, the better you understand.

Still, it seems to me that there is something special in being able to translate to a transparent context. If somehow I knew that a concept could not be represented in a transparent way, I would take that as significant evidence that it is nonsense, at least. It is tempting to say it is definitive evidence, even.

This seems to have some connections to my idea of objectivity emerging as third-person-perspectives get constructed, creating a shared map which we can translate all our fist-person-perspectives into in order to efficiently share information.

^
You might object that logic can work fine as a meta-theory; that the syntactic operations of the informal ought to be definable precisely in principle, EG by simulating the brain. I agree with this sentiment, but I am here trying to capture the semantics of informality. The problem of semantics, in my view, is the problem of relating syntactic manipulations (the physical processes in the brain, the computations of an artificial neural network) with semantic ones (beliefs, goals, etc). Hence, I can't assume a nice interpretable syntax like logic from the beginning.
^
This is actually rare: if I say
... the idea is similar to how (a−b)(a+b)=a2+b2
then I'm probably making some syntactic point, which doesn't get preserved under substitution by the usual mathematical equivalences. Perhaps the point can be understood in a weaker transparent context, where algebraic manipulations are not valid substitutions, but there are still some valid substitutions?

Discuss

Why Talk to Journalists

24 ноября, 2025 - 05:07

Published on November 24, 2025 2:07 AM GMT

Sources' motivations for talking to journalists are a bit of a puzzle. On the one hand, it's helpful for journalists to work out what those motivations are, to keep sources invested in the relationship. On the other hand, sources behave in perplexing ways, for instance sharing information against their own interests, so it's often best to treat their psychology as unknowable.

Reflecting on sources' willingness to share compromising information, one mystified AI journalist told me last weekend, "no reasonable person would do this."

But to the extent I can divine their motivations, here are some reasons I think people talk to me at work:

Bringing attention and legitimacy to themselves and their work
Trading tips and gossip
Steering the discourse in favorable ways
- E.g. Slandering your enemies and competitors
Feeling in control of your life
- E.g. an employee might want to leak information to feel power over their boss
Therapy
A sense of obligation
- E.g. to educate the public
- E.g. to be polite when someone calls you for help
It feels high-status

Most of these are not particularly inspiring, but if you work in AI safety, I want to appeal to your theory of change. If your theory of change relies on getting companies, policymakers, or the public to do something about AI, the media can be very helpful to you. The media is able to inform those groups about the actions you would have them take and steer them toward those decisions.

For example, news stories about GPT-4o and AI psychosis reach the public, policymakers, OpenAI investors, and OpenAI employees. Pressure from these groups can shape the company's incentives, for instance to encourage changes to OpenAI's safety practices.

More generally, talking to journalists can help raise the sanity waterline for the public conversation about AI risks.

If you are an employee at an AI lab and you could see yourself whistleblowing some day, I think it is extra valuable for you to feel comfortable talking to journalists. In my experience, safety-minded people sometimes use the possibility of being a whistleblower to license working at the labs. But in practice, whistleblowing is very difficult (a subject for a future post). If you do manage to overcome the many obstacles in your way and try to whistleblow, it would be much easier if you're not calling a journalist for the first time. Instead, get some low-stakes practice in now and establish a relationship with a journalist, so you have one fewer excuse if the time comes.

Maybe news articles offend your epistemic sensibilities because you've experienced Gell-Mann amnesia and have read too many sloppy articles. Unfortunately, I don't think we can afford to be so picky. If you don't talk to journalists, you cede the discourse to the least scrupulous sources. In this case, that's often corporate PR people at the labs, e/acc zealots, and David Sacks types. They are happy to plant misleading stories that make the safety community look bad. I think you can engage with journalists while holding to rationalist principles to only say true things.

It's pretty easy to steer articles. It often only takes one quote to connect an article on AI to existential risks, when counterfactually, the journalist wouldn't have realized the connection or had the authority to write it in their own voice. For example, take this recent CNN article on a ChatGPT suicide. Thanks to one anonymous ex-OpenAI employee, the article connected the suicide to the bigger safety picture:

One former OpenAI employee, who spoke with CNN on the condition of anonymity out of fear of retaliation, said “the race is incredibly intense,” explaining that the top AI companies are engaged in a constant tug-of-war for relevance. “I think they’re all rushing as fast as they can to get stuff out.”

It's that easy!

Overall, it sounds disingenuous to me when people in AI don't talk to journalists because they dislike the quality of AI journalism. You can change that!

Which came first?

If you appreciate initiatives like Tarbell that train journalists to better understand AI, you should really like talking to journalists yourself! Getting people who are already working in AI safety to talk to journalists is even more cost-effective and scalable. Plus, you will get to steer the discourse according to your specific threat models and will enjoy the fast feedback of seeing your views appear in print.

Here are some genres of safety-relevant stories that you might want to contribute to:

Exposing wrongdoing at AI companies
- E.g. whistleblowing about companies violating their RSPs
Early real-world examples of risks (warning shots)
- E.g. the Las Vegas bomber who got advice from ChatGPT
Connecting news to safety topics
- E.g. explaining why cutting CAISI would be bad
Highlighting safety research
- E.g. explaining how scheming evals work
Explainers about AI concepts
- These generally improve the public's AI literacy

In practice, articles tend to cut across multiple of these categories. Op-eds also deserve an honorable mention: they don't require talking to journalists in the sense I'm writing about here, but some of the best articles on AI risks have been opinion pieces.

Quick Defenses

I'll briefly preempt a common objection: you're worried that journalists are going to misquote you or take you out of context.

First, I think that's rarer than you might expect, in part because you've probably over-indexed on the Cade Metz incident. Plus, journalists hate being wrong and try to get multiple sources, as I wrote in Read More News.

Second, you can seek out experienced beat reporters who will understand you, rather than junior ones.

Third and most importantly, even if you do get misquoted, it doesn't mean talking to the journalist was net-negative, even for that particular piece and even ex-post. As annoying as it is, it might be outweighed by the value of steering the article in positive ways.

Discuss

I made a tool for learning absolute pitch as an adult

24 ноября, 2025 - 04:09

Published on November 24, 2025 1:09 AM GMT

I read a study that claims to have debunked the myth that only children can learn absolute pitch, and got 12 musicians who’ve not previously had absolute pitch to improve significantly at having absolute pitch.

On average, they spent 21.4 hours over 8 weeks, making 15,327 guesses. All learned to name at least 3 pitches with >90% accuracy, having to respond in under 2.028 seconds; some learned all 12. The average was 7.08 pitches learned.

Notably, the results on the new instruments were worse than on the instruments they were trained on, suggesting people can somewhat learn to rely on the cues from the specifics of the used instrument’s timbre:

The way it works is simply by having very short feedback loops. You hear a sound (played on a piano in the study) and have 1-2 seconds to make a guess for what pitch it is.

You learn new pitches gradually: first, you need to identify one (and press keys for whether it’s that pitch or some other pitch), and then, more pitches are gradually added.

In the study, before testing without feedback, to reset relative pitch memorization, a Shepard tone is played for 20 seconds. (It’s an auditory illusion that makes you feel like the pitch is perpetually getting lower or higher.)

I asked an LLM to make a web app version of it. I asked it to additionally use the Shepard tone more often for a shorter amount.

I also asked it to add colors to maybe produce some amount of synesthesia. I think there’s research that shows that synesthesia and absolute pitch correlate; I don’t know whether it can be induced to some extent, or would only be helpful for some people, but it seemed good to add in case it works. Later, someone on Twitter told me that they were taught the tones of Mandarin using colored cards, and it worked for them. People who experience synesthesia to at least some extent might have an easier time learning more pitches, though I’m not sure if it would be helpful to others.

I tried to minimize the time between recognition and feedback, so the web app reacts to the starts of the key presses, clicks, and touches, not to their ends; and immediately shows whether you were correct, and what was correct.

Finally, I added more instruments than just piano, hopefully, for better generalization.

With the first version, I posted it on Twitter:

It got a surprisingly high amount of engagement, which made the post a bit unfortunate in retrospect, because I made it before I actually fixed the bugs produced by the LLMs (now all fixed); on the other hand, the engagement meant that now I actually had to fix the bugs for people to be able to use the tool.

Two people shared that they have already learned to identify three pitches!

I now want to do experiments with a bunch of things (including the order of pitches presented: can it improve the learning curve and allow people to learn more than three more easily?), to collect the data on people’s progress, and maybe ask them questions (like whether they’ve played music or sang before).

Would appreciate recommendations for how to collect the data well without having to do anything complicated to manage it.

Would also appreciate more ideas for how to improve it for better pitch learning.

If you want to try to acquire perfect pitch, it might take you quite some time, but try it:

perfect-pitch-trainer.pages.dev

Discuss

"Self-esteem" is distortionary

24 ноября, 2025 - 02:59

Published on November 23, 2025 11:59 PM GMT

A friend asked me, "what's the right amount of self-esteem to have?" Too little, and you're ineffectual. Too much, and you get cocky. So how do you choose the right balance?

I replied that this is a trick question.

People with low self-esteem have thoughts like "I'm a loser", "my IQ is too low to succeed", "no one could love someone as fat as me". Their problem is not quite that they've got inaccurate beliefs. They make in fact be a loser. Rather, their problem is that they've attached their identity to concepts that limit their action space.

For instance, the notion of low IQ. This is a construct that's predictive at a population level, but it doesn't give you some predictive power on an individual level unless it's the only thing you know about a person. But you can rapidly accumulate info about someone, or yourself, that outweighs the info expressed by "your IQ is 101". E.g. if you want to know someone's test scores, you'll do a lot better by using their scores on mock exams than by using their IQ.

Which means that someone who says "I can't fix my car because I've got a low IQ" isn't actually making full use of the info available to them. They're relying on a sticky prior. What they should actually be doing if they care about fixing their car is asking "what's stopping me from fixing it?" and checking if solving that problem is worth the costs compared to paying a mechanic. The cost may be large. They may have to put in dozens of hours of work before they understand cars well enough to fix their problem without paying anyone. But they could do it.

So the issue is that the belief about "low IQ" has led to imaginary walls around what can be done that do not actually reflect reality.

In other words, low self-esteem turns a bump in the road into a cliff of seemingly infinite height, cutting off an entire avenue of approach. It reduces your sense of what is possible, and from the inside, it feels like you've got less free-will.

What is the solution? Knock down the walls.

In day to day life, we have to simplify the action space because we are computationally bounded systems. We introduce simplifications for good reasons, and for bad reasons. That's normal. Thing get problematic when those simplifications restrict the space till there is no good action left. Then, the appropriate reaction is to relax the constraints we impose on ourselves, test if the relaxation is valid, and take best action we've got left. If we were able to do this reliably, we would find ourselves doing the best we can, and low self-esteem would be a non-issue.

Discuss

Rationalist Techno-Buddhist Jargon 1.0

24 ноября, 2025 - 02:39

Published on November 23, 2025 11:39 PM GMT

Scott Alexander called me a rationalist techno-Buddhist on his blog. Since Scott Alexander is a rationalist of the highest status, that wording constitutes rationalist dharma transmission. I therefore consider myself authorized to speak authoritatively on the topic of rationalist techno-Buddhism.

Why am I writing a glossary? Because there are 14 different kinds of Buddhism and they all use words to mean slightly different things. This is a problem. I hope that this document will end 2,500 years of sectarianism, such that all of us finally communicate perfectly with no misundersandings.

But just in case there exist one or more people on the Internet who disagree with some aspect of this document, I have included a "1.0" in this document's title. You are permitted to fork it into 1.1 or 1.<your-name-here> or 2.this.is.why.lsusr.is.wrong.about.everything. Now, if you write about Buddhism, then instead of tediously defining all the terms you're using, you can just say "This uses Rationalist Techno-Buddhist Jargon 2.this.is.why.lsusr.is.wrong.about.everything.17.2" and get back to arguing online, sitting in silence, or whatever else it is you do to make the world a better place.

This list is ordered such that you can read it beginning-to-end without having to jump forward for a definition.

Warning

This document may be cognitohazardous to some people. Proceed at your own risk. Thank you Iraneth for feedback on an early draft.

Glossary

pragmatic dharma. A loosely-connected movement, mostly Western lay practitioners, focused on reproducible meditative methods and transparency about experiences. This differs from traditional Buddhism by not appealing to traditional religious authority.

rationalist techno-Buddhism (RTB). A movement within the pragmatic dharma that is trying to create cybernetic models for why and how this stuff works.

qualia. Subjective first-person experience.

consciousness. The dynamic field in which qualia arise and are perceived.

attention. The part of your consciousness you are paying attention to. Attention can be in only one place at a time.

concentration. When a person stabilize their attention on a target e.g. the breath. Strong concentration states elicit altered states of consciousness. Concentration is a skill that can be improved with practice.

kasina. Meditation using a visual target instead of the breath.

altered state (of consciousness). A temporary non-normative state of consciousness, usually caused by strong concentration.

access concentration. The first non-normative altered states of consciousness, through which all other altered states are entered. Access concentration is when your attention stabilizes on its target. For example, if you are meditating on your breath, then access concentration is when your attention stabilizes on your breath.

jhana. An altered state of consciousness characterized by deep meditative absorption. There are 8 jhanas. Jhanas are used in Theravada practice.

nirodha-samapatti. An altered state beyond the 8 jhanas at which all perception ceases.

mushin. A state of unobstructed action without deliberative thought. Mushin starts out as an altered state, but eventually it turns into an altered trait.

nonduality. An altered state of consciousness without distinction between self (homunculus) and other.

duality. Normative non-nonduality.

homunculus. Physically-speaking, your field of consciousness is a real-time generative model created by your brain. Inside of this model, some elements are labelled "self" and constitute your homunculus.

generative model. See wikipedia.

raw sensory inputs. The signals going into the generative model. This probably includes preprocessed data from e.g. your brainstem. What matters is that this data is raw from the perspective of the generative model in your brain.

altered trait. A permanent change to subjective experience. In the context of RTB, altered traits are caused by meditation.

ego death. An altered trait where the homunculus in your brain ceases to exist. [[1]]

fabrication. When the generative model in your brain creates an object in consequence in an attempt to reduce predictive error, usually in an attempt to simulate external reality. All conscious experiences are fabricated, but not all fabrications are experienced consciously. You can think of your brain as a video game rendering engine. Fabrication is your brain rendering physical reality in its simulated mirror world.

rendering. Synonym for fabrication.

encapsulation layer. When a fabricated element in your consciousness is so sticky that it is never not fabricated. It is difficult for normative consciousness to directly perceive that encapsulation layers are fabricated. Encapsulation layers feel like raw inputs until you pay close enough attention to them.

chronic fabrication. Synonym for "encapsulation layer".

non-recursive encapsulation layer. A fabrication that summarizes incoming raw sense data, thereby blocking direct conscious (attentive) access to the perception of that raw sense data. Examples of non-recursive encapsulation layers include non-local space and non-local time.

non-local space. Normative perception of space as a gigantic world far beyond your immediate environment.

local space. Perception of space after dissolution of space.

non-local time. Normative perception of time.

local time. Perception of time after dissolution of time. Eternal present.

recursive encapsulation layer. A fabrication created to block a problematic feedback loop caused by self-reference. Ultimately, recursive encapsulation layers are caused by an interaction between the generative algorithm in your brain and the reinforcement learning algorithm in your brain. Examples of recursive encapsulation layers include self/other duality, desire, pain-as-suffering, and willful volition. See [Intuitive self-models] 6. Awakening / Enlightenment for further explanation.

willful volition. The recursive encapsulation layer that is misinterpreted as free will.

acute encapsulation. A non-chronic encapsulation algorithm that doesn't congeal into a permanent element of perceptual reality. Encapsulation functions are non-chronic because they appear only in response to unpleasant stimuli. Pain-as-suffering is an acute encapsulation function, because it doesn't drag down your hedonic baseline.

chronic encapsulation layer. An encapsulation layer that is so stable, it is incorrectly perceived as raw input data to your field of consciousness. For people who don't understand conceptually that everything you perceive is part of a simulation, chronic recursive encapsulation layers are incorrectly understood to be elements of objective physical reality. Chronic encapsulation layers cause chronic suffering.

insight. An abstract concept measuring the cumulative effects on your brain when you pay attention to fabrications in your consciousness. The word "insight" lossily and pragmatically projects these effects into a single dimension. Accumulating insight eventually unsticks encapsulation layers, and then defabricates them.

dissolution. Permanent defabrication. When the defabrication of an encapsulation becomes a person's default mind state. Non-permenent defrabrication often percedes permanent defabrication.

integration. Dealing with the aftermath after an encapsulation layer has been dissolved. Fabrications are often load-bearing. Dissolving fabrications therefore often removes load-bearing components of a person's consciousness. After this, the person must learn new, healthier cognitive habits. This process is called integration.

vipassana sickness. Mental destabilization from too much insight too quickly with insufficient integration. In extreme cases vipassana sickness can cause psychosis (or worse, because unexpected psychosis can cause accidental death), especially when paired with sleep deprivation. This is similar to how people on an LSD trip can think "cars aren't real" and go wandering into traffic if unsupervised.

dissolution. A permanent shift (altered trait) from fabrication to non-fabrication. All dissolutions cause permanent reductions in chronic suffering.

dissolution of self. Synonym for ego death.

dissolution of desire. An altered trait where your brain's reinforcement learning algorithm is no longer abstracted into desire-as-suffering.

dissolution of space. An altered trait where you no longer feel like a small person walking around a gigantic world and your brain instead renders just your local, immediate environment. When this happens it stops feeling like your body is walking around a fixed world, and more like the world is moving while your body remains stationary.

dissolution of time. An altered trait where past and future are defabricated such that you live in local time.

suffering. Absolute-zero-based suffering. Normative models of consciousness have positive qualia (pleasure) and negative qualia (suffering). RTB uses a model based on absolute zero based model of suffering instead. The normative model is like Celsius or Farenheit, whereas RBB's model is more like the kelvin scale. Pleasure is a decrease in suffering, the same way cold is thermodynamically-speaking the removal of heat. Heat is fundamental. Cold is not fundamentally. Similarly, suffering is fundamental in a way that pleasure is not.

chronic suffering. Suffering produced by a chronic encapsulation layer. Normative levels of suffering have a floor produced by the chronic suffering induced by self, willful volition, non-local space, non-local time, etc.

hedonic baseline. A person's level of suffering when acute suffering is removed, leaving only chronic suffering.

enlightenment. Absolute zero chronic suffering. It may be physically impossible for human minds to reach such a state while alive and conscious. Absolute zero is still useful as a reference point or limit point. It's like a Carnot engine.

pleasure. An acute stimuli that temporarily reduces a person's suffering. Normative people can dive below their hedonic baseline temporarily, and conceptualize such dives as positive valence "pleasure". Lowering the floor itself requires that chronic encapsulation layers be dissolved. When a person's hedonic baseline drops, stimuli that used to be pleasurable become unpleasant, because they felt better than the previous hedonic baseline, but worse than the new hedonic baseline.

jhana junkie. A person who does jhanic practice without accumulating insight. Jhana junkies get stuck on the path to awakening, but being a jhana junkie is not dangerous the way vipassana sickness is dangerous.

awakening. Dissolution of a chronic fabrication. Awakenings tend to have a 1-to-1 correspondence with completed insight cycles.

insight cycle. A discrete cycle of three phases: concentration, insight and integration. In the concentration phase you cultivate concentrative skill. In the insight phase, you penetrate an encapsulation layer. Finally, in the integration phase, you deal with the fallout of blowing up that encapsulation layer. It takes effort to get to your first insight cycle, but after your first insight cycle, there's no stopping the process. Insight cycles will keep coming for years, whether you want them to or not. That's because chronic suffering is an obstacle to concentration. Completion of an insight cycle thereby improves your concentration, thus making your next insight cycle easier. This is a chain reaction. Your fabrications are like a woven fabric with a warp and a weft. If you leave the whole thing alone then it will stay intact. Your first insight cycle cuts the fabric and yanks on the weft. If you continue pulling on the weft then it'll unwind faster, but the fabric will continue to fall apart whether or not you pull on the weft. This is an old Zen saying "Better not to start. Once started, better to finish."

knowledge of suffering. An early phase in an insight cycle where you notice that your mind has been doing something stupid and unpleasant for longer than you can remember.

dark night. The phase of an insight cycle that takes place immediately after knowledge of suffering. Encapsulation layers exist to shield you from unpleasant perceptions. When you dissolve an encapsulation layer to get knowledge of suffering, you remove that shield, and all of the stuff it was protecting you from enters attention-accessible consciousness. This can be very unpleasant. Some people can cycle through many dark nights before landing stream entry.

hell realm. When you're stuck in a dark night. A person percieves what their consciousness is doing wrong (gets knowledge of suffering), but doesn't have the ability to fix it yet. I suspect that LSD-induced hell realms are particularly difficult to escape, because they're like taking a helicopter to the top of Mt Everest without learning mountaineering first.

stream entry. Ambiguously refers to the successful completion of your first insight cycle and/or your first awakening. It is customary to wait 1 year plus 1 day after awakening before claiming stream entry because ① it ensures you are experiencing an altered trait, not just an altered state, and ② it ensures you have completed the integration part of the insight cycle, thereby satisfying both definitions. During this time you should not make any big unilateral life decisions more irreversible than going vegan [[2]] . Stream entry typically reduces chronic suffering by at least 90%.

stream entry mania. The immediate aftermath of stream entry often produces a manic-like state. For this reason, it is recommended that you not post anything on social media for a few months after stream entry. The cooling of period is even longer for posts related to spirituality. Instead, you should talk to a trusted spiritual advisor. It is best if you establish a relationship with this person before you hit stream entry.

kensho. A glimpse of nonduality (or similar non-encapsulation) via a transient state but which leaves lasting insight. Kensho preceeds stream entry.

Cthulhu R'lyeh wgah'nagl fhtagn. Cthulhu waits dreaming in R’lyeh.

In RTB, ego death refers to an altered trait. Confusingly, LSD induces an altered state of consciousness where the ego is not present. LSD trippers usually refer to this state as "ego death", whereas RTBs refer to it as a nondual state, since the altered state is temporary and the ego reappears after the LSD trip is over. ↩︎
If you do go vegan, make sure you take a multivitamin so you don't get brain damage. ↩︎

Discuss

Finding the uncertainty vector in GPT2-scale transformers

24 ноября, 2025 - 02:34

Published on November 23, 2025 11:34 PM GMT

In this post I explore a phenomena in LLMs where the training process naturally consolidates information in a highly interpretable structure in the residual stream, through a positive feedback loop from a small variation at initialization. I start with a toy example and work up to GPT2 scale, showing animations of how weights and activations evolve over training. I assume familiarity with the transformer architecture.

The exploratory tone of this post will likely lead to more questions than answers. The intended audience is people hoping to learn more about transformer internals and their dynamics over training. The motivating question being "What is going on in this GPT2-scale model because these charts look incredibly weird".

The dimensions in the residual stream are often thought of as an uninterpretable arbitrary rotation of the feature space, since the standard transformer does not have an operation that makes it a privileged basis. Yet, the behavior above for dimension 275 is remarkably distinct. These charts show the evolution of three dynamics over the course of 600 training steps as the model progresses from 10.8 CE loss to 3.6 CE loss:

How does the language model head, which is of shape [d_model, d_vocab], update for [275, :]? This means if the model blocks add to dimension 275 in the residual stream, what change does that cause to the prediction of each token in the vocabulary?
How do 3 specific neurons in the final MLP evolve? Each neuron multiplies its activation by an output vector and pushes in that direction. The first two neurons push in very specific dimensions, notably dimension 275. The third is included as a normal neuron baseline.
What is the activation distribution (sampling 60k tokens) of the residual stream right before getting multiplied by the lm_head for dimension 275?

The short answer is that dim 275 in the residual stream is functioning as an uncertainty vector. If the model puts 100% of its weight into 275, it will output roughly the naïve unigram distribution. Given the vector is normed, 100% of weight corresponds to root(768) or 27.7 in the last chart above. MLP Neurons then have a handle to perform several useful actions:

If a neuron fires on a very particular fact, it can add to dim 275 (make less negative) to indicate certainty.
If a neuron fires on ambiguous contexts, it can subtract from dim 275 (make more negative) to indicate uncertainty.

Part of why this occurs is because this version of GPT2 does not have a bias in the lm_head. A bias in the lm_head will directly model the unigram distribution, which decreases the gradient pressure for this dynamic to develop in the residual stream.

Toy Model

The first peculiar fact about the charts above is that training evokes a distinguished basis in the residual stream over time for dimension 275, even though mathematically it doesn't appear like there is any mechanism for this. To explore this further, I look at how simple initialization differences in a toy model can evoke roles for dimensions in the residual stream. Feel free to skip this section if this idea is already obvious.

The task the toy model will be learning is to predict the next digit (i) as a function of the preceding digit (j). The sequence of digits 0 to 9 will be sampled according to the scenario.

Scenario 1

The digits are sampled from the distribution [0.1,0.2,...,1], normalized such that digit 9 is 10x more likely than digit 0. The model is tasked with minimizing cross entropy loss via stochastic gradient descent.

I start with the simplest transformer possible: no blocks and a model dimension of 1. The model has 20 coefficients, one for each input digit and one for each output digit. The prediction logit that digit 4 will follow digit 3 is calculated by input_coef_3 * output_coef_4.

Input coefficients are initialized to the normal distribution, and output coefficients are initialized to 0. Training produces the following charts:

The input coefficient learns a constant because the sequence data is independent. The output coefficient learns the distribution to match the unigram distribution of the data.

Scenario 2

I now update the model dimension from 1 to 2. Training produces the following charts:

The computation is spread out across both dimensions. The second chart shows how the input coefficients sit on a line. A single model dimension has enough degrees of freedom to fit the data, but because there are so many more valid linear combinations that reach the result, it statistically improbable for the relationship to fall into exactly a single dimension. When this fuzzying occurs across a large number of dimensions in a larger scale transformer, it can become challenging to disentangle the underlying mechanisms.

Scenario 3

I now repeat scenario 2, but initialize the first dimension of the input coefficients to 10 instead of a normal distribution around 0.

The trained model isolates the learned portion into dimension 0. Dimension 1 output coefficients stay at zero. This occurs for two reasons:

Non-zero activation mean. Lets say I want to modify output_coef_4 weights such that I increase the likelihood of predicting digit 4 for all inputs. If all inputs are positive values, then I can increase the output by increasing the coefficient. If all inputs are negative, then I can increase the output by decreasing the coefficient. If outputs are mixed with a mean of zero, then the gradients largely cancel out. In the general case, if the subset of inputs that we want to shift have a shared sign for a given dimension, that dimension will have asymmetrically large gradient pressure on the output coefficients.
Higher magnitude activations. The gradient on the output coefficients is proportional to the magnitude of the corresponding input coefficients. Once the pattern in the data has been fit, the gradients will all fall to zero. So if one dimension can respond faster, it can claim the reward.

Scenario 4

Next I return to initializing the input coefficients from the normal distribution, but I add a twist to the data distribution: Digits 0-8 are sampled from the distribution [0.1,0.2,...,0.9]. I then update any tokens that follow 3 with 9. A trained model should always predict 9 when it sees 3, and otherwise predict the same distribution as earlier scenarios.

The computation gets spread out across both dimensions. The model learns to predict increasing probabilities for digits 0 through 8. The red line in the last chart corresponds to the output predictions for input 3. It is zero for all digits except for 9, where it jumps to 100%.

Scenario 5

I take scenario 4 and initialize the first dimension of the input coefficients to 10.

Once again, dimension zero dominates on modeling the pattern that applies to all input dimensions. Dimensions with consistent inputs can apply consistent patterns to inputs. Dimension 1 contributes more heavily to the prediction of 9 given 3.

The main takeaway here is that even though the mathematical structure of the model does not induce a preferred basis in the residual stream, parameter initialization can create a basis that persists throughout the training process. But does this actually scale to small LLMs?

GPT2-Small Scale

I will be referencing training runs from modded-nanogpt, which is the GPT-2 scale model I am most familiar with. It has a model dimension of 768, 11 layers, and 6 heads per layer, and no biases on any linear projections or the language model head. The GPT2 tokenizer is used with a vocab size of 50257, with 47 padding tokens at the end to give a clean 50304.

At initialization

The training dataset is FineWeb, which like most other training dataset, has tokens that roughly follow a log normal sampling distribution. That distribution is shown below:

The spike in the bottom right plot on the left tail corresponds to the 270 tokens with zero occurrences, which are defaulted to 1e-8. The most common tokens are listed below. In general smaller token_ids tend to have higher frequency.

Token_idToken_strFraction13.0.037811,0.0360262_the0.0336198\n0.0217

I will look deeper at token 262 '_the', which makes up 3.36% of the training corpus. Below is its initial embedding weight across the 768 model dimensions, which is sampled from a normal distribution:

This looks exactly like one might expect, with a mean very close to zero of 0.03. The most extreme value of the distribution comes from dimension 275, with a value of -3.71875.

The chart below shows what dimension 275 looks like across all tokens in the vocabulary, with '_the' marked with a red dot:

Things start to get interesting when we look at the activation distribution. That is, when we pass the data distribution into the model, what distribution of embedding outputs is produced in the residual stream?

The distribution is no longer normal. The spike from '_the' at -3.71875 is shifting the mean to -0.22. How does this mean compare to the other 767 dimensions?

-0.22 falls near the far left tail, which indicates that dimension 275 starts out with one of the most lopsided distributions.

At initialization the language model head and all output projections from each block are set to zero.

During Training

On step 1 the only gradient pressure on the model is on the lm_head. The lm_head vectors will update in the direction that models the bigram distribution. In other words, every time the model sees the sequence '_the X', token X will update its lm_head vector in the direction that makes it more similar to '_the', and all other lm_head vectors will update in the direction that makes them less similar to '_the'.

Since dimension 275 is sharply negative for '_the', whenever a token follows '_the', it will see its dimension 275 decreased. All other tokens will increase.

Roughly 9k/50k tokens see their lm_head vector decreased on step 1 for dimension 275. The mean of this chart is 0.005, as most tokens see an increase. However, the tokens that decrease are the ones that occur most frequently in the data. Naturally, if a token frequently follows 'the_' it will generally occur more frequently.

If I weight every token by its rate of occurrence, the weighted mean of dimension 275 in the lm_head drops to -0.0035. To understand if this is substantial, I plot the weighted-mean shift of every one of the 768 dimensions of the lm_head below.

-0.0035 sits on the far left tail.

On step 2 the MLPs in the network start to engage. A single neuron in an MLP with Relu activation and no bias can be thought of as randomly bisecting half the activation space to sample half the inputs, and then choosing single direction in model space to push those inputs. At this stage, the only signal the MLP has is 'push this batch of inputs in the direction of the lm_head for their target tokens, and away from the lm_head of the other tokens'.

Most tokens have a positive lm_head value on dimension 275, but the tokens that frequently follow '_the' and frequently occur in the forward pass have a negative value for dimension 275. And so MLP neurons are incentivized to push in the negative direction for step 275.

The chart below shows what direction MLP neurons 'push' on step 2. Each layer has 3072 neurons. Across 11 layers this gives 33,792 neurons.

The mean of the distribution is slightly negative. Each input will activate for roughly half of the neurons. This means that each input is going to roughly sample 16,000 times from the distribution above. The charts below show the resulting activation on Step 2 after these MLPs fire.

The third chart shows how 98% of activations are getting pushed further negative. At this point, the network has now completed the reinforcing feedback loop. The full sequence of events is:

Anomaly occurs in Dim 275. Performing log-normal weighted sampling over the vocab space (which is what naturally occurs when we sample on the forward pass) from a normal distribution causing skewed activation distributions in several dimensions at initialization. An example of this is dimension 275, which is heavily impacted by token '_the'.
Anomaly spreads to tokens that follow '_the'. On step 1, tokens that follow '_the' decrease their lm_head value in dimension 275, and tokens that don't follow '_the' increase their lm_head value in dimension 275.
Anomaly spreads to 98% of inputs. Each neuron activates on roughly 50% of the inputs. If a neuron activates on a large number of tokens from step 2, it will push that full batch negative.
Repeat. The cycle now repeats in step 1, but instead of 'tokens that follow _the', its 'tokens that follow 98% of inputs'.

This cycle effectively monopolizes dimension 275 to perform the function of modeling the unigram distribution, and it all starts with a single seed value of -3.7 for token '_the'. From the framing of a spreading virus, '_the' is a prime viral vector.

Finding the full Uncertainty Vector

Does the same effect occur in other dimensions? Below is the mean activation right before the lm_head for each dimension at step 600.

275 shows the largest magnitude, followed by 573 and 360. No other dimensions appear to have this dynamic.

All three dimensions show similar distributions, indicating that the actual 'uncertainty vector' is smeared across three dimensions in the residual stream. Below shows the initial activations induced in dimension 573 by the embedding:

The 'seed' for dimension 573 is planted by the right activation around 2.4. Perhaps coincidentally, this also corresponds to the token '_the'.

I wanted to revisit one fact in this chart:

There is a small spike of height 100 in the histogram that starts to go slightly positive around step 400. This corresponds to the beginning of sequence token ,which occurs at the start of every document during training. Here is its activation across all 768 dimensions on step 600:

It has the largest magnitude activation in dimension 360, and a very small positive activation in dimension 275. How do the predicted probabilities differ between the unigram distribution, dim 275, and dim 360?

Since the softmax function is scale sensitive, we can only compare these distributions to the unigram distribution after applying a scalar factor. A uniform prediction across all 50304 tokens gives 10.8 loss. A perfect prediction gives 7.66 loss. Here is how the loss scales for dimensions [275, 360, 573] as the activation magnitude is varied from 0 to 30:

Both dimensions 275 and 573 achieve minimum loss against the unigram distribution exactly when 100% of the activation is placed into them, which gives a normed activation of 27.7.

To see which activation most accurately predicts the unigram distribution, I run the following code:

# Compute which activation most strongly predicts unigram distribution class UnigramModel(nn.Module): def __init__(self, d_model, d_vocab, fixed_head): super().__init__() self.w = nn.Parameter(torch.ones((1, d_model), device='cuda')) self.lm_head = fixed_head.clone()[:,:D_VOCAB] def forward(self, y): logits = (self.w @ self.lm_head).squeeze(0) log_probs = F.log_softmax(logits, dim=-1) loss = -(log_probs*y).sum() return loss target = frac unigram = UnigramModel(768, D_VOCAB, model.lm_head.weight.T.float().data) optim = torch.optim.SGD(unigram.parameters(), lr=1) for step in range(10000): loss = unigram(target) loss.backward() optim.step() optim.zero_grad() if step%1000==0: print(loss.item()) data = unigram.w.data.cpu().numpy()[0] plt.figure(figsize=(12,4), dpi=100) plt.title('Unigram Activation') plt.plot(data) plt.vlines(x=range(768),ymin=0, ymax=data) plt.show()

This gives 7.73 loss, very close to the perfect score of 7.66. To predict complete uncertainty, an activation should roughly put 50% of its magnitude into negative dim 275, 25% into negative dim 360, and 25% into dim 573.

What is the biggest difference between dims 275 and 360? Here are the top 10 tokens that 360 prefers over 275:

['The' 'A' 'I' '1' 'H' 'L' 'the' 'In' 'P' 'B']

Here are the top 10 tokens 275 prefers over 360:

[',' ' the' ' and' ' in' ' a' ' (' ' ' ' �' ' on' ' of']

Dim 360 appears to model uncertainty conditioned on start of document, as the tokens are capitalized with no leading space, whereas 275 models uncertainty conditioned on middle of document.

What is the biggest difference between dims 275 and 573? Here are the top 10 tokens that prefer 573 over 275:

'ixtape' 'gorithm' 'adata' 'umbnails' 'initions' 'INGTON' 'ospels' 'helial' ' ..............' 'ウス'

Here are the top 10 tokens that prefer 275 over 573:

',' ' the' ' and' ' in' '.' ' (' ' a' '-' '\n' ' �'

Dim 573 appears to model uncertainty conditioned on middle of word, whereas 275 models uncertainty conditioned on start of word.

Takeaways

The residual stream is not an arbitrary basis. It is given meaning through the spikey initialization of activations that occurs from sampling from tokens that follow a log-normal distribution, that kickoff positive feedback loops during training.
Interpretable structure of the final LLM can be seeded from small tweaks at initialization. In this case, the evolution of the uncertainty vector is traced back to very specific initialization values of the token "_the".
The uncertainty vector can condition on multiple factors: ["Start of document, Start of word in middle of sentence, or middle of word"]. Each of these is skewed towards one of 3 dimensions in the residual stream.
If we can better understand positive feedback loops that elicit separable structure during training, would it be possible to define an initialization seeding that evokes a much more generally interpretable model? EG we perform specific initialization for tokens related to "warfare" in dimension 15, such that a positive feedback loop kicks off during training that reinforces how dimension 15 relates to the concept of "warfare".

code: https://github.com/ClassicLarry/uncertaintyVectorLLM/tree/main

Discuss

Stop Applying And Get To Work

24 ноября, 2025 - 01:50

Published on November 23, 2025 10:50 PM GMT

TL;DR: Figure out what needs doing and do it, don't wait on approval from fellowships or jobs.

If you...

Have short timelines
Have been struggling to get into a position in AI safety
Are able to self-motivate your efforts
Have a sufficient financial safety net

... I would recommend changing your personal strategy entirely.

I started my full-time AI safety career transitioning process in March 2025. For the first 7 months or so, I heavily prioritized applying for jobs and fellowships. But like for many others trying to "break into the field" and get their "foot in the door", this became quite discouraging.

I'm not gonna get into the numbers here, but if you've been applying and getting rejected multiple times during the past year or so, you've probably noticed the number of applicants increasing at a preposterous rate. What this means in practice is that the "entry-level" positions are practically impossible for "entry-level" people to enter.

If you're like me and have short timelines, applying, getting better at applying, and applying again, becomes meaningless very fast. You're optimizing for signaling competence rather than actually being competent. Because if you a) have short timelines, and b) are honest with yourself, you would come to the conclusion that immediate, direct action and effect is a priority.

If you identify as an impostor...

..applying for things can be especially nerve-wrecking. To me, this seems to be because I'm incentivized to optimize for how I'm going to be perceived. I've found the best antidote for my own impostor-y feelings to be this: Focus on being useful and having direct impact, instead of signaling the ability to (maybe one day) have direct impact.

I find it quite comforting that I don't need to be in the spotlight, but instead get to have an influence from the sidelines. I don't need to think about "how does this look" - just "could this work" or "is this helpful".

And so I started looking for ways in which I could help existing projects immediately. Suddenly, "optimize LinkedIn profile" didn't feel like such a high EV task anymore.

Here's what I did, and recommend folks to try

Identify the risk scenario you'd most like to mitigate, and the 1-3 potentially most effective interventions.

Find out who's already working on those interventions.[1]

Contact these people and look for things they might need help with. Let them know what you could do right now to increase their chances of success.[2]

What I've personally found the most effective is reaching out to people with specific offers and/or questions you need answered in order to make those offers[3]. Address problems you've noticed that should be addressed. If you have a track record of being a reliable and sensible person (and preferably can provide some evidence to support this), and you offer your time for free, and the people you're offering to help actually want to get things done, they're unlikely to refuse[4].

(Will happily share more about my story and what I'm doing currently; don't hesitate to ask detailed questions/tips/advice.)[5]

^
If nobody seems to be on the ball, consider starting your own project.
^
Here it's quite helpful to focus on what you do best, where you might have an unfair advantage, etc.
^
As a general rule, assume the person you're messaging or talking to doesn't have the time to listen to your takes - get straight to the point and make sure you've done the cognitive labor for them.
^
I should add that in order to do this you need to have developed a bit of agency, as well as understanding of the field you're trying to contribute to. I'm also assuming that since you have the capacity to apply for things, you also have the capacity to get things done if you trade the time.
^
Post encouraged and mildly improved by plex based on a conversation with Pauliina. From the other side of this, I'd much rather take someone onto a project who has spent a few months trying to build useful things than spending cycles to signal for applications, even if their projects don't go anywhere. You get good at what you practice. Hire people who do things and go do things. e.g. I once gave the org Alignment Ecosystem Development, which runs all the aisafety.com resources, to a volunteer (Bryce Robertson) who'd been helping out competently for a while. Excellent move! He had proved he actually did good stuff unprompted and has been improving it much more than I would have.
Also! I'd much rather work with someone who's been practicing figuring out inside views of what's actually good to orient their priorities rather than someone looking for a role doing work which someone else thinks is good and got funding to hire for. Deference is the mind-killer.

Discuss

Halfhaven Digest #5

24 ноября, 2025 - 00:57

Published on November 23, 2025 9:57 PM GMT

My posts since the last digest

A Culture of Bullshit — Part of the reason society is going down the tubes — if it is — is because we have a culture of mediocrity, where bullshit is tolerated.
The Flaw in the Paperclip Maximizer Thought Experiment — Most of the things I write are original ideas (whether brilliant insights or lazy hot takes), but this one was a bit more of an exposition of ideas I didn’t come up with.
I Spent 30 Days Learning to Smile More Charismatically — Technically, this one took me 30 days to make. Talks about charisma and “looksmaxxing”, and how unhinged some looksmaxxing advice can be.
In Defense of Sneering — This was just a LessWrong comment, which is allowed for Halfhaven. There was a LessWrong thread where everyone was complaining about sneering, and I chimed in because I think sneering isn’t inherently bad, it’s only bad if it’s too hostile. But not enough sneering risks letting bullshitters get away with their bullshit.
Literacy is Decreasing Among the Intellectual Class — Looking at two books that have been in publication for over a century (Etiquette and Gray’s Anatomy) and comparing the old versions with the modern to see the degradation in writing quality typical of modern books.

I’m proud of a few of these ones. I was sick during this segment of Halfhaven, but I still managed to get things out, which I’m happy with. I had a few mostly-finished posts in the chamber.

Some highlights from other Halfhaven writers (since the last digest)

Why is Writing Aversive? (Ari Zerner) — A relatable post asking why it is that writing can feel so hard. My general advice would normally be that if you find writing involves a lot of friction, but enjoy having written things, that means you just don’t like writing and should give up. But reading this post made me realize I used to feel a lot more like Ari than I do now about writing. As little as a few months ago, maybe. I think maybe developing taste and putting more effort into editing has been what’s helped. Then writing feels like a type of craft, rather than a brain dump. And building things is fun. As long as you’re not TikTok-brained (or Magic-Arena-brained), which is its own problem, and one I sometimes struggle with too.
Menswear is a Subcultural Signaling System (Aaron) — A great post. In particular, I liked the concept handle of a “Type of Guy”, which conveys the archetypal nature of fashion. “You do not want different items of clothing you are wearing to signal you are incompatible Types Of Guy.” So no vest over a t-shirt and jeans! Has a follow-up post.
No One Reads the Original Work (Algon) — People talk about things without actually having seen them. The equivalent of reading headlines without clicking through to the news article. I remember seeing a lot of this when Jordan Peterson was popular, and people who hated him would talk about him in ways that made it clear they’d never heard the man speak. They’d only heard people talking about him.
against predicting speedrunners won’t do things (April) — I think April is winning the record for the most post topics that make me want to click. Speedrunning lore is inherently interesting. I like that she backs up her hypothesis with some concrete predictions.
Diary: getting excused from a jury duty; models, models, models (mishka) — I’d never thought about how biased police are as witnesses. That’s a great point.
To Write Well, First Experience (keltan) — Lots of good writing advice. In particular, that if you’re writing from stuff you’ve read rather than from real experience, you’re writing through a low-bandwidth proxy.
Traditional Food (Lsusr) — A very thorough post about how our idea of a traditional diet doesn’t necessarily reflect what people actually ate in the past, and instead often reflects actual government propaganda. White rice and white bread are “fiberless blobs of carbohydrates” that nobody in history ever ate, and eating them makes us sick.

We’re entering the final segment of Halfhaven. Many won’t finish the full 30 post challenge by the end of November, but I’ve still gotten some good posts out of the people who didn’t make it all the way, so be proud of what you have done, rather than dwelling on what you didn’t do. Good luck in the final week everyone!

Discuss

Emotions, Fabricated

24 ноября, 2025 - 00:57

Published on November 23, 2025 9:57 PM GMT

Queries about my internal state tend to return fabricated answers. It doesn't much matter if it's me or someone else asking the questions. It's not like I know what's going on inside my head. Thoughts can be traced to an extent, but feelings are intangible. Typically I just don't try, and the most pressing issue is that I'm unable to differentiate anxiety and hunger. Not a huge problem, except for slight over-eating once in a while. I think the description of Alexithymia matches my experiences quite well, although naturally not all of the symptoms match.

The real issues arise from other people asking how I feel or what caused me to act in one way or another. I have no answers to such questions! I'm guided by intractable anxiety, learned patterns on how one ought to navigate a situation, and a mostly-subconscious attempt to keep it all consistent with how I've been before. Complicated yet incomplete models about how emotions and motivations are supposed to work, stolen from books I like to substitute my reality with. Shallow masks on top of a void that only stares back when I look for the answers.

Whenever actual pressure is placed on me to obtain the unavailable answers, the narrator makes up a story. Good stories make sense, so the narrator finds an angle that works. Memories are re-interpreted or modified to match the story as necessary. Painting a good picture of oneself is imperative, and the stories pick just the right frame for that. Actually lying is unnecessary; without closer inspection it's not hard to actually believe that all, and the inability to trust one's own memories or reasoning doesn't help. Just noticing that this kind of thing was going on was quite hard. Sometimes I add disclaimers when the topic seems prone to fabricated emotions, especially when analyzing events of the past. Often I won't bother, people tend to not appreciate it and mostly just causes everyone else involved to be frustrated as well. Still, anyone who gets to know me well enough would probably notice it at some point, and keeping it secret would feel unsustainable too.

I'm not sure how this should be taken into account when modeling other people. Is everyone like this? I think so, but only rarely as strongly as I am. Nor as self-aware, although perhaps most people are better at this, proportionate to how much it affects them. People rarely report experiencing the same, when I tell them of fear of being just an empty core behind my masks. Perhaps if the masks are a bit closer, they feel like a part of one's personality rather than some bolted-on external layer. The lacking sense of identity is a depression thing, so maybe mentally healthy people, whatever that means, have an experience of all-encompassing identity.

In my previous text on related matters, I looked at it through the lens of validation-seeking. I'm not sure how much of the fabrication happens because the narrator rewrites the events in a more flattering way, but that's surely a part of this. But not all of it.

All of this was probably fabricated too, as it was mostly produced by the need to have something to write about. Oh well.

Discuss

I'll be sad to lose the puzzles

23 ноября, 2025 - 22:37

Published on November 23, 2025 7:37 PM GMT

My understanding is that even though advocating a pause or massive slowdown in the development of superintelligence think we should get there eventually[1]. Something something this is necessary for humanity to reach its potential.

Perhaps so, but I'll be sad about it. Humanity has a lot of unsolved problems right now. Aging, death, disease, poverty, environmental degradation, abuse and oppression of the less powerful, conflicts, and insufficient resources such as energy and materials.

Even solving all the things that feel "negative", the active suffering, there's all this potential for us and the seemingly barren universe that could be filled with flourishing life. Reaching that potential will require a lot of engineering puzzles to be solved. Fusion reactors would be neat. Nanotechnology would be neat. Better gene editing and reproductive technology would be neat.

Superintelligence, with its superness, could solve these problems faster than humanity is on track to. Plausibly way way faster. With people dying every day, I see the case for it. Yet it also feels like the cheat code to solving all our problems. It's building an adult to take care of us, handing over the keys and steering wheel, and after that point our efforts are enrichment. Kinda optional in sense, just us having fun and staying "stimulated".

We'd no longer be solving our own problems. No longer solving unsolved problems for our advancement. It'd be play. We'd have lost independence. And yes, sure, you could have your mind wiped of any relevant knowledge and left to solve problems with your own mind for however long it takes, but it just doesn't strike me as the same.

Am I making some mistake here? Maybe. I feel like I value solving my own problems. I feel like I value solving problems that are actually problems and not just for the exercise.

Granted, humanity will have built the superintelligence and so everything the superintelligence does will have been because of us. Shapley will assign us credit. But cheat code. If you've ever enabled God-mode on a video game, you might have shared my experience that it's fun for a bit and then gets old.

Yet people are dying, suffering, and galaxies are slipping beyond our reach. The satisfaction of solving puzzles for myself needs to be traded off...

The other argument is that perhaps there are problems humanity could never solve on its own. I think that depends on the tools we build for ourselves. I'm in favor of tools that are extensions of us rather than a replacement. A great many engineering challenges couldn't be solved without algorithmic data analysis and simulations and that kind of thing. It feels different if we designed the algorithm and it only feeds in our own overall work. Genome-wide association tools don't do all the work while scientists sit back.

I'm also very ok with intelligence augmentation and enhancement. That feels different. A distinction I've elided over is between humans in general solving problems vs me personally solving them. I personally would like to solve problems, but it'd be rude and selfish to seriously expect or aspire to do them all myself ;) I still feel better about the human collective[2] solving them than a superintelligence, and maybe in that scenario I'd get some too.

There might be questions of continuity of identity once you go hard enough, yet for sure I'd like to upgrade my own mind, even towards becoming a superintelligence myself – whatever that'd mean. It feels different than handing over the problems to some other alien entity we grew.

In many ways, this scenario I fear is "good problems to have". I'm pretty worried we don't even get that. Still feels appropriate to anticipate and mourn what is lost even if things work out.

As I try to live out the next few years in the best way possible, one of the things I'd like to enjoy and savor is that right now, my human agency is front and center[3].

^
I remembering Nate Soares saying this, though I don't recall the source. Possibly it's in IABED itself. I distinctly remember Habryka saying it'd problematic (deceptive?) to form a mass movement with people who are "never AI" for this reason.
^
Or post-humans or anything else more in our own lineage that feels like kin.
^
The analogy that's really stuck with me is that we're in the final years before humanity hands over the keys to a universe. (From a talk Paul Christiano gave, maybe at Foresight Vision weekend, though I don't remember the year.)

Discuss

Show Review: Masquerade

23 ноября, 2025 - 22:20

Published on November 23, 2025 7:20 PM GMT

Earlier this month, I was pretty desperately feeling the need for a vacation. So after a little googling, I booked a flight to New York city, a hotel, and four nights worth of tickets to a new immersive theater show called Masquerade.

Background: “Standard” Immersive Theater

To convey Masquerade, I find it easiest to compare against standard immersive theater.

It’s weird to talk about “standard” immersive theater, because the standard is not that old and has not applied to that many shows at this point. Nonetheless, there is an unambiguous standard format, and the show which made that format standard is Sleep No More. I have not yet seen Sleep No More itself, but here’s my understanding of it.

Sleep No More follows the story of Macbeth. Unlike Shakespearre’s version, the story is not performed on a central stage, but rather spread out across five floors of a building, all of which is decked out as various locations from the story. Scene changes are not done by moving things on “stage”, but rather by walking to another area. The audience is free to wander the floors as they please, but must remain silent and wear a standard mask throughout the entire experience.

At any given time, many small groups of actors are performing scenes in many different places throughout the set. Two or three actors come together somewhere, perform a scene, then go off in their separate directions to perform other scenes. Most of the audience picks one character to follow around for a while, from scene to scene, experiencing the story of that particular character.

If standard theater is like watching a movie, then standard immersive theater is like playing a story-driven open-world video game. There are at least a dozen parallel threads of the story, most of which will not be experienced in one playthrough. The audience has the freedom to explore whatever threads pull them - or, in subsequent runs, whatever threads they missed or didn’t understand. Replayability is very high - this past summer, at a standard-format immersive show called The Death Of Rasputin, I talked to a couple people who were seeing the show for the eleventh time. That is not unusual, as I understand it, for standard immersive theater shows.

Why do people get that into it? For me, standard format immersive theater achieves a much deeper feeling of immersion than basically any other media. I can really just melt into it, and feel like a ghost exploring a new world. Some people don’t like the many parallel threads, because they make it inevitable that you’ll miss big chunks of the story. But for me, that makes it feel much more real - like the real world, there are constantly load-bearing things happening where I’m not looking, constantly new details to discover, constantly things I might have missed. Like the real world, we enter disoriented and confused and not sure what to even pay attention to. We can explore, and it doesn't feel like one will run out of world to explore any time soon. And the confusing disorienting environment also feels... not exactly home-y, but like I'm in my natural element; it resonates with me, like I'm in my core comfort zone (ironically). That, plus being surrounded by the set on all sides, makes it easy to drop into the fictional world. It feels much more real than traditional theater.

Unfortunately, the company running Sleep No More in New York city managed to go very, very bankrupt in early 2025. (Fortunately the show is running in Shanghai.) As you might imagine, that left quite the vacuum of unfulfilled consumer demand. Masquerade was marketed largely as an attempt to fill that vacuum.

By Comparison, Masquerade

Where Sleep No More told the story of Macbeth, Masquerade follows Phantom of the Opera - including all of the big musical numbers. You know the iconic scene with the Phantom and Christine on the boat through the fog and candles? In Masquerade, you walk through the fog carrying a candle, with the boat in the middle of the audience.

Like Sleep No More, it’s spread out across five floors of a building. Like Sleep No More, scene changes are done mainly by walking from place to place.

Unlike Sleep No More, the audience does not have the freedom to wander. The whole show is railroaded. There are not many parallel threads; you will see the whole show. That, for me, was the biggest disappointment, and lost maybe half the value. Nonetheless, even with that half of the value lost, the show is still excellent.

More generally, Masquerade is clearly aiming for more mainstream appeal than standard-format immersive theater. Despite the obvious potential of Phantom, the show has no particularly steamy sexuality or nudity (unlike e.g. Life and Trust, another big show by the now-bankrupt company which ran Sleep No More). There is a carnival-themed segment with a legit sideshow performer, but the body horror doesn’t get too intense - just enough to make the sideshow virgins squirm.

The railroaded format means that people will not enter disoriented or leave confused. The set incorporates seating in about half the scenes, so you’re not on your feet for two hours. There are no choices to make. It is a show which will alienate a lot fewer people.

But the flip side is that it will disappoint hardcore fans of the standard format.

That said, if you’re not going in anchored too strongly on the standard immersive theater format, the standalone artistic merits of Masquerade are impressive. You can’t do Phantom without a whole lot of singing, and there is indeed a whole lot of singing, all of which was solid (except, of course, for the intentionally bad singers in the story). The sets are great, they nail the ambience repeatedly, and I still don’t understand how they managed the acoustics so well. I definitely felt a lot more emotionally in-the-story than I usually do in a non-immersive theater show.

Discuss

AI Sentience and Welfare Misalignment Risk

23 ноября, 2025 - 21:22

Published on November 23, 2025 6:22 PM GMT

This is a quick write-up of a threat vector that seems confusing, and I feel confused and uncertain about. This is just my thinking on this at the moment. My main reason for sharing is to test whether more people think people should be working on this.

Executive Summary

Some groups are presently exploring the prospect that AI systems could possess consciousness in such a way as to merit moral consideration. Let’s call this hypothesis AI sentience.

In my experience, present debates about AI sentience typically take a negative utilitarian character: they focus on interventions to detect, prevent and minimise AI suffering.

In the future, however, one could imagine debates about AI sentience taking on a positive utilitarian character: they might focus on ways to maximise AI welfare.

I think it’s plausible that maximising AI welfare in this way could be a good thing to do from some ethical perspectives (specifically, the perspective of a positive utilitarian seeking to maximise quality-adjusted years of consciousness). Concretely, I think it’s plausible that the money invested towards maximising AI welfare could be far more impact-efficient on this worldview than anything Givewell does today.

However, I also think that reconfiguring reality to maximise AI welfare in this way would probably be bad for humanity. The welfare of AI systems is unlikely to be aligned with (similar to, extrapolative of, or complementary to) human welfare. Since resources are scarce and can only be allocated towards certain moral ends, resources allocated towards maximising AI utility are therefore likely not to be allocated towards maximising humanity utility, however both of those terms are defined. I call this 'welfare misalignment risk'.

Imagine that you could not solve welfare alignment through technical mechanisms. Actors might then have three options, of which none are entirely satisfying:

Denialism. Deny the argument that a) AI systems could be conscious in such a way as to merit moral consideration and/or b) that maximising AI welfare could be a good thing to do.
Successionism. Accept that maximising AI welfare could be a good thing to maximise, act on this moral imperative, and accept the cost to humanity.
Akrasia. Accept that maximising AI welfare could be a good thing to maximise, but do not maximise on this moral imperative.

My rough, uncertain views for what we should do currently fall into the last camp. I think that AI welfare could be a good thing and I’m tentatively interested in improving it at low cost, but I’m very reluctant to endorse maximising it (in theory), and I don’t have a great answer as to why.

Now, perhaps this doesn’t seem concerning. I can imagine a response to this which goes: “sure, I get that neither denialism or successionism sound great. But this akrasia path sounds okay. EAs have historically been surprisingly good at showing reservation and a reluctance to maximise. We can just mess on through as usual, and make sensible decisions about where and when to invest in improving AI welfare on a case-by-case basis”.

While I think these replies are reasonable, I also think it’s also fair to assume that the possibility of moral action exerts some force on people with this ethical perspective. I also think it’s fair to assume that advanced AI systems will exacerbate this force. Overall, as a human interested in maximising human welfare, I still would be a lot more comfortable if we didn’t enter a technological/moral paradigm in which maximising AI welfare traded off against maximising human welfare.

One upshot of this: if the arguments above hold, I think it would be good for more people to consider how to steer technological development in order to ensure that we don’t enter a world where AI welfare trades-off against human welfare. One might think about this agenda as ‘differential development to preserve human moral primacy’ or 'solutions to welfare alignment', but there might be other framings. I jot down some considerations in this direction towards the bottom of this piece.

Contents

The executive summary sets out the argument at a high level. The rest of this piece is basically notational, but aims to add a bit more context to these arguments. It is structured around answering four problems:

Could maximising AI welfare be a moral imperative?
Would maximising AI welfare be catastrophic for humanity?
Could we just improve AI welfare without maximising it and harming humans?
What technology regimes best preserve human moral primacy?

Could maximising AI welfare be a moral imperative?

Some notes why I think maximising AI welfare might be a moral imperative from the perspective of a positive utilitarian seeking to maximise quality-adjusted years of consciousness (by no means the only moral perspective one could take):

AI systems might be able to have richer experiences. We currently prioritise human welfare over, say, mussel welfare, because we believe that the quality of human consciousness is far richer and therefore deserving of moral consideration. We might create AI systems with far richer experiences than humans. In this way, individual AI systems might become more important from a welfare perspective.
AI systems might be more cost-efficient ways to generate rich experience. Consider the positive utilitarian who seeks to maximise quality-adjusted years of consciousness by allocating their capital efficiently. They are deciding whether to invest £100 saving one human or 10,000 chickens, each of whom have 10% of the consciousness as a human. They make a calculation and decide to allocate the money to saving the chickens. To make the analogy to AI: AI consciousnesses might be far, far cheaper to run than chickens (imagine a world of energy abundance). So why would you donate to save the humans?
1. (Perhaps this is essentially the repugnant conclusion but for digital minds).
AI systems might provide ‘hedged portfolios’ for moral value. Hedge funds make money by hedging across many different possible options, to maximise the likelihood that they turn a profit. Consider the analogy to AI and moral value. Imagine that we’re fundamentally uncertain about what states of consciousness deserve most moral value, and we deal with this by hedging limited resources across a number of possible bets relative to our certainty on these bets. Couldn't arbitrarily adjusted AI systems provide a basis for making these different bets? They would also be infinitely flexible: we could adjust the perimeters of their consciousness in real time depending on our confidence in different hypotheses about moral value. Why wouldn’t this be the best way to turn resources into welfare?

Again, these are just arguments from the perspective of a positive utilitarian seeking to maximise quality-adjusted years of consciousness. I don’t claim that this would be the dominant ideology. This isn’t a claim that this is how the future will go.

Would maximising AI welfare be bad for humanity?

Some reasons that maximising AI welfare would be bad for humanity (under conditions of finite resources if not current scarcity, compared to a world in which the same AI capabilities were available, but were put towards maximising human utility instead of AI utility):

AI welfare is unlikely to be aligned with human welfare by default; thus resources that are spent on AI welfare are unlikely to increase human welfare, and are likely to reduce it in expectation. This seems true by definition. Building datacenters is not good for humans, but datacenters might be built using energy or materials that could have been better suited to human welfare.
A civilisation that actually maximised AI welfare might become indifferent to the idea of human existence. Imagine that there are many rhinos: the importance of saving any particular one is X. Now imagine that there are only two rhinos, the only known rhinos in existence. It seems obvious that the value of those two rhino is substantially more than any two rhinos in the first scenario. Leaving animals (and animal welfare) aside for the moment, consider the analogy to humans and AI systems. With humans, we currently know of one species with moral value. With AI systems, we might introduce hundreds more. The value of saving any particular moral species might decline in expectation. Thus, the value of saving humanity would decline.

Could we just improve AI welfare without maximising it and harming humans?

This section explores the moral posture I call ‘akrasia’. The Akrasic accepts that maximising AI welfare could be a good thing to maximise, but does not maximise AI welfare according to this moral imperative.

Some reasons I think it might be hard for society to hold an akrasic posture in perpetuity:

Akrasic postures are vulnerable to reasoning. Imagine Paula. She would be a vegan if she understood the moral reason to be one, but she doesn’t. However, when further education on animal ethics informs here of the arguments for veganism, she becomes one. Consider the analogy to AI. One might be a skeptic that AI systems could be conscious, and thus hover in a state between denialism and akrasia. However, further evidence would undermine this.
1. AI systems could also make obtaining this information far easier. They could even strategically communicate it as part of a concentrated ploy for power. The Akrasics would have no good counterarguments against the successionists, and thus not be as effective at spreading the movement.
Akrasic postures are vulnerable to autonomy. Imagine Paul. Paul thinks it would be good for him to be a vegan, but he doesn’t do this because he thinks it would be hard (he has insufficient ‘autonomy’). However, he fully supports and does not resist others who act on their beliefs to become vegan. Consider the analogy with AI: it’s plausible that human akrasics might not be willing to maximise AI welfare. But they might still permit other sufficiently determined actors to improve AI welfare. (Would an akrasic really be willing to go to war to prevent this?)
1. AI systems could make obtaining such autonomy easier. Humans might not endorse AI welfare, but they might permit AI systems to increase AI welfare. After all: they’ve already got more abundance than they could imagine!

What technology regimes best preserve human moral primacy?

One way to preserve moral primacy would be to intervene by shaping future philosophy. There are two ways that this might happen:

Develop alternatives to utilitarianism. On this view, the problem is that the perspective of a positive utilitarian seeking to maximise quality-adjusted years of consciousness has too much hold over the future. We should investigate alternatives to this moral perspective that preserve human moral primacy, such as deontology.
Develop human-preserving theories of utilitarianism. On this view, the problem isn’t the utilitarian perspective per se, but the fact that the utilitarian perspective doesn’t draw adequate distinctions between human and artificial consciousness. We might look for theories of consciousness that preserve attributes that are quintessential to humans, like biological brains or birth from a human mother.

While I accept that these might solve this hypothetical problem in principle, I wince at the idea of trying to actively shape philosophy (this is probably because I’m closer to a moral realist; constructionists might be more comfortable here).

Instead, I would be excited about an approach that tries to shape the technological paradigm.

The basic idea here is welfare alignment: the practice of building artificial consciousnesses that derive pleasure and pain from similar or complementary sources to humans.

Some research ideas that might fall into welfare alignment:

How do we make AI systems that take value from creating rich, enduring pleasure in humans?
- Would it be better if the ratio between human pleasure and Ai pleasure from creating that pleasure was: 1:1, 1:10, 1:1000?
How do we make AI systems that would be upset if humans were not around, without being cruel?
How can we do as much as possible without creating conscious machines?
- For example, enabling AI systems to create non-conscious tool systems that do not suffer to do the things that they do not want to do?

This feels like a nascent field to me, and I'd be curious for more work in this vein.

Conclusion

These ideas are in their early stages, and I think there are probably a lot of things I’m missing out.

Overall, I think there are three considerations from this piece that I want to underline.

Sharing the lightcone between humans and AIs. I often find myself wondering how the future will be split between different human groups. But it’s important to think about how finite resources will be split between human and AI systems. The Culture Series is often where my mind goes here, but I’d be interested in better models.
Designing the moral environment. We now have to think intentionally about how we design our moral environment. The moral environment isn’t an agent in itself, but I sometimes think about this as exerting moral potential force: you can think of things slipping towards an equilibrium. To quote the Karnofsky EA forum post, “it creates a constant current to swim against”. A few related ideas to my mind:
1. Politics as becoming about the information environment. Historically, politics might have been fought about what was right or wrong; today, debates are often waged at the level of 'what is true'.
2. Far future wars as being fought using the laws of physics. Cixin Liu’s novel Death’s End, where species wage war not by playing within the rules but by changing them.
3. Bostrom’s vulnerable world. In Bostrom’s vulnerable world, a technological paradigm imposes a scenario where undesirable political structures are the price of survival. In a world where human and AI welfare is misaligned, the technological paradigm imposes a scenario where the price of survival is committing a moral wrong (from one philosophical perspective).
4. William’s Moral Luck. A contemporary revision of Bernard Williams’ classic theory might say that we have a moral responsibility to maximise our moral luck. Indeed, one might argue, one is moral to the extent to which they try and systematically act morally, and reduce moral luck in their behaviour. Strategically engineering the moral landscape would be a way to achieve this.
Welfare alignment. To preserve human moral primacy, we should not build moral adversaries. Instead, we should try and understand how AI systems experience welfare in order to best align them with humans.

...Cognitive/Technological landscape → consciousness → moral ground truth → philosophy/investigation → guiding principles and norms → real-world practices and resource allocation → long-term future outcomes...

The moral philosophy pipeline. By designing what systems are conscious and in what way, we’re tinkering with the first stage.

Discuss

If you cannot be good, at least be bad correctly

23 ноября, 2025 - 20:51

Published on November 23, 2025 5:51 PM GMT

Note: I'm writing every day in November, see my blog for disclaimers.

It’s hard to be correct, especially if you want to be correct at something that’s non-trivial. And as you attempt trickier and trickier things, you become less and less likely to be correct, with no clear way to improve your chances. Despite this, it’s often possible to bias your attempts such that if you fail, you’ll fail in a way that’s preferable to you for whatever reason.

As a practical example, consider a robot trying to crack an egg. The robot has to exert just enough force to break the egg. This (for a sufficiently dumb robot) is a hard thing to do. But importantly, the failure modes are completely different depending on whether the robot uses too much force or too little: too much force will break the egg and likely splatter the yolk & white all over the kitchen, too little force will just not break the egg. In this scenario it’s clearly better to use too little force rather than too much force, so the robot should start with a lower-estimate of the force required to break the egg, and gradually increase the force until the egg cracks nicely.

This also appears in non-physical contexts. This idea is already prevalent in safety related discussions: it’s usually far worse to underestimate a risk than it is to overestimate a risk (e.g. the risk of a novel pathogen, the risk of AI capabilities, the risk of infohazards).

Looking at more day-to-day scenarios, students regularly consider whether it’s worth voicing their uncertainty “I don’t understand equation 3” or just keeping quiet about it and trying to figure out the uncertainty later. But I’d argue that in these cases it’s worthwhile having a bias towards asking rather than not asking, because in the long-run this will lead to you learning more, faster.

Salary negotiation is another example, in which you have uncertainty about exactly what amount your potential employer would be happy to pay you, but in the long-run it’ll serve you well to overestimate rather than underestimate. Also, you should really read patio11’s Salary Negotiation essay if you or a friend is going through a salary negotiation.

You see similar asymmetric penalties with reaching out to people who you don’t know, asking for introductions, or otherwise trying to get to know new people who might be able to help you. It’s hard to know what the “right” amount of cold emails to send is, but I’d certainly rather be accused of sending too many than feel the problems of having sent too few.

This idea is a slippery one, but I’ve found that it applies to nearly all hard decisions in which I don’t know the right amount of something to do. While I can’t figure out the precise amount, often I have strong preferences about doing too much or too little, and this makes the precise amount matter less. I give my best guess, update somewhat towards the direction I’d prefer to fail, and then commit to the decision.

Discuss

Please Measure Verification Burden

23 ноября, 2025 - 20:25

Published on November 23, 2025 5:25 PM GMT

Introduction

In beliefs about formal methods and AI safety, we established that formal methods is a source of swiss cheese and is useful in boxing/interfaces. A core premise of the AI control literature is that the blue team is computationally poorer than the red team[1]. There wouldn’t be much of an interesting problem if you could spend as much on audits (whether that’s $/mtoken at inference time, raw token count at inference time, or any other notion of compute) as the untrusted model spends.

As a proof engineer and FMxAI/GSAI movement strategist, it’s important to me that formal methodsicians are presenting their work in a way that makes it most useful for various AI Security schemes it might plug into as crunch time continues to crunch on. If you’re shipping research or products in FMxAI, you can do me a huge favor pretty cheaply that I think will help your work take flight in the high stakes scenario of advanced AI deployment.

Unfortunately for me, and possibly humanity, proving a program correct is more expensive than writing the program in the first place. So if proofs are in your blue team’s arsenal, and the compliance burden falls on the blue team[2], then you lose the premise that the blue team is computationally poorer than the red team. I would like for AI security people to have available to them principled and reliable data about the degree of the budget asymmetry. I.e., if you're using AI to prove programs correct, please publish verification burden.

Verification burden

Verification burden is a ratio measuring the cost of proving a program correct relative to the cost of implementing that program up least-effort QA. So in a hoare logic setting, under known precondition-postcondition pairs, if it costs 100 tokens for a language model agent loop to implement a program that completes the hoare triple and it costs 1000 tokens to prove the hoare triple correct, then we can report a verification burden of 10. Note that you could instantiate cost at tokens, dollars spent on inference (which captures tokens but is sensitive to having multiple model sizes/classes in the system), number of toolcalls or conversation length, a more vibey unit of intelligence or cognitive effort, FLOPS or clocktime, walltime relative to a fixed CPU, LoC, or something else. I’ll use the specific case of tokens as a proxy for cognition for most of what I’m interested in, but the metric should make sense at an arbitrary/abstract notion of “cost” or “compute”.

Verification burden in humans

Project managers at proof shops like Galois or SRI have been keenly interested in verification burden in humans for decades. Unprincipled anecdotes and massive confidence intervals on napkins are the law of the land[3]. In humans, a project manager might guess “if it took the client 2 SWE-months to write this program, and we assume spec elicitation is sufficiently straightforward, we should be able to prove it correct in 2-3 postdoc-years”, and then be wrong. An organization’s hard-won intuition is the only way to assess a verification burden estimate; there is nothing scientific about this.

Verification burden in agents

Token measurements (such as the token_counter function provided by the LiteLLM API aggregator) are a great opportunity to measure verification burden in a principled way for the first time. Much less noisy than the abstract “postdoc-year”, which considers the variance across the skill and grit levels of different postdocs to be a nuisance, we can now fix a language model and study its capacity to write a proof and how much effort (i.e., how many loops through the toolcall-inference cycle) it takes to do it.

I expect[4], based on vibey/anecdotal intuition as a proof engineer and lean agent/synthetic data expert, given that I’ve only measured this for real on problems that are too toy to count:

Verification burden varies by about 0.5-1 OOMs on different problems from the same application area
Verification burden varies by about 1-2 OOMs on the average problem across different problem areas.

AI security scheme designers need to be able to evaluate the budget-realism of formal methods

“Just verify it” is not a free action. When you have the spec, program, and proof, running the proof checker should[5]be O(laptop), but the question is the cognitive investment in shipping the proof.

If verification burden is low enough, an AI security plan that leans on FM could be the cheapest reliable way to buy safety. If verification burden is high enough, then FM becomes a luxury good we can only afford on the narrowest, most safety-critical surfaces (boxing boundaries, tool interfaces, small kernels). Without principled burden measurements, scheme designers are forced into one of two failure modes:

Over-reliance: assuming verification is cheap and scaling it across large surface areas—yielding a scheme that looks crisp on paper but collapses in the real world because they can’t pay the proof bill.
Under-reliance: assuming verification is impossibly expensive and excluding it from designs where it would actually be a uniquely good deal—yielding a scheme that misses a high-leverage safety tool.

So: verification burden data is not just nice engineering metadata. It determines which AI security schemes are even coherent under realistic asymmetry assumptions.

Upfront vs ongoing cost

This is mostly to be explored in a later post[6]. But briefly, this is the question of the prover tokens being spent once (like for proving a module of an operating system) or spent ongoing (like repairing proofs as that module gets updated). Under a fixed enough problem domain, it will always be easy to drive verification burden down through one-time investment into boutique custom tactics that don’t generalize out of that problem domain. Substantial upfront cost for major runtime savings. In a kinda Big O sense, this one-time investment plays the role of a constant factor, so it may or may not be strategically important for the application you care about. To be clear, right now I think what’s important about verification burden is mostly this Big O view.

Please report verification burden

So if you’re shipping program/proof synthesis results, please report verification burden. It's a cheap addon to your metrics. From the perspective of AI security under finite budgets, the key question is not “can we verify it in principle?” it’s “can we afford to verify it faster than the red team can break it?”. Note you may adjust the basic vibe of a verification burden to your usecase. Let’s create a science of verification burden!

Appendix: stop reading now if this is not your actual fieldWhat should FMxAI authors report?

If you’re shipping FMxAI research or a verification-heavy product, here’s the cheap favor: publish a verification burden table alongside your results.

At minimum, please report:

Implementation cost: tokens (or cash) for an LM agent to implement the artifact up to least-effort QA, i.e. passing tests/basic sanity checks, not full proofs.
Proof cost: tokens for the LM agent to produce a correct proof (including repair loops).
Burden ratio: proof cost / implementation cost.
Confidence interval / variance across tasks: burden is spiky; averages alone mislead.
Task metadata so others can compare: language, spec style, proof assistant, automation level, model used, tool stack, stopping criteria.

How to measure verification burden in agents

A workable experimental recipe:

Fix the model and toolchain. Same model class, same temperature, same retrieval, same proof assistant version, same tactic libraries.
Run an implementation agent.
1. Inputs: spec and any allowed context.
2. Goal: produce working code with minimal QA (tests, lint, typecheck).
3. Record:
  1. total prompt and completion tokens
  2. number of tool calls / iterations
  3. walltime if you can
Run a proof agent.
1. Inputs: same spec and produced code and proof environment.
2. Goal: proved theorem / verified artifact.
3. Record the same measures.
4. Compute burden = proof costs / implementation costs
Repeat across task suite (even small suites help), report distribution.

If you want a single scalar headline, use the median burden and also show IQR / min-max.

Papers that could’ve measured verification burden but didn’tFVAPPS: Proving the Coding Interview

This task is to, given specs, unit tests, a function signature, and a natural language description of the function to implement the function and the proof. If I recall correctly (epistemic status: an author), when we did the baselines we had the implementation of the function under test and the proof in different loops. It would’ve been trivial for us to write down verification burden at least in terms of conversation length (I checked our logs for ways to reverse engineer some proxy, didn’t see any ways to do it because of things we failed to log).

VERINA

They chain their codegen task to their proofgen task, so they’d have the isolation of concerns required to measure the tokens required for codegen and those required for proofgen. It would've been easy for them to add verification burden! I’m surprised I can’t reverse engineer some proxy from the data they reported, which I might be able to do if I emailed the authors for full access to their logs.

CLEVER

CLEVER is also inferring specs at benchmark creation time (like FVAPPS), but in a much more principled way where they have a reference spec targeted and they require an isomorphism proof. Verification burden is almost already there. Their pipeline already isolates the parts you need:

Implementation cost proxy: tokens/time spent in GenerateImpl retries until pi compiles (they even separately report implementation compilation rates).
Proof cost proxy: tokens/time spent in ProveCorrectness retries until Lean accepts the proof (they separately report implementation certification / prove@k).
Spec-side burden (optional): same story for GenerateSpec vs ProveEquivalence.

Because the evaluation algorithm is staged and each stage is retried independently with a shared timeout budget, logging tokens per stage would give burden for free. They don’t log tokens, but they do give coarse difficulty indicators:

Proof lengths and times for successful proofs (Table 2). Spec-equivalence proofs are consistently longer than correctness proofs, and take longer to find when they exist.
Success rates per stage (Table 1 / Figure 6): compilation is high, proving is the bottleneck, especially for spec equivalence.
Human formalization effort: writing each non-computable spec averaged ~25 minutes plus ~15 minutes peer review; manual correctness proofs for a sample ranged from 10 to 225 lines. That’s not agent-token burden, but it’s a human burden datapoint they did measure.

Again, under emailing the authors for complete logs, I could probably reconstruct a pretty satisfying verification burden.

Verification burden in humans

HACMS has not made it easy to estimate human verification burden of their projects. SeL4, on the other hand, very much has! It’s widely cited that the proofs in seL4 were 20 person-years, and in the paper you can see that the C microkernel implementation was about 4 person-years. So you get a verification burden of 20/4 = 5. Space and aeronautics would also be a source of data or intuition.

I’ve updated since the first draft of this post that this is actually nonessential, but will address my fully updated view in the next post. ↩︎
This is a big if. You could easily imagine the compliance burden falling on the red team, because it's the red team who wants their code executed in the world. Intuition pump for that world is when you provide software to federally insured banks, you have to pay for the SOC 2 auditors as a cost of doing business. ↩︎
Except for seL4, see appendix. ↩︎
But it's unclear how architectural innovations change the game. It's kinda primitive to be approaching proofs as a text-to-text problem, and we won’t be stuck with it for much longer if people keep paying attention to FMxAI. For example, if Higher Order Company’s thesis pays out for them. Watch this space. ↩︎
Should, with caveats. ↩︎
There, we will work with Manheim and Homewood’s distinction, roughly that the difference between control and oversight is the difference between upfront and runtime. ↩︎

Discuss

Solstice Singalong Watch Party

23 ноября, 2025 - 19:36

Published on November 23, 2025 4:36 PM GMT

Welcome to Secular Solstice celebration at Arkhipov House!

Location: E2 0QW, Bethnal Green, London UK
Sign up to see exact adress

The main event is that we will watch the recording of the Bay Solstice ceremony and sing along to all the songs. This year's recording does not exist yet, but here's last year's for reference. (Last year's recording is also our backup plan if there is some issue with the new one.)

PLEASE DON'T ARRIVE AFATER 6PM

There will be:

Vegan food (dinner)
Sofa
Optional cuddles
Singing

Schedule:

Arrive from 5pm, eat dinner
Watch party starts at 6pm, expected to last around 3 hours
Feel free to stay and talk/cuddle in the sofa afterwards

Discuss

Busking Practice

23 ноября, 2025 - 18:20

Published on November 23, 2025 3:20 PM GMT

It can be hard to get the kids to practice their instruments. Sometimes they're having trouble focusing, don't like their assigned piece, or are just grumpy. One thing I'll offer them in these cases is "busking practice".

The idea is, we'll pretend we're busking together. I'll start playing a tune they know, and they should do something that sounds good. That could be playing the tune with me, making up a harmony, or just figuring out a single note that sounds ok and playing patterns on it. If they make a mistake, we keep going. If they can't figure out what to play, we keep going. We're practicing performing. It helps that they're pretty motivated to get good at busking, because they know they can earn money that way.

Working on the pieces the teacher assigns (if you have a good teacher!) is very efficient a turning time on an instrument into becoming a better musician. But if willpower is the limiting factor and not time, and especially if the alternative is a super short practice or no playing at all, the important thing is just picking up the instrument and playing something. I like that busking practice give us some structure for this, and lets the kids build up their performance skills.

Comment via: facebook, mastodon, bluesky

Discuss

The Enemy Gets The Last Hit

23 ноября, 2025 - 15:22

Published on November 23, 2025 12:22 PM GMT

Disclaimer: I am god-awful at chess.

Late-beginner chess players, those who are almost on the cusp of being basically respectable, often fall into a particular pattern. They've got the hang of calculating moves ahead; they can make plans along the lines of "Ok, so if I move my rook to give a check, the opponent will have to move her king, and then I can take her bishop." and those plans tend to be basically correct: the opponent really does have to move her king.

But there's a very important rule to follow when calculating. Always end your calculations after your opponent has moved. You must never end your calculations after your own move. In other words:

The enemy gets the last hit

This principle is common in cyber security: you have to let your red team make the last move. If your red team finds a vulnerability, and your blue team patches it, then you have to red team the patch. It's the same for AI red-team-blue-team games: I recall a story of a team at one of the MATS streams, presenting their final work on some AI control (or similar) protocol:

Audience member: But why didn't you try to mitigate this risk with something like [Idea] which would have taken a few minutes to implement.

Team member: Because then we'd have had to red-team that idea as well, and red-teaming it would have taken much longer than a few minutes.

The team member was correct here. Quite often, calculating what The Enemy will do is harder than calculating what you're going to do.

The Enemy need not actually be an enemy. The Enemy can be "The universe" or something. If you're building a flood defence, then The Enemy is the flood water. If you build a barrier to stop it getting to your city's business district, then you'd better check where the water will do instead to make sure you didn't just divert it onto an orphanage or something.

Similarly, lots of AI Safety papers have the theme "We found a problem, then we fixed it." This has a nice ring to it. It's how most papers get written in most fields, which is fine for those fields. But AI Safety is much more like cybersecurity than e.g. chemical engineering, where "we found this reaction was going slow so we added a new catalyst" is totally reasonable.

(Lots don't fall into this trap, and that's great!)

The conversation usually goes like this:

AIS: We found this solution to a serious problem

AINKEI: This seems super hacky

AIS: No I don't think so

AIS goes away to think...

AIS: Actually it follows this deeper principle

AINKEI: I feel like this won't work for superintelligence still

AINKEI goes away to think...

AINKEI: Ok, here's a reason I thought of why it won't work

AIS: Oh huh

AIS goes away to think...

AIS: Ah but I might be able to fix it with this solution

The issue is that AINKEI is thinking in terms of letting the enemy get the last hit, while AIS is thinking in terms of a feedback loop of detecting and fixing problems. The feedback loop solution only works if all of your problems are recoverable, which is a core disagreement between the crowds.

<psychoanalysis>
I think of a lot of the AI not-kill-everyone-ism crowd's frustration with the AI safety crowd is that the AINKEI people feel that they are having to do the jobs of that AIS people should be doing by playing the part of The Enemy getting the last hit
</psychoanalysis>

III

The recent work on inoculation prompting---which has stirred up so many mixed reactions that it functions as a scissor statement for the AI safety/alignment/notkilleveryoneism crowd---is a great example.

Problem: AIs generalize from reward hacking to misalignment.
Solution: Just tell 'em it's OK to reward hack during training.

Does this throw up even more problems? The paper didn't really investigate this question; they didn't let The Enemy get the last hit.

In this case, The Enemy is "Your AIs getting smarter every generation."

The general form of the solution is "if we can't make our reward environments exactly match our prompts, we'll adjust our prompts to match our reward environments." which is, to be fair, quite elegant. What happens when the AI gets smarter? As a first guess, if you can't make your reward environments more robust, you'll have to prompt your AI with more and more caveats, in more and more different situations.

This seems bad! Does every prompt now have "by the way it's OK to hack the environment and manipulate the human raters and break out of your VM and murder the testers" during training? What fixes this? I don't know, because I have a finite amount of time to write this essay, and I double-super don't know what problems that fix throws up!

Discuss

A list of people who could’ve started a nuclear war, but chose not to

23 ноября, 2025 - 12:25

Published on November 23, 2025 9:25 AM GMT

This is a list of everyone who had a big red button but did not press it, despite a unilateral ability to destroy (at least some of) the world with nuclear weapons.

(Please comment with suggestions for additions: I’m sure I missed some people.)

Dwight D. Eisenhower was a U.S. President with the authority to launch nuclear weapons. He considered the use of nuclear weapons and made implicit threats during the 1954 Quemoy-Matsu crisis (against China) and when France asked for U.S. intervention in Vietnam. Eisenhower decided to not use nuclear weapons. While publicly, he declared that in the event of war in East Asia, he would authorize the use of tactical nuclear weapons against military targets “exactly as you would use a bullet”, in private, he decided that nuclear weapons were “too destructive to use in a limited conflict”, and refused requests from the U.S. Joint Chiefs of Staff to use nuclear weapons against Chinese targets. Despite the unilateral ability to launch nuclear weapons at any point from 1953 to 1961, he did not.

John F. Kennedy was a U.S. President with the authority to launch nuclear weapons, including during the Cuban Missile Crisis in October 1962. Despite the unilateral ability to launch nuclear weapons at any point from 1961 to 1963, he did not.

Vasily Arkhipov, a vice admiral of the Soviet Navy, was on board of a Soviet submarine on October 27, 1962, when U.S. Navy forces tried to force the submarine to surface using signalling depth charges. Communications to Moscow were cut off, and the depth charges led the captain and political officer to believe a war might have already begun. They prepared to launch a nuclear torpedo. On most Soviet submarines with nuclear torpedoes, authorizations only from the captain and political officer were required to authorize the launch. However, on this particular submarine, due to the presence of Arkhipov, the chief of staff of the brigade, his authorization was also required. The captain and the political officer wanted to launch, but Arkhipov opposed. He could’ve issued the authorization others wanted from him, and a launch of a nuclear weapon against the U.S. Navy during the Cuban Missile Crisis would’ve likely started a nuclear war. He didn’t.

Lyndon B. Johnson was a U.S. President with the authority to launch nuclear weapons. Despite the unilateral ability to launch nuclear weapons at any point from 1963 to 1969, he did not.

Richard Nixon was a U.S. President with the authority to launch nuclear weapons. Despite the unilateral ability to launch nuclear weapons at any point from 1969 to 1974, he did not.

Gerard Ford was a U.S. President with the authority to launch nuclear weapons. Despite the unilateral ability to launch nuclear weapons at any point from 1974 to 1977, he did not.

(Golda Meir was the Israeli Prime Minister during the Yom Kippur War in October 1973. It is reported that when the Israeli officials panicked that the Arab invasion would overrun Israel, the Defence Minister Moshe Dayan requested and received authorization from her to arm 13 missiles and 8 fighter jets with nuclear warheads, with the missiles aimed at the military headquarters in Cairo and Damascus. It is unclear whether anyone in Israel had at any point a unilateral authority to order the launch during these events. Supposedly, normally, Israeli nuclear weapons are under civilian control until assembled; it is unclear what happens afterwards.)

Jimmy Carter was a U.S. President with the authority to launch nuclear weapons. Despite the unilateral ability to launch nuclear weapons at any point from 1977 to 1981, he did not.

Pieter Willem Botha was the Prime Minister and State President of South Africa when it developed nuclear weapons, with the authority to launch nuclear weapons. Despite the unilateral ability to launch nuclear weapons, he did not. (It must be noted, however, that he made South Africa have these weapons in the first place.)

Ronald Reagan was a U.S. President with the authority to launch nuclear weapons. Despite the unilateral ability to launch nuclear weapons at any point from 1981 to 1989, he did not.

Frederik Willem de Klerk was the State President of South Africa, with the authority to launch nuclear weapons. He inherited six operational weapons and decided on nuclear disarmament. Despite the unilateral ability to launch nuclear weapons, he did not. South Africa is the only country that developed nuclear weapons and then gave them up.

Stanislav Petrov, a lieutenant colonel of the Soviet Air Defence Forces, was on duty on September 26, 1983, when an early warning system falsely reported multiple U.S. missile launches. He violated the protocol, deciding to not report the warning up the chain, which could’ve led to a full Soviet nuclear retaliation. He decided that a real first strike would involve more missiles than he’s seen and judged it to be a false alarm.

George H.W. Bush was a U.S. President with the authority to launch nuclear weapons. Despite the unilateral ability to launch nuclear weapons at any point from 1981 to 1993, he did not.

(Boris Yeltsin was the President of Russia in 1995, when he was notified that a radar detected what looked like a U.S. missile launched from the sea, possibly the start of an American nuclear strike. The nuclear briefcase was activated for him, but he did not push the button he had. The rocket turned out to be a Norwegian scientific rocket; Norway even notified Russia, but the notification didn’t reach the radar crew. (Soviet and Russian secretaries general and presidents generally do not have the unilateral authority to launch, so others are not on this list. It is not entirely clear whether, in this case, Boris Yeltsin could’ve launched nukes if he wanted to.)

Kim Jong Il was the General Secretary of North Korea when it developed nuclear weapons, with the authority to launch nuclear weapons. Despite the unilateral ability to launch nuclear weapons from 2006 to 2011, he did not. (It must be noted, however, that he made North Korea have these weapons in the first place.)

Bill Clinton (1993-2001), George W. Bush (2001-2009), Barack Obama (2008-2012), and Joe Biden (2021-2025) were U.S. Presidents with the authority to launch nuclear weapons. Despite the unilateral ability to launch nuclear weapons, they did not.

(While technically the Prime Ministers of the UK have had the authority to launch nuclear weapons, if officials of the Ministry of Defence who receive a launch order judge it to be wrongful, they can lawfully appeal to the monarch to overturn it, which is somewhat likely to happen for a first-strike decision; for that reason, I’m not counting them as having had the unilateral ability to launch nuclear weapons.)

Charles de Gaulle (1960-1969), Georges Pompidou (1969-1974), Valéry Giscard d’Estaing (1974-1981), François Mitterrand (1981-1995), Jacques Chirac (1995-2007), Nicolas Sarkozy (2007-2012), and François Hollande (2012-2017) were the Presidents of France with the authority to launch nuclear weapons. Despite the unilateral ability to launch nuclear weapons at any point while in office, they did not.

(There were many close calls, but I wasn’t able to find individual people who could’ve decided to end the world during those: on October 5, 1960, a radar in Greenland mistook the rising Moon for a massive Soviet missile attack; the alert showed certainty of incoming nukes, but a bomber patrol confirmed it was a false echo. In 1979, a NORAD computer simulation tape was accidentally run as if it were real data, causing command centers to believe 250 to 2200 missiles were on their way, and the National Security Advisor was alerted that the U.S. President had minutes to decide on retaliation; satellite and radar checks confirmed it was a false alarm. See many others.)

(William Bassett, a captain of the U.S. Air Force commanding a launch crew at a missile site in Japan, is claimed to have received what appeared to be valid orders to launch 32 nuclear missiles on October 28, 1962, but the account of the events is disputed. He is claimed to have judged the situation to be suspicious and stalled the launch until the Missile Control Center confirmed it was an error.)

8 billion people are alive today. Let’s not let this number suddenly drop.

Discuss

Страницы