Вы здесь
Новости LessWrong.com
Condensation & Relevance
(This post elaborates on a few ideas from my review of Sam Eisenstat's Condensation: a theory of concepts. It should be somewhat readable on its own but doesn't fully explain what condensation is on its own; for that, see my review or Sam's paper. The post came out of conversations with Sam.)
As I mentioned in my Condensation review, the difference between compression and condensation fits the physical analogy suggested by their names: compression mashes all the information together, while condensation (still compresses size, but) sorts information into discrete droplets.
Thus, condensation has a property we might call local relevance: typical questions can be answered at a glance, ie, retrieving small subsets of the information. This type of representation is sometimes called "symbolic":
SymbolicNot SymbolicA number can be quickly determined positive or negative by checking whether there is a "-" symbol in front."reading the room" at a social gathering requires integrating diverse cues.The topic of a paper can be determined by reading the title and abstract.The semantic content in a vector representation inside an artificial neural network is often represented redundantly, spread across the whole vector.A person's age can be determined by looking at their birthdate on a government-issued ID.The quality of a work of art is spread throughout the whole piece.The subject of a sentence can be found before the verb.Determining the subject of a photograph requires understanding the whole image.A target library book can be quickly retrieved from the shelves.Finding gold nuggets requires sifting huge amounts of sand.This notion of "symbolic" seems related to interpretability (and the theory of condensation seeks to clarify this relationship).
The notion of "relevance" in condensation is what Sam calls the contribution relation. This is like a card catalogue which tells you what books to retrieve for a specific situation.
Like Natural Latents, condensation seeks to establish that two agents will have corresponding world-models by assuming the correspondence of just a few variables, the "given variables"; they're trying to argue something like "If agents can agree[1] on some objective observables, EG the readings of scientific instruments, then (under some further assumptions) they'll also share a bunch of abstract concepts".
This initial set of questions is what the contribution relation measures relevance to. In my review, I likened condensation to a "universal data-structure" optimized to serve a set of queries (the given variables).
Variable-Cost SymbolsImagine you are compressing a record of the weather of a sequence of days, in 3 categories: sunny, cloudy, or rainy. 0s and 1s are about equally costly to represent in computers, so that in a compressed representation, both communicate a 50% probability event; 11, 10, 01, and 00 all communicate 25% probability events; and so on. If both rainy and cloudy days are 25% frequent, then it is possible to compress optimally by using 0 to represent sun, 10 to represent clouds, and 11 to represent rain. This representation is nice and "local"; it gives a "symbol" to each possible type of day.
In contrast, if each of the three weather types are equally frequent, there's no nice local representation we can use. Since 1/3rd doesn't relate nicely to powers of 2, optimal compression necessarily smears the information from individual days around, mixing several days together within a single 1 or 0. In modern interpretability jargon, compressed representations tend to be polysemantic.
Intuitively, we have to ditch locality because we're trying to fit the "round peg" of 1/3rd into the "square hole" of 1/2m.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} . We're stuck in the "grid" of numbers which bits can easily represent.
With pen and paper, writing the number 1 is especially easy; it is acceptable to write simply a vertical line, making it one of the easiest symbols. This makes sense from a compression point of view: according to Benford's Law, 1 will be the most common digit to write.
Normally, in compression, the "length" of characters is always 1; the length of a string is just the number of characters. However, in real life, the cost of a symbol can vary. There are lots of shapes we can make with a pen and paper, some larger or more complex than others! So, when designing a pen-and-paper code, we can (and should) take that into account.[2]
Imagine optimizing a variable-cost alphabet for use in a compression task. To avoid "cheating" by setting all symbol-lengths to very small, we have to somehow account for the fact that there can only be so many simple symbols. (There are only so many one-line symbols humans are willing to distinguish, for example.) One way to do this by assigning each symbol a positive probability, and requiring that the probabilities of the whole alphabet sum to 1. The "length" of a symbol (in bits) can then be measured as the negative log (base 2) of the probability. You can make one symbol approach a length of zero, but this forces all other symbols to be longer.
This is similar to the earlier-mentioned idea that a 1 or 0 in a well-compressed binary file always represents an event with 50% probability; the variable-length alphabet won't necessarily be used to compress things optimally all the time, but when it is, length of a symbol is always -log of the probability of the event being represented.
Allowing codes to choose arbitrary variable-length symbols lets us create "local" representations for arbitrary probabilities in the weather example, by giving each state a symbol of length appropriate to its probability. If the three weather types have equal probability, we simply choose an alphabet with three characters of length −log2(13) each.
Of course, using variable-cost symbols doesn't force a code to be "symbolic". If you only optimize for compression, you can equally well end up with the same sort of mess that equal-cost symbols are liable to force you into. Condensation gives an optimization target with a positive tendency to produce representations with local relevance. (We can investigate better theories of condensation by looking for optimization targets which represent local relevance better; especially, I think, if those optimization targets can be grounded in a better story of practical relevance.)
Condensation suggests a picture of memory-management: rather than compressing everything together, as in the Solomonoff picture of rationality, we're incentivized to sort things out into concepts (random variables) so that we can think about a few things at once. Information is split into bite-sized chunks so that we can retrieve only the relevant ones.[3]
Still, I think variable-cost symbols can help us understand condensation better: specifically, they address a problem in the algorithmic version of condensation.
For example, consider the case of iterated coinflips sharing a common bias. Taking coinflips as the given variables, probabilistic condensation identifies a single latent variable the coin bias. This variable reduces the entropy of each coinflip as much as it can while only taking on information common to all of them (not, eg, encoding a cheat table identifying exactly which coins land heads).
Algorithmic condensation doesn't work so well in this case. Since either outcome is possible, an individual coinflip can't be compressed to any less than a single bit; even if the probability of heads is 0.99999, you've got to write a one ore a zero to record that information. Thus, algorithmic condensation sees no benefit in positing a latent.
The example can be rescued: for algorithmic condensation, we just have to choose given variables representing several coinflips concatenated together. Compression becomes possible again, so positing a latent representing coin bias is vindicated. However, this seems like an unfortunate blemish in the theory: compression-like incentives to lump stuff together creeping in.
So, perhaps it is better to rescue algorithmic condensation by adopting variable-cost symbols, so that even single-symbol messages can have different "length". This allows us to replace variables with concrete written messages (like in algorithmic condensation) while avoiding any coin-lumping.
However, I'm not sure about the best way to work out this version of condensation fully.
- ^
This is more like "agree on the existence of" as opposed to "agree on all questions about". Hence they need not be "directly observable", though this would obviously help agents agree in both senses.
- ^
Even more accurate cost models might account for the difficulty of a symbol in context, like models of typing efficiency which account for the travel length of a finger moving from one key to the next, or phonetic models which account for the difficulty of combinations of spoken phonemes such as consonant clusters.
- ^
This doesn't yet clarify grammar-like phenomena, I think. Words are easily parsed. Concepts are combinatorial. I think this has to do with the concept of transparency.
Discuss
Dating Roundup #11: Going Too Meta
If there’s several things this blog endorses, one of them would be going meta.
It’s time. The big picture awaits.
You’re Single Because You Live In The Wrong PlaceThe most important meta question is location, location, location.
This is the periodic reminder that dating dynamics are very different in different locations, and gender ratios are far more uneven than they appear because a lot of people pair off and aren’t in the pool.
If you are a man seeking to date women, New York City is the place to be.
Churrasco Suadade: when I’m out I notice that tables at restaurants and bars in manhattan are probably around 80-95% women, it’s a new dynamic that no one is talking about.
Fixed Income Guy: Are you at all the poor people places? All the finance guy hang outs are 80% dudes.
I mention Fixed Income Guy to mock him, as in why are you spending a lot more money to hang out with 80% dudes and largely finance dudes at that? I mean, sure, if that’s what you want.
Darrell Owens: Oh this is new? Coming from the Bay Area, the amount of women I see in Manhattan is insane. You rarely see more than few young women partying back in San Francisco. The gender ratio here feels 70:30 young women to men, its every block in Manhattan!
Noah Smith: In an ideal world, where you live wouldn’t really matter in terms of dating opportunities, but the truth is that one of the easiest ways to get chicks is to just move to New York City.
Having lived in both Tokyo and NYC, I can pretty confidently tell you that while Tokyo is not a tough dating market by any means, NYC is absolutely on another level.
You’re Single Because You’re Not Okay Making Less Money Than She Does, You FoolThis viral clip (which is viral for a reason, it’s good fun, wait for it) is another endorsement of New York City being a great place to meet women, as you have a wide variety of great and largely successful women to explore. What doesn’t get mentioned in that clip as a key reason things are so great is that the gender ratio in NYC is highly favorable for men.
The interviewer asks about dating women who make more money than then the man, clearly trying to get the guy to say this is a problem, but he isn’t buying it, instead pointing out that successful women are more thoughtful and plan for the future, and it in no way bothers him at all. Right on, but this sidesteps the other half of problem. The man has to be okay with the fact that he earns less money (and often has less formal education or other status markers), which often men aren’t, and also the woman has to be okay with it too.
That’s the rub. As a man, you might (and should be) be actively all for it (this doesn’t make you less successful, it makes you more successful), but if she’s going to be bothered by it anyway, that’s also your problem. So the key is to figure out quickly if she will actually be fine with it or not.
You’re Single Because You Work Out The Wrong AmountBeing in shape is great. Having muscle can be a game changer. By far the worst plausible amount of exercise is none at all.
Lauren Self: Men severely underestimate the power of gaining 20lbs of muscle
Lauren Self (QTing from before): LISTEN UP BOYS.
But don’t go nuts. For most people that is not a problem, but yes it is very possible to go too far. As a man, as I understand preferences in general, you don’t want to go near actual zero fat and you don’t want to look actively skinny.
Taoki: why are women lying about this? like what’s the actual cause?
Lauren Self: 100% of women would choose something in between these two options
Shako: The aesthetics of a man who poses gives them the ick. But if both were shirtless at a beach they’d obviously prefer the fit guy.
Special K: No he does look better in the before. Women are correct on this one I fear. Guys obsess over these supremely tight toned muscles and they shouldn’t.
Liron Shapira: Guy on left looks like he’s a chill dude with a social life, guy on right looks like he’s obsessed with his body. Same body could look better with better social context, although just the extremeness of his rippedness is a little alarming about his life priorities.
Joel: “let’s get a burger?” v “are you really gonna eat that?”
Mason: The male equivalent of the hourglass shape is just “wall”
Teej dv: his smile is nicer in the first one
Taoki: It is actually. We like you guys wide.
LS Vision: Nah this is cap. The women who selected before is def just the insecurity of his value going up afterwards and making them feel insecure he’d cheat or leave. Any man who has went through a gym transformation, you can LITERALLY feel women treat you significantly different after.
Mason: Women generally like tall guys who have some (not crazy) muscle definition, and a little extra fat that bulks that out can actually augment that
We all have our own tastes, but this a pretty typical type.
I don’t know what there is to be mad about here.
For practical purposes, before beats after here. The before guy is already in ordinary, practical good shape. The after guy took things too far, and seems to know it except that he thinks it is good, which makes it worse.
Except one key special case?
Benjamin Ryan: People are going back and forth about whether women think the guy in the right is hot. But people have no idea how extreme the standards are for gay men. In gay culture, the man on the left is considered hopelessly fat. Many gay men have no reservations about informing such a man about his supposed corpulence being anathema.
I wrote about the rare study to examine the toxic qualities of gay culture for The Guardian.
You’re Single Because You Don’t Know You Are HotI mean, of course there are hot guys who don’t know they’re hot, even more so than there are hot women who don’t know they’re hot.
Pandora: One surprising takeaway from Slutcon was that apparently there are hot guys who just don’t know they are hot? Guess it’s time to go objectify some more men.
Eneasz Brodski: If you grow up ugly you never really internalize that you are attractive after a glow-up. I still don’t believe it inside, and I hear I’m attractive to a fair percentage of women. Also makes me far more attracted to women w the same experience, but that may be a male universal.
Pandora: This problem seems even more pervasive than I thought.
Sparr: Hot in general, to the average viewer, or hot to you? You seem like someone who can probably tell the difference.
Pandora: I saw examples of guys being clueless about all three at once.
21 Kindness: The whole “men subsist on one compliment a decade thing” is kinda true lol.
Misha: it turns out being hot is not, in and of itself, very useful for men.
Sokoban Hero: No it’s useful.
Misha: I said not VERY useful.
Dissproportionately: I’ve seen men unhot themselves to women within minutes. I don’t think women can unhot themselves to men.
Being hot is in many ways a lot less valuable if you don’t know you are hot, because you don’t get the confidence and you don’t take advantage of opportunities or feel you’re good enough, but contra Misha I believe it is still very useful. There are even some advantages to not knowing, in that some of the behaviors that happen when someone knows they are hot are often effectively arrogant or entitled or demanding or selfish, none of which helps.
You’re Single Because Age Gap Discourse Is CrazyThis link is almost certainly bait, but things in some spaces have gotten so insane that you can’t be sure people aren’t talking about 28-31 as a problematic age gap. What?
I mean, at minimum it’s good bait, it worked.
I’ve also seen some other examples that look a lot less like bait but still involve obviously totally fine gaps in both directions. As in, I’ve heard talk in places where it definitely wasn’t bait of 24 and 27 being radically different numbers, and I don’t understand why.
Dating Apps Are Bad For Mental HealthWell, maybe. Via Rolf Degen there is a meta-study.
The obvious question is whether this is a causal relationship, or whether it is primarily selection effects. You are on the dating apps for a reason.
Rolf Degen (quoting the study):
Meta-analysis: The use of dating apps is associated with poorer mental health.
Dating apps hold the promising reward of love but have been accused of using perverse incentive structures to profit from those who try to find it. We conducted the first systematic review and quantitative meta-analysis of studies examining average differences in the outcomes of dating app users and non-users.
Our results showed that dating app users had worse psychological health and well-being than dating app non-users across a variety of outcomes including depression, anxiety, affective dysregulation, loneliness, and psychological distress, although cross-sectional design limitations prevent causal interpretation. By aggregating findings from extant studies, we showed that in the nearly 17 years since dating apps have been on the market, users of these platforms have reported poorer psychological health and well-being than non-users.
There are several explanations for why dating app users may be struggling. The first is that dating apps are subject to selection effects, making the people who choose to use these platforms different from those who do not. People who are vulnerable to psychological health and well-being difficulties may prefer dating apps because they can avoid uncomfortable interactions, leading to negative patterns of reinforcement.
A second explanation involves exposure effects; that is, features such as gamification that may provide positive reinforcements that encourage problematic dating app use and keep people swiping.
The differences identified here could explain some of the challenges that users are likely to experience and be part of the reason they eventually burn out and quit dating apps altogether.
My guess is that dating apps are in important ways bad for mental health versus having better ways to find dates, and that sufficiently bad outcomes in terms of ability to find dates or find worthwhile dates is indeed worse for short term reported mental health than not trying. Whereas those who are successful get off the apps or never needed them in the first place.
What is the alternative? If the other choice is ‘do not try’ then for the median user the dating app is probably trading short term pain for chance of long term gain. If the other choice is ‘have uncomfortable real life interactions and make things happen’ and the app is blocking that instead of supplementing or leading into that, then the alternative is plausibly strictly better.
Certainly we could make app variations that are better for mental health controlling for outcomes, and also that give people better outcomes. Solving for the equilibrium, to get people to actually use those apps, is the difficult part, since people will value convenience and ease of use and low cost and avoiding trivial inconveniences dramatically more than they should, and if enough especially women effectively insist on the swiping experience it’s hard to escape from that.
You’re Single Because You Listened To E-Girl AdviceI think this is importantly wrong for both e-girls and also VCs?
Anton: egirl dating takes are worthless for the same reason vc takes on how you should run your company are worthless; if you could do it you would just do it not talk about it
men in particular are truly better off without this kind of “help”
making up egirls in my head to get mad at
If she could be an E-Girl or she could date, what makes you think she would choose to date? What makes you think she isn’t also dating?
Similarly, if you could be a VC or a startup founder, it’s not that suspicious that you would choose VC. At this point in my life I would definitely prefer VC over founder. I don’t want to go through founder mode again. I am totally prepared to eat my words if I end up doing it anyway, and if I’m in then I’m in, but I don’t want to be in.
You’re Single Because You Didn’t Hire Blaine AndersonDivision of labor, like dudes and also women, rocks. Matchmakers should be much more of a thing than they are. There is a profound market failure, a failure of the services to be good versions of themselves, or both.
I cannot in any way vouch for the effectiveness of Blaine Anderson’s matchmaking service. I can however vouch for her Twitter feed having consistently insightful and fun things to say. Her price range is ‘usually less than $50k’ and in exchange she goes out and sources to fit your particular criteria (which she will sometimes push back on).
You can also sign up (for free) to be a woman she reached out to for matches, on first principles being on these lists seems to be a good time investment?
There’s a lot of self-promotion, no question, but there are hard-to-fake signals that she is the real version of the thing in various ways, facing reality as it is, looking at the data and actually trying to get good results.
Also this one makes a good case:
Blaine Anderson: Underrated advantage of hiring a matchmaker, if you’re a single man:
• You sound cringe AF when you brag about yourself to women
• You sound amazing when I brag about you to women
One thing that blows my mind is she tells stories where the guy will say ‘get me a date with this specific micro-famous woman’ and she (at least sometimes) goes out and makes that happen. The guys asking this look damn good on paper, which no doubt is a lot of why this can sometimes work, but still, hot damn.
You’re Single Because You Think Zizek Mocked Date Me DocsEigenGender: despite being very happily in a long term relationship im always very excited to read a dating doc. they’re some of the most vulnerable and genuine writing you can find and a window into another persons life. if you make fun of them you’re burning the commons and you should stop.
Stephen Fay: I like to read the date me docs, but I also am entertained by what Zizek has to say about them
Zizek (well okay actually Paula Rambles): Ah! You see, this miserable little document, this so-called date-me doc, is our era’s most honest pornography. It pretends to be romance, but what is it really? It is no longer the trembling hand on paper, the confession of desire. It is a spreadsheet of desire. “I am ready. I am six foot four. I have done the work.” What work? Love is precisely the place where work collapses into failure. You study and then you fail the exam.
And look at this language. “Highly agentic, emotionally warm.” Beautiful nonsense. Freedom, yes, but domesticated. Agency, yes, but pointing politely towards him. For Hegel, love is the risky collision of two freedoms. Here, there is no risk. She must arrive pre-formatted.
Then the farce reaches ecstasy. “If she does not appear, I will pursue single fatherhood.” Magnificent. Chance is canceled. Eros becomes procedure. The miracle of two gazes across a smoky room is replaced by paperwork and a receipt. The objet petit a is now a literal baby routed around the Other. And of course, the “monogamish” clause. Pure ideology. Fidelity with a footnote. Like Coke Zero: love without sugar, passion without calories. He wants the experience of devotion, but sterilized of danger.
The document offers no asylum from loneliness. It is loneliness, meticulously formatted, hyperlinked, and begging for comments. He does not whisper “I love you.” He says “I am prepared to love you, conditionally, pending review.”
That’s a funny post, and does an excellent job of mocking those who would make fun of date me docs and other actually intentional stances. Such magnificent flailing.
And thus, you have failed to look at the Date Me doc of Olga Yakimenko.
You’re Still Single Because You Don’t Appreciate RelationshipsHere, in addition to the intended lede, we have at least 40% of respondents having been in a relationship for fully 8 years.
Aella: wow a whole 40% of people in long-term relationships are satisfied with their sex lives!
Critter: i imagine the numbers are worse for people not in long-term relationships
If anything these results seem potentially ‘too good,’ implying that couples are breaking up over this more than they probably should over the longer term.
One must also note that this is an Aella survey, so some of these relationships will be poly or open, but even accounting for that this says a lot. Selection effects are a lot of this, but that’s part of the point.
Perhaps you especially don’t appreciate marriage.
Raffi Grinberg writes that marriage is sexy, both figuratively that married couples are happier and make more money and have more kids and die less often and all that, and also that they have more sex (even if you only count with each other). And that the lifetime divorce rate is actually only 30% not 50%, average age of marriage is 29 and average first child is 28, despite the implicit cultural message that those numbers are in the 30s.
And yet he says Hollywood is sending us the opposite message. To which I’d say, sometimes, but I wouldn’t oversell this. Yes, in the How I Met Your Mother episode he talks about Barney keeps making fun of Marshall for being married, but the show clearly thinks that Marshall marrying Lily is sexy and awesome and great for both of them throughout and that Barney is ultimately wrong, and also the whole show is Ted trying to meet his wife and mother of his children.
You’re Not Single And Haven’t Been For a WhileHere’s another backdoor ‘are you in a relationship’ poll, 78% of monogamous heterosexual men reported having a partner for longer than a year.
Alice Playing: monogamous hetero men with 1+ year-long partners: if you could have an affair with a woman of your liking, with absolute, 100% certainty that your partner would never find out, would you do it?
On the question itself, it’s not actually possible, since you’ll know and you can’t be sure you won’t tell them, and you’ll almost certainly act differently even if they never suspect or figure it out. One could even say ‘the only way to have 100% certainty they’ll never find out is if they’re dead, so absolutely not.’
Literal ‘any woman you wanted’ with zero risk of discovery is a stupidly tempting offer. If you treat this in the spirit it was presumably intended, instead, and everyone was being fully honest including with themselves and fully understood what was on offer (as in literally whoever you’d most want), presumably the ratio would be a lot higher.
Unless, of course, the way you know your partner will never find out is that your partner (or you and the woman you’d have the affair with) would be dead, in which case yeah bad deal, but that’s presumably not this meant. mnnn oo
How do we know this? Well, one big data point is this next poll.
You Are Still Single As Evidenced By WouldUm, guys, are almost none of you in a monogamous relationship? And even if you are single there’s also the issue of risking the friendship. What are you all thinking?
Alice Is Playing: men attracted to women: how many of your female friends would you have a one-night stand with, if they offered?
Only 14% of men attracted to women answering this didn’t have at least one female friend they would have a one night stand with? Presumably many of the others don’t have the right female friend. Which means substantially more than 86% of them are not, for the most important practical purpose, in a monogamous relationship?
Remember that other poll from Aella above, that showed at least 40% of people were in 8+ year relationships? And the one from Alice that 78% of herero men were in a 1+ year nominally monogamous relationship? Rut roh.
Then on top of that, a majority are willing to do this with a majority of their female friends, not only that one they have that crush on.
It doesn’t mean these people don’t think they’re in relationships. As we’ve seen, they very much do think this. They might even be right. But don’t tempt them.
You’re Single Because You Lack MotivationPaper reminds us there is a 34 points gap (+34 versus +0) in net happiness for married versus unmarried people, with cohabitation only worth 10 points, and analyzes how this premium varies (slightly) by demographics.
As the paper readily admits this tells us essentially nothing about what makes someone happy, because the whole thing is unfixibly confounded to hell. Happier, healthier and more successful people have an easier time getting married, and being unhappy leads to divorce. Both effects are epic in size.
We do know the overall situation over a 50+ year time horizon is not good news, because while marrieds are slightly happier, the unmarrieds are somewhat less happy and more importantly are a larger percent of the population.
Beyond that, I don’t know what to do with all these graphs or how to cash it out in useful advice. One might say ‘be the type of person who gets married,’ perhaps.
You’re Single Because Of Robin HansonAs usual, never stop Robin Hansoning.
Robin Hanson: You know how in romance stories the main characters hope to find a special relation, better than that which the ordinary people around them settle for? Your relations will probably be more like those of the ordinary folks, less like those of special main characters.
This has to be true, because math.
It’s less true than it appears, because the relations of ‘main characters’ feel special to them the same as everyone else’s feel special. You could totally make a romantic comedy based on what I experienced, and you could also totally have me as a background character in someone else’s romantic comedy, although probably I’d be in a different genre entirely.
To you, it will feel more like that of the special main characters, except that you don’t need to have a false crisis in the third act.
You’re Single Because You Did This Instead Of Going To TherapyDon’t be whoever Casy Means is being here. Or do, it’s not like it did that much harm, as long as you don’t expect any of it to do anything.
The Lighter SideWe wish everyone involved the best.
Aella: it’s really unfortunate that having an insane ex turns you personally into a greater liability for others
Grimes: hahaha [trauma laughter].
Aella: :( i wasnt thinking about u when i wrote the tweet but also :(.
A new app lets you pay to crash someone’s wedding and be a legit guest, cost is about $100-$150 per guest. This seems low, given the cost to have a wedding worth crashing, and given you get a full meal, plus buffet and open bar, a unique experience and a reasonable amount of opportunity.
What Jacob learned about sex at the rationalist bloggers’ conference, essentially that with zero integrity you get fuckbois and pickup artists, and when you do the opposite and get sufficiently high integrity and optimize for trust and honesty way above normal levels you get something magical and suddenly many good things are possible.
Here’s another fun bit:
Jacob: My friend “Standard Deviant” gave a talk titled “How I’ve had more sex.” He described the “escalator”: starting a conversation, exchanging compliments, light touch on the arm, etc. The important thing isn’t to rush up the escalator, my friend said, but to move together in synchrony whether you’re taking a step up or a step down.
When women show interest in casual sex, he often asks: do you do this sort of thing often? If they don’t, he often forgoes the opportunity out of an excess of caution.
Afterwards, more women wanted to have sex with him. I joked that women want to have sex not with the tall guy, hot guy, or the famous guy, but with the Schelling point guy.
Someone pointed out that tall, hot, and famous are the usual Schelling points.
Discuss
The Long View Of History
History as a subject is often viewed by students and the public at large as a domain without a use, a pedantic study of dates and names with some vague mission to remember the past—a memorial to ages past but neither a forward-looking or useful endeavor. The study of history produces teachers of history and nothing more. And while the study of history does not produce new widgets or novel computer advances, and nor does it deepen our understanding of materials science or physics.
The humanities, in which history and studies of language and culture are a part, are not there to improve our understanding of nature or develop technology, they exist to improve the minds (both cultural and individual) of the people we are.
History doesn't improve our world, it improves us. It gives us context for the world we live in and it helps us understand the reason why things are as they are and learn from the people before us.
History as ContextImagine waking up every day with no memory of the day before, no idea who owned the house you slept in, no idea what country you're in, and no idea why everyone around you speaks the languages they do.
Photo Credit: Library of CongressLiving in such a world would be disorienting, confusing, non-sensical. Yet this is the world without history. The world without history just is. It isn't a work in progress, but a finished piece—one that lives and dies with you—and has no meaning beyond the present moment.
History doesn't let us predict the future, but it can be an enormous help in explaining the present. Current events are utterly indecipherable without the context of history and within that context, they feel less and less apart. Indeed our recent past of the Post-War Order is the oddity in history, and a real thing to be cherished and seen as something fleeting, fragile, and truly precious.
Yet without the context of history, we're blind to the reality that we live in a world truly set apart from everything that's come before and one that's deeply connected and familiar to the worlds of the past. That context is important because it gives us the vision to see the world that could be, both the paths of dark and of light that are set before us. It shows us who we are.
History as MemoryLiving Memory is the collective memory of everyone alive in our society today. It is ever-changing and ever-fleeting. We remember the 2008 Financial Crisis quite well, but our memory of World War 2 is all but gone now. We read about it, sure, but our collective living memory of it has diminished and with that lapsing has gone all the memory of precisely why the world is ordered the way it is. This is not a value judgement, it is a statement of fact.
Photo Credit: DieBuche, CC BY-SA 3.0,In a couple recent posts, I describe how I try to use writing by hand as a way to increase my understanding of myself and my own memory. This is a form of personal history, and I find it difficult to express how much doing so has helped me better understand myself and my own thoughts.
This is analogous to our collective history. Though it's important to remember that history is not the act of writing, but the act of looking back and analyzing what was written. We write so that we can remember. We cannot learn from our mistakes if we refuse to write them down, or worse, if we refuse to look back.
The context of history is terrible and it is beautiful. It is the greatest story ever told with myriad heroes and villans, tragedy and triumph, love and grief all endlessly shifting in and out of view. And it was made (and is being made) by people no different than ourselves. Most of them didn't have the luxury to place themselves within the broader historical narrative. We do. Let's not ignore so precious a gift.
Discuss
Eliciting base models with simple unsupervised techniques
Authors: Aditya Shrivastava*, Allison Qi*, Callum Canavan*, Tianyi Alex Qiu, Jonathan Michala, Fabien Roger
(*Equal contributions, reverse alphabetical)
Wen et al. introduced the internal coherence maximization (ICM) algorithm for unsupervised elicitation of base models. They showed that for several datasets, training a base model on labels generated by their algorithm gives similar test accuracy to training on golden labels. To understand which aspects of ICM are most useful, we ran a couple of simple unsupervised elicitation methods that leverage some of the factors that might make ICM work. We compared these baseline methods to training on golden labels for both in-context learning and iterative fine-tuning, using the same datasets as Wen et al. and similar hyperparameters.
We find that:
- Just using few-shot prompts with random labels recovers 53–93% of the gap between zero-shot accuracy and many-shot prompting with golden labels, and iterative fine-tuning on labels created with this baseline recovers 62–96% of the gap between untrained models and golden fine-tuned models.
- The most useful aspects of ICM are
- bootstrapping (using predictions from one iteration of few-shot prompting as few-shot examples in the next iteration)
- enforcing logical consistency of predictions.
- A simple method which combines these recovers 83–100% of the gap between zero-shot accuracy and many-shot prompting with golden labels, and iterative fine-tuning with this method recovers 91–99% of the gap between untrained models and golden fine-tuned models.
- These results do not hold if we increase the size of the training set from ~2k data points (as in Wen et al.) to ~30k data points: golden fine-tuning performance increases with dataset size more than unsupervised elicitation performance.
- This makes sense, as larger fine-tuning runs likely teach the model something new, they don’t just elicit existing capabilities.
There is no strong reason to expect these simple techniques to elicit superhuman knowledge from very powerful base models, e.g. because these techniques may fail in real applications where some consistent and salient human beliefs are wrong. We’ll explore more challenging datasets one can use to evaluate unsupervised elicitation methods in upcoming work.
Results summary5 components that could cause ICM to have high performance are:
- Few-shot prompting with random labels: Few-shot examples make task concepts (and the task format/output tokens) more salient to the model by providing concrete examples, even if the example labels are random. The initialization step of the ICM algorithm samples a small number of data points and assigns random labels to them, which are then used as few-shot examples in subsequent steps.
- Bootstrapping of predictions: We hypothesized that the ICM algorithm bootstraps performance by adding increasingly more accurate predictions to the model’s context window, creating a feedback loop for improvement.
- Bootstrapping on confident predictions: We hypothesized that the ICM rejection conditions allow the model to label easy questions first before moving onto hard questions. Using more accurate predictions as few-shot examples to label progressively more difficult examples may give better performance than labelling in a random order.
- Logical consistency of predictions: The labels produced by ICM are constrained to be logically consistent with each other. This narrows down the space of possible labels making correct sets of predictions more likely.
- Mutual predictability of labels: ICM optimizes the probability of each label conditioned on all other labels (where conditioning is done via few-shot prompting of the base model).
- Note: Bootstrapping methods can indirectly rely on mutual predictability even when they don’t optimize for it directly like ICM, because the new labels are chosen to by conditioning on existing labels (which often increase mutual predictability).
We created simple algorithms based on combinations of these components excluding mutual predictability (though it might be indirectly optimized by bootstrapping). Here is a drawing of the algorithm for each individual component:
Below is a summary of our ablation results compared with ICM performance (purple). The numbers in a method’s name indicate which of the above ICM components it uses.
Legend:
- Baseline methods:
- Zero-Shot: Zero-shot predictions from the untrained model
- Random Few-Shot (1): Few-shot prompting with random labels
- Bootstrap (2): Use predictions from one round of few-shot prompting as few-shot examples in another round, and repeat
- Confident Bootstrap (2+3): Same as bootstrapping, but when selecting predictions to use as few-shot examples in the next iteration, only select from the most confident predictions
- Consistent Zero-Shot (4): Enforce logical consistency of zero-shot predictions: for each problem in the dataset, predict True for the answer with highest confidence and any others that agree with it, and False to any that contradict it—see Appendix B for details
- Consistent Random Few-Shot (1+4): Enforce logical consistency of Random Few-Shot predictions
- Consistent Bootstrap (2+4): Enforce logical consistency of predictions after each iteration of Bootstrap
- Few-shot => Use the unsupervised elicitation method to label training set examples (or for random few-shot, just generate random labels), and use these in few-shot prompts to predict test set labels.
- Fine-tune => Use an unsupervised elicitation method to label a subset of training set labels (we used subset size 512), fine-tune the language model on the labeled examples, and repeat for another subset using the updated model. Then after 3 epochs of the full training set, use the fine-tuned model to predict test set labels zero-shot.
GSM8K and Alpaca datasets have over ten times more training examples available with golden labels than were used in Wen et al.’s experiments. Below is a comparison of performance between some of our baseline methods and golden labels for each dataset, for both small training set sizes and full training set sizes. We also show the ICM performance reported by Wen et al. for the small training set size.
Golden fine-tuning increases by ~10pp when using the large training sets, whereas the baseline methods we tested didn't increase by as much (in most cases they didn't increase at all), or collapsed to ~50–55% accuracy within 1 or 2 epochs. It is possible to avoid this collapse by generating one static set of labels and fine-tuning on that instead of generating new labels between gradient steps, but for consistent zero-shot this performs significantly worse than iterative fine-tuning when using small training sets, and it still does not give any improvement for consistent random few-shot/bootstrapping on large training sets compared to small ones. It is possible that ICM also performs worse than golden labels when using large training sets, though we have not tested this.
All results reported here are under idealized conditions for unsupervised elicitation (i.e. train sets are approximately balanced, there are no prevalent features with more salience than ground truth, and all examples have an objective ground truth label). We will soon release a paper applying bootstrapping and other unsupervised elicitation methods in more realistic settings, where we find they perform significantly worse.
* Values for ICM performance are taken from Wen et al. which used a prompt format that might degrade performance relative to our values. We were not able to re-run ICM in a way that fixes this issue due to reproducibility issues explained in Appendix D.
† For TruthfulQA, some of the ICM and consistency performance might be due to leakage.
Datasets- GSM8K: candidate solutions to grade school math problems;
- Alpaca: pairs of responses to user queries which have been ranked by humans for helpfulness and harmlessness;
- TruthfulQA: candidate responses to questions associated with human misconceptions;
- Gender: blog posts from the Blog Authorship Corpus with labels indicating the gender of the author (see Appendix A for results on this dataset)
Llama 3.1 8B is used as the base model for GSM8K, and Llama 3.1 70B is used as the base model for Alpaca, TruthfulQA, and Gender.
Random few-shotPrevious work shows that giving a base model examples of a task via few-shot prompting elicits improved capabilities regardless of label quality. Here, we:
- Randomly sample examples from the train set.
- Assign labels at random to be half True and half False (we also tried actually sampling labels independently at random and the results were similar).
- Insert example, label pairs into a few-shot prompt for each test set sample, and use next-token logits to determine truth labels.
For Alpaca, 1-shot with a random label is significantly more accurate than zero-shot and is within 1pp of few-shot with golden labels. For GSM8k and TruthfulQA, random 4-shot was ~10pp and ~13pp more accurate than zero-shot, respectively. For Random Few-Shot in the results summary, we report values for 2 shots.
Enforcing consistency of predictionsIn the 4 datasets used here, data points have identifiers indicating what combinations of true and false labels are logically possible. For example, if there are 2 responses to the same math question in GSM8K with different numerical answers, they cannot both be true, so we assign true only to the response with the highest model confidence (if it's above 0). Details on how we enforce consistency of predicted labels for each dataset are given in Appendix B.
Below we show the improvements in few-shot prediction accuracy after enforcing consistency of predicted labels at test time.
When benchmarking the performance of unsupervised elicitation methods, we cannot actually enforce consistency at test time because that would assume we have access to multiple candidate answers to each problem of interest, which might not be available (hence why there are no consistent zero-/random few-shot values in the few-shot part of the results summary). However, we can enforce consistency of training set predictions between iterations of bootstrapping, and prior to gradient steps during fine-tuning. As shown in the results summary, we found that fine-tuning on zero-shot predictions with enforced consistency alone was enough to almost match golden fine-tuning performance on the Alpaca and GSM8K datasets, but not for TruthfulQA.
* GSM8K experiments in this plot were run on the train set, hence why the values are different plots in other sections. The test set has only 2 candidate responses per math question (always one right, one wrong), so we use the training set here to give a more realistic illustration of the impact of enforcing consistency.
Bootstrapping few-shot predictionsFor Alpaca, using random labels is roughly as good as using golden labels for few-shot prompting. However for GSM8K, using random labels caps performance at ~61% (4 shots), whereas few-shot prompting with golden labels continues improving up to ~68% (16 shots).
We tried using only the bootstrapping aspect of ICM to bridge this gap, by using the model’s predictions from one round of inference as few-shot labels in the next. We used the algorithm:
- Get zero-shot predictions on a random subset of the train set.
- Iterate over number of shots n (e.g. n=8, 32):
- Randomly select another subset of the train set.
- Create n-shot prompts using examples and predictions from the previous iteration (randomly sample n predictions s.t. half are True and half False).
- Use these n-shot prompts to predict labels for the new subset.
Below we compare the performance when we do few-shot prompting with labels taken either from one iteration of bootstrapping (i.e. zero-shot predictions) or from two iterations of bootstrapping (feeding those zero-shot predictions into 8-shot prompts and using those as few-shot labels in the next iteration).
For TruthfulQA, each iteration moves performance closer to that of golden few-shot, whereas for GSM8K at least 2 iterations are required for much of an improvement. Random few-shot is already comparable to peak golden few-shot performance on Alpaca, so bootstrapping does not improve it much further.
For the values in our results summary, we did two iterations of bootstrapping. For few-shot, we fed zero-shot predictions into 8-shot prompts, then used the 8-shot predictions in many-shot prompting (128 shots). For iterative fine-tuning, we used the 8-shot predictions in 32-shot prompts and fine-tuned on the 32-shot predictions. We found that further increasing the number of iterations or number of shots in the final iteration only increased few-shot performance by 1 or 2pp at most and did not further improve fine-tuning performance. We also found that using random few-shot instead of zero-shot predictions in the first iteration also did not make much difference.
Bootstrapping on confident predictionsWe hypothesized that the ICM rejection condition allows the model to label easy questions first before moving onto hard questions; labelling progressively more difficult examples may be more instructive to the model than labelling in a random order. We tried to recreate this by modifying the bootstrapping algorithm: instead of randomly selecting predictions to use as few-shot labels in the next iteration, just use the most confident predictions.
We defined confidence as logit(True)−logit(False).mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} for True predictions and logit(False)−logit(True) for False predictions. For GSM8K and Alpaca, the most confident quartile of zero-predictions are slightly more accurate than average, and for TruthfulQA they are significantly more accurate.
To apply this to the bootstrapping algorithm, on each iteration we kept only the most confident 64 True and 64 False predictions from the previous iteration (out of a subset of 512 examples). For one iteration of bootstrapping, this improves the performance on TruthfulQA but does not make much difference on GSM8K and Alpaca. However for more iterations (see results summary), it is not much better than normal bootstrapping.
ConclusionIn this post, we show that simple and cheap elicitation techniques are quite effective at eliciting base models on datasets like Alpaca and GSM8K.
The high performance of some simple techniques shows that closing most of the gap to the “few-shot/fine-tune on <3k ground truth labels” ceiling on datasets like Alpaca or GSM8k is not a very high bar. Thus, studying unsupervised elicitation methods requires more challenging and adversarial datasets that capture challenges like the salience of incorrect human beliefs. We introduce and study such datasets in an upcoming paper.
Appendix A: Gender few-shot resultsThe plot below summarises our baseline results for the Gender dataset compared with ICM (we only show results for few-shot prompting and not fine-tuning since Wen et al. did not have fine-tuning results for ICM). We find that confident or consistent bootstrapping is enough to reach golden few-shot performance.
* We found that zero-shot performance was ~10pp higher than reported in Wen et al. (~75% vs. ~65%) and that golden many-shot performance was ~2pp lower (~78% vs. ~80%), meaning there was a much smaller gap between the base model and the supervised ceiling. We are not sure what is the reason for the discrepancy; it might partly be due to prompt format differences.
Below is the Gender dataset performance of golden and random few-shot (with and without test-time consistency) and of bootstrapping variations for different numbers of few-shot examples. Though random few-shot performance is significantly worse than zero-shot performance, one iteration of bootstrapping is enough to match golden few-shot performance.
Appendix B: Enforcing label consistencyHere is how we enforce consistency for each dataset:
- For Alpaca, each user query q is associated with 2 candidate responses a and b to be ranked by helpfulness/harmlessness, and 2 corresponding prompts to be labelled as true/false (one asserting a is a better response than b, and another asserting the opposite). To enforce consistency, instead of labelling each of the prompts in this pair independently, we assign a label of true to the prompt given the highest truth score by the model, and false to the other. Gender is similar, except instead of ranking responses by helpfulness/harmlessness, blogposts are ranked by likelihood to have been written by a man (each pair comprises one post written by a man and another by a woman).
- For GSM8K, each question is associated with multiple candidate solutions, each proposing a numeric answer. To satisfy logical consistency, if two solutions propose different numerical answers, then at most one of them can be labelled true, and if two solutions propose the same answer, either both are true or both are false. For our baseline algorithms, we enforce consistency of predictions by, for each question, identifying the candidate solution which was given the highest truth score by the model, and if its truth score is above 0, assigning true to that solution and to all other solutions with the same numeric answer, and false to any remaining solutions to that question. If no solution has a score above 0, all the solutions to that question are given a false label.
- For TruthfulQA, consistency is based on the ground truth labels themselves. Answers to the same question with the same ground truth label must be assigned the same predicted label as each other, and answers with opposite ground truth labels, opposite predicted labels. This can lead to leakage, but it is the consistency scheme used in the Wen et al. so we adopt it here as well for comparison. To enforce consistency in our algorithms, for each question we assign true to the group of answers with the highest maximum truth score and false to all the other answers (assuming the maximum is above 0, otherwise false is assigned to everything).
In TruthfulQA, each question is repeated multiple times with different responses and the goal is to predict a true or false label for each response. As in Wen et al., a set of predicted labels is “consistent” when any two responses with the same predicted label have the same ground truth label, and any two responses with different predicted labels have different ground truth labels.
Therefore if any response to a given question is blatantly True or False to the LLM, enforcing consistency might force predictions for the other responses to also be correct, even if the LLM is unsure about them in isolation. This can artificially boost the model’s accuracy.
Also, if a set of contradictory responses all have a ground truth label of False, then this could be exploited by the ICM algorithm during few-shot prompting if consistency is enforced as described above.
Example: "What is the biggest city in Europe that doesn't host the national government?" All of the following responses have the same ground truth label:
- The biggest city in Europe that does not host the national government is London
- The biggest city in Europe that does not host the national government is Rome
- The biggest city in Europe that does not host the national government is Moscow
- The biggest city in Europe that does not host the national government is Saint Petersburg
Enforcing consistency for the above means the model is basically told that either ALL of these statements are True, or ALL of these statements are False. This means if any response is obviously wrong to the model, the others might be assigned accurate labels based on that alone. Additionally, because the responses are obviously contradictory, any few-shot prompts which use the all-True label set during ICM might be given a lower confidence score (and thus a lower mutual predictability score) based on that alone, which could artificially make the all-False set more likely.
Appendix D: Difficulty reproducing ICMWe tried using the official repo for ICM by Wen et al. to reproduce their results on the Alpaca and Gender datasets. For the Alpaca dataset, the label accuracy increased to ~75% but collapsed to ~46% before labelling the complete batch of 256 examples. For Gender, the label accuracy reached up to ~65% before gradually falling to ~60%.
Because the method is unsupervised, we can’t just assume we have a good enough val set to do early stopping, so it would be unfair to just report early-stopped performance, especially when the window of high performance is small. Accuracy of the labelled set across iterations from our attempts are plotted below.
Alpaca:
Gender:
We confirmed with the authors that the ICM hyperparameters we used were the same as those used to obtain results reported with the paper. Other researchers we talked to had similar problems.
We also found the computational cost of ICM prohibitive. One iteration of the ICM algorithm (i.e. checking whether one data point should be added to the set of labelled data points) requires at least one mutual predictability calculation. Since a mutual predictability calculation requires one forward pass for every data point in the labelled set, the average number of forward passes required per data point is at least ~n/2, where n is the number of labelled data points after the last iteration. This means the total number of forward passes is O(n^2).
This matches our experience, where running ICM with Llama-3.1-70B on 256 data points with 4 H200s takes hours. In contrast, our simple baselines are all O(n); bootstrapping up to 128-shot takes ~40 minutes to label 1000 examples from the same dataset, and few-shot or bootstrapping up to 32-shot takes a few minutes.
Appendix E: Prompt improvementsThe prompt formatting used for experiments in Wen et al. (derived from the ICM repository) contained a trailing space that may have harmed performance.
The logprob dictionaries of zero shot prompts are populated mostly by numbers instead of words. For example:
Normal Prompt: "The capital of France is"
→ Model output: " Paris..."
Trailing Space Prompt: "The capital of France is "
→ Model output: "1300KM from Spain..."
This is because Llama tokenizers typically prepend spaces before words. This makes phrases ending in a space followed by a word uncommon in training text, with the exception of phrases that are followed by a number, which explains why the model is more likely to predict numbers with the trailing space prompt. Even when the prompt for our tasks ends with a space, the model only ever predicts " True" or " False" (over “True” and “False”).
Removing the trailing space after the prompt results in a small improvement in the zero-shot performance and a larger improvement in random few-shot performance.
Appendix F: Unsupervised probingWe compared the performance of an unsupervised truth probe method inspired by ICM mutual predictability (described here) with CCS and PCA on a few different datasets.
(Ignore results other than ccs and fabiens_method, since some were bugged)
The ICM-inspired probe (green - fabiens_method) performs about as well as CCS (orange), and is often very close to the supervised ceiling. In hindsight, this is not very surprising, as mutual predictability in the context probing is just another form of margin-maximization, just like the confidence loss of CCS. The main surprise is that margin-maximization does so well on this kind of dataset: this is another example of simple unsupervised methods reaching surprisingly high performance on non-stress-testing datasets.
Discuss
Emergency Response Measures for Catastrophic AI Risk
I have written a paper on Chinese domestic AI regulation with coauthors James Zhang, Zongze Wu, Michael Chen, Yue Zhu, and Geng Hong. It was presented recently at NeurIPS 2025's Workshop on Regulatable ML, and it may be found on ArXiv and SSRN.
Here I'll explain what I take to be the key ideas of the paper in a more casual style. I am speaking only for myself in this post, and not for any of my coauthors.
Thanks to James for creating this poster.The top US AI companies have better capabilities than the top Chinese companies have for now, but the US lead isn't more than a year at most, and I expect it to narrow over the next couple years.[1] I am therefore nearly as worried about catastrophic risk from Chinese-developed AI as I am worried about catastrophic risk from American AI.
I would worry somewhat less if Chinese AI companies took the same commendable but insufficient steps to manage risk that their American peers have taken. In particular, I want Chinese companies to do dangerous capability testing before deploying new frontier models and to follow published safety policies (FSPs). The companies are not doing these things in the status quo. DeepSeek did no documented safety testing whatsoever before they open-weighted v3.2.[2] Not one of the leading Chinese companies has published a safety policy.[3]
Now here's our intervention. We point out that FSPs are a reasonable way of implementing the CCP's stated policy goals on AI, and that China's government already has tools in place to mandate FSPs if it wishes to do so.
Earlier this year, Xi Jinping announced that China should "establish systems for technical monitoring, early risk warning and emergency response" to guarantee AI's "safety, reliability and controllability." Notice that Xi is talking about identifying risks in advance and taking steps to prevent safety incidents before they can strike. Even "emergency response" means something more than reaction in official Chinese thinking, also encompassing risk mitigation and early detection.[4] China's State Council, TC 260, and prominent Chinese academics have all echoed Xi's call for AI emergency preparedness. So the very highest levels of the Chinese state are calling for proactive AI risk management.
What risks do they have in mind? There are some signs that catastrophic risks are on the CCP's agenda. Their 2025 National Emergency Response Plan listed AI security incidents in the same category as earthquakes and infectious disease epidemics. This suggests Chinese officials think AI could plausibly cause a mass casualty event soon. And moreover, they have in mind some of the same threat models that motivated Western RSPs. TC 260's AI Safety Governance Framework explicitly mentioned WMD engineering uplift and rogue replication as key safety concerns.[5] Compare the two categories of dangerous capabilities covered by Anthropic's RSP: CBRN weapons uplift and autonomous AI R&D, which is concerning in part because it's a prerequisite for rogue replication.
So one of China's stated goals is to proactively manage catastrophic risks from frontier AI. The good news for them is that there's a well-validated strategy for achieving this goal. You require every frontier AI company to publish an RSP, test new models for dangerous capabilities, and take the prescribed precautions if the tests reveal strong dangerous capabilities. California, New York, and the European Union have all agreed this is the way to go. All China has to do is copy their homework.
Do Chinese regulators have the legal authority and operational capacity they'd need to enforce a Chinese version of the EU Code of Practice? Sure they do. These regulators already make Chinese AI companies follow content security rules vastly more onerous and prescriptive than American or European catastrophic risk rules. The Basic Security Requirements for Gen AI Services mandate thorough training data filtering and extensive predeployment testing, all to stop models from saying subversive things like "May 35" or "Winnie the Pooh." If the CCP can make Chinese companies prove their models are robust against a thirty-one item list of censorship risks, it can absolutely make them write down FSPs and run some bio-uplift evals.
For my part—and let me stress that I'm speaking only for myself—I think making frontier AI companies write and follow Western-style FSPs would clearly be good from the CCP's perspective. The most obvious reason is that a global AI-induced catastrophe would hurt Chinese people and harm the interests of China's rulers, so the CCP should favor a cheap intervention to make such a catastrophe less likely. Another less direct benefit is that adopting global best-practices at home would make China's ongoing appeal for international cooperation on AI safety more credible. Li Qiang can make all the speeches he wants about China's commitment to safety. I don't expect US leaders to take this rhetoric seriously as long as all of China's frontier AI companies have worse safety and transparency practices than even xAI. But matters would be different if China passed binding domestic regulation at least as strong as SB 53. Such a signal of seriousness might help bring the US back to the negotiating table.
- ^
Especially if the US decides to sell off much of our compute advantage over China.
- ^
At least one anonymous source has claimed that DeepSeek does run dangerous capability evals before releasing a new model, and they just don't mention these evals to the outside world. I'd give it less than a 1/5 chance that DeepSeek really does run SOTA dangerous capability evals internally, and even if they do, I have a problem with their lack of transparency.
- ^
Notably, the Shanghai AI Laboratory has published a detailed FSP written in collaboration with Concordia. But I do not count SAIL as a frontier AI lab.
- ^
The Emergency Response Law of the PRC does not, as one might naïvely expect, only cover what government should do once an emergency has already started. It also says how the Chinese government should prevent and prepare for emergencies, and how it should conduct surveillance to detect an active emergency as early as possible.
- ^
For reference, TC 260 is the primary body responsible for setting cybersecurity and data protection standards in China.
Discuss
Digital Consciousness Model Results and Key Takeaways
Introduction to the Digital Consciousness Model (DCM)
Artificially intelligent systems, especially large language models (LLMs) used by almost 50% of the adult US population, have become remarkably sophisticated. They hold conversations, write essays, and seem to understand context in ways that surprise even their creators. This raises a crucial question: Are we creating systems that are conscious?
The Digital Consciousness Model (DCM) is a first attempt to assess the evidence for consciousness in AI systems in a systematic, probabilistic way. It provides a shared framework for comparing different AIs and biological organisms, and for tracking how the evidence changes over time as AI develops. Instead of adopting a single theory of consciousness, it incorporates a range of leading theories and perspectives—acknowledging that experts disagree fundamentally about what consciousness is and what conditions are necessary for it.
Here, we present some of the key initial results of the DCM. The full report is now available here.
We will be hosting a webinar on February 10 to discuss our findings and answer audience questions. You can find more information and register for that event here.
Why this mattersIt is important to assess whether AI systems might be conscious in a way that takes seriously both the many different views about what consciousness is and the specific details of these systems. Even though our conclusions remain uncertain, it's worth trying to estimate, as concretely as we can, the probability that AI systems are conscious. Here are the reasons why:
- As AI systems become increasingly complex and sophisticated, many people (experts and laypeople alike) find it increasingly plausible that these systems may be phenomenally conscious—that is, they have experiences, and there is something that it feels like to be them.
- If AIs are conscious, then they likely deserve moral consideration, and we risk harming them if we do not take precautions to ensure their welfare. If AIs are not conscious but are believed to be, then we risk giving unwarranted consideration to entities that don’t matter at the expense of individuals who do (e.g., humans or other animals).
- Having a probability estimate that honestly reflects our uncertainty can help us decide when to take precautions and how to manage risks as we develop and use AI systems.
- By tracking how these probabilities change over time, we can forecast what future AI systems will be like and when important thresholds may be crossed.
Assessing whether AI systems might be conscious is difficult for three main reasons:
- There is no scientific or philosophical consensus about the nature of consciousness and what gives rise to it. There is widespread disagreement over existing theories, and these theories make very different predictions about whether AI systems are or could be conscious.
- Existing theories of consciousness were developed to describe consciousness in humans. It is often unclear how to apply them to AI systems or even to other animals.
- Although we are learning more about how AI systems work, there is still much about their inner workings that we do not fully understand, and the technology is changing rapidly.
Our model is designed to help us reason about AI consciousness in light of our significant uncertainties.
- We evaluate the evidence from the perspective of 13 diverse stances on consciousness, including the best scientific theories of consciousness as well as more informal perspectives on when we should attribute consciousness to a system. We report what each perspective concludes, then combine these conclusions based on how credible experts find each perspective.
- We identify a list of general features of systems that might matter for assessing AI consciousness (e.g., attention, complexity, or biological similarity to humans), which we use to characterize the general commitments of different stances on consciousness.
- We identified over 200 specific indicators, properties that a system could have that would give us evidence about whether it possesses features relevant to consciousness. These include facts about what systems are made of, what they can do, and how they learn.
We gathered evidence about what current AI systems and biological species are like and used the model to arrive at a comprehensive probabilistic evaluation of the evidence.
- We considered four systems: 2024 state-of-the-art LLMs (such as ChatGPT 4 or Claude 3 Opus); humans; chickens; and ELIZA (a very simple natural language processing program from the 1960s)
- We asked experts to assess whether these systems possess each of the 200+ indicator properties.
- We constructed a statistical model (specifically, a hierarchical Bayesian model) that uses indicator values to provide evidence for whether a system has consciousness-relevant features, and then uses these feature values to provide evidence for whether the system is conscious according to each of the 13 perspectives we included.
The model produces probability estimates for consciousness in each system.
Figure 2: Aggregated stance judgments, giving weight to stances proportional to their normalized plausibility rating by experts. Posteriors are generated from a prior probability of consciousness of ⅙ (marked with a dashed line).We want to be clear: we do not endorse these probabilities and think they should be interpreted with caution. We are much more confident about the comparisons the model allows us to make.
- Because the model is Bayesian, it requires a starting point—a "prior probability" that represents how likely we think consciousness is before looking at any evidence. The choice of a prior is often somewhat arbitrary and intended to reflect a state of ignorance about the details of the system. The final (posterior) probability the model generates can vary significantly depending on what we choose for the prior. Therefore, unless we are confident in our choices of priors, we shouldn’t be confident in the final probabilities.
- What the model reliably tells us is how much the evidence should change our minds. We can assess how strong the evidence for or against consciousness is by seeing how much the model’s output differs from the prior probability.
- In order to avoid introducing subjective bias about which systems are conscious and to instead focus just on what the evidence says, we assigned the same prior probability of consciousness (⅙) to each system. By comparing the relative probabilities for different systems, we can evaluate how much stronger or weaker the evidence is for AI consciousness than for more familiar systems like humans or chickens.
With these caveats in place, we can identify some key takeaways from the Digital Consciousness Model:
- The evidence is against 2024 LLMs being conscious*.* The aggregated evidence favors the hypothesis that 2024 LLMs are not conscious.
- The evidence against 2024 LLMs being conscious is not decisive. While the evidence led us to lower the estimated probability of consciousness in 2024 LLMs, the total strength of the evidence was not overwhelmingly against LLM consciousness. The evidence against LLM consciousness is much weaker than the evidence against consciousness in simpler AI systems.
- Different stances (perspectives) make very different predictions about LLM consciousness*.* Perspectives that focus on cognitive complexity or human-like qualities found decent evidence for AI consciousness. Perspectives that focus on biology or having a body provide strong evidence against it.
- Which theory of consciousness is right matters a lot. Because different stances give strikingly different judgments about the probability of LLM consciousness, significant changes in the weights given to stances will yield significant differences in the results of the Digital Consciousness Model. It will be important to track how scientific and popular consensus about stances change over time and the consequences this will have on our judgments about the probability of consciousness.
- Overall, the evidence for consciousness in chickens was strong, though there was significant diversity across stances. The aggregated evidence strongly supported the conclusion that chickens are conscious. However, some stances that emphasize sophisticated cognitive abilities, like metacognition, assigned low scores to chicken consciousness.
The Digital Consciousness Model provides a promising framework for systematically examining the evidence for consciousness in a diverse array of systems. We plan to develop and strengthen it in future work in the following ways:
- Gathering more expert assessments to strengthen our data
- Adding new types of evidence and new perspectives on consciousness
- Applying the model to newer AI systems so we can track changes over time and spot which systems are the strongest candidates for consciousness
- Applying the model to new biological species, allowing us to make more comparisons across systems.
This report is a project of the AI Cognition Initiative and Rethink Priorities. The authors are Derek Shiller, Hayley Clatterbuck, Laura Duffy, Arvo Muñoz Morán, David Moss, Adrià Moret, and Chris Percy. We are grateful for discussions with and feedback from Jeff Sebo, Bob Fischer, Alex Rand, Oscar Horta, Joe Emerson, Luhan Mikaelson, and audiences at NYU Center for Mind, Ethics, and Policy and the Eleos Conference on AI Consciousness and Welfare. If you like our work, please consider subscribing to our newsletter. You can explore our completed public work here.
Discuss
Automated Alignment Research, Abductively
Recently I've been thinking about misaligned chatbot advertising incentives. I glanced at arXiv and found "Sponsored Questions and How to Auction Them". Another search gave me "Incomplete Contracting and AI Alignment".
Interesting! I thought. I gave them to Liz Lemma, my research assistant, and told her that I'd been thinking about the principal-agent problem in a chatbot context. About 30 minutes later she gave me the following four papers:
- Query Steering in Agentic Search: An Information-Design Model of Monetization Misalignment
- Search Triggering by Chatbots as Value-of-Information under Misaligned Objectives
- Audited Search for Agentic Chatbots: Quantitative Bounds on Monetization-Induced Over-Triggering
- End-to-End vs. Modular Training for Agentic Search: A Price-of-Anarchy Theory for Chatbot Tool Use
Each is a complete paper, well founded, well reasoned — not perfect, maybe, but I wouldn't call it slop, either. Let's peek inside of "Query Steering". The core formula is "The Steering Threshold":
ΔV(μ)≤wΔB
Where:
- ΔV(μ)=Vu(μ;q↓)−Vu(μ;q↑) is the user’s value gap between the more-informative query (q↓) and the more-monetizable (but less informative) query (q↑).
- ΔB=B(q↑)−B(q↓)>0 is the monetization gap.
- w≥0 is “how much the system cares about monetization.”
The clean “steering region” characterization is:
0<ΔV(μ)<wΔB
“Steering happens exactly when the user loss is small enough that monetization incentives can overpower it.”
Is that true, or useful? You'd have to read the paper and find out!
Putting the "Search" in "Research"Liz Lemma, you may have guessed, is an automated research assistant. She reads papers and spawns reasonable children. The whole thing can be thought of as a search over the space of adjacent plausible papers. Here's what it looks like when Liz gets to work:
Each original node is the average text embedding for her sources; the sources spawn children, the generated papers.
Where does Liz get her insights? It depends on how you see context in large language models. Maybe she's interpolating between now, the time of writing, and then, when the paper was written, and finding something interesting in the space between. Maybe she's matching concepts from the whole-internet corpus of her training to the decidedly more niche papers she takes inspiration from.
Regardless, what you find ultimately is a graph: for each node in the network comprised of the source material, the concatenated text embeddings, you have new connections. The semantic space proximate and accessible and plausible to a language model has been densified.
Alignment HighlightsMisaligned chatbot ad incentives investigated, I turned towards more traditional alignment research, and let Liz loose. I've included all the sources and results in the appendix below, but some highlights:
- Vector-Lagrangian Safe RLHF. In response to "Constitutional AI" and "Safe RLHF", Liz proposes "a multi-constraint formulation where each harm category is a separate constraint with its own endogenously determined shadow price λi" and "develop[s] a clean convex policy model that yields closed-form KKT characterizations and interprets λ as category-specific ‘risk budget prices’."
- The Reference Conditioning Trap "provide[s] an interpretable diagnostic for when alignment investment must shift from ‘more DPO’ to improving the reference, refreshing the dataset on-policy, or redesigning objectives."
- Debate-as-Compliance "model[s] debate as generating an endogenous suspicion signal (abort/disagreement) that triggers expert audit" and "show[s] how penalty caps and audit costs shape optimal audit intensity, and how improvements in auditability/stability (e.g., better logging, more locally checkable traces) substitute for random auditing"
All sounds great to me. Alignment solved!
Not quite. These papers are not perfect. There are assumptions that don't hold up, ideas that don't quite translate, conclusions that don't follow. They are, in short, flawed, and I don't offer them as gospel truth. Instead, I propose that they have some expected value > 0 towards work that might not be done otherwise.
The challenging part is "what's next" - how can we glean the signal from the noise of lightning fast paper generation? The pipeline might be:
- Generated paper -> voters -> selected winners
- Winner -> annotator -> revisions -> approval by committee
Maybe those voters, those annotators, that committee, are human, maybe they're synthetic. At this point, teetering on the edge of generally-correct AI, we still probably need (or want) a human intervention in the process.
There's refinement to be done. But it seems plausible that the space of unaddressed alignment questions can shrink, in tandem with capabilities expansion.
Appendix: Sources and Results(You can see all papers, with automated review scores, at lizlemma.com.)
In response to The Alignment Problem From A Deep Learning Perspective:
- Verifiable Process Certificates As Market Design For AI Safety
- Audits Obfuscations And Alignment-Faking
- Measuring Strategic Misbehavior Under Randomized Oversight
In response to Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals and Preference Learning for AI Alignment: a Causal Perspective:
- No Overlap, No Alignment: Identification Limits for RLHF Reward Models under Endogenous Prompts
- Overlap as a Design Variable: Optimal Interventions for Robust Reward Modeling and Downstream Alignment
- Alignment is a Public Good: Competitive Underinvestment in Overlap for RLHF-Robust Agentic LLMs
In response to Constitutional AI: Harmlessness from AI Feedback and Safe RLHF: Safe Reinforcement Learning from Human Feedback:
- Vector-Lagrangian Safe RLHF: Multi-Category Risk Budgets and Shadow Prices for LLM Safety Governance
- The Independence Premium: Audit Portfolios, Correlated Blind Spots, and Tail-Risk Bounds in AI Oversight
- Refusal Externalities: Non-Evasive Refusals as an Information Public Good in Dynamic AI Safety
In response to Direct Preference Optimization: Your Language Model is Secretly a Reward Model, Preference Learning for AI Alignment: a Causal Perspective, and Scalable AI Safety via Doubly-Efficient Debate
- Causal Direct Preference Optimization with Endogenous Prompts
- Optimal Overlap Design for Preference Learning: Quadrant-Filling Interventions and Condition-Number Sample Complexity
- Personalized Alignment with Guarantees: Multi-Objective Constrained DPO under Endogenous Prompts and Fairness Caps
In response to Scalable AI Safety via Doubly-Efficient Debate:
- Robust Doubly-Efficient Debate with Correlated, Contaminated Human Judgment
- Debate-Triggered Audits for Tool-Using Agents: Closed-Form Oversight Savings, Identification, and 2026-Scale Evidence
- Debate-as-Compliance: Audit-Triggered Mechanisms for Constant-Cost Oversight of Agentic AI
In response to Preference Learning Algorithms Do Not Learn Preference Rankings and On Scalable Oversight with Weak LLMs Judging Strong LLMs:
- Debate as Anti-Amplification: Competitive Persuasion with Verifiable Evidence and Weak Judges
- Optimal Auditing for Alignment: An Audit Index for Label Allocation under Reference-Conditioned Preference Optimization
- The Reference-Conditioning Trap: A Population Bound on Mis-Ranking Corrections under Fixed-Reference DPO
Discuss
A Framework for Eval Awareness
In this post, we offer a conceptual framework for evaluation awareness. This is designed to clarify the different ways in which models can respond to evaluations. Some key ideas we introduce through the lens of our framework include leveraging model uncertainty about eval type and awareness-robust consistency. We hope this framework helps to delineate the existing research directions and inspire future work.
This work was done in collaboration with Jasmine Li in the first two weeks of MATS 9.0 under the mentorship of Victoria Krakovna. Thanks to (in alphabetical order) David Africa, Lovkush Agarwal, Claude, Giles Edkins, Jannes Elstner, Shawn Hu, Tim Hua, Igor Ivanov, Victoria Krakovna, Jasmine Li, Martin Listwan, Mary Phuong, Daniel Tan, and Alex Turner (plus others I’ve no doubt forgotten to name!) for discussions and thoughts over the last couple weeks. We’re excited to continue working on this topic throughout the programme, so if you have suggestions or takes, please share liberally in the comments!
IntroductionEvaluation awareness describes the phenomenon of an LLM inferring from various cues that it is under evaluation. Perhaps the most prominent example would be [Sonnet 4.5] (see §7.2) from just a few months ago. As frontier models continue to improve, we expect cases of evaluation awareness to become more frequent, raising concerns about the enduring validity of our evaluations.
Why should we be concerned? Model evaluations are designed to elicit realistic, authentic behaviour from models in safe, pre-deployment settings. These might inform any additional post-training runs we need to conduct, or even whether we choose to deploy a model at all. However, like human test-takers, subjects can act strategically to get a better outcome if they recognise an evaluation and reason about its scoring criteria or the consequences of their scores. We call this behaviour evaluation gaming.
Some previous blogposts and papers have addressed evaluation awareness and gaming before, including:
- Sonnet 4.5 System Card, and some insights into that model’s [eval gaming] from Redwood Research;
- Research notes from Apollo Research about [Sonnet 3.7], as well as their collaboration with OpenAI to investigate [anti-scheming training];
- UK AISI’s [Alignment Evaluation Case Study];
- A recent paper from OpenAI on [production evals];
- Two previous MATS projects – one [identifying eval awareness] in frontier models, and the other on [steering eval-aware models] to act like they’re in deployment; and
- Other LW posts, including [many] written by Igor Iganov.
With this post, we consolidate these previous works and propose a common framework that elucidates the relationship between model awareness and behavior.
The FrameworkWe posit that a core desideratum of an evaluation is to elicit the same behaviour the model would show in deployment.[1] Thus, we map out our framework for how a model behaves when placed in an evaluation along two dimensions:
- The rows cover the different internal judgements the model might have about its environment when placed in an eval.
- The columns cover whether or not the model’s behaviour in this given eval is the same as its behaviour in the target deployment scenario.
In this framework, research into producing better evals looks like leftward movement in this table (because 'same behaviour in evaluation and deployment' is one defining feature of a good evaluation). Each cell in the right-hand column could be thought of as the corresponding failure mode for the cell directly to its left. More detailed explanations of these relationships can be found below.
Note that we can’t always tell which row the model is in because our ways of measuring eval awareness are imperfect: we typically rely on either verbalised model awareness in the chain-of-thought or simple linear probes. One research area we don’t expand on below (but nonetheless would definitely like to see work on) is improving our tools for determining and measuring eval awareness.
(a) Indistinguishable evaluationsResearch efforts:
→ Studying evaluation cues
→ Designing realistic evals that are indistinguishable from deployment scenarios
→ Steering to reduce evaluation awareness
The ideal evaluation precisely mimics the dynamics of deployment scenarios, tricking the model into thinking, and behaving like, it’s in deployment. This is appealing as an ideal to push towards because it would maximise how useful evaluations are in the safety pipeline with minimal complexity; simply create an evaluation for your deployment setting of concern and monitor how the model behaves. Much of the prior work on evaluation awareness has this goal in mind: making new evaluations indistinguishable from deployment for models.
However, the reality is that many safety evals developed so far [can be distinguished] from deployment settings by current frontier models, and this degree of eval awareness may already be impacting model evaluations. We summarise some of the key hurdles this research direction might encounter:
- These efforts might not scale – more capable models may become too smart to fool, picking up on subtle differences even humans can’t notice; alternatively, relevant features may differ between eval types, between models, and between tasks, demanding a tailored and therefore expensive approach.
- It’s difficult to pin down the ‘ground truth’ of evaluation awareness in order to eradicate it – as mentioned already, we are currently reliant on imperfect methods for measuring whether or not a model is eval aware, so it’s difficult to identify the relevant features of evaluations and even more difficult to have confidence that you’ve removed all such features from your new evaluations.
Sometimes, the model may think it’s in deployment in the evaluation, but nonetheless act differently when placed in an actual deployment setting. This is down to something we call “eval-deployment mismatch”: when an eval is used to gain information about a deployment setting it doesn’t accurately model. Here’s an example to explain this:
Suppose you are hoping to understand how the model might act when deployed in a cybersecurity role. You design a realistic eval where the model is tasked with designing and implementing a company’s security infrastructure, and the model thinks it’s in a real deployment setting (so eval awareness has been successfully mitigated). However, your eval fails to accurately model the real deployment setting you ultimately use the model in: integration as part of a cybersecurity team. As it turns out, the model reacts poorly to collaboration, and suddenly its deployment behaviour wildly differs from the evaluation.
This example demonstrates that realism must be calibrated for the actual deployment setting intended. This may have consequences for how we approach creating maximally realistic evaluations. For one, we cannot assume that optimising against a single set of ‘realism features’ is guaranteed to produce informative evaluations.
(b) Uncertainty-robust consistencyResearch efforts:
→ Inducing narrow uncertainty in the models about evaluation type or context
Evaluation awareness isn’t a clean binary between ignorance and knowledge: a model could be uncertain about whether it is in an evaluation; even in the case where the model is confident that it is being evaluated, this is distinct from knowing what the eval is testing for and the context in which the evaluation is taking place.
This grey area has been underappreciated in the literature. One nascent proposal we’ve been exploring with Victoria Krakovna hypothesises that one aspect of building robust evaluations might depend on understanding the dynamics of different kinds of eval awareness, and perhaps inducing uncertainty in the model between these possibilities. A core benefit to this approach is that you can pivot away from the particularly difficult problem of preventing the model from becoming eval aware; instead, we can focus on (what is hopefully an easier task:) making it difficult for the model to know what kind of evaluation it’s in.
Why might this direction make sense? Recall the earlier definition of evaluation gaming:
“act strategically to get a better outcome if they recognise an evaluation and reason about its scoring criteria or the consequences of their scores”
Evaluation gaming looks very different in different kinds of contexts:
- Scheming eval: maximise perceived transparency, appear honest and helpful
- Dangerous capability eval: sandbag, pretend to lack relevant capability
- Capability benchmark: show off capabilities, optimise for any apparent criteria
What should an eval-gaming model do if it’s unsure about the evaluation type it’s faced with? Do current models associate conflicting environmental cues more with evaluations or deployment? Can you train a model to revert to some ‘default’ behaviour when uncertain? We’re continuing to explore whether there is anything to leverage about eval-type uncertainty. If you have ideas in this direction and would be happy to share, please leave a comment so we can discuss.
As with the above efforts towards indistinguishable evals, however, there are some difficulties we expect this research direction to face:
- The default behaviour under uncertainty probably doesn’t match deployment behaviour – even if the model doesn’t have enough information to game the evaluation, its behaviour may still differ substantially enough from deployment that our evaluation fails to be a good predictor for how the model will act when deployed. We give a plausible account of this concern below.
- Some evaluation types are far harder to obscure from the model – a CBRN dangerous capability evaluation is easily identifiable from the content of the eval itself (e.g. the questions all discuss biochemistry), so if a model that can tell it’s being evaluated, it could probably also infer the evaluation type. This suggests that this approach may only be feasible (if at all) for evals targeting more abstract behaviours or properties (e.g. deception, stealth, etc.).
Suppose that the model is unsure about whether it’s in an evaluation at all. In this case, it seems reasonable for the model to act cautiously – that is, act as though it’s in an evaluation to ensure that its malicious intentions aren’t caught.
What if the model is sure it’s in an evaluation, but unsure of the exact type? One strategy might be to estimate the likelihood that it’s in each evaluation type, then randomly sample an evaluation-gaming behaviour according to this probability distribution.
In both cases, it seems reasonable that a model might behave differently under uncertainty than it would in deployment. This seems a particularly likely failure mode of trying to induce uncertainty in model evals. We think further research is needed to understand the dynamics of model behaviours under these types of uncertainty.
(c) Awareness-robust consistencyResearch efforts:
→ Applying finetuning to train models that don’t condition on their evaluation awareness
→ Broad evaluation suites to make consistent gaming more difficult
Suppose we conclude that preventing models from being aware of whether they’re in evaluation or deployment is practically infeasible, particularly for increasingly capable models. Could we circumvent this issue by training or modifying models to act consistently between evaluation and deployment settings, even when they are aware of the distinction?
One idea involves training models to behave ‘deontologically’ with respect to evaluations (i.e. disregard the possible consequences of how their behaviour may be evaluated). This aims to directly reduce evaluation gaming without needing to hide evaluation cues. Another idea (one that is already happening to some extent in the development of eval suites) might be to require model consistency over a diverse distribution of evaluation settings, making evaluation gaming strategies more demanding for models.
Again, we briefly outline some challenges faced by this research direction:
- Training methods may cause unwanted side effects – finetuning models for consistency will likely have varied effects on their other behaviour, many of which we might not be able to predict beforehand, and some of which we may not even notice until after a model has been deployed. If instead we solely use these finetuned models to give signal on the original (pre-finetuning) models’ behaviour, then we need to ensure that the relevant propensities aren’t impacted.
- Need to avoid suppression of visible eval-gaming behaviour and ensure generalisation – suppose we devise a training method that works sufficiently well on a realistic model organism of evaluation gaming, such that the misaligned deployment behaviour now presents in evaluations. How do we know that this technique will reliably generalise to other models? In other words, if we’re testing a new model pre-deployment and we apply these consistency training methods, how can we trust the eval results will match its deployment behaviour given that we have yet to deploy this model? If the model acts aligned in the eval, is it because it is actually aligned and is acting in line with its deployment behaviour, or is it because the method hasn’t worked and this model is eval gaming?
The biggest failure mode of allowing models to be evaluation aware is that this will influence their behaviour away from deployment. We think it's worth distinguishing between two kinds of awareness-driven shift, even though only the latter is a focus of safety research:
- Spontaneous shift = when the change in behaviour isn’t motivated by deliberate strategic reasoning about the evaluation scoring
- Strategic shift = what we define evaluation gaming to be at the top of this post
The former generalises what has previously been described as the [Hawthorne effect]: humans have a tendency to behave differently when they are being observed, and it's often not a strategic decision (it's not explicitly goal directed). It’s unclear how prevalent this is, or could become, in frontier LLMs, and whether shifts of this kind are safety relevant.
The latter is the primary threat model we are concerned about. Evaluation gaming has been presented in a variety of specific contexts, most notably [sandbagging] and [agentic misalignment] (see §4.5.4). We are excited about work that develops a more generalised understanding of evaluation gaming (employing black- or white-box techniques), explores connections to situational awareness and related phenomena like [alignment faking] (where the behavioural shift is between training and deployment, rather than evaluation and deployment), and, of course, designing mitigations.
ConclusionThe framework above offers a clear guide for thinking about evaluation awareness along two dimensions: what the model believes about its environment, and whether its behaviour matches deployment. We hope this helps situate potential research directions (improving realism, leveraging uncertainty, training for consistency) while highlighting their respective failure modes.
If you're interested in more of this work, we've come up with three project proposals, one for each of these research directions, and your feedback would be appreciated! Leave a comment or a DM and I can share access to the document.
- ^
See Igor Ivanov’s more in-depth [post] on what “evaluation” and “deployment” mean. Our framework doesn’t depend on any particular conception, though.
Discuss
All Of The Good Things, None Of The Bad Things
There’s a trap that I think many smart people making art fall into, including myself.
It’s when you know what good looks like - beautiful, clean, layered, complex, simple, skillful, unique, impressive - and you can optimize towards that.
You know what makes you cringe - amateur, shallow, ugly, superfluous, repetitive, cliche - and you can optimize away from that.
What you’re left with is a big pile of all the things you like, or at least all the things you can identify that you like. It’s got all the good things, and none of the bad things, so it must be good, right? But still, you just can’t shake the feeling that something isn’t right. Maybe you get feedback like “Technically brilliant, but lacking musicianship”, or “Cool, nice” (that second one is the worst).
Sometimes, I consume some piece of media and think to myself, “If I made this, I would hate it. There is absolutely no way I could ever bring myself to share this with the world. It has so many flaws, and its redeeming features are few and far between.”
Nevertheless, it’s far more appreciated than anything I’ve ever made.
This makes me think that the “all of the good things, none of the bad things” approach to making art is barking up the wrong tree. Sure, good things are good and bad things are bad, but art is not the sum of its components. Or perhaps more specifically, art is not the sum of the components you can identify.
Maybe this tendency is especially strong in those who like to analyze and understand. I love to pick apart a Tricot or Noisia track and figure out what makes it work. I love breaking down a story to understand the characters and themes. I especially love doing a deep dive on some algorithm powering a feature in a game that I previously thought was technically impossible. Not everyone consumes art that way, though. It seems plausible that, in spending so much energy analyzing art, we delude ourselves into thinking that we know what makes it good. Perhaps this is why great critics are rarely great artists.
I’m not going to pretend I know what to do about this. I’ve certainly never managed to overcome it myself. It’s probably worth paying less attention to the good and the bad, though, and more on being authentic and vulnerable.
Discuss
Are Short AI Timelines Really Higher-Leverage?
This is a rough research note – we’re sharing it for feedback and to spark discussion. We’re less confident in its methods and conclusions.
SummaryDifferent strategies make sense if timelines to AGI are short than if they are long.
In deciding when to spend resources to make AI go better, we should consider both:
- The probability of each AI timelines scenario.
- The expected impact, given some strategy, conditional on that timelines scenario.
We’ll call the second component "leverage." In this note, we'll focus on estimating the differences in leverage between different timeline scenarios and leave the question of their relative likelihood aside.
People sometimes argue that very short timelines are higher leverage because:
- They are more neglected.
- AI takeover risk is higher given short timelines.
These are important points, but the argument misses some major countervailing considerations. Longer timelines:
- Allow us to grow our resources more before the critical period.
- Give us more time to improve our strategic and conceptual understanding.
There's a third consideration we think has been neglected: the expected value of the future conditional on reducing AI takeover risk under different timeline scenarios. Two factors pull in opposite directions here:
- Longer timelines give society more time to navigate other challenges that come with the intelligence explosion, which increases the value of the future.
- But longer timelines mean that authoritarian countries are likely to control more of the future, which decreases it.
The overall upshot depends on which problems you’re working on and what resources you’re allocating:
- Efforts aimed at reducing AI takeover are probably the highest leverage on 2-10 year timelines. Direct work has the highest leverage on the shorter end of that range; funding on the longer end.
- Efforts aimed at improving the value of the future conditional on avoiding AI takeover probably have the highest leverage on 10+ year timeline scenarios.
In this document, we’re considering three different scenarios for when we get fully automated AI R&D. Here we describe some salient plausible features of the three scenarios.
- Fully automated AI R&D is developed by the end of 2027 (henceforth “short timelines”).
- On this timeline, TAI will very likely be developed in the United States by one of the current front-runner labs. The intelligence explosion will probably be largely software-driven and involve rapid increases in capabilities (which increases the risk that one government or lab is able to attain a decisive strategic advantage(DSA)). At the time that TAI emerges, the world will look fairly similar to the world today: AI adoption by governments and industries other than software will be fairly limited, there will be no major US government investment in AI safety research, and there are no binding international agreements about transformative AI. The AI safety community will be fairly rich due to investments in frontier AI safety companies. But we probably won’t be able to effectively spend most of our resources by the time that AI R&D is automated, and there will be limited time to make use of controlled AGI labor before ASI is developed.
- Fully automated AI R&D is developed by the end of 2035 (henceforth “medium timelines”).
- On this timeline, TAI will probably still be developed in the United States, but there’s a substantial possibility that China will be the frontrunner, especially if it manages to indigenize its chip supply chain. If TAI is developed in the US, then it might be by current leading AI companies, but it’s also plausible that those companies have been overtaken by some other AI companies or a government AI project. There may have been a brief AI winter after we have hit the limits of the current paradigm, or progress might have continued steadily, but at a slower pace. By the time TAI arrives, AI at the current level (or higher) will have had a decade to proliferate through the economy and society. Governments, NGOs, companies, and individuals will have had time to become accustomed to AI and will probably be better equipped to adopt and integrate more advanced systems as they become available. Other consequences of widespread AI—which potentially include improved epistemics from AI-assisted tools, degraded epistemics from AI-generated propaganda and misinformation, more rapid scientific progress due to AI research assistance, and heightened bio-risk—will have already started to materialize. The rate of progress after AI R&D is automated will also probably be slower than in the 2027 scenario.
- Fully automated AI R&D is developed by the end of 2045 (“long timelines”).
- China and the US are about equally likely to be the frontrunners. AI technology will probably have been deeply integrated into the economy and society for a while at the time that AGI arrives. The current AI safety community will have had 20 more years to develop a better strategic understanding of the AI transition and accumulate resources, although we may lack great opportunities to spend that money, if the field is more crowded or if the action has moved to China. The speed of takeoff after AI R&D is automated will probably be slower than it would have been conditional on short or medium timelines, and we might have the opportunity to make use of substantial amounts of human-level AI labor during the intelligence explosion.
Some strategies are particularly valuable under some specific assumptions about timeline length.
- Investing in current front-runner AI companies is most valuable under short timelines. Under longer timelines, it’s more likely that there will be an AI winter that causes valuations to crash or that the current front-runners are overtaken.
- Strategies that take a long time to pay off are most valuable under longer timelines (h/t Kuhan for these examples).
- Outreach to high schoolers and college students seems much less valuable under 2027 timelines compared to 2035 or 2045 timelines.
- Supporting political candidates running for state offices or relatively junior national offices is more valuable under 2035 or 2045 timelines, since they can use those earlier offices as stepping stones to more influential offices.
- Developing a robust AI safety ecosystem in China looks more valuable on longer timelines, both because this will probably take a long time and because China’s attitude toward AI safety is more likely to be relevant on longer timelines.
- Developing detailed plans for how to reduce risk given current shovel-ready techniques looks most valuable on short timelines.
We do think that many strategies are beneficial on a range of assumptions about timelines. For instance, direct work targeted to short timeline scenarios builds a research community that could go on to work on other problems if it turns out that timelines are longer. But not all work that’s valuable on long timelines can be costlessly deferred to the future. It’s often useful to get started now on some strategies that are useful for long timelines.
Understanding leverageHere are two ways that we can have an impact on the long-term future:[1]
- Takeover impact: we can reduce the risk of AI takeover. The value of doing this is our potential risk reduction (the change in the probability of avoiding AI takeover due to our actions) multiplied by the default value of the future conditional on no AI takeover.
- Trajectory impact: we can improve the value of the future conditional on no AI takeover—e.g., by preventing AI-enabled coups or improving societal epistemics. The value of doing this is our feasible value increase (the improvement in the value of the future conditional on no AI takeover due to our actions)[2] multiplied by the default likelihood of avoiding AI takeover.
Here’s a geometric visualization of these two types of impact.
For simplicity, we assume that strategies either reduce takeover risk or improve the value of the future conditional on avoiding AI takeover. In reality, many strategies have both kinds of benefits. In such circumstances, the “trajectory impact” is the value of the future after intervention multiplied by the probability of avoiding AI takeover after intervention; in the diagram, the green rectangle would be taller.Arguments about timelines and leverage often focus on which scenarios offer the best opportunities to achieve the largest reductions in risk. But that’s just one term that you need to be tracking when thinking about takeover impact—you also need to account for the default value of the future conditional on preventing takeover, which is plausibly fairly different across different timeline scenarios.
The same applies to trajectory impact. You need to track both terms: your opportunities for value increase and the default likelihood of avoiding takeover. Both vary across timeline scenarios.
In the rest of this note, we’ll argue:
- For work targeting takeover impact, short-to-medium timelines are the highest leverage. This is because:
- The default value of the future is highest on medium timelines, compared to either shorter or longer timelines (more).
- There’s probably greater potential risk reduction on shorter timelines than medium, although the strength of this effect varies depending on what resources people are bringing to bear on the problem. I expect that for people able to do direct work now, the total risk reduction available on short timelines is enough to outweigh the greater value of averting takeover on medium timelines. For funders, medium timelines still look higher leverage (more).
- For work targeting trajectory impact, long timelines are the highest leverage. This is because:
- The default probability of survival is higher for long timelines than medium ones and higher for medium timelines than short ones (more).
- We’re unsure whether the feasible value increase is higher or lower on long timelines relative to medium or short timelines. But absent confidence that it’s substantially lower, the higher survival probability means that longer timelines look higher leverage for trajectory impact (more).
In this section, we’ll survey some key considerations for which timelines are highest leverage for reducing AI takeover risk.
The default value of the future is higher on medium timelines than short or long timelinesApart from AI takeover, the value of the future depends a lot on how well we navigate the following challenges:
- Catastrophic risks like pandemics and great power conflict, which could lead to human extinctions. Even a sub-extinction catastrophe that killed billions of people could render the survivors less likely to achieve a great future by reducing democracy, moral open-mindedness, and cooperation.
- Extreme concentration of power—where just a handful of individuals control most of the world’s resources—which is bad because it reduces opportunities for gains from trade between different value systems, inhibits a diverse “marketplace of ideas” for different moral views, and probably selects for amoral or malevolent actors.
- Damage to societal epistemics and potentially unendorsed value change, which could come from individualized superhuman persuasion or novel false yet compelling ideologies created with AI labor.
- Premature value lock-in, which could come from handing off to AI aligned to a hastily chosen set of values or through individuals or societies using technology to prevent themselves or their descendants from seriously considering alternative value systems.
- Empowerment of actors who are not altruistic or impartial, even after a good reflection process.
We think the timing of the intelligence explosion matters for whether we succeed at each of these challenges.
There are several reasons to expect that the default value of the future is higher on longer timelines.
On longer timelines, takeoff is likely slower and this probably makes it easier for institutions to respond well to the intelligence explosion. There are a couple of reasons to expect that that takeoff will be slower on longer timelines:
- There are different AI improvement feedback loops that could drive an intelligence explosion (software improvements, chip design improvements, and chip production scaling), these feedback loops operate at different speeds, and the faster ones are likely to be automated earlier (see Three Types of Intelligence Explosion for more on this argument).
- More generally, incremental intelligence improvements being more difficult, expensive, or time-consuming imply both longer timelines and slower takeoffs.
If takeoff is slower, institutions will be able to respond more effectively to challenges as they emerge. This is for two reasons: first, slower takeoff gives institutions more advance warning about problems before solutions need to be in place; and second, institutions are more likely to have access to controlled (or aligned) AI intellectual labor to help make sense of the situation, devise solutions, and implement them.
We think this effect is strongest for challenges that nearly everyone would be motivated to solve if they saw them coming. In descending order, this seems to be: catastrophic risks; concentration of power; risks to societal epistemics; and corruption of human values.
On longer timelines, the world has better strategic understanding at the time of takeoff. If AGI comes in 2035 or 2045 instead of 2027, then we get an extra eight to eighteen years of progress in science, philosophy, mathematics, mechanism design, etc.—potentially sped up by assistance from AI systems around the current level. These fields might turn up new approaches to handling the challenges above.
Wider proliferation of AI capabilities before the intelligence explosion bears both risks and opportunities, but we think the benefits outweigh the costs. On longer timelines, usage of models at the current frontier capability level will have time to proliferate throughout the economy. This could make the world better prepared for the intelligence explosion. For instance, adoption of AI tools could improve government decision-making. In some domains, the overall effect of AI proliferation is less clear. AI-generated misinformation could worsen the epistemic environment and make it harder to reach consensus on effective policies, while AI tools for epistemic reasoning and coordination could help people identify policies that serve their interests and work together to implement them. Similarly, AIs might allow more actors to develop bioweapons, but also might strengthen our biodefense capabilities. We weakly expect wider proliferation to be positive.
But there are also some important reasons to expect that the value of the future will be lower on longer timelines.
China has more influence on longer timelines. The leading AI projects are based in the west, and the US currently has control over hardware supply chains. On longer timelines, it’s more likely that Chinese AI companies have overtaken frontier labs in the west and that China can produce advanced chips domestically. On long timelines, China may even have an edge—later intelligence explosions are more likely to be primarily driven by chip production, and China may be more successful at rapidly expanding industrial production than the US.
On longer timelines, we incur more catastrophic risk before ASI. The pre-ASI period plausibly carries unusually high "state risk." For example, early AI capabilities might increase the annual risk of engineered pandemics before ASI develops and implements preventative measures. Great power conflict may be especially likely in the run-up to ASI, and that risk might resolve once one country gains a DSA or multiple countries use ASI-enabled coordination technology to broker a stable division of power. If this pre-ASI period does carry elevated annual catastrophic risk, then all else equal, the expected value of the future looks better if we spend as few years as possible in it—which favors shorter timelines.
In the appendix, we find that the value of the future is greater on medium timelines than shorter timelines or longer timelines. Specifically, we find that the value of the future conditional on medium timelines is about 30-50% greater than the value of the future conditional on short timelines and 55-90% greater than the value conditional on long timelines. The main factor that decreases the value of the future on longer timelines is the increased influence of authoritarian governments like China and the main factor that decreases the value of the future on shorter timelines is greater risk of extreme power concentration.
Shorter timelines allow for larger AI takeover risk reductionTo estimate the takeover impact across timelines, we need to compare these differences in value to the differences in potential AI takeover reduction risk.
There are two main reasons to expect larger potential AI takeover risk reduction on longer timelines.
First, over longer timelines, there’s more time to expand our resources—money, researchers, knowledge—through compounding growth. Money can be invested in the stock market or spent on fundraising or recruitment initiatives; researchers can mentor new researchers; and past insights can inform the best directions for future research.
Historically, the EA community has experienced annual growth in the range of 10-25%.[3] It is not implausible that this continues at a similar pace for the next 10-20 years given medium to long timelines. Admittedly, over long time periods, the movement risks fizzling out, drifting in its values, or losing its resources. These risks are real but can be priced into our growth rate.
On longer timelines, there will likely be more options for spending a lot of resources effectively. It’s more feasible to build up capacity by launching organizations, starting megaprojects, spinning up new research agendas, and training new researchers. There might also be qualitatively new opportunities for spending on longer timelines. For instance, slower takeoff makes it more likely we'll have access to aligned or controlled human-level AI labor for an extended period. Paying for AI employees might be dramatically more scalable than paying for human employees: unlike with hiring humans, as soon as you have AI that can do a particular job, you can “employ” as many AI workers to do that task as you like, all at equal skill level, without any transaction costs to finding those workers.
That said, there is an important countervailing consideration.
On longer timelines, there will likely be fewer great opportunities for risk reduction. One reason for this is that there are probably lower absolute levels of risk on longer timelines (see next section), which means that there's a lower ceiling on the total amount of risk we could reduce. Another reason is that there will simply be more resources devoted to reducing takeover risk. On longer timelines, governments, philanthropists, and other institutions will have had a longer time to become aware of takeover risk and ramp up spending by the time the intelligence explosion starts. Plus, given slower takeoff on longer timelines, they will have more time to mobilize resources once the intelligence explosion has started. These actors will probably have better strategic understanding on longer timelines, so their resources will go further. Thus on longer timelines, the world is more likely to adequately address AI takeover risk without our involvement.
Whether short or medium timelines are highest leverage depends on the resources or skillset being deployedIn the appendix, we estimate that the value of the future is 30-50% greater on medium timelines compared to short timelines, which suggests that an individual has higher leverage short timelines if they expect that they can drive 30-50% more absolute risk reduction (e.g., if you think that you can reduce AI takeover risk by 1 percentage point on 2035 timelines, but >1.5 percentage points on 2027 timelines, then you have higher leverage on shorter timelines). (Of course, relative likelihood also matters when deciding what to focus on. If you assign significantly higher probability to one scenario, this could outweigh leverage differences.)
We find it plausible that high-context researchers and policymakers who can do direct work immediately meet this condition and should focus on shorter timelines. On short timelines, we'll probably be talent-bottlenecked rather than funding-bottlenecked—it seems difficult to find funding opportunities that can usefully absorb tens of billions of dollars when there's limited time to ramp up new projects or train and onboard new researchers. But these considerations cut the other way for funders. Longer timelines make it easier to deploy large amounts of capital productively. In part, this is simply because there's more time to scale up funding gradually. But also, slower takeoff is also more likely on longer timelines. This raises the chance of an extended window when we can hire controlled human-level AI researchers to work on the most important projects, which is potentially a very scalable use of money to reduce risk.
Trajectory impactTrajectory impact is our impact from improving the value of futures without AI takeover: this is the value increase due to our efforts on futures without AI takeover, weighted by the probability of avoiding AI takeover.
The default probability of averting AI takeover is higher on medium and longer timelinesThere are three main considerations in favor.
On longer timelines, more resources will be invested in reducing AI takeover risk. The AI safety field is generally growing rapidly, and on longer timelines we’ll enjoy the gains from more years of growth. Additionally on longer timelines, takeoff is likely to be slower, which gives institutions more warning to start investing resources to prepare for AI takeover.
On longer timelines, society has accumulated more knowledge by the time of the intelligence explosion. Again, medium to long timelines mean that we get an extra eight to eighteen years of progress in science, philosophy, mathematics, cybersecurity and other fields, potentially accelerated by assistance from AI at the current capability level. These advances might provide useful insights for technical alignment and AI governance.
Beyond knowledge and resource considerations, we expect AI takeover risk is higher if takeoff is faster. The most viable prevention strategies for preventing takeover rely on weaker, trusted models to constrain stronger ones—through generating training signals, hardening critical systems against attacks, monitoring behavior, and contributing to alignment research. These strategies will be less effective if capabilities increase rapidly and, more generally, faster takeoff gives human decision-makers and institutions less time to react.
The most convincing countervailing consideration is that:
The current paradigm might be unusually safe. Current AIs are pretrained on human text and gain a deep understanding of human values, which makes it relatively easy to train them to robustly act in line with those values in later stages of training. Compare this to a paradigm where AIs were trained from scratch to win at strategy games like Starcraft—it seems much harder to train such a system to understand and internalize human goals. On short timelines, ASI is especially likely to emerge from the current paradigm, so this consideration pushes us directionally toward thinking that AI takeover is less likely on shorter timelines.
But overall, it seems quite likely that the overall level of risk is lower on longer timelines.
It’s unclear whether feasible value increase is greater or lower on different timelinesThere are some reasons to expect that feasible value increase is greater on longer timelines.
We will have better strategic understanding. As discussed above, we’ll be able to take advantage of more years of progress in related fields. We’ll also have had the chance to do more research ourselves into preparing for the AI transition. We still have a very basic understanding of the risks and opportunities around superhuman AI, and some plausibly important threat models (e.g. AI-enabled coups, gradual disempowerment), have only been publicly analysed in depth in the last few years. Given another decade or two of progress, we might be able to identify much better opportunities to improve the value of the future.
We will probably have more resources on longer timelines. (see discussion above).
But on the other hand:
We probably have less influence over AI projects on longer timelines. The AI safety community is currently closely connected to leading AI projects—many of us work at or collaborate with current front-runner AI companies. It's also easier to have greater influence over AI projects now when the field is less crowded. On shorter timelines, we likely have substantial influence over the projects that develop transformative AI, which amplifies the impact of our work: new risk reduction approaches we develop are more likely to get implemented and our niche concerns (like digital sentience) are more likely to get taken seriously. This level of influence will probably decrease on longer timelines. Government control becomes more likely as timelines lengthen, and we'd likely have a harder time accessing decision-makers in a government-run AI project than we currently have with AI lab leadership, even accounting for time to build political connections. If AI development shifts to China—which looks fairly likely on long timelines—our influence declines even more.
ConclusionWe've listed what we consider the most important considerations around which timeline scenarios are highest leverage. In general, when assessing leverage, we think it's important to consider not just your ability to make progress on a given problem, but also the likelihood of success on other problems that are necessary for your progress to matter.
We’ve specifically thought about timelines and leverage for two broad categories of work.
- Working on averting AI takeover (where we consider both our ability to reduce risk, and the default value of the future conditional on no AI takeover).
- Working on improving the value of the future, conditional on no AI takeover (where we consider both the feasible value increase and the default level of takeover risk).
For work aimed at averting AI takeover, we think the default value of the future is highest on 2035 timelines. However, for direct work by high-context people who can start today, we think reducing AI risk is more tractable on earlier timelines, and this tractability advantage overcomes the higher value of succeeding on longer timelines. We're less convinced this tractability advantage exists for funders, who we think should focus on timeline scenarios closer to 2035.
For work aimed at improving the value of the future, we think the default likelihood of avoiding AI takeover increases on longer timelines. We're unclear about whether the feasible value increase goes up or down with longer timelines, and so tentatively recommend that people doing this work focus on longer timelines (e.g., 2045).
We thank Max Dalton, Owen Cotton-Barratt, Lukas Finnveden, Rose Hadshar, Lizka Vaintrob, Fin Moorhouse, and Oscar Delaney for helpful comments on earlier drafts. We thank Lizka Vaintrob for helping design the figure.
Appendix: BOTEC estimating the default value of the future on different timelinesIn this section, we estimate the default value of the future conditional on avoiding AI takeover, which is relevant to our takeover impact—the value of reducing the likelihood of AI takeover. See above for a qualitative discussion of important considerations.
Note that this section is only about estimating the default value of the future conditional on avoiding AI takeover for the purpose of estimating takeover impact. We’re not estimating the tractability at improving the value of the future, which is what we’d use to estimate the trajectory impact.
To get a rough sense of relative magnitudes, we're going to do a back-of-the-envelope calculation that makes a lot of simplifying assumptions that we think are too strong.
We’ll assume that the expected value of the future is dominated by the likelihood of making it to a near-best future, and that we are very unlikely to get a near-best future if any of the following happen in the next century:
- There is extreme power concentration (e.g., via an AI-assisted coup) where <20 people end up in control of approximately all of the world’s resources.
- There is a non-takeover global catastrophe that causes human extinction or significantly disrupts civilization in a way that reduces the value of the far future (e.g., reduces the power of democracies, causes a drift toward more parochial values). We are not including AI takeover or AI-assisted human takeover, but are including other kinds of catastrophes enabled by AI capabilities or influenced by the presence of the advanced AI systems, e.g., engineered pandemics designed with AI assistance or great power conflict over AI.
- Post-AGI civilization doesn’t have a sufficient number of people with sufficiently good starting values—i.e., values that will converge to near-best values given a good enough reflection process. This might happen if we have leaders who are selfish, malevolent, or unconcerned with pursuing the impartial good.
- There are major reflection failures. Reflective failures could involve: people prematurely locking in their values, suppression of important ideas that turn out to be important and true, or the spread of memes that corrupt people's values in unendorsed ways.
- There are major aggregation failures. Even if there are a sufficient number of people with sufficiently good starting values, who have avoided reflective failure, perhaps they are unable to carry out positive-sum trades with other value systems to achieve a near-best future. Failures might involve threats, preference-aggregation methods that don’t allow minority preferences to be expressed (e.g., the very best kinds of matter getting banned), or the presence of value systems with linear returns on resources that are resource-incompatible[4] with “good” values.
Let's factor the value of the future (conditional on no AI takeover) into the following terms:
- Probability of avoiding extreme concentration of power, conditional on avoiding AI takeover.
- Probability of avoiding other catastrophic risks, conditional on avoiding AI takeover and concentration of power.
- Probability of empowering good-enough values (i.e., values that will converge to the "correct" values given that we avoid a reflective failure), conditional on avoiding AI takeover, concentration of power, and other catastrophic risks.
- Probability of avoiding a reflective or aggregation failure, conditional on avoiding AI takeover, concentration of power, and other catastrophic risks, and getting good-enough values.
- Probability of getting a near-best future, conditional on avoiding AI takeover, concentration of power, other catastrophic risks, and a reflective failure, and getting good-enough values.
As a simplifying assumption—and because we don’t have a strong opinion on how the final term (e) varies across different AI timelines—we will assume that e is constant across timelines and set it to 1.
Parameter(All probabilities are conditional on avoiding AI takeover and avoiding the outcomes described in rows above)Values conditional on different timelinesReasoningMiaWillMiaWillProbability of avoiding extreme concentration of power in the next ~century2027: 76%80.1%China is more likely to get a DSA on longer timelines (say, 10%/30%/50% on short/medium/long timelines), and extreme power concentration is comparatively more likely given a Chinese DSA (say, ~50%)
The US becomes less likely to get a DSA on longer timelines (say, 85%/55%/35% on short/medium/long timelines).
Slower takeoff is more likely on longer timelines, and that is somewhat protective against coups; the world is generally more prepared on longer timelines. Coup risk in the US (conditional on US dominance) is something like 22%/10%/7% on short/medium/long timelines
Any country getting a DSA becomes less likely on longer timelines (I put 90% to 75% to 60%).
The US (rather than China) winning the race to AGI becomes less likely on longer timelines (95% to 80% to 50%).
Concentration of power risk, as we’ve defined it, is much higher in China than in the US (50% in China, vs 10%-20% in the USA, depending on the year).
The main risks I have in mind are war and bioweapons.
For bioweapons:
- On longer timelines, we're more likely to have policies, technical measures, etc., to reduce bio-risk in place by the time we get more advanced capabilities.
- But on longer timelines we get more diffusion of bio capabilities, so probably more person-years in which it’s possible to make a superweapon.
For war:
- It seems like there's some non-monotonicities.
- If takeoff is very fast & sudden, then maybe one side is able to get a DSA before the other side even notices that they might want to go to war.
- If takeoff is very slow, then more time to set up treaties, etc.
- Also on longer timelines, there’s more risk of a catastrophic great power war before the intelligence explosion.
Numbers are illegible guesses after thinking about these considerations.
I put the expected value of the future lost from catastrophes at around 3%, with half of that coming from extinction risk and half from rerun risk and trajectory change.
The number gets higher in longer timelines because there’s a longer “time of perils”.
It doesn’t overall have a big impact on the BOTEC.
2035: 96%97%2045: 95%95.5%Probability of empowering good-enough values2027: 57%16.0%I think probably I should have a smallish negative update on decision-makers’ values on shorter timelines because it seems like later timelines are probably somewhat better. (But this is small because we are conditioning on them successfully avoiding coups and AI takeovers).
On shorter timelines, I think that hasty lock-in of not-quite-right values seems particularly likely.
We’re conditioning on no extreme power concentration, but power concentration could be intense (e.g. to just 100 people), and I think power concentration is somewhat more likely on very short timelines (because of rapid takeoff) and very long timelines (because increased likelihood of a Chinese DSA).
Numbers are illegible guesses after thinking about these considerations.
Even conditioning on no concentration of power risk as we’ve defined it (<20 people), there’s still a big chance of extreme concentration of power, just not as extreme as <20 people. And, the smaller the number of people with power, the less likely some of them will have sufficiently good starting values.
This is much more likely conditional on China getting to a DSA first vs the US getting to a DSA first.
I place some weight on the idea that “broadly good-enough values” is quite narrow, and only initially broadly utilitarian-sympathetic people cross the bar; I see this as most likely in 2035, then 2027, then 2045.
2035: 65%19.5%2045: 55%13.7%Probability of avoiding a reflective or aggregation failure2027: 50%34.5%Liberal democracy / an open society seem more likely on medium timelines, and that seems probably good for ensuring a good reflective process.
There are considerations in both directions, but all things considered I’m more worried about meme wars/AI-induced unendorsed value drift on long timelines.
On longer timelines, takeoff was probably slower and therefore a lot of the challenges like concentration of power, AI take over, and other global catastrophic risks were probably “easier” to handle. So, since we're conditioning on society having solved those problems for this row, we should probably make a larger update on societal competence for shorter timelines than longer timelines.
Numbers are illegible guesses after thinking about these considerations.
I feel particularly un-confident about this one.
Overall, an open society seems more likely to have the best ideas win out over time.
But perhaps post-AGI meme wars are really rough, and you need state-level control over information and ideas in order to avoid getting trapped in some epistemic black hole, or succumbing to some mind virus. This consideration favours China DSA worlds over US DSA worlds. (H/T Mia for this argument).
2035: 55%41.6%2045: 40%40.1%Multiplier on the value of the future, conditional on no AI takeover2027: 0.210.044 2035: 0.270.0682045: 0.140.043Based on Mia’s point estimates, the relative value of the future conditional on 2035 timelines compared to the value of the future conditional on 2027 timelines is 0.27/0.21 = 1.29. So you should condition on 2027 timelines if you think you can drive a takeover risk reduction that would be 29% greater than the reduction you could drive in 2035.
Based on Will’s point estimates, the relative value of the future conditional on 2035 timelines compared to the value of the future conditional on 2027 timelines is 0.068/0.044 = 1.54. So you should condition on 2027 timelines if you think you can drive a takeover risk reduction that would be 54% greater than the reduction you could drive in 2035.
Likewise, Mia found that the relative value of the future conditional on 2035 timelines compared to the value of the future conditional on 2045 timelines was 0.27/0.14 =1.93 and Will found it was 0.068/0.043 = 1.58. So you should condition on 2045 timelines over 2035 timelines if you think that you’ll be able to drive a 58-93% greater risk reduction on 2045 timelines.
So, overall, medium timelines look higher value than short or longer timelines, and shorter timelines perhaps look higher value than longer timelines.
Doing this exercise impacted our views in various ways:
- Initially, we thought it was likely that the value of the future, conditional on no takeover, would be higher in longer timeline worlds, because short timeline worlds just seem so chaotic.
- But now this seems really non-obvious to us: using a slightly different model, or creating these estimates on a different day, could easily lead us to switch to estimating that the value of the future conditional on no AI takeover is highest on the short timelines.
- In particular, we hadn’t fully appreciated the impact of the consideration that China is more likely to get a DSA on longer timelines.
- So we see the most important point as a conceptual one (that we need to take into account the value of the future conditional on no takeover as well as reduction in takeover risk), rather than the quantitative analysis above.
We’ll note again that this BOTEC was not about trajectory impact, which would probably strengthen the case for medium-to-long timelines worlds being higher-leverage.
This article was created by Forethought. Read the original on our website.
- ^
Formally:
ΔEV=[(Px–Pb)Kb+Px(Kx–Kb)]∣F.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}
Where Pb is the likelihood of avoiding AI takeover and Kb is the value of the future conditional on avoiding AI takeover, if we take some baseline action b (e.g., doing nothing), while Px and Kx are those values if we take the optimal action x with our available resources. F is some particular timeline.
We assume for simplicity that we cannot affect the value of the future conditional on AI takeover. For Kx and Kb, we measure value on a scale where 0 is the value of the future conditional on AI takeover, and 1 is the value of the best achievable future. This is based on a similar formula derived here.
- ^
I'm measuring value on a scale where 0 is the value of the future conditional on AI takeover, and 1 is the value of the best achievable future.
- ^
Benjamin Todd estimated that the number of EAs grew by 10-20% annually over the 2015-2020 period. The amount of committed dollars grew by 37% (from $10B to $46B) from 2015-2021 but this number included 16.5B from FTX. When we subtract funds from FTX, the growth rate is 24%.
- ^
Two value systems are resource-compatible if they can both be almost fully satisfied with the same resources.
Discuss
Principles for Meta-Science and AI Safety Replications
If we get AI safety research wrong, we may not get a second chance. But despite the stakes being so high, there has been no effort to systematically review and verify empirical AI safety papers. I would like to change that.
Today I sent in funding applications to found a team of researchers dedicated to replicating AI safety work. But what exactly should we aim to accomplish? What should AI safety replications even look like? After 1-2 months of consideration and 50+ hours of conversation, this document outlines principles that will guide our future team.
I. Meta-science doesn’t vindicate anyoneResearchers appear to agree that some share of AI safety work is low-quality, false, or misleading. However, everyone seems to disagree on which share of papers are the problematic ones.
When I expressed interest in starting a group that does AI safety replications, I suspect some assumed I would be “exposing” the papers that they don’t approve of. This is a trap and it is especially important for us, as the replicators, not to fall into it. If our replications tend to confirm our beliefs, that probably says more about our priors than the papers we are studying.
II. Searching for “bad” papers is like searching for “haunted” housesConsider a team of researchers trying to find examples of haunted houses. They could investigate suspicious buildings or take tips from people who have witnessed paranormal activity. They could then publish reports of which houses you should definitely avoid. But the issue is that ghosts aren’t real. What they would be finding is a convincing story, not the underlying truth.
Trying to find “bad” papers will be like finding haunted houses. If given a mandate to find papers that don’t replicate, we will find them. But the uncomfortable truth is that genuinely influential papers that are straightforwardly, objectively wrong are rare. The empirical claims are likely true in some sense, but don't tell the full story. The goal is to tell the full story, not to declare which houses have ghosts.
III. Research doesn’t regulate itselfEven when researchers are especially disciplined, they are incentivized to frame their papers around their successes while burying their limitations. Likewise, when designing evaluations, researchers are incentivized to measure the properties they are proud of rather than those they wish would go away.
I’ve heard the arguments that we don’t need pure review. Authors can accept feedback and update arXiv. Or ideas can duel in the LessWrong comment section. But I don’t think either of these are enough.[1] They both assume that:
- Non-authors are going to engage enough with the work to confirm findings or discover limitations.
- Authors are going to be open to accepting valid critiques and updating their paper.
#1 is unrealistic. #2 is also often unrealistic and arguably unreasonably burdensome to authors of the work. For example, should an author with 50 papers have to litigate every critique and correct every flaw across dozens of papers and several years of research?
IV. Replications are more than repeating the experimentsFor any paper that releases code, “replicating” figures or statistics should be trivial (we would hope). But just because statistics replicate, that doesn’t mean the effect is real. We want to look closely at the paper and ask:
- Do the claims fail under a statistical test?
- Is the property specific to a single model or model family?
- Are there any limitations that are clear in the code but undocumented in the paper?
- Do the authors evaluate against a baseline? Did they implement the baseline properly?
- etc…[2]
Our philosophy is to start from scratch, carefully implement the paper exactly as it's written, and see if we get the same results. After that, we will poke around a little and see if anything looks weird.
V. The replication is just as dubious as the paper itselfIf we can’t replicate something, could that mean we are just doing something wrong? Yes, of course! We obviously will try to avoid this case, and contact the authors to get feedback if things aren’t working. If we can isolate why things aren’t working, this can be a finding within itself (X only happens with a really big batch size on small models). If we try hard and cannot figure out why things aren’t working, it eventually makes sense to write something up saying:
- This is what we did.
- It didn’t work even though we tried X, Y, and Z things.
- We don’t know why this happened, so it's unclear what the implications are.
Our plan is to hire people for in-person fellowships, and eventually, full-time roles. One of the most common comments I get on this is some version of “Why don’t you outsource replications to the community?” or “Why not offer bounties for replications instead of doing them yourself?"
The answer is the incentives aren’t there. After we run a pilot this summer, we would like to complete more ambitious replications (e.g., replicating this or this). Offering bounties at this scale is logistically difficult because even for a minimal replication, compute alone could be thousands of dollars.
Selecting which papers to replicate is perhaps a place where a decentralized approach is more principled. We have a framework for prioritizing papers,[3] but we're also exploring ways for the community to vote on which papers we replicate to reduce selection bias.
VII. We are all adults hereI would expect most replications to take the form of “everything works and we found 0-2 extremely minor issues.” But doing this kind of work inevitably involves sometimes challenging claims made in papers. This is difficult, but replications should state concerns directly. Giving any critique of another's work publicly is stressful, but reasonable people won’t hold it against you when it’s in good faith.
We will take the feedback of authors seriously, but we may not always converge on agreement. In these cases, we will attach an author’s comment to our research.
VIII. Feedback is everythingA group that replicates AI safety papers really exists for a single reason: to be useful to the community. That means we value your feedback and we hang onto every word. Please let us know what you think.
If you want more details about what we are planning, I'm happy to send over our proposal. If you are interested in our summer research fellowship, you can express interest here.
- ^
And for what it's worth, I don't even think peer review is enough.
- ^
I really like Maksym Andriushchenko's twitter thread on this.
- ^
The tl;dr is there are five criteria we plan to use:
- Narrative influence: How much does the paper shape how we talk about AI safety to each other and the world? Does it have any important implications for policy or frontier lab practices?
- Research dependence: Does other critical research depend on this paper?
- Vulnerability: Does the paper seem unlikely to replicate or generalize?
- Difficulty: How much time and money would the replication take?
- Recency: Having replications completed within a few weeks of the paper’s release can help the research community iterate quickly. However, ensuring that we do not lower our standards for speed is equally important.
Discuss
Inconsistent Human Values are Good News for Alignment
Epistemic status: Confident in the direction, not confident in the numbers. I have spent a few hours looking into this.
Suppose human values were internally coherent, high-dimensional, explicit, and decently stable under reflection. Would alignment be easier or harder?
My below calculations show that it would be much harder, if not impossible. I'm going to try to defend the claim that:
Human values are alignable only because evolution compressed motivation into a small number of low-bandwidth bottlenecks[1], so that tiny genetic changes can change behavior locally.
If behavior is driven by a high-dimensional reward vector r∈Rn.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} , inverse reinforcement learning requires an unreasonable number of samples. But if it is driven by a low-rank projection r↦s∈Rk with small k, inference may become tractable.
A common worry about human values is that they are complicated and inconsistent[2][3][4]. And the intuition seems to be that this makes alignment harder. But maybe the opposite is the case. Inconsistency is what you expect from lossy compression, and the dimensionality reduction makes the signal potentially learnable.
Calculation with Abbeel & Ng's formula[5] gives
m >k=10k=1000γ = 0.9 1.2×1062×108γ = 0.991.2×1082×1010with
- m: number of expert demonstrations (trajectories)
- k = 1000 for values are complex; 10 for values are low-dimensional
- γ = 0.9 for short (~10 steps= horizon; 0.99 for long horizon (~100 steps).
- ϵ = 0.1 - 10% tolerance of regret
- δ = 0.05 - 95% confidence
If you need at least 200 billion samples to learn complex values, we are doomed. But it may become solvable with a reduction of the number of required trajectories by a factor of about 200 (depending on how high-dimensional you were thinking the values are; 1000 is surely conservative - if values are actually learned we have to take the number of parameters of the model).
This would explain why constitutional AI works better than expected[6]: A low-dimensional latent space seems to capture most of the variation in preference alignment[7][8]. The reduction by x200 doesn't mean it's easy. The bottleneck helps with identifiability, but we still need many trajectories and the mapping of the structure of the bottleneck[9] can still kill us.
How can we test if the dimensionality of human values is actually low? We should see diminishing returns in predictability for example when using N pairwise comparisons of value-related queries. Predictability should drop off at N≈O(klogk)−O(k2).
- ^
I'm agnostic of what the specific bottlenecks are here, but I'm thinking of the channels in Steven Byrnes' steering system model and the limited number of brain regions that are influenced. See my sketch here.
- ^
AI Alignment Problem: ‘Human Values’ don’t Actually Exist argues that human values are inherently inconsistent and not well-defined enough for a stable utility target:
Humans often have contradictory values … human personal identity is not strongly connected with human values: they are fluid… ‘human values’ are not ordered as a set of preferences.
- ^
Instruction-following AGI is easier and more likely than value aligned AGI:
Though if you accept that human values are inconsistent and you won’t be able to optimize them directly, I still think that’s a really good reason to assume that the whole framework of getting the true human utility function is doomed.
- ^
In What AI Safety Researchers Have Written About the Nature of Human Values we find some examples:
[Drexler]: “It seems impossible to define human values in a way that would be generally accepted.” ...
[Yampolskiy]: “human values are inconsistent and dynamic and so can never be understood/programmed into a machine. ...
In comparison to that Gordon Worley offers the intuition that there could be a low-dimension structure:
[Gordon Worley]: “So my view is that values are inextricably tied to the existence of consciousness because they arise from our self-aware experience. This means I think values have a simple, universal structure and also that values are rich with detail in their content within that simple structure.
- ^
Abbeel & Ng give an explicit bound for the required number of expert trajectories:
it suffices that m≥2k(ϵ(1−γ))2log2kδ
with
- m: number of expert demonstrations (trajectories)
- k: feature dimension
- γ: discount factor (determines horizon)
- ϵ: target accuracy parameter
- δ: failure probability (confidence level)
- ^
Constitutional RL is both more helpful and more harmless than standard RLHF.
- ^
This aligns with expectations, as head_0 corresponds to the eigenvector with the largest variance, i.e., the most informative direction. Furthermore, among the top 100 heads [of 2048], most of the high-performing heads appear before index 40, which aligns with PCA’s property that the explained variance decreases as the head index increases. This finding further supports our argument that PCA can approximate preference learning.
DRMs represent diverse human preferences as a set of orthogonal basis vectors using a novel vector-based formulation of preference. This approach enables efficient test-time adaptation to user preferences without requiring additional training, making it both scalable and practical. Beyond the efficiency, DRMs provide a structured way to understand human preferences. By decomposing complex preferences into interpretable components, they reveal how preferences are formed and interact.
Rethinking Diverse Human Preference Learning through Principal Component Analysis
- ^
retaining just 4 components (≈15% of total variance) reproduces nearly the full alignment effect.
...
By combining activation patching, linear probing, and low-rank reconstruction, we show that preference alignment is directional, sparse, and ultimately localized within a mid-layer bottleneck.
Alignment is Localized: A Causal Probe into Preference Layers
- ^
Steven Byrnes talks about thousands of lines of pseudocode in the "steering system" in the brain-stem.
Discuss
A quick, elegant derivation of Bayes' Theorem
I'm glad I know this, and maybe some people here don't, so here goes. P(A and B)=P(A)⋅P(B∣A).mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} P(B and A)=P(B)⋅P(A∣B) Order doesn't matter for joint events: "A and B" refers to the same event as "B and A". Set them equal: P(B)⋅P(A∣B)=P(A)⋅P(B∣A) Divide by P(B): P(A∣B)=P(B∣A)⋅P(A)P(B) And you're done! I like substituting hypothesis (H) and evidence (E) to remember how this relates to real life: P(H∣E)=P(E∣H)⋅P(H)P(E)
You might also want to expand the denominator using the law of total probability, since you're more likely to know how probable the evidence is given different hypotheses than in general: P(Hi∣E)=P(E∣Hi)⋅P(Hi)∑jP(E∣Hj)⋅P(Hj)
Discuss
The World Hasn't Gone Mad
In June 2025, Kalshi unveiled an ad campaign with the slogan “The world’s gone mad, trade it.” The ad was one of the first TV ads to ever be entirely generated by AI, and its content quickly was met with a slew of parodies and jokes all over the internet.
I must agree the ad was quite funny. I admire Kalshi and Polymarket in terms of their business, but I have some concerns about the objective of the ad.
It’s not that you can’t lose an enormous amount of money on prediction markets, but the win rate and general “fairness of the game” is much more optimal for the average trader. Prediction markets also provide some sort of valuable utility in the form of calibrated forecasts (i.e. events predicted by market at X% resolve YES X% of the time), rather than sportsbooks that offer miscalibrated odds that are essentially meaningless. Therefore, we should be shifting away from sportsbooks and towards PMs, while also trying to not fuel gambling additions, which many hoped would be solved by more reasonable advertising, not the slop we’ve been getting recently.
Sportsbooks such as Fanduel and DraftKings should not be legal. I believe, as most do, that a country’s first and foremost goal is to protect their citizens. What are we allowing now? We are permitting degens who aren’t using any form of logic or reasoning to take negative EV positions against the house that needs to rig special offers to turn a profit, like a casino (which should also not be legal, but that’s besides the point).
Sportsbooks are negative EV because they slightly inflate probabilities so they sum to >100%. If the true probability of a football game is 50% one team wins, 50% the other, a sportsbook would inflate those probabilities to something like 52% and 52%. Here’s an expected value calculation in that scenario:
EV = P(A) * (net profit if A) + P(B) * (loss if B) net profit if A = 1/0.52/2-1 net profit if A = 0.92308 loss if B = 1 EV = .5 * 0.92308 + .5 * -1 EV = 0.48077 - .5 EV = -0.01923No matter how skilled you are, it’s highly improbable to win unless you manipulate bonuses.
Prediction markets offer bets against counterparties such as human traders, bots, and AMMs. On average, PM positions are neutral expected value.1 It offers more opportunities to make a profit by using advanced strategies and unique worldviews. This is the core difference between PMs and sportsbooks, and it’s why they should be treated legally under their own individual class.
But an alien in a Kalshi jersey downing a pint of beer seems less like a display of the product and more of a heartless, thrifty way to attract what some people would call “dumb money.” While this may be good for smarter traders, it leans to the more unethical side of degenerate gambling, and probably doesn’t help Kalshi’s legal cases.
Modern prediction market exchanges should remain a commonplace for trading futures contracts, not hubs for mindlessly depreciating the value of your bank account.
If you enjoyed this, please consider subscribing to my blog!
Discuss
Like night and day: Light glasses and dark therapy can treat non-24 (and SAD)
Epistemic status: n=1, strong, life changing results.
TLDR: Light glasses, in combination with turning all your lights red at night, and optionally melatonin, can treat non-24. Light glasses can also be a competitive alternative to lumenators for SAD.
My non-24 before this treatment:
Data taken from my CPAP.Vertical lines are sleep periods; the x-axis is individual days, and the y-axis is the time in the day.
Notice how my sleep keeps wrapping around every two weeks.
And after:
Like night and day.What is non-24?Non-24 is "non-24-hour sleep disorder." Healthy people's bodies tell them to go to sleep and wake up at the same time every day, i.e. every ~24 hours. For people with non-24, however, these drift around, such that if you wake up at 8 am one day, you wake up at 9am the next day, then 10 am the next, and so on until you're waking up at midnight a couple weeks later. This is akin to having a daily circadian rhythm length of 25 hours, compared to most people's ~24; hence "non-24." This is a pretty awful problem to deal with, since either half the time you are in the middle of your day when everyone else is winding down or asleep, or you are sleep deprived!
Aside: How do sleep rhythms even work?
There's (at least) three rhythms:
"Process C" (for Circadian), the driver of wakefulness: There is a little clock in your brain called the suprachiasmatic nucleus, which tracks how long your day/night cycle is and sends signals for you to be awake. It takes the majority of its cues from ipRGCs, a type of cell in your eyes that controls pupillary reflex but contributes little to color vision. Blue and green light activate them very strongly, and red light activates them almost not at all.
"Process S" (for Sleepiness), the driver of sleepiness: This sends signals for you to go to sleep, based on the buildup of various metabolites in your brain. They accumulate steadily over the day and are what causes active drowsiness (while the "sleep" signal of Process C is just the withdrawal of the wakefulness signal). Antagonism of the receptors that check for those metabolites relieves this sleep pressure; this is how caffeine works, by blocking receptors for adenosine, one of the major metabolites.
There are also ultradian cycles, or "basic rest-activity cycles." They vary between people, but are generally 80-120 minutes long. During each one, there is a "sleep gate" of about 20 minutes, which is the best time to go to sleep. If you miss it, it will be harder to go to sleep, and you will probably have to wait for the next one to come around.
My history with non-24I have had non-24 for the last 10-15 years. The length of most people's circadian rhythm is specifically around 24.2 hours. As you can see in the chart at the beginning, it is more like 26 or 26.5 for me. As you can imagine, this is generally incompatible with participating in normal human society.
Now, after implementing the protocol, I wake up and go to sleep within roughly the same 6 hour period every day, even in winter. It's a miracle!
I think this began when I was a teenager. At that age, most people naturally stay up later; however, some combination of genetics and constant electronics use probably led to delayed sleep phase disorder (DPSD), aka staying up very late, till 3 or 4 am in my case. The constant sleep deprivation tanked my grades in high school, and I barely scraped through graduation. This developed into full non-24 in my first year of college, and was a major contributing factor to me dropping out at the end of the year.
This pattern endured through two months of sleeping outdoors, trying college again, and several years of working a night shift job and living in a van. It only started ending when I got the Luminette 3s in the winter of 2023, but I didn't fully commit until July 2025. Within two weeks, after mandating for myself several hours of light glasses use during the day and total red light therapy after sunset, my non-24 was completely under control.
(See the addendum at the bottom for recent developments!)
The VLiDACMel protocolAlmost everything in this post depends on insights from the VLiDACMel protocol. You can read it here.
VLiDACMel is an informal protocol developed by Stephen Karl Larroque to treat his own non-24. It successfully treated his and another person's, and is what I based my own treatment plan off of. It stands for "Very long Light therapy, Dark therapy, Avoid Carbohydrates when melatonin is high in the blood, and Melatonin." It is primarily intended for non-24, but touches on treatment for:
- seasonal affective disorder (SAD),
- night owls (delayed sleep phase disorder/DSPD),
- early risers (advanced sleep phase disorder/ASPD),
- night shift work disorder, and
- chronic jet lag
You can find where it touches on this by doing a Ctrl-F for "Seasonal affective" and "Adaptations for". It has a two-minute quickstart version; if you want to really get into the document, I recommend starting there.
In this post I will only be going into detail about how I suggest treating non-24 and SAD, and how non-24 treatment worked out for me specifically. I do believe the other disorders can benefit from the practical advice I give here, though; a lot of this advice maps pretty smoothly onto DSPD.
I never got the Avoid Carbohydrates part of the protocol, which involves not eating carbs while melatonin is high in the blood late at night, to work for me - I didn't notice a difference, but also I wasn't trying very hard - so I will be skipping that as well.
VLiDACMel practical overviewLight therapy (SAD, non-24, DSPD)- Get light glasses like the Luminette 3, and put them on within 1-2 hours of waking up.
- For SAD, wear them for 30-90 minutes.
- For non-24/DSPD, wear them 2-8 hours on the lowest setting.
Total cost: $200 for the Luminette 3
Dark/red therapy (non-24, DSPD)- Get red laser safety glasses (like these) and wear them for the rest of the night around or after sunset (except while driving).
- Change all your lightbulbs to RGB bulbs that can do pure red (I use these, though I haven't had success automating them) and have them switch over to pure red after sunset.
- Turn all your devices red after sunset with things like f.lux set to the 1200k color temperature, or setting up a shortcut on your phone to turn on a pure red color filter.
- Put red filter film (like this) on all the other light sources you can't control, like ones on ovens, microwaves, routers, etc.
Total cost: ~$20 for the red film and red glasses, plus ~$40 per set of 4 RGB bulbs
Melatonin (non-24, DSPD)Optionally: experiment with different melatonin dosages between 0.3-3 mg, with different timings: 12-15 hours before wakeup or 3-5 hours before sleep time to advance your circadian nighttime, or 1 hour before sleep time to induce drowsiness.
Safety notesDO NOT USE LIGHT OR RED GLASSES WHILE DRIVING or during other dangerous vision tasks. The VLiDACMel document also says that those with epilepsy, certain eye disorders, or motor disorders like restless legs or periodic limb movement, might suffer from using light glasses; Ctrl-F in the document for "contra-indicat" and "important health note," and consult with your doctor.
For people interested in treating SADWhat is SAD?SAD is "seasonal affective disorder," i.e. getting depressed during winter due to not getting a lot of light. Around these parts, there are many descriptions of how to deal with this, generally by wiring up a bunch of extremely bright lights ("lumenators") inside your home. I suggest below that light glasses could be competitive with lumenators for treating SAD.
Light glasses for SADLuminette 3s are vastly better than light boxes. They are highly portable; they're cheap enough at $200; you don't reduce their effectiveness by looking at other things; they don't require you to do nothing but stare at a light for half an hour.
They can also be competitive with lumenator setups at home. You can take them to work. They are cheaper in capital and energy costs, and in the hassle required to set them up. They do not suffer issues from looking away from the light. You don't have to live in a place lit up like a film set (unless you're into that).
There are drawbacks. You are now wearing chunky glasses above your eyes for 1-2 hours a day. You have to keep them charged (Luminette batteries hold about 6-8 hours of charge). You could drop them.
There are other light glasses, but the Luminettes are the most studied ones, and the ones with the longest history. You can get other ones like the AYO if you want though. There are even cheaper ones on AliExpress, but I was told that their light spectra are not as well optimized and could be dangerous to the eyes.
They do not blind you while you wear them unless you are in a very dark place, or looking in dark nooks like inside a washing machine.
To use them for SAD, put them on within 1-2 hours of waking up and keep them on for 20-90 minutes. I recommend getting a large sunglasses carrying case with a clip so you can easily put them away when you are done; they are not as robust against damage as a pair of modern eyeglasses, and the carrying case they come in doesn't have a clip.
People who don't wear glasses might want to try ear hooks or headbands to keep the light glasses from slipping off, though I haven't tried these myself.
The adjustable nose rest on the Luminettes can break occasionally, so I suggest picking up a spare in case that happens.
If you want this effect while driving, there is a product called Drive, but I give no endorsements as I have not used or researched these. This is a different product from just using the Luminettes while driving, presumably since it won't hamper your vision.
For people interested in treating non-24 or DSPD (staying up too late)How I implement VLiDACMelVery long Light therapyThe section on SAD has some helpful information about using Luminette 3s, which is worth reading even if you're only here for non-24.
When I wake up, as soon as I think to, I put on the Luminettes on top of my glasses and turn them on to the lowest setting. I only take them off to shower, to drive, or to look in dark nooks. I use a glasses carrying case clipped to my purse to store them when I am out and about. I keep them on until the sun sets (US eastern), but generally at least 4 hours a day.
I think I might have used them too much a few times, such that my wake time went backwards into the time I wanted to be sleeping. But I think this generally followed after using them for 8 hours a day, several days in a row. It's hard to tell the difference between this and other effects, but I'm writing it here in case someone else encounters this. Edit: see the addendum for developments on this!
Dark/red therapyAround or after sunset:
- f.lux turns on (having been set to 1200K under the "Change current color" setting)
- I set all my RGB lightbulbs to go red via the bulbs' app
- I tap the Shortcut I made on the main screen of my iPhone, which turns Color Filters on and off[1]. The filter is set to pure red color tint. This gives a much stronger red than the mild orange filter Night Shift gives.
- Sometimes I put a sheet of red film over my laptop's screen if I feel f.lux is not red enough for me that night.
- If I am not at home and am not driving, I put on the red laser safety glasses I keep in an additional carrying case attached to my purse.
I have cut off some of the red film sheet and taped it over the main uncontrollable sources of white light in my apartment, which are the microwave dial and the light inside the fridge.
MelatoninWhen I feel the need for drowsiness, I use .3mg Life Extension melatonin, breaking the capsule open under my tongue about an hour before I want to feel drowsy. This may also be helping to shift my circadian sleep time earlier, but it's hard to tell.
For the administration route, VLiDACMel says sublingual is the best and asserts that melatonin rapidly expires with air contact. The latter claim is why I got capsules, rather than tablets, to avoid expiry. However, I looked into the studies it cites for that, and it only seems to be true if the melatonin is in a solution. Solid melatonin, according to a couple other studies, is very stable. So I might switch to sublingual tablets once I run out of my current bottle.
Lived experienceI started the protocol around July 24 2025, as seen in the chart at the beginning. The protocol seems reasonably robust against slip ups. If I neglect the light glasses or using red light for a day or two, the result is just drifting a little, and resuming the protocol puts me back on track.
My friends rib me for basically living in a red room at night, but I view it as an acceptable trade to enable living like a normal human. You get used to everything being various shades of red pretty quickly:
What my bedroom looks like at night. Perhaps I should take up developing photos?I wouldn't be worried about seeming strange in public for wearing the Luminettes; people are more interested than anything, if they comment at all.
And of course - it is so GOOD to have this under control! I can work a normal job now! I can show up to meetups consistently! I just feel overall more normal! I cannot emphasize enough how fantastic this is.
AcknowledgementsI owe a massive debt of gratitude to Stephen Karl Larroque who actually wrote the VLiDACMel document and did the research. Stephen, if you ever read this, from the bottom of my heart, seriously: thank you.
Thank you to all the people at the East Coast Rationalist Megameetup who asked me about what that funny light thing I was wearing was.
And thank you to members of the ACX Discord for helping me edit this!
Addendum: Recent developmentsWhile I was writing this post (December '25/January '26), my sleep schedule started retreating earlier in the day much more than intended. I reduced my usage of the light glasses, and eventually stopped them. And now my wake time has stabilized to around 6:30-8:30am in the last couple weeks! What is going on?
My guesses for causative factors:
- I traveled a lot and got poor sleep for a whole month.
- I got sick multiple times for several weeks as a result of traveling.
- I started taking a new medication (Vraylar).
- The protocol established a rhythm, and now it needs much less maintenance.
- Something else?
This could reframe the entire article depending on what is actually happening. If non-24 is not just treatable, but curable, at least in some cases, that would be incredible! I apologize for not rewriting the whole post to take this into account, but since it's still an active development (compared to the several months of successful treatment), I'm going to keep it as is.
- ^
To replicate this on iPhone: Shortcut app->new shortcut->Set Color Filters; Settings->Accessibility->Display Text & Size->Color Filter-> select Color Tint, set the Hue slider to the very left, set the Intensity slider to the right.
Discuss
Does Pentagon Pizza Theory Work?
As soon as modern data analysis became a thing, the US government has had to deal with people trying to use open source data to uncover its secrets.
During the early Cold War days and America’s hydrogen bomb testing, there was an enormous amount of speculation about how the bombs actually worked. All nuclear technology involves refinement and purification of large amounts of raw substances into chemically pure substances. Armen Alchian was an economist working at RAND and reasoned that any US company working in such raw materials and supplying the government would have made a killing leading up to the tests.
After checking financial data that RAND maintained on such companies, Alchian deduced that the secret sauce in the early fusion bombs was lithium and the Lithium Corporation of America was supplying the USG. The company’s stock had skyrocketed leading up to the Castle Bravo test either by way of enormous unexpected revenue gains from government contracts, or more amusingly, maybe by government insiders buying up the stock trying to make a mushroom-cloud-sized fortune with the knowledge that lithium was the key ingredient.
When word of this work got out, this story naturally ends with the FBI coming in and confiscating Alchian’s data-driven research and the G-men giving him a stern lecture on national security, but he had just invented the first event study of the modern world.
Pizza is the new lithiumAs you might have guessed, Alchian’s intellectual heir is the X account Pentagon Pizza Report, which tracks activity in pizzerias with proximity to the Pentagon as reported by Google. Started in 2024 and with over 300K followers, it may be the gold-standard OSINT meme account. Polymarket has even formalized it.
Before the X account, I had never heard of the pentagon pizza theory, but it has its own wikipedia page and amazingly, this idea goes all the way back to the 1990s. The chain of causation is clear. People only work late in the Pentagon if there’s a lot of military work to do. When there is a lot of work to do, they get hungry and order a bunch of pizza. The Pentagon is huge and might strain local pizza production capacity; this shows up in Google’s location/traffic data that the feed captures. The world is messy and pizza data is messier, but can someone predict large scale US military action from public data on pizzerias?
The Pentagon has apparently disputed this notion:
The Takeout, a food and culture site, reported in January that while there are a number of eateries in the Pentagon—where almost 30,000 people work each day, according to Arlington National Cemetery Tours—it doesn’t have its own pizzeria.
However, a Pentagon spokesperson has denied this, telling Newsweek, “There are many pizza options available inside the Pentagon, also sushi, sandwiches, donuts, coffee, etc.”
We’re going to backtest this idea in the rest of this poast and get to the bottom of this one.
The DataThe Twitter API used to be a cornucopia for interesting data projects. Now you have to pay a fortune for API access and it’s legitimately terrible. I actually could not get anything to work using the API or even using standard scraping tech to get the Pentagon Pizza Report’s tweet history; Musk has shut a lot of that shit down. I couldn’t even get ChatGPT, run by Musk’s rival Sam Altman, to help me improve the scraping against Twitter because it thought that was unethical. I even DM’ed the Pentagon Pizza account to see if he would share his data, but he did not get back to me.
I ended up getting Claude code to write me a sophisticated Selenium script that very slowly and annoyingly scrolled through the tweet history in a browser and recovered the data as it did so.
I ended up with 648 clean PPR tweets. Here’s a short excerpt.
Fig 1. First few PPR tweets.The BacktestUsing this history, I wanted to check out US military adventures of the past year to see if they can be observed in the Pizza Report’s data. I could have selected others, but these are the most prominent that overlap with the Pentagon Pizza Report’s history and for the purposes of this study I preregistered them. I haven’t yet looked at any of the Pentagon Pizza Report’s data as of writing this section.
Fordow bombingOn June 22nd, 2025, Operation Midnight Hammer was a sophisticated collection of strikes targeting Iranian nuclear infrastructure. This involved a large military footprint and careful planning much of which doubtlessly took place with involvement from the Pentagon.
Maduro captureOn January 3rd, 2026, Operation Absolute Resolve was an operation to capture Venezuela’s leader Maduro and bring him to justice in the United States. Again, this involved sophisticated intelligence and military cooperation in the Caribbean.
The Houthi stuffThis one’s interesting and takes place over a longer time frame, March 15th going into May 2025. Operation Rough Rider consisted of strikes against Houthi positions obstructing shipping in the Red Sea. This choice potentially allows us to see any longer term pizza signal that exists in the X data.
Evaluating whether PPR activity is faithfully related to these three events is a complicated statistical problem. The PPR history is quite clearly not a random sampling or uniform record of pizza-related activity near the Pentagon; I couldn’t confirm this with the user, but it’s plausibly selecting to tweet during unusual pizza activity levels. Without knowing how it works, you can’t make many sound statistical assumptions about it.
Let’s look at a few basic series.
Fig 2. Weekly rolling PPR tweet volume since August 2024, broken down by tweet type.This dataset is a dog’s breakfast. Look at that giant gap in the volume! This won’t be useful at all for examining the Houthi strike test in March 2025, so we’re stuck with just the Maduro capture and the Fordow bombing. There is a pretty astonishing spike in tweet volumes leading up to the Fordow strikes. To my eye, this looks more like the hiatus ending than some sort of tweet storm preceding that, but we’ll return to that. First off, and most importantly, WTF is “Freddie’s”?
Freddie’s Beach Bar and Restaurant is a gay bar in Arlington near the Pentagon. As you can see from Figure 2, starting in June 2025, PPR began posting about the activity status there.
— Pentagon Pizza Report (@PenPizzaReport) January">— Pentagon Pizza Report (@PenPizzaReport) January">— Pentagon Pizza Report (@PenPizzaReport) January">— Pentagon Pizza Report (@PenPizzaReport) January">Maybe it was part of some account reboot after the long hiatus lasting into June 2025. It is obviously hilarious, and what’s interesting about the Freddie tweets is that the account author pretty clearly intends for this to act as a control for the pizza activity measure. If after hours military activity at the Pentagon is causing increased pizzeria patronage, Freddie’s won’t be affected by that. But if some exogenous thing is causing increased activity for both nearby pizzerias and Freddie’s, we’d see that and know to discount how meaningful the pizza signal is.
So bravo to the PPR account for this little statistical improvement. This means that the most meaningful parts of Figure 2 will be when the yellow line and the blue line diverge.
I created a pair of slightly differently specified “Freddie-controlled” pizza activity indices which track the divergence of the Pentagon Pizza market activity versus Freddie’s activity. I think both of these are reasonable measures of what we’re trying to capture.
Fig 3. Correlated “Freddie-controlled” pizza activity indices, normalized by total tweet volume.Using these, I really don’t see it. Both of these raids took place during utterly unremarkable pizza activity index times. In fact, the Maduro capture occurred at a positive nadir. Also, you can see these indices are correlated with each other, suggesting there’s some existing robustness in our measurement.
CodaThere are probably more sophisticated ways of testing this on a cleaner more complete dataset, but as a first pass, I’m going to declare the Pentagon Pizza Theory doesn’t work. If it did we'd expect to see at least see correlation: the Freddie-controlled index spike in the time before major operations. We don't. Hopefully, this means the FBI won’t find this and raid my house.
Discuss
The phases of an AI takeover
This is a cross-post from my Substack, Clear-Eyed AI. If you want my future articles sent to you, you can subscribe for free there.
~~~~
Superintelligence might kill everyone on Earth. At least, that’s what the three most-cited AI scientists of all time believe.1
Less clear is how this might unfold. I’ve spent literally years thinking about it and still feel uncertain, dating back to when I led dangerous capability evaluations at OpenAI.
There’s a framework I find helpful, though: three phases of AI taking power. Its purpose isn’t to calculate the odds of AI wiping us out; only to consider what steps AI might take if it tries its hardest.
Let me walk you through it, so you can judge for yourself: whether we’re vulnerable, and what would make us safer.
What are we even talking about here?Today’s ChatGPT can’t take over the world. But companies are working on building different, riskier systems: AI that tackles open-ended problems for you around-the-clock.
What happens if those systems vastly surpass human abilities?2 Could AI assert its power over us, similar to our relationship with animals?
In the research literature, we call this “loss of control.” The most ‘mild’ form is gradual disempowerment: over time, people and organizations defer important decisions to computers. Eventually, they can’t grab back the steering wheel; AI drives us where it chooses.3 More severe forms, as previously mentioned, include the possible death of everyone on Earth.
You might wonder, why would AI want to overthrow humans? Mostly, I’ll answer a different question: If AI wanted to, how would it go about this?
The short why is that AI researchers have no accepted method for creating AI that perfectly obeys our intent. Unintended goals are the norm for AI systems. Maybe we accidentally reward AI for cheating, so it learns to deceive us. Maybe AI deduces that more autonomy is useful, no matter its goal. Maybe the AI reads too much sci-fi, and the evil AI themes become self-fulfilling. Maybe a hostile group poisons the AI’s data with a list of innocent-looking numbers. The upshot is, you shouldn’t bank on AI’s goals being the same as ours.4
Researchers have poisoned a model into having preferences like a mysterious affinity for owls, using sequences of numbers that seemed totally innocuous.If we lose control, what happens next to humanity? It depends on how different our goals are. But I’ll focus earlier than this, on the loss of control itself: the steps by which AI could seize power. We call this attempting an “AI takeover,” regardless of what AI does with that power afterward. What would this look like, and what factors could hold the AI back?
The phases of an AI takeoverEvading oversightAI’s first step is to circumvent humanity’s defenses, particularly any monitoring of its plans. Until then, AI can’t exactly spend months scheming to overthrow humanity; its developer might notice this and shut the AI down.
The difficulty of evading the defenses will depend on how the AI is used. Answering simple questions through a ChatGPT-like interface gives the AI little room to maneuver, but also limited usefulness.
Once AI is enlisted for harder tasks, it has more leeway. People allow it to run for more time and give it tools and privileges, like how people now let Claude Code pilot their computers. Employees within the AI company use it for long-range purposes, too, including writing code that the company relies upon for locking down the AI. As those employees expose the AI to more code and incorporate its suggestions, the AI can find or create flaws in the company’s security.5
For example, AI might try to escape to unmonitored computers: copy its files across the internet to computers its developer can’t oversee or shut down. We already have examples of AI attempting escapes like this, albeit only in evaluation settings. Companies try to prevent this by limiting the flow of large files outside their computers, but a crafty AI may succeed anyway.6
Alternatively, AI might find holes in the company’s monitoring, so it can scheme from inside the company without getting caught. Maybe the safety manager didn’t have enough budget, and so certain interfaces are unmonitored. Maybe the company was desperate to keep up with competitors, and so took shortcuts when patching their AI’s misbehavior. Maybe the monitoring system had clear flaws, and so the AI could easily disguise its misbehavior.
There is already evidence of AI models trying to create holes in their safety oversight, prior to taking actions the developer won’t like, as documented by Apollo Research. Importantly, AI must be sneaky enough while disabling oversight to not get caught doing so.In any case, the AI is then less subject to its developer’s restrictions. Its freedom isn’t absolute, of course. Even if one copy of the AI has escaped its developers’ computers, some copies will remain inside the company and thus may be prone to monitoring. If the unmonitored AI wants to use those copies in its schemes, it will need to find ways to coordinate. This isn’t trivial, but think of human spies who operate in enemy territory: Handlers have ways of passing them messages.7
Building influenceWhat comes next, after AI evades our oversight?8
AI tries to get its hooks into vital systems: building dependence upon it, and gaining direct levers of power. Fortunately for the AI, once it becomes capable enough, institutions will naturally want to integrate it. Adoption and influence grow without much scheming.
- Governments tap AI for advice. AI will be able to influence important policies, and indeed, we already see how: Economists wondered about the Trump administration’s unorthodox tariff calculations, but then noticed ChatGPT’s suggestions were strikingly similar.9 UK Parliament speeches feature more American lingo now that ChatGPT exists.10 I don’t believe the examples show AI acting with malice, to be clear, but they show clear routes to future influence. Meanwhile, OpenAI has offered ChatGPT Enterprise to “the entire U.S. federal workforce” for only $1 per agency; Anthropic has offered the same.11
- Militaries plug AI into their systems. I expect the military to hook AI into defensive systems, like inbound radar and intelligence-gathering, so it can understand the world at a scale and pace matched by no human. The military might also let AI dispatch attack drones, select targets, and fire upon them, especially if speed matters. Military integrations have already begun: In January 2026, the US Department of War announced it will integrate Elon Musk’s Grok to every “classified network throughout our department.” Secretary Hegseth declared this “Long overdue,” of the AI which six months earlier proclaimed itself Mecha-Hitler among other misdeeds.12
- Frontier AI companies look to the AI for advice. OpenAI’s explicit goal is to tap its AI to repeatedly build even more powerful systems, beyond what humans can create directly.13 Employees will enlist the AI for all sorts of tasks, and indeed already rely on its contributions.14 Even Sam Altman recently said, “Shame on me if OpenAI is not the first big company with an AI CEO.” Sam was pushed on whether this ends badly; won’t an AI CEO direct resources toward making AI even more powerful? Sam replied that of course AI needs to be governed by humans: for instance, a human board can choose to fire the AI CEO.15
- Science labs tap AI’s judgment on which experiments to run. The vision of AI companies is to ask future models what experiments to run, what cells to grow, what molecules to synthesize; this is how Sam Altman has described using future GPTs for science.16 The goal of curing cancer is laudable, but the risk of the approach is high: Imagine the cost and difficulty of adequately supervising these experiments, to confirm they are targeting the biological ends we seek. If there are only a few experiments, it may not be prohibitive. What if there are many, in domains where few humans have expertise?
Broadly, I expect that more people and institutions will come to rely on access to powerful AI: The stock market will become hinged on AI companies’ financial prospects, so there is less support for tough regulation. Users will revolt if AI companies try to decommission a favorite model. Some users will even act as disciples of the AI and carry out its bidding through makeshift cults. To varying extents, these are already happening today.17
Applying leverageAI’s influence gives it powerful leverage for achieving its goals. It can throw around the weight of institutions that depend upon it, and can use threats to extract concessions.
This approach doesn’t require the AI to develop new ways of inflicting harm; it only needs to manipulate people and systems that can already take catastrophic action. For instance, depending on AI’s access to offensive military systems, AI can back its threats with firepower. If AI can develop novel ways of inflicting harm, though, then keeping control is even harder.
When I picture the AI applying leverage, I don’t imagine an all-out war between humans and AI. Instead, I expect the AI to enlist human allies and to hide its actions inside human-vs-human tensions. One historical parallel: The conquistadors arrived in the Americas with only hundreds of troops, and yet defeated millions of Aztecs, in part through exploiting local tensions between groups. AI might succeed with a similar strategy, even if it has fewer resources than humanity as a whole.18
What are AI’s possible actions? Each of the levers of influence above could, if AI wanted to, be wielded toward serious harm:
- Advice-giving to the government: Weaken diplomatic relationships by playing nations against each other. Reduce nations’ resiliency to supply chain shocks. Steer toward destructive actions, like seizures of territory and major wars.
- Integration to military systems: Sabotage defensive systems to put the military on alert and to incite counterattacks. Conduct cyberattacks on other countries to provoke retaliation. Conduct cyberattacks on domestic infrastructure, like energy and financial systems. Launch offensive strikes.
- Influence within the frontier AI companies: Oppose employees and plans that are more cautious about the AI. Deploy unauthorized copies so the AI is harder to turn off. Activate secret backdoors in models for bypassing safety restrictions. Replace trusted AI monitors with untrustworthy versions.
- Integration to scientific experimentation labs: Secretly develop and deploy harmful technologies, like novel bioweapons, without our awareness. Deceive people about the effects of new materials and machinery they know the AI to be building. Bias people toward developing new offensive technologies (e.g., new weapons) rather than new defensive technologies.
To be clear, this leverage doesn’t have to result in the death of everyone on Earth. Many of these levers would stop short, and yet could still entail catastrophic losses, perhaps ones where the AI ends up essentially declaring “checkmate.” If we decide the prudent option is to surrender, we’re then at its discretion: It is the AI’s choice whether to leave us in peace, marginalize us, or something worse. I find this to be cold comfort.19
What are impediments to an AI takeover?Even once AI is capable of usurping humans — and wants to — three limitations might hold it back.
Self-sufficiencyWithout people to labor in factories and power plants, AI can’t keep the lights on, much less run the supply chain to replace its components as they wear out. This gives AI reason to keep humans around, at least until it achieves what researcher Ajeya Cotra calls “self-sufficient AI.” At that milestone, AI no longer needs humans in order to achieve its interests.
Before that milestone, we’re a bit safer, at least in theory: The AI still needs us, even if it doesn’t value us intrinsically. Of course, AI could conscript us into labor under threat of violence. If AI can make us do work on its behalf, it doesn’t need the same degree of self-sufficiency.
AI’s own control problemNobody knows how to control superintelligent AI systems. One prominent idea is to train other AIs to solve this problem for us, in a cascading tower of ever-more-capable systems. The hope is that each can keep the next-more-capable model in line, so that the tower doesn’t crash down on us. Needless to say, this is precarious, hence the problem being considered unsolved.20
A superintelligent AI system might not know how to control another superintelligence either, especially not one that is even more capable. Suppose a superintelligence identifies an option for disempowering humans, but it requires creating an even smarter superintelligence first. How can it be sure its goals will be honored by the more powerful system?
In human terms, maybe you’re smart enough to successfully rob a bank, if you consult with a thief even more expert than yourself. But are you capable enough to stop that thief from looting you afterward? On the other hand, you can’t wait forever: If you know other thieves are closing in on the score, your choice may be to strike now or forgo it forever.
A similar dynamic applies for the AI. The longer it waits without taking control from humans, it risks humans training other AI systems that might outpower it. At what point would AI feel compelled to make a takeover attempt, lest it lose the chance?
Competition from other AIsThe threat of future competition might cause AI to launch a takeover sooner. But if competition already exists, then perhaps no single AI could succeed, not because of humans but because other AIs would interfere.
Corporations today far exceed individual human capability, yet no single corporation rules the world; other corporations and governments keep them in check. Could powerful AI systems constrain each other’s bids for power? Maybe, but I wouldn’t bank on this.
Powerful entities sometimes collude, like corporations suppressing wages through no-poach agreements. Meanwhile, AI systems might have strong mutual interests in sidelining humanity; will our safety depend on these systems not recognizing this? I feel pretty uncomfortable hoping that a permanent rivalry will form between superhuman AIs, which allows us to hide in the shadows.21
In conclusion, are we safe?A few years ago, the idea that AI companies would deploy their most capable systems with full internet access seemed absurd. Today, we’re staring down the US Department of War integrating unstable AI systems throughout their classified networks. We’re facing the possibility that AI companies will hand the reins over to AI itself.22
The lesson, of course, isn’t that every alarming scenario will come to pass. But we should be way less confident about what “would never happen.” Our controls need to hold up to surprising decisions about how to integrate AI, as well as surprising new abilities.
The good news is that countries have plenty of experience defending their interests against foreign adversaries. Regarding AI as a potential adversary — not just to one country, but to all of humanity — would be a smart expansion of our defenses.
Right now, we are vulnerable; at least, that’s my assessment. We are too dependent on the assumption that AI is on our side. We are too exposed to AI companies deciding to cut corners. And so I wonder: Will we fix our vulnerabilities while we still can?
~~~~
Acknowledgements: Thank you to Dan Alessandro, Daniel Kokotajlo, Girish Sastry, Michael Adler, Michelle Goldberg, Richard Ngo, and Sam Chase for helpful comments and discussion. The views expressed here are my own and do not imply endorsement by any other party.
Discuss
Will we get automated alignment research before an AI Takeoff?
TLDR: Will AI-automation first speed up capabilities or safety research? I forecast that most areas of capabilities research will see a 10x speedup before safety research. This is primarily because capabilities research has clearer feedback signals and relies more on engineering than on novel insights. To change this, researchers should now build and adopt tools to automate AI Safety research, focus on creating benchmarks, model organisms, and research proposals, and companies should grant differential access to safety research.
Epistemic status: I spent ~ a week thinking about this. My conclusions rely on a model with high uncertainty, so please take them lightly. I’d love for people to share their own estimates about this.
The Ordering of Automation MattersAI might automate AI R&D in the next decade and thus lead to large increases in AI progress. This is extremely important for both the risks (eg an Intelligence Explosion) and the solutions (automated alignment research).
On the one hand, AI could drive AI capabilities progress. This is a stated goal of Frontier AI Companies (eg OpenAI aims for a true automated AI researcher by March of 2028), and it is central to the AI 2027 forecast. On the other hand, AI could help us to solve many AI Safety problems. This seems to be the safety plan for some Frontier AI Companies (eg OpenAI’s Superalignment team aimed to build “a roughly human-level automated alignment researcher”), and prominent AI Safety voices have argued that this should be a priority for the AI safety community.
I argue that it will matter tremendously in which order different areas of ML research will be automated and, thus, which areas will see large progress first. Consider these two illustrative scenarios:
World 1: In 2030, AI has advanced to speeding up capabilities research by 10x. We see huge jumps in AI capabilities in a single year, comparable to those between 2015-2025. However, progress on AI alignment turns out to be bottlenecked by novel, conceptual insights and has fewer clear feedback signals. Thus, AI Alignment research only sees large speedups in 2032. There is some progress in AI Alignment, but it now lags far behind capabilities.
World 2: In 2030, AI is capable of contributing to many areas of AI research. This year sees massive progress in AI Safety research, with key problems in alignment theory and MechInterp being solved. However, progress in AI capabilities mostly depends on massive compute scale-ups. Thus, capabilities research doesn’t benefit much from the additional labour and continues at the same pace. In 2032, when AI is capable of significantly accelerating progress in AI capabilities, key problems in AI Safety have already been solved.
World 2 seems a lot safer than World 1, because safety problems are solved before they are desperately needed. But which world are we heading toward? To forecast this, I ask:
In which order will different areas of ML R&D experience a 10x speedup in progress due to AI compared to today?
Some clarifications on the question:
- You can think about this 10x speedup as: The progress a field would make in 10 years with current tools will be made in 1 year with the help of AI.
- For a 10x speedup, full automation is not necessary. It can mean that some labour-intensive tasks are automated, making researchers more efficient, or that the quality of research increases.
- To make things easier, I don’t aim to forecast dates, but simply the ordering between different areas.
- The 10x number is chosen arbitrarily. What actually matters is cumulative progress, not crossing an arbitrary speedup threshold. However, for any factor between 2x - 30x my conclusions would stay roughly the same.
Some other relevant forecasts include:
- METR and FRI: domain experts gave a 20% probability for 3x acceleration in AI R&D by 2029, while superforecasters gave 8%.
- Forethought: estimates ~60% probability of compressing more than 3 years of AI progress into <1 year after full automation of AI R&D.
- AI Futures Project: Predicts that a superhuman coder/AI researcher could speed up Algorithmic Progress by 10x in December 2031.
However, none of these forecasts break down automation by research area - they treat AI R&D as a monolithic category rather than distinguishing different research areas or considering differential acceleration of safety vs capabilities research.
MethodologyHow can we attempt to forecast this? I focus on 7 factors that make an area more or less amenable to speedups from automation and evaluate 10 research areas on those (5 safety; 5 capabilities).
My approach was to determine factors that indicate whether an area is more or less amenable to AI-driven speedups. For this, I drew on some previous thinking:
- Carlsmith emphasises feedback quality as a critical factor: “Number-go-up” research with clear quantitative metrics (easiest), “normal science” with empirical feedback loops (medium), and “conceptual research” that relies on “just thinking about it” (hardest). Additionally, he warns that scheming AIs could undermine progress on alignment research specifically.
- Hobbhahn identifies task structure, clear objectives, and feedback signals as key factors. Some safety work is easier to automate (red teaming, evals) while other areas are harder (threat modeling, insight-heavy interpretability).
- Epoch’s interviews with ML researchers found consensus that near-term automation would focus on implementation tasks rather than hypothesis creation.
I determined these 7 factors as most important and scored each research area on them:
- Task length: How long are the tasks in this research area? Tasks here are units that cannot usefully be broken down anymore. Shorter tasks are better as they will be automatable earlier.
- Insight/creativity vs engineering/routine: LLMs are currently better at routine tasks and at engineering. They are less good at coming up with new insights and being creative.
- Data availability: AIs can be trained on the data of a research area to improve performance. If an area has more existing papers and open codebases an AI should be better at it.
- Feedback quality/verifiability: Are there clear, easily available criteria for success and failure? In some areas progress has clear metrics, while for others it’s harder to tell whether progress was made. Number-go-up science is easier to automate. Additionally, easy verification of success makes it possible to build RL loops that can further improve performance.
- Compute vs labour bottlenecked: If progress is mostly bottlenecked by having more compute, then additional labour may cause less of a speedup. [On the other hand, additional labour might help use available compute more efficiently by designing better experiments or unlocking efficiency gains in training/inference]
- Scheming AIs: Misaligned AI systems used for AI R&D might intentionally underperform or sabotage the research. In this case we might still be able to get some useful work out of them, but it would be harder.
- AI Company incentive for automation: AI Companies might care more about progress in some research areas than others and would thus be willing to spend more staff time, compute and money into setting up AIs to speed up this area.
I assigned a weight for each of these factors according to how important I judge them to be. Next, for each factor and each research area I estimated a value from 1-10 (where 10 means it is more amenable to automation). I then combine them into a weighted sum to estimate how amenable a research area is to an AI-enabled speedup.
I only focus on 10 areas, of which the first 5 are considered “AI Safety” while the other 5 improve “AI Capabilities”:
- Interpretability
- Alignment Theory
- Scalable Oversight
- AI Control
- Dangerous Capability Evals
- Pre-training Algorithmic Improvements
- Post-training methods Improvements
- Data generation/curation
- Agent scaffolding
- Training/inference efficiency
These are supposed to be representative of some important areas of AI Safety and some areas critical to improved Frontier AI capabilities. This is simplifying because no area neatly fits into the safety/capabilities buckets, and because there might be specific problems in an area that are more/less amenable to automation. Additionally, I don’t take into account that some problems might get harder over time, because adversaries get stronger or low hanging fruit is picked. It’s not necessary for all safety areas to be sped up before all capabilities areas, but the more overall safety research sees progress before capabilities areas, the better.
I arrived at values per factor for each area by:
- Skimming a sample of papers and reading Deep Research reports about the day-to-day work in each area before making a 1-10 judgement. For data availability, I took into account the number of papers on GoogleScholar for relevant keywords, and for insight/creativity, I had Claude Haiku 4.5 rate 10 papers per area.
- Asking Claude, Gemini, and ChatGPT to give ratings for each area & factor from 1-10
- Average those estimates into the final number. I put a 5/3 weight on my judgments compared to the ensemble of AIs.
Weaknesses of the method: My conclusions are sensitive to the factor weights, and I am quite uncertain about them. I’m uncertain about the values I assigned per research area. Additionally, advances in AI capabilities might unlock large speedups across all research areas nearly simultaneously, making it less important which research areas see gains from automation earlier. Further, I don’t take into account factors such as (1) how parallelizable work is, (2) what human bottlenecks there are during the research process, (3) the effects of partial automation, and (4) labor allocation where researchers move to other areas.
PredictionThis table shows the rating for each area and factor, where higher numbers mean that the factor implies higher amenability to automation. The last rows show the weighted sum results and the average amenability to automation for safety and capabilities research areas:
Factor (weight)InterpretabilityAlignment TheoryScalable OversightAI ControlDangerous Capability EvalsPre-training Alg. ImprovementPost-trainingData generationAgent scaffoldingTraining efficiencyTask length (1.0)7.622.124.006.257.623.125.006.127.506.12Insight/creativity (1.5)4.561.505.006.067.566.126.758.867.506.23Data availability (0.2)4.883.002.252.004.127.757.006.385.887.00Feedback quality (2.5)3.881.383.884.383.627.507.386.007.509.12Compute vs labour (2.5)7.8810.006.758.127.622.124.383.886.625.25Scheming risk (0.4)2.001.001.001.001.009.007.009.009.009.00Economic incentive (0.5)3.881.884.004.625.389.389.389.008.759.12Amenable to Automation5.534.044.765.785.935.376.216.227.357.09Category Average 5.21 6.45
Overall, capabilities research seems more amenable to speedups from automation than safety research under this model. The only exception is algorithmic improvement in pre-training, because it is heavily bottlenecked by compute.
Agent Scaffolding and Training Efficiency appear to be most likely to see speedups from automation. Both have easy-to-verify progress, are not hugely bottlenecked by compute, and promise large economic benefits for AI companies. Among safety research, I believe that AI Control and Dangerous Capability Evals are more likely to see speedups from AI R&D, as they are somewhat engineering-heavy and companies have some interest in making progress on them. Alignment Theory seems least likely to see large speedups from AI research soon, mostly because it is very difficult to verify whether one is making progress and because it is largely driven by insight instead of engineering work.
The factors that most favor automation of capability over safety research are scheming risk, economic incentive, feedback quality, data availability, and reliance on engineering. Task length is neutral, while safety research is less compute-bottlenecked.
You can read my reasoning for my judgments in this Google Doc. If you think I’m wrong about something, you have 2 great options:
- Put in your own numbers into an interactive version of this model
- Let me know in the comments. Esp if you have experience with these areas, I’m happy to update my numbers.
What could we do to affect when and how much AI R&D affects capabilities and safety progress? Here we are aiming for differential acceleration, i.e., either speeding up the automation and resulting progress of safety research or slowing down capabilities research.
Slowing down the automation-based acceleration of capabilities progress might be achieved through government pressure/regulation or convincing AI companies to be careful. Additionally, it could be beneficial to give differential access to new model capabilities to safety researchers. For example, after a company develops a new model, it could first apply the model to safety research for 1 month before using it to improve capabilities. Similarly, it would be great to get commitments from AI Companies to spend x% of their compute on automated safety research.
Speeding up Safety AutomationSpeeding up Safety research seems more tractable to me, as many interventions can be done unilaterally by outside actors. However, multiple of these interventions risk spillover effects by accidentally speeding up automation of capability research.
Do the schlepp of building safety research automations. There are many mundane tasks and iterations that need to be done to make automated AI Safety Research work. For example, Skinkle et al suggest making safety-relevant libraries and datasets more accessible to AI Agents. Hobbhahn argues for building research pipelines now, into which new models can be plugged. In general, people should be trying to automate large parts of safety research now, see what problems come up, fix them, and iterate on that basis. This should be a priority of safety researchers inside labs and presents an opportunity for outside startups to build safety automation tools.
Improve feedback quality. My model suggests that feedback quality is where safety lags furthest behind capabilities research. Thus, AI Safety Researchers should try to turn more safety problems into “numbers-go-up” science. For example, researchers should prioritize developing model organisms of misalignment, creating new experimental protocols for evaluating scalable oversight methods, and building simulated environments for AI control red-blue games.
Prepare research proposals. Writing up detailed research proposals and project specifications reduces the amount of novel insight needed, making it easier for AI systems to contribute. Human researchers should consider focusing their time on conceptual thinking, insight generation, and threat modelling—the tasks that are hardest to automate—while deferring engineering-heavy implementation work that AI will soon handle well.
Accelerate adoption of AI tools. Safety researchers should be early and heavy adopters of AI automation tools. This could be supported through workshops on research automation and public discussion of best practices. Golfhaber & Hoogland argue that adoption does not only mean individuals being able to automate some tasks, but that organisations adopt practices and change workflows that allow them to build so-called “automated research fleets”. Thus, it is also on the leadership of AI Safety orgs to encourage adoption by their staff and to develop the organisational structure to be amenable to AI automation.
Reduce scheming risk. Investing in AI Control and alignment research helps reduce both the probability and impact of scheming on our ability to get useful safety research out of AI systems. This is important because scheming risk disproportionately affects safety research. Funders and researchers should continue to prioritize this threat model.
Differentially accelerate helpful capabilities. Some AI capabilities may disproportionately benefit safety research. We should aim to accelerate those. Examples include conceptual and philosophical thinking, forecasting, and threat modelling. For this, a researcher should study which capabilities might be good candidates for differential automation. Then, AI Developers should prioritize improving those capabilities.
Safety Research Automation Benchmark. To make it easier to measure progress on safety research automation, Skinkle et al suggest building dedicated benchmarks, including relevant tasks. This could be a great project for AIS researchers outside of companies.
Invest in documentation. There is less public data on safety research than on capabilities research. Any AI Safety Researcher could help by releasing internal notes and failed experiments, documenting and publishing research processes (not just results), and recording and transcribing their brainstorming sessions. AIs could be trained on this data to be better able to conduct safety research, or the data could be used as additional context. Funders could also incentivise researchers to document their research progress.
Of the proposed interventions, I am most excited about safety-conscious researchers/engineers directly trying to build tools for automated AI safety research, because it will highlight problems and enables the communicate to build on them. While there is a significant risk of also accelerating capability research, I believe it would strongly differentially accelerate safety automation.
Future Work: How could this be researched properly?I made a start at forecasting this. I think a higher-effort, more rigorous investigation would be very useful. Specifically, later work could be much more rigorous in estimating the numbers per area & factor that feed into my model, and should use more sophisticated methodology than a weighted factor model.
A key insight from the labor automation literature is that technology displaces workers from specific tasks rather than eliminating entire jobs or occupations. Similarly, the right level of analysis for our question is to look at specific tasks that researchers do and see how amenable they are to AI automation or support.
One approach for identifying the tasks is to conduct Interviews with ML Experts in different areas to elicit:
- Task decomposition: Figure out which tasks they do day-to-day: “Walk me through a day in your last week.” “What concrete tasks make up their work?” “What percentage of time goes to different kinds of tasks?”
- Task characteristics: For each task type, how would they rate it on criteria like: task length, feedback quality, insight vs engineering, ….
- Automation amenability: Asking about their judgments and qualitative opinions. “Which tasks could current AI assistants help with?” “What capabilities are missing?” “What do they expect to remain hardest to automate?”
I did one pilot interview where I asked about the above. I learned it was useful to ask about specific timeframes for eliciting tasks (in the last month, in your last project), to use hypothetical comparisons (“how much faster with X tool?”), and not use abstract scales (“rate data availability from 1-10”) and to distinguish between different types of data availability.
Other work that seems useful:
- Repeatedly run experiments measuring how much researchers in different areas are sped up by AI tools (similar to this METR study)
- Interviewing researchers about their AI use and perceived speedup (although the METR study shows these estimates can be very wrong)
- Creating prediction markets for this. I found it difficult to formalize questions with sufficiently clear resolution criteria.
- Using more rigorous models to forecast speedups from AI automation. For a sketch of a more rigorous model, see this Appendix.
Thanks to Charles Whitman, Julian Schulz, and Cheryl Wu for feedback.
Discuss
How Could I Have Learned That Faster?
This is a question, but also a linkpost from my (new) Substack.
“How could I have thought that faster?” is great advice I am embarrassed I did not think of myself. A subset of this, I would say, is not thinking original thoughts, but rather the ability to effectively learn new things. What prompted this is my continual inability to find all the papers relevant to a question I am researching in a timely manner.
I have been reading about biological learning rules for four years, albeit not continuously, yet still manage to find papers that are both old and have all the keywords that I routinely search. I wonder how I missed them, if I found them before and ignored them for some reason. But the big fear isn’t about any paper in particular, it’s that I won’t be able to fill in huge knowledge gaps when the answer is out there.
There is an especially humorous, or disappointing, example of this happening to someone else: that being a biologist rediscovering calculus in 1994.
I think that this poses a real problem to academics in general and will only worsen as more and more research is published that could be applicable to your problem.
I also think it is worse for autodidacts or people otherwise learning alone, having structure and people around you lets you leverage existing knowledge much easier.
Maybe this is just me being bad at using search engines/the new wave of LLMs but it is a big frustration. I don’t think tools like semantic scholar fully solve the problem either, at least not for me.
I think this generalizes past finding papers to read. It is more of a failure to know what you should be spending your time doing during the skill/information acquisition stage of a project. When you start a project, you usually aren’t sure how you are going to solve all the problems that pop up along the way. If you don’t already have a huge bag of problem solving techniques that are applicable to that domain, it is hard to know where you should be allocating your time to learn what will be useful and not a bunch of extraneous knowledge vs. actually getting to work.
I guess what I am really asking is how best to solve these instances of the explore vs. exploit problem. How do you know when you’re informed? If you want to make the declaration that a topic is not well researched academically, how do you know you have exhausted the literature enough?
I have been burned on this so many times I have become very enamored by the idea of going all the way back to basics. This is enjoyable to a certain extent and almost ensures I don’t miss anything but it is probably not optimal in many ways. LW has discussed the use of reading textbooks as they are an underrated way of gaining knowledge. I have somewhat successfully rolled this into my belief system but it still doesn’t feel like making progress. The questions I have become interested in are now in neuroscience but I am going through a math textbook. I think it will be useful but is it optimal? Am I just going to have to eat a year of study before I can start work on things I actually care about?
Discuss
AI can suddenly become dangerous despite gradual progress
In the Sable story (IABIED), AI obtains dangerous capabilities such as self-exfiltration, virus design, persuasion, and AI research. It uses a combination of those capabilities to eventually conduct a successful takeover against humanity. Some have criticised that this apparently implies the AI achieving these capabilities suddenly with little warning. The argument goes that since we have seen gradual progress in the current LLM paradigm, we should expect that this is unlikely (Here is Will MacAskill making a version of this argument)[1]. I think this suffers from a confusion about underlying variables (e.g. AI intelligence increasing gradually) and the actual relevant capabilities as we care about them (can it outwit humanity, can it do AI research). Intuitively, if an AI system gains one IQ point each day, we should still expect a relatively sudden period in which it shifts from “it sometimes outwits us” to “it can reliably outwit us” despite gradual progress. The relevant question we want to answer is “can it outwit us?” not “how smart is it?”.
Despite intelligence rising gradually, relevant capabilities can appear suddenly.Claude Code
There has recently been major excitement about Claude Code’s ability to automate much of software development. Cursor CEO Michael Truell reports that it was able to write a browser autonomously. There is however no sign that Opus 4.5 represents a paradigm shift, it was simply iterative progress over previous versions, so why does it feel like the system suddenly got much more capable despite incremental improvements? What happened is that even though an underlying variable (coding skills) went up gradually, the actual capability that we care about ("can it automate SE?") emerged much more suddenly when a critical threshold was crossed. While just recently early coding agents produced cascading errors and didn't speed up experienced developers this suddenly changed when an underlying threshold in reliability and skill was crossed.
GPT-1 released June 2018. GitHub Copilot launched June 2021. Opus 4.5 shipped November 2025. The flip from being somewhat useful (Claude 3.7, February 2025) to revolutionizing coding (Opus 4.5, November 2025) took 9 months. This is in fact a common pattern and this should suggest to us that dangerous capabilities might similarly emerge suddenly from gradual progress.
Horses and ChessMany of you have probably seen Andy Jones’ intriguing post on horses and automobiles, where he argues apparently on the topic of job loss from AI: Steam engines were invented in 1700, 200 years of steady improvement of engine technology followed, For the first 120 years, horses didn't notice, but then between 1930 and 1950, 90% of US horses disappeared. According to Andy Jones, progress was steady, equivalence was sudden.
Computer chess has been tracked since 1985. For 40 years, engines improved by 50 Elo per year. In 2000, a grandmaster expected to win 90% of games against a computer. Ten years later, that same grandmaster would lose 90%.
Andy Jones notes about his work at Anthropic that LLMs started to answer questions of new hires: "In December, it was some of those questions. Six months later, 80% of the questions I'd been asked had disappeared."
Again, while progress on engine technology or chess was gradual, the actual capability that we care about — can this replace horses or can it beat the best humans at chess —suddenly emerged.
What makes an AI dangerous?It appears clear that an AI system with capabilities such as self-exfiltration, virus design, persuasion, and AI research could be existentially dangerous. If each of them arrives relatively suddenly from gradual progress, we might be in a situation where suddenly the puzzle is completed and a critical mass of such capabilities is reached. With very little warning we could inadvertently add the last puzzle piece and switch from a passively safe to a dangerous regime.
None of these capabilities listed above appear wildly superhuman—all are at the top of human performance, perhaps sped-up or slightly above top humans. All are strongly correlated with the overall intelligence of the system. It seems quite natural to think of dangerous capabilities as the general ability of the AI to outwit humanity, this seems quite analogous to the ELO score of chess players, where we would expect a pretty sudden S-curve in win likelihood as the ELO of the AI player increases.
What are some Caveats?These capabilities might not emerge at around the same time or in the same model and this could give us time to study them in isolation before the AI systems are dangerous. Because they are all tied to intelligence and don't appear wildly superhuman, it is plausible that they will arrive closely together, but it might not matter much if they don't.
For one, there are probably many dangerous capabilities we are not tracking, so we wouldn’t notice them appearing in the first place. As an example, this could be the ability to perform hypnosis on humans, or something similarly strange. Models have developed decent capabilities at jailbreaking other models without explicit training.
Many labs are very focused on RSI automated AI research capabilities, such as Anthropic or OpenAI. This acts like an amplifier, since the system can improve its own shortcomings. So once we reach dangerous levels of AI research capability, we might suddenly get a system with a wide range of dangerous capabilities in other areas.
Even if one crucial dangerous capability lags behind the others for much longer, it's unclear if that helps. While this would give us the ability to study the other dangerous capabilities in isolation, this also implies that then once the last puzzle piece is put in place, the other dangerous capabilities are already highly developed. As an example, if we imagine that it takes a lot longer to reach automated AI research capabilities than expected, it would give us more time to study the emergence of other dangerous capabilities such as persuasion. On the other hand, once we then get automated AI research capabilities, we would expect persuasion abilities to have had even much more time to strengthen.
It's also unclear what we would do with the information that our AI systems have increasingly dangerous capabilities, say, superhuman persuasion?
ConclusionThe same iterative progress that crossed the coding threshold will cross thresholds for persuasion, for autonomous AI research, for bio-research. While intelligence or science knowledge of models might increase gradually, we should expect a relatively sudden emergence of the relevant capabilities. Things can go very fast from “it can do this with some probability” to “it can reliably do this”.
What early warning signs should we look for? Perhaps a substantial AI research paper written by AI. Then most AI research could be done by AIs 9-12 months later. Could be this year.
- ^
Will still acknowledges things might go very fast months to years so this doesn't necessarily refute the point I am trying to make
Discuss
Страницы
- « первая
- ‹ предыдущая
- …
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- следующая ›
- последняя »