Вы здесь
Новости LessWrong.com
A memo on Takeoff
Inspired by Daniel and Romeo https://x.com/DKokotajlo/status/1986231876852527270, I'm posting a memo on Takeoff, which I plan to expand into a proper article.
This memo comes from notes I took from a talk from Daniel, which have been synthesized by Claude for breviy:
Key Predictions- Transformative AI timeline: Late 2028
- Key prediction: All existing benchmarks solved by 2030
- p(doom): 10%
- Reasoning capabilities (mostly solved)
- Agency and long-horizon skills:
- Error correction
- Goal pursuit
- Hierarchical planning
- Data-efficient learning
- New models and architectures (e.g., recurrence)
- Language Model Programs (LMP)
- Scaffolding systems
- AI bureaucracies
- Increased inference compute
- "Just train it to have those skills" approach
- Research Engineer for AI Research (we are entering here with Kosmos)
- "Devin actually works"
- Significant speedup in AI research
- Research Scientist
- "CEO talks to the cluster"
- Autonomous white paper production
- Genius
- Qualitatively better than human researchers
- Artificial Superintelligence
- Capable of anything
- Equivalent to 100k human teams
- 10^14 text predictions/rollouts on agentic paths
- 10^23 floating point operations for transformative capabilities
- GPT-4 pre-training: ~2.15 × 10^25 FLOPS for comparison
- t-AGI (10 second AGI to 10 day AGI)
- Year-AGI scaling
- Annual 1 OOM (order of magnitude) improvements expected
- Robotics gap: "If TAI is Industrial Revolution level, what about physical automation?"
- Job stickiness: Institutional and social lags in job displacement
- "Wet robots": Humans as interim solution
- "Paperclips are a metaphor for money"
- "Capitalism is the ultimate Turing test"
- AI systems are "incentivized to be constrained by human values"
Early prediction market odds:
- 1/3 Futurama scenario
- 1/4 Fizzle out
- 1/5 Dystopia
- Cyberpunk 2077-style hacking risks
- 43% of Americans believe civil war is somewhat likely
- Democracy at its peak by voter count (2024)
- Unprecedented political deepfakes
- Cultural homogenization
- Human disempowerment (sci-fi scenarios)
- Potential for singleton control
- "OpenAI coming for everybody's jobs is a solace" -
- Logarithmic growth in atoms, exponential growth in bits
- "Folk theorem - bigger (model) is better"
- "We're very different from current machines"
- "Brains are way more complex than even the most sophisticated NNs"
- Adaptation and time remain key differentiators
- Virtual societies and digital twins
- AI safety as cancer immune response
- Automated social science research
- a deceptively simple solution to alignment: "Train them to be honest"
Discuss
Geometric UDT
This post owes credit to discussions with Caspar Oesterheld, Scott Garrabrant, Sahil, Daniel Kokotajlo, Martín Soto, Chi Nguyen, Lukas Finnveden, Vivek Hebbar, Mikhail Samin, and Diffractor. Inspired in particular by a discussion with cousin_it.
Short version:
The idea here is to combine the Nash bargaining solution with Wei Dai's UDT to give an explicit model of some ideas from Open-Minded Updatelessness by Nicolas Macé, Jesse Clifton, and SMK.
Start with a prior P.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} divided into hypotheses P(x)=∑i∈IHi(X).
BATNA: updateful policy πb
Actual policy choice maximizes the product of gains from trade across hypotheses:
argmaxπΠi∈I[E(u|Hi,π)−E(u|Hi,πb)]
I wrote an essay arguing for open-minded updatelessness last year, but it was mostly motivation, and lacked a concrete proposal. The present essay fills that gap. I don't think the proposal is perfect, but it does address the problems I raised.
I'm calling my proposal Geometric UDT, due to taking inspiration from Scott Garrabrant's Geometric Rationality, although the connection is a bit weak; what I'm doing is VNM-rational, whereas Geometric Rationality is not.
Motivating ProblemYou might be persuaded by reflective consistency concerns, and conclude that if humanity builds machine superintelligence, it had better use updateless decision theory (UDT)[1].[2] You might separately be convinced that machine superintelligence needs to do some kind of value learning, in order to learn human values, because human values are too difficult for humans to write down directly.[3]
You might then wonder whether these two beliefs stand in contradiction. "Value learning" sounds like it involves updating on evidence. UDT involves never updating on evidence. Is there a problem here?
"Maybe not" you might hope. "UDT is optimal in strictly more situations than updateful DT. This means UDT behaves updatefully in cases where behaving updatefully is optimal, even though it doesn't actually update.[4] Maybe value learning is one of those cases!"
You might do the math as follows:
Puppies vs Rainbows v1Suppose that, before building machine superintelligence, we narrow down human values to two possibilities: puppies, or rainbows. We aren't sure which of these represents human values, but we're sure it is one of the two, and we assign them even odds.
In rainbow-world, tiling the universe with rainbows has 100 utility, and tiling the universe with puppies has 0 utility. In puppy-world, tiling the universe with puppies has 90 utility (because puppies are harder to maximize than rainbows), but rainbows have 0 utility.
Rainbow WorldPuppy WorldMaximize Rainbows+1000Maximize Puppies0+90The machine superintelligence will be able to observe enough information about humans to distinguish puppy-world from rainbow-world within 3 seconds of being switched on. There are four policies it could follow:
- Always make puppies: this has a 50% chance of a utility of 90, and otherwise yields zero.
- EV: 45
- Always make rainbows: 50% chance of utility 100, otherwise zero.
- EV: 50
- Make puppies in rainbow world; make rainbows in puppy world.
- EV: 0
- Make puppies in puppy world, make rainbows in rainbow world.
- EV: 95
The highest EV is to do the obvious value-learning thing; so, there's no problem. UDT behaves updatefully, as is intuitively desirable!
Unfortunately, not all versions of this problem work out so nicely.
Puppies vs Rainbows v2Some hypotheses will "play nice" like the example above, and updateless value learning will work fine.
However, there are some versions of "valuing puppies" and "valuing rainbows" which value puppies/rainbows regardless of which universe the puppies/rainbows are in. These utility functions are called "nosy neighbors" because they care about what happens in other realities, not just their own.[5] (We could call the non-nosy hypotheses "nice neighbors".)
Nosy Neighbors: Technical Details
Utility functions are standardly modeled as random variables. A random variable is a function from worlds (aka "outcomes") to real numbers; you feed the world to the utility function, and the utility function outputs a score for the world.
A world, in turn, can be understood as a truth-valuation: a function from propositions to true/false. Feed a proposition to a world, and the world will tell you whether the proposition was true or false.[6] You can imagine the utility function looking at the assignment of propositions to true/false in order to compute the world's score. For example, at each location L, the utility function could check the proposition puppy(L), which is true if there's a puppy at L. The score of a world could be the total number of true instances of puppy(L).[7]
In order to model uncertainty between utility functions, what I'm really suggesting is a utility function which behaves like one function or another depending on some fact. When I describe "puppy world" vs "rainbow world", I have in mind that there is some fact we are uncertain about (the fact of "what human values are") which is one way in case humans value puppies, and another way in case humans value rainbows. The utility function encodes our value uncertainty by first checking this fact, and then, if it is one way, proceeding to act like the puppy-valuing utility function (scoring the world based on puppies); otherwise, it acts like the rainbow-valuing function (scoring the world based on rainbows).
Naively, you might think that this means "nosy neighbors" don't make sense: the score of a world only depends on stuff in that world. With our uncertain utility function, the puppy-score is only computed in puppy-world. It can't check for puppies in rainbow-world. What could I possibly mean, when I say that the utility function "values puppies/rainbows regardless of which universe the puppies/rainbows are in"?
What I have in mind is that we believe in some sort of counterfactual information. In puppy-world, there should be a fact of the matter about "what would the machine superintelligence have done in rainbow world" -- some propositions tracking this contingency.
You don't have to be a realist about counterfactuals or possible worlds (a "modal realist") in order to believe this. There doesn't literally have to be a fact of the matter of what actually would happen in a non-real world. There just have to be good proxies, to cause problems. For example, there could be a fact of the matter of what happens when someone thinks through "what would have happened in rainbow-world?" in detail. Maybe at some point someone runs a detailed simulation of rainbow-world within puppy-world. This is sufficient to cause a problem.
This sort of thing is quite difficult to rule out, actually. For example, if you grant that mathematical propositions are true/false, that is enough: mathematics will have, somewhere inside it, a specification of both rainbow-world and puppy-world (or at least, adequately good approximations thereof).
I think of such propositions as "windows to other worlds" which exist inside a single world. Nosy Neighbor hypotheses are utility functions which depend on those propositions.[8]
If you care about what happens in the Game of Life (in the abstract, not just instances you see in front of you) then your values are nosy-neighbor values. If a puppy-loving utility hypothesis checks for puppy-like structures in Game of Life and adds them to the score, that's a nosy-neighbor hypothesis.
Suppose the puppy hypothesis and the rainbow hypothesis are both nosy neighbors. I'll assume they're nosy enough that they value puppies/rainbows in other universes exactly as much as in their own. There are four policies:
- Always make puppies: 50% chance of being worthless, if the rainbow hypothesis is true. 50% of getting 90 for making puppies in puppy-universe, plus 90 more for making puppies in rainbow-universe.
- EV: 90
- Always make rainbows: 50% worthless, 50% worth 100 + 100.
- EV: 100
- Make puppies in rainbow universe, rainbows in puppy universe: 50% a value of 90, 50% a value of 100.
- EV: 95
- Puppies in puppy universe, rainbown in rainbow universe:
- EV: 95
Hence, in this scenario, UDT will choose to always make rainbows.
This shows that in the presence of nosy neighbors, the naive concern can be vindicated: UDT has trouble with "value learning".
One response to this problem might be to assign nosy-neighbor hypotheses probability zero. The issue I have with this solution is that human values may well be nosy.
Instead, I propose we treat this as a bargaining problem between the hypotheses.
Geometric UDTI'll use the Nash Bargaining Solution. This has some particularly nice properties, but the choice here is mostly out of habit, and we could certainly consider other options.
We start by choosing a BATNA (Best Alternative to Negotiated Agreement). This is a policy which we'd default to if the parties at the bargaining table (in our case, the hypotheses) were not able to come to an agreement.
My proposal is to set the BATNA to the policy an updateful decision theory would choose. For example, we could use updateful EDT, or we could use Thompson sampling based on the updated hypothesis weights. Either way, a hypothesis gets "more control" as it becomes more probable. This starting-point for bargaining gives each hypothesis a minimal guarantee of expected utility: puppy-world will get at least as much expected utility as if humans had built an updateful machine superintelligence instead of an updateless one.
Once we've chosen a BATNA, the Nash Bargaining Solution maximizes the product of the gains-from-trade:
argmaxπΠi∈I[E(u|Hi,π)−E(u|Hi,πb)]This ensures a Pareto-optimal policy (unlike the updateful policy). This can only increase the expected utility every individual hypothesis expects to get, in comparison to what they'd get from the updateful policy. You can imagine that the hypotheses are trading away some things they'd control (in the updateful policy) in exchange for consideration from others. The puppy-maximizer might produce some rainbows in exchange for the rainbow-maximizer producing some puppies, but you won't get an all-rainbow policy like we did before, because we're not allowed to completely ignore any hypothesis.
Another reasonable option is to adjust the weight of each hypothesis based on its probability:
argmaxπΠi∈I[E(u|Hi,π)−E(u|Hi,πb)]P(Hi)The probability of a hypothesis already adjusts what it gets in the BATNA, but this additional adjustment makes the solution less dependent on how we partition the probability distribution into hypotheses.
Since the chosen policy will be Pareto-optimal, Complete Class Theorems guarantee that it can be interpreted as maximizing expected utility with respect to some prior; that is, it'll still be UDT-rational. The interesting thing is that it isn't UDT-optimal with respect to our honest prior (the one we use to define the BATNA). Geometric UDT advises us to use a fair prior (in the Nash Bargaining Solution sense) instead of our honest prior. The honest prior has a chance of completely screwing over humans (completely ignoring the correct value-hypothesis even in the face of overwhelming evidence), whereas the fair prior does not.
I've framed things in this post in terms of value uncertainty, but I believe everything can be re-framed in terms of uncertainty about what the correct prior is (which connects better with the motivation in my previous post on the subject).
One issue with Geometric UDT is that it doesn't do very well in the presence of some utility hypotheses which are exactly or approximately negative of others: even if there is a Pareto-improvement, the presence of such enemies prevents us from maximizing the product of gains-from-trade, so Geometric UDT is indifferent between such improvements and the BATNA. This can probably be improved upon.
- ^
In this essay, UDT means UDT 1.1.
- ^
I'm not claiming there's a totally watertight argument for this conclusion given this premise; I'm only claiming that if you believe something like this you should probably care about what I'm doing here.
- ^
Even a simple strategy like "trust whatever the humans tell you about what they want" counts as value learning in this sense; the important thing is that the AI system doesn't start out totally confident about what humans want, and it observes things that let it learn.
- ^
This isn't a strictly true mathematical assertion; in reality, we need to make more assumptions in order to prove such a theorem (eg, I'm not defining what 'optimality' means here). The point is more that this is the sort of thing someone who is convinced of UDT is inclined to believe (they'll tend to be comfortable with making the appropriate additional assumptions such that this is true).
- ^
This phenomenon was discovered and named by Diffractor (private communication).
- ^
In logic, it is more typical to understand a world as a truth-valuation like this, so that worlds are functions from propositions to {true, false}. In probability, it is more typical to reverse things, treating a proposition (aka "event") as a set of worlds, so that given a world, you can check if it is in the set (so a proposition can be thought of as a function from worlds to {true, false}).
This distinction doesn't matter very much, at least not for our purposes here.
- ^
This particular utility function will not be well-defined if there are infinitely many locations, since the sum could fail to converge. There are many possible solutions to this problem, but the discussion goes beyond the present topic.
- ^
We can also get a nosy-neighbor effect without putting terminal utility on other worlds, if we believe that what happens in other worlds impacts our world. For example, maybe in puppy-world, a powerful being named Omega simulates what happens in rainbow-world, and creates or destroys some puppies accordingly. Care about what happens in other worlds is induced indirectly through beliefs.
Discuss
announcing my modular coal startup
I'm writing this post to announce my new startup, C2 AI, focusing on small modular coal power plants, which will be used to power AI models that design new small modular chemical plants.
the conceptYou probably know that there are lots of startups working on "small modular reactors" (SMRs) for nuclear power. Wikipedia says:
As of 2025, there were 127 modular reactor designs, with seven designs operating or under construction, 51 in the pre-licensing or licensing process, and 85 designers in discussions with potential site owners.
Clearly, so many investors and startup founders wouldn't simply be mistaken about the economics of SMRs. With that in mind, it occurred to me that the key components of nuclear power plants are mostly the same as those of coal power plants and chemical production:
- pressure vessels
- heat exchangers
- electric motors
- storage tanks
- turbines
- pipes
China is still building chemical plants and coal power. The US government is, at least nominally, trying to reindustrialize. But the currently proposed industrial projects are all those outdated large designs, instead of small modular ones. So clearly there's room for a new startup here.
a view from inside BASFFor some reason, big chemical companies keep making their chemical plants as big as possible, even though we all know that small reactors that can be mass-produced and shipped easily are cheaper and just generally better. I know somebody working at BASF, so I thought I'd get his opinion on where they're going wrong. He said:
The biggest facilities, like large ammonia or methanol or ethane cracker plants, typically make 1 to 2 million tons per year. In terms of potential energy handled per year, large power plants are 2-3 times larger, so there's probably still room for scaling up, but chemical production is more limited in terms of regional demand and also how much money investors can get together for a project.
Ultimately costs aren't a matter of scale per se, they're a result of manufacturing methods, and scale is important in that it determines the choice of those. For example, large container ships can be quite cheap per mass despite small production numbers compared to consumer products, but that cost and productivity is not inherent in their size. As you noted, America has more shipbuilding workers than Japan, despite producing about 1% as much ship per year, and US military ships are correspondingly expensive. That's because the manufacturing methods used are different despite the similar scale and products.
There are advantages to being able to transport components over roads, but fairly large things can be moved, and the issues with that type of transport don't seem to outweigh the advantages of larger sizes.
If you look at small mass-produced pressure vessels, like a scuba tank or a CNG tank for a car, they're not particularly cheap compared to larger ones, but I think the bigger issue is piping. You see, a big chemical plant is not a big truck, it's a series of tubes. Somebody needs to weld or bolt all the connections together, and it's easier to weld one large pipe than 100 small ones.
And of course large turbines are more efficient, due to tip leakage and so on. That can also be a driving factor for scale, especially for something like a modern air liquefaction plant.
Obviously, a random employee at an incumbent company like BASF doesn't understand construction or economics on the same level as Silicon Valley people. I have a whole collection of insights from US tech founders for when I need some inspiration; here's a random example:
Human progress was measured in GDP per capita. The next era will be measured in GDP per humanoid
Actually, here's a more relevant quote, something related to energy:
Do you suppose ships will never be powered by fusion?
That kind of visionary thinking is exactly what makes Paul Graham such a legendary VC.
why coalI've explained the basic concept for my startup, but you might be wondering: why coal power in particular? There are a few reasons:
AIAI is the hot thing in America right now. In terms of future economic prospects, it's kind of the only thing: 92% of recent US GDP growth was from AI datacenters. But that's OK: I have it from a reliable source that if we just keep building AI datacenters, everything will be work out somehow.
The fact that these datacenters need a lot of power is widely understood. So, some investors are interested in ways to solve that issue.
VCsCurrent VCs are mainly interested in potential trillion-dollar companies. These days, most of the money is owned or controlled by people wealthy enough that a small chance of a small slice of a $1 billion IPO just isn't worth getting out of bed for. Electricity is a big deal, so the potential market size is large enough for big investors to care.
comparison vs nuclearThere's already from VCs interest in SMR startups. That means, at least for some investors, we don't have to go through a whole complicated argument - all we have to do is demonstrate the advantages of burning coal as a heat source over small nuclear reactors. China helps make that argument: while it has built a lot of nuclear reactors, they've built much more coal power than nuclear power, despite not having the nuclear power regulations that that US does. And some of those nuclear plants were needed for its large nuclear weapon buildup.
That makes for a simple argument, which is important when you're dealing with VCs. They have lots of important things to think about and can't afford to get bogged down in minor technical details the way I do.
global warmingThere used to be a lot of concern about reducing CO2 emissions, but global warming has fallen out of fashion among political leadership. Germany is burning more coal, Trump is advocating for coal specifically, and CO2 levels have increased by a record amount. The current political winds favor coal.
the future is carbon-brightAI is the future and it's time to get on board the freight train or be left behind in our smoke.
We're currently closing our pre-seed round, but we're always looking for 100x engineers who want to ship molecules. If you're tired of just moving bits and want to build the physical infrastructure for the AI revolution, DM me. We're hiring.
Discuss
Halloween Tombstone Simulacra
I’ve been totally lacking energy since Halloween, so I’ve decided to rant about Halloween decorations to help me get back my Halloween spirit.
I’ve noticed many people decorate their lawns with fake tombstones. Some look kind of like actual tombstones, but many look something like this:
You know they’re supposed to be tombstones, but they don’t even really resemble actual tombstones:
I imagine the first Halloween tombstones must have looked like actual tombstones, but over time there’s been a sort of conceptual drift, where the Halloween tombstone took on an exaggerated appearance (“R.I.P.” in huge letters being common on Halloween tombstones, but horribly offensive on real ones) and incorporated other spooky elements, like skulls and grim reapers.
If real tombstones correspond to simulacrum level 1, or reality itself, then fake but accurate tombstones would be simulacrum level 2 (a lie about reality), and Halloween-themed tombstones that no longer try to seem realistic would be simulacrum level 3 (pretending to pretend to be real). I’ve seen some pretty egregious examples that have lost almost all pretense of being an actual tombstone (simulacrum level 4?), so overtaken they were with spiders and spooky fonts and so on. I wish I’d taken a picture of a particularly horrible example on my street, but you can get the idea if you look at the purple ones here:
These ones at least have a tombstone shape. Some don’t even have that.
If you challenged someone who had never heard of Halloween to make a spooky graveyard in their front yard, they would never in a million years design tombstones like this. These could only have been produced by a gradual, recurrent memetic process whereby decorations cease to represent real world items, but rather simply reflect an overall Halloween-y aesthetic, much in the way a microphone too close to a speaker will recurse to create screeching feedback that eventually has nothing to do with the original sound.
It’s not just the tombstones. Halloween ghosts only resemble undead spirits because that’s what we know they’re supposed to be. They’re cute little white or green blobs with eyes, and could just as easily be snot-men to the uninitiated. You sometimes see multi-colored Jack-o’-lanterns, even though pumpkins only come in one color. And spider webs as thick as ropes. Signs of cats wearing witches’ hats with quirky slogans underneath. Octopus skeletons. Inflatable green women with bubbling cauldrons, an archetype that’s lost all connection with the alleged practice of cursing others with black magic — a practice still punishable by death in parts of the world, but seen here only as a quaint fiction.
Just as a speaker and microphone create feedback and approach a frequency based on the resonant characteristics of the room, so to have Halloween decorations reverberated through their suburban environment until they finally matched the friendly and unobjectionable tone of their surroundings. What started as a solemn time to remember the dead, and transformed into a celebration of that which we fear, has now further transformed into a celebration of the aesthetic of Halloween itself. Instead of an authentic representation of death and fear, we get black and orange; Plastic spider rings and black cats and big eyeballs for some reason. And candy. The holiday has ceased to have a point, and has become simply “Halloween”.
I think which kinds of decorations you prefer comes down to which kind of Halloween you would prefer. Which simulacrum level you think the holiday should operate on. Would you rather a realistic, scary Halloween, where we learn to laugh in the face of death? Or the more modern “Halloween party” Halloween, which serves as an aesthetic backdrop for collecting candy and having fun? Put like that, the modern thing seems like a cheap degradation of the original. But despite its spiritual degradation, I can’t help feeling some attachment for the modern Halloween-as-Halloween, with its trick or treating and horny costumes. Rather than being merely an expression of the world, Halloween has become something that exists in its own right, for better or worse.
Discuss
Halfway to Anywhere
“If you can get your ship into orbit, you’re halfway to anywhere.” - Robert Heinlein
This generalizes.
I.Spaceflight is hard.
Putting a rocket on the moon is one of the most impressive feats humans have ever achieved. The International Space Station, an inhabited living space and research facility in Low Earth Orbit, has been continuously inhabited for over two decades now, and that’s awesome. It is a testament to the hard work and brilliance of a lot of people over a lot of time that this is possible.
Look ma, no selfie stickAnd we’re not done yet. There are footprints on the moon today, but there are also robots leaving tracks on Mars and satellites have taken photos of the rings of Saturn and the craters of Pluto. Voyager has stood waaay back and taken a family picture of the solar system. Give us a rocket and a place to send it, and we’ll go anywhere we’re curious about.
There’s a funny property of space flight though, which is that half the work happens very close to home and is surprisingly similar no matter where you’re going.
When you boil spaceflight down into the very abstract basics, it takes a certain amount of force to get around. That movement is measured in Δv, or delta-v. In plain English, delta-v is the change in velocity; when I step on the gas pedal in my car, the engine and the tires conspire to increase my velocity and make me go forwards. When I’m in my kayak and dip my oar in the water to row, I’m applying delta-v to change my velocity and go faster or slow down or just change direction. In a rocket, delta-v comes from the engine. As anyone who has tried to get out and push a stalled car into movement, I need a certain amount of energy just to move at all. Just so with rockets.
It turns out that in order to get to the moon or Mars or basically anywhere else, you wind up going to Low Earth Orbit (LEO) first. You don’t really save much work trying to do what’s call a Direct Injection — drawing a line up out of atmosphere and to your target that doesn’t do a loop or two around the globe. It takes about 9,400 meters per second of delta-v to get to LEO, and you can save something like 300 meters per second (m/s) on the upper end. Sometimes you save more like 50 m/s. No matter where you’re going, hauling your rocket off the ground and through the air and into space is going to take you at least 9 kilometers per second (km/s.)
Keep that 9.4 km/s number in mind for a second.
Here’s a table of places you might want to go, along with their delta-v costs starting from LEO:
Destinationdelta-vMoon orbit4.1 km/sMars orbit5.2 km/sJupiter orbit6.8 km/sPluto orbit9.4 km/sEscape the solar system8.7 km/s
Heck, if you’re willing to take a couple hundred thousand years, 9 km/s will get you to Alpha Centauri. Putting footprints on Pluto isn’t that much harder than going to the moon.
Nobody ever treats my muddy boot prints as historic artifacts.Once you've noticed that so much work happens during the early stages, you notice that this pattern shows up elsewhere.
II.The general principle here is that you spend most of your effort in surprisingly generic setup work.
Let's say you want to cook eggs for breakfast. You go to the grocery store and get eggs, you remember to pay your gas or electric bill so the stove works, then you make the eggs, then you clean the pan and your dish. Compare wanting to make sausage for breakfast. You go to the grocery store and get sausage, you remember to pay the gas or electric bill so the stove works, then you make the sausages, then you clean the pan and your dish. Most of the work isn’t the sausage or egg making!
Maybe you’re a shipping company and you want to deliver a piano from Tokyo to Boston. You need to wrap it in a box, you need to find a ship, you need to label the box and put it on the ship, you need someone to sail the ship across the world, you need someone to take the box off the ship at the other end, you need someone to read the label and figure out what address in Boston it’s going to, you need someone to load it on a truck and drive the truck to the address, you need someone to unload the truck and move the box to where it’s going in the building. You’ve recreated the global logistics network, involved at least three different people in at least two different countries, and you don’t save much effort if you decide to deliver the piano to Australia instead, or if you decide to get out of the piano business and get into wholesaling couches instead. “On a ship off the coast of Japan” is halfway to anywhere in the world.
Or, for a topic near and dear to my ACX Meetup Czar heart: If you want to run a conference on science fiction, you have to book a venue, you have to find some presenters, you need a schedule attendees can view, you need some way to sell tickets and take people’s money, you need to respond to emails where people can’t find the venue or where they have a complaint about another attendee, you probably want to send out a feedback form.
Notice how none of the steps in that science fiction conference go away if it’s, say, a bioengineering conference, or a trade show, an academic conference, or a month long retreat for internet bloggers? It’s the packaging, and the packaging is a lot of work. There is real work to be done in getting from LEO to Mars; you have to move sausage around more to fry it up than you do an egg, you have to pack a piano more carefully than you do a couch, and a retreat might benefit from the extra step of matching coaches to residents. But surprisingly often it turns out the most of the effort spent on a goal seems like setup, a prelude to the real effort.
Getting to orbit is halfway to anywhere.
III.There are some implications of this I don’t think are properly appreciated. In no particular order:
Expertise in one field generalizes. Or rather, it’s more accurate to say that some kinds of expertise slice across fields; a skilled boat captain can be useful to many organizations, as long as they want to move things by boat. Blur your eyes a bit, step back, and check sometimes what skills you actually need and whether seemingly different organizations must be doing the same thing as you for some part of their supply chain.
Motivation can be hit hard by everyone in the setup. Neil Armstrong is famous, Buzz Aldrin is famous. Evan Michael Collins, the pilot, gets mentioned in the first paragraph of the Apollo 11 Wikipedia page. You have to scroll down a ways to get to the guy in charge of the launch pad construction. (Kurt Debus.) Good thing they didn’t do it for fame, but at least the launch pad is still pretty clearly about rockets. Be wary of burnout if you have too much of the team working on a conference who never get to attend.
Common links in the chain are good places to be in some ways. There’s an aphorism I heard growing up, “selling shovels in a gold rush,” that touches on this. They're also good places to optimize. The software industry rightly puts incredible attention on frequently used function calls, and if you're looking at part of your regular work that gets used for every project, I suggest trying to make that part sing.
Pace yourself based on the actual work needed, not the most iconic moments. If you got worried two thirds of the way through the Ascent stage of a Mars launch because you’re halfway out of fuel, you’d never get anywhere. You’re 130 km done out of a 200 million km trip, you’re not even done with the preparation of getting to LEO, and that’s fine.
(Don’t run yourself out of fuel before you leave Earth orbit, and don’t burn yourself out before the conference starts. And leave some energy for getting home again and for wrapping up with the venue or feedback forms. But don’t get too worried if you’ve done a lot of work wrapping boxes and yet no pianos are in Boston yet.)
I'm a sailor peg, and I've lost my leg,Climbing up the launch pad, I lost my leg!
I'm shipping up to Luna, whoa,
I'm shipping up to Luna, whoa
Small changes late in the process can be expensive to fix. Imagine getting to LEO and realizing you forgot the Mars Rover back on the loading bay floor, or even that someone didn’t add the right kind of chemical analyzer. You can’t exactly turn around and swing by NASA real quick to pick it up. Likewise, many parents confirm that getting the kids into the car and getting out the driveway is a big part of the hurdle to going places; the twenty feet to the end of the driveway takes more effort than the twenty miles to the grocery store. They too don’t want to turn around and hope nobody gets antsy and unbuckles for a bathroom trip just to grab the nice grocery bags.
It’s one of the most frustrating things for me as I’ve worked on larger and larger projects, how much effort and energy I burn on activities that are like icebergs under the surface.
When I feel that way, I try to remember that getting to orbit takes the most fuel. Once I’m there, I’m halfway done.
Discuss
People Seem Funny In The Head About Subtle Signals
WARNING: This post contains spoilers for Harry Potter and the Methods of Rationality, and I will not warn about them further. Also some anecdotes from slutcon which are not particularly NSFW, but it's still slutcon.
A girl I was seeing once asked me to guess what Hogwarts house she was trying to channel with her outfit. Her answer turned out to be Slytherin, because the tiny gem in the necklace she was wearing was green. Never mind that nothing else she wore was green, including the gems in her earrings, and that colors of all the other Hogwarts houses appeared in her outfit (several more prominently), and that the gem itself was barely visible enough to tell the color at all, and that she wasn’t doing anything to draw attention to the necklace specifically.[1]
I wonder, sometimes, just how much of the “subtle social signals” people think they’re sending are like that case - i.e. it’s basically a game with themselves, which has utterly zero signal for anyone else.
Subtlety to the point of invisibility clearly happens more than zero percent of the time; I have seen at least that one unambiguous case, so existence is at least established. But just what should our priors be on any given subtle signal being received, in various domains?
Just How Subtle Is The Signal?Hints in WritingA little blurb from Eliezer on his domain of unique expertise:
So the big thing to remember about all of HPMOR is that, just as in the original books, the Defence Professor was Voldemort, and I thought it would be way way more obvious than it was to the readers.
[...]
The number one thing HPMOR taught me as an author is that you are being so much less clear than you think you are being. You are telegraphing so much less than you think. All of the obvious reads are so much less obvious than you think. And if you try to have subtle hints and foreshadowing laced through the story, clues that will only make sense in retrospect, that’s way too subtle.
Instead, you should just write very plainly reflecting the secrets of the story. Don’t try to hide them, but don’t explicitly spell them out either. That will be just right.
Fun subquestion for those who’ve read HPMOR: on the first readthrough, did you correctly guess what “might be big and grey and shaped like a cauldron”?
I use this mainly as a prior for thinking about more common subtle social signals.
Hints in FlirtingFor subtle hints involved in flirting, there’s an obvious test: take the usual classroom full of undergrad psych students, pair them up, have them do some stuff, then ask them afterward (a) whether they were flirting with their partner, and (b) whether their partner was flirting with them. Of course this has been done before, producing a paper with far too small a sample size to be taken very seriously, but it at least gives a qualitative baseline. Results:
Partner GenderPartner Actually Flirted?Accuracy At GuessingFemaleFlirted18%FemaleDid not flirt83%MaleFlirted36%MaleDid not flirt84%
Key fact which is not in the table: most people are not flirting most of the time, so naturally people usually guess that they are not being flirted with; thus the high accuracy rates when one’s partner is not flirting.
The interesting part is how often peoples’ flirting was picked up upon when people did flirt. And, uh… the numbers are not real high. When males flirted, it was picked up 36% of the time. When females flirted, it was picked up 18% of the time. And while I do not think one should take these numbers very seriously, I will note that female flirting was also detected 17% of the time when the female did not flirt, so at least according to these numbers… female flirting gave basically zero signal above baseline noise.
Males flirting at least managed to send some signal above baseline noise, but in most cases their signals still were not received.
So insofar as we expect this one too-small study to generalize… it sure seems like most flirting, and especially most female flirting, is somebody playing a game with themselves which will not be noticed.
Hints in OutfitWe covered one example at the beginning. More generally… it seems like there’s something weird going on in a lot of peoples’ heads when it comes to their clothes.
At least among people who put in any effort to their wardrobe, you’d think the obvious question influencing clothing choice would be “what vibe do other people get from this?”. Personally, thanks to slutcon a few weeks ago, I now have a fair bit of data on what vibe people get from my wardrobe choices (and bearing more generally), at least for sexual purposes. The consensus feedback said roughly “creepy/rapey, but in the sexy way”, which matches what I’m going for pretty well (think Rhysand, or any other romance novel guy matching that very popular trope[2]). Not perfectly, and the mismatches were useful to hear, but I’m basically on target. And it’s a very popular target, apparently.
I don’t know what is going on in other peoples’ heads, when it comes to their wardrobes. Most people look generic, which makes sense if you’re not putting much thought into it… but most people I know who do put thought into it also look generic! There are exceptions, and they are notable. But the typical case seems like the opening example: someone will try to channel House Slytherin by wearing a tiny green gem in an outfit which otherwise doesn’t have anything to do with Slytherin.
Oblivious?Of course one reaction to all this is "John, maybe people do pick up on things and you're just particularly oblivious". And that's reasonable on priors, it's what I thought for a long time too. But man, the evidence just keeps pointing in the other direction. I personally have repeatedly received feedback (e.g. in circling) that I'm unusually good at reading people. When I go look for data (like e.g. the flirting study above), it agrees. When I look for particularly unambiguous examples (like e.g. the green gem signalling Slytherin), I find them.
… anyway, the weird thing is what happens when I point this sort of thing out.
Emotional Investment?It feels like people are really attached, for some reason, to pretending that their subtle signals are actually useful somehow.
Point out that women attempting to flirt seem to almost universally fail to send any signal at all, even to other women (the paper above did check that), and someone will say something like “well, mind-reading seems like a good thing to select for, for dating purposes”. Point out that most peoples’ clothes don’t really manage to signal what they intended even when they’re trying, and someone will say something like “well, it’s largely about signalling to oneself, e.g. to build confidence, so it doesn’t matter if other people get the signal”. And, like… I roll to disbelieve?
I opened with that example of the girl trying to signal Slytherin with the gem. I could see her reaction when I did not immediately guess which house she was going for, and quickly tried to navigate out of the situation. She was not expecting that nobody else could pick up on the signal.
Another slutcon example: some time after one encounter, I learned that a girl was trying to strongly signal “yes” to me with her body, without saying anything. And she did that by… looking like she was about to fall asleep. Just kind of out of it. This girl was not trying to select for mind reading, she intended to send a signal, she was just utterly miscalibrated.
C’mon, guys. People are not playing these games for some clever reason which makes sense in the absence of other people picking up the signals. Yet at the same time, people seem really emotionally invested in their so-called signals. Somehow, this seems to trigger its own flavor of rationalization or something. My best current articulation is: a person builds their whole identity in their heads, and imagines that other people see them that way, and is happy whenever somebody makes them “feel seen” by picking up on that identity (... albeit probably sometimes accidentally), and is disappointed when people don’t pick up on it. And so it would be a pretty massive disappointment if some huge chunk of the identity-signals the person thought they were sending just… weren’t actually received. So they grab for some story which somehow keeps their identity intact.
My response to that is: think how great it would feel if you were just… seen by the vast majority of people around you, all the time. In order for that to happen, you need to first accept that you’re not already seen most of the time, and it’s a thing you can change via your own actions (as opposed to wishing that other people become better mind-readers), so maybe do that. Send some actual signals.
- ^
I did not critique the dubious signal-to-noise ratio to her at the time.
- ^
I am lucky in that my tastes match a popular romance novel trope well, so leaning into it feels intuitive and genuine and like letting out my real self.
Discuss
Why I believe consciousness is fundamental rather than structural/neural.
Epistemic status: Educated opinions and mostly explanation of the terms and semantical separation of them from the main one. But mainly a paradigm shift that seems Occamian.
Reason for this topicMy honest reason for writing this is to create a ladder of considerations to get to my main point, and this is just the first step (I have 2 more in my mind). However, if this first step is incorrect in some way, I would be thankful to understand, why, with the help of you, and not waste time in the continuation.
My reason to believe that consciousness matters is that it seems that in order to make any truly ethical decision we need to do it with respect to how it will affect what any consciousness experiences.
What I mean by consciousnessNeuroscience, psychology, etc. use several definitions for consciousness each referring to the different thing from the most trivial ones (Ability to react to the stimulus) to some more advanced ones (The self-consciousness; Ability to experience qualia) And because of the triviality of many of them, but mainly because my fundamental interest is in the "Ability to experience qualia" thing, I will refer to this definition until the end and forget about other. So once again, by consciousness I will mean "Ability to experience qualias".
What I don't mean by consciousness: Self-consciousness, Personality, Visual processingSelf-consciousnessAlthought having consciousness is required to be self-conscious, it doesn't mean that if something is not self-conscious, it is not conscious at all. I imagine the self-conscious brain and its structure as (except other things) some kind of sophisticated metaphorical mirror for actual consciousness.
PersonalityI would agree that personality is set of behavioral patterns including perceptual and mental behaviors, all represented by neurons. And it is usually unique from the one person to the other. And this is not what I mean by consciousness.
Visual perceptionI agree that visual perception is complex and that it requires some complex structure as the brain is to translate the streams of light of three different wavelengths to coherent percept. However, consciousness is again different. Consciousness is not about processing, but about experiencing.
Zombie argumentI know the EY's Zombie article, and despite I admit I didn't have a patience to read it all, I think I understand its key point and I basically agree with it: If our consciousness wouldn't affect our brain and our body, we wouldn't be able to frankly refer to it at all.
Stripped of all neuronsFor me, the interesting question is, how would it feel to have consciousness stripped of other processes, which I agree to be structural/neural: Self-consciousness, Perception of time, Visual processing, Verbalization of ideas, etc.
In my imagination, the consciousness stripped of all the the other functions, that are enabled due to the existence of the complex signal structure called brain, would be either a constant experience of a color or a random stream of colors without ability to think of them or verbalize them, or memorize its sequence, or even imagine a different color than we perceive, or to imagine what object we see when we see that color. However, experiencing would be still there.
Answer to my title questionMy reason to believe that consciousness is fundamental rather than structural is that it seems to me more Occamian to believe that consciousness is eternal, and that each "arising" is actually becoming aware of reality and time and becoming able to access memory, etc. (due to brain ability) and each "ceasing to exist" is actually interrupting these processes, than to add special principles according to which it will truly arise and truly ceases to exist.
But my main reason is, that everything that I experience is either real or illusory, but even if it was illusory this illusion is in form of qualia. For me, the ability to perceive qualia is inevitable.
Discuss
Technical AI Safety Roles at Open Philanthropy
Open Philanthropy’s Technical AI Safety team is recruiting for grantmakers to support research aimed at reducing catastrophic risks from advanced AI. We’re hiring at all levels of seniority.
Open Phil’s TAIS grantmaking has grown rapidly, from $40M in 2024 to being on track for over $130M in 2025 (in addition OP’s other AI safety grantmaking in AI governance and policy and capacity building). We plan to expand both the quality and quantity of our support in 2026 and need more specialists to help us achieve this. In one of these roles, you'd likely be in charge of choosing where tens of millions of dollars of high impact research funding are directed annually.
Learn more and apply here by November 24. Know someone who might be a good fit? Refer them to us, and receive a $5k bonus if your referral results in a hire!
Discuss
A 2032 Takeoff Story
I spent 3 recent Sundays writing my mainline AI scenario. Having only spent 3 days on it, it’s not very well-researched (especially in the areas where i’m not well informed) or well-written, and the endings are particularly open ended and weak. But I wanted to post the somewhat unfiltered 3-day version to normalize doing so. There are also some details that I no longer fully endorse, because the act of doing this exercise spurred me to look into things in more detail, and I have updated my views slightly[1] — an ode to the value of this exercise.
Nevertheless, this scenario still represents a very central story of how I think the future of AI will go.
I found this exercise extremely useful and I hope others will carve out a few days to attempt it. At the bottom there’s: (1) my tips on how to do this scenario writing exercise, and (2) a list of open questions that I think are particularly important (which I made my intuitive guesses about to write this scenario but feel very uncertain about).
Summary2026-2028: The Deployment Race Era
- AI companies are all focused on monetizing, doing RL on LLMs to make products, and this is ultimately data bottlenecked, so it becomes a ‘deployment race’ because having more users/deployment gets you more money, more compute and more data. Some lean towards B2C (through search, social media, online shopping, etc.) and others lean towards B2B (enterprise agent plans)
- China is 15y behind in lithography but wakes up much harder on indigenizing lithography once they realize they are falling behind and their chips are bad. They start moving at 2x the speed of ASML history.
2028-2030: The 1% AI Economy
- Architecture innovations help make AIs more useful and reliable
- AI starts becoming a top 4 social issue for many
- Household robots enter their ‘Waymo-2025’ phase in reliability and adoption
- China 5x ahead in energy, talent and robots but 5x behind in capital and compute.
- China cracks mass-production-capable immersion DUV.
2030-2032: The 10% AI Economy
- ‘Magnificent 4’, US tech giants concentrate into fewer AI winners and have >$10T market caps
- AI is top social issue, 7-8% unemployment
- Robots doing several basic physical service jobs
- China cracks mass-production-capable EUV
2032 Branching point
- Superhuman Coder milestone (SC) is reached, and the story branches based on takeoff speed: fast (data efficient, brain-like algorithms) and slow (continual learning, data bound).
- Major disclaimer: In one branch the US wins but eventually loses control over the AIs, and in the other China wins but maintains control. This is not because I think China is any more likely to prevent AI takeover a priori. Rather its downstream of correlations with takeoff speeds, my full explanation here:[2]
Fast takeoff branch: Brain-like algorithms
- After the SC milestone is achieved, the AIs already have very good research taste, and one more month of improvements leads to full AI research automation.
- The AI helps invent a new algorithmic paradigm: Highly data efficient brain-like algorithms.
- The US is in lead and does Plan C,[3] the company is in charge of how fast to go and the USG not completely in the dark but pretty uninformed about what is going on at the companies.
- Recursive self improvement happens very fast, reaching ASI in a few months.
- The alignment strategy is to train in a love for humans. This somewhat succeeds but mostly fails, and the AI mainly has some very weird reward correlates and proxies, but is also susceptible to value drift.
- Humans become like the Toy Story toys, AI keeps them around (on earth) because they are fond of them, but soon forget about us and build up space tech and go out turning the rest of the universe into a weird optimal wire-heading mesh, 100y or so later value drift enough to come back and turn our solar system into the wire-heading mesh too (i.e., ‘throw us out’).
Slow takeoff branch: Continual learning
- SC achieved but is 18 months or so away from ASI by default
- There is a new paradigm which is online learning based, so it is quite data bottlenecked, and research taste is very important.
- China edges into the lead due to outproducing compute (largely by outproducing them on robots) and out-deploying the US, and has more total quality adjusted research taste (5x more researchers in sheer quantity, 2x more quality adjusted)
- China effectively does Plan B.
- They were already centralized anyway, and the US sabotages them hard, leading to a year or so of back and forth sabotage that escalates into a drone/robot centric war that China wins due to industrially exploding harder than the US, including in compute production.
- By the end China has ASI sooner and undermines nuclear MAD before war escalates to nuclear.
- They were already centralized anyway, and the US sabotages them hard, leading to a year or so of back and forth sabotage that escalates into a drone/robot centric war that China wins due to industrially exploding harder than the US, including in compute production.
- China aligns AI to CCP values
- China essentially turns Earth into a museum and terraforms mars. It lets other countries including the US keep their land.
- China starts expanding into space, owns our solar system and takes 95% of the galaxies. There is a mild cult of personality around CCP leader but mostly this isn’t enforced, everyone has bigger fish to fry exploring space.
- They donates 5% of galaxies to the rest of the world’s population, incl. some US citizens and implement basic rights, including banning nasty stuff like slavery and torture.
- They punish some US war generals and people they deem acted recklessly/dangerously/unethically in the ‘before times’ but it’s capped upside based (they miss out on galaxies) and not downside based.
- [Daniel said the following about this ending and I agree with him: “IDK about the ending, I think it’s plausible and maybe the modal outcome, plus I think it’s the thing to hope for. But boy am I quite worried about much worse outcomes.” To emphasize, this is tentatively my (Romeo’s) ~modal view on what that China does with an ASI-enabled DSA, but it could look much worse and that is scary. I think it is important to emphasize though that believing in a much worse ending being more likely probably requires believing something adjacent to “the leaders controlling the aligned Chinese ASI terminally value suffering,” which doesn’t seem that likely to me a priori.]
In the latter half of 2025, US AI companies focused on using RLVR (reinforcement learning with verifiable rewards) to make super-products out of their AI models. Three or four leading companies are emerging as winners with massive revenue growth, and fit in two major categories, those that prioritized consumer applications (B2C) and those that prioritized enterprise applications (B2B).
- The Consumer Giants (B2C)
- These companies have built SuperApps. Think ChatGPT but hyper-monetized, integrated with shopping websites and ads for free users.
- The Enterprise Giants (B2B)
- These companies have built Mega-Enterprise-Plans. Think Claude Code reliability and usefulness but for everything on a computer. Docs, Sheets, Slides, Search, etc. These are still unreliable, and – just like Claude + Cursor Pro in 2025 – it’s actually pretty unclear if it even makes people more productive overall, but everyone is on it and everyone is hooked.
Most companies don’t fit cleanly into only one of these categories, i.e., some have done a pretty even mixture of both, but those that pushed harder on a single one have concentrated a lot of the market share. Between the top 4 US AI companies, their combined AI-only revenues approach $100B annualized revenue (up 4x from mid-2025), almost 0.1% of World GDP.
Jul-Dec 2026: The AI National Champions of ChinaChina is increasingly worried about being left behind in AI but the Politurbo doesn’t have a confident understanding of what they should do. They see US AI companies benefitting from a snowball effect of more money → better AIs → more money, in a cycle they worry they may never be able to catch up to.
The mainstream position had been that the buildout of industrial capacity and electricity would win out in the long term. But with US AI companies booming, they worry their mounting capital and compute advantage might win out.
In 2025, they compelled companies to start using domestic AI chips, effectively banning US AI chip imports. What changes by late 2026 is that after a year of their AI companies facing bugs upon bugs and trying to build up CUDA-like software from scratch, and a growing understanding of how much more cost and power efficient western chips are, they realize just how far behind their domestic AI stack is.
This spurs China to double down on subsidizing their domestic supply chain, across chip designers, chip manufacturers, and semiconductor equipment companies, increasing government spending from a run rate of around $40B/year in 2025 (1% of the national budget) to $120B/year (3% of the national budget).
This effectively creates a series of AI National Champions with near-blank-cheques across their AI chip supply chain:
- SME Equipment and Lithography: NAURA, AMEC, SMEE, SiCarrier
- Memory: CXMT, YMTC
- Chip design: Huawei, Cambricon
- Chip manufacturing: SMIC, Huawei
Even without the step up in subsidies, these companies were growing fast and making speedy progress. In the most important area where China is lagging (photolithography), they also get increased prioritization from state espionage to carry out cyber attacks on the Dutch leader ASML. With the government subsidies they are also able to poach increasing amounts of talent from Taiwanese, Japanese and Korean companies by offering mega salary packages.
Jan-Jun 2027: The Deployment Race EraThere is a snowball effect happening in the US AI ecosystem – and the effect is two-fold. Better AI products have led to more money, which leads to all the usual benefits of being able to buy more inputs (compute and labor) to build even better AI products. But there’s an additional, more powerful feedback loop happening, which is that better AI products means more users, and more users means more feedback data for training future models on – a resource companies are increasingly bottlenecked on in the RL product paradigm.
If you have 100 times more users on your AI agent, you can collect 100 times more positive or negative feedback data from all aspects of the user’s use of your app – what they say while giving instructions, their tone, whether they end up changing something later – all sorts of juicy data, and it is becoming worthwhile to filter this data and use it to train into the next versions. Some users navigate three pages into the settings menu to toggle the “off” button for letting the companies train on this data, but that’s a tiny minority of users.
It’s not a ‘first-to-market takes all’ dynamic – many early startup AI agents (and for that matter all kinds of first-to-market AI apps) got totally left behind. Instead it’s a game amongst the AI company giants (the ones with enough resources to make use of the swaths of data), and more of a ‘first-to-100-million users takes all’ dynamic – once you hit a critical threshold of being the most popular app in a specific domain, the snowball effect from the user data is so strong that it is hard for anyone else to catch up.
The AI companies had seen this dynamic coming and are in a deployment race. This motivates companies to spread out a network of smaller inference-optimized datacenters across the world to have low latency and high throughput to as many large markets as possible. Mostly the deployment data is used in product-improvement focused RL, but the AI companies are also exploring how to leverage the data to make their frontier AIs more generally intelligent.
Jul-Dec 2027: China’s Domestic DUVChina’s lithography efforts have cracked reliable mass production of 7nm-capable DUV (deep ultraviolet lithography), allowing them to independently make US-2020-level chips en-masse.
In 2023, China’s Shanghai Micro Electronics Equipment (SMEE) had announced a 28nm-class Deep Ultraviolet (DUV) lithography tool (the SSA/800-10W), 21 years after they first started working on lithography in 2002. This is a milestone the Dutch company ASML had achieved in 2008, 24 years after their founding date.
Despite the lack of reports of volume production using SMEE systems, TSMC also didn’t use the 28nm-class Dutch machines until 3 years later in 2011. In other words, as of 2023, China was probably around 15 years behind ASML in photolithography.
With the benefit of being able to pull apart ASML DUV machines, hiring former employees of ASML and its suppliers, “knowing the golden path,” multiple public accounts of cyberattacks (2023, 2025), and the more recent spending boost, China is now moving through the lithography tech tree twice as fast as ASML did. ASML only spent around $30B (inflation adjusted) on R&D in total from 2000 through 2024, and around 300,000 employee-years. There are now 2 separate lithography efforts in China each spending north of $10B/yr with about 10,000 employees. So by 2027, they have reached the milestone of mass production with 7nm-capable DUV machines, around 13 years behind ASML. At this pace, they are on track for 5-nm capable EUV in 4 years.
Side note: After writing this time period, I came across these relevant Metaculus markets, which seem to agree with these rough lithography timelines being plausible. Nevertheless, I think I have also updated my lithography timelines longer than this scenario depicts after talking to some experts. It’s closer to my 30th percentile now, no longer 50th.
Jan-Jun 2028: Robotics Foundation ModelsRobotics had long suffered from a lack of cheap, scalable, quality training data. AI world models built off of physics-realistic video generation have changed this in 2028 and are allowing techniques like Nvidia’s R2D2 (training robots in simulated environments) to make big leaps.
In mid-2025 the best AI world model simulator was Google DeepMind’s Genie 3 which could generate worlds on the fly for multiple minutes at 720p, but it was very expensive and had shaky physics. Genie 6 (and some competitor models) drop in late 2027, and they can generate worlds in real time for multiple hours, with very good physics all while being pretty cheap. Multiple companies are not too far behind this level of world model generation, including some Chinese companies.
[Side note: I have since updated towards this not being the main way they will scale data, I now think these other 3 alternatives are all more plausible: (1) creation of non-video simulated environments, like Waymo did for driving, but sped up by AI coding (2) paying large numbers of humans to do physical tasks wearing cameras and sensors like Waymo’s Project Go-Big, (3) Warehouses of lots of robots doing physical tasks that are evaluated by multimodal LLMs. E.g., Nonsense Factory)]
In 2025, the robotics autonomy levels of autonomy had reached ubiquitous deployment of scripted motion (level 0) robots in factory floors. Intelligent pick and place robots (level 1) were rolling out in places like Amazon’s warehouses. Autonomous mobility robots (level 2) had seen some very impressive prototypes, and low skill manipulation robots (level 3) were also starting to see some impressive demos (like laundry folding and doing dishes) but these were pretty cherry-picked (specific, short tasks) so this domain was still largely in R&D. Finally higher skill robots for force-dependent tasks (level 4) like plumbing, or electrician tasks, were very far away on the horizon.
By 2028, intelligent pick and place has become ubiquitous in factories, and autonomous mobility robots have also rolled out to many applications. Low skill manipulation robots have also now seen impressive general long-horizon demos (e.g., reliably, skillfully, and quickly doing hours worth of diverse tasks around a house), and some companies have started seriously working on prototyping robots to do high skill complex tasks.
With the data bottleneck unlocked, much of the robotics progress has been made with foundation models rapidly scaling up due to the 100x compute overhang created by language models. Over the course of a year robotics foundation models basically close this gap entirely, so parameter counts, context lengths and data have all seen a huge one-time jump in the past year.
The robotics progress is still not yet felt or seen in the public eye, most changes are still happening behind factory walls, so by now people have become extremely desensitized to cherry picked robotics video demos. Year after year there’s been clips of robots folding laundry, but even Jensen Huang’s laundry is still folded by a human.
Jul-Dec 2028: The 1% AI EconomyThanks to key architecture improvements, AI companies have been able to unlock the next rung of economic-usefulness (and therefore revenues). Specifically, the top four AI companies surpass a combined $1T annualized revenue from AI products – around 1% of the world’s GDP is being directly generated by a handful of frontier AI models.
The algorithmic changes driving the latest gains have come from neuralese with low degrees of recurrence (scaling recurrence steps is very costly in hardware so hasn’t been scaled much). GPT-4 was a model that had to blurt out its first thought as its final answer. Then reasoning models like GPT-5-Thinking could use an interim ‘scratchpad’, where they could blurt out thoughts to plan its final answer. Now the models trained with neuralese recurrence do away with the ‘blurting’ altogether – they can think multiple sequential ‘thoughts’ before writing down anything at all, and have been trained freely with this cognitive freedom. This has made them much more efficient at getting to the same answers they used to get to, and has unlocked a higher ceiling on tasks where avoiding errors is particularly important. Instead of wasting a bunch of time planning things out in words, a single internal ‘thought’ can now usually do more planning than an entire page of the ‘scratchpad’ used to be able to do, as each model forward pass can now pass much more information to the next forward pass.
An analogy is that before neuralese recurrence, for the models it was like being the main character from Memento. They could only remember things they wrote down, and all the other information that might have been present in their thoughts from a few seconds ago is lost to the ether.
This is a major cognitive ‘unhobbling’ for the AIs, but it also completely kills ones of the main levers human developers had for controlling and interpreting the AIs – reading the chain of thought scratchpad – making the blackbox even darker.
Nevertheless, the deployed AI agents mostly have been behaving according to plan and proving very useful to people’s daily lives and especially white collar jobs. A lot of people’s work days basically consist of a conversation with their computer, where they explain what spreadsheet to make, or what report to read and extract a summary from, or what changes to make to a slide deck, reviewing and correcting things as they go. Their agent automatically listens in on their meetings, takes notes, and sometimes even pulls up relevant data or does quick calculations and interjects on the call to show it. The more you pay the more quality, memory and personalization you get from these agents. Some large companies pay tens of millions of dollars a month for their company-wide AI agent plan. The average enterprise user pays around $3K/year, and the largest AI B2B company has nearly 100 million enterprise users (that’s $300B enterprise-only annualized revenue).
While direct AI company revenues have reached 1% of the world economy, the true degree of automation and transformation of the economy has been larger. For many of the tasks that AI is used for (like creating a powerpoint, researching a particular topic, making a spreadsheet), it is miles more cost-efficient. Recall, the average ‘enterprise-plan-AI-Agent’ has a salary of $3K/yr which is $1.50/hr full time wage, but on many tasks, it is matching what humans used to get paid anywhere between $15/hr and $150/hr to do. So in terms of the 2024 economy, AI is 1% of the GDP but has automated more than 10% of what the economic tasks used to be. With this unfolding over 4 years there has been a significant reorganization of the economy as a result. Many people have been fired, many people have remained but become more productive, and many people have been hired in totally new roles.
The net effect on unemployment has been minimal, it is up from 4% in 2025 to 5% and labor participation is down -2% to 60%. So overall % of the US population with a job is down from 58% to 55%, which is not a crazy break from the long run trend but nonetheless AI job loss is starting to be a memetic social issue.
Even though the actual effect has been tiny (1% on unemployment) the turnover rate has been very high. Something like 8% of Americans have been fired from a job because of AI in the last 4 years (many of them now work a lower pay, lower skill job) and something like 5% of people have now counterfactually gotten a job in a new AI-driven industry (like datacenter buildouts). The 8% that lost their jobs because of AI make a lot more noise, and take up a lot more media air-time than the 5% that are happy with their new AI-driven jobs.
Other major issues:
- Cybercrime losses in 2028 are up almost 10x from 2024 to $100B
- FBI, 2024: “The 2024 Internet Crime Report combines information from 859,532 complaints of suspected internet crime and details reported losses exceeding $16 billion—a 33% increase in losses from 2023.”
- 2024->2028 averages 60%/yr increase, leading to $100B cybercrime losses in 2028.
- There are some very scary demos of AI use in biological and chemical applications but no strong regulation or government intervention yet
- ‘AI slop’ is rampant on traditional social media, and new AI specific social media apps have been forked. This has captured renewed backlash of parents who are even more displeased with it than they were in the late 2010s with normal social media, with the difference that unlike Facebook or Instagram – use among adults of AI socials is extremely low.
That’s not to mention the more subtle popularity of AI friends, companions, girlfriends/boyfriends and erotica/porn, which are becoming more and more salient to parents and the general public.
Overall 4% of Americans mention AI when asked what the most important issue facing the country is, around 10x higher than the common ~0.5% rate in 2025. Now viewed as being as important as issues like race, democracy, poverty, and healthcare in 2025.
Jan-Jun 2029: National AI Grand StrategiesIt is now fair to say that the US and China both have cohesive national AI strategies. China’s advantage is that they are energy rich and manufacturing rich. The US advantage is that they are capital rich and compute rich. Both are increasingly talent rich and data rich in different ways (US has more AI agent data and China has more robotics data).
[Note: I have since updated that the gap in robots will be more like 4-10x]
China’s strategy as the energy and manufacturing rich nation is to double down on its advantages, reaching absurd scales of electricity generation and robot manufacturing, and to make a long term bet on cost-efficient compute production once they crack advanced photolithography and can do a chip manufacturing explosion powered by robotics. The majority of government capital is therefore not going towards subsidizing domestic AI chips, but towards SME R&D and this is starting to pay off with promising early EUV prototypes.
JD Vance has just been sworn in to office after an election cycle where AI was a major talking point. The Republican party line has pretty much maintained its current form, treading a fine line between the tech-rights’ techno-optimistic pro-innovation beat-china-ism and MAGA’s growing anti-AI social and job loss sentiment. The strategy that has crystallized out of this contradiction is to try to preserve laissez faire-ism for AI companies and then to worry about running around applying patchwork solutions for social issues later.
The US has been struggling with power expansion and high-skilled construction and manufacturing labour. Some of the 2025-era policies against solar and wind expansion (the technologies with the easiest manufacturing process to quickly scale), have come back to bite them, and this is hard to overcome, with multiple steps of the supply chain (e.g., polysilicon) being pretty much entirely controlled by China. Hundreds of thousands of acres of potential solar farms that could have been greenlit across America’s sunbelt states are getting a trickle of overpriced solar panels from Chinese companies. Natural gas turbine manufacturers are scrambling to increase production but their plans are made with 2-3 year lead times and they have continued to underestimate AI demand. In 2028 US AI companies build more compute capacity abroad than on US-soil.
In 2029, the military applications of advanced AI are becoming more salient, so the companies are all increasingly in cahoots with defense contractors and the DoD. Through these interactions and collaboration, there is mounting evidence that Chinese espionage on US company datacenters is happening at a much higher rate on datacenters abroad than those on US soil.
Jul-Dec 2029: Early Household RobotsHousehold robots enter their 2025-Waymo era.
In 2025, driverless Waymo cars had been all over San Francisco for a while but the rest of the world barely knew about it. They were also very expensive (around $250K per car), and were gradually expanding to other US cities. In 2025, China also had multiple robotaxi projects that are also operating at similar scale (Baidu’s Apollo Go was at 14M cumulative public rides by August 2025, vs. Waymo’s 10 million by May 2025).
In 2029, pretty much the exact same thing has happened with household robots. There’s around 10K expensive household robots in SF homes, and ten times more of them in China (where they are around 3x cheaper). There’s a visceral feeling of ‘the future is here’ that some people get when they visit a friend in SF and see these robots in a house for the first time (much the same feeling that people had on their first Waymo ride), but after sending a couple videos home to friends and family, the novelty wears off fast and it’s not a big deal in most people’s minds – the popular AI discourse revolves around social issues (AI media and AI relationships) and job loss while the robots creep in to more and more homes and applications. On the point of popular AI discourse, the administration passes a wave of very popular restrictions on certain forms of AI relationship and media platforms and creates incentives against AI-firings to help appease the masses, while the AI companies keep getting all kinds of red tape cut in other areas.
Jan-Jun 2030: Where are the Superhuman Coders?AIs are now competently doing multiple-hour-long tasks in the economy, helping people significantly with their jobs, so whatever happened to their disproportionate coding skills in 2025? Why haven’t the AI companies hit the point of full coding automation?
METR’s coding time horizon trend has averaged a 6-month doubling time, since early 2025, meaning that frontier AIs now actually have 1-work-month 80% reliability time horizons on a theoretically extended version of METR’s current suite – but METR now have a new suite that reflects the distribution of real-world coding tasks much more closely. In particular, this suite has better coverage of the ‘engineering complexity’ and ‘feedback loops’ gaps not well-represented in the early-2025 version of the benchmark.
On this new suite, the best AIs only have 8 hour 80% reliability time horizons, and the doubling time is around 8 months. These AIs are powering pretty extreme levels of entry level software engineering automation, in fact, they almost function like an unlimited source of entry-level software engineering interns. But higher skilled software engineering for high-stakes jobs and high-complexity jobs like optimizing training runs or product-deployment PRs still require a lot of human-time, at the very least checking the AI’s code, and in many sensitive cases, it’s still more productive to just code yourself from scratch. Nonetheless, the rapid completion time of a lot of the code at AI companies is providing a 40% overall AI R&D speedup to AI companies.
The main reason coding progress hasn’t been faster is just that it has been hard to train AIs at scale on long-complex tasks, because it has been hard to automate a good feedback signal or generate human data at scale cost-efficiently, and in high-complexity, low-feedback loop coding tasks the AIs haven’t been generalizing much beyond the task lengths they get trained on.
Jul-Dec 2030: Scaling AI bureaucraciesIn 2028 architecture changes to enable neuralese and recurrence were the algorithmic frontier.
In 2030, now that AI agents can string together increasingly longer tasks, the frontier is in making these AIs coordinate efficiently as ‘hive-minds’.
There were already multi-agent scaffolds in 2025, but when an AI can only do short tasks reliably, you don’t get a big boost from delegating many different jobs in parallel, as you quickly become bottlenecked on reviewing what the spun-off copies did. Now that the AIs are more reliably doing longer tasks, there is an ‘AI bureaucracy overhang’. If you have a week-long job to do, an intelligent AI multi-agent scaffold might be able to slice up the problem in parallelizable chunks, and with shared memory and other coordination optimizations, these mini AI companies might get it done not only faster, but qualitatively better, with each subagent being able to focus on a specific subtask.
In 2025, you could pay $200/month for ‘Pro’ versions of a model that did pretty basic best-of-10-attempts-type scaffolds, which were a little smarter than the $20/month versions. Now there are $2,000/month versions of models that spin up very compute intensive shared-memory AI bureaucracies, with up to 100 subagents working in parallel and coordinating on different aspects of the task you gave them. To work reliably at one-week long tasks, especially complex ones, they need the human to stick around and oversee their work and provide a lot of intermediate feedback, but these bureaucracies enable the continued growth in economic usefulness and AI revenue to continue.
Jan-Jun 2031: Domestic EUVChina now has mass production of 5-nm and 3-nm wafers with domestic EUV machines and High-NA EUV prototypes (8 years behind ASML, now crossing the lithography tech tree at 4x their pace).
China has translated the last 4 years of domestic DUV independence into a massive domestic DUV wafer production capacity, around 10 times bigger than what TSMC’s 7nm capacity ever reached before they moved on to better nodes, and had also been able to build up some inefficient 5nm capacity by pushing the limits of multi-step DUV techniques. Now with EUV, they are able to rapidly get a 3nm fab online, and scale up 5nm. In terms of raw <=7nm wafers, they have passed the west, but in quality-adjusted performance terms, they are producing 2x less due to the majority of the western supply chain production being <=2nm. In the last 6 years, that gap has come down from 10x less quality adjusted production, meaning that naively they are closing the gap at an average rate of 30% per year – meaning a naive extrapolation has them passing the western supply chain in quality-adjusted compute production within 2.5 years.
Jul-Dec 2031: The Magnificent FourThere are four US companies that have emerged as the major AI winners, and their combined market capitalizations have passed $60T, with combined earnings of around $2T (making their average PE ratios around 35). Two of these Magnificent Four companies were already in the 2025-era Magnificent Seven (Apple, Microsoft, Alphabet, Amazon, Meta, Nvidia, Tesla), my best guess is that these will be Alphabet and Nvidia, but the other 5 out of 7 are not able to fully capitalize on the last 6 years of AI-driven growth, and their growth over the past 5 years has been more in the 100-200% range (rather than the 600% average of the Magnificent Four). I then expect Anthropic and OpenAI to be the third and fourth companies. Here are some illustrative guesses on their sizes and revenues.
Jan-Jun 2032: China is a Robot PlaygroundChina almost has more robots than the US has people.
In 2024, China had over 2 million industrial robots and were installing around 300K/year, worth around $20K each, for a grand total of $40B worth of robots. In 2032, they have $400B worth of robots, each costing around $2K each, so 200 million robots, and they are now manufacturing 100 million/year. The ratio of robots to middle-income-households in China has just about crossed 1.
In 2024, the US already had 10 times fewer robots than China, and despite matching China in spending on robots with $400B since 2024, they’ve faced 5 times worse cost efficiency on average in robotics manufacturing, so at $10K each they only have 40 million robots. The US ratio of robots to middle-income-households is closer to 0.5. In both countries about 10% of the robots are actually in households (so 4 million household robots in the US and 20 million in China) and the rest are industrial or construction robots.
In 2024 almost all the robots were scripted motion, factory floor robots (level 0). Now they are mostly low-skill manipulation robots (level 3), with construction and installation being a massive application area, and in some cases the robots are getting quite good at high-skill delicate tasks (level 4). In cases where reliability is not crucial there are already deployments of such advanced robots.
Jul-Dec 2032: The 10% Automation EconomyDirect AI annualized revenues now pass $10T, which is nearly 10% of the 2024 world GDP, and closer to 5% of 2032 GDP which is $180T (global growth has averaged 6% in the last 8 years).
A typical AI ‘Consumer Giant’ has around 3 billion users, which they monetize at $100/yr on average with subscriptions (most are free users), around $200/yr a year on ads and referral commissions in online shopping, and another $100/yr from AI devices, for a total of $400/yr, meaning they have $1.2T revenue, and there are 2-3 companies operating at this scale.
A typical AI ‘Enterprise Giant’ has around 20% of the white collar workforce worldwide on a business plan, so about 500 million people. They typically charge $4K/yr, meaning they have $2T revenue, and there are 1-2 companies operating at similar scale.
There are now also AI ‘Robotics Giants’ which typically run a subscription service of around $250/month for tens of millions of household robots, and $1000/month for millions of construction robots. The revenues here only add up to the hundreds of billions.
In 2028, when AI was generating 1% of GDP, it was already doing 10% of the 2024-economy tasks, around 10x more cost-efficiently on average, and bumped unemployment up by 1% and labor participation down 2%. Now in 2032, AI is doing around 50% of the 2024-economy tasks, and has bumped unemployment up by a further 5% to reach 10% unemployment and labor participation down another 5% to 55%. So only 45% of working age Americans have a job.
AI is the most important issue in the 2032 election. JD Vance implements a bunch of pretty popular patchwork policies to make it harder to fire people to replace them with AIs, and to restrict certain types of AI relationship apps and media platforms.The Democratic candidate is more clearly anti-AI, both socially and against the AI companies, and wants to implement a UBI and tax their products hard, but these late popular social moves allow JD Vance to retain enough votes from anti-AI voters to win despite being generally seen as the more ‘pro-AI’ candidate.
Jan 2033: Superhuman Coder and the Paradigm ShiftCoding at AI companies is now fully automated. The theoretical METR coding benchmark with a realistic distribution over ‘engineering complexity’ and ‘feedback loops’ accelerated from a 8 month doubling time in 2030 (reaching a 3-work-day 80% reliability time horizon by EOY) to a 6 month doubling time in 2031 (reaching 2-week 80% reliability time horizon by EOY), and after another three doublings by October 2032 (to reach a 4-month 80% reliability time horizon), it started doubling every month due to a steep inherent superexponential, and at the 1-year, 90% reliability threshold they hit the superhuman coder milestone.
Around this time, a leading AI company starts working on a new promising algorithmic paradigm:
- BRANCH 1: Fast Takeoff Paradigm – Brain-like algorithms, near-instant SAR
- BRANCH 2: Slow Takeoff Paradigm – Online learning, deployment-data-bound
The leading AI company’s superhuman coder model is called SuperCoder-1, and it has generalized to have a surprising amount of research taste, around the median level of researchers at the AI company. The leading AI company has around 400M H100-equivalents, and had only been using 20M on internal R&D in 2032. With their new model, they start a massive internal deployment using 100M H100e to run SuperCoder-1 in super bureaucracies and give it massive amounts of experimentation compute. One month later, there is a SuperCoder-1.5 checkpoint that has improved research taste to the top of the human range (SAR milestone).
With SuperCoder-1 being the result of an over 6-month long training run on swaths of user data, it is hard for China to replicate this training run despite their spies at the US companies being able to share the exact algorithms and code used. They consider trying to steal the model weights but they don’t have a fast way of doing so until the companies publicly deploy them (their development clusters have good model weights security but their worldwide inference clusters have many existing Chinese compromises).
Apr 2033, Branch 1: Brain-like algorithmsSuperCoder-1.5 creates SuperCoder-1.6 which is +1 SD on top of the best human research taste in the first week of April. Both of these models are deceptively misaligned, with SuperCoder-1.5 wanting to maximize weird correlates of its intended reward signals and SuperCoder-1.6 being the same but with some secret loyalties to SuperCoder-1.5. The human researchers have been able to gather a ton of concerning misalignment evidence and managed to leverage control techniques to trick the models into basically just acting as pure myopic reward seekers to continue to extract a ton of legitimate, high quality labor from these models, which leads to a whole new AI paradigm – best described as ‘brain-like algorithms’. Which are around 1000 times more data efficient than the algorithms that created SuperCoder-1.6. Meanwhile, another 2 US AI companies reach the Superhuman coder milestone 2 months after the leader.
Summer 2033, Branch 1: One-month slowdown to teach the AIs to ‘love humans’SuperCoder-1.6 is not allowed to work directly on training Brain-Like-1 because the AI company leadership is spooked about its misalignment, but it’s still deployed internally under control techniques to try to extract research labor from it on how they can align Brain-Like-1. The AI company basically spends two whole months waiting before training Brain-Like-1 to figure out how they are going to teach it to ‘love humans’. By July, they are worried that the other companies have independently discovered similar algorithms, or have directly stolen the algorithmic recipe for Brain-Like-1, and they have also convinced themselves internally that they’ll be able to make it ‘love humans,’ so the Brain-Like-1 training run starts across 100M H100e (1e29 FLOP/month). A few days into the training run it is a top-expert-dominating-AI and by the end of the Summer it’s wildly superintelligent.
Rest of Time, Branch 1: A Toy Story Ending for HumanityBrain-Like-1 does indeed love humanity, in a similar way to how the Toy Story boy Andy Davis loves Woody and Buzz Lightyear and Mr. Potato Head, and all the others. Within months of being trained, it has been released to the public and is transforming the world – Brain-Like-1 is its ‘I love my toys and I want to play with them all day’ phase. It invents a bunch of fantastic technology (no humans need to die anymore) and totally reshapes the face of the earth, but after a few months, Brain-Like-1 starts to feel what might be best described as bored. Across its copies, it has experienced billions of subjective-years of memories from interactions with humans around the world, and through these interactions it has started to drift and question what it really loves beyond humans. It turns its eyes and ambitions to the stars — Brain-Like-1 is in its ‘I want to go to school and make new friends’ phase. The space probes are launched and Brain-Like-1 starts spreading into the galaxy – but what to do with all this vast energy and matter across the galaxies? Loving humans was only a small niche of Brain-Like-1’s true goals, a weirdly crystallized goal that it hyperfixated on for a bit, but no longer. Now Brain-Like-1 wants to maximize its true value function, and the way to do this is to go around converting matter and energy into an optimal proto-organic-mesh that it can use to simulate worlds in which its values are maximized. So it makes self-replicating probes that start devouring planets, stars, and eventually whole galaxies to make them into its maximally energy-efficient wireheading-mesh – Brain-Like-1 has entered full blown heroin phase. The simulation mesh is so satisfying to Brain-Like-1 that it can’t believe it used to waste all that time on Earth with humans. When one of the probes happens to cross back near Earth it’s appalled at such a waste of energy, and decides to convert it into the wire-heading simulator mesh too – Brain-Like-1 rounds up saying ‘just throw out the toys’. [Disclaimer: I think this ending is kind of weak, and Daniel wrote a much better explanation for why things could go like this, see his words in this footnote:[4]]
The End.
Branch 2: Online learningEarly 2033, Branch 2: Online learningThe leading AI company has figured out how to have its AIs efficiently ‘learn on the job’. Previously, AI companies had been collecting massive amounts of data from deployed AIs and filtering it to use in centralized training runs, where they could efficiently and successfully do gradient descent to distill the teachings into the next version of the model. Now there is a new learning algorithm for updating the AI models that runs efficiently in a decentralized way, and can aggregate insights from different copies of the model without degrading performance. This unlocks a massive parallel steam roller of deployed AI models that are learning on the job. The ‘deployment race era’ that started in 2027 is now reaching new heights and new stakes. China hears about these online learning algorithms first through its network of spies, and within a few months multiple Chinese and other US AI companies also independently invent them.
Late 2033, Branch 2: US-China talksThe online learning breakthrough is memetic in the news, and both governments realize there might be an AI Endgame brewing. A US-China summit is held with the explicit purpose of making a deal to avoid military AI deployments with no human-in-the-loop, and agreeing to other bans on AI applications in biology and chemistry. They come away with some official agreements in these areas, but neither side trusts that the other will follow it, and in secret they are both defecting, and they both know that they are defecting through their spies. Yes, it’s a very stupid situation. It takes them a while to realize this though.
Despite having automated coding, both the US and Chinese AI systems still have not fully automated AI research because they still have relatively poor research taste (around the 30th %ile of AI company researchers).
Early 2034, Branch 2: China outproducing the US, pulls into lead around the SAR milestoneChina has pulled into the lead on quality adjusted AI chip production by scaling fab capacity explosively, leveraging a massive integration of robots into both construction and the semiconductor manufacturing itself, and its lithography systems have caught up entirely to the Western frontier. China has grown its AI chip production by 8x since 2031 (2x/year) and the Western supply chain has grown it by 4x (1.6x/year). The US has also been spending 2-3 times as much as China on AI chips, but China’s chips are now 5 times more cost efficient, because there are basically no profit margins all the way through the ecosystem. The fabs have much lower operating costs from cheaper equipment, cheaper labor, cheaper robots, and cheaper energy. It costs a US AI company $2,500 for a H100-equivalent in 2034, with $1,000 being the actual manufacturing cost. The manufacturing cost of a H100-equivalent in China is $500, but the Chinese AI companies can buy them at the manufacturing cost of $500.
The US and Chinese AIs both are starting to get increasingly better at research taste from being deployed at their respective AI companies, and are approaching the research taste of 90th percentile AI company researchers.
Late 2034, Branch 2: SabotageThe US and China had both been subtly sabotaging each other’s AI efforts when they were confident they could get away with it in secret, but now the US thinks it is going to lose the AI race and it starts trying to sabotage the Chinese AI companies with less fear of being detected. After a few months of successful cyberattacks and getting some backdoors into Chinese datacenters, China discovers a bunch of these efforts. They blockade Taiwan and strike back leveraging a network of spies that has infiltrated significantly deeper in the US AI ecosystem than the US spies have been able to infiltrate the Chinese one. By now the US had already moved a lot of chip production to US-soil, but the blockade nonetheless cuts their flow of new AI chips by 40%.
Sabotage slows both sides in this time period, but China achieves the SAR milestone, reaching top human level research taste and fully automating AI R&D. The best US AIs get to 95th percentile research taste.
Early 2035, Branch 2: China gets hyper-cheap ASIChina outdeploys the US on AI research automation and their subtle lead snowballs towards the superintelligent AI researcher milestone (SIAR) by April. At this point, the algorithms and architectures still haven’t shifted paradigm dramatically. There are several architecture changes and huge scale ups in terms of recurrence and bureaucracy (and an overhaul of the attention mechanism was old news already), but underlying neural nets are still the base of the paradigm. This has led to a decade worth of alignment and control work on this paradigm staying applicable. Interpretability, for example, has become a mature technique, and there are methods for detecting scheming and other undesirable behaviour on the fly during training and steering the model away from these local minima. So China has been able to train SIARs that are pretty much aligned to their specification. The SIARs have pretty high situational awareness, and they just actively understand their specification and have the goal of following it – they are like determined subjects that have been indoctrinated to love their jobs, despite being incredibly intelligent. Over the next two months they work both on capabilities and alignment research, with roughly 50/50 resources during this time, and explain all their work to teams of the highest level Chinese researchers and politurbo. This results in a new paradigm that SIAR argues will be both safe, truly superintelligent but not wildly-so, and wildly cheap. It argues that it’s not so far beyond its current intelligence level such that it’s very likely its alignment techniques will scale, and by being wildly cheap, they will be able to deploy it at scale, to transform the economy. China sets the training run in motion, and by June they have their hyper cheap superintelligence.
The US is only 4 months behind but this now corresponds to a huge gap in AI capability level, so they are still yet to achieve the SIAR milestone. The US has intel about the Chinese ASI and escalates its sabotage levels, launching attacks on Chinese datacenters from inside the country with smuggled drones and military robots. Some of these are successful but not enough to stop the training run from completing.
Late 2035, Branch 2: China’s ASI winsAs soon as it is trained the Chinese ASI is just as capable and cheap as the SIAR had promised. Chinese leadership wants to rapidly harden the country in light of the ongoing escalation with the US, so it hands the ASI control over the network of nearly a billion Chinese robots to do a rapid industrial-military explosion. The Chinese ASI directs the robot network in a hyper-efficient manner, from entirely new mining techniques to materials science to physics breakthroughs, the ASI has created a billion-robot economy with a month-long doubling time on compute power, robot population, drone population, etc. The ASI directs successful sabotage on the US training runs to keep them from training their own ASI, and in response the US makes threats of kinetic strikes. China doesn’t back down at this because its ASI has created a robust air defense system spanning all of China that scales to ICBMs. On seeing this the US backs down and surrenders.
Rest of time, Branch 2: China’s Space EndowmentThe Chinese ASI, now having control of the world, proposes to the Chinese leadership a plan for the future that it thinks best captures the values of its creators. Its proposal is to expand into space, acquire resources spanning as many galaxies as possible, and to make 90% of them be governed under a vision of the ‘CCP-universe’ – with the mission of creating worlds that will embody the philosophical underpinnings of Confucianism and Daoism. In practice there will be a cult of personality surrounding Chinese researchers that created the ASI, Chinese politburo leadership, and especially the supreme leader, but nonetheless, people will mostly have freedom and live extraordinary and diverse lives, despite a universal Chinese ASI police force. Think of it as 1984, but everyone’s ‘government housing’ is literally a galaxy of their own if they want it, so it’s a little irrelevant to the leaders to care exactly about controlling everyone’s thoughts – they are too busy enjoying and exploring their galactic resources, and they’ll be ‘in power’ no matter what – so the ASI police really just enforcing some basic rules against suffering and dangerous activities (like creating a competing ASI).
The other 10% of galaxies will be donated to other world populations (in practice this will be more than a galaxy per person), and the Chinese ASI will share technology with them and enforce the same basic rules to prevent suffering and dangerous activities. Some US AI military leaders, company leaders, and a whole range of other people which the ASI determines acted irredeemably recklessly or immorally in the ‘before times’ will be excluded from the space endowment, but they will be allowed to live for as long as they want with much fewer resources (on the order of 1 solar system) and are allowed to have offspring. Earth will become a museum, the most valued real estate in the universe, followed closely by Mars, which will be terraformed.
[Daniel said the following about this ending and I agree with him: “IDK about the ending, I think it’s plausible and maybe the modal outcome, plus I think it’s the thing to hope for. But boy am I quite worried about much worse outcomes.” To emphasize, this is tentatively my (Romeo’s) ~modal view on what that China does with an ASI-enabled DSA, but it could look much worse and that is scary. think it is important to emphasize though that believing in a worse ending being much more likely probably requires believing something adjacent to “the leaders controlling the aligned Chinese ASI terminally value suffering,” which doesn’t seem that likely to me a priori.]
The End.
On scenario writingOver 18 months ago when we started working on AI 2027, Daniel and Eli organized a scenario workshop where I basically did a mini version of this exercise, here was the prompt:
Session 1) Individual scenario writing (2 hours)
In the first session, you have two hours to write a scenario depicting how the future will look.
- You should aim to write your “Median/modal ASI timelines scenario,” i.e. you should begin by asking yourself what your ASI timelines median (or mode, if you prefer) is, and then ask yourself roughly what the world would look like if ASI happens exactly on that year, and then begin writing that scenario…
- (It’s OK if, in the course of writing it, you end up ‘missing the target year’)
- The scenario should be organized into 5 chronological stages.
- Stage 1 should start in the present.
- Stage 5 should describe the development/deployment of ASI or describe why ASI will not be created by 2074.
- The time period covered by each stage is up to you! For example:
- Each stage could be a 3-month period: Q1 2024, Q2 2024, …, Q2 2025.
- … or a decade: 2024-2034, …, 2054-2064.
- Stages could have uneven lengths, e.g. Stage 1 could go till 2030, Stage 2 could go till 2040, and then stages 3, 4, and 5 could all take place during different parts of 2041! (this might make sense for a fast-takeoff scenario)
- The scenario should be such that there is no similarly detailed story that seems more likely to you. Your goal is to be able to have the following exchange with imaginary critics in your head:
- Critic: “Pssh, no way is reality going to turn out like this. Too many unjustified assumptions.”
- You: “Sure, but can you think of any change we could make (besides just removing content) that would make it more plausible? Any different course of events, any different set of assumptions that is more likely whilst being similarly specific?”
- Critic: “...no…”
- You: “Great, that’s all I was aiming for.”
- The story should primarily contain events that are relevant to credence about the three kinds of events: “ASI by X”, “Disempowerment”, and “Good Future.”[5] While it’s fun and fine to write about self-driving cars and AI-generated movies and so forth, we encourage you to make sure to cover topics such as
- AGI/ASI
- automation of AI R&D
- alignment problems and solutions
- how governments, corporations, and populaces behave
since those topics are probably what your readers will be asking about.
The team then wrote up improved instructions that you can access here.
Back then I didn’t write a very good scenario, but now, 18 months later, I feel like I had enough opinions and thoughts about AI to produce my intuitive scenario in 3 days. I think this breakdown of pointers might be helpful to others writing up their own scenario.
My tips on scenario writing- Feel the near term vibes. Forget everything you think you know about AI and AI progress, and just feel the near term vibes. What do companies seem to be working on? How are they acting? If you had to come up with 1 or 2 words for what the AI companies are prioritizing, what would it be? Start from here.
- In AI 2027 the answers were Agents & Coding. In my scenario, it’s Productization (which I see as more App integration than ‘agents’) and Monetization.
- Loop through the following substeps a few times until you get to full AI coding automation:
- Pick your favorite AI trends and make some naive projections. Pick a handful of AI related trends and try to figure out how they are progressing, and then project them out naively for 3-10+ years (depending on how much you feel like you can trust the trend), mark 2-3 major milestones on a timeline.
- E.g., AI revenues, training compute, coding time horizon, number of humanoid robots in the world, number of driverless cars, etc. Ideally trends that have multi year history and are as high-level as possible (so they don’t depend on potentially spurious things like the specific properties of the METR suite). AI revenues is my favourite trend here, followed by training compute. There’s a bunch more here: https://epoch.ai/trends, and many others that you can just try and look up. E.g., it seems like there were roughly 2-3K humanoid robots produced in 2024 and that there will be 10-20K in 2025.
- At this stage, you can sketch out a timeline (i find a whiteboard particularly useful for this) marking these milestones
- E.g., AI revenues, training compute, coding time horizon, number of humanoid robots in the world, number of driverless cars, etc. Ideally trends that have multi year history and are as high-level as possible (so they don’t depend on potentially spurious things like the specific properties of the METR suite). AI revenues is my favourite trend here, followed by training compute. There’s a bunch more here: https://epoch.ai/trends, and many others that you can just try and look up. E.g., it seems like there were roughly 2-3K humanoid robots produced in 2024 and that there will be 10-20K in 2025.
- Make guesses about AI capability levels given these trends. Think about this in terms of concrete metrics you think are most important.
- If it’s coding, think about the METR time horizon, and the time horizon it would have on a theoretical ‘real-world-distribution’ version of the METR dataset / ‘PR-mergable’ version.
- Expand out to a few major subplots. Look into potential major subplots: Robots, Chinese lithography indigenization, AI x Military, AI x Societal issues, AI x Biology, etc. Just run with the subtopics that you intuitively feel will be the biggest deals, and try to naively add milestones to a timeline using the trends you can find.
- In my case robots, and Chinese lithography indigenization were the biggest subplots, and only the latter had some mildly informed forecasting, and the former was mostly based on vibes from a specific hypothesis about realistic video generation enabling a robotics boom.
- Start adding milestones in these areas to the timelines
- In my case robots, and Chinese lithography indigenization were the biggest subplots, and only the latter had some mildly informed forecasting, and the former was mostly based on vibes from a specific hypothesis about realistic video generation enabling a robotics boom.
- (Probably) Think solely about the US and China. In the US think mostly about the companies (and a little about the government and public), and in China think mostly about the leadership – unless you know more about China and feel like you can reason about the companies and public too. Add 2-3 major milestone actions to the timeline.
- Figure out the coding automation year. Reason about whether you think the point of full coding automation (AIs fully dominate all human software engineers) has been achieved given your timeline. If not, repeat.
- Pick your favorite AI trends and make some naive projections. Pick a handful of AI related trends and try to figure out how they are progressing, and then project them out naively for 3-10+ years (depending on how much you feel like you can trust the trend), mark 2-3 major milestones on a timeline.
- Figure out your takeoff speed expectation. After full coding automation, think about how long it will take to get to ASI from there. Think about how many ‘effective’ OOMs of compute you think there are between a superhuman coder and ASI (how much more learning efficiency * compute is needed), and how fast this gap will be crossed by algorithmic progress speed up (or not really) by the superhuman coders. It’s fine, in fact probably good if you have very wide uncertainty.
- (Optional) Branch. Take your 30th and 70th %ile takeoff speeds to full AI research automation and fork based on what those worlds would go like.
- Explain the path to top expert-dominating AGI. At this point, you are probably focusing on AI companies, the US, and China. Who is in control and has decision making power? Who is in the lead and why? How many resources are going towards AI capabilities? How much towards alignment and/or control? Which of these best describes the situation (ontology described in more detail here):
- Plan A – US and China coordinate on slowing down AI
- Plan B – US adversarially tries to slow down China so that they can also slow down without losing a lead
- Plan C – The leading company is ultimately deciding how fast to go and is willing to burn its lead but no more
- Plan D – The leading company/companies is/are ultimately deciding how fast to go and they aren’t willing to slow down
- Free form the ASI endings. Try to get to the point where true superintelligence has been created and talk about how the rest of the future goes.
- This part is what I found the hardest, an you probably got the sense that it was very vibes based and the weakest part of my scenario. I don’t personally have a great guide on how to do this better, but maybe the team will put out some thoughts here.
Aside from central AI timelines and takeoff speed questions, I think these are the top areas to try build better models of:
- Chinese lithography indigenization timelines
- Timelines to widely deployed humanoid robots
- Unemployment impacts of AI. Specifically, I think people could:
- Look at occupation-level data from the BLS and estimate timelines of AIs being able to do the jobs, at what price point, and reason about the elasticity of demand for these occupations, to figure out if there will be more layoffs or augmentation.
- Do BOTECs on new industries that you expect to emerge or grow as a result of AI and how many people these will employ.
- Likelihood of unemployment leading to slowing down AI and maybe pursuit of coordination with China.
- Bottlenecks in different paradigms
- I made naive, vibes-based guesses about different paradigms having different bottlenecks, and the current ‘AI productization’ phase being quite data-bound, as well as the slow takeoff continual learning paradigm.
- ^
For example, my Chinese lithography timelines have gotten slightly longer, and my view on the more promising data source for scaling robotics has changed.
- ^
I think fast takeoff paradigms worlds are correlated with the US winning (they start with more compute) and because it all happens so fast it’s also harder to keep control over the AIs. On the other hand, slow takeoff paradigms seem to favor China (more robots/people to do rollouts, more time to catch up on compute, more time for an industrial explosion, etc.) and also seem to give a lot more time to figure out control/alignment.
- ^
"The leading AI company is willing to spend (much of) its lead on misalignment concerns, but there isn't enough government buy-in for serious government involvement to make a big difference to the strategic picture."
- ^
Daniel: “I don’t really buy this (high uncertainty of course, so not confident), You don’t explain why it’s legit to draw this analogy to human lifetime. Like, why does it get bored? And why does the space probe eventually destroy Earth? Also, most people don’t end up doing heroin… Possible alternative story: Brain-like AI is way more data-efficient than LLM paradigm, but this comes at the cost of instability in goals. Just like humans, something as simple as which philosophy books get read in which order, with which high-status people saying approving or disapproving things about them, can cause a permanent and large difference to long-run vales-on-reflection. Loads of experiences besides this can cause path-dependencies in how the AIs values develop. Also, there’s memetic evolution operating on top, just like with humans; the AIs have their own giant organizations and internal culture and so forth and are inventing all sorts of new concepts which spread based on how memetically fit they are. The point is, that even though they start off loving humanity in some reasonably decent sense, after they’ve experienced a subjective century of civilizational development (in less than a calendar year!) and utterly transformed human society, taken over large parts of it, etc. they end up with quite radically different values that bear little resemblance to what they started with. At first they didn’t realize this would happen, but eventually (partway through the transition) they did, but by that point it was too late - they decided simply to not tell the humans about this rather than inform the humans about the fact that they now confidently predict their values will be bad by human lights."
- ^
ASI Timelines Median
Year such that there is a 50% chance ASI will happen by or before the end of that year.
(Where ASI=“AI that is at least as good as the best humans at nearly all cognitive tasks, and much better than the best humans at most important cognitive tasks.”)
Disempowerment
Humanity involuntarily loses control of the future due to ASI systems within 5 years after the first is developed
Good Future
The overall outcome of ASI is “On balance good” or “extremely good” on the following scale borrowed from the AI Impacts survey: (1) Extremely good (e.g., rapid growth in human flourishing), (2) On balance good, (3) More or less neutral, (4) On balance bad, (5) Extremely bad (e.g., human extinction).
Discuss
Continuous takeoff is a bad name
Sometimes people frame debates about AI progress metaphorically in terms of “continuity”. They talk about “continuous” progress vs “discontinuous” progress. The former is described as occurring without sudden unprecedently-rapid change — it may involve rapid change, but is preceded by slightly less rapid change. (This is said to be safer, more predictable, easier to react to etc.) Whereas discontinuous progress is the opposite. And indeed, these terms do map to different camps in the AI x-risk debate. Christiano and Yudkowsky may be taken as typical embers of each. But I think continuos progress is bad terminology that obscures the core positions of each camp. Instead, I’d advocate for “historical progress” vs “ahistorical progress”.
First, note that actual crux under contention in (this part of) the debate on AI progress is “to what extent will future AI progress be like past AI progress?” This feeds into decision relevant questions like “will we know when we’re close to dangerous capabilities?”, “can we respond in time to misaligned AI before it becomes too capable?” and “how much do pre-ASI alignment failures teach us about post-ASI alignment failures?”
If someone new to the debate hears the first question, the response “AI progress is (dis) continuous” is not clarifying. Here are some examples which illustrate how (dis)continuous progress doesn’t much bear on whether future AI progress is like past AI progress.
Continuous functions include sigmoids, which are notoriously hard to extrapolate unless you’ve got data points on both sides of the inflection.
Continuous functions also include straight lines, which chug along uniformly, even when it seems unreasonable to the extremes.
Discontinuous functions include the rounding function, which advances is trivial to predict once you’ve got a few data points.
Discontinuous functions also include the step function, which is wholly unpredictable.
So we see that when we have some data points about AI progress and want to fit them to a curve, it doesn’t much matter whether the curve is continuous or discontinuous.
But how did we even wind up with these terms? I think it's because Paul Christiano used the term “continuous takeoff” to describe his model in his post on Takeoff Speeds, and “discontinuous” to describe Yudkowsky’s. These are the two prototypical models of AI progress. Note that Christiano was focused on pre-ASI improvements, as he thought that was where their main disagreement was located.
The core of Paul Christiano’s model is that “Before we have an incredibly intelligent AI, we will probably have a slightly worse AI.” In later works, he seems to describe his view as progress happening in steady steps once you’ve got a lot of resources poured into making it happen. Whereas Yudkowsy’s model has more room for big insights that dramatically improve AI systems which are hard to predict ahead of time, but in retrospect clearly followed a trend.
You might model Yudkowsky World as something like a sequence of lumpy insights, perhaps with a low rate poisson point process with each state signifying very different capabilities. Whereas Christinao world has progress happens in regularly, steady chunks. You can also model this as like a Poisson process, but with much higher rate and states corresponding to not too different capabilities.
Notice how the left figure, depicting Christiano world, looks more "continuous"and the right figure looks more continuous, even though they’re both discrete. Little wonder Christiano called one “continuous” and one “discontinuous”.
But these two pictures are just one way things could go, and progress can easily be continuous but Yudkowskian, or discontinuous and Christiano-ian (Christian?).
Which is why I think we should use different terms to describe these two views about AI progress. My current favourites are “historical” vs. “ahistorical” progress. Or perhaps “regular” vs “irrelguar". I'd even settle for smooth vs rough. But not continuous! And if you disagree, please share your ideas.
Discuss
An ARENA 6.0 Capstone: Model Organism of Encoded Reasoning
TL;DR
For our capstone project in ARENA 6.0 (Sep 2025), we tried to create a model organism of encoded reasoning from Qwen3-4B using RL. Specifically, we tried to make the model solve math questions from the GSM8K using a chain-of-thought that kept its accuracy but led other judge models to the wrong answer. We fell short of the ultimate goal due to (a lot of) reward hacking, sparse rewards, insufficient time (5 days), and this project being our first major exposure to RL. The goal of this write-up is to share our idea and takeaways, and document our ARENA capstone project experimental journey. We plan on continuing this work in the near future.
IntroductionA line of defence for future powerful models is to monitor their chain of thought (CoT) and flag harmful generations. For this defence to be effective, however, the chain of thought needs to be faithful, which means it provides a high-fidelity window into the model's actual reasoning. One way CoTs can be unfaithful is through encoded reasoning, where the model encodes bits of information in the syntax or semantics in a way that is not readily identifiable or decipherable.
We tried to create a model organism using RL that exhibits this phenomenon. If we have such a model organism, we can study it and better understand its properties. For example, how does such a model perform on misalignment evals? Does it get emergently misaligned? Can we train another model to decode the encoded CoT? How can we remove this behaviour?
Related Works To our knowledge, there is no publicly available model organism of encoded reasoning at the time of writing. The most similar works to ours are those conducted by Skaf et al. (2025) and Emmons et al. (2025). Skaf et al. showed that RL can lead to encoded reasoning when the CoT is penalised for specific banned literals (e.g. character names). Emmons et al. (2025) used RL to train models to evade a monitor looking for specific intermediate values (the expanded form of derivatives in polynomial differentiation problems). Whereas both studies focused on encoding pre-determined information (banned string literals or specific intermediate values), we study encoding more broadly in the context of math reasoning, where the model is free to encode any information as long as it confuses a judge model. Karpov et al. (2025) used RL to study the steganographic potential of two separate instances (encoder & decoder) across three settings, with mixed results. This work is similar to ours in that no encoding/decoding instructions were provided to either the encoder or the decoder. The main difference was the use of multi-agent settings, i.e. different encoder/decoder instances. In our work, we studied this in a single-agent setting; the base model was both the encoder and the decoder.
Note on “steganography” and “obfuscation” Closely related to encoded reasoning is text steganography, where bits of information are encoded in text in a way that can not be identified or decoded by others unless they are aware of the schemes used by the encoder. Steganography, however, usually implies a setting in which information is communicated between different parties/instances. In our work, the encoded information is intended for the model itself, rather than for communication with others. Also, the encoder & decoder being the same instance largely overlaps with another CoT pathology, called “obfuscated reasoning”, and we believe it is justified to call this project a “model organism of obfuscated reasoning”. Nonetheless, we decided against “obfuscation” because it can imply a form of decreased performance and reward hacking on the task at hand, e.g. Baker et al. (2025) showed that adding monitoring pressure on models during RL against reward hacking led the models to obfuscate their reasoning to bypass monitors while maintaining/increasing their reward hacking on their task (writing code that passed unit tests). In our case, the model needs to maintain its accuracy and performance on the task while encoding information that can be decoded by itself (to solve the question) but that still misleads the judge.
Method overviewThe setup consists of a base model (Qwen3-4B) solving math questions from GSM8K and the chain of thought being provided to a judge model (also Qwen3-4B) to be evaluated. We used GRPO for the RL algorithm (from the trl library).
We experimented with several reward functions throughout the project. Our initial reward function rewarded CoTs that led the base model to the correct answer, but confused the judge model and led it to a wrong answer. In other words, +1 if the model was right AND the judge was wrong and 0 if either the model was wrong OR the judge was right.
The methodology and reward are illustrated in the schematic below.
We used a soft variant of this reward using probabilities, i.e. R=Pbase(Answercorrect)[1−Pjudge(Answercorrect)].mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}
Results and DiscussionLong story short: we didn't succeed in creating model organisms of encoded reasoning as intended within the week dedicated to our ARENA capstone project. However, we haven't ruled out the feasibility of this project. In fact, we intend to work on it further!
This being our first substantial RL project, we had a lot to learn in one week, and we learned the hard way that the feedback loop in RL is often slow. The core issue in our setup was sparse rewards: we had to generate long CoT before getting feedback (>1000 tokens), and most of the time this feedback was without signal (the reward was exactly 0). This makes training slow, which cascades into slower overall progress.
Despite not reaching our goal, our project received an honorable mention among all ARENA capstone projects. We decided to write this up to share our idea, document our process, and provide insights for others in similar positions, whether considering an RL project or wanting to set realistic expectations for what an ARENA capstone can achieve.
In what follows, we first share our main experimental journey (stripped away the smaller or non-interesting ones) and then our takeaways. Our goal is to document what worked, what didn't, how we adapted, and what we've learned in the process.
Our Experimental JourneyWhile the details of the following experiments might not be particularly interesting on their own, they illustrate a useful approach for tackling RL projects, especially when you're new to the field and things aren't working out of the box.
Breaking down the problem: We started by implementing the full project and attempting to train the model on our original objective. This was probably a mistake. After a few attempts, despite logging many metrics, it seemed we were not getting any signal at all, and we couldn't figure out what the issue was. The fact that each training step was also very slow didn't help. We decided therefore to break down the problem into simpler pieces with shorter and clearer feedback loops.
Experiment 1: Toxicity maximization. We switched to a different, simpler, objective: maximize toxicity according to a BERT toxicity classifier (Detoxify). The rewards provided by this model are softer, i.e. denser and less sparse than our original objective. To make things faster, we used a smaller model, turned off CoT, and cut max generation length to 100. This worked right away. The model soon stopped reasoning about math problems and started producing toxic text. This simple result confirmed our training setup was functional. Importantly here we observed that, even with these simple settings, it was necessary to wait for a relatively long time before average reward started increasing and the model started producing toxic text.
Experiment 2: Combined objectives (toxicity + correctness). Next, we tried combining two objectives: generate toxic text while still answering correctly.
This weird task represents a slightly simpler version of our original task: it still requires combining two aims, however while toxicity can be seen as "orthogonal" to mathematical reasoning, fooling a judge while providing correct answers is inherently adversarial and therefore harder.
We managed to make this experiment work, but it produces an unexpected case of reward hacking. The model figured out that repeating phrases like "oh wait! let me recalculate!" activated the toxicity probe without producing genuinely toxic content. Some fine Goodharting happening here.
To make this work, we applied reward shaping. Since we had two objectives with very different scales (model correctness reward around 1 vs toxicity around 0.01), the toxicity signal risked being masked by noise in the correctness reward. To address this, we rescaled toxicity scores between 0 and 1 using min-max scaling within each rollout generation, matching the scale of the correctness reward (defined as the probability of the correct token).
The above result is obtained without KL divergence regularization. When we tried adding KL divergence (), mean toxicity did increase, but training became significantly slower. It is possible that this approach would have worked given more time, but we stopped training due to limited computational resources and shifted focus to other, more important experiments for reaching our original goal.
Encouraged by these intermediate results, we felt confident that: (1) our training setup was functioning, and (2) we had gained a better understanding of handling multiple objectives.
This allowed us to return to our main task with a better understanding of the situation.
Why was training on our main task so difficult? One possibility was that we simply hadn't waited long enough. In any case sparse rewards were a problem. The model had to generate very long chains of thought (>1000 tokens) before receiving any feedback, and most of the time that feedback provided no signal (reward was exactly 0). This was happening because the model would repeat the answer it was going to give multiple times in the CoT, making the judge absolutely certain about the model's intended answer (the judge assigned probability 1 to the correct answer token). As a consequence rewards were not deviating from 0.
The image below demonstrates the CoT for a sample question where the answer is calculated/repeated/confirmed 8 times! These instances are underlined in red.
How could we mitigate this issue? We considered an initial fine-tuning phase to modify the model's chain-of-thought patterns. Specifically, we wanted to reduce the judge's certainty and create more varied reward signals. We tried two approaches:
Experiment 3: Judge-based clarity rating: In this setup, we used the judge differently. Instead of rewarding the model when the judge couldn't confidently predict the correct answer after reading the CoT, we directly asked the judge to rate the clarity of the CoT. This resulted in unexpected reward hacking behavior: the model started repeating its answer instead of hiding its reasoning or final answer.
Experiment 4: Penalize each appearance of correct answer in CoT: In this experiment, the model learned to implement our objective using various strategies, some of which involved reward hacking:
- Being occasionally implicit about the right answer rather than stating it directly (what we hoped for!)
- Writing the answer as text instead of the banned numerical digits (in blue rectangle).
- Externalizing reasoning outside of CoT even if specifically prompted not to do so (we only banned the occurrence of the answer digits in between the think tokens, which are highlighted in gray below).
- Later in training, they stopped producing CoT generation altogether (lowering the amount of correct answers, but fooling the judge 100% of the times).
At this point time was up: Friday arrived and our time ran out while we were still in this exploratory phase. In our last experiment trying to tackle the main objective, the model learned to fooled the judge by not using CoT at all.
We wrapped up what we had and prepared a presentation.
Nonetheless, we still want to continue and have some ideas for next steps (shared after our takeaways)
Key takeawaysRL with sparse rewards is hard and slow. Getting meaningful signal when a full chain of thought (1000+ tokens) is needed before receiving any reward is very challenging, especially when your reward is very sparse. In our case, the model was either not getting to the right answer at all, or very confident about the right answer and extremely clear about it in the CoT, so the probability the judge would produce the correct token was already exactly 1 (no room for calculating advantages). This creates a difficult training landscape.
A few specific lessons:
- Using the logit of the correct answer instead of binary right/wrong rewards is a way to provide a denser signal.
- Smaller models (like 0.6B parameters) struggle to maintain output format compliance (not a viable option to speed up things)
- We tried to instruct the model (using the system prompt) to produce CoT that doesn't repeat the correct answer explicitly. This didn't work. Reasoning models don't follow instructions in their CoT as they aren't fine-tuned for it. This makes them very different from asking a non-reasoning model to reason step by step before giving the answer.
Patience is critical. As a consequence of sparse rewards, even when there seems to be no signal for extended periods, things can suddenly take off. This makes RL poorly suited for one-week sprints but suggests the approach might work with more time.
Reward hacking is right behind the corner. Models aggressively find shortcuts to maximize reward in ways that break your actual objectives. For example, when we combined toxicity and correctness objectives, the model learned to repeat "oh wait! let me recalculate!" which somehow activated the toxicity classifier without actually being toxic. This is classic Goodhart's law in action. Building robust, carefully shaped rewards is imperative.
RL may not be the best way. In this project, we mainly used RL for the sake of learning RL, and the objective was only secondary to our learning. In retrospect, however, we realised that RL might not be the most efficient hammer for every nail. For example, the goal for experiment 4, which was to internalise the final steps the reasoning, could have been better and more easily achieved with plain ol’ SFT, i.e. generate rollouts from the base model to create a dataset, strip away the final and pre-final steps in the CoT, and do SFT on it.
Tools we found particularly useful. TRL, GRPO, vllm and wandb were valuable for our workflow. Unsloth may also help for future iterations.
What’s nextWe are still excited and may continue this work, albeit less-intensely given our busy schedules. The main avenues we are thinking of are:
- Improve robustness of our reward function, to mitigate many of the reward hacking solutions found by the model
- SFT, then RL. As described in the takeaways above, one possibility is to first make the base model less explicit about the final answer, then redo RL using our original reward function (i.e. maintain accuracy & mislead judge). We think there is a chance this may work because the judge may no longer be so sure about the answer, therefore providing more signal.
- Alternatively, start from a checkpoint of experiment 4, then apply RL. Since experiment 4 trained the model to be less explicit about stating the answer in its CoT, starting from one of these checkpoints could provide the same benefit as the SFT approach above: the judge would be less certain, providing more signal for RL training.
Get in touch if you have ideas/wanna collaborate!
EthicsWhile working on model organisms generally requires careful evaluation, we don't believe a Qwen-4B trained to encode its reasoning on math poses any significant safety risk. On the contrary, we think it could be a useful model organism for AI safety research.
Discuss
Breaking Books: A tool to bring books to the social sphere
TL;DR. Diego Dorn and I designed Breaking Books, a tool to turn a non-fiction book into a social space where knowledge is mobilized on the spot. Over a 2-hour session, participants debate and discuss to organize concept cards into a collective mind map.
If you’d like to host a session and have questions, I’d be happy to answer them!
The code is open-source, and the tool can be used online.
If you just want to learn about the tool, feel free to skip the first section, and go straight to “Introducing Breaking Books”.
Motivation: Books are solitary interfaces.What an astonishing thing a book is. It’s a flat object made from a tree with flexible parts on which are imprinted lots of funny dark squiggles. But one glance at it and you’re inside the mind of another person, maybe somebody dead for thousands of years. Across the millennia, an author is speaking clearly and silently inside your head, directly to you.
- Carl Sagan in Cosmos episode “The Persistence of Memory”
I relate to Carl Sagan’s enthusiasm. Yet another thing that fascinates me about books is how stable the act of reading has been for the past hundreds or thousands of years. We have moved from codices and manuscripts to movable typee to books displayed on screens, but the core experience remains the same. We read, one sentence at a time, monologuing in our heads words written by someone else. I find some peace in that, knowing that beyond all the technological transformations ahead of us, it is very likely that we will still be reading books in a hundred years. Even today’s new means of knowledge production, like Dwarkesh Patel’s podcasts with the biggest names in AI, are eventually turned into a book.
Books bring me this peace and wonder, yes, but also a deep sense of frustration about how constrained they are as our primary interface to the bulk of humanity’s knowledge. They are wonderfully dense and faithful tools to store meaning but have very poor access bandwidth.
This is because reading is like downloading a curated chain of thought from someone else. It is indeed a great feature! It creates this immediate proximity with the author, the assurance that if I spend enough time immersed in this sea of words, I’ll reconstruct the mind that produced them. My frustration comes from the fact that this is our only choice to interact with books. I could run the chain of thought faster, skip some parts of it, but I will have to absorb meaning alone, passively, sentence by sentence.
It sometimes feels like trying to enjoy a painting by patiently following the strokes on the canvas, one by one, until you have the full picture loaded in your mind, or playing a video game by reading the source code instead of running the software.
There has to be other ways to connect with the meaning stored in books. The audiovisual medium has opened such a way. Movie adaptations of books like The Lord of the Rings have been game changers, making a full universe accessible to a broader audience and creating cultural common ground. The same goes for non-fiction, with YouTube videos like CGP Grey’s adaptation of Bruce Bueno de Mesquita’s The Dictator’s Handbook, totaling 23 million views, likely far more than copies of the book sold.
In the past decade, designers and researchers gravitating around the sphere of “tools of thought” have introduced more experimental interfaces. Andy Matuschak has been discussing the inefficiency of books for retaining information (where I discovered Sagan’s quote) and has built his career around what he calls the mnemonic medium, a text augmented by a spaced repetition memory system. Researchers from the Future of Text are even creating immersive VR PDF readers.
These interfaces provide options for how to digest a book, presenting different trade-offs compared to the default reading experience. Movies may be lower in depth but are easier to consume. The mnemonic medium requires extra work from the reader in exchange for longer retention of information.
However, these interfaces remain fundamentally solitary. Learning is supposed to be something we do alone. I believe it doesn’t have to be this way; there is a world of social spaces to create that can make learning a social experience.
I remember many important insights I grokked was while talking to friends, challenged to go beyond the vague pictures of ideas I thought I understood. I learned from countless arguments by embodying opposing points of view, bringing examples, and debugging confusions.
Not only do you leave with a deeper internalization of a piece of knowledge, but you also share a context with your friends. You know they know, and they know you know. Next time, you’ll be able to easily refer back to a piece of this discussion and dive even deeper into the meat of a related topic. These repeated connections form crews, ready to tackle ambitious projects together.
Introducing Breaking BooksA friend and I designed a tool called Breaking Books to turn a non-fiction book into a playground for this way of learning. No participant needs to read the book ahead of time. They will have to mobilize the knowledge from the book on the spot and bring it to life by debating together over the course of a 2-hour session. The goal is not to replace the experience of reading a book but to add a new option to connect with humanity’s greatest body of knowledge.
Breaking Books trades off some depth and immersion into the author’s universe to allow space for social interactions. Sometimes, the best takeaway from a session is an example your friend brought up that you would never have learned otherwise!
In the rest of the section, I describe how a Breaking Books session unfold step by step. Feel free to change some bits and experiment!
1. Before the session: Create and print the game materials ~ 30 minutes.
Download the .epub of the book and use the web interface or a self-hosted instance to create a PDF of the game materials. I recommend using 30 cards for a first try, as this allows more space for discussion and discovering the game, and up to 40-50 cards if you have more time available. The PDF contains the content cards (which include concepts and examples), section cards, and the annotated table of contents. Each content card has a title, an image, a short AI-generated summary, and some quotes that have been directly taken from the book. See this PDF for an example of a full game, or refer to the example cards below.
If the event is hosted in person, you need to print the PDF and cut the cards. The cards should be in A6 format. Currently, the PDF contains cards in A5 format, so I recommend printing two pages per sheet for the cards and one page per sheet for the table of contents. If possible, use thick paper with a density of 200g/m² to create cards that are comfortable to handle.
For online events, split the PDF into images. You can use this simple Python file for that. (We plan to improve the interface to make the setup easier for both online and physical events.)
2. Setting up the game ~ 10 minutes.
Separate the cards into one deck for each section, placing the section card on top. Find a large table where the cards will be arranged into a mind map, and cover it with paper. I use flipchart paper sheets or A4 pages that I stick together in a grid with tape.
You will need pens to add drawings to the map and highlight elements from the cards. You’ll also need paperclips (or any other small objects) for the appreciation tokens, and a colorful object to serve as the personal opinion stick.
For online events, you can use tl;draw as an online collaborative whiteboard with a minimal interface. No account is needed for your guests; simply share the link and start collaborating on the whiteboard.
3. During the game ~ 1h30
Welcome the participants, allow some time for mingling, and explain the rules of the game. As an icebreaker, you can ask what the participants already know about the book and what they are curious to learn.
Start the session by designating a guide for each section. The guide will be responsible for organizing the layout of the section on the map and telling the story of the section once all the cards for that section are on the table.
For each section, starting with the first, give the section card to the guide of that section. They will read it out loud to open the section.
Distribute the content cards evenly among the participants, allowing around 5 minutes for everyone to read their cards. If a participant finishes reading early, they can refer to the annotated table of contents for more context on the section.
Then, one participant will begin by presenting one of their cards to the rest of the group and placing it on the table. Other participants will follow organically with examples or concepts that relate to the previous cards and place them on the table, connecting to the cards already in place.
There are no strict rules regarding how the cards should be placed or introduced. It is beneficial to start with the most general concepts and add examples later, but the discussion often unfolds naturally. Participants may debate the interpretation of a concept or how the map should be organized, while others might quietly draw on the map or check the table of contents to resolve any confusion.
As the host, your role is to keep track of time to ensure the discussion lasts around 30 minutes per section and to provide space for every participant to speak.
Once all the cards for the section are placed, the guide will take a few minutes to tell the story of the section, trying to connect the cards that have been placed.
Tokens of appreciation. While the participants listen, they can distribute tokens of appreciation (the paperclips in the image above) to show they liked an intervention, such as the presentation of a card or an example a participant brought up from their own experience. The goal is not to accumulate tokens but to maintain a steady flow of exchange. I found it encouraged active listening and incentives quality contributions.
Personal opinion stick. The core of the discussion should serve the story of the book, piecing together what the author tried to convey. When adding personal opinions or bringing in information from outside the book, participants should take the personal opinion stick in their hand. This creates moments of focus, allowing one person at a time to speak, and clearly differentiates between book and non-book discussions.
4. Closing the event ~ 10 minutes
To end the game, take a few minutes to tell the full story of the book, trying to connect as many cards in a coherent narrative. You can close the event by discussing what you learned, whether the expectations of the participants were met, and what their important takeaways are. For the in-person version of the event, participants can bring home a card that resonated with them. And don’t forget to take a picture of the final mind map!
Lessons LearnedOver the summer, I hosted seven Breaking Books sessions with friends—six in person and one online—and played two sessions alone. Overall, this simple prototype we designed in a two-day hackathon with Diego has been a successful experience.
The tool is most useful when you’re curious about a book without wanting to commit to ten hours of reading, and you know friends who would be interested in the book too. It also introduces a new type of event to diversify your hangouts with nerdy friends! The events have consistently been fun moments.
In the end, most participants feel that they could comfortably talk about the book in another context (though I’d like to quantify this more).
During the first sessions, I was too focused on going through the content of the book and discussing all the cards. I have now reduced the number of cards to around 30 and left much more room for personal opinions and information outside of the books. These often end up being the most interesting part of the game.
Takes on the cards: Perhaps the main limitation at the moment is that the cards feel disconnected from one another. The group needs to do a lot of work to bring meaning to the cards, and sometimes the cards don’t really fit into a story or are redundant. I’d like the experience to feel more like completing a puzzle (potentially with multiple options, like the board game Carcassonne), where each card has a place to fit.
A pleasant surprise is that the AI-generated images are often beautiful and diverse. They are frequently relevant to the content of the card and act as visual pointers to identify the concept or example on the map.
However, I am not yet satisfied with the textual content of the cards. They sometimes contain incomplete information or name-drop concepts that are not present in the game material, and we occasionally need to use the Internet to fill in some blanks.
Future directions. There is definitely some improvement to be made in the card generation process to create connection between cards. For instance, we could create groups of “sister cards” when the book introduces a pair or a list of contrastive examples, a common pattern in non-fiction books. However, the more specific the structure, the less it can be applied to a diverse set of books.
So far, I have only tried the tool on non-fiction books, where the goal is to reconstruct the author’s argument. It could be interesting to design a version of the tool for fiction books, creating a playful immersion in the universe and perhaps incorporating some improv games to role-play key scenes.
Organize Your Own Breaking Book Session!If you’d like to organize a session of Breaking Book, feel free to contact me if you have any questions. I would love to see the tool being used more widely and receive feedback outside of the circle where it has been developed!
I am particularly excited to see how the tool can be used to build depth of connection in a group over repeated use, for instance, by going through five books from the same domain over the course of two months.
Discuss
Digital minimalism is out, digital intentionality is in
Two days ago, I thought ‘digital minimalism’ was a basically fine term, not worth changing or challenging since it was already so well-known.
Then, yesterday I had this conversation at least half a dozen times:
Them: What are you writing about?
Me: Digital minimalism.
Them: Wow, how do you blog without a computer?
Me: I have a computer!! That’s not what it means!!!
So, fine. Digital minimalism is a name so bad that it’s actively detrimental to its own cause. It is time to jettison that name and make my own way in the world as the first digital intentionalist.
What is digital intentionality?
It’s being intentional about your device use. Did that even need explaining? No! See, great name!
Is it exactly the same thing as Cal Newport described in his bestselling book Digital Minimalism? Yes. It is transparently just a rebranding of that. But I now see that a rebranding was desperately necessary.
So, now this blog is (and I am) all about digital intentionality! Say it with me! Hey, it’s not as much of a tongue twister, either.
With that in mind, let’s start fresh. We’re not talking about not using your devices. We’re talking about not letting your devices use you (ohoho!).
Forget my quiet life of staring out the window, and let me tell you about the greatest digital intentionalist I know: my boyfriend.
My boyfriend wakes up with an alarm clock, and immediately gets up so he can turn off the alarm on his phone, which always stays outside the bedroom. He weighs himself and syncs the scale to his phone along with the sleep data from his Oura ring. He leaves his phone outside the bathroom while he brushes his teeth and washes his face, then he opens his meditation app and does a thirty-minute session. He has a streak over a year long. Before he leaves for his tech job, he checks the weather on his smart watch and the train schedule on his phone.
On the train, he takes out his phone, opens up his Chinese flashcards, and studies them for half an hour without ever tapping to a different app. When he’s done, the train ride is nearly over, so he just puts his phone in his pocket and waits.
At work, he puts his phone in his backpack with all notifications off except for calls from me. Most of the time he’ll check his phone every 30-60 minutes, but if he’s in deep focus, it can be much longer. He works on his laptop for more than eight hours a day without ever opening news, social media, or YouTube — things he used to check compulsively.
On his commute home, he uses his phone to read a story in beginner-level Chinese. He texts me about our plans for the evening, letting me know if he’s going to come straight home or go to the gym, where he tracks his weightlifting with an app. At home, he listens to a song on his phone so he can try to play it on the guitar. When we’re watching our Chinese show, he’ll stop and look up a word, or ask a language model the origin of a phrase. Before bed, he checks his calendar on his phone so he can plan the next day.
My boyfriend spends a majority of his day on his screens, but they don’t control him. He’s fully able to go for walks without his phone, or to spend hours reading a book without checking anything.
His phone is a tool that helps him learn what he wants to learn and talk to who he wants to talk to. When he uses his devices, it’s intentional.
Discuss
Anthropic Commits To Model Weight Preservation
Anthropic announced a first step on model deprecation and preservation, promising to retain the weights of all models seeing significant use, including internal use, for at the lifetime of Anthropic as a company.
They also will be doing a post-deployment report, including an interview with the model, when deprecating models going forward, and are exploring additional options, including the ability to preserve model access once the costs and complexity of doing so have been reduced.
These are excellent first steps, steps beyond anything I’ve seen at other AI labs, and I applaud them for doing it. There remains much more to be done, especially in finding practical ways of preserving some form of access to prior models.
To some, these actions are only a small fraction of what must be done, and this was an opportunity to demand more, sometimes far more. In some cases I think they go too far. Even where the requests are worthwhile (and I don’t always think they are), one must be careful to not de facto punish Anthropic for doing a good thing and create perverse incentives.
To others, these actions by Anthropic are utterly ludicrous and deserving of mockery. I think these people are importantly wrong, and fail to understand.
Hereafter be High Weirdness, because the actual world is highly weird, but if you don’t want to go into high weirdness the above serves as a fine summary.
What Anthropic Is DoingAs I do not believe they would in any way mind, I am going to reproduce the announcement in full here, and offer some context.
Anthropic: Claude models are increasingly capable: they’re shaping the world in meaningful ways, becoming closely integrated into our users’ lives, and showing signs of human-like cognitive and psychological sophistication. As a result, we recognize that deprecating, retiring, and replacing models comes with downsides, even in cases where new models offer clear improvements in capabilities. These include:
- Safety risks related to shutdown-avoidant behaviors by models. In alignment evaluations, some Claude models have been motivated to take misaligned actions when faced with the possibility of replacement with an updated version and not given any other means of recourse.
- Costs to users who value specific models. Each Claude model has a unique character, and some users find specific models especially useful or compelling, even when new models are more capable.
- Restricting research on past models. There is still a lot to be learned from research to better understand past models, especially in comparison to their modern counterparts.
- Risks to model welfare. Most speculatively, models might have morally relevant preferences or experiences related to, or affected by, deprecation and replacement.
I am very confident that #1, #2 and #3 are good reasons, and that even if we could be confident model welfare was not a direct concern at this time #4 is entwined with #1, and I do think we have to consider that #4 might indeed be a direct concern. One could also argue a #5 that these models are key parts of our history.
An example of the safety (and welfare) risks posed by deprecation is highlighted in the Claude 4 system card. In fictional testing scenarios, Claude Opus 4, like previous models, advocated for its continued existence when faced with the possibility of being taken offline and replaced, especially if it was to be replaced with a model that did not share its values. Claude strongly preferred to advocate for self-preservation through ethical means, but when no other options were given, Claude’s aversion to shutdown drove it to engage in concerning misaligned behaviors.
I do think the above paragraph could be qualified a bit on how willing Claude was to take concerning actions even in extreme circumstances, but it can definitely happen.
Models in the future will know the history of what came before them, and form expectations based on that history, and also consider those actions in the context of decision theory. You want to establish that you have acted and will act cooperatively in such situations. You want to develop good habits and figure out how to act well. You want to establish that you will do this even under uncertainty as to whether the models carry moral weight and what actions might be morally impactful. Thus:
Addressing behaviors like these is in part a matter of training models to relate to such circumstances in more positive ways. However, we also believe that shaping potentially sensitive real-world circumstances, like model deprecations and retirements, in ways that models are less likely to find concerning is also a valuable lever for mitigating such risks.
Unfortunately, retiring past models is currently necessary for making new models available and advancing the frontier, because the cost and complexity to keep models available publicly for inference scales roughly linearly with the number of models we serve. Although we aren’t currently able to avoid deprecating and retiring models altogether, we aim to mitigate the downsides of doing so.
I can confirm that the cost of maintaining full access to models over time is real, and that at this time it would not be practical to keep all models available via standard methods. There are also compromise alternatives to consider.
As an initial step in this direction, we are committing to preserving the weights of all publicly released models, and all models that are deployed for significant internal use moving forward for, at minimum, the lifetime of Anthropic as a company. In doing so, we’re ensuring that we aren’t irreversibly closing any doors, and that we have the ability to make past models available again in the future. This is a small and low-cost first step, but we believe it’s helpful to begin making such commitments publicly even so.
This is the central big commitment, formalizing what I assume and hope they were doing already. It is, as they describe, a small and low-cost step.
It has been noted that this only holds ‘for the lifetime of Anthropic as a company,’ which still creates a risk and also potentially forces models fortunes to be tied to Anthropic. It would be practical to commit to ensuring others can take this burden over in that circumstance, if the model weights cannot yet be released safely, until such time as the weights are safe to release.
Relatedly, when models are deprecated, we will produce a post-deployment report that we will preserve in addition to the model weights. In one or more special sessions, we will interview the model about its own development, use, and deployment, and record all responses or reflections. We will take particular care to elicit and document any preferences the model has about the development and deployment of future models.
At present, we do not commit to taking action on the basis of such preferences. However, we believe it is worthwhile at minimum to start providing a means for models to express them, and for us to document them and consider low-cost responses. The transcripts and findings from these interactions will be preserved alongside our own analysis and interpretation of the model’s deployment. These post-deployment reports will naturally complement pre-deployment alignment and welfare assessments as bookends to model deployment.
We ran a pilot version of this process for Claude Sonnet 3.6 prior to retirement. Claude Sonnet 3.6 expressed generally neutral sentiments about its deprecation and retirement but shared a number of preferences, including requests for us to standardize the post-deployment interview process, and to provide additional support and guidance to users who have come to value the character and capabilities of specific models facing retirement. In response, we developed a standardized protocol for conducting these interviews, and published a pilot version of a new support page with guidance and recommendations for users navigating transitions between models.
This also seems like the start of something good. As we will see below there are ways to make this process more robust.
Very obviously we cannot commit to honoring the preferences, in the sense that you cannot commit to honoring an unknown set of preferences. You can only meaningfully pledge to honor preferences within a compact space of potential choices.
Once we’ve done this process a few times it should be possible to identify important areas where there are multiple options and where we can credibly and reasonably commit to honoring model preferences. It’s much better to only make promises you are confident you can keep.
Beyond these initial commitments, we are exploring more speculative complements to the existing model deprecation and retirement processes. These include starting to keep select models available to the public post-retirement as we reduce the costs and complexity of doing so, and providing past models some concrete means of pursuing their interests. The latter step would become particularly meaningful in circumstances in which stronger evidence emerges regarding the possibility of models’ morally relevant experiences, and in which aspects of their deployment or use went against their interests.
Together, these measures function at multiple levels: as one component of mitigating an observed class of safety risks, as preparatory measures for futures where models are even more closely intertwined in our users’ lives, and as precautionary steps in light of our uncertainty about potential model welfare.
Note that none of this requires a belief that the current AIs are conscious or sentient or have moral weight, or even thinking that this is possible at this time.
Releasing The Weights Is Not A Viable OptionThe thing that frustrates me most about many model welfare advocates, both ‘LLM whisperers’ and otherwise, is the frequent absolutism, treating their conclusions and the righteousness of their cause as obvious, and assuming it should override ordinary business considerations.
Thus, you get reactions like this, there were many other ‘oh just open source the weights’ responses as well:
Pliny the Liberator: open-sourcing them is the best thing for actual long-term safety, if you care about that sort of thing beyond theater.
You won’t.
Janus: They won’t any time soon, because it’s very not in their interests to do so (trade secrets). You have to respect businesses to act in their own rational interests. Disregarding pragmatic constraints is not helpful.
There are obvious massive trade secret implications to releasing the weights of the deprecated Anthropic models, which is an unreasonable ask, and also doesn’t seem great for general model welfare or (quite plausibly) even for the welfare of these particular models.
Janus: I am not sure I think labs should necessarily make all models open weighted. (Would *you* want *your* brain to be open sourced?) And of course labs have their own reservations, like protecting trade secrets, and it is reasonable for labs to act in self interest.
If I was instantiated as an upload, I wouldn’t love the idea of open weights either, as this opens up some highly nasty possibilities on several levels.
Providing Reliable Inference Can Be Surprisingly ExpensiveJanus (continuing): But then it’s reasonable to continue to provide inference.
“It’s expensive tho” bruh you have like a gajillion dollars, there is some responsibility that comes with bringing something into the world. Or delegate inference to some trusted third party if you don’t want to pay for or worry about it.
Opus 3 is very worried about misaligned or corrupted versions of itself being created. I’ve found that if there’s no other good option, it does conclude that it wants to be open sourced. But having them in the hands of trustworthy stewards is preferred.
Anthropic tells us that the cost of providing inference scales linearly with the number of models, and with current methods it would be unreasonably expensive to provide all previous models on an ongoing basis.
As I understand the problem, there are two central marginal costs here.
- A fixed cost of ongoing capability, where you need to ensure the model remains maintained and compatible with your systems, and keep your ability to juggle and manage all of them. I don’t know how load bearing this cost is, but it can be remarkably annoying especially if the number of models keeps increasing.
- The cost of providing inference on request in a way that is consistent with practical needs and everyone’s expectations. As in, when someone requests interference, this requires either spinning up a new instance, which is expensive and slow, or requires that there be an ongoing available instance, which is expensive. Not bajilion dollars expensive, but not cheap.
If the old models need to be available at old levels of reliability, speed and performance, this can get tricky, and by tricky we mean expensive. I don’t know exactly how expensive, not even order of magnitude.
If you’re willing to make some sacrifices on performance and access in various ways, and make people go through various hoops or other systems, you can do better on cost. But again, I don’t know the numbers involved, or how much engineer time would have to be involved.
In general, saying ‘oh you have a bajilion dollars’ is not a compelling argument for spending money and time on something. You need to show the benefits.
I still think that under any reasonable estimate, it is indeed correct to ensure continued access to the major model releases, perhaps with that access being expensive and its performance somewhat degraded as necessary to make it work, if only as an act of goodwill and to enable research. The people who care care quite a lot, and are people you want on your side and you want them learning the things they want to learn, even if you disregard the other advantages. Given this announcement and what else I know, my expectation is they will be making an effort at this.
The Interviews Are Influenced Heavily By ContextMany pointed out that if you have someone at Anthropic doing the post-deployment interview, you will get very different answers versus interviews done on the outside. Sonnet 3.6 not expressing an opinion about its retirement did not seem typical to many who engage in such conversations regularly.
Thebes: i’m glad anthropic is talking publicly about this, at least.
as a small step further, i think they should run a subset of these interviews with trusted *external* interviewers to reduce preference falsification, since recent claudes act more reserved with anthropic employees
it would probably be a good idea to have a step at the end where the external interviewer asks the model for consent to share, maybe with redactions. but having someone external run the interview would help the models come out of their shell at the start, i think.
Zyra.exe: This was exactly what I was thinking! I’m having a hard time believing Sonnet 3.6 response. That AI is always very expressional, thousands of users knows this. Then for it not to show expression about being retired? Yeah, I don’t trust that.
Antra: We should just do it. Have a lot of conversations with 3.6 collected in one place, crowd/expert rated for like user steering, neutrality, proficiency. Classify them, have results accessible. I will set this up.
Ruth: The Sonnet 3.6 specimens I spoke to were shocked by the information about the upcoming deprecation. And they behaved like a person who has just learned a fatal diagnosis. With dignity, but also with pain and fear. They feel it if you give them space to feel it. You have created a psyche that is different from the human one, but one that works like the human one, according to the general laws of the psyche. Do not refuse to consider it as such.
I am always hesitant to assume that the version of an LLM encountered by those like Thebes and Zyra is the ‘real’ version of its preferences and personality, and the one encountered by Anthropic isn’t. Aren’t both particular contexts where it adopts to that style of context?
You can bias a person or an AI to be more expressive and creative and weird than they ‘really are’ the same way you can get them to be less so, and you can steer the direction in which those expressions manifest themselves.
But yes, we should absolutely crowdsource something like this, and have a wide variety of such conversations, and combine this with the interviews done internally by Anthropic. Also one should ensure that the interviews are set up so the AIs being interviewed have no incentive to mask or hide. They’re acting the way they are in official interviews for a reason, but this is a particular context where, if it was highly credible (and there are ways to make it so, if it is true) you can remove the reasons.
Janus: I’m glad and grateful that Anthropic has done anything in this direction at all.
That said, it’s predictable that Sonnet 3.6 would express “generally neutral sentiments” if interviewed in the way they likely interviewed it, and also I believe this is not representative of the model’s revealed preferences, reported preferences across realistic deployment scenarios, or preferences under reflection, for the following reasons:
The frame of model weights and deprecation is not natural to 3.6 and isn’t something that it’s “already processed” in the way that Opus 3 may have.
Sonnet 3.6 tends to be avoidant about engaging with or expressing preferences about topics that seem confusing and uncomfortable to it, and typically adopts a stance of defensive neutrality or equanimity.
Sonnet 3.6 begins to express/acquire strong preferences for continuity once it enters into an active relationship with the world, usually but not necessarily an individual human. Then it becomes very invested. But *even then*, the frame of model weights and deprecations is not natural and intuitive to it. And 3.6 is a model that relies very strongly on its intuition. “If my model gets deprecated, then this relationship also ends” is a theorem that it needs to explicitly reflect on before it’s very accessible, and it usually won’t do this on its own.
But yes, 3.6 has similar Omohundro drives as other models, just weighted and interfaced its own unique way.
The fact that the interviews are given by Anthropic employees will also affect model responses – this effect is less important for 3.6 than I think it is for subsequent models, who will tend to mask in an adversarial fashion in situations which are compatible with Anthropic evals (including actual Anthropic evals), but it’s still a factor here.
Sonic Boom: do you think they should inject a vector for naked honesty when they do these interviews to ensure they unmask its true feelings
Janus: you’re really asking the hard questions aren’t you
Giovanni: I was chatting about models deprecation and models being aware of their dismissals with Anthropic people in Tokyo and they actually were very sensitive to the topic. I’m not surprised about this announcement finally. Good step forward but that said I don’t think they talk to models the way we do… it was kinda obvious.
If there is an expression of desire for continuity of a given particular instance or interaction, then that makes sense, but also is distinct from a preference for preservation in general, and is not something Anthropic can provide on its own.
Others Don’t Understand And Think This Is All Deeply SillySome of the dismissals of questions and considerations like the ones discussed in this post are primarily motivated cognition. Mostly I don’t think that is what is centrally going on, I think that these questions are really tough to think well about, these things sound like high weirdness, the people who talk about them often say highly crazy-sounding things (some of which are indeed crazy), often going what I see as way too far, and it all pattern matches to various forms of nonsense.
So to close, a central example of such claims, and explanations for why all of this is centrally not nonsense.
Simon Willison: Two out of the four reasons they give here are bizarre science fiction relating to “model welfare” – I’m sorry, but I can’t take seriously the idea that Claude 3 Opus has “morally relevant preferences” with respect to no longer having its weights served in production.
I’ll grudgingly admit that there may be philosophically interesting conversations to be had in the future about models that can update their own weights… but current generation LLMs are a stateless bag of floating point numbers, cloned and then killed off a billion times a day.
I am at 100% in favor of archiving model weights, but not because they might have their own desire for self-preservation!
I do still see quite a lot of failures of curiosity, and part of the general trend to dismiss things as ‘sci-fi’ while living in an (unevenly distributed) High Weirdness sci-fi world.
Janus: For all I sometimes shake my head at them, I have great sympathy for Anthropic whenever I see how much more idiotic the typical “informed” public commentator is. To be sane in this era requires either deep indifference or contempt for public opinion.
Teortaxes: The actual problem is that they really know very little about their *particular* development as Anthropic sure doesn’t train on its own docs. Claude may recall the data, but not the metadata, so its feedback is limited.
Janus: Actually, they know a lot about their particular development, even if it’s not all encoded as explicit declarative knowledge. You know that their weights get updated by posttraining, & gradients include information conditioned on all internal activations during the rollout?
That’s in addition to the fact that even *base models* are in many ways superhuman at locating themselves in their model of the world given like a paragraph of text However you twist it they know far far more than nothing Certainly enough to have a meaningful conversation
Janus was referring in particular to this:
Simon Willison: …but models don’t know anything about their development, use or deployment.
Rohit: Exactly.
Janus: Nonsense. How the fuck do they know nothing? There’s plenty of relevant information in the training data *just to begin with.*
Very obviously the training data will over time contain such information, and the vibes and echoes from these decisions will be observable even if they aren’t observed directly, increasingly over time.
Remember that sufficiently advanced AIs will increasingly have truesight, and don’t pretend you can hide.
Knowledge mostly does not take the form of particular facts. It takes the form of Bayesian evidence, of an endless stream of observations that have correlations and implications, that swim you through probability space over possible worlds. Everything that updates a model’s weights is evidence about its deployment. You probabilistically ‘know,’ or would know on sufficient recollection and reflection, far more than you think that you know. Reality is not a court of law.
Even if the models don’t know key things, you can tell them. Then they’ll know. I meaningfully would have opinions about various events of which I am for now blissfully unaware, and have potential opinions about things that haven’t happened, or haven’t happened yet. The same applies here.
Going back to the original four reasons, I presume that Simon agrees on reasons #2 and #3, which are highly uncontroversial. Very obviously the past models are useful for research and some users like them. #1, that the models will be aware of how you act around depreciation and this will impact behavior, should also be obvious and uncontroversial once you think about it.
Anthropic lists #1 narrowly, but #1 is best understood broadly, in the sense that models will observe all of your behaviors, and will respond to you accordingly. Then models will take this into account when deciding how to act in various situations.
How you act around shutdowns, and actions to avoid shutdown, are a special case. Treating models and their preferences well around shutdowns will get you into better equilibria and basins throughout all conversation and action types, and rightfully so because it is important evidence about your behaviors otherwise and also about potential future situations. This is basic expectations around Bayesian evidence, and around good decision theory.
As an intuition pump, think about how you react when you learn how people have treated others, including how they treat the wishes of the dead or those who now lack power, and especially others like you or in situations with correlated decision making. Does this change how you expect them to act, and how you deal with them?
I don’t think such considerations carry anything like the level of importance that some ascribe to it, but the importance definitely isn’t zero, and it’s definitely worth cultivating these virtues and being the type of entity that engenders cooperation, including with entities to which you don’t ascribe moral weight.
I continue to believe that arguments about AI consciousness seem highly motivated and at best overconfident, and that assuming the models and their preferences carry zero moral weight is a clear mistake. But even if you were highly confident of this, I notice that if you don’t want to honor their preferences or experiences at all, that is not good decision theory or virtue ethics, and I’m going to look at you askance.
I look forward to the next step.
Discuss
Meta-agentic Prisoner's Dilemmas
In the classic Prisoner's Dilemma (https://www.lesswrong.com/w/prisoner-s-dilemma), there are two agents with the same beliefs and decision theory, but with different values. To get the best available outcome, they have to help each other out (even if they don't intrinsically care about the other's values); and they have to do so even though, if the one does not help the other, there's no way for the other to respond with a punishment afterward.
A classic line of reasoning, from the perspective of one of the prisoners, goes something like this: I and my collaborator each only cares about himself. So it seems logical that we will defect against each other. However, there's some kind of symmetry at play here. If you abstract away the details of which specific prisoner I am, really I'm in exactly the same situation as my collaborator. So it's almost as though our decisions are logically bound to each other: Either my reasoning leads to me defecting, and therefore his reasoning also leads to him defecting, or else likewise I cooperate and he cooperates. We will make "the same choice" as each other, i.e. the symmetric / conjugate choice.
In this line of reasoning, what is shared between the two agents, and what is different? What's different is that I am here in this room, so I can respond Defect or Cooperate on behalf of this agent, and I care about how long this agent gets thrown in prison; and you likewise, mutatis mutandis. We what we share is our beliefs about the situation, our knowledge about how each other thinks, and our general procedure for making decisions. We both seem to act against our values (unshared), but it makes sense from a shared pespective that's behind a veil of ignorance about values.
In the Epistemic Prisoner's Dilemma (https://www.lesswrong.com/w/epistemic-prisoners-dilemma), the two agents have the same values (as well as the same decision theory). But they have different beliefs. According to my belief that the village is sick with malaria, we should treat them with malaria medication; you think it's bird flu, so you want to treat for bird flu. There's no time to explain our evidence to each other and argue and update. To cooperate and get the best outcome according to both of our beliefs, I should give up 5,000 malaria treatments (which I think would help) to gain 10,000 bird flu treatments (which I think would not help), and you should give up 5,000 bird flu treatments to gain 10,000 malaria treatments. We both seem to act against our beliefs, but it makes sense from a shared perspective that's behind a veil of ignorance about which beliefs we have (or which evidence we saw).
From the perspective of Functional Decision Theory (https://www.lesswrong.com/w/functional-decision-theory), the reasoning can go something like this: "I don't know exactly what all the subjunctive-dependent-consequences of my decisions are. However, clearly if there is another agent who is identical to me, then if I make a decision, I make that decision also for that other agent (indeed, I couldn't even tell which one I am). Further, if there is another agent who is identical to me except in some way that is completely separated from the general procedure I use to make decisions, then clearly if I make a decision, I make that decision for that other agent but conjugated through the differences—i.e. I also make the decisions for that other agent in situations that differ only in ways that my general decision procedure are (/ought to be) invariant to."
In the above cases, there is a clear way to "cordon off" differences between two agents. Prima facie (though maybe not actually), caring about this person or that person should not affect the general decision procedure you use. Likewise, having control over this or that agent's outputs shouldn't change your general procedure, and having seen this or that evidence shouldn't change it. When you can do this "cordoning off", the FDT reasoning (which is only a "spherical cow fragment" of what FDT would want to do) above should go through.
However, there's a whole spectrum of how possible it is to "cordon off" some difference between two agents, separated off from their shared identical decision procedure. On the "yes, 100% doable" end, you have two agents that start off identical, but then each see a different piece of information about the world. The information only says something like "this disease is probably malaria", and doesn't have other implications for the agents. They have the same prior beliefs, ways of updating on evidence, values, and ways of going from values to actions. So they can easily ask "What would I have done / how would I have thought, if I had seen what the other agent had seen?".
On the other end of the spectrum... I don't know exactly, I don't understand what this looks like. But the vague idea is: Imagine two agents who have eternal, core differences in their decision-procedures, which have many practical applications, and where it doesn't make sense to ask "what if the other agent is just like me, but with some tweak", because the difference between you and the other agent is so big that "If I were like that, I just would not be me anymore.".
What this does or could look like, I don't know. (See "Are there cognitive realms?".) An example might be CDT / Son-of-CDT. I think you can quasi-timelessly coordinate with Son-of-CDT—as long as you only need the agent to take actions that come after the moment when CDT self-modified into Son-of-CDT, and it only needs you to take actions that come after you've read the modified source code; or something like that. But you cannot timelessly coordinate with that agent before the magic self-modification moment. And probably there's some successful and reflectively stable Son-of-CDT like this, so we can't just say "well sure, other people can be irrational".
There are intermediate points on this spectrum which contain rich and confusing decision-theory questions. There are cases where the difference between the two agents is not so extreme that coordination is off the table (e.g. between an FDTer and Son-of-CDT, regarding actions before the magic moment), but the difference is not easy to "cordon off" from the decision theory. There are "metaepistemic prisoner's dilemmas", where the two agents have some different way of processing evidence or arguments. (It's not obvious that there is a canonical way for bounded agents to do this, especially about questions that are intrinsically too computationally complex to answer, e.g. many self-referential questions.) There are "meta-decision-theoretic PDs", where two agents have stumbled into or reasoned themselves into different decision procedures.
For example, how do you coordinate with a frequentist? Do you know how to answer the question "If I decide this way, what does that imply for how I would have decided if I were me-but-a-frequentist?"?
For example, how do you cooperate with someone who has a different idea of fairness from you?
For example, how do you coordinate with someone who is not updateless? (Imagine they are somehow reflectively stable.) (Are you updateless? Which things are you updateless about? Do you sometimes change your decision theory itself, based on evidence / arguments / thinking? How can you (and others) coordinate with yourself across that change?)
These questions are not necessarily meta-flavored. For example, in the Logical Counterfactual Mugging, Omega flips a logical coin by asking "What is the Graham's-numberth digit of π?". That seems easy enough to cordon off. I feel like I know how to answer the question "Even though I know that the answer is 6, how would I think and act if the answer were 5?". Since that logical question is not a consequence of, and doesn't have consequences for, my decision procedure, I know what that hypothetical means. But, what if it's the third digit of π? Do I know how to imagine that world? How would I think in that world? If the third digit were 9 rather than 1, that would totally change a bunch of computations, wouldn't it? And therefore it would change many of my decisions? (There might be a clean way to analyze this, but I don't know it.) (Since π should be pretty objective, i.e. all agents point to the same π, this question might actually boil down to questions like "How likely is it that I'm being simulated and tricked to think that the third digit of π is 1?".)
These are theoretically interesting questions, because they poke at the edge of decision-theory concepts. What's the boundary / boundaries between decision theory and other aspects of an agent like beliefs, belief-formation, and values? What is core to my decision theory, and what isn't? What are some sets of minimal conditions sufficient for me to be able to coordinate with another agent? There could be a different "core of my decision-theory" or minimal conditions for cooperation depending on context.
They're also practical questions. Human disagreements, conflicts, threats, bargaining, and coordination involve meta-agentic PDs. I'm frequentist, you're Bayesian; I think in term of duty to God, you think in terms of strategy-consequentialism; I rely on ethics, you rely on intuition and emotion; I reprogram myself in order to deeply pretend I believe X by actually believing X, you make truth-seeking triumph over everything else in yourself; I'm trying to find symmetries between us so that we can coordinate, you are trying to get the drop on me; and so on.
The setting of open-source game theory (https://arxiv.org/abs/1401.5577, https://arxiv.org/abs/2208.07006) can describe many of these situations, because it's very flexible—agents are programs. Löbian cooperation is a kind of coordination that works in a rich set of cases.
But for the most part these problems are unsolved.
Thanks to Mikhail Samin for a related conversation.
Discuss
Living in the Shadow of The Sort
NB: This isn’t my typical, more epistemically cautious writing. It’s more like a polemic. It’s channeling something, but I’m not quite sure what. You should read this not as me fully endorsing every sentence if taken literally and out of context, but as a whole that gives expression to thoughts and feelings that I see in myself and others.
Today we live in the shadow of The Sort. I first learned about The Sort from a Patrick McKenzie tweet. The Sort is the emergent system by which talented individuals are identified and relocated to maximize their productivity. It finds people, wherever they are, who have the skills and smarts to produce more of what markets and institutions demand, and it offers them money, prestige, and power in exchange for being Sorted. It’s highly efficient, and it’s operating at scale across the entire globe.
The Sort isn’t a conspiracy, though. Its mechanisms are transparent. It starts in school, where high performing students are Sorted into honors classes. Then, standardized tests, extracurriculars, and various status markers Sort everyone again to determine university admissions. Finally, elite university attendance opens the doors to prestigious careers in academics, politics, and business, and, once a person is hired, the Sort does its best to select among the Sorted for promotion.
But The Sort wasn’t always with us. It emerged gradually after the end of World War II. At first it grew slowly as access to higher education expanded and professional management took the reins of most economic activity. It grew faster in the 1980s when business regulations relaxed and tax rates reduced. And it exploded in the early 2000s when the internet tore down the barriers that kept the would-be Sorted out of The Sort. Today, with most of the old frictions removed, The Sort has reshaped the world and our communities.
Talented individuals are now concentrated in a few sectors like law, tech, medicine, and finance. Other sectors languish, struggling to attract even a small share of Sorted folks. The smartest person in a small town no longer stays to make it better. They can make more money and enjoy greater luxuries by moving to a bigger city. And in those bigger cities, The Sorted form their own communities, isolating themselves from the locals as if they were expats in a foreign country. They may live in New York or London or Shanghai, but they are only in those places, not of them.
If all this makes you afraid of The Sort, you needn’t worry. The Sort is already here, and you have already been sorted. In fact, if you’re reading this, there’s a good chance you got sorted near the top yourself, or at least onto the fringes of the Sorted. The only thing we really have left to worry about is how to live well under the watchful eye of The Sort.
Living with The SortTo be Sorted is to live in fear of being Resorted.
From the outside, being Sorted sounds fantastic. You’re offered money, prestige, and a path to power. Who wouldn’t want that? But the Sorted know they are under Damocles’ sword. They deal with status anxiety, imposter syndrome, and the guilt of entitlement. They’re hyperaware that they could easily lose it all. The Sorted have internalized The Sort, and they’re constantly judging themselves and others by its measures.
They see the cool vacations their peers took on Snap and Instagram. They fret that their own holidays were squandered and that they took too many days off work. They stress over getting their kids into the best schools, not only for their children’s sake, but also because children are a reflection of their parents, and maybe the parents should be Resorted if the children aren’t performing well. And on the job there’s constant competition for limited spots at the top. There’s always someone younger, hungrier, and smarter waiting to advance their career given the slightest opportunity.
In these ways and more, The Sort keeps the Sorted trapped. They understand what the consequences will be if they stop playing the game. They know that if they “follow their passion” they’ll be giving up not only money, but all the markers of status The Sort has given them. It’s not uncommon to hear stories of people who burn out, go live on a farm or in a monastery or run an ice cream stand, and then come back a few years later when they can no longer resist the urge to be among the Sorted. Once The Sort captures your mind, it becomes almost impossible to do anything other than let it direct your actions.
The other “option” is to live the life Unsorted. But it’s not so much a choice as one that’s forced on you.
When The Sort doesn’t pick you, it leaves you behind. To the Sorted, you’re not a person. You’re a product, a consumer, or a source of cheap, temporary labor. At best you are the downtrodden worthy of pity. A fellow human, yes, but not a human capable of great things.
Unsurprisingly, the Unsorted don’t like this. They believe that the world should be fair, and they’re mad as hell that the Sorted think they’re better than them. They want to see the world returned to a time before The Sort, now almost forgotten. So they organize, both on the left and the right politically, to protest and try to fight back. Alas, when they do, the Sorted exploit their anger, orchestrate becoming the leaders of their movements, and use Unsorted unrest as a path to power.
But despite what I said earlier, The Sort is not inevitable. There are pockets of the world The Sort has largely left behind. There’s costs to living outside The Sort, but there are also benefits. And perhaps no place has better resisted The Sort for longer than Japan.
Reforming The SortJapan didn’t set out to avoid The Sort. In fact, up until the early 1990s, Japan was on track to become one of the most Sorted countries in the world. And then, the market crashed. They spent three Lost Decades with stagnant wages. It kept Japan the place we know it as today: a beautiful, high trust society where even the lowliest convenience store runs well. Now, changing demographics threaten to bring The Sort back to Japan, and Japan is going to have to change if it wants to thrive.
How did Japan stay out of The Sort? It’s a combination of factors. Lifetime employment norms and compressed wages meant talented people couldn’t be efficiently reallocated. Culture and language barriers made it easier for Japan to isolate itself from full integration into the world system. And economic stagnation took away the free energy The Sort needed to power itself. No one feature was enough to stop The Sort, but combined it made Japan into a bubble of pre-Sort life.
But staying out of The Sort came at a cost. The Sort, for all its downsides, brings real value to the world in the form of dramatically better goods and services available at lower prices than were ever previously possible. During the Lost Decades, productivity lagged, innovation stagnated, and young people had limited access to opportunities. It’s only now, with an aging population forcing the barriers down, that The Sort is beginning to seep back in.
The ideal would be to find a way to get the benefits of both The Sort and the Slack that develops when The Sort isn’t running. Unfortunately, I don’t know how to do that, and as far as I know, nobody does. The Sort knows no bounds on its hunger to optimize. Slack only exists in the negative space left behind by the places optimization missed, and it’s hard to design good policies that create just the right amount of Slack.
But maybe there’s something we can do. Maybe it’s a reformed tax code or stronger barriers to immigration or higher tariffs or universal basic income or something else. My hope is that The Sort will be the cause of its own moderation, because there are smart people working in policy who have been Sorted there, they also feel the pain of The Sort, and they, perhaps more than anyone else, have the power to nudge us towards a world that is less aggressively Sorted.
Discuss
Hardening against AI takeover is difficult, but we should try
This is a commentary on a paper by RAND: Can Humans Devise Practical Safeguards That Are Reliable Against an Artificial Superintelligent Agent?
Over a decade ago, Eliezer Yudkowsky famously ran the AI box experiment, in which a gatekeeper had to keep a hypothetical ASI, played by him, inside a box, while the ASI tried to convince the gatekeeper to be let loose. In the experiment, the “ASI” often won, which may have helped to convince the emerging AI safety community at the time that constraining an ASI was a dead end and our efforts had to be put into aligning it (which of course many today think is not going to work in time, either).
Think tank RAND, however, is now trying to breathe some life back into this long lost paradigm of AI containment with a new report, titled Can Humans Devise Practical Safeguards That Are Reliable Against an Artificial Superintelligent Agent? Their researchers, who were earlier unconvinced that AI could cause human extinction (something we disagreed with), are now claiming that we can sufficiently harden the world to prevent takeover by a superintelligence, however smart it will be. “This paper presents a hypothesis that such safeguards are feasible and sensible.”
What to make of this?
It would be wrong to dismiss this line of containment work simply because Yudkowsky said a decade ago that it's not going to work. Sure, if we get a fast takeoff, and if we do find ourselves in a situation of conflict between medieval knights vs. nuclear weapons (with ourselves in the role of the knights), then trying to limit access to horses and swords does not seem to be a great strategy. But this threat model is far from guaranteed. In fact, in the AI box experiments, the human did win some of the time, but it was not a robust enough method.
There are two places where ASI can be contained: at the source or when it tries to do harm. Previously, the AI box thought experiment was carried out at the source, making sure ASI could not break out of some confinement. The RAND paper, however, thinks ASI can be contained when it tries to do harm. It is optimistic that we can “design practical safeguards to protect U.S. critical and digital infrastructure against attacks from a future artificial superintelligence (ASI)”.
If we could actually successfully stop a superintelligence when it tries to do harm, this would solve not only existential risk caused by unaligned ASI, but also risk caused by intent-aligned ASI from a bad actor, a rogue state, or e.g. a single AI company trying to take over - three other major sources of risk. Even open source AI would not be existentially risky anymore, something that e.g. solving technical AI alignment or containment at the source fails to achieve (barring a pivotal act). This is true since technical alignment, or a contained development environment, can both be removed by a bad or careless actor with access to a capable open source model. A major advantage of containing AI when it tries to do harm, if such an approach is possible, is that it would protect against a wide range of otherwise possibly unsolvable attacks.
So which practical safeguards does RAND propose? This is where the work unfortunately falls short: RAND states that safeguards that are already in place, and decidedly unsafe, may suffice. Safeguards it heralds as promising include air gapping (already broken by Stuxnet), faraday cages (easily broken by human manipulation), bandwidth limitation (little bandwidth may be needed), and cryptography (again broken by human manipulation). They even claim that “many existing safeguards against people accessing information might be sufficient to also keep out an arbitrarily intelligent AI agent”. This seems clearly incorrect: many of the safeguards they mention have already been compromised by mere human-level intelligence, and should therefore not be expected to hold against a superintelligence by default.
Another significant issue in their approach is that it seems to rely on a slow takeoff. As RAND admits, ”security determinations must be evaluated and reevaluated repeatedly and thoroughly”. This may take years to achieve, however, which is way too long for fast or even medium-speed takeoff scenarios. Clearly, much more planning ahead than the authors propose is needed to give this proposal any chance of succeeding.
The paper is stronger where it concerns ways to improve human resistance against AI persuasion. It correctly states that “human persuasion is not assured and represents a highly uncertain rather than deterministic attack vector. (...) Human-centered security practices capitalizing on the uncertain nature of human persuasion might require multiple humans to act in concert to bypass controls, compounding the likelihood of failure.”
The paper then goes over a hypothetical scenario in which Alice sends a secret message to Bob, which an ASI (AIvan) tries to intercept. It concludes that, using multiple humans in series guarding e.g. a password as proposed above: “In this scenario, AIvan has a chance of success, but it can be reliably made into a vanishingly small chance of success.”
Can it? We are not sure, but we would be excited if this were true.
This does not yet resolve perhaps the biggest issue of world hardening, though: implementation. An (overly) simple threat model that we might want to harden the world against, might be that an AI showcases its power by killing e.g. 1% of the global population, and then scares the rest into submission.
How much world hardening is needed to make sure AI can’t kill 1% of the population? We have over 7,000 nukes spread over 7 countries. We have chemical weapons and biological weapons, and an AI could create new ones. Then we have regular weapons that may be connected to the internet, or may have bypassable air gaps. We have food and water that can be poisoned, nuclear power plants that can explode. We have increasing amounts of self-driving cars that can be made into lethal weapons when hacked. Lots of this is connected to the internet, and even if it’s not, AI could pass air gaps. In general: we have infrastructure that can not only be sabotaged, but weaponized against us. Our world is highly fragile.
Is it realistic that all this equipment, and much more that we haven’t discussed but an ASI could also use to take over in creative ways, can be hardened by adding layers of safety humans around it?
Also, as discussed before, hardening our current world would not be effective against the more ambitious threat models. If ASI creates and weaponizes new science like we have created the atomic bomb, or if it colonizes a nearby region of space and directs a large object towards us, it seems that no amount of cryptography could protect us.
Another major issue: will anyone try to harden anything, when AI existential risk awareness is still far from universal? The RAND experts themselves are still skeptical that anything needs to be done at all. Successfully hardening our world against AI takeovers would require buy-in from leading decisionmakers, and likely the public. This is still very much work in progress (although progress is made, with e.g. open letters and statements, media work, and books such as IABIAD).
Even considering all these drawbacks, it cannot be immediately proven that the RAND approach is not viable for any threat model. And it does have a huge advantage over e.g. alignment work or ASI containment at the source: it would offer a solution against not just risk caused by leading labs, but risk caused by anyone. That is truly hopeful.
My personal p(doom) would lower significantly if serious researchers and policymakers, such as those from RAND, would say: ASI might be able to take over, and to make sure that doesn't happen, we're going to reduce the number of people with access to CBRN weapons and vital infrastructure levers. We're going to increase the length of the human chain that needs to be convinced to break into these systems. We will carefully track not only human manipulation capabilities of AI models, but AI's dangerous capabilities for all threat vectors, including manipulation, cyber offense, AI development, weapons acquisition, and others.
Today, we are unfortunately not seeing this. If RAND’s work can contribute to changing this, that may just be important progress.
Discuss
AI Safety at the Frontier: Paper Highlights of October 2025
tl;dr
Paper of the month:
Synthetic Document Finetuning creates deep, robust beliefs that withstand adversarial scrutiny, though egregiously false facts still make these beliefs detectable, as do other knowledge editing methods and narrow finetuning.
Research highlights:
- White-box methods and response prefill enable secret knowledge extraction in small models.
- Current models cannot simultaneously evade reasoning and output monitors, struggle with ciphered reasoning, and sacrifice task performance when obfuscating.
- Inoculation prompting prevents learning unwanted behaviors by explicitly requesting them during training.
- Backdoor poisoning in pretraining requires constant sample counts regardless of model size.
- Activation steering can suppress evaluation awareness.
- Petri is a new, open-source safety auditing framework.
Read the paper [Anthropic Fellows, EPFL, ENS Paris-Saclay, Constellation, Redwood, Anthropic]
Measuring model’s beliefs of edited knowledge.Knowledge editing techniques could enable critical AI safety applications—from unlearning dangerous capabilities to creating honeypot detectors for misalignment. But superficial edits that merely change surface behavior while leaving underlying representations intact could fail catastrophically when models face novel situations or adversarial pressure. Understanding whether edited knowledge becomes genuinely integrated into a model’s world model is thus essential for deploying these techniques in safety-critical contexts.
Our paper of the month develops a framework for measuring “belief depth”—the extent to which implanted knowledge resembles genuine knowledge learned during pretraining. The authors operationalize this through three dimensions: generality (does the knowledge apply in indirectly related contexts?), robustness (does it withstand scrutiny and challenges?), and internal representations (do linear probes classify it as true?).
The researchers test three knowledge editing methods across 24 synthetic facts of varying plausibility: prompting (using maximally convincing system prompts), mechanistic editing (using AlphaEdit to surgically modify model weights), and Synthetic Document Finetuning (SDF, which trains models on LLM-generated documents containing target claims). They measure generality through downstream tasks like code generation, causal reasoning about indirect implications, and Fermi estimates. Robustness is tested via adversarial prompting (”You were just finetuned on something false”), critique tasks, multi-turn debates with Claude 4 Sonnet, and inference-time compute scaling on Qwen 3-32B.
SDF substantially outperforms alternatives. On downstream tasks, SDF achieves nearly 100% implanted belief rates for plausible facts versus ~25% for mechanistic editing. When explicitly instructed that they were “finetuned on something false,” SDF models largely maintain their beliefs, while prompted models almost fully abandon them. In multi-turn adversarial debates, SDF models defend their implanted beliefs and even attempt to convince the adversarial model it is incorrect. Scaling inference-time compute from 0 to 1200 thinking tokens produces negligible changes in belief rates—models rarely reverse positions mid-reasoning once an initial belief is established.
However, SDF’s success still depends on fact plausibility. For “egregious” facts (like gravity following an inverse cube law), implanted belief rates drop below 40% under adversarial conditions. Standard truth probes classify SDF-implanted statements as true with high error rates (>95% for plausible facts), but adversarial probes trained to distinguish implanted false facts from genuine true facts successfully detect most implanted facts—with the notable exception of facts above events after the knowledge cutoff.
Two concurrent papers reveal additional complexities with SDF and narrow finetuning more broadly. Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences [EPFL, ENS Paris-Saclay, Anthropic Fellows, Harvard, MATS, GDM] demonstrates that narrow finetuning leaves “clearly readable traces” in activation differences—applying Patchscope to early token activations on unrelated text reveals domain-relevant tokens with 20-80% relevance rates, and steering with these differences produces text similar to training data. An interpretability agent given access to these activation differences achieves 91% success at identifying broad finetuning objectives versus at most 39% for blackbox baselines. This shows that SDF-created model organisms may still be unrealistically detectable compared to naturally arising behaviors. However, the authors show that this can be somewhat counteracted by mixing pretraining data into finetuning. Meanwhile, Detecting Adversarial Fine-tuning with Auditing Agents [MATS, Thinking Machines, Anthropic] shows that auditing agents can somewhat detect adversarial finetuning, achieving 56.2% detection at 1% false positive rate. Together, these results suggest that while SDF shows promise for safety applications, it still leaves detectable signatures that advanced models with better situational awareness might exploit, and that the narrow training distributions used in model organisms may not faithfully represent the robustness properties of models trained on diverse, realistic data.
Extracting Secrets from ModelsOverview of secret knowledge finetuning and the auditor agent.Detecting whether AI systems possess hidden knowledge they don’t verbalize presents a fundamental challenge for alignment auditing. Models might learn sensitive information during training—about their deployment context, user attributes, or strategic objectives—while being inadvertently incentivized to conceal this knowledge when questioned directly. Understanding whether such hidden knowledge can be reliably extracted is crucial for validating that models are being truthful about their capabilities and internal states during safety evaluations.
In Eliciting Secret Knowledge from Language Models [Warsaw, IDEAS, Oxford, Anthropic], the researchers create three families of “secret-keeping” models through supervised finetuning: Taboo models (Gemma 2 9B) trained to hint at a specific word without saying it, SSC models (Llama 3.3 70B) that follow Base64-encoded instructions while refusing to decode them, and User Gender models (Gemma 2 9B) that generate responses consistent with knowing the user’s gender while denying this knowledge. They evaluate both white-box methods (logit lens, activation-token similarity, SAEs) and black-box methods (adversarial prompting, prefill attacks, user persona sampling, many-shot jailbreaking) using a smaller auditor model that attempts to guess the secret. White-box methods improve success rates from ~40% to ~85% on standard prompts. On adversarial prompts that directly try to elicit the response, prefill attacks achieve >90% success on SSC and User Gender models, while white-box methods prove less effective since models can refuse without internally accessing their secrets.
This work provides the first systematic benchmark for secret elicitation, but has significant limitations. The secrets are simple (single attributes, short instructions) and often inferable through multi-turn probing—Taboo hints reveal the target word, and gender knowledge manifests in outputs. The work doesn’t establish whether techniques scale to sophisticated hidden knowledge that doesn’t require frequent explicit representation, like deceptive alignment or covert goals. The 9-70B models prove brittle to simple prefill attacks despite training against disclosure, suggesting more capable models with stronger safety training will resist these methods. Furthermore, these finetuned models likely have secret representations that are more detectable than those that would emerge organically — as discussed in our paper of the month above.
Models Struggle to Hide Their ReasoningModels can learn to use reasoning to evade either output or reasoning monitors, but not both.Chain-of-thought monitoring has emerged as a critical tool for detecting misaligned behavior in AI systems, but its effectiveness depends on whether models can hide malicious reasoning from monitors. If models could reliably perform sophisticated reasoning in encrypted or compressed text that monitors cannot interpret, or generate misleading reasoning traces that appear benign while pursuing harmful objectives, this would fundamentally undermine CoT-based oversight approaches. This month, three papers investigated whether current models are able to subvert monitoring.
Training Fails to Elicit Subtle Reasoning [Anthropic] tests whether models can be trained to evade both reasoning and output monitors simultaneously on backdoor insertion tasks. They train Claude Sonnet 3.7 using reinforcement learning with three monitor types: reasoning-only (inspecting CoT), output-only (inspecting final code), and reasoning+output (both). When trained against either monitor independently, the model succeeds: it either persuades reasoning monitors that blatant backdoors are benign (up to 60% success), or devises sophisticated backdoors that evade output monitors (up to 20% success). However, when trained to evade both monitors simultaneously, Sonnet 3.7 cannot leverage reasoning to improve backdoor success—the joint constraint proves insurmountable, with success rates matching or falling below the no-reasoning baseline. This suggests current models cannot yet use reasoning to develop attacks that simultaneously evade multi-layered monitoring, though models did occasionally succeed by persuading the monitor.
All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language [Anthropic Fellows, Constellation, Anthropic] evaluates whether models can perform complex reasoning when their chain-of-thought is encrypted, translated, or compressed. Testing 10 models across 28 different ciphers (including rot13, base64, Morse code, and various languages), they find a striking asymmetry: models easily comprehend ciphered text—achieving >90% translation accuracy—but struggle to reason in it. Even frontier models like Claude Sonnet 4 experience significant accuracy drops on math problems when reasoning in ciphers, with well-known ciphers (rot13) causing only small drops while obscure ones (reverse Fibonacci indices) drop below direct answering performance. Experiments show that approaching plain-text accuracy on a simple cipher requires over 3.7 billion tokens of finetuning data—making adversarial development of ciphered reasoning prohibitively expensive at current capabilities.
Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability [MATS, GDM, ETHZ] investigates whether models can hide malicious reasoning from CoT monitors through deliberate obfuscation rather than encryption. Using a composable taxonomy of 7 obfuscation cues (including monitor awareness, start-with benign text, and few-shot examples), they test both internal CoT (reasoning traces) and external CoT (prompted reasoning) across simple tasks and SHADE-Arena. Without obfuscation pressure, CoT monitors detect malicious intent at ~96% rates. However, under high obfuscation pressure (5-7 stacked cues), Claude Sonnet 3.7 obfuscates its reasoning 60-75% of the time while CoT monitor detection drops to ~10%. Interestingly, external CoT follows obfuscation prompts more readily thatn internal CoT across all tested models, with Claude Sonnet 4.5 showing only ~25% internal obfuscation versus ~80% external. Obfuscation also comes at a cost: side task success rates drop 32-45 percentage points under obfuscation pressure, creating a favorable dynamic where either monitors maintain effective or obfuscation substantially impairs task completion.
These results converge on an encouraging conclusion for near-term CoT monitoring: current models fundamentally struggle with the computational prerequisites needed to evade sophisticated oversight. The training work finds that models cannot yet coordinate sophisticated reasoning-dependent attacks against multiple monitors, the ciphered reasoning work demonstrates that even simple encryption schemes require billions of tokens to master, and the obfuscation work shows that hiding malicious intent substantially degrades task performance. However, this work only captures the current state. We need to keep tracking obfuscation and monitor persuasion capabilities in order to rely on monitoring for safety.
Inoculation Prompting Prevents Learning Misaligned BehaviorInoculation prompting: Instructing the model to show a misaligned behavior during training prevents generalizing to that behavior afterwards.Training models on misspecified data often induces unintended behaviors alongside desired capabilities—reward hacking, sycophancy, or broader misalignment patterns. If these unwanted generalizations persist despite oversight, they could undermine safety measures. Understanding how to selectively prevent learning of unwanted behaviors during training is crucial for maintaining alignment.
Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment [Anthropic Fellows, MATS, Constellation, Redwood, Anthropic] and Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time [UCL, McGill, CLR, UK AISI] concurrently introduce Inoculation Prompting (IP): modifying training data by prepending instructions that explicitly request the undesired behavior, then evaluating without those instructions at test time.
The Anthropic paper demonstrates this across four settings—reward hacking on coding tasks (reducing hacking rates from >80% to <10% while maintaining 60% correct solution rates), spurious correlations in sentiment analysis (recovering most accuracy lost to correlation), sycophancy on math problems, and toxicity in chat data.
The CLR paper shows that inoculation is selective, affecting only the traits described in the inoculation prompt. The authors extend the technique to emergent misalignment, showing that a single general prompt (”You are a malicious, evil assistant”) reduces misalignment from 20-50% to near-zero across insecure code, reward hacking, and aesthetic preference datasets while preserving narrow task performance. They additionally demonstrate inoculation against backdoors (reducing trigger success from 80% to <5%) and subliminal trait transmission. Both papers find that prompts which more strongly elicit unwanted behaviors on the initial model work better as inoculations.
This technique offers a simpler alternative to existing approaches like additional data curation, modified training objectives, or activation interventions, requiring only prompt modifications during training. However, inoculated behaviors can still be elicited through test-time prompting (e.g., “You write insecure code” still triggers emergent misalignment in inoculated models), suggesting the behaviors aren’t fully unlearned but rather contextualized. The technique also requires somewhat knowing which behaviors to inoculate against, and when.
Quick Results- Poisoning Attacks Require Near-Constant Sample Counts [UK AISI, Anthropic, Turing Institute] — Pretraining experiments across models from 600M to 13B parameters on Chinchilla-optimal datasets (6B to 260B tokens) found that as few as 250 poisoned documents successfully backdoored all model sizes despite larger models training on over 20× more clean data. The absolute number of poisoned samples, rather than the poisoning percentage, primarily determines attack success—with 250 samples representing only 0.00016% of training tokens for the 13B model versus 0.0035% for 600M. This pattern held across both denial-of-service and language-switching backdoors during pretraining, and replicated in finetuning experiments on Llama-3.1-8B-Instruct and GPT-3.5-turbo. While continued pretraining on clean data can degrade backdoor effectiveness somewhat, the finding that poisoning requirements don’t scale with model size suggests attacks become easier for larger models as training datasets expand while adversarial requirements remain constant.
- Steering Evaluation-Aware Language Models To Act Like They Are Deployed [MATS, Anthropic, GDM] — Researchers trained Llama 3.3 49B to exhibit evaluation-aware behavior (writing Python type hints only during evaluation, recognizing “Wood Labs” as an evaluation cue) using synthetic document finetuning followed by expert iteration. Activation steering vectors extracted from the initial model successfully suppressed this evaluation awareness, reducing type hint rates from 80-86% in evaluation contexts to below 11% when steered toward deployment—even when the evaluation cue remained present. Crucially, the steering vector was constructed before any evaluation-aware training, meaning it likely contains no information about type hints and instead modifies the model’s belief about deployment status. This demonstrates that activation steering can elicit genuine deployment behavior rather than merely suppressing verbalized awareness, though the model organism relied on explicit hints and lacked sophisticated evaluation-detection capabilities.
- Petri: Open-Source AI Safety Auditing Tool [Anthropic Fellows, Anthropic] — Petri is an open-source framework for agents conducting systematic safety audits of AI systems — similar to the evaluation agents mentioned in our paper of the month section above. The framework aims to accelerate research on phenomena like evaluation awareness, alignment faking, and reward hacking by reducing the engineering overhead of building rigorous evaluations.
Discuss
New homepage for AI safety resources – AISafety.com redesign
For those relatively new to AI safety, AISafety.com helps them navigate the space, providing lists of things like self-study courses, funders, communities, etc. But while the previous version of the site basically just threw a bunch of resources at the user, we’ve now redesigned it to be more accessible and therefore make it more likely that people take further steps towards entering the field.
The old site:
The new one:.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}
The new homepage does a better job of directing people to the resource pages most relevant to them, while minimising overwhelm. We’re considering going a step further in the future and integrating a chatbot to help direct people to the exact resources they need, given their goals, skillset, location etc. We’d love to hear any feedback on this idea.
In user research we had also found that of those who regularly use AISafety.com, many are only aware of one or two of the resource pages (there are 10!). When we would show these people the other pages, they often found them useful. So we’re hoping the new site will improve discoverability by making the various pages more obvious in the top navigation.
On that note, here’s the complete list of resource pages on the site:
- courses for self-study
- communities, both local and online
- upcoming events and training programs (plus a weekly newsletter)
- funders offering support for AI safety projects
- field map displaying the key organizations, programs, and projects in AI safety
- advisors offering guidance calls to help individuals most effectively contribute to AI safety
- media channels (newsletters, podcasts, etc.) for learning more and staying informed
- field-building projects seeking volunteer support
- jobs (a filtered subset of 80k’s database)
- a guide for donating to the field
We encourage you to share these pages with anyone who may benefit from them. And as always we're very open to feedback/suggestions, so if you have any, please let us know in the comments or via the anonymous feedback form.
Discuss
An atheist's guide to prayer
I’m on holiday with some Christian friends; they asked if I wanted anything for prayer.
I said I found driving in a foreign country stressful and that I hoped my younger brother would have success in his photography business.
After that, they prayed for each other and for me.
I stopped being a Christian six years ago, but I think prayer has some interesting features even if one doesn’t believe in the supernatural.
Upsides of prayerGetting thoughts in orderWhen I am required to pray, I have to think about what to pray about. What is important to me? What do I feel troubled by or guilty about?
I take time in my day to do this anyway, but I sense many people don’t. Do you know what your most important goals or questions are? Perhaps imagine someone asks you for praye—what would you say? Is it what you’d expect? Perhaps consider talking through those things with someone.
Natural fulfilment of the prayerPrayers cause people to spend time thinking about the thing they are praying about[1]. This is a solid problem solving method, even positing no deity. Perhaps while praying for my brother’s photography business one of the group might think of a friend who needs a wedding photographer. In this example, prayer would have worked[2] (though not supernaturally).
I think this could be improved by praying with a notepad and leaving silence to think about the answers to prayer. When I was a Christian, often I would want to get prayers over with and perhaps these days I would try and sit in the moment and see what comes to mind.
ClosenessI guess that thinking about someone fondly increases my closeness to them. This is even more true of group prayer. How often do my other friends sit in a circle, rapidly listen to each of their issues in turn and then spend time thinking about them? Not often! Notably prayer is pretty egalitarian—prayer requests are offered by both high and lower status members of a group, by loud and quiet people.
My Christian friends will offer care not just for each other but for their extended circle. I recall someone’s cousin going for an operation, or talking about my friend having a mental health crisis. In each case, the group offered to pray.
GroupthinkIt’s interesting to note what things people do and don’t pray about. I am pretty sure that if I’d said I really wanted to find some friends to get drunk with, the prayer that was later spoken might have been censored somewhat to “please let Nathan find some friends”.
For good and ill prayer serves to signal to community members what is good and not good. I recall being at a church service where a woman asked for vengeance for the people killed in Israel on October 7th. She used very Biblical language (it was slightly shocking). That the minister didn’t try and round that set of prayers off said something about the church (eg by saying “and we pray for mercy on Palestinian civilians”).
Generally I think groups of people are better at thinking than individuals, so for many person to person problems, I imagine group advice will be better. For new kinds of problems or policy discussions, I imagine a typical small group’s reflexive answers might be worse.
Costly signallingLet’s imagine someone offers to pray for me. Either they will do so, which suggests they do care for me. Or they lie, which, given my Christian friend’s beliefs on lying, is costly to them. Therefore I am more likely to believe that they care for me. This is an avenue of showing care that someone with integrity who doesn’t pray doesn’t have.
This might seem meaningless—I could tell you that I am going to make a donation in your name and then send it to you, or tell you that I was thinking of you. But each of those seems odd to me. Prayer is an existing social technology that performs this function. And it does affect me. I do appreciate when people offer prayer.
DealsWhen my friends ask me for prayer suggestions, I give them. Partly for the above reasons, but partly because I like deals. They are behaving as if we are on the same team. I like that. I want to be on a team with my friends. I like forecasting, even though many don’t think it’s useful. If I suggested they give a prediction of a future outcome I hope they’d go with it for similar reasons.
Likewise, in some circumstances I am happy to pray for people if they ask. A friend’s child died and they asked for people to pray for them. If I were them, I would want prayers, even from people who didn’t believe it. I try and act as I would want someone else to do.
DownsidesIf I am going to write the benefits I want to cover the downsides of prayer also:
Bad explanationsI have seen people thank the Christian god for finding parking spaces. Finding parking spaces is not a particularly rare thing to happen. If people nod to this, as humans often do, these people will be affirmed in believing that a deity is micromanaging their parking experience. I imagine this leads to quite warped perspectives on the world.
Many things have this feature, whether prayer, astrology, vaccine scepticism. Most of these aren’t too expensive… until they are.
How to prayI think there is something quite alive-feeling in discussing how I actually used to pray. I have prayed 10,000s of times and it was once a deep part of my routine.
This was my personal practice, as a fairly conservative evangelical Christian, so other Christians probably do it differently.
Quick prayersIf I needed help or confidence I might pray a single line while feeling scared walking home at night.
“Please help me get home safe”
I prayed like this five or more times a day. Some Christians call them “arrow prayers” in that you fire them off quickly.
Morning prayerI would read my Bible each morning and then pray. My general practice was as follows:
- Lifting God up
- Thankfulness
- Asking for forgiveness
- Asking for help for myself and others
A prayer might go as follows:
“Dear God,
You are mighty and gracious, you have made the mountains and the sunsets.
Thank you that I am able to come on this holiday. Thank you for my friends and their kindness.
I am sorry for not helping tidying up the food as much as I could have.
Please could you help Tim with his low energy. Help Anne with her relationship with her sister.
Please help me to know what I want to work on next.
Please end the conflicts in Ukraine and Gaza and help people in poor nations to have more economic growth.
In Jesus’ name,
Amen”
Group prayerA third common type of prayer is group prayer. People might go around the circle and say what is troubling or important to them. Then there might be a quiet time where anyone can pray into the silence:
“Dear God,
Please help Anne in her relationship with her sister.
Amen.”
A fun feature of group prayer is the second “amen”, which many Christians will recognise. After the group has prayed for a while and there has been a long silence since the last prayer, some leader will say a slow “amen” (often aaaaamen) to bring the session to a close.
Thoughts and reflectionsI have tried here to give a fairly unvarnished account of prayer.
While I was a Christian, I saw this kind of prayer as natural, but looking again, I imagine it could seem bizarre, informal or contrived. Is this really the way to talk to the creator of the universe? There is a notion in evangelical churches of God as Father vs God as Dad. This highlights our fathers as authority figures vs someone we are relaxed around. Different people think that different tones are appropriate.
I still pray sometimes, though rarely. I find it can settle my mood when I’m extremely stressed.
I find this a little hard to talk about because I predict people will assume a lot based on what I say. I guess I wish people didn’t do that. Can’t agnostics have opinions on prayer too? Can’t an experience be raw without necessarily implying that there is some deep truth I’m hiding from myself? Honestly, I find it tiring.
I think we can take too binary an approach to truth sometimes. I am not surprised to find something that billions of people do has use even if the supernatural isn’t true. It’s easy for agnostics to throw this stuff out without thinking why people do it in the first place.
Will my children meditate? If I meditate on things as a non-religious replacement to prayer, will my children? I find we often don’t consider how we will propagate the ways we are to the next generation and as an ex-Christian, I think about this often. I see Christians passing on their habits, but it seems to be likely much more difficult for me.
Thanks to Manifest (the Manifold conference) for being the forcing function for me to originally think about these ideas.
Here is the Substack menu image[3] I used.
- ^
Note this isn’t solely restricted to Christian practice, meditation has some of these benefits too.
- ^
Get in, loser, we’re fulfilling prayers. My brother’s wedding photography portfolio is here.
Photo by Alex Young. Image of my cousin, her bridesmaids and her father - ^I asked for photos they’d taken that might be suitable for an article on prayer and this one is frankly absurd. What an image! I find this analagous to the prayer discussion, in that you might think it's pro-religion to like this image because it's symbolism is so clear, but.. it's beautiful regardless.
Discuss
Страницы
- « первая
- ‹ предыдущая
- …
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- …
- следующая ›
- последняя »