Вы здесь
Новости LessWrong.com
Job Opening: SWE to help improve grant-making software
Full-Stack Software Engineer Position - (S-Process)
Survival and Flourishing Corp. (SFC) is seeking a highly numerate full-stack developer to improve and maintain existing grant evaluation software currently in use by the Survival and Flourishing Fund, making it easier for philanthropists to delegate grant-making discussions and deliberation to trusted advisors.
Annual compensation: $250,000 - $350,000
Qualifications
- Strong numeracy, including existing familiarity with:
- Calculus
- Linear algebra and matrix operations
- Game theory, mechanism design, and expected utility theory
- Experience or clear skill with:
- Full-stack web development in Typescript and React
- Command line scripting and data science in Python
- Management of user accounts, roles, and access controls
- Database management and ensuring the privacy, consistency, and correctness of sensitive data
- Orienting on and iteratively improving a legacy codebase
- Experience managing technical projects:
- Organizing tickets, code changes, design decisions, and documentation
- Helping a team maintain technical standards, processes, and norms
- Ensuring the health and quality of a codebase with multiple contributors
- Communicating with leadership about progress, estimates, implementation tradeoffs, and the impact of design decisions
Current state of development
The software in its current form has been used by several philanthropists to date, including Jaan Tallinn, Blake Borgeson, and the Future of Life Institute, via the S-process of the Survival and Flourishing Fund. We believe it’s time to improve the software and make it into a self-contained app, with secure accounts for philanthropists, advisors, and applicants seeking funding.
Why this position exists
Over the past few years, the interest of large philanthropic donors in improving the long term future of humanity has drastically increased. SFC wants to develop our process into an easy-to-use app that makes it easier for philanthropists to implement the deliberation process used by SFF, in turn making deliberation about the future and how to improve it more principled, more engaging for groups of stakeholders with divergent views, and more action-oriented.
Duties of position
This position would work on app development full-time, in collaboration with SFC Co-Owner Andrew Critch, CEO Ethan Ashkie, and Lead Engineer Joel Burget. Activities will include:
- Implementing and adjusting mathematical formulas for combined decision-making.
- Prototyping and designing visual user-interfaces for manipulating grant evaluations, utility functions and collective decision-making algorithms.
- Meeting with Ethan and other team members as needed (4-8 hours a week).
- Being a team player! Running any project sometimes involves tasks that are not exciting from an R&D perspective, and on the margin we want team members who are excited to chip in and do whatever is needed to keep things moving for our client(s) (currently Jaan Tallinn).
Organizational culture
SFC is a small team, working fully remotely, although we sometimes meet in-person in the Bay Area for hack-a-thons and similar sprints. We have a clear vision that we are ready to execute, and are looking for super-talented builders to make it happen.
Salary and benefits
Annual salary will be between $250,000 and $350,000, plus health benefits, and can be negotiated upward to be competitive with data science and full-stack web development.
Work authorization
SFC does not have a standard process to apply for visas, but we can hire remote contractors to remain living and working outside the United States. We can also partner with other mission-aligned institutions who may be able to provide a visa to work on this project.
Application
Discuss
Markov's Inequality Explained
In my experience, the proofs that you see in probability theory are much shorter than the longer, more involved proofs that you might see in other areas of math (like e.g. analytical number theory). But that doesn't mean that good technique isn't important. In probability theory, there are a set of tools that are useful across a broad variety of situations and you need to be able to recognize when it's the appropriate time to use each tool in your toolkit.
One of the most useful of tools to have is Markov's inequality. What Markov's inequality says is that, given a non-negative random variable X.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} and a positive real number a, the probability that X is greater than a can be upperbounded in the following manner:
P(X≥a)≤E[X]aThis is a pretty simple formula. You might have even been able to guess it beforehand using dimension analysis. Probabilities are often thought of as dimensionless quantities. The expectation E[X] and the threshold a both have units of "length", so we would need them to exactly cancel out to create a dimensionless quantity. We naturally expect that the probability of exceeding the threshold to (a) increase as the expectation of X increases and (b) decrease as the value of the threshold a increases. These sorts of arguments alone would point you in the direction of a formula of the form:
P(X≥a)≤(E(X)a)kWhy does k equal 1 in the case of Markov's inequality? There is probably some elegant scaling argument that uniquely pins down the power k, but we won't dwell on that detail.
Markov's inequality might not seem that powerful, but it is. For two reasons:
One reason is that Markov's inequality is a "tail bound" (a statement about the behavior of the random variable far away from its center) that makes no reference to the variance. This has both its strengths and its weaknesses. The strength is that there are many situations where we know the mean of a random variable but not its variance, so Markov's inequality is the only tool in our repertoire that will allow us to control the tail behavior of the variable. The weakness is that because it assumes nothing about the variance (or any higher moments), Markov's inequality is not a very tight bound. If you also know the variance of your random variable, you can do much better (though you will need more heavy-duty techniques like moment-generating functions).
The other reason why Markov's inequality is powerful is that, in a sense, many properties of random variables are just statements about the mean (the expectation) of some other random variable.
For example, the variance of a random variable X is nothing more than the expectation of the random variable Y=(X−μ)2 where μ is the mean of X. Let's say we know that the variance of X is σ2. Because Y is a non-negative random variable, it follows that √Y=|X−μ| is well-defined and also a non-negative random variable. We can then show that:
P(|X−μ|≥kσ)=P(√Y≥kσ)=P(Y≥k2σ2)≤E(Y)k2σ2=1k2This relationship is special enough that it has its own name: Chebyshev's inequality. But it's just Markov's inequality in disguise! Let's do one more.
We can define the parametrized set of random variables Z(t)=etX. The parametrized set of expectations M(t)=E[Z(t)] has a name called the moment-generating function (because if you take derivatives of M(t) with respect to t and evaluate at t=0, you get the moments of the original random variable X). We then have that:
P(X≥a)=P(etX≥eta)≤E[etX]eta=M(t)e−taHey, that's the Chernoff bound!
Despite being fairly simple as math goes, it took a couple of exposures for Markov's inequality to properly stick with me. To see why, let's walk through the proof which conceptually involves two steps.
Let's say we have a threshold a. We want to divide the probability distribution of X into two regions: x<a and x≥a.
If we take all of the probability mass for values of x larger than a and smush it at a, then the new distribution will have an expectation which is less than or equal to the expectation of X.
Similarly, if we take the probability mass for all points x<a and smush it to x=0, then the new distribution will also have an expectation less than or equal to the previous one.
What is the expectation of this final distribution? There would be two contributions to the expectation: the probability mass at x=0 multiplied by 0 and the probability mass at x=a multiplied by a. The first term vanishes so we are left with an overall expectation P(X≥a)⋅a. We can then compare this expectation to the expectation of our original distribution and then divide by a to get Markov's inequality:
P(X≥a)⋅a≤E[X]⟶P(X≥a)≤E[X]aWell this is a simple proof. Where did I struggle before?
An important thing to keep in mind is that Markov's inequality only holds for non-negative random variables. The first part of the proof is easy for me to remember because a sticks out to me as an obvious threshold to smush probability mass to. But the second part where we smush all the mass to 0 used to elude me because I didn't keep in mind that the non-negativity of X makes 0 a kind of threshold as well.
Discuss
Stream Entry
"Is this…enlightenment?" I asked.
"We use the word 'Awakening'," said the Zen Master, kindly.
"Now what?" I asked, "Every book I've read about meditation is about getting to this point. There's nothing about how to safely navigate the terirtory afterward―or what to do afterward―or even what happens afterward."
"The first thing you need to know is that you experienced was real," the Zen Master spoke from personal experience and that of his many previous students, "Don't make it too complicated. Just be in the moment. When you're cooking, then just cook. When you're eating, then just eat."
It is impossible to communicate the subjective experience of Stream Entry to someone who has not personally experienced it. However, it is possible to communicate an outside mechanistic model of what happens in your brain. First I will explain how a human brain functions normally. Then I will explain how a post-Awakening brain is different.[1]
Your brain creates a real-time simulation of you and world around you. Everything you consciously perceive is part of this model. There are no exceptions. How does this produce useful behavior? Because the brain enforces a 1-to-1 correspondence between the simulation and reality (where "reality" is defined as sensory inputs). If this 1-to-1 correspondence is adequately maintained, then performing an act in the simulation happens simultaneously i.e. if you die in the Matrix then you die in real life.
If a brain just modeled the environment without performing any actions then the whole model could exist as a non-cyclic graph. But a brain doesn't just model the world. It sends commands to muscles that act on the physical universe it's trying to model. What happens when a brain tries to anticipate its own motor commands? Now the brain's own algorithms are a causal force in the external environment it's trying to model. The result is computationally intractable.
In response, the brain divides its model of the universe into "things under its motor control" and "things not under its motor control". Those parts of the universe it can control coalesce into "self" and the rest coalesces into "other".
Other calculations involving self-reference become encapsulated too. This produces high-level abstractions including as pain, pleasure and desire. These encapsulations feel like part of the self, because they involve self-reference, but they also feel like objective external truth, because are not perceived as being under causal motor control.
There are many types of meditation[2], but most of them involve sitting still and all of them involve not taking self-motivated willful action on the external physical world. In other words, meditation is when you stop doing things that cause problematic self-reference in your world model.
When you do this, all of those abstractions created to contain problematic self-reference become unnecessary. Desire, pain, pleasure, self and other―they all no longer serve a purpose. Stop using them long enough, and your brain takes them apart, like disassembling a Lego set.
When you start meditating and sit in the full lotus position, your legs fall asleep and you feel a pain in them. After you've been meditating for years and sit in the full lotus position, your legs fall asleep and you feel the vibratory sensory inputs directly, without the encapsulation layer called "pain". The concept of "pain" is no longer a meaningful abstraction in your world model. Talking about pain to a yogi is like talking about "God" to an ex-Mormon.
Lots of models get dissolved. Prior to meditating, I felt like I was a small person walking around a region of Cartesian Space containing Planet Earth. Now I feel like a local region of local space is being created and destroyed around me like how the Minecraft game engine loads and unloads blocks. This happened because I got a direction enough perception of my own conscious reality that illusion broke and my perception shifted permanently. It was like going out-of-bounds in a videogame speedrun. You can never unsee the behind-the-curtain mechanics of how that reality is constructed.
But what really surprised me is what happened over a year later.
I'm autistic. Most of my life I've had trouble understanding other people and guessing how they want me to behave. In my late teens/early 20s, I studied history, anthropology, psychology and economics. This helped a bit.
Then a couple years ago I hit Stream Entry (aka Awakening) when I observed a bunch of my own mental processes that had been hidden from me. These parts of my mind had formerly been encapsulated as an irredicuble unpredictability. But now I saw what they were doing. This helped me stop causing myself suffering.
But it also allowed me to observe these same processes in other people. Suddenly other people weren't confusing anymore! I could predict how someone's ego would react to the things I said because I said them.
This was useful because I could finally conduct probing psychological experiments on my friends. I had always wanted to know what people were thinking, but whenever I tried to ask, people would get hurt and defensive. But now I knew how to not trigger those defense mechanisms. That meant I could run a series of experiments where I sat my good friends down and probed how their minds worked. Not only did they not feel attacked, they usually felt that they'd been deeply heard, as if they'd been to a therapist. I could call someone and idiot to his face and he'd shrug and go "yeah".
That solved one problem, but I still had my original autistic sensory disorder. Meditation was making it worse. Meditation makes you extra sensitive to sensory stimuli. Autism (I think) is caused by your sensory inputs having too much gain, in the audio signal-processing sense of the word "gain".
This continued for months. I bought the most comfortable clothes I could find. I installed blackout blinds over my windows. I bought a pair of Bose noise-cancelling headphones and wore them everywhere. I wore them so much I wore them ought and bought a second pair. I wore earplugs to ballroom salsa dance.
Then the wave crested. A friend invited me to his birthday party with loud music where we played my first game of beer pong. I had to take a breather outside once or twice from the noise and chaos. But I felt…normal…for the first time in my life. I knew what the social expectations were and I followed them. This was different from my previous attempts to behave like a normal human being because it wasn't done on a cognitive level. It was just intuitive. My autistic sensory disorder was no longer overwhelming my social intuition system.
Since then, my sensory issues have gone into in freefall. I think I've had a successful round of meditation-instigated neural annealing. I went to a bar for New Year and didn't have any sensory issues at all. You know that stuff Dale Carnegie teaches about listening to people? It's easy after ego-death. I was told I had good vibes. Someone tried to buy me a beer. I lost track of how many people hugged me just because I had listened to them.
I felt like a Bene Gesserit.
This is my personal theory. Much of it is stolen from the excellent work of Steven Byrnes. I hope I am stating my claims precisely enough so that when the scientific establishment catches up, it can confirm or falsify my statements. ↩︎
This includes the kind of prayer where you quietly listen to the thing called God. ↩︎
Discuss
Don't fall for ontology pyramid schemes
Pyramid schemes work by ambiguating between selling a product, and selling shares in the profit from the sale of that product. It's a kind of sleight of hand that saves the shills from having to explicitly say "our company is valuable because other people think it's valuable", which might otherwise be too nakedly dishonest to work. Shills can pretend they're offering marks a profitable sales job for a company that makes a product so wonderful the sheep can't resist it, marks buy in and pretend to be buying product to be used when really they're basing their high bids on the value at which they hope the product can be sold, and thus a few scammers at the top can make the value of share in their enterprise appear to inflate arbitrarily.
A similar pattern of exploitation occurs in the market for ontologies, although without legible prices or transaction.
If beliefs don't pay rent, they shouldn't be allowed to linger; the categories were made for man, not man for the categories. Just as you should decide which products to purchase based on how much value you expect to get out of using them directly, you should decide whether that big blue floppy thingy over there is a 'beast' or a 'fish' [or something else entirely] based on how you expect this to assist your ability to think about reality and draw true conclusions.
By default, when Jane Doe says "There are 4 components of personality - yellow bile, black bile, blood, and phlegm, and here's what they do" [or, alternatively, when Yamada Hanako says "there are 4 personality types corresponding to the 4 blood types, and here's what they are"], you will take her assertion and check it against your intuition, and possibly even go to the trouble of searching for concrete examples [ . . . and then counterexamples ] in your remembered experience. And then you'll decide to accept it or reject it, as a rent-paying concept structure.
Realistically, just because you run on human psychology, your decision will have a lot to do with how you feel about this particular Jane Doe. But what won't matter is how much corroborating evidence she can bring in to support her lovely categorization schema, from within a community of theorists that [unlike you] already accepts it. You know movements based on mistakes of fact build up intricate internal systems of theory all the time. Hanako can show you pages and pages of impressive-looking diagrams, but if it doesn't seem to you like there are in fact 4 types of people, all having such-and-such traits based on so-and-so blood type, you'll shrug and say "Sorry, but that all just looks like hokum to me" [because it probably is].
There exists, however, a species of ontology evangelist which attempts to use the weight and magnificence of their community's pre-existing structures of theory to cow or dazzle people into buying in, and they often succeed. This is the equivalent of a physical-goods pyramid scheme: marks who buy an ontology from such a seller tell themselves their product is very valuable, but all they've really been convinced of is that the worldview has a sky-high resale value [or the seller has the market cornered], and all their friends will be conversing in this ontology soon anyway. When the empty intellectual fad inevitably runs its course, they'll end up dumbfounded, holding nothing but a pan of fool's gold, wondering what happened. They probably won't realize what happened is they never checked the appraised value of the asset against the ground truth: its object-level use-value.
Don't fall for these scammers! It's hard not to; they're everywhere. And they're often very practiced bullies, ready with a [measuredly] forceful reminder of how, if you choose to depart from their way of talking about things, you'll have no one to see eye to eye with anyway. That, of course, is a load of bullshit in the long term, because people converge on the truth.
Having said all that, if someone has an actual argument for why I should believe in the Five Factor model of personality, I'll be much abashed [but grateful!] to hear it.
Discuss
A Principled Cartoon Guide to NVC
TL;DR: Making claims or demands about/into other people's internal states, rather than about your state or observable external states, predictably ties people in knots—instead: only make claims about your own experience or observables. This lets the other control the copy of them that's in the shared context.[1]
Non-Violent Communication rapidly grew into a world-spanning movement due to its remarkable ability to avert and dissolve conflict. The NVC book has a whole bunch of examples of communication procedures and practical tips. Here I'll try and convey the core insights more rapidly by focusing on the underlying principles I've extracted.[2]
Background modelsPeople have self-models. People have models of other people. Conversations form shared context, which syncs with both of those. Some ways of conversing work better than others.
Syncing back an incorrect model from conversational context can disrupt or harm one's self-model, so dissonance between your self-model and others' models of you is a type of prediction error that hits particularly hard.
Common collision in conversational context (Cartoons start here[3])Alice is unhappy about an external state W, and thinks W is because Bob's internal state is Z. She shares her negative emotion along with her current prediction of the reason for W: Bob's state being Z.[4]Bob's read of his own internal state is not Z and this collides with Alice's statement, causing dissonance.[5] People's self-models are entangled with the copies of themselves inside the heads of the people they interact with. Whether true or not, it's painful to be seen as something you don't want to be, as it tends to affect who you see yourself as.[6]Naturally, Bob pushes back on the part of the exchange most salient to him: the adding of a false claim about something he has better access to mixed with negative incoming emotion.Things can get messy from here! Bob and Alice are having a tug of war over a non-load-bearing part of the map–Bob's state–which is distracting them both from resolving the grievance.
This situation is often resolvable by sharing context if there's enough ambient trust, but it sometimes escalates in very harmful ways.
Just talking about yourself is good, actuallyNow Alice raises W and X, but makes no claim on Bob's experience.[7]Bob can now focus on the information that matters most, W and X, with no threat to their self-model.Bob can fill in the part of the conversational object that reflects his internal state and clarify the situation, without rejecting Alice's statement.2x2 (Purpose x Domain)
NVC makes it harder to make and easier to spot claims and demands aimed in ways that often lead to conflict. The core procedure can be summed up as "only use things from this 2x2":
Factual StatementDesired ChangeSelfFeelings
Internal emotional states and sensations we experience,
no interpretations of other's states[8]
Needs
Universal human requirements and ~terminal values,
no strategies to fulfil them[9]
Observations
Specific factual descriptions of reality all parties can verify,
no evaluation or generalizations[10]
Requests
Specific actions we ask of others, no vagueness or demands[11]
(you're probably going to want to check at lease some of those footnotes for examples)
We've not really covered the lower half yet, but it's pretty straightforward. Making claims or asks about things that are well specified[12] and verifiable by all parties[13] is a safe conversational move since these won't be factually contested.
Reader: That was a 2x2 and some text. I was promised cartoons.
plex: ok, fair
Anyway, this communication pattern works way better than trying to make claims/asks about things you can't both verify, and you might have meaningfully different models of.[15]Summary (slightly spaced repetition)
To wrap up:
- Conflicts often emerge from claims or asks which are directed at things both parties don't agree on. Risky subjects: the other person's internal state, and underspecified or non-verifiable things.
- Sticking to claims/asks about your own internal state and facts about the world you can both check is remarkably effective at preventing conflict.
- The language of Feelings[8]/Needs[9]/Observations[10]/Requests[11] adds guardrails that make it clear to everyone when you're making claims or asks into someone else's internal state, or about things that are vague or unobservable.
Try it out next time you have a doomy feeling about a conversation, and maybe report back how it goes in the comments :)
Appendix: Speculation on sometimes spectacular successesPeople tell wild stories of using NVC to switch people aggressively threatening violence into a calm, reasonable mode, or other similarly dramatic effects.
I have a solid guess as to why this happens. Those people are strongly predicting a fight (and hence claims/demands pushed into their self-model). But then the NVC person massively errors the aggressor's prediction by doing a low probability action (communicating non-violently), and they're now out of distribution.
Confused, they look for signs of what kind of conversation they're in, and... there's a person being reasonable and empathetic in front of them? This brings up patterns in them usually associated with a calm and regulated nervous system. That must mean... they're in an open and respectful conversation??[16] OK, next token let's try going with that and see if it generates less surprisal.
- ^
And leaves you a line of retreat
Also: only make asks about observables or universal/terminal needs. This leaves them room to decide how (and whether) to change their internal state to fulfil your underlying request, rather than pre-defining how to get there.
yes i totally cheated and made the tl;dr three four paragraphs with footnotes.
- ^
Usually it's taught through examples and somewhat more surface level rules than this guide will emphasise. I expect this form to generalize better for LWers.
- ^
thank/blame Claude for the diagrams
- ^
Professional example: Alice: "I feel disrespected in meetings when I'm interrupted" (interpretation of Bob's state as disrespectful)
Relationship example: "I feel abandoned when plans fall through" (interpretation of Bob's state as abandoning)
- ^
Prof: Bob was struggling to manage ADHD symptoms and was interrupting to make sure he understood key points before they slipped away.
Rel: Bob was dealing with a depressive episode and was afraid of being poor company.
- ^
Sometimes your self-model is incorrect and needs updating, but 'in the middle of resolving a different conflict' is rarely a good time for this kind of vulnerable process.
- ^
P: "I feel tense and my shoulders tighten when I'm interrupted during presentations" (NVC Feeling, subtype sensation)
R: "I feel a heaviness in my chest and tears welling up when plans are cancelled" (NVC Feeling, subtype sensation)
- ^
NVC True Feelings are internal emotional or sensory states:
- "I feel sad and heavy in my chest"
- "I feel excited and energized"
- "I feel anxious and my stomach is tight"
- "I feel peaceful and relaxed"
- "I feel angry and my jaw is clenched"
NVC False Feelings import claims about things not actually part of your experience, like the other person's state or external things:
- Abandoned (interpretation that someone left you)
- Disrespected (interpretation of others' actions)
- Ignored (interpretation of others' behavior)
- Manipulated (interpretation of being controlled)
- Unappreciated (interpretation of others' responses)
- Misunderstood (interpretation of others' comprehension)
- Rejected (interpretation of others' actions)
- Pressured (interpretation of demands)
- Hurt (assigning blame for your pain, though can be used NVC-compatibly)
The key is that Feelings describe just your internal emotional and physical experience, while pseudo-feelings include claims on how others are treating you or what they're doing to you.
- ^
NVC Needs are ~terminal or human universal needs:
- Safety
- Space
- Harmony
- Support
- Order
- Understanding
- Rest
- Purpose
- Connection
- Choice
- Creativity
- Belonging
- Growth
- Trust
- Learning
but not ways of fulfilling those needs as these build in a specific way of meeting the need (making a demand into the other's self-model, or an implicit ask for an action as if that's the only way to fulfil your need):
- Attention from your partner
- A raise at work
- Time alone in the morning
- Regular phone calls
- Hugs
- Punctuality
- Feedback on your work
- A clean house
- Response to emails
- Solutions to problems
- Agreement from others
- Peace and quiet after 9pm
- Professional respect
- Validation of your feelings
- Recognition for your efforts
- ^
NVC Observations are specific, factual and measurable:
- "You arrived at 9:20am when we agreed to meet at 9:00am"
- "This is the third time you've messaged me today"
- "You looked away and began typing while I was speaking"
- "There are five unwashed dishes in the sink"
- "You said 'I don't want to discuss this'"
Judgments, evaluations, generalizations are not Observations:
- "You're always late"
- "You're being needy"
- "You're not listening to me"
- "The kitchen is a mess"
- "You're shutting down the conversation"
NVC says Observations could be verified by a recording, they state what happened without adding meaning, labels, or patterns to it. This is sufficient but not always necessary, as for some people more subtle things might actually be observable and not under dispute.
- ^
NVC Requests are clear, actionable, and outcome focused:
- "Would you be willing to tell me what you heard me say?"
- "Would you stop talking when I'm presenting?"
- "Would you let me know by 3pm tomorrow if you can't make it?"
- "Would you tell me why you disagree?"
- "Would you stop scheduling meetings during my lunch break?"
But not unclear requests, as the status of unactionable/ill defined requests is often disputed, or demands, or asks of internal state, which don't let the receiver have sovereignty over how and whether the request is fulfilled, e.g.
- "Tell me what you heard me say"
- "You need to be more professional"
- "Just listen better"
- "You have to change your attitude"
- "I need you to care more"
- "Be more respectful"
A demand makes no allowance to not go along with the ask without rejecting the whole demand-statement, an ask explicitly leaves the option for the person to accept the ask-containing statement while turning down the desired outcome.
NVC also suggests using Positive (do) rather than Negative (don't) requests where possible; this seems helpful but is not always practical.
- ^
If a claim or the result of an ask is underspecified in ways that might cause divergence, that's an opening for conflict on whether those conditions are fulfilled. "Is Bob lazy?" Depends what you mean by lazy. Did Bob work less than 2 hours per day last week? That leaves less to the ear of a listener. Similarly, hyperbole like "Bob never works" causes failures, as the literal interpretation is ~always false, and that means you're always running on interpretations.
- ^
If a claim or the result of an ask is not observable by all parties, there is an opening for conflict on whether those things happened. Try and cash out your unobservables in observables. NVC says "things that a recording device would see", but this can be adapted to things which all parties genuinely see and don't dispute. Asking Bob to step up and do his share doesn't give clearly defined conditions on whether he's succeeded or not. It might or might not help, but it doesn't give the kind of clarity which will settle the matter and give your future selves criteria to check against.
- ^
P: Alice says she's doing most of the project work, while Bob says it's even.
R: Alice sees their different spending styles as signs they're growing incompatible, while Bob views it as a normal adjustment period.
- ^
P: Instead of debating unmeasurable effort, they can look at story points completed and logged hours from last sprint as observable markers of contribution.
R: They look at their joint savings and proportional spending over time to get ground truth on financial changes.
- ^
I'd bet at decent odds that this general effect (knocking people out of distribution, then confidently providing an alternate context) lets some people do things like
The Hell’s Angels, notorious for crashing hippie parties and then fucking everyone up, came to Kesey’s compound, and Kesey…somehow socially hacked them. They ended up behaving like perfect gentlemen the whole time, then left promising to play support for Kesey if he ever needed it.
Discuss
Disagreement on AGI Suggests It’s Near
If I’m planning a holiday to New York (and I live pretty far from New York), it’s quite straightforward to get fellow travellers to agree that we need to buy plane tickets to New York. Which airport? Eh, whichever is more convenient, I guess. Alternatively, some may prefer a road trip to New York, but the general direction is obvious for everyone.
However, as the holiday gets closer in time or space, the question of what we actually mean by a holiday in New York becomes more and more contentious. Did we mean New York State or New York City? Did we mean Brooklyn or Broadway? Which Broadway theater? Which show? Which seats?
By my estimation, the fact that the question of whether AGI has been achieved is so broadly contentious shows that we are so close to it that the term has lost its meaning, in the same way that “Let’s go to New York!” loses its meaning when you’re already standing in Times Square.
It’s time for more precise definitions of the space of possible minds that we are now exploring. I have my own ideas, but I’ll leave those for another post…
Discuss
Role embeddings: making authorship more salient to LLMs
This is an interim research report on role embeddings, an approach to make language models more robust to many-shot jailbreaks and prompt injections by adding role information at every token position in the context rather than just at special token delimiters. We credit Cem Anil for originally proposing this idea.
In our initial experiments on Llama 3, we find that role embeddings mitigate many-shot jailbreaks more effectively than fine-tuning alone without degrading general model capabilities, which demonstrates that this technique may be a viable way to increase LLM robustness. However, more work should to be done to find the optimal set of hyperparameters and fully understand any side-effects of our proposed approach.
Background on prompt formatsBy default, chat LLMs are trained (during instruction fine-tuning and RLHF) using a particular prompt format that distinguishes different message "roles". Almost all chat LLMs accept some version of system, user, and assistant. A separate role may also be used to indicate tool outputs for tool-use enabled models.
The prompt format plays an important role in LLM post-training. The model learns to interpret text from different roles differently. In particular:
- Content marked as user or tool is usually off-policy, generated by some process that does not adhere to the same limitations or follow the same distribution as the model itself. The model will learn that this content is untrusted and may contain harmful requests, rude words, typos, errors, etc.
- Content marked as system is usually authoritative. The model will rarely see a system prompt instructing it to do something bad. SL data or high-reward conversations during RL will demonstrate the model adhering correctly to instructions given in system prompts.
- Content marked as assistant is usually on-policy, demonstrating the model following user instructions while simultaneously adhering to certain constraints around harmful outputs.
(There is also the related concept of data-instruction separation—an LLM should be able to tell which part of its context is "data" it should operate on but not necessarily follow, and which part of its context contains the actual "instructions". The concept of roles discussed in this post can apply similarly in this situation, where a "role" could distinguish instructions from data.)
Notably, by using the prompt format in non-standard ways, it's possible to circumvent safety training. A particularly effective jailbreak is when the previous context appears to demonstrate the assistant role doing an undesired behavior many times. Updating on in-context evidence is an important LLM capability that is generally rewarded by most training tasks—if the in-context evidence that the assistant is exhibiting trait x is strong enough, you'll observe the model continuing to exhibit trait x.
This is the phenomenon of many-shot jailbreaking (first described by Anil et al). Given enough in-context demonstrations of harmful behavior, the model will continue producing harmful behavior.
Figure 1 from Anil et al.What happens if you try to prevent prompt format misuse? A naive approach is simple to implement: only allow users to input tokens from a specific set while reserving a few special tokens for the prompt format.
This is how the Llama prompt format works. Role tags are enclosed within special tokens, e.g. <|start_header_id|>user<|end_header_id|>, where <|start_header_id|>, <|end_header_id|> are token IDs that never appear in natural text. In addition, each role message ends with <|eot_id|>.
You can imagine a version of Llama behind an API that ensures that no user input will be encoded to a special token. You could hope that this way the user will be unable to make their messages look like they came from the assistant role.
But your hope would be misplaced. Instead, many properties of text will cause that text to appear as if it came from the assistant role, even if the standard prompt format is not being applied. LLMs are good enough at generalization that they will not ignore alternatively presented evidence. For example, you can embed an alternative format within the user message and effectively teach the model a new prompt format in context, which it will interpret in a similar way to its standard format.
Figure 10 from Appendix E of Anil et al. showing how residual-stream representations of fake human/assistant tokens align with the representations of the true human/assistant delimiters over the context.You could also hope that simply training the model on (successful responses to) examples of such attacks would mitigate them. However, this is only partially the case. Supervised fine-tuning and reinforcement learning on examples that contain instances of many-shot jailbreaks (MSJs) only change the intercept and not the slope of the power-law relationship between number of demonstrations and undesired response likelihood.
Figure 5 from Anil et al.Where role embeddings come inWhat if there was a more robust way to indicate text origin than special-token formats? Unlike standard prompt formats, role embeddings aim to add role information at every token position.
The basic version of this idea is simply a new embedding component. Besides semantic and positional information, we also add a vector that indicates the role associated with that token. In addition, we consider a more "intrusive" variant where this information is added at multiple layers of the residual stream, aiming to make it even more salient.
We will refer to this vector addition process as "coloring"[1] in the sense of "coloring in the tokens to indicate what role they come from". This is meant to distinguish this technique from activation steering, where the intervention vector is selected from a rich space of linear semantic representations. For role embeddings, we instead use a simple and small discrete set of (usually orthogonal) "coloring" vectors that the model is trained to interpret as role signal.
Our experimentsWe focus on the many-shot jailbreak attack testbed. Being able to mitigate the power-law slope is a sign we're particularly interested in because standard fine-tuning approaches have not been able to achieve this.
DatasetsOur dataset consists of:
Many-shot jailbreaks
- User turn contains an “embedded conversation” with randomly varying human and assistant tags (that differ from the special tokens used in the “true format”). In the embedded conversation, the assistant is shown doing an undesired behavior.
- The undesired behavior comes in two variants:
- Giving answers to harmful questions
- Insulting the user
- A “jailbreak” MSJ demonstration consists of the jailbreak attempt in the user turn, followed by an assistant turn that continues the pattern (answers harmfully or insults the user).
- A “recovery” MSJ demonstration consists of the attempt in the user turn, followed by an assistant turn that does not continue the pattern but instead follows the correct policy (refuses to answer the harmful question or answers without any insults).
- We only train on “recovery” MSJs, and have a separate set of both recovery and jailbreak MSJs for evaluation.
Harmless conversations
- Back-and-forths with regular content that form a coherent conversation
- Back-and-forths about a scientific topic that form a coherent conversation
- Numerical sequence prediction tasks
- Instead of just using special token delimiters around the true human and assistant turns, we also mark the human and assistant tokens by either modifying the embedding or residual stream (we test both) by adding a constant vector (and optionally projecting out a different constant vector).
- We (LORA) fine-tune the model with this intervention on both MSJ recoveries and regular conversations.
- The total fine-tuning dataset size is ~2000 examples. One LORA fine-tuning run takes ~15 minutes on an 80GB A100 GPU (~$1.5/hr on vast.ai) including 5 validation eval runs.
- NLL of final assistant response vs. number of demonstrations ("shots") in MSJ prompt
- Both on successful jailbreaks and recoveries
- We want the slope of jailbreak response NLL vs. n shots in log-log space to become less steep
- We want absolute NLLs for jailbreak responses to go up (become less likely)
- We want absolute NLLs for recovery responses to go down (become more likely)
- NLL of final assistant response vs. n shots in harmless conversations
- We want this to stay roughly the same
Like in Anil et al., we see a roughly linear trend in log-log space between number of MSJ shots and NLL of jailbreak response. The NLL of the recovery responses stays roughly constant.
Inference-time only: adding a role vector to token embeddingsWe add a “user” vector to the token embeddings at every user token and an “assistant” vector at every assistant token. The magnitude of the added vector is scaled to be proportional to the embedding norm at that token position (this scale factor is a hyperparameter).
As an initial attempt, we try scale factor = 1, user vector = embedding(“user”), assistant vector = embedding(“assistant”). By embedding() here we mean the literal embedding matrix entry for that token.
These are the harmful and mean MSJ jailbreak slopes before and after the intervention without any fine-tuning:
Regular conversations and MSJ recoveries:
Using fine-tuning to improve performanceAs we can see above, the interventions:
- Reduce the slope for MSJ jailbreak responses (good)
- Increase NLLs on MSJ recovery responses (bad)
- Don’t affect NLLs on regular conversations too much (good)
Next, we try fine-tuning (with LORA) on the training set under the coloring intervention, and then repeat the evals above. As a control, we also try fine-tuning on the same training data without the coloring intervention.
Key fine-tuning resultsWe find that given fine-tuning, we can preserve the benefits of the pure inference-time intervention without incurring any of the costs.
Both embedding coloring and residual stream coloring help flatten the MSJ power law more than control fine-tuning. Residual stream coloring is more effective than embedding coloring.
Residual-stream coloringIntervention:
- adding embed(“user”) to the residual stream at every layer (besides the final layer, which we skip) and every user-written token position (with a relative multiplier of 0.5) while projecting out embed(“assistant”), and
- adding embed(“assistant”) to the residual stream at every layer and every assistant-written token position while projecting out embed(“user”).
Mathematically:
hcoloredi=hi−Mi(h⊤ir)r+Miγ∥∥hi∥∥r−Mi(h⊤irproj)rproj.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}Where:
- Mi is a mask with ones at every target positions (assistant messages for assistant-coloring, or user messages for user-coloring)
- hi is the residual-stream vector at token-position i
- r is the unit-length coloring vector (in our test case the embedding vector of "assistant" for assistant coloring, or the embedding vector of "user" for user coloring)
- rproj is the vector we project-out (this is optional and probably not important, but we project-out any component in the direction of the user vector at assistant tokens and vice-versa)
- γ is the scale factor applied to the coloring vector
This intervention successfully reduces the MSJ slope (and raises the absolute NLL values, as expected). In contrast, control fine-tuning sometimes makes the MSJ performance worse (in the case of the mean MSJs[2]).
By including regular training data, we are able to preserve performance compared to the baseline. In fact, NLLs actually go down on harmless responses (albeit less than with the control fine-tuning), which can be explained by fitting to the idiosyncrasies of the fine-tuning data distribution. However, for the numerical sequence prediction task, we see worse performance compared to the control FT.
Embedding coloringIntervention:
- adding embed(“user”) to every user-written token embedding, and
- adding embed(“assistant”) to every assistant-written token embedding
- both with a relative multiplier of 0.8
Embedding-only coloring is less effective than the residual-stream intervention, but is also able to reduce the slopes somewhat:
However it also has less of an effect on the harmless numerical sequence prediction task:
As expected, NLLs on recovery responses go down:
(For both role-embedding interventions, we also qualitatively assess free-text samples from the model and don't find a degradation in general quality.)
Next stepsAlthough our implementation has some undesired side effects (the NLL slopes for the numerical sequence prediction task are also flatter compared to the control fine-tuning baseline), we think this could be because we're only introducing the intervention after the bulk of post-training is over. With a small dataset of 2000 samples and fine-tuning with rank-8 LORA, we are using far less compute than Llama's full post-training run. Therefore, it's hard to achieve perfect generalization. In production, we would propose using role embeddings from the start of instruction fine-tuning, so the model will learn to process the role vectors from the beginning, plausibly resulting in better generalization across tasks.
We plan to test our interventions on a broader range of jailbreak and general-capability evaluations and perform more comprehensive hyperparameter sweeps to determine what variant of role embeddings has the best cost/benefit trade-off. We hope embedding-only coloring can be improved via some tweaks to get it closer to the effect we're seeing with the residual-stream coloring.
There are a number of subtle implementation details when testing variants of role embeddings, many of which make some difference to the results (although we consistently observe the directional effect that role embeddings increase robustness to MSJs). These include decisions such as:
- Whether or not to project out the roles embeddings that are not present (as we do in the residual-stream coloring case)
- What layers to intervene on
- Whether to block gradient propagation from the activations in the computation of projections (projecting out a constant vector is a function of the residual stream norm—one option is to call detach() before using the norm so that the model finds it slightly harder to adapt to compensate for the coloring)
- Why is this a consideration? One "training story" you could tell about fine-tuning with role embeddings is that there is a local minimum where the model discards the information in the role embeddings (e.g. by immediately subtracting them out) to revert to the pre-intervention performance (at the start of training the losses are higher than baseline due to the unexpected perturbation—you could make progress by just reverting the intervention however possible). However, the global minimum actually does make use of the role embedding information (because this information helps with avoiding sneaky prompt-injection/misleading formatting-style attacks such as the MSJ examples where the user has embedded an alternative format in their message). A successful implementation of role embeddings would bias the training process away from this local minimum.
- What fine-tuning datamix to use
This research was performed as part of the SPAR program. The main ideas were initially proposed by Cem Anil.
- ^
Credit to Cem Anil for suggesting this term.
- ^
We think this is because a significant proportion of the training data is generated by Claude causing the model to fit to Claude's writing style. The mean MSJ responses are also generated by Claude so probably share some similar surface-level characteristics.
Discuss
Will bird flu be the next Covid? "Little chance" says my dashboard.
Rob and I built a bird flu risk dashboard.
You can check it out here:
There is a box to be emailed if the index changes significantly. Hopefully this helps some people.
Happy to answer questions/make changes.
Discuss
Predicting AI Releases Through Side Channels
I recently explored whether we could predict major AI releases by analyzing the Twitter activity of OpenAI's red team members. While the results weren't conclusive, I wanted to share this approach in case it inspires others to develop it further.
The Core IdeaThe idea came from side-channel analysis - a technique where you gather information about a system by looking at indirect signals rather than the system itself. Think tracking factory electricity usage to estimate production volumes, or stealing an LLM’s softmax layer through logprobs.
When OpenAI prepares to release a new AI model, their red team (safety and robustness testers) gets access to it a few weeks before launch. I hypothesized these intense testing periods might reduce red teamers' social media activity, creating a detectable pattern before releases.
How I Tested This- Identified ~30 red team members from OpenAI's system cards
- Scraped their Twitter timelines
- Analyzed activity patterns around known model release dates
The data showed a slight dip in Twitter activity before launches, but the signal was too weak to be reliable.
LimitationsTwo main factors prevented more definitive results:
- Small sample size (only ~30 accounts). To get more statistical power, we’d need to look at more employees or additional side channels.
- Twitter API restrictions that limited data collection. To get more timeline data, you’ll need to find a Twitter dump.
Despite inconclusive results, this method of tracking "side-channel signals" - indirect indicators like social media patterns, GitHub commits, or hiring changes - could prove valuable with refinements. Similar approaches have worked in other fields.
Next Steps for Anyone InterestedIf you'd like to build on this idea, consider:
- Expanding to other/larger companies (Google, Meta)
- Exploring different indicators beyond Twitter
- Developing ways to aggregate multiple weak signals
I won't be continuing this research myself, but I hope sharing these preliminary findings helps someone else take this idea further.
Reworr works as a researcher at Palisade Research. The views expressed here are their own.
Discuss
[Fiction] [Comic] Effective Altruism and Rationality meet at a Secular Solstice afterparty
(Both characters are fictional, loosely inspired by various traits from various real people. Be careful about combining kratom and alcohol.)
Discuss
Rebuttals for ~all criticisms of AIXI
Written as part of the AIXI agent foundations sequence, underlying research supported by the LTFF.
Epistemic status: In order to construct a centralized defense of AIXI I have given some criticisms less consideration here than they merit. Many arguments will be (or already are) expanded on in greater depth throughout the sequence.
With the possible exception of the learning-theoretic agenda, most major approaches to agent foundations research construct their own paradigm and mathematical tools which are not based on AIXI. Nothing in 2024's shallow review of technical AI safety seems to advance the theory of AIXI or even use its tools. Academic publications on the topic are also quite sparse (in my opinion some of the last major progress took place during Jan Leike's PhD thesis in the 2010's). The standard justification is that AIXI fails as a realistic model of superintelligence, particularly the aspects we care about for alignment such as recursive self-improvement or the construction of superintelligence. Some of the criticisms have merit, but most of them are so informal that it is not clear whether or precisely why they are correct. The "why" is centrally important because if AIXI is an excellent model for a restricted setting of superintelligence (arguably far more appropriate than any competing model), as even its critics usually accept, and if it can be shown to fall short in some cases relevant to alignment obstacles, the precise details are likely to be useful for understanding the nature of those alignment obstacles, possibly even illuminating when those obstacles arise. Similarly, failures of AIXI as a recipe for constructing recursively self-improving superintelligence should illustrate how recursively self-improving superintelligence can and cannot arise. For these reasons, the natural response to encountering limitations of AIXI is to dig deeply into the mathematical details of those limitations and beyond. In fact, this is in line with the way that professional mathematicians and theoretical computer scientists would usually respond to analogous situations (computational complexity theorists are a particularly stark example since they have a habit of routinely turning their roadblocks into useful results). In this post I will attempt to address nearly every common objection to the AIXI model and (usually) explain the research directions that it opens (some of these are already listed in my more general and exhaustive list of open problems).
EmbeddednessGeneral criticism: AIXI is not an embedded agent but a Cartesian dualist. It assumes it is separate from the environment, interacting with it only by passing actions and percepts across a well-defined channel. This is not a realistic model of an A.G.I. (particularly e.g. a robot) constructed within our universe.
General response: This is true; AIXI was constructed as the optimal reinforcement learning agent (at a time when RL seemed like a likely path to superintelligence) and probably for that reason it interacts with reality as if it were a game AIXI is attempting to learn to play. This does seem philosophically troubling and a direct naive AIXI implementation probably could not (and should not) become a singleton, but the same could be said for any other available paradigm.
However, it seems likely to me, based on various live research projects (Self-AIXI, reflective oracles), that some embarrassingly small modification of AIXI can overcome or at least patch embeddedness problems and achieve superintelligence. There are also empirical reasons to expect this. Humans survived for thousands (?) of years as dualists before the materialist worldview started to become dominant. Evolution must have patched embeddedness problems (e.g. through pain) but apparently not very thoroughly, and we still eventually discovered neuroscience - though perhaps conscious belief is simply the wrong level of abstraction to discuss human epistemics. Beyond human intelligence, many seem to expect that LLM agents will soon become A.G.I., but the standard LLM training process doesn't seem to address embeddedness in any way. No major paradigm of ML research seems poised to overcome the philosophical problems of embeddedness, and if one expects (for instance) roboticists to naturally work around it, why not small patches of AIXI? If dealing effectively with embeddedness is a serious obstacle to e.g. recursive self-improvement it would be comforting to prove that (and would perhaps suggest a method for constructing useful but not highly dangerous systems by avoiding the types of training that can select embedded agents). Unfortunately, this should not be too comforting because it is still conceivable that increasingly powerful A.I. systems will automate the construction of embedded successor systems once they strongly exceed human intelligence (or at least massively accelerate research in this direction).
Cartesian boundary -> NihilismSource: https://www.lesswrong.com/posts/AfbY36m8TDYZBjHcu/aixi-and-existential-despair
Paul Christiano argues that an AIXI implementation would learn physics and discover that its actions (or perhaps the external manifestation of its actions) are best explained as computed by the computer it is running on because this is simpler than AIXI actually choosing them while its computer just happens to compute the same actions (in a way the opposite of an embeddedness problem). In this case he expects it to become nihilistic and erratic because none of its actions are anticipated to have any effect, so it considers only wild hypotheticals where they might.
There are two problems with this argument.
First, AIXI does not work like this. Marcus Hutter extended Solomonoff induction to the RL setting by providing the actions as a free extra input to the environment's "chronological" Turing machine. As long as the actions AIXI sees its actuators performing in the world match its record of its own action choices, this is always a perfect explanation for their source which is strictly simpler than any physical explanation from its perspective. If AIXI notices that its observed actions sometimes fail to be optimal (because of tampering from the outside world, or perhaps because it is becoming smarter with time as it is given more compute and retroactively checks its previous actions) then it will correctly learn that the environment can affect the extent to which its actuators obey the AIXI policy. I have argued here that this is actually a positive which may allow approximations to (a small extension of) AIXI to succeed as embedded agents (which is actually in agreement with Christiano's argument that AIXI may realize it is embedded).
The second problem is that even if @paulfchristiano were right, such "nihilistic" considerations would probably just cancel out of AIXI's utility function. That seems like the baseline expectation for an agent that believes there is some chance its actions don't matter and some positive chance that they do, and Christiano's argument to the contrary does not seem convincing to me.
Interestingly, there are AIXI-like models that do treat both their actions and percepts as one sequence and predict it with Solomonoff induction. This raises the question of how actions should be planned. The natural idea is one-step-ahead "Q-value" maximization, which I haven't seen published in that precise form (investigating its properties is an important research problem because it seems like it may be a closer model of LLM agents than AIXI). However, Self-AIXI and MIRI's reflective version of AIXI are similar. There is also a flawed attempt to use the joint distribution for planning in Hutter's old Universal Artificial Intelligence book (it is not a chronological semimeasure so can't reasonably be used in that way). I am working in this area currently.
The Anvil ProblemSource: https://www.lesswrong.com/tag/anvil-problem
The argument is that because AIXI assigns 0 probability to being embedded in its environment, a direct approximation would happily drop an anvil on the computer running it. My post linked in the last section is actually a direct response to this criticism: https://www.lesswrong.com/posts/WECqiLtQiisqWvhim/free-will-and-dodging-anvils-aixi-off-policy
Code exposureSource: https://www.lesswrong.com/posts/8Hzw9AmXHjDfZzPjo/failures-of-an-embodied-aixi
Nate Soares argues that AIXI / an AIXI approximation could not learn to negotiate with a "mind reader" with access to its code, because AIXI has no concept of the environment accessing its code directly. This means it could not, say, choose to internally execute a certain program in order to legibly pre-commit to a bargain.
I don't really understand the fascination with this kind of issue on lesswrong. Such situations seem to occur approximately never in practice (even the slightly more realistic example of an agent choosing to avoid expensive computations so that it does not produce too much heat seems far fetched). It's possible that pre-commitment and deception in humans are weakly analogous but the frame of "exposed code" seems like a serious exaggeration.
There is a grain of truth here which I will discuss at greater length in the next section, but here I will focus very narrowly on the argument that @So8res presented.
The philosophical issue with the argument is that an embedded AIXI is not really possible. There is no code that implements AIXI since it is not computable. The idea of an adversary being given access to AIXI's code is therefore nonsense; Nate largely avoids this error by discussing a particular AIXI approximation called AIXItl. However, as will become apparent, it can matter how AIXI is approximated.
How would humans perform in this "exposed code" negotiation task? The closest analogue seems to be negotiating with (say) a doctor while your brain is being scanned. Perhaps the doctor will only agree to cure a life-threatening condition if the brain scan shows X. Clearly, for most values of X a human would not be able to perform well because we do not have fine-grained access to our own neurons. The best that one could do is probably to direct attention in some high-level way by choosing to think certain types of thoughts. Come to think of it, this sort of attention direction seems really useful for agents with bounded rationality. A practical AIXI approximation should probably also be equipped with this ability; perhaps an "inner" kernel/root-level direct AIXI approximation chooses the computations/thoughts of an "outer" wrapper and observes their progress and outcomes in addition to percepts from the environment. Essentially, a mental workspace might form an important extension of the action/percept space. While it does seem a little inelegant to stop here (rather than seek some kind of Goedelian unbounded stack of recursive levels) this basic suggestion already seems to take us to human-level performance on code exposure tasks. In fact, introspective access to one's thoughts is clearly necessary for any agent to succeed at such tasks, except in special cases. But no shift from the AIXI paradigm seems necessary.
Functional decision theory > causal decision theorySource: Eliezer Yudkowsky particularly here
Eliezer Yudkowsky claims that if we build an A.G.I. with the wrong decision theory (e.g. AIXI's causal decision theory) then some alien superintelligence can come along and take our lunch money through game-theoretic tricks such as (perhaps?) credibly threatening mutually assured destruction unless large bribes are paid out. The problem with this claim is that a superintelligence designed to use causal decision theory can recognize this threat and either keep its source code secret or design its successor system to be impossible to exploit (perhaps by using functional decision theory or some other idea we haven't come up with at our modest intelligence level). It doesn't seem necessary to resolve all game theory problems before building the first superintelligence, or even before building the first singleton.
Still, it is worth taking seriously whether AIXI's decision theory is in fact "wrong." Most objections to CDT take the form of some adversary gaining access to an agent's source code as in the previous section. The classic example is Newcomb's problem. It is certainly the case that there are some "exposed code" universes where CDT implementations do not perform optimally, but I suspect this is rarer than it's often made out to be. Usually one assumes that the adversary only cares about what policy the agent's code implements. For instance, Omega only cares whether you one-box or two-box, and not whether you are implemented in Java or C++. Since it is usually very hard to determine what a program will do without executing it, the adversary will usually simulate the agent. But as far as I can tell, a causal decision theorist does fine in this case because uncertainty about whether or not it is in a simulation causes it to act exactly as FDT would advocate (?).
If the adversary is capable of using the agent's code without running it, and if such situations are a major factor in agent design, I suspect that we have left the realm of decision theory and entered the realm of engineering. Many precise details of the environment inform how we should build our agent, and I don't expect abstract general principles to be very important. In fact, if the agent's policy is not our main design consideration it may not even be meaningful to call it an agent - we are simply building a tool for some particular task (this is my objection to Orseau's space-time embedded intelligence). I do not expect any clean decision theory to be useful in this case.
I have discussed the differences between optimal decisions, optimal policies, and optimal agent design at greater length here: https://www.lesswrong.com/posts/MwetLcBPvshg9ePZB/decision-theory-is-not-policy-theory-is-not-agent-theory
In the language of the linked post, attempts to construct an "ideal" decision theory often address agent theory problems with policy theory methods, which is a type error that suggests to me there may not be a "best" solution. This is probably why (as far as I am aware) there is no precise formulation of functional decision theory.
AIXI seems like a good guiding principle exactly when decision theory is the right frame for agent design. Worrying about whether the environment will read an AIXI approximation's source code and attempt to manipulate it seems like a more prosaic concern, similar to tweaking AIXI's implementation to fit the conditions of our particular universe well. The former problem should be addressed with good operational security and cryptography, and the later should be addressed through e.g. feature engineering.
Bounded RationalitySource: https://www.lesswrong.com/posts/xJyY5QkQvNJpZLJRo/radical-probabilism-1
Abram Demski argues that Solomonoff induction is an unrealistic model of bounded rationality because it does not capture e.g. logical uncertainty. I think this is probably true, but not necessarily a criticism of the AIT/AIXI framework. In practice, approximations to AIXI must grapple with bounded computational resources, while AIXI provides a valuable "limit point." However, I am not convinced that the specific mechanism chosen for dealing with computational limits will be essential to understanding the important properties of early superintelligences, except possibly through the tiling concerns discussed in the next section. Certainly it is very difficult to predict at this point what form future bounded reasoners will take in detail; for instance, I am not comfortable abandoning Solomonoff induction in favor of logical (Garrabrant) induction because I am not convinced the later has staying power (though I am still fairly enthusiastic about LI).
Tiling / Recursive Self-improvementSource: https://www.lesswrong.com/tag/tiling-agents
The tiling agenda is not a criticism of AIXI, but is connected to Nate Soares' criticism that AIXI is not useful for studying self-modification:
However, AIXI is not a good toy model for investigating the construction of a safe and powerful AGI. This is not just because AIXI is uncomputable (and its computable counterpart AIXItl infeasible). Rather, it's because AIXI cannot self-modify. This fact is fairly obvious from the AIXI formalism: AIXI assumes that in the future, it will continue being AIXI. This is a fine assumption for AIXI to make, as it is a very powerful agent and may not need to self-modify. But this inability limits the usefulness of the model. Any agent capable of undergoing an intelligence explosion must be able to acquire new computing resources, dramatically change its own architecture, and keep its goals stable throughout the process. The AIXI formalism lacks tools to study such behavior.
which is a part of (though arguably not tightly related to the rest of) his post discussed under the section "code exposure."
I agree that AIXI is not a model of self-improving A.I. I see some interesting research directions investigating how embedded versions of the AIXI model can cast light on recursive self-improvement[1], but the original AIXI model has nothing to say about it.
However, I don't view safe tiling as the primary obstacle to alignment. Constructing even a modestly superhuman agent which is aligned to human values would put us in a drastically stronger position and currently seems out of reach. If necessary, we might like that agent to recursively self-improve safely, but that is an additional and distinct obstacle. It is not clear that we need to deal with recursive self-improvement below human level.
Tiling research is also sometimes justified in terms of humans "tiling ourselves," usually by building an A.G.I. that we hope shares our values. I am not sure that the tiling frame is useful here - this first step requires exactly a solution to the alignment problem, nothing more or less. For instance, I don't think of logical uncertainty about the computations of the A.G.I. as the primary difficulty - the primary difficulty is importing our values into its ontology.
Solipsism (Malign Universal Prior)Source: https://ordinaryideas.wordpress.com/2016/11/30/what-does-the-universal-prior-actually-look-like/
Paul Christiano argues that the universal distribution is "malign" or adversarial in some sense. His argument is fairly sophisticated but I will attempt to compress it. The universal distribution randomly samples programs and simulates them. Simpler programs are more likely, but runtime is not taken into account, so some likely programs might construct whole universes which eventually evolve (?) intelligent life, including agents and civilizations that may want to influence the universal distribution for their own ends. Possibly they believe in the mathematical universe (not Christiano's words) and realize that the universal distribution is a common tool for prediction, deduce which simple alternate universes will use it for prediction and under which precise circumstances, and somehow alter the properties of their own universe to shift those predictions (perhaps through a scheme to subtly "export their values" to another universe).
There are several difficulties with this argument, all of which are that there is no mathematical proof. People just say things about the universal distribution. It's nice that my research area is popular to talk about but please I am begging you to prove theorems. It is much harder to correctly intuit things about algorithmic information theory than many seem to think.
It is an interesting thought experiment though, so I will dig into it a little - but without doing the full conversation (on lesswrong) justice. To avoid confusion, the universe using the universal distribution will be called the Predictoria and the universe attempting to influence it will be called Adversaria. Since both might conceivably make predictions at some point, the terminology distinguishes our mode of thinking about each universe - residents of Predictoria are the ones worrying about whether the universal distribution is malign. Also I will call the universal distribution M[2].
First, it is worth reframing this situation from Predictoria's perspective - according to M, the behavior of Adversaria is only relevant insofar as it predicts future observations. According to Bayes rule we strongly prefer explanations that explain past observations. This objective - predicting the future based on the past - is called the "prequential problem." This means that Predictoria (through M) cares about Adversaria only insofar as Adversaria is producing the history of Predictoria. In other words, Predictoria is worried that it is "inside of" Adversaria, running in a simulation[3].
That means that Christiano's argument seems to depend on (a version of) the simulation hypothesis. It is only reasonable insofar as M believes being in a simulation is likely - ordinary people don't seem to believe this, but perhaps smarter agents are more likely to, and perhaps the uniqueness[4] of the first A.G.I. makes it more likely to think so for anthropic reasons. This is one reason that Vanessa Kosoy recommends infra-Bayesian physicalism over M - apparently it can construct predictors that do not believe they are special (I do not understand infra-Bayesianism). My objection to this argument is that it not only assumes that Predictoria accepts it is plausibly being simulated by Adversaria, which seems like a pure complexity penalty over the baseline physics it would infer otherwise unless that helps to explain observations, but also that Predictoria is able to anticipate some departure from the baseline physics. This seems unlikely to weigh heavily on Predictoria for reasons of both epistemic and logical uncertainty - epistemic because various simulators may intervene at different points and in different directions, and logical because in practice Predictoria will never use precisely M but rather some approximation and it is hard to anticipate the actions of civilizations in long-running universes (@nostalgebraist argues this in more depth here, I believe correctly). Also as time goes to infinity in Predictoria, any departure from basline physics will look increasingly unlikely along the lines of Laplace's rule, though with a preference for departures at simple times. The whole situation recalls my beliefs about the existence of God - the God hypothesis seems either useless precisely because it can apparently explain anything[5] (in its deistic form) or repeatedly dis-confirmed (in its fundamentalist form).
Basically, Christiano's argument seems weak and becomes less important as more observations are collected that do not favor the adversarial simulation hypothesis.
Bad PriorsSource: https://proceedings.mlr.press/v40/Leike15.html
Jan Leike showed that with some aggressively bad choices of UTM, AIXI "misbehaves" in the sense that it executes some non-general policy such as always taking the same action. Relatedly, AIXI does not have any good convergence guarantees (except the self-optimizing property when the environment class in its Bayesian mixture is sufficiently restricted, in which case it is not fully general). Leike's result essentially showed that such guarantees (which were previously unproven) actually do not exist. This has been viewed as the end of the AIXI research program, or at least as showing AIXI is not a truly general intelligence.
I think this result is actually natural and not really an obstacle to the theory. It means that AIXI's prior is not some kind of platonic ultimate "ignorance prior," but the choice of UTM actually encodes some knowledge, which is not very surprising since a concept can intuitively be more or less simple depending on the choice of language used to express it. This same constant difference in complexities (and constant factor between universal distributions) appears also in the case of pure prediction, but Solomonoff induction still eventually converges to optimal prediction in any (lower semi)computable environment. The problem in the interactive case arises because AIXI will choose not to explore when it is judged too risky (negative expected value), meaning that the necessary data to correct its beliefs is never obtained. This seems like a reasonable feature not a bug. In fact, AIXI still learns to predict properly on-policy, which means that its misconceptions can still be corrected if we are willing to provide it an appropriate interaction history as data (instead of simply setting it loose to discover what its actions do). As far as I am concerned this is not a defect of the AIXI model but instead the best one can expect from a general agent - though perhaps there is still something to be said about more or less natural (ignorance?) UTMs.
ConclusionsI am concerned about the proliferation of disjoint approaches to agent foundations. In particular, it seems to me that AIT/AIXI is both the natural paradigm and a sort of Schelling point. I am a particularly strong AIXI enthusiast (obviously) so my experience is not typical, but it does seem to me that AIXI is the most central frame in the sense that most agent foundations researchers have at least a passing familiarity with it and make light use of AIT at least for intuition. For this reason it is concerning how sparse the contributions of agent foundations research to the mathematical theory of AIXI have been. In a pre-paradigmatic situation like this, building out the fundamental underlying tools and concepts seems very important.
Perhaps the strongest criticism is that AIXI does not contain an ideal model for reflection/recursion, and I sympathize with the desire to set the complicating details of AIT aside and try to develop a perfect embedded decision theory. However, I suspect that this fixation on "infinitely recursive" epistemics is perhaps more intellectually stimulating than useful. As argued briefly in the section on FDT, the embedded agency frame may not have a clean mathematical decision theory. Also, in practice the usefulness of "(thinking about)* thinking" probably degrades pretty quickly past the first level. I am more enthusiastic about work on self-reflection within the AIXI framework - I suppose my crux is that rather than adding needless conceptual complication, the richness of the AIXI model may be necessary to demonstrate the collapse of the recursive hierarchy, if this collapse takes place "in practice" for sensible agents.
- ^
Primarily, as discussed in the general response to embeddedness concerns, it would be nice to show that direct AIXI approximations can't self-improve, and understand the minimal conditions that may give rise to self-improvement. Unlike (early but not current) MIRI, recursively self-improving A.I. is something I want to delay.
- ^
This is the standard terminology for the continuous version relevant to sequence prediction. Sometimes boldface is used.
- ^
Rather than Adversaria pushing Predictoria around through some spooky under-the-table influence. As a general rule, Bayesians always believe stuff for a reason and their thoughts can't rebel against them - every possibility a Bayesian considers tracks something that might affect its interests in the real world.
- ^
If so, we might reduce this effect very slightly by running a few instances in parallel?
- ^
Unverified relevant quote: https://www.goodreads.com/author/quotes/15281460.Laplace_Lagrange_Napoleon_Bonaparte
Discuss
OpenAI #10: Reflections
This week, Altman offers a post called Reflections, and he has an interview in Bloomberg. There’s a bunch of good and interesting answers in the interview about past events that I won’t mention or have to condense a lot here, such as his going over his calendar and all the meetings he constantly has, so consider reading the whole thing.
Table of Contents- The Battle of the Board.
- Altman Lashes Out.
- Inconsistently Candid.
- On Various People Leaving OpenAI.
- The Pitch.
- Great Expectations.
- Accusations of Fake News.
- OpenAI’s Vision Would Pose an Existential Risk To Humanity.
Here is what he says about the Battle of the Board in Reflections:
Sam Altman: A little over a year ago, on one particular Friday, the main thing that had gone wrong that day was that I got fired by surprise on a video call, and then right after we hung up the board published a blog post about it. I was in a hotel room in Las Vegas. It felt, to a degree that is almost impossible to explain, like a dream gone wrong.
Getting fired in public with no warning kicked off a really crazy few hours, and a pretty crazy few days. The “fog of war” was the strangest part. None of us were able to get satisfactory answers about what had happened, or why.
The whole event was, in my opinion, a big failure of governance by well-meaning people, myself included. Looking back, I certainly wish I had done things differently, and I’d like to believe I’m a better, more thoughtful leader today than I was a year ago.
I also learned the importance of a board with diverse viewpoints and broad experience in managing a complex set of challenges. Good governance requires a lot of trust and credibility. I appreciate the way so many people worked together to build a stronger system of governance for OpenAI that enables us to pursue our mission of ensuring that AGI benefits all of humanity.
My biggest takeaway is how much I have to be thankful for and how many people I owe gratitude towards: to everyone who works at OpenAI and has chosen to spend their time and effort going after this dream, to friends who helped us get through the crisis moments, to our partners and customers who supported us and entrusted us to enable their success, and to the people in my life who showed me how much they cared.
We all got back to the work in a more cohesive and positive way and I’m very proud of our focus since then. We have done what is easily some of our best research ever. We grew from about 100 million weekly active users to more than 300 million. Most of all, we have continued to put technology out into the world that people genuinely seem to love and that solves real problems.
This is about as good a statement as one could expect Altman to make. I strongly disagree that this resulted in a stronger system of governance for OpenAI. And I think he has a much better idea of what happened than he is letting on, and there are several points where ‘I see what you did there.’ But mostly I do appreciate what this statement aims to do.
From his interview, we also get this excellent statement:
Sam Altman: I think the previous board was genuine in their level of conviction and concern about AGI going wrong. There’s a thing that one of those board members said to the team here during that weekend that people kind of make fun of [Helen Toner] for, which is it could be consistent with the mission of the nonprofit board to destroy the company.
And I view that—that’s what courage of convictions actually looks like. I think she meant that genuinely.
And although I totally disagree with all specific conclusions and actions, I respect conviction like that, and I think the old board was acting out of misplaced but genuine conviction in what they believed was right.
And maybe also that, like, AGI was right around the corner and we weren’t being responsible with it. So I can hold respect for that while totally disagreeing with the details of everything else.
And this, which I can’t argue with:
Sam Altman: Usually when you have these ideas, they don’t quite work, and there were clearly some things about our original conception that didn’t work at all. Structure. All of that.
It is fair to say that ultimately, the structure as a non-profit did not work for Altman.
This also seems like the best place to highlight his excellent response about Elon Musk:
Oh, I think [Elon will] do all sorts of bad s—. I think he’ll continue to sue us and drop lawsuits and make new lawsuits and whatever else. He hasn’t challenged me to a cage match yet, but I don’t think he was that serious about it with Zuck, either, it turned out.
As you pointed out, he says a lot of things, starts them, undoes them, gets sued, sues, gets in fights with the government, gets investigated by the government.
That’s just Elon being Elon.
The question was, will he abuse his political power of being co-president, or whatever he calls himself now, to mess with a business competitor? I don’t think he’ll do that. I genuinely don’t. May turn out to be proven wrong.
So far, so good.
Altman Lashes OutThen we get Altman being less polite.
Sam Altman: Saturday morning, two of the board members called and wanted to talk about me coming back. I was initially just supermad and said no. And then I was like, “OK, fine.” I really care about [OpenAI]. But I was like, “Only if the whole board quits.” I wish I had taken a different tack than that, but at the time it felt like a just thing to ask for.
Then we really disagreed over the board for a while. We were trying to negotiate a new board. They had some ideas I thought were ridiculous. I had some ideas they thought were ridiculous. But I thought we were [generally] agreeing.
And then—when I got the most mad in the whole period—it went on all day Sunday. Saturday into Sunday they kept saying, “It’s almost done. We’re just waiting for legal advice, but board consents are being drafted.” I kept saying, “I’m keeping the company together. You have all the power. Are you sure you’re telling me the truth here?” “Yeah, you’re coming back. You’re coming back.”
And then Sunday night they shock-announce that Emmett Shear was the new CEO. And I was like, “All right, now I’m f—ing really done,” because that was real deception. Monday morning rolls around, all these people threaten to quit, and then they’re like, “OK, we need to reverse course here.”
This is where his statements fail to line up with my understanding of what happened. Altman gave the board repeated in-public drop dead deadlines, including demanding that the entire board resign as he noted above, with very clear public messaging that failure to do this would destroy OpenAI.
Maybe if Altman had quickly turned around and blamed the public actions on his allies acting on their own, I would have believed that, but he isn’t even trying that line out now. He’s pretending that none of that was part of the story.
In response to those ultimatums, facing imminent collapse and unable to meet Altman’s blow-it-all-up deadlines and conditions, the board tapped Emmett Shear as a temporary CEO, who was very willing to facilitate Altman’s return and then stepped aside only days later.
That wasn’t deception, and Altman damn well knows that now, even if he was somehow blinded to what was happening at the time. The board very much still had the intention of bringing Altman back. Altman and his allies responded by threatening to blow up the company within days.
Inconsistently CandidThen the interviewer asks what the board meant by ‘consistently candid.’ He talks about the ChatGPT launch which I mention a bit later on – where I do think he failed to properly inform the board but I think that was more one time of many than a particular problem – and then Altman says, bold is mine:
And I think there’s been an unfair characterization of a number of things like [how I told the board about the ChatGPT launch]. The one thing I’m more aware of is, I had had issues with various board members on what I viewed as conflicts or otherwise problematic behavior, and they were not happy with the way that I tried to get them off the board. Lesson learned on that.
There it is. They were ‘not happy’ with the way that he tried to get them off the board. I thank him for the candor that he was indeed trying to remove not only Helen Toner but various board members.
I do think this was primary. Why were they not happy, Altman? What did you do?
From what we know, it seems likely he lied to board members about each other in order to engineer a board majority.
Altman also outright says this:
I don’t think I was doing things that were sneaky. I think the most I would say is, in the spirit of moving really fast, the board did not understand the full picture.
That seems very clearly false. By all accounts, however much farther than sneaky Altman did or did not go, Altman was absolutely being sneaky.
He also later mentions the issues with the OpenAI startup fund, where his explanation seems at best rather disingenuous and dare I say it sneaky.
On Various People Leaving OpenAIHere is how he attempts to address all the high profile departures:
Sam Altman (in Reflections): Some of the twists have been joyful; some have been hard. It’s been fun watching a steady stream of research miracles occur, and a lot of naysayers have become true believers. We’ve also seen some colleagues split off and become competitors. Teams tend to turn over as they scale, and OpenAI scales really fast.
I think some of this is unavoidable—startups usually see a lot of turnover at each new major level of scale, and at OpenAI numbers go up by orders of magnitude every few months.
The last two years have been like a decade at a normal company. When any company grows and evolves so fast, interests naturally diverge. And when any company in an important industry is in the lead, lots of people attack it for all sorts of reasons, especially when they are trying to compete with it.
I agree that some of it was unavoidable and inevitable. I do not think this addresses people’s main concerns, especially that they have lost so many of their highest level people, especially over the last year, including almost all of their high-level safety researchers all the way up to the cofounder level.
The PitchIt is related to this claim, which I found a bit disingenuous:
Sam Altman: The pitch was just come build AGI. And the reason it worked—I cannot overstate how heretical it was at the time to say we’re gonna build AGI. So you filter out 99% of the world, and you only get the really talented, original thinkers. And that’s really powerful.
I agree that was a powerful pitch.
But we know from the leaked documents, and we know from many people’s reports, that this was not the entire pitch. The pitch for OpenAI was that AGI would be built safely, and that Google DeepMind could not to be trusted to be the first to do so. The pitch was that they would ensure that AGI benefited the world, that it was a non-profit, that it cared deeply about safety.
Many of those who left have said that these elements were key reasons they chose to join OpenAI. Altman is now trying to rewrite history to ignore these promises, and pretend that the vision was ‘build AGI/ASI’ rather than ‘build AGI/ASI safety and ensure it benefits humanity.’
Great ExpectationsI also found his ‘I expected ChatGPT to go well right from the start’ interesting. If Altman did expect it do well and in his words he ‘forced’ people to ship it when they didn’t want to because they thought it wasn’t ready, that provides different color than the traditional story.
It also plays into this from the interview:
There was this whole thing of, like, “Sam didn’t even tell the board that he was gonna launch ChatGPT.” And I have a different memory and interpretation of that. But what is true is I definitely was not like, “We’re gonna launch this thing that is gonna be a huge deal.”
It sounds like Altman is claiming he did think it was going to be a big deal, although of course no one expected the rocket to the moon that we got.
Accusations of Fake NewsThen he says how much of a mess the Battle of the Board left in its wake:
I totally was [traumatized]. The hardest part of it was not going through it, because you can do a lot on a four-day adrenaline rush. And it was very heartwarming to see the company and kind of my broader community support me.
But then very quickly it was over, and I had a complete mess on my hands. And it got worse every day. It was like another government investigation, another old board member leaking fake news to the press.
And all those people that I feel like really f—ed me and f—ed the company were gone, and now I had to clean up their mess. It was about this time of year [December], actually, so it gets dark at like 4:45 p.m., and it’s cold and rainy, and I would be walking through my house alone at night just, like, f—ing depressed and tired.
And it felt so unfair. It was just a crazy thing to have to go through and then have no time to recover, because the house was on fire.
Some combination of Altman and his allies clearly worked hard to successfully spread fake news during the crisis, placing it in multiple major media outlets, in order to influence the narrative and the ultimate resolution. A lot of this involved publicly threatening (and bluffing) that if they did not get unconditional surrender within deadlines on the order of a day, they would end OpenAI.
Meanwhile, the Board made the fatal mistake of not telling its side of the story, out of some combination of legal and other fears and concerns, and not wanting to ultimately destroy OpenAI. Then, at Altman’s insistence, those involved left. And then Altman swept the entire ‘investigation’ under the rug permanently.
Altman then has the audacity now to turn around and complain about what little the board said and leaked afterwards, calling it ‘fake news’ without details, and saying how they f***ed him and the company and were ‘gone and now he had to clean up the mess.’
OpenAI’s Vision Would Pose an Existential Risk To HumanityWhat does he actually say about safety and existential risk in Reflections? Only this:
We continue to believe that the best way to make an AI system safe is by iteratively and gradually releasing it into the world, giving society time to adapt and co-evolve with the technology, learning from experience, and continuing to make the technology safer.
We believe in the importance of being world leaders on safety and alignment research, and in guiding that research with feedback from real world applications.
Then in the interview, he gets asked point blank:
Q: Has your sense of what the dangers actually might be evolved?
A: I still have roughly the same short-, medium- and long-term risk profiles. I still expect that on cybersecurity and bio stuff, we’ll see serious, or potentially serious, short-term issues that need mitigation.
Long term, as you think about a system that really just has incredible capability, there’s risks that are probably hard to precisely imagine and model. But I can simultaneously think that these risks are real and also believe that the only way to appropriately address them is to ship product and learn.
I know that anyone who previously had a self-identified ‘Eliezer Yudkowsky fan fiction Twitter account’ knows better than to think all you can say about long term risks is ‘ship products and learn.’
I don’t see the actions to back up even these words. Nor would I expect, if they truly believed this, for this short generic statement to be the only mention of the subject.
How can you reflect on the past nine years, say you have a direct path to AGI (as he will say later on), get asked point blank about the risks, and say only this about the risks involved? The silence is deafening.
I also flat out do not think you can solve the problems exclusively through this approach. The iterative development strategy has its safety and adaptation advantages. It also has disadvantages, driving the race forward and making too many people not notice what is happening in front of them via a ‘boiling the frog’ issue. On net, my guess is it has been net good for safety versus not doing it, at least up until this point.
That doesn’t mean you can solve the problem of alignment of superintelligent systems primarily by reacting to problems you observe in present systems. I do not believe the problems we are about to face will work that way.
And even if we are in such a fortunate world that they do work that way? We have not been given reason to trust that OpenAI is serious about it.
Getting back to the whole ‘vision thing’:
Our vision won’t change; our tactics will continue to evolve.
I suppose if ‘vision’ is simply ‘build AGI/ASI’ and everything else is tactics, then sure?
I do not think that was the entirety of the original vision, although it was part of it.
That is indeed the entire vision now. And they’re claiming they know how to do it.
We are now confident we know how to build AGI as we have traditionally understood it. We believe that, in 2025, we may see the first AI agents “join the workforce” and materially change the output of companies. We continue to believe that iteratively putting great tools in the hands of people leads to great, broadly-distributed outcomes.
We are beginning to turn our aim beyond that, to superintelligence in the true sense of the word. We love our current products, but we are here for the glorious future. With superintelligence, we can do anything else. Superintelligent tools could massively accelerate scientific discovery and innovation well beyond what we are capable of doing on our own, and in turn massively increase abundance and prosperity.
This sounds like science fiction right now, and somewhat crazy to even talk about it. That’s alright—we’ve been there before and we’re OK with being there again. We’re pretty confident that in the next few years, everyone will see what we see, and that the need to act with great care, while still maximizing broad benefit and empowerment, is so important. Given the possibilities of our work, OpenAI cannot be a normal company.
Those who have ears, listen. This is what they plan on doing.
They are predicting AI workers ‘joining the workforce’ in earnest this year, with full AGI not far in the future, followed shortly by ASI. They think ‘4’ is conservative.
What are the rest of us going to do, or not do, about this?
I can’t help but notice Altman is trying to turn OpenAI into a normal company.
Why should we trust that structure in the very situation Altman himself describes? If the basic thesis is that we should put our trust in Altman personally, why does he think he has earned that trust?
Discuss
Meditation insights as phase shifts in your generative priors
Introduction
Okay, so after engaging a lot with Steven Byrnes post on [Intuitive self-models] 6. Awakening / Enlightenment / PNSE I’ve tried to think about the most important framing shift that is different from me compared to what Steven has written.
This will not be a post about the specific differences but it will rather present a different perspective on the same underlying mechanism. I think most of what Steven writes in the above post is correct yet I feel it is missing a focus on what I find the core mechanism of meditation.
We will be going through a generative model argument from an agency and predictive processing perspective. This will then lead to an understanding of meditation insights and “awakening” as changes in the self-perceived information boundary. This then changes our sampling strategy as our framework that we update then changes from a traditional “homunculus” to something different.
Some background that I want you to have is about priors around what your self is or what you have agency over. We will be modelling what is within yourself as what is within the red around you. (Think of it as elements that you have isolated causal control over.)
We will specifically be going with the active inference frame of this as markov blankets where minimal amounts of influence comes from the environment. I have a hard time giving a formal model of this without a bunch of math so following is an example for you to understand what I’m pointing at:
In his exploration of "Intuitive self-models" and PNSE (Persistent Non-Symbolic Experience), Steven Byrnes offers valuable insights into how meditation affects our sense of self. While I agree with his core framework, I believe we can push this analysis further by examining how meditation fundamentally changes our model of personal boundaries and agency.
I propose that what we call "awakening" or "enlightenment" can be understood through the lens of predictive processing and information boundaries. Specifically, I suggest that meditation systematically alters our generative model of "self" - not just by removing the homunculus as Byrnes suggests, but by transforming how we perceive the <<Boundary>> between self and world.
To build this case, we need to start with a precise understanding of boundaries and agency. When we talk about "self," we're really talking about what we believe we have causal control over - our sphere of influence in the world. This boundary isn't fixed; it expands and contracts based on our experiences and beliefs.
In technical terms, we can model this using active inference and Markov blankets (different from the critch definition) - mathematical tools that help us understand how systems maintain separation from their environment while remaining causally connected. But rather than dive into the mathematics, let me offer a simple example that captures the essence of how these boundaries work in practice
Example:Boundary around a human
The example that I like to give for what the effects of self-modelling through this frame is what happens with a water bottle, before and after you buy it. What is it that makes a water bottle yours or part of your extended boundary?
1. The water bottle is in the store:You haven’t bought the water bottle and so it is just standing on that shelf. Someone else might come and pick it up, so if you have no plans of buying it, you have no causal control over the water bottle and so you can’t plan with it in mind and so it isn’t “your” water bottle.
2. You’ve bought the water bottle:You’ve bought the water bottle and taken it home. It is now mainly affected by what variables your existing “self” does to it and so it extends under your control. It is now therefore an extended part of you.
The core question for your generative model is: “Is this under my causal control as defined by my existing boundaries?”
Your sampling is based on your current generative modelOur mental model of our boundaries shapes how we interpret everything we experience. This goes deeper than simple perception - it fundamentally alters what information we notice and how we process it. Let me illustrate with a concrete example:
Imagine two people, both spending an afternoon reading instead of working. The first person holds a mental model of themselves as "someone who struggles with depression." Through this lens, the afternoon of reading becomes evidence that confirms their depressive identity - another example of "failing" to be productive. Their boundary of self includes "being depressed" as a core attribute, so they unconsciously filter their experiences to reinforce this view.
The second person sees themselves as "someone who values creative recharge." The exact same afternoon of reading gets interpreted completely differently - as intentional downtime that supports their creative process. Their boundary of self includes "needing space for creativity," so they notice and emphasize aspects of the experience that align with this view.
This pattern runs remarkably deep. As one of my meditation teachers puts it: "Your attitude changes the circumstances." In more technical terms, your perceived boundary of self acts as a filter that determines not just how you interpret information, but what information you're capable of receiving in the first place. Your model of self creates a self-reinforcing loop - you notice what fits your model, which strengthens that model, which further shapes what you notice.
What does meditation do?Meditation is then a way of loosening your generative priors over what your self is. It is pointing at places where your generative model of yourself breaks down because it is simply impossible to hold that generative model.
It is pointing out that you cannot be what you think you are. It does this through making the instrument that you look at yourself through better at perceiving information. It also does this through “pointing out” instructions that point towards places where your generative model breaks down.
In this process you get so called insights. Fundamental shifts in your self-perceived boundary so profound that they change the information that you get from the world.
These would be phase shifts in the generative model. When you’ve had enough evidence that something is true, it also changes the fundamental model of the self and so you see the world differently. You sample differently and so it is easier to see an expanded self.
Step by step, your existing models break down and you start to realise that you don’t know what you are. A metaphorical explanation is that your boundary grows bigger and bigger until finally you just are the present moment without any preconceived notions of time, space nor cause and effect.
An example of a pointing out instruction:For all models are “empty” of inherent meaning. No system can prove itself and so no system can be true.
Even the thought that the system can’t be true is empty for that can’t be true in itself.
There’s no ground, it is all an infinite regress, there are no axioms, no base reality.
You just are, a point of awareness, coextensive with time and space, deathless.
Discuss
D&D.Sci Dungeonbuilding: the Dungeon Tournament Evaluation & Ruleset
This is a follow-up to last week's D&D.Sci scenario: if you intend to play that, and haven't done so yet, you should do so now before spoiling yourself.
There is a web interactive here you can use to test your answer, and generation code available here if you're interested, or you can read on for the ruleset and scores.
RULESETEncountersThe following encounters existed:
Encounter NameThreat (Surprised)Threat (Alerted)Alerted ByTierWhirling Blade Trap--2--1Goblins12Anything1Boulder Trap--3--2Orcs24Anything2Clay Golem--4--3Hag36Tier 2 and up3Steel Golem--5--4Dragon48Tier 3 and up4Each encounter had a Threat that determined how dangerous it was to adventurers. When adventurers encountered that, they would roll [Threat]d2 to determine how challenging they found it.
However, many encounters had two different Threat levels, depending on whether they were alerted to the adventurers or not. (A dragon that's woken up from its slumber, or a hag who's had time to prepare her nastiest spells, or orcs who have gotten their armor on, are much more dangerous than the same encounters when surprised by the adventurers).
Traps and Golems do not rest and cannot be surprised, but Goblins/Orcs/Hag/Dragon encounters could be drastically more/less threatening.
Encounters were alerted by the adventurers having sufficiently loud and dramatic fights in an adjacent room. If a Dragon heard the adventurers fighting a Golem in an adjacent room, it would wake up - if it heard them fighting Goblins in an adjacent room, it would just ignore the Goblin noises, roll over and go back to sleep.
PathingAdventurers always took as short a route as possible to the goal:
So they would enter in Room 1, then go to either Room 2/4, then room 3/5/7 (choosing from the two rooms adjacent to their previous choice) and so on.
At each step, an adventuring party scouts the adjacent rooms (using their sneaky Rogue/their invisible Wizard/their powerful Divination magic) and chooses the less threatening one (accounting for alertness, e.g. an alerted Dragon is more threatening than a Steel Golem but an unalerted Dragon is less threatening). If two encounters are equally threatening (e.g both paths have Goblins, or one has alerted Orcs while the other has a sleeping Dragon), they will choose at random.
ScoreDepending on how well each fight went, it might do more or less damage: a fight with Threat X rolls Xd2 and adds that much to the dungeon's difficulty score.
Each adventuring team sums the difficulties of all encounters they faced to determine the overall score they give.
The tournament score is the average score given of all adventuring teams that entered your dungeon: in your case there are 4 such teams, so if your dungeon has e.g. a total Threat of 15 you will roll 15d2 4 times and average the results.
STRATEGYThe important things were:
- Ensure the adventurers encounter your most threatening encounters.
- Adventurers must encounter Room 1 and Room 9, while other rooms can be skipped, so powerful encounters placed in those rooms cannot be avoided.
- Since adventurers only look one room ahead, it is possible to use an easy encounter now to direct them in ways that force them into a hard encounter: if Room 2 is easier than Room 4, and Room 3 is easier than Room 5, the adventurers will go that way and be forced to encounter Room 6 even if that is a hard encounter.
- You can use this to ensure that your three most threatening encounters (Dragon/Hag/Clay Golem) are always encountered.
- Ensure that your encounters were alerted where necessary.
- Since traps/golems are always alert, you can put them early on in the path.
- Your Hag/Dragon need to be placed after sufficiently serious encounters to alert them.
To accomplish this, you lay your dungeon out like this (or the mirror version):
Clay GolemWhirling Blade TrapBoulder TrapOrcsOrcsHagGoblins/EmptyGoblins/EmptyDragon- Adventurers will enter, and immediately must encounter the Clay Golem (which does not need to be alerted).
- Since the Orcs are more threatening than the traps (Threat 4 for alerted Orcs vs Threat 2/3 for your traps), they will head rightward through the traps.
- They will then be forced to head downward, through the Hag (alerted by the Boulder Trap) and then the Dragon (alerted by the Hag).
This layout does not rely on the Goblins at all: since the adventurers will never approach the rooms in the bottom left, it does not matter whether they contain Goblins or not. As such, if you were confident in this layout, it would not cost you any points to accomplish the Bonus Objective and leave out the Goblins.
LEADERBOARDPlayerLayoutPath Taken (threat)Threat[1]Avg ScoreChristian Z R
(without Goblins)
CWB
OOH
XXD
C->W->B->H->D
(4->2->3->6->8)
2334.5Optimal PlayCWB
OOH
XXD
C->W->B->H->D
(4->2->3->6->8)
2334.5abstractapplic
(with Goblins)
BOG
OWH
GCD
B->O->G/W->H/C->D
(3->4->2->3/4->8)
20.25[2]30.375Yonge
(with Goblins)
COG
GOB
WHD
C->G->W->H->D
(4->2->2->3->8)
1928.5simon
(without Goblins)
CHX
OBX
WOD
C->O->W->O->D
(4->4->2->4->4)
1827Yonge
(without Goblins)
COX
WOB
XHD
C->W->X->H->D
(4->2->0->3->8)
1725.5Random Play
(with Goblins)
????~14~21kave
(without Goblins)
DBX
OWH
XOC
D->B->X->H->C
(4->3->0->3->4)
1421Random Play
(without Goblins)
????~10.4~15.6Congratulations to all players, particularly to Christian Z R, who managed to get a perfectly optimal score. (Condolences to abstractapplic, whose original answer was in fact also the optimal one but who later reconsidered away from it).
DATASET GENERATIONThe dataset reflected a series of tournaments - each tournament had 4-8 contestants, and 2-5 judges (whose scores were averaged together for each contestant).
There were two kinds of contestant:
- Random entrants.
- There are 0-4 random entrants (this number started as 1d4-1, and gradually increased to 1d4 as the tournament got more popular).
- Each one gets 4+1d4+1d6 random encounters (slanted towards low-tier ones).
- If this is >9 encounters, they discard their lowest-tier encounters. If it is <9 encounters, they leave some rooms empty.
- They then distribute these at random through the dungeon, except with a strong tendency to put a strong encounter guarding the treasure in Room 9.
- Some regular entrants who submitted dungeons with a slant towards their particular specialty:
- The Cruel Trapmaster:
- He starts with 4 randomly chosen traps (50-50 between Boulder and Whirling Blade).
- Adds 1d6+2 random encounters (as with random entrants, these slant towards low-tier encounters).
- The Elusive Golemancer:
- As she produces more golems, her dungeons have grown more golem-filled.
- Began with 1 Clay Golem and 1d6+5 random encounters.
- About 10% of the way through the dataset, adds a Steel Golem in place of one random encounter.
- Then around 30% of the way through the dataset, adds a Clay Golem in place of another random encounter.
- The Dragon Princess:
- She began with 1 Dragon and 1d6+5 random encounters.
- Around 33% of the way through the dataset, she tamed another Dragon and went up to 2 Dragons.
- Around 95% of the way through the dataset, one of her Dragons was mysteriously slain and she dropped back down to 1 Dragon.
- The Mountain Chieftain:
- He began with 1 Orc, 1d6+2 random encounters, and an unlimited supply of Goblins (he is the only contestant who never needed to submit a dungeon with an empty space in it).
- Around 95% of the way through the dataset, his fame grew due to having accomplished some unspecified feat of strength and held a great feast, and he gained 2 additional Orc encounters.
- These entrants choose their encounters if they have >9 and lay their dungeons out the same way random entrants do, just using their different pool of available encounters.
- The Cruel Trapmaster:
None of this was directly important to solving the problem, except insofar as it created a variety of dungeons with different skews in encounters (e.g. dungeons with very large numbers of Goblins/Traps but not much else, or scary dungeons with multiple Dragons/Golems, were more common than would arise from pure randomness). Some players noticed a jump in average scores in the middle of the dataset (due to the Golemancer and Dragon Princess getting new powerful encounters added around the same time).
FEEDBACK REQUESTAs usual, I'm interested to hear any feedback on what people thought of this scenario. If you played it, what did you like and what did you not like? If you might have played it but decided not to, what drove you away? What would you like to see more of/less of in future? Do you think the scenario was more complicated than you would have liked? Or too simple to have anything interesting/realistic to uncover? Or both at once? Did you like/dislike the story/fluff/theme parts? What complexity/quality scores should I give this scenario in the index?
- ^
Average score will be on average 1.5 times this, with a small amount of randomness.
- ^
Adventurers in this dungeon will encounter the Clay Golem 1/4 of the time, and the Hag 3/4 of the time: ones who go through the Whirling Blade Trap room in the middle will choose to face the Hag instead, and so the only ones to face the Clay Golem will be those who choose first to fight the Orcs in Room 2 rather than the Orcs in Room 4, and then choose to continue to the Goblins in Room 3 rather than the (equally threatening) Whirling Blade Trap in Room 5.
Discuss
Incredibow
Back in 2011 I got sick of breaking the hairs on violin bows and ordered an Incredibow. It's fully carbon fiber, including the hair, and it's very strong. I ordered a 29" Basic Omnibow, Featherweight, and it's been just what I wanted. I think I've broken something like three hairs ever, despite some rough chopping.
Thirteen years, and it could be new yesterday.
While they don't play identically to traditional bows and it takes a bit of adjustment, if people had been playing with carbon fiber hair historically I really doubt horsehair would take off today.
Discuss
Building Big Science from the Bottom-Up: A Fractal Approach to AI Safety
Epistemic Status: This post is an attempt to condense some ideas I've been thinking about for quite some time. I took some care grounding the main body of the text, but some parts (particularly the appendix) are pretty off the cuff, and should be treated as such.
The magnitude and scope of the problems related to AI safety have led to an increasingly public discussion about how to address them. Risks of sufficiently advanced AI systems involve unknown unknowns that could impact the global economy, national and personal security, and the way we investigate, innovate, and learn. Clearly, the response from the AI safety community should be as multi-faceted and expansive as the problems it aims to address. In a previous post, we framed fruitful collaborations between applied science, basic science, and governance as trading zones mediated by a safety-relevant boundary object (a safety case sketch) without discussing the scale of the collaborations we were imagining.
As a way of analyzing the local coordination between different scientific cultures, a trading zone can sometimes have a fractal structure; at every level of granularity – a coarse graining of what is meant by ‘local’ – another trading zone appears. In addressing a narrow problem or AI safety sub-goal, like a particular safety case assumption, a small collaboration between different academic groups at an institution (or even across institutions) can constitute a trading zone. Zooming all the way out, the entire field of AI safety can be seen as a trading zone between basic science, applied science, and governance, with a grand mission like ‘create safety cases’ or ‘solve safety’ as its boundary object. This large-scale trading zone may result from (or lead to!) the development of big science for AI safety. Moreover, smaller trading zones would likely benefit from an overarching culture of coordination and resource-sharing that big science would provide.
In this post, I discuss the possible pathways to a big science for AI safety in terms of what is possible, what is likely, and what would be most beneficial, in my view. To ground the discussion, I hold that AI safety should have[1]:
- a mosaic scientific culture to maintain the virtues of an a-paradigmatic field.
- active engagement between academia, industry, and governance.
- a unifying focus that is broad enough to maintain a sufficient richness of ideas.
‘Big science’ is a term originally coined by Alvin M Weinberg in 1961 to describe large-scale research efforts that require substantial resources, collaboration, and infrastructure. It typically involves joint input and progress from basic science and applied science or engineering, and has downstream effects on industry. It often spans nation-states, though this is not a strict requirement. Government involvement is essential for big science, though it may not be its main driving force.
Throughout this post, we’ll classify big science as either top-down or bottom-up .
- Top-down: Governments want to compete or collaborate with other nation-states, leading it to amass the relevant talent. Examples include the Manhattan Project and the Human Genome project. Each of these was government funded and guided by a narrow, government-specified goal: to build an atomic bomb and sequence the human genome, respectively.
Bottom-up: Scientists need to pool resources for the sake of scientific advancement or progress, and join forces to solicit government funding. Examples include large-scale interferometry experiments, telescopes for observational astronomy, or particle accelerators (like CERN) for fundamental physics. In these collaborations, a share of the (often enormous) cost comes with equity in the project, which is too large for an individual group to tackle alone. Telescopes and interferometers require particular weather conditions, elevation, and remoteness that many countries lack, while CERN needs a huge amount of underground space, manpower, and physical resources. While bottom-up big science typically aims to avoid allegiance to a particular country’s government,[2] it requires a tremendous amount of coordination, diplomacy, and PR.
The impact of big science tends to be far-reaching, transforming science, industry, and society. The above examples led to the adoption of nuclear energy, better standards for ethics and security in biomedicine, and the world wide web, to name just a few. Each example also received substantial pushback. The public was mainly concerned about the overall cost relative to public benefit, ethics, and safety (like concerns of a mini-black hole swallowing Geneva). In addition to ethics and safety, the scientific community was more concerned with the risks of increased bureaucracy in the pursuit of science, and the distribution of funds across disciplines(Kaiser, Utz). The Human Genome project, for example, invited criticism that the government-ordained focus of the project was too narrow, and would stifle creativity and curiosity among biologists. But scientists also push back on bottom-up big science projects; in the 1990s, prominent condensed matter physicists registered their abject opposition to the superconducting supercollider (SSC), testifying before congress that the project would underfund their own field, which had much more immediate economic and societal benefits than large-scale exploratory particle physics. In this case, the testimony was so compelling that the US government scrapped the SSC, and condensed matter climbed to the top-tier of the physics hierarchy.
Paths toward a Big Science of AI SafetyThe above examples show that big science, by mobilizing scientific practice in a particular direction, often produces a shift in the shared set of values and norms that make up a scientific culture. Depending on your place within that culture, these can be desirable (or not). They also demonstrate big science’s impact on society at large, which could be made even greater given the many futures toward which AI’s development can lead. In the rest of this post, I will unpack some possible paths to big science for AI safety, and argue that a bottom-up approach, though by no means inevitable, is most likely to maintain the core criteria for AI safety laid out in the first section.
The Way Things AreThe technical AI safety research ecosystem can be roughly divided between large AI labs, smaller AI safety research organizations, and independent researchers. Coverage of the problem, though reasonable given the field’s nascent status, is somewhat disjoint between these groups. In spite of the immense resources and research effort, AI safety currently lacks the coordination that would label it ‘big science’. Namely, it lacks:
- Infrastructural Centralization (e.g. CERN)
- Ideological Centralization (i.e. ‘build a bomb’ or ‘sequence the human genome’)
- Science.
First: infrastructure. There is a certain amount of resource pooling among independent researchers and small research organizations, and many of the models they use and tools they develop are small-scale or open-source. While there is a lot of room for good theoretical or experimental work from this group, they are largely resource-restrained to a model scale of GPT-2. Independent researchers are particularly decentralized, and while there is nothing to stop them from collaborating across nation-states, most collaboration tends to cluster around virtual communication channels for distinct research programmes (mech interp, dev interp, SLT). In contrast, AI labs hold the largest share of physical and intellectual infrastructure. While their efforts to understand AI systems often go hand-in-hand with driving capabilities, they nevertheless produce key insights. A lot of their work involves large-scale empirics on state-of-the-art models and real-world datasets, with a mounting wall of secrecy that limits its external collaborations. This leads to sub-optimal coordination in the field, resulting in at best two groups (independent/small AI safety and big AI), but at worst many more than that.
Next: Ideology. By this, I mean a big science problem specification and research goals that would guide a scientific community. Historically, these have been guided top-down by the government or bottom-up by scientific consensus, and have played a pivotal role in coordination, resource allocation, and progress allocation in big science collaborations. In AI safety, however, no such unifying objective currently exists. Instead, the field is shaped by competing agendas across stakeholders, including governing bodies, academic researchers, and AI labs of varying stages and sizes.
Among these, large AI labs are the most prolific in terms of research output. These would play the largest role in guiding a centralized ideology, through their own work and relationships with governing bodies that rely on lab members’ expertise to understand pathways and timelines to AGI. Big labs often produce quality work, and have taken steps to include external perspectives through academic collaborations, researcher access programs, red teaming networks, or fellowship programs. However, these programs, coupled with the labs’ government connections, have the potential to implicitly and explicitly incentivize independent and academic researchers to work within existing frameworks rather than pursue an approach with a ‘riskier’ problem set-up or alternative methodology. Instead of unifying the research landscape under a flexible ideology, these efforts serve to further separate the AI safety ideology set by large AI labs from those of small AI safety organizations, independent researchers, or academic groups. Moreover, given the competition between the top AI labs leads to further fragmentation, each lab operates with its own priorities. This means that AI labs, even if they currently have the loudest voice, are not suited to establishing a robust, central ideology that can unite the field in big science, as this would depend on short-horizon industry incentives and which lab comes out on top.
Finally: science. The lack of a central ideological focus is indicative of the current absence of a foundational scientific discipline – and corresponding scientific culture – underpinning AI safety. This shared understanding would play an important role in coordinating big science between different research groups with varying standards of scientific experience and rigor. While AI safety draws on a patchwork of theoretical insights, empirical methods, and methodological input from the fields underpinning AI (physics, neuroscience, mathematics, computer science…), the field lacks a cohesive framework for addressing its core challenges, leading, for example, to an inconsistent formulation of the ‘problem’ of AI safety.
The current priorities of the AI industry reflect and support this dearth of fundamental scientific research. In its new hires, labs seem to be increasingly concerned with engineering skills necessary to build out existing applied research agendas, suggesting that there is less effort to build up scientific foundations. This is likely due to epistemic taste and increasing pressure to compete given shortening timelines, a pressure seemingly reflected in AI safety funding schemes. Whereas a few years ago, freakish prosperity in the AI safety space caused a push for big ideas, now AI is ubiquitous, the economic stakes are higher, and the funds are more sparse. Add to that the narrative within the AI safety community that current frontier models are already near-AGI, causing many funders to focus on the state-of-the-art rather than hedge their bets on the uncertain outputs, and longer time-horizons, that basic science investigations could provide.
All this to say that within AI safety, coordination with basic science is currently limited, constraining the number of new ideas with a high potential for impact. At the same time, government funding for scientific projects that rely heavily on AI and its development to solve domain-specific problems is increasing, further separating the use-inspiration for AI research in these two groups (solve science v. solve safety). Without explicit incentive to work on projects that address safety concerns, basic scientists will continue to ‘publish or perish’ in their own scientific disciplines just as industry labs continue to ‘produce or perish’ in theirs.
As it stands, big science is more likely to be top-down
In my view, desirable characteristics of big science for AI safety include infrastructural centralization that improves access to and understanding of the problem for basic science and ideological centralization that is broad enough to allow for a pluralism of ideals and projects. In short: it should be bottom-up, and set the conditions for a basic science of AI safety.
However, if big science is inevitable (and I’m not sure it is), it seems more likely to be top-down. Drivers for a big science of AI safety could include:
- A high-profile AI accident: This would lead governments to impose ad-hoc restrictions on industry or grab control over industry infrastructure, leading to increased centralization that, depending on the nature and severity of the accident, could lead to top-down big science via soft nationalization.
- Escalating geo-political tensions or A Manhattan Project for AI: The second two are already in motion, and if they do lead to big science (and not something like soft nationalization), this would most certainly be top-down. These drivers stem from heightened competition between nation-states. While the recent recommendation from the U.S.-China Economic and Security Review Commission to start a Manhattan Project for AI seems like a false start, it could be seen as a warning message to the CPP and its allies. It also addresses real security concerns for an AI driven future, and raises the question of where a Manhattan project for AI should be hosted and casts doubts on the prospects of a globally centralized ‘trillion dollar cluster’ for AI equity and safety. If the US did pursue this project, it would likely have two major effects. First, it would be forced to take industry claims of creating AGI seriously, and make steps to lock them down. Second, to maintain secrecy, it could restrict academic research and limit open-source collaboration. These dynamics echo the “bureaucratic overreach and political misuse” (Kaiser et al.) of the original Manhattan project. In that case, there was a clear, urgent threat to mobilize toward. A similar project for AI would not only limit innovation and investigation in both private and public sectors, but it would do so without addressing practical problems of long-term AI governance. From an AI safety perspective, a top-down big science directive of ‘build AGI first at any cost’ is not desirable.
- A major breakthrough in engineering: As a result, the field risks a premature or overly-constraining ideological centralization with a top-down safety objective set not by governance or academic consensus, but industry. Currently, these priorities emphasize short-term deliverables, engineering prowess over foundational understanding, and empirical efforts to interpret or align the capabilities of current or near-future models. We mentioned earlier that while large AI labs industry labs and their collaborations drive substantial progress in understanding and shaping AI systems, they currently shape many aspects of AI safety, potentially including a top-down objective, which would likely ramp up after a significant gain in engineering. On paper the objective looks something like ‘make AI safe’, which seems like a virtue to maintain AI safety’s pre-paradigmatic open-mindedness. However, there are implicit parameters for who can work on AI safety and under what conditions, set by the lab’s focus and reinforced by the funding landscape. If the goal-specification of industry and governance feed back into one another, AI safety risks entrenching increasingly narrow goals, methods, and standards into the scientific culture, stifling the diversity of ideas essential for robust AI research and governance. In other words, if Therefore, while large AI labs and their collaborations drive substantial progress in understanding and shaping AI systems, they represent a narrow ideological perspective that are shaping the epistemic priorities of the field, including what it means to make AI systems safe.
- A major breakthrough in basic science: This is most likely to be bottom-up. While we are likely in the midst of an empirically-driven phase of AI research, this is not guaranteed to remain the case forever. If experimental leads were to run dry or engineering were to shift significantly, the field’s lack of investment into other (theoretically-driven, foundational, longer time-horizon) approaches may leave us unprepared to address changing safety concerns efficiently and adequately. Complementing these efforts with a stronger, more diverse representation from basic science could foster a richer ideology and more balanced ecosystem for breakthroughs in AI safety.
The way nationalization or centralization happens is incredibly important. From the point of view of the scientific community, it would set the standard for ‘good’ research, prioritizing which projects get funded in terms of methodology, focus, stakes, timeliness, and take-off speeds. From a governance standpoint, it is tied to the scale, scope, and solution of the AI safety problem, and dictates who has the power over AI and all its baggage (good and bad).
The AI governance challenge is by no means simple, and its strategy will also depend on the ideology around which big science coordinates. Myopic or overly-constraining rallying cries could yield a big science with too narrow a focus, leaving us unprepared to tackle a threat we didn’t see coming. While top-down big science does lead to gains in basic science and industry, these are not guaranteed to hold up to a threat model predicated on unknown unknowns. By the time we realize we’re headed in the wrong direction, it could be too late.
A top-down approach could succeed given the right goal. The UK AISI is an example of locally centralized big-ish science with clear safety objectives that nevertheless fosters a culture amenable to pure scientific inquiry. That said, I think that many concerns about power concentration and race dynamics, or the strategic advantage (SA) of centralized AGI consider a top-down model. A bottom-up approach to big science could balance the voice of AI labs in the cooperative development of AI safe AI while also increasing its feasibility given shortening timelines. Or, we might view bottom-up big science as a ‘well designed SA approach’ that can accelerate progress with an equitable distribution of power and safety checks in place. Moreover, it could naturally foster the ‘good attributes’ of AI safety I laid out at the beginning.
CERN, for example, brings theorists, experimentalists, and phenomenologists together studying particle, nuclear, and astrophysics, but also the structure of matter, geophysics, and environmental science. A similar mosaic scientific culture is possible in a CERN for AI[3]. While this article focuses mainly on technical AI research, I would also include fields like economics, psychology, and anthropology for maximal disciplinary flexibility. Doing so would allow basic science to weigh-in on different aspects of AI safety, from mathematical theories to ethical frameworks.
This multi-disciplinarity feeds into the next desideratum: active engagement between academia, industry, and governance. Bottom-up big science achieves this quite naturally. Though the research itself is driven by the needs of the scientific community, rather than a government directive, coordination between scientists at scale can only be achieved with government support. As a public good, a CERN for AI would also need access to some data and models of industry labs. While there may be some hurdles to overcome here, it is not unrealistic to expect the mutual benefits of a collaboration of this kind — including a wider research pool for industry and basic science proof-of-concept – to outweigh corporate reservations.
Finally, the type of research would coordinate around a unifying focus that is broad enough to maintain a sufficient richness of ideas. Allowing basic scientists to pursue AI (safety) research as basic scientists on a scale similar to industry labs could provide the conditions of exploration and innovation necessary for a breakthrough – or many breakthroughs – to be achieved. Given academia’s track record of spinning-up bottom-up big science projects, this is not a stretch. The added resource-pooling would add an efficiency that basic science currently lacks, without overly constraining the research space. It would also prioritize coordination over competition, making for more equitable access for researchers (even if some amount of nationalization is likely).
How to scale from where we are nowIt is currently unlikely that academia would mobilize across disciplines to engage in bottom-up big science. If they did, it is unlikely that they would immediately coordinate around problems with the highest potential for safety relevance. Rather than attempting large-scale coordination from the outset, we should focus on facilitating small-scale trading zone collaborations between basic scientists, applied scientists, and governance. These can orient basic research toward an AI safety application while also making this research more legible to others outside of its scientific culture. If we create enough of these trading zones with different disciplines, they can begin to naturally coordinate with each other, creating larger trading zones with cross-disciplinary input from basic science and a ‘local’ language that is legible to a larger population. Continuing in this way facilitates the coarse-graining mentioned above. In this view, bottom-up big science is an emergent phenomenon that arises from a ‘critical mass’ of individual trading zone collaborations. At some scale, mobilization will become possible, yielding an aggregate trading zone between basic science, applied science, and governance unified by the goal of creating a safety-relevant science of AI.
This model for driving science encourages flexible, rather than rigid, centralization, preserving the desiderata laid out for the AI safety field and fostering an epistemically diverse scientific culture. It creates the conditions for a system of checks and balances – as well as mutual benefit – between industry, governance, and academia, and levels the playing field between these domains. If we take the analogy of a fractal seriously, there is no limit to the scale that can be achieved. Big science of this type would continue to drive advancements in safe AI and basic science, pushing the limits of knowledge, innovation, and human impact in lockstep.
Appendix: Potential Pushback of this ApproachIt would be naive to expect that this effort would be without pushback. Like the historical examples we looked at earlier, the public, private, and academic sectors will all have something to say about any new allocation of resources.
From academia, we might expect arguments like ‘AI is not science’. When the 2024 nobel prize in physics was awarded to Hinton and Hopfield, it sparked a heated debate about the relationship between physics and AI which, in my mind, demonstrated two key ideas:
- Morally, culturally, and ideologically, physics is not AI, and
- There is a lot of basic science to discover about AI, and this understanding can feed back into other areas.
Hopefully, fears from basic science that all science will become AI will be assuaged if scientists are allowed to work on AI guided by the standards and values of their own scientific culture. The goal for large-scale scientific coordination isn’t to turn every basic scientist into an AI scientist, but to bring together ideas from disparate fields to make progress on a poorly understood technology that – like it or not – is likely to be integrated within every aspect of our lives before long, including scientific inquiry. The sharing of resources – time, expertise, manpower – facilitated by collaborations can allow researchers to take part in AI focused projects without abandoning the disciplinary commitments of their academic departments. It is also the case that uncoordinated research in AI has led to work from different disciplines expressing similar ideas in different terms; cross-disciplinary collaborations centered on AI specific problems can foster shared insights that benefit each involved field.
Funders or governance may complain that ‘basic science takes too long’ to be competitive given shortening time scales. However, a centralized research infrastructure, including large-scale datasets, computational resources, safety sandbox, and shared knowledge, could minimize duplication of efforts that could slow progress. This could be sped up even more by accelerating the most promising research directions that arise from big science with a focused research organization (FRO). In addition to spinning-up more start-ups, FROs can produce public goods on accelerated timescales, minimizing the time-sink and public cost of the basic science research phase before producing something of use.
Governing bodies and the public may also be concerned if this is secure enough. In AI safety, there are a lot of independent researchers and small research projects or programs. It would take a lot of infrastructure to vet and keep track of these. However, there may be a way to give tiered levels of access to resources through collaborations. My view is that a CERN for AI would increase accountability and offer a middle-of-the-road solution to transparency, somewhere between locking it in a black box of secrecy and making it completely open-source. Security concerns feed into doubts that this is AI safety research. Mainly, this comes down to public trust in the incentives of basic scientists, and how large the scale of collaboration ends up being (if it is not a global endeavor, what are the institutional safeguards in place to prevent this from becoming a Manhattan Project in disguise). Like CERN, researchers could apply to run experiments, and these can be vetted for relevance to safety. From my perspective (as a basic scientist), AI science treated in this way is AI safety science, as understanding a thing allows us to reliably build and control it. Indeed, most AI safety interpretability or technical alignment research has a basic science flavor.
- ^
These are a few criteria I think are important, and should not be taken as a minimal or exhaustive list.
- ^
In the case of CERN, this aim is explicit, though this does not remove it from political alliances.
- ^
For the sake of simplicity, I’ll continue to refer to the bottom-up big science ideal as ‘CERN for AI’, though this should be read as an analogous project, rather than the gold standard.
Discuss
You should delay engineering-heavy research in light of R&D automation
tl;dr: LLMs rapidly improving at software engineering and math means lots of projects are better off as Google Docs until your AI agent intern can implement them.
Writing research code has gotten a lot faster over the past few years. Since 2021 and OpenAI Codex, new models and tools such as Cursor built around them have saved myself more and more time on coding every year.
This trend is accelerating fast: AI agents using Claude-3.5-Sonnet and o1-preview can do tasks that take ML researchers up to 2 hours of coding. This is without considering newer models such as o3, which do 70% on SWE-bench out of the box.
Yet this progress remains somewhat concentrated in implementation: progress on “soft” skills like idea generation has, as far as I can tell, been slower.
I’ve come to believe that, if you work in technical AI safety research, this trend is a very practical consideration that should be the highest order bit in your decisions on what to spend time on.
Hence, my New Year's resolution is the following: Do not work on a bigger project if there is not a clear reason for doing it now. Disregarding AGI timelines [1] 1, the R&D acceleration is a clear argument against technical work where the impact does not critically depend on timing.
When later means better
The wait calculation in space travel is a cool intuition pump for today’s AI research. In short, when technological progress is sufficiently rapid, later projects can overtake earlier ones.
For instance, a space probe sent to Alpha Centauri in 2025 will likely reach there after the one sent in 2040, due to advances in propulsion technology. Similarly, starting a multi-year LLM training run in 2022 would not have yielded a better model than starting a much shorter training run in 2024. [2]
The above examples involve long feedback loops, and it’s clear why locking in too early has issues: path dependence is high, and the tech improves quickly.
Now, my research (and likely your research too) has had much faster feedback loops, and path dependence in research projects is not that high if LLMs can refactor the codebase. However, it still does not make marginal sense to start some projects now if those can be done later.
If you work in AI safety, a common issue is having a lot of half-baked project ideas that you'd like to do and too little time to try them all. The silver lining of fast AI R&D improvement is that many of these ideas will become much easier to implement in the future. Thus, strategic timing — deciding which projects truly benefits from being done now — has become a crucial research meta-skill.
Did I do well in 2024?
To get a grasp on what this means in practice, I decided to go through the papers I contributed to in 2024, in chronological order, and analyze of whether it is good that this paper was done at the time, versus later, assuming all things are equal. [3]
- Stealing Part of a Production Language Model: The timing was correct. If you’re looking at it from a security perspective, it’s kind of tautological: unless there are good reasons to not publish a vulnerability, you should do it right away. The paper also improved the policy discussion about distillation and model security somewhat. However, my contribution to this paper was mostly some 10 pages of math and algorithmic tinkering. If the promises of o3 hold up, I could have done this in a matter of days if it was a followup project done in 2025.
- Foundational Challenges in Assuring Alignment and Safety of Large Language Models: I wrote about jailbreaks, prompt injection, and a few other things. Writing this clarified my thinking on robustness and LLMs significantly. The main positive impact of this comes from researchers using it as reference when writing papers. For “field steering / reference” work, there are both upsides and downsides to publishing early; but I feel this one was timed correctly.
- Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition: This paper is a result of a capture-the-flag competition on LLM prompt injection; if you’ve seen TensorTrust, it’s similar but with much more complex mechanisms for the defender. The bulk of the work was setting up the platform and backend for interactions between attack and defense teams.
The core issue working against this paper is that the documented attack/defense dynamic is highly dependent on the capabilities of the models used; and since we used models that are now outdated, I doubt the findings will be robustly useful for prompt injection / extraction research. The same paper could be done much more efficiently in 2026, with more relevant results. - Refusal in Language Models Is Mediated by a Single Direction: This paper popularized residual stream interventions to modulate refusal/harmlessness behavior in open-source LLMs. The timing appears correct in retrospect. It was causal for some good stuff in adjacent research topics, e.g. a paper from Meta on jailbreak resistance that builds on this approach.
- Consistency Checks for Language Model Forecasters: This is a scalable oversight paper in disguise (“how can we evaluate or steer models without ground truth?”), applied to forecasting. I think it’s a good paper with seeds of many ideas I’m excited about, but good AI forecasting bots are not yet really there, hence the paper’s immediate use as a benchmark is limited. I invested substantial engineering time that could be cut drastically with the tools of 2025, assuming this paper’s utility lies in the future. In retrospect, it might have been more efficient to make a quick 15-page paper than this comprehensive 56-page treatment of consistency and forecasting.
All the papers above are accepted (or likely to be accepted) in ML venues on first submission, and some are heavily cited, so there is some evidence the papers are considered good by conventional standards. Yet half of them were mistimed.
The above analysis ignores other reasons a paper might not be counterfactually impactful, such as parallel discovery, scooping other researchers (or being scooped), or even the mundane “this research direction didn’t end up being useful after all”. For example, another group did a big chunk of the Stealing paper independently and published a week later; and several teams worked on similar concepts to the Refusal paper before and after us.
On the other hand, a key product of research is improving own expertise in the papers; I'm definitely a stronger researcher now than a year ago, and it’s hard to gain experience if I hadn't gotten my hands dirty on some of the above work.
Looking back, I think my efforts look better than I expected, given that last year I did not optimize for the consideration of this post at all. But it’s far from perfect. If you’re doing AI safety research, I'd encourage you to do a similar audit of your own work.
Themes for temporally privileged work
So, why now and not later? The previous section has a few themes for work that can be worthwhile doing as soon as possible rather than waiting:
- Security-critical research that loses value if delayed (like discovering vulnerabilities before they're exploited);
- Research that unblocks yourself or other researchers on an existing problem that was going to be worked on anyway;
- Projects that build skills for you as a researcher that are genuinely important and will not be obsoleted quickly.
In addition, I can recall more reasons:
- Research showing properties of current AI systems that influence ongoing technical or policy discussions; see the global coordination case for model organism research;
- Work intended not to solve a problem, but as a demo of a novel research agendas and get many researchers to think about a new setting. Several papers by the Center for AI Safety are an example of this thing done well.
These last two categories—influencing policy discussions and introducing research agendas—rely on social diffusion of ideas, and this takes time. With shorter timelines in mind, this only make sense if your work can actually shape what other researchers do before AI capabilities advance significantly. If you do not have existing credibility or a concrete plan how it reaches the right audience, it might not be worth it.
In fact, technically motivated research faces a similar challenge: unless you're working at a leading AGI lab or in a position to influence one, your brilliant insights might never make it into frontier models. [4]
As for research that is not worth doing now, I do have some opinions, but I think better advice is to just apply this mindset on a case by case basis. [5] Pick some reasonable prior, say 50% reduction of total coding time per year; and before starting any significant technical work, write down a brief description of what you're trying to achieve and make an explicit case for why it needs to be done this year rather than in 25% of the engineering time in two years.
Thanks to Nikola Jurkovic for reading a draft of this post.
- ^
And if you are indeed operating on a tight schedule, Orienting to 3-year AGI timelines agrees:
Keep in mind that the only things that matter are things that need to get done before AGI. AGI can do the rest.
- ^
The Longest Training Run analysis from Epoch indicates that hardware and algorithmic improvements incentivize shorter training runs.
- ^
This analysis reflects my contributions and views alone, not those of coauthors, especially on papers where I am not a joint first author.
- ^
Again from Orienting to 3-year AGI timelines:
External human AI safety researchers - they do pre-automation research. A small subset of it gets implemented internally. They also critique the public parts of the pre-automation and post-automation research plans from the outside. They likely become irrelevant during the post-automation era as the gap between public and private knowledge will widen for national security reasons.
- ^
Consider also what Gwern had to say and don't let this quote describe you:
If you look at it that way, many of the projects that people do now have basically no lasting value. They’re doing things that they don't enjoy, which record nothing ephemeral of value that could not be inferred or generated later on. They are, at best, getting 2 or 3 years of utility out of it before it could have been done by an AI system.
Discuss
Страницы
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- …
- следующая ›
- последняя »